About Me

My photo
Rohit leads the Pivotal Labs App Modernization Practice in engineering, delivery training & cross-functional enablement, tooling, scoping, selling, recruiting, marketing, blog posts, webinars and conference sessions. Rohit has led multiple enterprise engagements including ones featured in the Wall Street Journal. Rohit focuses on designing, implementing and consulting with enterprise software solutions for Fortune 500 companies on application migration and modernization.

Friday, February 8, 2019

How healthy is your Rabbit ?

We deal with RabbitMQ a lot in our AppTx engagements. If your RabbitMQ or microservices that deal with events and messaging are unhealthy this blog post has some hints towards fixing the issues.

Please remember these two tenets as you diagnose RabbitMQ performance issues with your application

  • To achieve predictable RabbitMQ response times, you will want to dedicate an ODB service instance in PCF (if you can leverage isolated zones, the better). This is so that the results are not skewed due to noisy neighbors.
  • 99% of the performance problems are generated by the applications and it is extremely rare to be related to misconfiguration of RabbitMQ. Therefore, we advise to simulate the customer's workload, taking into account peak and average load, and using the perf-test tool  to simulate it.

Pivotal Tile for RabbitMQ 3.7.11 has a number of enhancements for health checks of the deployment for an operator. If you are on a RabbitMQ deployment and are wondering about the deployment health point your operators to https://www.rabbitmq.com/blog/2019/02/07/this-month-in-rabbitmq-feb-7-2019/ .

Here are the links you need to benchmark the deployment against Pivotal Validated numbers > Pick a workload and compare.  https://github.com/rabbitmq/workloads

You can also run the performance test on PCF  https://github.com/rabbitmq/rabbitmq-perf-test-for-cf

To test the health of an individual deployment you can use  http://www.rabbitmq.com/monitoring.html#health-checks

In terms of application resiliency with regards RabbitMQ look at. https://github.com/rabbitmq/workloads/tree/master/resiliency for a number of recommendations.

Use RabbitMQ channel caching: Opening and closing channels frequently is a CPU bound process and does incur performance penalties. One of the best practices to implement is caching of connections and channels. Channels should NOT be shared across thread, but they sure can be shared in the same thread. Use Spring-AMQP abstraction to help with this. The auto configuration created by Spring AMQP creates a Caching ConnectionFactory, which allows connections and channels caching in a thread safe manner. You can read more about RabbitMQ best practices here : https://www.cloudamqp.com/blog/2018-01-19-part4-rabbitmq-13-common-errors.html

If you are using Spring Cloud Stream you are covered. By default, the RabbitMQ binder uses Spring Boot’s ConnectionFactory, and it therefore supports all Spring Boot configuration options for RabbitMQ and wires in the CachingConnectionFactory.

It is very difficult to determine if a shared PCF environment is in "good shape" - not because certainly the conditions (spare cpu, number of connections, spare memory, etc) of the benchmark are not going to be the same to the conditions at the time when we obtained the baseline. The best we can do is determine a number of metrics of what a healthy deployment looks like. And that will depend on the solution itself.

For instance, if it is massively critical that a hypothetical "incoming request" queue be always below a certain threshold then a metric would be the depth of the queue. Or if in normal circumstances we expect around 100+/10 connections, having more than 150 connections seems like we have a connection leak (something that usually occurs). The platform team should be monitoring RabbitMQ with a tool like prometheus connected with some alerting solution.

- The recommendation of scaling applications based on the depth of a RMQ queue sounds sensible but increasing the number of consumer connections/channels may also produce an excess of load in RabbitMQ defeating the purpose of scaling the number of consumers to help reduce the backlog of messages. So, we need to be cautious here.

- Analyzing "why there are so many messages in RMQ" puts us in the right direction. Monitoring the number of consumers on the queue and the consumer utilization is extremely valuable but also the monitoring the message ingress rate and the ack rate in the application itself.

Thanks to Marcial Rosales and Anwar Chirakkattil  for his guidance on Scaling and Health of RabbitMQ in PCF and Dan Frey for reviewing this article.  

No comments:

Post a Comment