We deal with RabbitMQ a lot in our AppTx engagements. If your RabbitMQ or microservices that deal with events and messaging are unhealthy this blog post has some hints towards fixing the issues.
Please remember these two tenets as you diagnose RabbitMQ performance issues with your application
- To achieve predictable RabbitMQ response times, you will want to dedicate an ODB service instance in PCF (if you can leverage isolated zones, the better). This is so that the results are not skewed due to noisy neighbors.
- 99% of the performance problems are generated by the applications and it is extremely rare to be related to misconfiguration of RabbitMQ. Therefore, we advise to simulate the customer's workload, taking into account peak and average load, and using the perf-test tool to simulate it.
Pivotal Tile for RabbitMQ 3.7.11 has a number of enhancements for health checks of the deployment for an operator. If you are on a RabbitMQ deployment and are wondering about the deployment health point your operators to https://www.rabbitmq.com/blog/2019/02/07/this-month-in-rabbitmq-feb-7-2019/ .
Here are the links you need to benchmark the deployment against Pivotal Validated numbers > Pick a workload and compare. https://github.com/rabbitmq/workloads
You can also run the performance test on PCF https://github.com/rabbitmq/rabbitmq-perf-test-for-cf
To test the health of an individual deployment you can use http://www.rabbitmq.com/monitoring.html#health-checks
If you want to run large workloads follow this link https://rabbitmq.github.io/rabbitmq-perf-test/stable/htmlsingle/#workloads-with-a-large-number-of-clients
In terms of application resiliency with regards RabbitMQ look at. https://github.com/rabbitmq/workloads/tree/master/resiliency for a number of recommendations.
Use RabbitMQ channel caching: Opening and closing channels frequently is a CPU bound process and does incur performance penalties. One of the best practices to implement is caching of connections and channels. Channels should NOT be shared across thread, but they sure can be shared in the same thread. Use Spring-AMQP abstraction to help with this. The auto configuration created by Spring AMQP creates a Caching ConnectionFactory, which allows connections and channels caching in a thread safe manner. You can read more about RabbitMQ best practices here : https://www.cloudamqp.com/blog/2018-01-19-part4-rabbitmq-13-common-errors.html
If you are using Spring Cloud Stream you are covered. By default, the RabbitMQ binder uses Spring Boot’s ConnectionFactory, and it therefore supports all Spring Boot configuration options for RabbitMQ and wires in the CachingConnectionFactory.
It is very difficult to determine if a shared PCF environment is in "good shape" - not because certainly the conditions (spare cpu, number of connections, spare memory, etc) of the benchmark are not going to be the same to the conditions at the time when we obtained the baseline. The best we can do is determine a number of metrics of what a healthy deployment looks like. And that will depend on the solution itself.
For instance, if it is massively critical that a hypothetical "incoming request" queue be always below a certain threshold then a metric would be the depth of the queue. Or if in normal circumstances we expect around 100+/10 connections, having more than 150 connections seems like we have a connection leak (something that usually occurs). The platform team should be monitoring RabbitMQ with a tool like prometheus connected with some alerting solution.
- The recommendation of scaling applications based on the depth of a RMQ queue sounds sensible but increasing the number of consumer connections/channels may also produce an excess of load in RabbitMQ defeating the purpose of scaling the number of consumers to help reduce the backlog of messages. So, we need to be cautious here.
- Analyzing "why there are so many messages in RMQ" puts us in the right direction. Monitoring the number of consumers on the queue and the consumer utilization is extremely valuable but also the monitoring the message ingress rate and the ack rate in the application itself.