About Me

My photo
Rohit is an investor, startup advisor and an Application Modernization Scale Specialist working at Google.

Friday, February 8, 2019

Power of PCF Metrics For Day 2 App Ops

PCF Metrics is a powerful free batteries included Application Monitoring and Management tool that is bundled with PCF. It showcases the power of PCF allowing an enterprise to look at one pane of glass for logging and metrics. PCF Metrics if used wisely reduces the cost of leveraging an expensive log aggregation tool like Splunk and eliminate the use of an APM like Dynatrace or AppDynamics or whatever new flavor shows up in the market claiming AIOps.

So let's talk about some concrete uses cases of app developers using PCF Metrics. First of all PCF Metric is a resource hog when it comes to the Tile install. So there will be some sticker shock.  You want to install the latest version of PCFMetrics i.e. PCF Metrics 1.6. You will also need to install the Metrics Forwarder tile . 

So why do I have to bother with the Metrics Forwarder ? - The metrics forwarder service enables the gathering of fine-grained metrics from the PCF Metrics deployment. The Metrics Forwarder for PCF is a service that allows apps to emit custom metrics to Loggregator and consume those metrics from the Loggregator Firehose. These metrics are then consumed by PCFMetrics via an internal nozzle.

PCF Metrics 1.6 gives you the capability of monitoring custom application metrics  and set monitors and alerts on them and graph them on the dashboard. Yes you can define any application metric using micrometer and that metric shows up in the PCF Metrics dashboard. You can set alerts on metrics and have PCF metrics page you on Slack. This applies not only to custom metrics but also to container metrics like HIGH CPU. What this means is that if you have a non cooperative Ops team - then you can control your destiny and be notified immediately via SLACK when your SLO's are violated and can undertake remediation or triage of a situation immediately by taking thread dumps or heap dumps.

How does all this magic work ?

First you need a certain version of the Java Buildpack v 4.2

You can use Spring Boot Actuators to emit metrics to the Metrics Forwarder API. To do this, perform the following steps:
  1. Configure your app to use Spring Boot Actuators.
  2. Create the Metrics Forwarder (Tile ver. 1.11.3 )  for PCF service.
  3. Bind your app to the Metrics Forwarder for PCF service.
  4. Push or restage your app using the Java buildpack v4.2 or later.
Behind the Scenes

When an app is bound to the Metrics Forwarder service, the app receives credentials and the URL of the Forwarder API. It uses this information to post Spring Boot Actuator metrics to the Metrics Forwarder tile. This configuration data is stored in VCAP_SERVICES environment variables.When you cf push or cf restage the app, the Java buildpack downloads an additional metrics exporter jar, and includes it within the application droplet. When the app is running, the metrics exporter jar now added to the application context reads the Actuator metrics from a metrics registry every minute. It then posts the data to the Metrics Forwarder URL. From there, the Metrics Forwarder service sends this data to Loggregator. PCF Metrics then reads from the Firehose to ingest metrics data for retention and visualization.

How do you add custom metrics to a Spring Boot App ?

Micrometer is the metrics collection facility included in Spring Boot 2’s Actuator. It has also been backported to Spring Boot 1.5, 1.4, and 1.3 with the addition of the micrometer-spring-legacy dependency.

The PCF java buildpack includes a Cloud Foundry Spring Boot Metric Writer that  provides an extension to Spring Boot that writes Metrics to a Metric Forwarder service. Here are the gory details 

The CloudFoundryMetricWriterAutoConfiguration through spring boot autoconfig magic creates a RestOperationsMetricPublisher that publishes metrics to the Metrics Forwarder API. Therefore in order to publish metrics to  PCF Metrics you don't need a micrometer-registry. Core-micrometer is enough to get the  metrics published. You don't need to add any additional dependencies for Metric registry extensions. Spring Boot auto-configures a composite MeterRegistry and adds a registry to the composite for each of the supported implementations that it finds on the classpath. In PCF the Java Buildpack is configuring Spring Boot with the PCF Metrics Meter Registry.

You can take action on custom metrics by creating monitors that alert a slack endpoint. See https://docs.pivotal.io/pcf-metrics/1-6/using.html#monitors for configuring the right set of app specific alerts for your application.

For sample code for micrometer metrics in spring boot checkout https://github.com/micrometer-metrics/micrometer-samples-spring-boot and https://spring.io/blog/2018/05/02/spring-tips-metrics-collection-in-spring-boot-2-with-micrometer


How healthy is your Rabbit ?

We deal with RabbitMQ a lot in our AppTx engagements. If your RabbitMQ or microservices that deal with events and messaging are unhealthy this blog post has some hints towards fixing the issues.

Please remember these two tenets as you diagnose RabbitMQ performance issues with your application

  • To achieve predictable RabbitMQ response times, you will want to dedicate an ODB service instance in PCF (if you can leverage isolated zones, the better). This is so that the results are not skewed due to noisy neighbors.
  • 99% of the performance problems are generated by the applications and it is extremely rare to be related to misconfiguration of RabbitMQ. Therefore, we advise to simulate the customer's workload, taking into account peak and average load, and using the perf-test tool  to simulate it.

Pivotal Tile for RabbitMQ 3.7.11 has a number of enhancements for health checks of the deployment for an operator. If you are on a RabbitMQ deployment and are wondering about the deployment health point your operators to https://www.rabbitmq.com/blog/2019/02/07/this-month-in-rabbitmq-feb-7-2019/ .

Here are the links you need to benchmark the deployment against Pivotal Validated numbers > Pick a workload and compare.  https://github.com/rabbitmq/workloads

You can also run the performance test on PCF  https://github.com/rabbitmq/rabbitmq-perf-test-for-cf

To test the health of an individual deployment you can use  http://www.rabbitmq.com/monitoring.html#health-checks

In terms of application resiliency with regards RabbitMQ look at. https://github.com/rabbitmq/workloads/tree/master/resiliency for a number of recommendations.

Use RabbitMQ channel caching: Opening and closing channels frequently is a CPU bound process and does incur performance penalties. One of the best practices to implement is caching of connections and channels. Channels should NOT be shared across thread, but they sure can be shared in the same thread. Use Spring-AMQP abstraction to help with this. The auto configuration created by Spring AMQP creates a Caching ConnectionFactory, which allows connections and channels caching in a thread safe manner. You can read more about RabbitMQ best practices here : https://www.cloudamqp.com/blog/2018-01-19-part4-rabbitmq-13-common-errors.html

If you are using Spring Cloud Stream you are covered. By default, the RabbitMQ binder uses Spring Boot’s ConnectionFactory, and it therefore supports all Spring Boot configuration options for RabbitMQ and wires in the CachingConnectionFactory.

It is very difficult to determine if a shared PCF environment is in "good shape" - not because certainly the conditions (spare cpu, number of connections, spare memory, etc) of the benchmark are not going to be the same to the conditions at the time when we obtained the baseline. The best we can do is determine a number of metrics of what a healthy deployment looks like. And that will depend on the solution itself.

For instance, if it is massively critical that a hypothetical "incoming request" queue be always below a certain threshold then a metric would be the depth of the queue. Or if in normal circumstances we expect around 100+/10 connections, having more than 150 connections seems like we have a connection leak (something that usually occurs). The platform team should be monitoring RabbitMQ with a tool like prometheus connected with some alerting solution.

- The recommendation of scaling applications based on the depth of a RMQ queue sounds sensible but increasing the number of consumer connections/channels may also produce an excess of load in RabbitMQ defeating the purpose of scaling the number of consumers to help reduce the backlog of messages. So, we need to be cautious here.

- Analyzing "why there are so many messages in RMQ" puts us in the right direction. Monitoring the number of consumers on the queue and the consumer utilization is extremely valuable but also the monitoring the message ingress rate and the ack rate in the application itself.

Thanks to Marcial Rosales and Anwar Chirakkattil  for his guidance on Scaling and Health of RabbitMQ in PCF and Dan Frey for reviewing this article.