About Me

My photo
Rohit is an investor, startup advisor and an Application Modernization Scale Specialist working at Google.

Thursday, July 9, 2015

Cloud Foundry High Availability

This blog post will explore all the intricacies of  High Availability with apps and services deployed to Cloud Foundry Components within one Availability Zone. The modes of recovery for CF components and services within one availability zone are as follows:

Types of HA - What is Running

Owner - Who Recovers It
CF Process running within a VM or a Container Monit Supervisor Process running within the VM/Container
VM or Container BOSH Agent + Resurrector plugin
Application Instance (JVM / Node.js / Ruby/ Python) Health Monitor CF Job
Availability Zone (AZ) IaaS specific HA (VMWare HA, AWS CloudWatch + EC2 AutoRecovery)

 Jobs in a typical CF 1 AZ  Deployment (PCF 1.5 AWS Deployment using CloudFormation)
| Job/index                                                      |
| clock_global-partition-489dca900676578cb063/0                  |
| cloud_controller-partition-489dca900676578cb063/0              |
| cloud_controller_worker-partition-489dca900676578cb063/0       |
| consul_server-partition-489dca900676578cb063/0                 |
| dea-partition-489dca900676578cb063/0                           |
| doppler-partition-489dca900676578cb063/0                       |
| etcd_server-partition-489dca900676578cb063/0                   |
| health_manager-partition-489dca900676578cb063/0                |
| loggregator_trafficcontroller-partition-489dca900676578cb063/0 |
| nats-partition-489dca900676578cb063/0                          |
| router-partition-489dca900676578cb063/0                        |
| uaa-partition-489dca900676578cb063/0                           |

VM Instances used to Manage and Operate Cloud Foundry
i-4c40e0b2 (Ops Manager optional - only in PCF
i-af65c551 (NAT Instance)
i-df53f321 (microbosh-bb550035e754b2468ad0) 

Databases used in Cloud Foundry  (Critical ones highlighted)
| Database           |
| app_usage_service  |
| autoscale          |
| bosh               |
| ccdb               |
| console            |
| notifications      |
| uaa                |

BlobStore to store droplets, buildpacks and compiled packages

Now lets look at the potential impact of Failure for these components

System can be recovered quickly by spawning another instance of the same process type. Failure of these processes is not catastrophic. The four levels of HA provide resiliency and recovery of the CF core jobs.
  • clock_global-partition - This component is stateless.  The instance does not need to persist state, but can be restored using the BOSH ressurector.
  • cloud_controller-partition & cloud_controller_worker-partition - CC relies on on access to the CCDB and the blobstore. No API requests are available. Current apps and services continue to run. Service instances cannot be bound/unbound created/deleted.Crashed apps are not restarted.
  • consul_server-partition - This consensus based KV store is not currently deployed by default as part of cf-release. Required by Diego and routing-api features. Therefore this component can be considered optional. 
  • dea-partition - DEA need to be sized to handle failure as applications will be migrated dynamically. With two DEA VMs, 50% over-provisioning will guarantee all app instances restart in the remaining DEA VM. Three DEA's require 33%, etc. App instances will fail to start if there is no capacity remaining. This may affect performance. 
  • doppler-partition & loggregator_trafficcontroller - These are stateless components. Loss of these components results in App and system logs not drained correctly or streamed back to the clients
  • nats-partition - Cloud Foundry continues to run any apps that are already running even when NATS is unavailable for short periods of time. The components publishing messages to and consuming messages from NATS are resilient to NATS failures. As soon as NATS recovers, operations such as health management and router updates resume.
  • router-partition - Dependent on NATs to actually receive route updates for most clients. Will definitely impact client traffic since it reverse proxies all requests/responses from the DEA. 
  • uaa-partition - Depends on the uaadb. Loss of logged in state. New OAuth access tokens cannot be obtained from the UAA.

These components together represent the entire runtime and deployment state of CF.  Failure of any one of these DBs represents a significant loss of end-user experience. 
  • uaa - Authentication and authorization
  • ccdb - API requests and Managed Services
  • console - Apps Manager/ Developer Console
  • bosh - Current state of a CF Deployment in terms of job and VMs
  • blobstore - Currently running apps continue to run. Apps can not be successfully pushed. Already pushed apps cannot be staged. Health management will fail to start apps because droplets cannot be downloaded
Within a single AZ, periodic DB snapshots and and IaaS capability to backup and restore DBs is critical when DBs fail. The only solution to the true- HA of these databases is to host them in an external customer-provided DB deployment can provide a multi-AZ capability that a CF deployment can leverage. Using something like AWS RDS or external blob storage such as S3 provides multi-AZ capabilities as a feature from the provider. Using an external blob storage such as S3-compatible blob provider inherits the HA capabilities of the external provider. For example, AWS S3 is multi-AZ and HA.

Failure Testing of Components in Cloud Foundry

- Checkout cf-attack A cf cli plugin to perf/load test an app
- Chaos as a Service: https://github.com/skibum55/chaos-as-a-service

Failure testing hardens your systems against failure. Failure Testing prevents larger outages by enabling the practice of immunization i.e. imbibe poison/virus in small quantities to become immune. It is critical that you know how the system responds to failure. When failure does occur it is prudent to plan for it and establish an Incident Control System that first mitigates, isolates and resolves the issue. The entire episode ends with a blameless post-mortem that enables the product and failure response process to address gaps.

This blog post explains the various scenarios you need to cover in your end to end failure testing plan for apps deployed to Cloud Foundry.  We only get into the details of which scenarios to consider and how to affect them. Response to failure aka High Availability of CF and Apps will be covered in the next blog post.

1. Application Failure Scenarios

1. [HEAP-OOM]  App instance allocates excess Java Heap Memory. App repetitively suffers java heap OOM.
2. [NATIVE-OOM] App instance allocates excess native memory leading to crashes of the warden container.
3. [EXCESS-CPU] App instance is pegged at > 90% CPU for a sustained period of time.
4. [DEADLOCK]  App instance is pegged < 50% CPU inspite of sustained increasing request traffic.
5. [GC-PAUSES]  App unresponsive for random periods of time.
6. [RANDOM-DEATH] App instance randomly exits.

Apps for Affecting App Failure Scenarios:
- https://t.co/50H0HA10ym - Problem Diagnostics Toolkit that provides an app for creating thread concurrency issues, memory leaks, high CPU usage and JVM crashes. Unzip the ear, ignore all the other files and just push the war file. The browser load of the first page will take a  long time since it loads the DOJO toolkit on the first web request.  Credit for this app goes to IBM. Thanks @jpapejr For a running example see http://pdt.cfapps.pez.pivotal.io/
- https://github.com/cf-platform-eng/cf-scale-boot - Endpoint for randomly killing app instances
- https://github.com/pivotalservices/autoscaler-test-app - Hogs Memory to create OOMs
https://github.com/Netflix/SimianArmy/wiki/The-Chaos-Monkey-Army#grow-your-own-chaos-monkey - Customize the Netflix chaos monkey for your own purposes.

2. Service Failure Scenarios

1. Logging Infrastructure service (say Splunk or LogStash) is unresponsive / not accessible
2. Monitoring Service (say AppDynamics) is unresponsive / not accessible
3. Data/Persistence services (say Postgresql, MySQL) are not available/ accessible

Tools to simulate troublesome service interactions
- Failure Injection Testing
- Spring Integration Testing to introduce faulty service dependencies.

3. Cloud Foundry Stateful and Stateless Job (Component) Failure Scenarios

1. [CF-STATELESS] Random failure of critical CF stateless components. DEA VMs, Collector, Doppler & Loggregator Traffic Controller VMs, Health Manager VM, ETCD nodes,  Cleanup  jobs and Errands fall in this category.
2. [CF-STATEFUL] Loss of storage volumes and persistence stores in a CF deployment. This includes the loss of critical databases like the BOSHDB, CCDB, UAADB, AppsManager-DB, NFS-Server/Blobstore,etc.,
3. [CF-CRITICAL] Loss of critical CF components like the front end loadbalancer, UAA, gorouter, NATS, & CloudController.
4. [CF-CONFIGURATION] Loss of BOSH Director, OpsManager VM, MicroBosh, Jumpbox, DNS Server
5. [CF-AZ-LOSS] Loss of an entire AZ.

Tools for killing CF components 
https://github.com/pivotal-cf/chaos-lemur - Adaptation of the Chaos Monkey concept to the SpringXD BOSH release
- https://github.com/BrianMMcClain/cloudfoundry-chaos-monkey - Chaos Monkey style service for CF

4. Hardware Failure Scenarios

1. [NETWORK-PARTITION] Entire ESXi Hypervisor hosting DEAs and other CF components goes offline.
2. [HYPERVISOR-DEATH]  Entire ESXi Hypervisor hosting DEAs and other CF components dies.

Tools To Blow Up Your Favorite Hypervisor
- Simulating VMware High Availability failover (2056634)
Testing a VMware Fault Tolerance configuration (1020058)
- Simian Army for Creating Havoc on AWS

The best way to test resiliency of Cloud Foundry is to do the following:
1. Kill VMs from vSphere vCenter or AWS IaaS console.
2. Run bosh tasks —no-filter in a loop to watch the resurrector bring them up. i.e. watch -n 2 "bosh tasks —no-filter"
3. Randomly ssh into a container bosh ssh and sudo kill -9 -1
4. bosh ssh into a DEA and kill a container

Failure is NOT a once/twice a year event. If your site gets a billion queries/requests/hits then once in a billion events is happening 20 times a day. Plan for it and embrace it. Last but not the least always adhere to the rules of  designing and developing operations-friendly services.