About Me

My photo
Rohit is an investor, startup advisor and an Application Modernization Scale Specialist working at Google.

Thursday, July 9, 2015

Failure Testing of Components in Cloud Foundry

Update:
- Checkout cf-attack A cf cli plugin to perf/load test an app
- Chaos as a Service: https://github.com/skibum55/chaos-as-a-service
- MONKEYS & LEMURS ANDLOCUSTS … OH MY! ANTI-FRAGILE PLATFORMS

Failure testing hardens your systems against failure. Failure Testing prevents larger outages by enabling the practice of immunization i.e. imbibe poison/virus in small quantities to become immune. It is critical that you know how the system responds to failure. When failure does occur it is prudent to plan for it and establish an Incident Control System that first mitigates, isolates and resolves the issue. The entire episode ends with a blameless post-mortem that enables the product and failure response process to address gaps.

This blog post explains the various scenarios you need to cover in your end to end failure testing plan for apps deployed to Cloud Foundry.  We only get into the details of which scenarios to consider and how to affect them. Response to failure aka High Availability of CF and Apps will be covered in the next blog post.

1. Application Failure Scenarios

1. [HEAP-OOM]  App instance allocates excess Java Heap Memory. App repetitively suffers java heap OOM.
2. [NATIVE-OOM] App instance allocates excess native memory leading to crashes of the warden container.
3. [EXCESS-CPU] App instance is pegged at > 90% CPU for a sustained period of time.
4. [DEADLOCK]  App instance is pegged < 50% CPU inspite of sustained increasing request traffic.
5. [GC-PAUSES]  App unresponsive for random periods of time.
6. [RANDOM-DEATH] App instance randomly exits.

Apps for Affecting App Failure Scenarios:
- https://t.co/50H0HA10ym - Problem Diagnostics Toolkit that provides an app for creating thread concurrency issues, memory leaks, high CPU usage and JVM crashes. Unzip the ear, ignore all the other files and just push the war file. The browser load of the first page will take a  long time since it loads the DOJO toolkit on the first web request.  Credit for this app goes to IBM. Thanks @jpapejr For a running example see http://pdt.cfapps.pez.pivotal.io/
- https://github.com/cf-platform-eng/cf-scale-boot - Endpoint for randomly killing app instances
- https://github.com/pivotalservices/autoscaler-test-app - Hogs Memory to create OOMs
https://github.com/Netflix/SimianArmy/wiki/The-Chaos-Monkey-Army#grow-your-own-chaos-monkey - Customize the Netflix chaos monkey for your own purposes.

2. Service Failure Scenarios

1. Logging Infrastructure service (say Splunk or LogStash) is unresponsive / not accessible
2. Monitoring Service (say AppDynamics) is unresponsive / not accessible
3. Data/Persistence services (say Postgresql, MySQL) are not available/ accessible

Tools to simulate troublesome service interactions
- Failure Injection Testing
- Spring Integration Testing to introduce faulty service dependencies.

3. Cloud Foundry Stateful and Stateless Job (Component) Failure Scenarios

1. [CF-STATELESS] Random failure of critical CF stateless components. DEA VMs, Collector, Doppler & Loggregator Traffic Controller VMs, Health Manager VM, ETCD nodes,  Cleanup  jobs and Errands fall in this category.
2. [CF-STATEFUL] Loss of storage volumes and persistence stores in a CF deployment. This includes the loss of critical databases like the BOSHDB, CCDB, UAADB, AppsManager-DB, NFS-Server/Blobstore,etc.,
3. [CF-CRITICAL] Loss of critical CF components like the front end loadbalancer, UAA, gorouter, NATS, & CloudController.
4. [CF-CONFIGURATION] Loss of BOSH Director, OpsManager VM, MicroBosh, Jumpbox, DNS Server
5. [CF-AZ-LOSS] Loss of an entire AZ.

Tools for killing CF components 
https://github.com/pivotal-cf/chaos-lemur - Adaptation of the Chaos Monkey concept to the SpringXD BOSH release
- https://github.com/BrianMMcClain/cloudfoundry-chaos-monkey - Chaos Monkey style service for CF

4. Hardware Failure Scenarios

1. [NETWORK-PARTITION] Entire ESXi Hypervisor hosting DEAs and other CF components goes offline.
2. [HYPERVISOR-DEATH]  Entire ESXi Hypervisor hosting DEAs and other CF components dies.

Tools To Blow Up Your Favorite Hypervisor
- Simulating VMware High Availability failover (2056634)
Testing a VMware Fault Tolerance configuration (1020058)
- Simian Army for Creating Havoc on AWS

The best way to test resiliency of Cloud Foundry is to do the following:
1. Kill VMs from vSphere vCenter or AWS IaaS console.
2. Run bosh tasks —no-filter in a loop to watch the resurrector bring them up. i.e. watch -n 2 "bosh tasks —no-filter"
3. Randomly ssh into a container bosh ssh and sudo kill -9 -1
4. bosh ssh into a DEA and kill a container

Footnote* 
Failure is NOT a once/twice a year event. If your site gets a billion queries/requests/hits then once in a billion events is happening 20 times a day. Plan for it and embrace it. Last but not the least always adhere to the rules of  designing and developing operations-friendly services.

No comments:

Post a Comment