About Me

My photo
Rohit is an investor, startup advisor and an Application Modernization Scale Specialist working at Google.

Wednesday, June 8, 2016

Creating Chaos in Cloud Foundry

One of the key tenets of operational readiness is to be prepared for every emergency. The best way to  institutionalize this discipline is by repeatedly creating chaos in your own production deployment and monitor the system recovery. The list below is a listing a tools from the PCF Solutions team @ pivotal and others to cause chaos at all levels in the stack in Cloud Foundry.

Tools, Presentations & Repos:
https://github.com/xchapter7x/chaospeddler
https://github.com/xchapter7x/cf-app-attack
https://github.com/strepsirrhini-army/chaos-lemur
https://github.com/FidelityInternational/chaos-galago
https://github.com/skibum55/chaos-as-a-service
Monkeys & Lemurs and Locusts Oh My - Anti-Fragile Platforms

Type of test/event/task



1. BOSH
* bosh target (director ip)
* bosh login (director username/password obtained from Ops Man)
* bosh download manifest cf-(hash) ~/cf.yml
* bosh deployment ~/cf.yml
* bosh vms/cck
* bosh ssh
* bosh logs
* bosh debug (gives you the job/task logs)
2. VM Recovery
* Terminate a VM by deleting it in vSphere, watch it come back up
3. App Recovery
* Terminate an app by using cf plugin, watch it come back up.
4. Correlate logs?
* Watch logs for steps above
5. Chaos Monkeys
* Execute Chaos Lemur and watch bosh/cf respond
6. Director
* Shut VM down/delete in vCenter
* When its down, what app still runs?
* Once VM is gone, how do you get it back/rebuild?
7. Network switch
8. Hypervisor
9. Credentials that expire:
* Certs that have expiration date
* System Accounts (internal CF system accounts)
* vCenter API Account that CF uses
10. Log Insight goes down
11. Kill container
12. Kill VM
13. Kill DEA
14. Kill Router
15. Kill Health Manager
16. Kill Binary Repository
* Then scale
17. Over-allocate Hardware (how do we do it?)
18. Execute and backout a change to CF
19. Bulid Pack Upgrade and Roll Back
20. Right Apps have right build pack
21. Licensing server scenario (for example, can't connect)
22. Double single components (for example, 2 BOSH's)
23. Kill internal message bus
24. DNS
25. Clock drift



Chaos Testing Procedure: 
Kil vms from vsphere; used bosh tasks —no-filter in a loop to watch resurrector bring them up
bosh ssh and sudo kill -9 -1 are also fun
bosh ssh’d into a dea and killed a container

No comments:

Post a Comment