About Me

My photo
Rohit is an investor, startup advisor and an Application Modernization Scale Specialist working at Google.

Sunday, October 8, 2017

Emergent Systems and the need for Chaos Architecture

The ideas below are an amalgamation of key signals from Adrian Cockcroft, Neal Ford, Matt Stine, Russ Miles, Michael Nygard and the rockstar engineers of Netflix who have pioneered Chaos Engineering.

For the long term survival of your microservices system some key concepts have now come together chiefly: anti-fragilitycontinuos partial failure  and evolutionary architecture

Since all of us seem to building networks of distributed microservices there is no way to test the emergent behavior of these systems in test. We HAVE to run controlled experiments in production. Chaos engineering and a chaos friendly architecture is critical for enterprises to maintain availability of their applications and survive breaches.  Adrian Cockcroft in his recent Cloud Native London keynote espoused four layers, two teams and an attitude.

Chaos Engineering is the discipline of experimenting on a distributed system
in order to build confidence in the system’s capability
to withstand turbulent conditions in production.

Chaos engineering is the best continuously holistic function to evaluate a distributed system to withstand impact and external perturbation. So what are the different levels at which chaos engineering

Layer-1: Infrastructure: Lay out the infra. so that there is no SPOF. Multiple zones, regions. app is distributed enough times and enough ways - diversity in the infrastructure

Layer-2: Switching and interconnect. Strategy for interconnecting. NO SPOF => data in more than one place. This will require data to be replicated in a different side of the world. Data needs be more than one disk - in  a different building. Routing needs to transparently handle failover across datacenters. Unfortunately DR routing/failover is the Least well tested set of components in the system. Usually all error handling code explodes at impact. Once the fallen datacenter comes up there is a need to Re-route and re-synchronize - introduce anti-entropy people back into the system. It is critical to regularly test Failover regularly to the backup datacenter. Test HA/DR across data-centers  properly instead of availability theater.

Layer-3: Application Layer: What does app do when it  experiences data loss , n/w connectivity failures, timeouts, error returns, slow responses, network partitions - app hangup - goes 100% busy
Single functions and microservices can be tested to one thing. Lambda is easy to develop for. Unit of testing and deployment. Monoliths lead to combination testing. Lot of variations

Layer 4 People: When machines  mis-behave people really screw it up. Usually folks made it worse. There are countless stories of systems that were thrashed by the operators due to a comedy of errors. It is super-important to practice gamedays similar to how children in kindergarten practice a fire drill.  Chernobyl.  Reboot may be the wrong thing to do when you have services. Play out fire-drills so that when there is an actual fire people take the right action - disaster preparedness.  Practice, Practice, Practice. ... Fire drills

Tools that attack the different layers.
  1. Game Days - exercise outage. Right way everyone is to behave. folks on a call. how to find the dashboards etc. as well digging into details
  2. Simian army- Tests once a month
  3. FIT - Deep injection of failures. CHAP. Chaos automation platform
  4. Gremlin Inc. -  Auto mates chaos engineering scenarios and gamedays.  Undo button. Safer with automation.
  5. There is an excellent catalog tools at the end of the chaos engineering eBook
Two Teams
  1. Security Blue Team/Security Red Team(Break into your site)
  2. Chaos Engineering Team/SRE Team

Companies offer services to make System secure and resilient.
  • AttackIQ
  • Safebreach
  • Spear fishing

Attitude - Improve your chaos posture
- OReilly book - Chap and the Chaos Maturity Model gives a roadmap to improve your chaos game.
- Chaos engineering community day coming up - London this is becoming a thing

If you want to create a system with 99999/99999999 availability it is important to establish  a Chaos Engineering Practice that keeps the team safe, and whole stack reliable.


Go run a gameday. People experienced in simulating outages. easy cleanup.
Start at top and work your way down.

Emergent Behavior in Microservices [Notes from stine-nygard podcast]

Emergent behaviors are not desirable. emergent behaviors should lead to favorable outcomes.
Road to salvation - microservies. you need to be tall.
Failure modes vs Positive emergent behavior.
Continous partial failure => embrace that state => architecture

Continuous Partial Failure
Large interconnected- interdependent network of systems Enterprise => giant distributed system.
Things fail. Things will go down - expected.  a way to lead us to better design > freedom and flexibility

For a  service provider there is  no real difference between downtime and release.  CPF - statistical reality and loosening the coupling between services - so that they can be upgraded and released at their cadence - necessary for success

Why MicroServices falls into  - two camps

1. Frustrated with pace of change in IT. long monolithic projects. decouple and decentralize to gain speed and maneuverability. decentralizing > speed and maneuverability > some way of prioritizing > make new ones kill old ones > decision centralized or decentralized. Microservices allow for Disposing of code - instead of maintainable cost.You want to be able to rewrite code in short term basis.

2. Because NetflixOSS IS DOING int. It is the language of the day to talk about systems. Decentralize and resiliency for microservices.   I am doing thousand deployments -- what is the point if I still cant achieve a strategic objective.

Microservices is NOT a panacea
going down the road to microservices - figure out problems of the future.
Change in direction > bandwagon jumpers > cargo cult - martin fowler article

What is it you are trying to achieve with microservices  - early adopters of CMM - saw dramatic success >> document templates. Cargo culting a microservices template won't tale you any closer to your goal. Teach and communicate the why of microservices broadly.

Anti-fragile software - because - networks of systems with higher coupling. tighter the coupling the more synchronous - continous processes - production failures . High degree of dynamic coupling. - correlated dynamic - system moved into a new regime of behavior- integrated distributed systems running sync - exposed to unpredictable traffic

Systems cannot understand, cannot control - will FAIL.  Make systems robust.  Get them to equilibrium . robust systems do not accommodate change. > are not flexible.  new systems can multiple demand on core system. Destabilized the whole comopnay. ARB - review processes slow doen the adaption to competivie environment.

Build anti-fragility into the systems = new software - leads to more survivable instead of weaker.
Do not create a monoculture. We will be more robust - variety of config instead of CREATING A MONOCULTURE. art of war - maneuverability - microservices

water > nature of the ground > evolutionary design.
Survival of the fittest around services - competitive market based economy for services..
Build some disposable services. Plan to throw things away.

Flow of water - change course - discover terrain. in your business - as opposed to their businesses.
Competitive landscape. Fitness is always being tested and challenged. ecosystems - > organisms create a new niche. Plan to adapt and continue changing. Organisms continually create new niches. People with create things that will allow people will do things on top of it.
Always look at
- where is my organism in my competitive landscape
- what is going on inside the company

Parallels between Evolutionary Biology and Microservices
Redundant services - redundancy uncomfortable conventional IT- reducing cost - having one system. --- comfortable at inefficiency at micro - to get competitive advantage at the macro level

<google - services>  = biological evolution = we put more resources on slower projects. we starver production resources - let the strength flow into the part that is working.

Software Dogma
1. Redundancy  opposite of  DRY - DRY ... is becoming a trump card.....
DRY within a code base - create dependencies outside the team... DRY can introduce dependencies. DRY should be applied with a view of complexity. Skepticism between teams and codebases

2. YAGNI became another commandment. 2/3 services leads to duplication. Move everything to REST interfaces => invite disaster. Entity services with CRUD interfaces. Fooled by noun = one unitary whole concept.

Behavior oriented aggregates instead of data orientation 
Look at the behavior of the service not the data. Aggregates need to have data. Single service with multiple concerns. Single noun for multiple behaviors. Order Service .. < document.  Partitioning things may look like duplication. Breaking up concepts at the right granularity scales and decouples the system. Breaks dependencies between teams.

What services should I have ?
Influenced by programming paradigm shift  Nouns >> behavior and verbs.
Service Topology .. verbs and behaviors ...

in the services - verbs
passed to the services - adjectives

Over specialization vs open ended data formats
open ended data formats ...over specify > lose flexibility
Is CDC too much ... proto buffs ... schema registries ... lose schema flexibility

This can be attacked two ways in Java
>>> languages and framework driver   annotations >> POJO clojure vs java
>>>architectural remedy in java world ... domain uncoupled to implementation tech.

Hexagonal Architecture
Port  --> Application Layer  --> Domain Layer --> Application Layer --> Adapter

wire object >> object representation of data .>> translation to domain based object
HTTP request ... > object > shippable > object > repository
Java 8/9 streams API will help ...
Leverage a  functional programming language like Clojure ..... data first programming language. .. data on the wire and not data as domain objects.

Useful Resources by Michael
Portland pattern repository ... http://c2.com/ppr/titles.html
Books on Pattern Oriented System Architecture

No comments:

Post a Comment