About Me

My photo
Rohit is an investor, startup advisor and an Application Modernization Scale Specialist working at Google.

Thursday, March 3, 2016

Deigo BBS, ETCD and Consul - 3 is a crowd

PCF 1.6 has now been completely redesigned on the Diego runtime. In case you are wondering all this all works now checkout the picture below -



Diego internally uses a couple of registries to store internal state primarily Consul for NAT-less service registry and ETCD as an ephemeral data store for the Diego bulletin board service. Once there are corruption issues due to connectivity or disks, these distributed registries  start arguing. They cannot reconcile the state of the world with the CloudController due to quorum or issues with leader election or distributed key lockout. Basically the instances need to get the same key.  You can specify the key in the newer versions, but there is no enforcement of state in the case of corruption.

I am sure there is a proper technical reason why these distributed K,V stores behave they way they do in these situations; however my intent here is to teach you how to get out of this situation if somehow you end up here.

So in situations where Consul and Etcd and Diego Brain BBS jobs act like errant brothers and fight with another to reconstitute the state of the world from the Cloud Controller, leverage the following recipe to keep sanity in the house.

1. Restart all the VMs of the Consul, ETCD and Diego BBS nodes in that order.

2. Wipe out member DB files i.e. nuke the data dir. on the ETCD VMs. monit restart all services on the ETCD nodes.To delete the contents of the etcd data store see:
  1. Run bosh vms to identify the etcd nodes in your deployment.
  2. Run bosh ssh to open a secure shell into each etcd node.
  3. Run monit stop etcd on each etcd node.
  4. Delete or move the etcd storage directory, /var/vcap/store, on each etcd node.
  5. Run monit start etcd on each etcd node.
3. Reduce the number of Consul instances to 1 and  deploy the BOSH release again. If this fails, follow the same process for the other errant jobs. In the end you may end up with just one instance of these 3 jobs. Once you have a healthy deployment bump up the # of Job instances in the Resources page of the OpsManager.

4. Etcd may sometimes experience orphaned processes that leads to the Diego BBS job failing. This can be remedied by killall etcd on the Diego BBS VMs and subsequent redeployment.

5.  Nuke and Pave ... Take backup of your PCF deployment with cfops and do a brand new deployment followed by a restore.

No comments:

Post a Comment