About Me

My photo
Rohit leads the Pivotal Labs App Modernization Practice in engineering, delivery training & cross-functional enablement, tooling, scoping, selling, recruiting, marketing, blog posts, webinars and conference sessions. Rohit has led multiple enterprise engagements including ones featured in the Wall Street Journal. Rohit focuses on designing, implementing and consulting with enterprise software solutions for Fortune 500 companies on application migration and modernization.

Saturday, April 9, 2016

Debugging Node.js high CPU, crash and memory issues in Cloud Foundry - part 1

These are the scenarios that keep a Devops engineer awake in the night - an instance of the node.js app is running at 100% CPU in Cloud Foundry with NO logging or resiliency protection like circuit breakers. This is the third occurrence of the problem in the last month. Management is dragging you into these long drawn day long meetings where each side blames each other - the app is restarted - the problem goes away and the cycle repeats. 

With apps increasingly being wrapped in russian doll layers of abstraction  [baremetal => VM => Container => Node process] getting to the root cause has become increasingly difficult.

So if you are a Java guy that has no experience in debugging a production node.js app. After drinking an appropriate amount of coffee and fruity alcoholic drinks you have landed on http://techblog.netflix.com/2015/12/debugging-nodejs-in-production.html

The recommendations from Yunong Xiao in the presentation essentially boils down to:
1. Using node-restify as a REST framework in the node app to log and observe app performance
2. Use Linux Perf Events to statistically sample stack traces  to construct CPU flame-graphs. This technique can reliably find the proverbial needle in the haystack.
3. Take Coredumps to dump the app state for post-mortem debugging with the mdb debugger. Unfortunately the mdb debugger can only run on Solaris- therefore you have to transfer the coredump to a debug solaris VM instance [1] [2]. 

Unfortunately both options 1 and 2 are not possible. 1 because the app-dev team will need to get involved and they don't look upon kindly on a devops guy advising a REST framework. 2 because CF does not bake perf events into the OS stemcell. Yikes!!! Looks like good old gcore is the only way to go  ... 

ok so with the introduction of Diego, finding your way into the container should be as easy as doing a cf-ssh and then issuing the gcore command rite ? - NOPE Wrong - Once ssh'ed into the container there is no way to elevate the privilege and attach gdb to your node.js process. You will need to find another way to elevate privilege and take a core dump.

This other way is using BOSH and Veritas. The gist of the method is that you bosh ssh into the correct Diego cell by first examining the app metadata. Thereafter on the Diego cell install veritas the Garden explorer tool. Use Veritas to map from the app instance to the container instance. Thereafter ./wsh into the garden container. You can now elevate your privilege with a sudo su - and then issue a  gdb, gcore of the node.js server process.

Once the dump has been generated sftp it to a debug solaris instance and run diagnostics with mdb commands like jsstack etc ... see this video from Joyent that explains how to walk the dump.

Part 2 of the blog goes into the gory detail of discovery and  generation of the core dump.

No comments:

Post a Comment