About Me

My photo
Rohit is an investor, startup advisor and an Application Modernization Scale Specialist working at Google.

Saturday, October 10, 2015

Method to get to the root cause of a memory leak in the Cloud

From an application developer perspective our apps are increasingly run in russian doll deployments i.e. app code wrapped in a app server container which is wrapped by a LXC container running within a VM , running on one or multiple hypervisors running on bare metal. In such a scenario determining the root cause of memory leaks becomes difficult. Please find below a process that could be used to get to the eureka moment.

The basic principle to get to root cause is that of eliminating all the variables one by one. We start at the top of the stack and work our way down. Remember the JVM is like an iceberg. There is java heap that is above the water and an unbound native memory portion underneath the surface.  Java heap OutOfMemory errors are easier to fix than native memory leaks. Native leaks are generally caused by errant libraries/frameworks, JDKs, app-servers or some unexplained OS-container-JDK interaction.

ok lets get to it ... 

First establish a process to measure the JVM heap and native process size perhaps using a dump script like https://github.com/dmikusa-pivotal/cf-debug-tools#use-profiled-to-dump-the-jvm-native-memory. Remember to take heapdumps before during and after the load run. Once the test is close to completion take native process core dumps using kill -6 or kill -11. This procedure of running the test is then repeated as you eliminate each variable below.

1. [app] First look at the application source for the usual memory leak anti-patterns like DirectByteBuffers, threadlocal, Statics, Classloader retention, resource cleanup etc. This is where you will get the maximum bang for the buck. Take a heapdump of the JVM process and analyze in EclipseMemoryAnalyzer of HeapAnalyzer.

2. [jdk] Eliminate the JDK as a factor of the leak by switching JVM implementations i.e. moving from OpenJDK to Hotspot or from OpenJDK to IBM JDK etc .. see the entire list of JVM impls https://en.wikipedia.org/wiki/List_of_Java_virtual_machines.

3. [app-server] If simple eyeballing does not help then switch the app-server i.e. move from tomcat to jetty, undertow to tomcat. If your app runs on WebSphere or WebLogic and cannot be ported then my apologies. Call 1-800-IBM-Support.

4. [container] If your droplet (app + libraries/frameworks + jvm) is running within a container in Cloud Foundry or Docker then try switching out the containers. i.e. If running within the warden container then run the same app within Docker container. Try changing Docker base images and see if the leak goes away.

5. [hypervisor] If running on AWS switch to OpenStack or vSphere and vice versa. You get the idea. Cloud Foundry makes this easy since you can standup the same CF deployment on all three providers.

6. [bare-metal] Run the app on the bare metal server to check if the leak persists.

7. [sweep-under-the-rug] Once you are ready to pull your hair out, resort to tuning the JDK. Start playing with JVM options like -Xms234M  -Xmx234M -XX:MetaspaceSize=128M  -XX:MaxMetaspaceSize=128M -Xss228K. In cloud foundry this is set by the memory_calculator that is influenced by setting the memory_heuristics and memory_sizes env vars.
  • JBP_CONFIG_OPEN_JDK_JRE: '[memory_heuristics: {heap: 55, metaspace: 30}, memory_sizes: {metaspace: 4096m..}]'
  • JBP_CONFIG_OPEN_JDK_JRE: '[memory_calculator: {memory_sizes: {stack: 228k, heap: 225M}}]
  • JBP_CONFIG_OPEN_JDK_JRE: '[memory_calculator: {memory_heuristics: {stack: .01, heap: 10, metaspace: 2}}]'
As you can see, the options become increasingly cumbersome as you keep doing down this list. Your best hope is to catch it at 1, 2 or 3. Good Luck Hunting!

No comments:

Post a Comment