About Me

My photo
Rohit leads the Pivotal Labs App Modernization Practice in engineering, delivery training & cross-functional enablement, tooling, scoping, selling, recruiting, marketing, blog posts, webinars and conference sessions. Rohit has led multiple enterprise engagements including ones featured in the Wall Street Journal. Rohit focuses on designing, implementing and consulting with enterprise software solutions for Fortune 500 companies on application migration and modernization.

Thursday, October 1, 2015

Chasing Cloud Foundry OutOfMemory Errors - OOM

If you are ever unfortunate enough to troubleshoot application, java heap or native process OOM issues in Cloud Foundry follow the playbook below to get to the root cause:

1. Include the attached script dump.sh at the root of your JAR/WAR file. You can edit the LOOP_WAIT variable in the script to configure how often it will dump the Java NMT info. I'd suggest somewhere between 5 and 30 seconds, depending on how long it takes for the problem to occur. If the problem happens pretty quick, go with a lower number. If it takes hours, then go with something higher.

2. Make a .profile.d directory, also in the root of the JAR / WAR file. For a detailed explanation on using .profile.d to profile native memory checkout this note from CF support engineer Daniel Mikusa.

3. In that directory, add this script.
#!/bin/bash
$HOME/dump.sh &

This script will be run before the application starts. It starts the dump.sh script and backgrounds it. The dump.sh script will loop and poll the Java NMT stats, dumping them to STDOUT. As an example see the simple-java-web-for-test application. There is also an accompanying load plan here

4. Add the following parameters to JAVA_OPTS: 
JAVA_OPTS: "-XX:+PrintGCDateStamps -XX:+PrintGCDetails -Xloggc:./jvm-gc.log -XX:NativeMemoryTracking=detail"
The -XX:NativeMemoryTracking will enable native OOM analysis [1]. More on this later. The -Xloggc will allow you to pipe all GC output to a log that you can later analyze with tool like PMAT or GCMV

5. Push or restage the application.

6. In a terminal open cf logs app-name > app-name.log. That will dump the app logs & the Java NMT info to the file. Please try to turn off as much application logging as possible as this will make it easier to pick out the Java NMT dumps.

7. Kick off the load tests.


The nice thing about Java NMT is that it will take a snapshot of the memory usage when it first runs and every other time we poll the stats we'll see a diff of the initial memory usage. This is helpful as we really only need the last Java NMT dump prior to the crash to know what parts of the memory have increased and by how much. It's also nice because it gives us insight into non-heap usage. Given the Java NMT info, it should be easier to make some suggestions for tuning the JVM so that it doesn't exceed the memory limit of the application and cause a crash.

8. If you have the ability to ssh into the container use the following commands to trigger heapdumps and coredumps
JVM Process Id:
PID=` ps -ef | grep java | grep -v "bash\|grep" | awk '{print $2}'`
Heapdump:
./jmap -dump:format=b,file=/home/vcap/app/test.hprof $PID
Coredump:
kill -6 $PID   should produce a core file and leave the server running
Analyze these dumps using tools like Eclipse Memory Analyzer and IBM HeapDump Analyzer 

9. If you have the ability to modify the application then enable the Spring Boot Actuator feature for your app if it is Boot app else integrate a servlet or script like  DumpServlet  and HeapDumpServlet into the app 

Salient Notes:
- The memory statistics reported by CF in cf app is in fact  used_memory_in_bytes which is a summation of  rss and the active and inactive caches. [2] and [3]. This is the number watched by the cgroup linux OOM killer. 
- The Cloud Foundry Java Buildpack by default enables the following parameter -XX:OnOutOfMemoryError=$PWD/.java-buildpack/open_jdk_jre/bin/killjava.sh
Please do NOT be lulled into a false sense of complacency by this parameter. I have never seen this work in production. Your best bet is to be proactive when triggering and pulling dumps using servlets, JMX, kill commands, whatever ... 

References:

No comments:

Post a Comment