Friday, November 18, 2011

Be aware of your GC

When you're learning Java and later developing your applications you usually tend to not to worry about the memory management - we have garbage collector after all, so what's to worry about. Surprisingly you should worry about memory in your Java applications as it's not that difficult to introduce memory leak after all. You should worry about memory also in order to minimize pause time GC infers, and to use CPU to process your client's requests instead of cleaning your memory. It all depends on your non-functional requirements of latency, throughput, etc. (which can be tuned with some VM settings, not changing any applicative code)

There are two (at least) perspectives of memory management and GC in Java:
1.You should be careful while developing you application and make sure not to cache any objects. In other words do not be afraid of using new in Java - creating objects in Java is extremely fast and cheap (e.g. comparing to C++) as physical memory is already allocated - more in http://www.oracle.com/technetwork/java/gc-tuning-5-138395.html short article
2.You should be aware of GC tuning options that can match your concrete application - and this is what I will address in this post.

Usually you have two options while tuning GC: either you limit number of global collections (young + tenured generations), or limit max pause of garbage collection (thus increasing number of collections). You have plenty of options and quite exhaustive list can be found at Alexey Ragozin Blog

You should start analyzing your application under heavy load by enabling GC logging first:
-Xloggc:/opt/java/logs/gc.log
-verbose:gc

After running your application (Server HotSpot 1.6.0_23-b05) you should see at least one line like this:
8.847: [GC 138240K->4866K(501248K), 0.0148485 secs]
which says that 8.8 sec after VM startup there was a young generation collection that freed memory from 138240K used to 4866K with total mem available of 501248K and that garbage collection took ~15ms.

I recommend you adding also two following VM options:
-XX:+PrintGCDetails
-XX:+PrintGCDateStamps
that will result in such GC log line:

2011-11-18T16:36:04.513+0100: -0.813: [GC [PSYoungGen: 13824K->1376K(16128K)] 13824K->1376K(48896K), 0.0030416 secs] [Times: user=0.05 sys=0.01, real=0.02 secs]

Here you basically have more details about what happened i.e. that young generation was collected indeed and more importantly you have timestamps that are critical if you're running GC logs in production environment and simply time elapsed from VM startup says absolutely nothing.

When you start some heavy load of your application GC log might start looking more like this:

2011-11-18T16:37:51.472+0100: 99.088: [GC [PSYoungGen: 17338K->573K(17600K)] 49913K->33167K(50368K), 0.0011993 secs] [Times: user=0.00 sys=0.00, real=0.00 secs]
2011-11-18T16:37:51.800+0100: 100.780: [GC [PSYoungGen: 17341K->547K(17600K)] 49935K->33165K(50368K), -1.3929204 secs] [Times: user=0.00 sys=0.00, real=0.00 secs]
2011-11-18T16:37:52.097+0100: 99.670: [GC [PSYoungGen: 17315K->512K(17664K)] 49933K->33193K(50432K), 0.0011831 secs] [Times: user=0.00 sys=0.00, real=0.00 secs]
2011-11-18T16:37:52.097+0100: 99.671: [Full GC [PSYoungGen: 512K->0K(17664K)] [PSOldGen: 32680K->13582K(32768K)] 33193K->13582K(50432K) [PSPermGen: 29382K->29382K(58880K)], 1.5108144 secs] [Times: user=0.11 sys=0.00, real=0.13 secs]
2011-11-18T16:37:52.503+0100: 100.066: [GC [PSYoungGen: 16896K->557K(17664K)] 30478K->14140K(50432K), 0.0015381 secs] [Times: user=0.05 sys=0.00, real=0.02 secs]
2011-11-18T16:37:52.800+0100: 100.337: [GC [PSYoungGen: 17453K->501K(17664K)] 31036K->14188K(50432K), 0.0011944 secs] [Times: user=0.00 sys=0.00, real=0.00 secs]
2011-11-18T16:37:53.206+0100: 100.656: [GC [PSYoungGen: 17397K->402K(17344K)] 31084K->14117K(50112K), 0.0009709 secs] [Times: user=0.00 sys=0.00, real=0.00 secs]

You can notice that on 99.671 sec from the VM startup there was a Full GC executed. It means that the old generation was full.

Now it's time for you to play with some settings like -XX:NewSize, -XX:MaxNewSize, -XX:SurvivorRatio and others to see what's the best compromise between the max pause time and the number of GC pauses.

It's also time to play with different types of GC. You still have to test -XX:+UseConcMarkSweepGC, -XX:+UseParNewGC and -XX:+UseG1GC. Be careful with the last option as if you use JVM version 6.x you have to combine it together with -XX:+UnlockExperimentalVMOptions as "Garbage-First Garbage Collector (or G1 GC for short) is a new GC that is being introduced in the Java HotSpot VM in JDK 7. An experimental version of G1 has also been released in Java SE 6 Update 14. G1 is the long-term replacement for HotSpot's low-latency Concurrent Mark-Sweep GC (widely referred to as CMS)".

IMPORTANT: there is no golden formula for correct and optimal settings / GC implementation. It really depends on your application (sometimes you would need to re-implement some parts to help GC), traffic, given memory (remember than on 32-bit machines max heap size is 4GB and it could be insufficient for some applications), number of cores, etc. But most of all it's your application that drives GC behavior. Be careful and optimize your GC setting with production settings, not to have unpleasant surprises on launch date :)

Note on HotSpot's GC on Windows: it sucks because many times I can get negative GC time in logs like '-1.3917690 secs' which is not very helpful for analysis. After some playing with GC logs I'm not sure times given there can be trusted at all - what a pity.

In the next post I will explain what monitoring / analyzing tools you can use to monitor you application in runtime as well analyze GC logs offline.

No comments: