We recently started getting high GC pauses with our in house built in memory cache.
This memory cache has:
- All of the metadata required for response is cached in memory
- Low Response rates.
Configuration for the metadata cached application that we were using:
- -Xmx: 44 Gb
- Young Generation: 15Gb
- Old Generation: 30 Gb
Now we started investigating those high GC pauses in our application and interestingly enough we couldn’t find any minor GC cycles in our logs. On looking further we could find application was doing FULL GC at regular intervals
Sample log lines:
[Full GC [PSYoungGen: 13417472K->1003667K(14090240K)] [ParOldGen: 30037496K->30037470K(30037504K)] 43454968K->31041137K(44127744K) [PSPermGen: 82734K->82734K(82944K)], 11.6876010 secs] [Times: user=91.27 sys=0.00, real=11.69 secs]
[Full GC [PSYoungGen: 13417472K->1021218K(14090240K)] [ParOldGen: 30037096K->30037496K(30037504K)] 43454568K->31058714K(44127744K) [PSPermGen: 82740K->82734K(82944K)], 15.4744130 secs] [Times: user=118.96 sys=0.00, real=15.47 secs]
[Full GC [PSYoungGen: 13417472K->993065K(14090240K)] [ParOldGen: 30037131K->30037096K(30037504K)] 43454603K->31030161K(44127744K) [PSPermGen: 82734K->82734K(82944K)], 17.5521800 secs] [Times: user=134.98 sys=0.00, real=17.55 secs]
In this log line , we can clearly see that during this FULL GC cycle , only young generation has been garbage collected and old generation is more or less the same, which means there were plenty of short lived objects which got garbage collected when FULL GC triggered.
This was not at all what we were expecting , our expectation was that this young generation getting filled up , should never trigger FULL GC. And in fact, we were under the impression that whenever we have a failure to allocate new objects in young generation, minor GC should kick in and garbage collect all the short lived objects as we had plenty of short lived objects in this case. But it seemed from the logs , that every time we had a failure to allocate a new object in young generation, FULL GC was kicking in which caused huge pauses to our application.
( This is because we had 30gb of old generation , so inspite of having all strong referenced objects , it took the GC threads huge time just to traverse through the 30gb of heap space)
We can clearly see the jstat
output in this snapshot attached.
According to the snapshot , we can see that the:
FULL GC
cycle started at 13383165.2
FULL GC
cycle ended at 13383177.3
.
During this time our application was not able to serve any requests.
This FULL GC cycle got triggered because there was object allocation failure in the young gen. Interestingly enough we could have used the survivor regions and could have easily dealt with the object allocation failure but instead it triggered FULL GC paid the penalty of traversing this old generation.
Now i tried reproducing this behaviour with this small piece of code.
object GCTest { val str = new scala.util.Random(500) def main(args: Array[String]): Unit = { var l: List[String] = List.empty while(true) { val a = str.nextString(50) l = l :+ a Thread.sleep(10) } } }
GC cycles were as expected
Few Observations from the GC logs:
- Application is doing FULL GC cycles
- Old Generation is fully occupied by the long lived objects – objects strongly referenced by the List.
- No Minor GC cycles
- Lots of non live objects in the Young Generation which gets cleared in the next GC cycle
We can clearly see in the above GC logs that inspite of having survivor regions as empty and many non live objects in the young generation[4] , JVM still does FULL GC cycles instead of the expected minor GC cycles which would have resulted in the collection of these non live objects and hence decrease the memory pressure on young generation.
So at last my learning from this whole fiasco was that JVM behaves oddly when used with heaps in which we have to maintain a lot of long-live objects. Because all of this would eventually gets pushed to old generation, so now after this old generation gets full , instead of minor GC our application would do FULL GC cycles which would mean much lower throughput.
Note: All of the above observations are from openJDK java version “1.7.0_151”