On Wed, Sep 18, 2013 at 04:14:10PM +0100, Matthew Slowe wrote:
> From time to time our Nagios checks have gone mad and declared that
> "nothing works" for a few minutes at a time (stuff we check below).
> These blackouts tended to affect both nodes at the same time and would
> fix themselves after a few minutes. Until today I wasn't able to get a
> good look at what was happening during one of these blackouts but I
> caught one today (by chance) -- java was hogging CPU. In an attempt to
> fix this, we're doing two things (a) move to the Sun/Oracle JVM (which
> we meant to do in the first place... oops) and (b) enable Parallel
> Garbage Collection [3] which, we've seen in the past on load profiles
> like this, to make a big difference on forced garbage collection events.
>
> [2] See http://www.win.tue.nl/~aeb/linux/lk/lk-9.html#ss9.6
>
> [3] See http://www.oracle.com/technetwork/java/javase/gc-tuning-6-140523.html#par_gc
>
> [4] See http://j.mp/1ep6FMT
Morning all!
It's been an interesting few weeks... and we're not really in much of a
better situation :-( Apologies for the pretty heavy post!
Current state of play:
3 RHEL VMs (each are 2cpu, 4G) on VMware (spread across three hosts with no obvious CPU
contention)
All three on Sun^WOracle JVM 1.7:
$ java -version
java version "1.7.0_25"
Java(TM) SE Runtime Environment (build 1.7.0_25-b15)
Java HotSpot(TM) 64-Bit Server VM (build 23.25-b01, mixed mode)
-Xms1024m -Xmx1024m
-XX:+DisableExplicitGC
PrintGCDetails PrintGCTimeStamps PrintGCDetails
PrintHeapAtGC PrintGCTimeStamps PrintGCDateStamps
PrintGCApplicationStoppedTime -verbose:gc
-Xloggc:/var/log/tomcat6/gc.log UseGCLogFileRotation
Having tried a load of things I decided to spread my bets and we're
running three different GC models in parallel:
1) UseParallelOldGC ConcGCThreads=4 ParallelGCThreads=4
MaxGCPauseMillis=3000
NewSize=256m MaxPermSize=128m
2) UseConcMarkSweepGC CMSIncrementalMode UseParNewGC CMSConcurrentMTEnabled
ConcGCThreads=4 ParallelGCThreads=4 CMSClassUnloadingEnabled
MaxGCPauseMillis=3000
NewSize=256m MaxPermSize=128m
3) UseG1GC MaxGCPauseMillis=5000
I'm aware this is a shiny new one but currently am willing to try almost
anything to get rid of these GC pauses!
All three VMs are seeing the same level of load and see a varied
distribution of GC pauses. Most of the time they're fine and they blaze
through a GC in <0.1s but occationally they all see *much* longer
pauses.
Is anyone seeing a similar or greater load (Sam?) able to share what, if
anything, they do for managing Garbage Collection and if they see pauses of a
similar ilk?
I'm preetty close to turning off all the tuning again (back to Shib recommended
defaults https://wiki.shibboleth.net/confluence/display/SHIB2/IdPProdJVMTuning)
and seeing if the JVM can sort itself out.
Loads of suggestions in the intertubes that (counterintuatively) lowering the
heap size is a good thing to do. I don't fully understand how much ShibIDP
caches stuff (if at all) but might we benefit from caching less?
All of these traces are for roughly a 24 hour period and only
include the GCs which took >1s. The G1GC log format is very different!
1) ParallelGC (ignored 2445 events):
2013-10-15T11:18:18.529+0100: 9772.148: [GC [PSYoungGen: 326136K->21260K(327424K)] 752269K->448129K(1026496K), 1.7355970 secs] [Times: user=0.14 sys=0.00, real=2.72 secs]
2013-10-15T11:19:19.190+0100: 9832.809: [GC [PSYoungGen: 326540K->22073K(327360K)] 753409K->452843K(1026432K), 15.4796500 secs] [Times: user=0.15 sys=0.03, real=15.47 secs]
2013-10-15T14:24:39.189+0100: 20952.808: [GC [PSYoungGen: 325919K->22514K(326976K)] 598819K->298516K(1026048K), 1.4169930 secs] [Times: user=0.06 sys=0.00, real=1.41 secs]
2013-10-16T08:05:06.675+0100: 84580.363: [GC [PSYoungGen: 326392K->22694K(324736K)] 943833K->641851K(1023808K), 47.9720620 secs] [Times: user=0.28 sys=0.30, real=48.45 secs]
2013-10-16T08:06:36.408+0100: 84670.027: [GC [PSYoungGen: 324710K->23728K(325760K)] 943867K->658653K(1024832K), 3.7116570 secs] [Times: user=0.09 sys=0.05, real=3.71 secs]
2013-10-16T08:07:23.391+0100: 84717.010: [GC [PSYoungGen: 325744K->23716K(302336K)] 960669K->674558K(1001408K), 2.1170260 secs] [Times: user=0.10 sys=0.04, real=2.11 secs]
2013-10-16T08:08:08.151+0100: 84761.770: [GC [PSYoungGen: 302308K->21489K(314048K)] 953150K->678322K(1013120K), 1.6553760 secs] [Times: user=0.07 sys=0.02, real=1.65 secs]
2013-10-16T08:08:47.580+0100: 84801.199: [GC [PSYoungGen: 300081K->20023K(314688K)] 956914K->679571K(1013760K), 3.7389560 secs] [Times: user=0.08 sys=0.01, real=3.74 secs]
2013-10-16T08:20:00.947+0100: 85474.566: [GC [PSYoungGen: 315115K->20572K(323328K)] 988968K->695248K(1022400K), 1.9098330 secs] [Times: user=0.34 sys=0.01, real=1.91 secs]
2013-10-16T08:24:25.046+0100: 85738.665: [GC [PSYoungGen: 321149K->20925K(325696K)] 999834K->700258K(1024768K), 1.1434680 secs] [Times: user=0.17 sys=0.01, real=1.14 secs]
2) CMS GC (ignored 2711 events):
2013-10-15T09:25:35.497+0100: 3498.937: [GC 3499.700: [ParNew: 292668K->20750K(306688K), 18.9112410 secs] 444927K->173730K(1014528K) icms_dc=0 , 19.4728050 secs] [Times: user=4.48 sys=0.15, real=19.69 secs]
2013-10-15T09:26:17.812+0100: 3541.251: [GC 3541.251: [ParNew: 293390K->30132K(306688K), 8.4375810 secs] 446370K->184108K(1014528K) icms_dc=0 , 8.4389720 secs] [Times: user=0.13 sys=0.03, real=8.44 secs]
2013-10-15T09:26:47.455+0100: 3570.895: [GC 3570.895: [ParNew: 302772K->29409K(306688K), 2.3354470 secs] 456748K->193906K(1014528K) icms_dc=0 , 2.3357080 secs] [Times: user=0.11 sys=0.01, real=2.34 secs]
2013-10-15T11:16:37.185+0100: 10160.624: [GC 10160.958: [ParNew: 292890K->20954K(306688K), 40.3960690 secs] 692087K->421434K(1014528K) icms_dc=0 , 40.5051210 secs] [Times: user=0.31 sys=0.14, real=40.77 secs]
2013-10-15T11:17:46.251+0100: 10229.691: [GC 10229.691: [ParNew: 293594K->28842K(306688K), 3.7343380 secs] 694074K->430833K(1014528K) icms_dc=0 , 3.7345270 secs] [Times: user=0.49 sys=0.03, real=3.73 secs]
2013-10-15T11:18:09.470+0100: 10252.909: [GC 10252.909: [ParNew: 301482K->26120K(306688K), 1.5488630 secs] 703473K->438164K(1014528K) icms_dc=4 , 1.5490490 secs] [Times: user=0.09 sys=0.01, real=1.55 secs]
2013-10-15T11:18:22.104+0100: 10265.543: [GC [1 CMS-initial-mark: 412044K(707840K)] 569118K(1014528K), 2.7722390 secs] [Times: user=0.10 sys=0.01, real=2.78 secs]
2013-10-15T11:18:39.060+0100: 10282.500: [GC 10282.500: [ParNew: 298760K->23153K(306688K), 1.0338600 secs] 710804K->439075K(1014528K) icms_dc=4 , 1.0340570 secs] [Times: user=0.08 sys=0.01, real=1.03 secs]
2013-10-15T16:56:47.089+0100: 30570.529: [GC 30570.680: [ParNew: 293275K->24113K(306688K), 37.4896370 secs] 550467K->282444K(1014528K) icms_dc=0 , 37.5648950 secs] [Times: user=0.25 sys=0.15, real=37.68 secs]
2013-10-15T16:57:42.123+0100: 30625.563: [GC 30625.563: [ParNew: 296753K->31170K(306688K), 3.3016020 secs] 555084K->294570K(1014528K) icms_dc=0 , 3.3021540 secs] [Times: user=0.13 sys=0.02, real=3.30 secs]
2013-10-15T16:58:04.358+0100: 30647.797: [GC 30647.797: [ParNew: 303810K->28342K(306688K), 2.0779480 secs] 567210K->303245K(1014528K) icms_dc=0 , 2.0781710 secs] [Times: user=0.13 sys=0.02, real=2.08 secs]
2013-10-15T17:25:32.927+0100: 32296.366: [GC 32296.366: [ParNew: 296535K->22413K(306688K), 2.5116350 secs] 634780K->364027K(1014528K) icms_dc=0 , 2.5118060 secs] [Times: user=0.09 sys=0.01, real=2.51 secs]
2013-10-15T19:02:28.585+0100: 38112.024: [GC 38112.024: [ParNew: 292218K->19569K(306688K), 1.1639890 secs] 799079K->527416K(1014528K) icms_dc=0 , 1.1641580 secs] [Times: user=0.09 sys=0.00, real=1.17 secs]
2013-10-15T19:04:19.908+0100: 38223.348: [GC 38223.348: [ParNew: 292491K->20852K(306688K), 1.0817580 secs] 802076K->531335K(1014528K) icms_dc=0 , 1.0818990 secs] [Times: user=0.08 sys=0.00, real=1.08 secs]
2013-10-15T20:04:24.731+0100: 41828.181: [GC 41828.221: [ParNew: 292076K->34047K(306688K), 5.7203690 secs] 896122K->640021K(1014528K) icms_dc=0 , 5.7206930 secs] [Times: user=0.22 sys=0.05, real=5.76 secs]
2013-10-15T20:04:58.838+0100: 41862.277: [GC 41862.277: [ParNew: 306687K->34048K(306688K), 3.8018520 secs] 912661K->656913K(1014528K) icms_dc=10 , 3.8020970 secs] [Times: user=0.19 sys=0.02, real=3.81 secs]
2013-10-15T20:05:18.774+0100: 41882.213: [GC [1 CMS-initial-mark: 622865K(707840K)] 781253K(1014528K), 6.7568280 secs] [Times: user=0.14 sys=0.01, real=6.75 secs]
2013-10-15T20:06:09.143+0100: 41932.583: [GC 41932.583: [ParNew: 299285K->23925K(306688K), 2.5954090 secs] 930733K->660246K(1014528K) icms_dc=11 , 2.5956550 secs] [Times: user=0.29 sys=0.02, real=2.60 secs]
3) G1GC (ignored 1321 events):
2013-10-15T17:07:06.409+0100: 30235.329: [GC pause (young), 22.13925700 secs]
2013-10-15T17:07:41.248+0100: 30270.009: [GC pause (young), 1.82773500 secs]
2013-10-15T19:52:52.881+0100: 40181.642: [GC pause (mixed), 1.59984500 secs]
2013-10-15T19:55:12.036+0100: 40320.797: [GC pause (young), 2.12066800 secs]
2013-10-15T20:17:49.332+0100: 41678.319: [GC pause (young), 63.77815100 secs]
2013-10-15T20:19:05.460+0100: 41754.222: [GC pause (young), 5.40757200 secs]
2013-10-15T20:19:23.764+0100: 41772.526: [GC pause (young), 1.43072900 secs]
From the one running ParallelGC (the most "standard" setup), the more verbose
logging about the GC which took 50 seconds is:
{Heap before GC invocations=2429 (full 3):
PSYoungGen total 326400K, used 326392K [0x00000000eaab0000, 0x0000000100000000, 0x0000000100000000)
eden space 304384K, 100% used [0x00000000eaab0000,0x00000000fd3f0000,0x00000000fd3f0000)
from space 22016K, 99% used [0x00000000fd3f0000,0x00000000fe96e310,0x00000000fe970000)
to space 22720K, 0% used [0x00000000fe9d0000,0x00000000fe9d0000,0x0000000100000000)
ParOldGen total 699072K, used 617440K [0x00000000c0000000, 0x00000000eaab0000, 0x00000000eaab0000)
object space 699072K, 88% used [0x00000000c0000000,0x00000000e5af81d0,0x00000000eaab0000)
PSPermGen total 105408K, used 49027K [0x00000000b8000000, 0x00000000be6f0000, 0x00000000c0000000)
object space 105408K, 46% used [0x00000000b8000000,0x00000000bafe0fb8,0x00000000be6f0000)
2013-10-16T08:05:06.675+0100: 84580.363: [GC [PSYoungGen: 326392K->22694K(324736K)] 943833K->641851K(1023808K), 47.9720620 secs] [Times: user=0.28 sys=0.30, real=48.45 secs]
Heap after GC invocations=2429 (full 3):
PSYoungGen total 324736K, used 22694K [0x00000000eaab0000, 0x0000000100000000, 0x0000000100000000)
eden space 302016K, 0% used [0x00000000eaab0000,0x00000000eaab0000,0x00000000fd1a0000)
from space 22720K, 99% used [0x00000000fe9d0000,0x00000000ffff9a60,0x0000000100000000)
to space 23744K, 0% used [0x00000000fd1a0000,0x00000000fd1a0000,0x00000000fe8d0000)
ParOldGen total 699072K, used 619156K [0x00000000c0000000, 0x00000000eaab0000, 0x00000000eaab0000)
object space 699072K, 88% used [0x00000000c0000000,0x00000000e5ca5390,0x00000000eaab0000)
PSPermGen total 105408K, used 49027K [0x00000000b8000000, 0x00000000be6f0000, 0x00000000c0000000)
object space 105408K, 46% used [0x00000000b8000000,0x00000000bafe0fb8,0x00000000be6f0000)
}
Any ideas gratefully received!
--
Matthew Slowe
Server Infrastructure Team e: [log in to unmask]
IS, University of Kent t: +44 (0)1227 824265
Canterbury, UK w: www.kent.ac.uk
|