JiscMail Logo
Email discussion lists for the UK Education and Research communities

Help for JISC-SHIBBOLETH Archives


JISC-SHIBBOLETH Archives

JISC-SHIBBOLETH Archives


JISC-SHIBBOLETH@JISCMAIL.AC.UK


View:

Message:

[

First

|

Previous

|

Next

|

Last

]

By Topic:

[

First

|

Previous

|

Next

|

Last

]

By Author:

[

First

|

Previous

|

Next

|

Last

]

Font:

Proportional Font

LISTSERV Archives

LISTSERV Archives

JISC-SHIBBOLETH Home

JISC-SHIBBOLETH Home

JISC-SHIBBOLETH  October 2013

JISC-SHIBBOLETH October 2013

Options

Subscribe or Unsubscribe

Subscribe or Unsubscribe

Log In

Log In

Get Password

Get Password

Subject:

Re: tomcat tuning for a busy IdP

From:

Matthew Slowe <[log in to unmask]>

Reply-To:

Discussion list for Shibboleth developments <[log in to unmask]>

Date:

Wed, 16 Oct 2013 09:28:35 +0100

Content-Type:

multipart/signed

Parts/Attachments:

Parts/Attachments

text/plain (161 lines) , smime.p7s (161 lines)

On Wed, Sep 18, 2013 at 04:14:10PM +0100, Matthew Slowe wrote:
> From time to time our Nagios checks have gone mad and declared that
> "nothing works" for a few minutes at a time (stuff we check below).
> These blackouts tended to affect both nodes at the same time and would
> fix themselves after a few minutes.  Until today I wasn't able to get a
> good look at what was happening during one of these blackouts but I
> caught one today (by chance) -- java was hogging CPU. In an attempt to
> fix this, we're doing two things (a) move to the Sun/Oracle JVM (which
> we meant to do in the first place... oops) and (b) enable Parallel
> Garbage Collection [3] which, we've seen in the past on load profiles
> like this, to make a big difference on forced garbage collection events.
> 
> [2] See http://www.win.tue.nl/~aeb/linux/lk/lk-9.html#ss9.6
> 
> [3] See http://www.oracle.com/technetwork/java/javase/gc-tuning-6-140523.html#par_gc
> 
> [4] See http://j.mp/1ep6FMT

Morning all!

It's been an interesting few weeks... and we're not really in much of a
better situation :-( Apologies for the pretty heavy post!

Current state of play:

3 RHEL VMs (each are 2cpu, 4G) on VMware (spread across three hosts with no obvious CPU
contention)

All three on Sun^WOracle JVM 1.7:

	$ java -version
	java version "1.7.0_25"
	Java(TM) SE Runtime Environment (build 1.7.0_25-b15)
	Java HotSpot(TM) 64-Bit Server VM (build 23.25-b01, mixed mode)

	-Xms1024m -Xmx1024m
	-XX:+DisableExplicitGC

	PrintGCDetails PrintGCTimeStamps PrintGCDetails
	PrintHeapAtGC PrintGCTimeStamps PrintGCDateStamps
	PrintGCApplicationStoppedTime -verbose:gc
	-Xloggc:/var/log/tomcat6/gc.log UseGCLogFileRotation

Having tried a load of things I decided to spread my bets and we're
running three different GC models in parallel:

1) UseParallelOldGC ConcGCThreads=4 ParallelGCThreads=4 
   MaxGCPauseMillis=3000 
   NewSize=256m MaxPermSize=128m

2) UseConcMarkSweepGC CMSIncrementalMode UseParNewGC CMSConcurrentMTEnabled 
   ConcGCThreads=4 ParallelGCThreads=4 CMSClassUnloadingEnabled 
   MaxGCPauseMillis=3000 
   NewSize=256m MaxPermSize=128m

3) UseG1GC MaxGCPauseMillis=5000
   I'm aware this is a shiny new one but currently am willing to try almost
   anything to get rid of these GC pauses!


All three VMs are seeing the same level of load and see a varied
distribution of GC pauses. Most of the time they're fine and they blaze
through a GC in <0.1s but occationally they all see *much* longer
pauses. 


Is anyone seeing a similar or greater load (Sam?) able to share what, if
anything, they do for managing Garbage Collection and if they see pauses of a
similar ilk?

I'm preetty close to turning off all the tuning again (back to Shib recommended
defaults https://wiki.shibboleth.net/confluence/display/SHIB2/IdPProdJVMTuning)
and seeing if the JVM can sort itself out.

Loads of suggestions in the intertubes that (counterintuatively) lowering the
heap size is a good thing to do. I don't fully understand how much ShibIDP
caches stuff (if at all) but might we benefit from caching less?


All of these traces are for roughly a 24 hour period and only
include the GCs which took >1s. The G1GC log format is very different!

1) ParallelGC (ignored 2445 events):

2013-10-15T11:18:18.529+0100: 9772.148: [GC [PSYoungGen: 326136K->21260K(327424K)] 752269K->448129K(1026496K), 1.7355970 secs] [Times: user=0.14 sys=0.00, real=2.72 secs]
2013-10-15T11:19:19.190+0100: 9832.809: [GC [PSYoungGen: 326540K->22073K(327360K)] 753409K->452843K(1026432K), 15.4796500 secs] [Times: user=0.15 sys=0.03, real=15.47 secs]
2013-10-15T14:24:39.189+0100: 20952.808: [GC [PSYoungGen: 325919K->22514K(326976K)] 598819K->298516K(1026048K), 1.4169930 secs] [Times: user=0.06 sys=0.00, real=1.41 secs]
2013-10-16T08:05:06.675+0100: 84580.363: [GC [PSYoungGen: 326392K->22694K(324736K)] 943833K->641851K(1023808K), 47.9720620 secs] [Times: user=0.28 sys=0.30, real=48.45 secs]
2013-10-16T08:06:36.408+0100: 84670.027: [GC [PSYoungGen: 324710K->23728K(325760K)] 943867K->658653K(1024832K), 3.7116570 secs] [Times: user=0.09 sys=0.05, real=3.71 secs]
2013-10-16T08:07:23.391+0100: 84717.010: [GC [PSYoungGen: 325744K->23716K(302336K)] 960669K->674558K(1001408K), 2.1170260 secs] [Times: user=0.10 sys=0.04, real=2.11 secs]
2013-10-16T08:08:08.151+0100: 84761.770: [GC [PSYoungGen: 302308K->21489K(314048K)] 953150K->678322K(1013120K), 1.6553760 secs] [Times: user=0.07 sys=0.02, real=1.65 secs]
2013-10-16T08:08:47.580+0100: 84801.199: [GC [PSYoungGen: 300081K->20023K(314688K)] 956914K->679571K(1013760K), 3.7389560 secs] [Times: user=0.08 sys=0.01, real=3.74 secs]
2013-10-16T08:20:00.947+0100: 85474.566: [GC [PSYoungGen: 315115K->20572K(323328K)] 988968K->695248K(1022400K), 1.9098330 secs] [Times: user=0.34 sys=0.01, real=1.91 secs]
2013-10-16T08:24:25.046+0100: 85738.665: [GC [PSYoungGen: 321149K->20925K(325696K)] 999834K->700258K(1024768K), 1.1434680 secs] [Times: user=0.17 sys=0.01, real=1.14 secs]

2) CMS GC (ignored 2711 events):

2013-10-15T09:25:35.497+0100: 3498.937: [GC 3499.700: [ParNew: 292668K->20750K(306688K), 18.9112410 secs] 444927K->173730K(1014528K) icms_dc=0 , 19.4728050 secs] [Times: user=4.48 sys=0.15, real=19.69 secs]
2013-10-15T09:26:17.812+0100: 3541.251: [GC 3541.251: [ParNew: 293390K->30132K(306688K), 8.4375810 secs] 446370K->184108K(1014528K) icms_dc=0 , 8.4389720 secs] [Times: user=0.13 sys=0.03, real=8.44 secs]
2013-10-15T09:26:47.455+0100: 3570.895: [GC 3570.895: [ParNew: 302772K->29409K(306688K), 2.3354470 secs] 456748K->193906K(1014528K) icms_dc=0 , 2.3357080 secs] [Times: user=0.11 sys=0.01, real=2.34 secs]
2013-10-15T11:16:37.185+0100: 10160.624: [GC 10160.958: [ParNew: 292890K->20954K(306688K), 40.3960690 secs] 692087K->421434K(1014528K) icms_dc=0 , 40.5051210 secs] [Times: user=0.31 sys=0.14, real=40.77 secs]
2013-10-15T11:17:46.251+0100: 10229.691: [GC 10229.691: [ParNew: 293594K->28842K(306688K), 3.7343380 secs] 694074K->430833K(1014528K) icms_dc=0 , 3.7345270 secs] [Times: user=0.49 sys=0.03, real=3.73 secs]
2013-10-15T11:18:09.470+0100: 10252.909: [GC 10252.909: [ParNew: 301482K->26120K(306688K), 1.5488630 secs] 703473K->438164K(1014528K) icms_dc=4 , 1.5490490 secs] [Times: user=0.09 sys=0.01, real=1.55 secs]
2013-10-15T11:18:22.104+0100: 10265.543: [GC [1 CMS-initial-mark: 412044K(707840K)] 569118K(1014528K), 2.7722390 secs] [Times: user=0.10 sys=0.01, real=2.78 secs]
2013-10-15T11:18:39.060+0100: 10282.500: [GC 10282.500: [ParNew: 298760K->23153K(306688K), 1.0338600 secs] 710804K->439075K(1014528K) icms_dc=4 , 1.0340570 secs] [Times: user=0.08 sys=0.01, real=1.03 secs]
2013-10-15T16:56:47.089+0100: 30570.529: [GC 30570.680: [ParNew: 293275K->24113K(306688K), 37.4896370 secs] 550467K->282444K(1014528K) icms_dc=0 , 37.5648950 secs] [Times: user=0.25 sys=0.15, real=37.68 secs]
2013-10-15T16:57:42.123+0100: 30625.563: [GC 30625.563: [ParNew: 296753K->31170K(306688K), 3.3016020 secs] 555084K->294570K(1014528K) icms_dc=0 , 3.3021540 secs] [Times: user=0.13 sys=0.02, real=3.30 secs]
2013-10-15T16:58:04.358+0100: 30647.797: [GC 30647.797: [ParNew: 303810K->28342K(306688K), 2.0779480 secs] 567210K->303245K(1014528K) icms_dc=0 , 2.0781710 secs] [Times: user=0.13 sys=0.02, real=2.08 secs]
2013-10-15T17:25:32.927+0100: 32296.366: [GC 32296.366: [ParNew: 296535K->22413K(306688K), 2.5116350 secs] 634780K->364027K(1014528K) icms_dc=0 , 2.5118060 secs] [Times: user=0.09 sys=0.01, real=2.51 secs]
2013-10-15T19:02:28.585+0100: 38112.024: [GC 38112.024: [ParNew: 292218K->19569K(306688K), 1.1639890 secs] 799079K->527416K(1014528K) icms_dc=0 , 1.1641580 secs] [Times: user=0.09 sys=0.00, real=1.17 secs]
2013-10-15T19:04:19.908+0100: 38223.348: [GC 38223.348: [ParNew: 292491K->20852K(306688K), 1.0817580 secs] 802076K->531335K(1014528K) icms_dc=0 , 1.0818990 secs] [Times: user=0.08 sys=0.00, real=1.08 secs]
2013-10-15T20:04:24.731+0100: 41828.181: [GC 41828.221: [ParNew: 292076K->34047K(306688K), 5.7203690 secs] 896122K->640021K(1014528K) icms_dc=0 , 5.7206930 secs] [Times: user=0.22 sys=0.05, real=5.76 secs]
2013-10-15T20:04:58.838+0100: 41862.277: [GC 41862.277: [ParNew: 306687K->34048K(306688K), 3.8018520 secs] 912661K->656913K(1014528K) icms_dc=10 , 3.8020970 secs] [Times: user=0.19 sys=0.02, real=3.81 secs]
2013-10-15T20:05:18.774+0100: 41882.213: [GC [1 CMS-initial-mark: 622865K(707840K)] 781253K(1014528K), 6.7568280 secs] [Times: user=0.14 sys=0.01, real=6.75 secs]
2013-10-15T20:06:09.143+0100: 41932.583: [GC 41932.583: [ParNew: 299285K->23925K(306688K), 2.5954090 secs] 930733K->660246K(1014528K) icms_dc=11 , 2.5956550 secs] [Times: user=0.29 sys=0.02, real=2.60 secs]

3) G1GC (ignored 1321 events): 

2013-10-15T17:07:06.409+0100: 30235.329: [GC pause (young), 22.13925700 secs]
2013-10-15T17:07:41.248+0100: 30270.009: [GC pause (young), 1.82773500 secs]
2013-10-15T19:52:52.881+0100: 40181.642: [GC pause (mixed), 1.59984500 secs]
2013-10-15T19:55:12.036+0100: 40320.797: [GC pause (young), 2.12066800 secs]
2013-10-15T20:17:49.332+0100: 41678.319: [GC pause (young), 63.77815100 secs]
2013-10-15T20:19:05.460+0100: 41754.222: [GC pause (young), 5.40757200 secs]
2013-10-15T20:19:23.764+0100: 41772.526: [GC pause (young), 1.43072900 secs]



From the one running ParallelGC (the most "standard" setup), the more verbose
logging about the GC which took 50 seconds is:

{Heap before GC invocations=2429 (full 3):
 PSYoungGen      total 326400K, used 326392K [0x00000000eaab0000, 0x0000000100000000, 0x0000000100000000)
  eden space 304384K, 100% used [0x00000000eaab0000,0x00000000fd3f0000,0x00000000fd3f0000)
  from space 22016K, 99% used [0x00000000fd3f0000,0x00000000fe96e310,0x00000000fe970000)
  to   space 22720K, 0% used [0x00000000fe9d0000,0x00000000fe9d0000,0x0000000100000000)
 ParOldGen       total 699072K, used 617440K [0x00000000c0000000, 0x00000000eaab0000, 0x00000000eaab0000)
  object space 699072K, 88% used [0x00000000c0000000,0x00000000e5af81d0,0x00000000eaab0000)
 PSPermGen       total 105408K, used 49027K [0x00000000b8000000, 0x00000000be6f0000, 0x00000000c0000000)
  object space 105408K, 46% used [0x00000000b8000000,0x00000000bafe0fb8,0x00000000be6f0000)
2013-10-16T08:05:06.675+0100: 84580.363: [GC [PSYoungGen: 326392K->22694K(324736K)] 943833K->641851K(1023808K), 47.9720620 secs] [Times: user=0.28 sys=0.30, real=48.45 secs]
Heap after GC invocations=2429 (full 3):
 PSYoungGen      total 324736K, used 22694K [0x00000000eaab0000, 0x0000000100000000, 0x0000000100000000)
  eden space 302016K, 0% used [0x00000000eaab0000,0x00000000eaab0000,0x00000000fd1a0000)
  from space 22720K, 99% used [0x00000000fe9d0000,0x00000000ffff9a60,0x0000000100000000)
  to   space 23744K, 0% used [0x00000000fd1a0000,0x00000000fd1a0000,0x00000000fe8d0000)
 ParOldGen       total 699072K, used 619156K [0x00000000c0000000, 0x00000000eaab0000, 0x00000000eaab0000)
  object space 699072K, 88% used [0x00000000c0000000,0x00000000e5ca5390,0x00000000eaab0000)
 PSPermGen       total 105408K, used 49027K [0x00000000b8000000, 0x00000000be6f0000, 0x00000000c0000000)
  object space 105408K, 46% used [0x00000000b8000000,0x00000000bafe0fb8,0x00000000be6f0000)
}

Any ideas gratefully received!

-- 
Matthew Slowe
Server Infrastructure Team	e: [log in to unmask]
IS, University of Kent		t: +44 (0)1227 824265
Canterbury, UK			w: www.kent.ac.uk

Top of Message | Previous Page | Permalink

JiscMail Tools


RSS Feeds and Sharing


Advanced Options


Archives

November 2023
February 2023
January 2023
November 2022
October 2022
September 2022
June 2022
January 2022
November 2021
October 2021
September 2021
May 2021
April 2021
March 2021
February 2021
January 2021
December 2020
November 2020
October 2020
September 2020
July 2020
June 2020
May 2020
April 2020
March 2020
February 2020
January 2020
December 2019
November 2019
October 2019
September 2019
August 2019
June 2019
May 2019
March 2019
February 2019
January 2019
November 2018
July 2018
June 2018
May 2018
April 2018
March 2018
January 2018
November 2017
October 2017
September 2017
August 2017
July 2017
June 2017
May 2017
March 2017
February 2017
January 2017
October 2016
September 2016
August 2016
July 2016
June 2016
May 2016
March 2016
February 2016
January 2016
December 2015
November 2015
September 2015
August 2015
June 2015
April 2015
March 2015
February 2015
December 2014
November 2014
October 2014
September 2014
August 2014
July 2014
June 2014
May 2014
April 2014
March 2014
February 2014
January 2014
December 2013
November 2013
October 2013
September 2013
August 2013
June 2013
May 2013
April 2013
March 2013
February 2013
January 2013
December 2012
November 2012
October 2012
September 2012
August 2012
July 2012
June 2012
May 2012
April 2012
March 2012
February 2012
January 2012
December 2011
November 2011
October 2011
September 2011
August 2011
July 2011
June 2011
May 2011
April 2011
March 2011
February 2011
January 2011
December 2010
November 2010
October 2010
September 2010
August 2010
July 2010
June 2010
May 2010
April 2010
March 2010
February 2010
January 2010
December 2009
November 2009
October 2009
September 2009
August 2009
July 2009
June 2009
May 2009
April 2009
March 2009
February 2009
January 2009
December 2008
November 2008
October 2008
September 2008
August 2008
July 2008
June 2008
May 2008
April 2008
March 2008
February 2008
January 2008
December 2007
November 2007
October 2007
September 2007
August 2007
July 2007
June 2007
May 2007
April 2007
March 2007
February 2007
January 2007
December 2006
November 2006
October 2006
August 2006
July 2006
June 2006
May 2006
April 2006
March 2006
February 2006
January 2006
December 2005
November 2005
October 2005
September 2005
August 2005
July 2005
June 2005
May 2005
April 2005


JiscMail is a Jisc service.

View our service policies at https://www.jiscmail.ac.uk/policyandsecurity/ and Jisc's privacy policy at https://www.jisc.ac.uk/website/privacy-notice

For help and support help@jisc.ac.uk

Secured by F-Secure Anti-Virus CataList Email List Search Powered by the LISTSERV Email List Manager