On 18 May 2007, at 14:25, Alastair Duncan wrote:
> On Friday 18 May 2007 10:02:21 Graeme Stewart wrote:
>> Hi Steve
>>
>> The new WAR file has improved the CLOSE_WAIT a bit, but it's not
>> entirely gone:
>>
>> svr019:~# netstat -t | grep rgma | wc -l
>> 168
>> svr019:~# netstat -t | grep rgma | grep CLOSE_WAIT | wc -l
>> 39
>
> Hi Graeme,
>
> Having sockets in a CLOSE_WAIT state is not necessarily bad as long
> as they
> change from this state. Do the number of sockets in the CLOSE_WAIT
> state
> fluctuate and not just go up?
Oh dear, it's all gone badly wrong now:
svr019:~# netstat -t | grep CLOSE_WAIT | wc -l
1801
All of them have 1 byte in the Recv-Q.
And RGMA's total number of TCP connections is
svr019:~# netstat -tp | grep java | wc -l
2279
It looks like this has been happening ~ 3 weeks. See
http://svr031.gla.scotgrid.ac.uk/ganglia/?r=month&c=Grid
+Servers&h=svr019.gla.scotgrid.ac.uk
look at proc_total.
Pushing the big red restart button now.
>
>>
>> And in addition, the problem was monitoring in the job wrapper adding
>> unnecessary wallclock to jobs. I don't see how this will be
>> dramatically improved, even if CLOSE_WAIT goes away entirely.
>
> As I understand the situation the wallclock time was still high
> when the R-GMA
> publishing part of the job wrapper was disabled. So what else is
> done in the
> Jobwapper that can cause this.
Don't know. I think the current plan might be to test with nagios
instead, which is far more sane.
Cheers
Graeme
--
Dr Graeme Stewart - http://wiki.gridpp.ac.uk/wiki/User:Graeme_stewart
ScotGrid - http://www.scotgrid.ac.uk/ http://scotgrid.blogspot.com/
|