It seems that the guesses below are correct. We have seen that with a
low number of jobs the number of tomcat threads grows at a much slower
rate (or maybe it even remains stable). By now, we have deactivated the
new jobwrapper tests (by emptying the jobwrapper-start.d/end.d
directories in all the WNs). We expect that tomcat stays alive (and
responsive) for a longer time (~ a week) now, but clearly this is not an
optimal solution.
I will submit a bug on the jobwrapper tests and a bug on tomcat memory
problems (although this might be redundant).
Cheers,
Antonio.
> Hi,
>>> Do you also see connections from your WNs? Glite update 9 contains
>>> jobwrapper tests that publish some information about every job to
>>> R-GMA. This might be causing the increased load. More info:
>>>
>>> http://goc.grid.sinica.edu.tw/gocwiki/SAM_jobwrapper_tests
>>>
>> I suspect that this is the reason why you are having problems. I've
>> looked at your MON box and currently there are ~700 producers. It
>> appears that a close is not being called before the code exits which
>> means that the producers are left hanging around for longer than is
>> necessary especially as there is only one tuple being inserted for the
>> start and end event.
>>
> This might be the explanation (the time of last glite upgrade
> matches). Then the memory leak problem would just the same as before,
> but since now the number of R-GMA publications has multiplied, tomcat
> dies much faster...
>
> Nobody else has installed the upgrade and seen the problem?
>
> We can try to turn off such SAM publication to see what happens
> (although that'll probably be tomorrow).
>> An alternative more efficient way of coding the job wrapper scripts
>> would be to setup 1 producer instead of 3 at present and publish to all
>> 3 tables via this one producer and when the job has finished it is
>> closed explicitly so it will be cleaned up once the inserted data has
>> expired.
>>
> If the above is true, we'll submit a bug for this, but anyway tomcat
> problem remains...
>
> Now, something else you may find interesting.
>
> After a comment we got, we tried with setting "export
> LD_ASSUME_KERNEL=2.4.19" in /etc/tomcat5/tomcat5.conf, rather than
> just "/etc/tomcat5/tomcat5.conf" as we had it. What happened is that
> instead of a tomcat process with a lot of threads, we see a growing
> number of tomcat processes. The memory exhaustion is as before, but
> the number of connections to rgma12.pp.rl.ac.uk is just one.
>
> By now, we have reverted that change.
>
> Thank you for your help.
>
> Antonio.
>
>> Alastair
>>
|