On Fri, 2006-11-17 at 11:50, Antonio Delgado Peris wrote:
> It seems that the guesses below are correct. We have seen that with a
> low number of jobs the number of tomcat threads grows at a much slower
> rate (or maybe it even remains stable). By now, we have deactivated the
> new jobwrapper tests (by emptying the jobwrapper-start.d/end.d
> directories in all the WNs). We expect that tomcat stays alive (and
> responsive) for a longer time (~ a week) now, but clearly this is not an
> optimal solution.
>
> I will submit a bug on the jobwrapper tests and a bug on tomcat memory
> problems (although this might be redundant).
We are currently looking at both of these situations. The knock on
effect of tomcat running out of memory is that the java bug of not
releasing connections is also encountered so this can have a detrimental
effect on the registry which we have witnessed. It is possible that
python bug which which was mentioned has also had an effect on the
proliferation of the producers and this is also under investigation.
In the mean time it may be prudent to disable the jobwrapper tests until
this is understood and a solution can be found.
Alastair
>
> Cheers,
>
> Antonio.
>
>
>
> > Hi,
> >>> Do you also see connections from your WNs? Glite update 9 contains
> >>> jobwrapper tests that publish some information about every job to
> >>> R-GMA. This might be causing the increased load. More info:
> >>>
> >>> http://goc.grid.sinica.edu.tw/gocwiki/SAM_jobwrapper_tests
> >>>
> >> I suspect that this is the reason why you are having problems. I've
> >> looked at your MON box and currently there are ~700 producers. It
> >> appears that a close is not being called before the code exits which
> >> means that the producers are left hanging around for longer than is
> >> necessary especially as there is only one tuple being inserted for the
> >> start and end event.
> >>
> > This might be the explanation (the time of last glite upgrade
> > matches). Then the memory leak problem would just the same as before,
> > but since now the number of R-GMA publications has multiplied, tomcat
> > dies much faster...
> >
> > Nobody else has installed the upgrade and seen the problem?
> >
> > We can try to turn off such SAM publication to see what happens
> > (although that'll probably be tomorrow).
> >> An alternative more efficient way of coding the job wrapper scripts
> >> would be to setup 1 producer instead of 3 at present and publish to all
> >> 3 tables via this one producer and when the job has finished it is
> >> closed explicitly so it will be cleaned up once the inserted data has
> >> expired.
> >>
> > If the above is true, we'll submit a bug for this, but anyway tomcat
> > problem remains...
> >
> > Now, something else you may find interesting.
> >
> > After a comment we got, we tried with setting "export
> > LD_ASSUME_KERNEL=2.4.19" in /etc/tomcat5/tomcat5.conf, rather than
> > just "/etc/tomcat5/tomcat5.conf" as we had it. What happened is that
> > instead of a tomcat process with a lot of threads, we see a growing
> > number of tomcat processes. The memory exhaustion is as before, but
> > the number of connections to rgma12.pp.rl.ac.uk is just one.
> >
> > By now, we have reverted that change.
> >
> > Thank you for your help.
> >
> > Antonio.
> >
> >> Alastair
> >>
|