Print

Print


On Tue, 2 Aug 2005, Ian Fisk wrote:

> Hi Maarten,
> 
>           I left about 75 running.   Memory usage of the system has  
> dropped as has the load   I don't think it's related to file system  
> problems, which we believe we have solved with a network upgrade.      
> The problem we saw before corrupted a lock file, all of which appear  
> to be fine.    Also in the previous problem we would see processes  
> continuously increase.    In this case quite suddenly there were 2300  
> processes and it didn't increase.     It just stayed static.     So  
> far they have stayed dead, which is different from the previous  
> problem also.

Hi Ian,
might there be some cron job cleaning up the grid user home directories?

In any case, can you run "lsof -p PID" for a few such PIDs and send us
the results?  We would like to see if the processes have files open that
have been deleted, and also look at their network connections, if any.

> On Aug 2, 2005, at 1:17 PM, Maarten Litmaath wrote:
> 
> > Ian Fisk wrote:
> >
> >
> >> We are observing a large number of processes on the FNAL CE.
> >>
> >
> > You sure they are not related to the file server problems you had
> > at the end of May?
> >
> >
> >> Currently there are 2300 belonging to one UID.   They are roughly   
> >> divided between
> >>  globus-job-manager -conf /opt/globus/etc/globus-job-manager.conf  
> >> - type lcgcondor -rdn jobmanager-lcgcondor -machine-type unknown -  
> >> publish-jobs
> >> and
> >> /usr/local/bin/perl /opt/globus/libexec/globus-job-manager- 
> >> script.pl - m lcgcondor -f /tmp/gram_mBjvlv -c poll
> >> Rough 1150 of each.    I am not sufficiently familiar with what  
> >> these  two scripts are supposed to be doing.    The number of  
> >> processes does  not appear to be growing (or shrinking).     The  
> >> UID in question does  not currently have any active jobs in the  
> >> batch system.
> >>
> >
> > I suggest you kill almost all of them, leaving a few for us to look  
> > at.
> > First kill 10 processes and check if the load does not suddenly  
> > increase
> > a lot, then kill 50, 100, ...
> >
>