JISCMail - LCG-ROLLOUT Archives

On Fri, 27 May 2005, Ian Fisk wrote:

> We are seeing the the lock corruption problem again.    After stable  
> running for 5 days we have two corrupted Grid-manager-monitor lock  
> files.    The machine swapping does not appear to be the cause of the  
> problem, but it is the result of the problem.

I have a hunch the problem is due to your system:

-----------------------------------------------------------------------------
$ globus-job-run cmslcgce.fnal.gov /bin/pwd
/uscms_data/d1/grid_home/dteam006

$ globus-job-run cmslcgce.fnal.gov /bin/df .
Filesystem           1K-blocks      Used Available Use% Mounted on
131.225.207.94:fs_ibrix02
                     2836649880 991990964 1838253096  36% /uscms_data/d1
-----------------------------------------------------------------------------

Did Lisa not send a message to the roll-out list a few days ago,
about problems with the Ibrix file server?

You may want to mount the thing with the "noac" option.

> We have hundreds of the grid-monitor-processes.    Today the users in  
> question are new people and they both came in through the CERN RB.     
> In some cases the grid-monitor-manager doesn't finish so the job  
> state is not updated and users don't get the results, because they  
> never change state.
> 
> The grid-monitor-manager perl script seems to be set to delete itself  
> after it starts and it's submitted with the job, so I'm not sure how  
> to debug.   The instance of the script recorded in the gass cache  
> appears to be consistent across users.   At the moment we have to  
> watch and respond before the system has so few resources that it has  
> to be rebooted to clear them.
> 
> -Ian
> 
> 
> On May 24, 2005, at 12:13 PM, David Smith wrote:
> 
> > On Tue, 24 May 2005, Ian Fisk wrote:
> >
> >
> >> We could use some guidance on solving this grid-manager-monitor  
> >> problem.
> >>
> >> The lock file for one user is corrupted and we are seeing the grid-
> >> manager-monitor thrashing reported a few days ago.    Somehow the  
> >> problem
> >> seems reproducible for the user, but he is submitting through the  
> >> same RB as
> >> several other people and using the same infrastructure to  
> >> configure the jobs.
> >> He has submitted 200 jobs, but another user without the problem  
> >> has also
> >> submitted more than 200.
> >>
> >> The gateway has all 4GB of real memory used and 2GB of swap used.
> >>
> >> The lock file is corrupted for cms002.    It should be about 16  
> >> bytes long
> >>
> >
> > Hello Ian,
> >
> > I wasn't able to understand exactly what was happening - there was a
> > problem with some internal locks for the cms002 user
> > (~cms002/.lcgjm/*lock) which was making the job managers hang. I  
> > think it
> > is likely this was from earlier problems with the machine running  
> > low on
> > memory. I was able to correct that, and now things appear to running
> > smoothly. However since I wasn't able to understand the corruption  
> > of the
> > lock file nor the cause of the original problem exactly it is quite
> > possible there is a further problem or bug that needs to be found.  
> > I'll
> > try to keep a period check on your CE, and likewise if you notice any
> > problem send another mail, and CC me directly on it.
> >
> > Yours,
> > David
> > --
> > ---------------------------------------------------------------------- 
> > ---
> > David Smith       e-mail: [log in to unmask]        tel: +41 22 76  
> > 74462
> > Address: D. Smith, CERN G06610, Bat 28 R-007, 1211 Geneva 23,  
> > Switzerland
> > ---------------------------------------------------------------------- 
> > ---
> >
>