On Fri, 27 May 2005, Ian Fisk wrote:
> We are seeing the the lock corruption problem again. After stable
> running for 5 days we have two corrupted Grid-manager-monitor lock
> files. The machine swapping does not appear to be the cause of the
> problem, but it is the result of the problem.
I have a hunch the problem is due to your system:
-----------------------------------------------------------------------------
$ globus-job-run cmslcgce.fnal.gov /bin/pwd
/uscms_data/d1/grid_home/dteam006
$ globus-job-run cmslcgce.fnal.gov /bin/df .
Filesystem 1K-blocks Used Available Use% Mounted on
131.225.207.94:fs_ibrix02
2836649880 991990964 1838253096 36% /uscms_data/d1
-----------------------------------------------------------------------------
Did Lisa not send a message to the roll-out list a few days ago,
about problems with the Ibrix file server?
You may want to mount the thing with the "noac" option.
> We have hundreds of the grid-monitor-processes. Today the users in
> question are new people and they both came in through the CERN RB.
> In some cases the grid-monitor-manager doesn't finish so the job
> state is not updated and users don't get the results, because they
> never change state.
>
> The grid-monitor-manager perl script seems to be set to delete itself
> after it starts and it's submitted with the job, so I'm not sure how
> to debug. The instance of the script recorded in the gass cache
> appears to be consistent across users. At the moment we have to
> watch and respond before the system has so few resources that it has
> to be rebooted to clear them.
>
> -Ian
>
>
> On May 24, 2005, at 12:13 PM, David Smith wrote:
>
> > On Tue, 24 May 2005, Ian Fisk wrote:
> >
> >
> >> We could use some guidance on solving this grid-manager-monitor
> >> problem.
> >>
> >> The lock file for one user is corrupted and we are seeing the grid-
> >> manager-monitor thrashing reported a few days ago. Somehow the
> >> problem
> >> seems reproducible for the user, but he is submitting through the
> >> same RB as
> >> several other people and using the same infrastructure to
> >> configure the jobs.
> >> He has submitted 200 jobs, but another user without the problem
> >> has also
> >> submitted more than 200.
> >>
> >> The gateway has all 4GB of real memory used and 2GB of swap used.
> >>
> >> The lock file is corrupted for cms002. It should be about 16
> >> bytes long
> >>
> >
> > Hello Ian,
> >
> > I wasn't able to understand exactly what was happening - there was a
> > problem with some internal locks for the cms002 user
> > (~cms002/.lcgjm/*lock) which was making the job managers hang. I
> > think it
> > is likely this was from earlier problems with the machine running
> > low on
> > memory. I was able to correct that, and now things appear to running
> > smoothly. However since I wasn't able to understand the corruption
> > of the
> > lock file nor the cause of the original problem exactly it is quite
> > possible there is a further problem or bug that needs to be found.
> > I'll
> > try to keep a period check on your CE, and likewise if you notice any
> > problem send another mail, and CC me directly on it.
> >
> > Yours,
> > David
> > --
> > ----------------------------------------------------------------------
> > ---
> > David Smith e-mail: [log in to unmask] tel: +41 22 76
> > 74462
> > Address: D. Smith, CERN G06610, Bat 28 R-007, 1211 Geneva 23,
> > Switzerland
> > ----------------------------------------------------------------------
> > ---
> >
>
|