We are seeing the the lock corruption problem again. After stable
running for 5 days we have two corrupted Grid-manager-monitor lock
files. The machine swapping does not appear to be the cause of the
problem, but it is the result of the problem.
We have hundreds of the grid-monitor-processes. Today the users in
question are new people and they both came in through the CERN RB.
In some cases the grid-monitor-manager doesn't finish so the job
state is not updated and users don't get the results, because they
never change state.
The grid-monitor-manager perl script seems to be set to delete itself
after it starts and it's submitted with the job, so I'm not sure how
to debug. The instance of the script recorded in the gass cache
appears to be consistent across users. At the moment we have to
watch and respond before the system has so few resources that it has
to be rebooted to clear them.
-Ian
On May 24, 2005, at 12:13 PM, David Smith wrote:
> On Tue, 24 May 2005, Ian Fisk wrote:
>
>
>> We could use some guidance on solving this grid-manager-monitor
>> problem.
>>
>> The lock file for one user is corrupted and we are seeing the grid-
>> manager-monitor thrashing reported a few days ago. Somehow the
>> problem
>> seems reproducible for the user, but he is submitting through the
>> same RB as
>> several other people and using the same infrastructure to
>> configure the jobs.
>> He has submitted 200 jobs, but another user without the problem
>> has also
>> submitted more than 200.
>>
>> The gateway has all 4GB of real memory used and 2GB of swap used.
>>
>> The lock file is corrupted for cms002. It should be about 16
>> bytes long
>>
>
> Hello Ian,
>
> I wasn't able to understand exactly what was happening - there was a
> problem with some internal locks for the cms002 user
> (~cms002/.lcgjm/*lock) which was making the job managers hang. I
> think it
> is likely this was from earlier problems with the machine running
> low on
> memory. I was able to correct that, and now things appear to running
> smoothly. However since I wasn't able to understand the corruption
> of the
> lock file nor the cause of the original problem exactly it is quite
> possible there is a further problem or bug that needs to be found.
> I'll
> try to keep a period check on your CE, and likewise if you notice any
> problem send another mail, and CC me directly on it.
>
> Yours,
> David
> --
> ----------------------------------------------------------------------
> ---
> David Smith e-mail: [log in to unmask] tel: +41 22 76
> 74462
> Address: D. Smith, CERN G06610, Bat 28 R-007, 1211 Geneva 23,
> Switzerland
> ----------------------------------------------------------------------
> ---
>
|