On Mon, 23 May 2005, Ian Fisk wrote:
> We are experiencing an interesting problem with the
> grid-monitor-manager. It appears to manifest itself in two ways
>
> 1.) users report problems of jobs never exiting the scheduling state
> 2.) Machine memory usage increases until the system goes unstable
>
> What appears to be happening is under some circumstances the lock file
> in /opt/globus/tmp for the grid-monitor-manager is corrupted. The
In what way? Can you send me examples?
Recently you reported a similar problem: was that understood in the end?
> grid-monitor-manager decides the lock is stale and submits another copy
> of the grid-monitor-manager. Unfortunately, the lock file is not
> recreated it's just appended. Since it's already corrupted, the next
That sounds like another bug.
> time the monitor-manager checks the lock file it still determines it's
> stale and submits again. The grid-monitor-manager should submit once
> instance of the perl script
>
> perl /tmp/grid_manager_monitor_agent.cms001.19552.1000 --delete-self
> --maxtime=3600s
>
> per user. In this failure mode it continually submits new versions
> of the monitor-manager because the old ones are no longer seen to
> exist. After a few hours there are hundreds of copies taking 14M of
> memory each.
>
> Interestingly it appears to happen more frequently for some users than
> others. The two users we have repeatedly observed failure for have
> both used the CNAF RB for at least some of their jobs, but we have
> nothing to suggest that the CNAF RB is the cause of the problem or
> anything more than a coincidence.
>
> Has anyone experienced this error before?
No, but we will look into it.
|