We are experiencing an interesting problem with the
grid-monitor-manager. It appears to manifest itself in two ways
1.) users report problems of jobs never exiting the scheduling state
2.) Machine memory usage increases until the system goes unstable
What appears to be happening is under some circumstances the lock file
in /opt/globus/tmp for the grid-monitor-manager is corrupted. The
grid-monitor-manager decides the lock is stale and submits another copy
of the grid-monitor-manager. Unfortunately, the lock file is not
recreated it's just appended. Since it's already corrupted, the next
time the monitor-manager checks the lock file it still determines it's
stale and submits again. The grid-monitor-manager should submit once
instance of the perl script
perl /tmp/grid_manager_monitor_agent.cms001.19552.1000 --delete-self
--maxtime=3600s
per user. In this failure mode it continually submits new versions
of the monitor-manager because the old ones are no longer seen to
exist. After a few hours there are hundreds of copies taking 14M of
memory each.
Interestingly it appears to happen more frequently for some users than
others. The two users we have repeatedly observed failure for have
both used the CNAF RB for at least some of their jobs, but we have
nothing to suggest that the CNAF RB is the cause of the problem or
anything more than a coincidence.
Has anyone experienced this error before?
Thanks, Ian
|