On Fri, 6 May 2005, Ian Fisk wrote:
> I have a strange problem. The FNAL site is being thrashed by
> hundreds of copies of /tmp/grid_manager_monitor_agent that run on the
> gateway, spawned by the fork queue. Each instance takes 14M of
> memory and before long all the system memory is used. They are all
> from the same user, who submitted a lot of jobs a few days ago, but
> killed them with edg-job-cancel. What is particularly strange is
> that I killed 700 of them this afternoon. After 6 hours there were
> more than 200 running again.
>
> At the moment I have to monitor manually. Any thoughts of the cause
> or the solution would be appreciated.
It seems the Condor-G component of the RB does not gracefully handle
a large number of cancellations involving a single CE (I think this
used to work fine), so I will open a bug about it.
For now you would have to keep monitoring the situation; you can also
try to speed up the cleanup by doing something drastic:
1. First, to avoid new jobs, ensure the GRIS publishes something other
than "Production" for "GlueCEStateStatus" (e.g. "Draining").
2. Temporarily prevent the grid_monitor processes or anything else from
getting started by renaming /opt/globus/libexec/globus-job-manager;
the RB would quickly consider the affected jobs to have failed.
3. Ensure the jobs are removed from the batch system.
After an hour or so you would reenable the globus-job-manager and check
what happens in the next half hour; if all looks fine, you would change
"GlueCEStateStatus" back to "Production".
|