Yo,
During the night, many stale gram_job_state files belonging to user
lhcbpr06 "reappeared" on our system. They take a lickin' and keep on
tickin' .... These jobs were submitted from machines
rb114.cern.ch
rb123.cern.ch
lcgrb03.gridpp.rl.ac.uk
at least this is what we think, due to the fact that there are lhcbpr06
monitoring jobs on the CE machine with dest-url parameters pointing
to these machines.
[ later : yes, they do come from these machines, as evidenced by records
in the jobmanager log file : since midnight we've had an equal number of
connections (about 2500) from each of these RBs, 220 connections per
hour per RB ]
There have been no jobs submitted from user lhcbpr06 on our machine
for over one month!!
In /opt there are 2981 files owned by user lhcbpr06, mostly
gram_job_state files.
There are also 17 processes (this number fluctuates with time) on our
CE running as lhcbpr06 : three are running perl scripts in a gass
cache area and have dest-url parameters as reported above; one is
running the command
'perl /tmp/grid_manager_monitor_agent.lhcbpr06.4627.1000 ...'
the rest look like
globus-job-manager -conf /opt/globus/etc/globus-job-manager.conf
-type pbs
There are also about 2300 connections to the gatekeeper in the last 11
hours from this user, requesting to execute a script like this:
exec=https://rb123.cern.ch:20137/opt/lcg/sbin/grid_monitor.sh
there are absolutely NO connections from lhcbpr06 that are trying to do
anything other than run the grid_monitor.sh script. My conclusion is
that various RBs (we have this problem also with tons of biomed jobs,
from users that have not been seen on the system since early august) are
holding on to stale information and trying to monitor jobs that have
been gone for several weeks.
Just to check to make sure it's not us, trying the following:
Trying the following:
in /opt/globus :
first stop the gatekeeper
find . -uid <lhcbpr06 uid> | xargs rm
ps uaxw | awk '$1~/lhcbpr06/ {print $2}' | xargs kill -KILL
check to make sure everything is gone (it wasn't)
then start the gatekeeper again
which should remove all traces of lhcbpr06 activity from the system.
See what happens.
within 30 seconds we see:
> lhcbpr06 9265 4.0 0.0 4528 2444 ? S 11:01 0:00 globus-job-manager -conf /opt/globus/etc/globus-job-manager.conf -type pbs -rdn jobmanager-pbs -machine-typ$
> lhcbpr06 9266 6.0 0.0 4520 2440 ? S 11:01 0:00
globus-job-manager -conf /opt/globus/etc/globus-job-manager.conf -type
pbs -rdn jobmanager-pbs -machine-typ$
> lhcbpr06 9273 0.0 0.1 5428 3752 ? R 11:01 0:00
/usr/bin/perl /opt/globus/libexec/globus-job-manager-script.pl -m pbs -f
/tmp/gram_8Awxdv -c proxy_relocate
> [root@tbn20 globus]#
and state files from this user
./tmp/gram_job_state/job.tbn20.nikhef.nl.8926.1189155674.lock
./tmp/gram_job_state/job.tbn20.nikhef.nl.9118.1189155691.lock
./tmp/gram_job_state/job.tbn20.nikhef.nl.8926.1189155674
./tmp/gram_job_state/job.tbn20.nikhef.nl.9265.1189155706.lock
./tmp/gram_job_state/job.tbn20.nikhef.nl.9118.1189155691
./tmp/gram_job_state/job.tbn20.nikhef.nl.9265.1189155706
./tmp/gram_job_state/job.tbn20.nikhef.nl.11582.1189155785
./tmp/gram_job_state/job.tbn20.nikhef.nl.11582.1189155785.lock
[ all from lhcbpr06 ]
Conclusion : we can't do anything to stop this, it is a RB problem.
J "the job bomb rides again" T
|