Hi,
We've been seeing lots of apparently 'dead' globus-job-manager processes
laying around here for awhile. Decided to look into it today. There
seem to be two classes of job manager processes on the CE machine (tbn20):
1. 'active' job managers; they all seem to be running a perl script that
is located in their GASS cache path, and this is apparently monitoring
the status of one or more jobs, as the 'dest-url' looks like:
https://lcgrb01.gridpp.rl.ac.uk:50002/tmp/condor_g_scratch.0xa2271c0.19018/\
tbn20.nikhef.nl:2119.0x843fda8/grid-monitor-job-status.tbn20.nikhef.nl:2119.3570.733
there is apparently one of this type of job manager per user, per RB.
2. 'inactive' job managers, that might actually hang around for weeks.
They look the same as the first class, except there is no associated
perl script running.
lhcb002 1493 0.0 0.1 8176 4636 ? S Jul25 0:00 \
globus-job-manager -conf \
/opt/globus/etc/globus-job-manager.conf -type fork \
-rdn jobmanager-fork -machine-type unknown -publish-jobs
if i look to see what it is doing:
[root@tbn20 root]# strace -p 1493
Process 1493 attached - interrupt to quit
select(18, [4 8 11 12], [], [], {7, 490000}) = 0 (Timeout)
gettimeofday({1122365060, 774869}, NULL) = 0
select(18, [4 8 11 12], [], [], {0, 114}) = 0 (Timeout)
gettimeofday({1122365060, 784780}, NULL) = 0
gettimeofday({1122365060, 784825}, NULL) = 0
_llseek(15, 608, [608], SEEK_SET) = 0
read(15, 0x80cd6d0, 4096) = -1 ESTALE (Stale NFS file handle)
gettimeofday({1122365060, 785061}, NULL) = 0
gettimeofday({1122365060, 785116}, NULL) = 0
select(18, [4 8 11 12], [], [], {9, 824242}) = 0 (Timeout)
gettimeofday({1122365070, 614070}, NULL) = 0
gettimeofday({1122365070, 614124}, NULL) = 0
gettimeofday({1122365070, 614172}, NULL) = 0
gettimeofday({1122365070, 614219}, NULL) = 0
select(18, [4 8 11 12], [], [], {0, 160764}) = 0 (Timeout)
gettimeofday({1122365070, 784035}, NULL) = 0
gettimeofday({1122365070, 784087}, NULL) = 0
_llseek(15, 608, [608], SEEK_SET) = 0
read(15, 0x80cd6d0, 4096) = -1 ESTALE (Stale NFS file handle)
gettimeofday({1122365070, 784248}, NULL) = 0
gettimeofday({1122365070, 784291}, NULL) = 0
select(18, [4 8 11 12], [], [], {9, 990692} <unfinished ...>
Process 1493 detached
my guess is that it is supposed to read something in a file, and that
file will tell it when the process should die, but the file is gone and
so the process does not know that it should have terminated itself.
My guess: somehow the script/process manages to wait long enough between
reads that the job's home directory mount (autofs) 'expires' and gets
unmounted.
Comments? Should I submit a bug?
J "lekker ontkeveren" T
|