Hi,
Since approx 19:00 on Friday (gotta love the timing) we've had an
average of 500 processes running on our CE.
Normally we're below 100 processes.
I've seen failing SAM tests with:
Destination = LRMS
- Host = wms206.cern.ch
- Reason = 10 data transfer to the server failed
- Result = FAIL
and
Host = wms209.cern.ch
- Reason = Got a job held event, reason: Globus
error 21: the job manager
failed to locate an internal script argument file
Looking at the processes I see a lot of
pillhb04 3247 0.0 0.1 6100 3720 ? S 09:12 0:00
globus-job-manager -conf /opt/globus/etc/globus-job-manager.conf -type lcg
sge -rdn jobmanager-lcgsge -machine-type unknown -publish-jobs
opssgm 3320 0.0 0.1 6084 3620 ? S 09:13 0:00
globus-job-manager -conf /opt/globus/etc/globus-job-manager.conf -type lcg
sge -rdn jobmanager-lcgsge -machine-type unknown -publish-jobs
and also a lot of
patlas06 16739 0.0 0.4 80144 9984 ? S 10:02 0:00
globus-job-manager-marshal: waiting in queue
patlas06 16740 0.0 0.0 1480 372 ? S 10:02 0:00
/opt/globus/libexec/globus-job-manager-script.pl -m lcgsge -f /tmp/gram_st
age_ina8oBiy -c stage_in
patlas06 16741 0.0 0.4 80144 9984 ? S 10:02 0:00
globus-job-manager-marshal: waiting in queue
patlas06 16742 0.0 0.0 1480 372 ? S 10:02 0:00
/opt/globus/libexec/globus-job-manager-script.pl -m lcgsge -f /tmp/gram_st
age_inZ8O1KQ -c stage_in
I think that somehow jobs are getting stuck and this causes a backlog in
the globus-job-manager-marshal but I've not found the sticking point yet.
Google gave me:
http://goc.grid.sinica.edu.tw/gocwiki/10_data_transfer_to_the_server_failed
and I'm going through the points.
I wondered if anyone had any ideas (good tips)
Also I see:
[root@grid-lcgce ~]# ps auxwww | grep marshal | grep running
atlasprd 18989 0.0 0.4 80276 10272 ? S 10:10 0:00
globus-job-manager-marshal: running
atlasprd 18993 0.0 0.4 80276 10272 ? S 10:10 0:00
globus-job-manager-marshal: running
atlasprd 18994 0.0 0.4 80276 10272 ? S 10:10 0:00
globus-job-manager-marshal: running
atlasprd 18997 0.0 0.4 80276 10276 ? S 10:10 0:00
globus-job-manager-marshal: running
atlasprd 20133 0.0 0.4 80276 10272 ? S 10:15 0:00
globus-job-manager-marshal: running
If this points towards a problem with the atlasprd account, or is it
just that account is more active. (The pid's of the processes change
with time so I was assuming all was ok)
cheers
johnk
--
+------------------------------------------------------------+
|Dr. John Alan Kennedy Rechenzentrum Garching (RZG) |
|Mail: [log in to unmask] Boltzmannstrasse 2 |
|Phone: +49 89 3299 2694 85748 Garching |
|Fax: +49 89 3299 1301 |
+------------------------------------------------------------+
|