Hi there,
we have not found a solution yet.
What we seemed to have found is the problem.
AFAIU on the lcg-CE the job is managed in a directory of a users home
(which is is a shared NFS in our case):
/home/zidtm041/.globus/job/desdemona.zih.tu-dresden.de/
For every job a directory is created there named with some ID:
home/zidtm041/.globus/job/desdemona.zih.tu-dresden.de/13762.1289309526
There some scripts and later on the output can/should be found.
Our problem is, that the job creates such a directory (including e.g.
scheduler_pbs_job_script - see below 18806.1289316531), but this
directory disappears after a short time for unknown reasons.
Then a new directory with new ID-numbers (see below 20785.1289316535)
appears but with the pbs scripts missing:
[root@desdemona jobdir]# ll 18806.1289316531
total 20
-rw-r--r-- 1 root root 33 Nov 9 16:29 remote_io_url
-rw-r--r-- 1 root root 2165 Nov 9 16:29 scheduler_pbs_job_script
-rw-r--r-- 1 root root 0 Nov 9 16:29 scheduler_pbs_submit_stderr
-rw------- 1 root root 0 Nov 9 16:29 stderr
-rw------- 1 root root 0 Nov 9 16:29 stdout
-rw------- 1 root root 9946 Nov 9 16:29 x509_up
[root@desdemona jobdir]# ll 20785.1289316535
total 16
-rw------- 1 root root 249 Nov 9 16:30 stdout
-rw------- 1 root root 9950 Nov 9 16:30 x509_up
After some time the job is submitted (by the WMS?) again.
The first jobs (and the others) then run on a WN, but crash when trying
to copy the stdout and stderr from the torque directory on the WN to the
CE as the first job directory (18806.1289316531) is gone.
Can anybody give us a hint - where to look? what might be the problem?
Switching on more logging?
We are running out of ideas.
Cheers,
Ralph
Ralph Mueller-Pfefferkorn wrote on 03.11.2010 17:42:
> Hi there,
>
> we have mysterious problem.
> We run an extra torque server and a lcg-CE.
> After an update of the torque server (operating system update), suddenly
> all jobs are submitted several times to the system.
>
> A job arrives at the lcg-CE and is passed to the torque server (other
> machine). Torque accepts the job and runs it.
> The logs both on the CE and torque look normal. But after about half a
> minute the same job (same Grid ID) is submitted again to the lcg-CE. And
> again and again. The same job is submitted 9 times.
>
> The jobs then fail when trying to copy there output from the WN to the CE:
> from WN /var/log/messages:
> Nov 3 17:36:03 r1i1n15 pbs_mom: sys_copy, command '/usr/bin/scp -rpB
> /var/spool/pbs/spool/5137506.service0.ice.zih.tu-dresden.de.OU
> [log in to unmask]:/home/ziops022/.globus/job/desdemona.zih.tu-dresden.de/32270.1288802028/stdout'
> failed with status=1, giving up after 4 attempts
> Nov 3 17:36:03 r1i1n15 pbs_mom: req_cpyfile, Unable to copy file
> /var/spool/pbs/spool/5137506.service0.ice.zih.tu-dresden.de.OU to
> [log in to unmask]:/home/ziops022/.globus/job/desdemona.zih.tu-dresden.de/32270.1288802028/stdout
>
> The reason is that the directory
> /home/ziops022/.globus/job/desdemona.zih.tu-dresden.de/32270.1288802028/
> on the CE does not exist anymore.
>
> Does anybody have an idea where to look?
>
> Cheers,
> Ralph
>
|