Hallo Ralph,
> What we seemed to have found is the problem.
I think it rather is a symptom...
> AFAIU on the lcg-CE the job is managed in a directory of a users home
> (which is is a shared NFS in our case):
What are the NFS mount options? Do you have the same options on all
machines that mount the file system?
> /home/zidtm041/.globus/job/desdemona.zih.tu-dresden.de/
>
> For every job a directory is created there named with some ID:
> home/zidtm041/.globus/job/desdemona.zih.tu-dresden.de/13762.1289309526
>
> There some scripts and later on the output can/should be found.
>
> Our problem is, that the job creates such a directory (including e.g.
> scheduler_pbs_job_script - see below 18806.1289316531), but this
> directory disappears after a short time for unknown reasons.
> Then a new directory with new ID-numbers (see below 20785.1289316535)
> appears but with the pbs scripts missing:
>
> [root@desdemona jobdir]# ll 18806.1289316531
> total 20
> -rw-r--r-- 1 root root 33 Nov 9 16:29 remote_io_url
> -rw-r--r-- 1 root root 2165 Nov 9 16:29 scheduler_pbs_job_script
> -rw-r--r-- 1 root root 0 Nov 9 16:29 scheduler_pbs_submit_stderr
> -rw------- 1 root root 0 Nov 9 16:29 stderr
> -rw------- 1 root root 0 Nov 9 16:29 stdout
> -rw------- 1 root root 9946 Nov 9 16:29 x509_up
> [root@desdemona jobdir]# ll 20785.1289316535
> total 16
> -rw------- 1 root root 249 Nov 9 16:30 stdout
> -rw------- 1 root root 9950 Nov 9 16:30 x509_up
The second directory probably was for a fork job, e.g. the grid_monitor
submitted by the WMS/Condor-G.
> After some time the job is submitted (by the WMS?) again.
Sure, since the job failed and the user specified a non-zero retry count.
> The first jobs (and the others) then run on a WN, but crash when trying
> to copy the stdout and stderr from the torque directory on the WN to the
> CE as the first job directory (18806.1289316531) is gone.
>
> Can anybody give us a hint - where to look? what might be the problem?
Since WMS jobs fail with Maradona errors, have a look here:
http://goc.grid.sinica.edu.tw/gocwiki/Cannot_read_JobWrapper_output...
The dots are part of the URL.
|