Jean-Michel Barbet wrote:
> I am using a script $GLITE_LOCAL_CUSTOMIZATION_DIR/cp_1.sh that
> redirect to a local directory on WN named job-XXXXXXX (mktemp).
> In these '/dlocal/job-XXXXXXX' I can have a mix of jobs using
> a directory 'https...' coming from the job ID and jobs working
> directly under the /dlocal/job-XXXXXXX directory :
>
>
> ls -als /dlocal/job-*
> /dlocal/job-gKlnk16866:
> total 12
> 4 drwx------ 3 sgmali020 alicesgm 4096 Apr 20 10:57 .
> 4 drwxrwxrwt 10 root root 4096 Apr 20 10:57 ..
> 4 drwx------ 4 sgmali020 alicesgm 4096 Apr 20 10:57
> https_3a_2f_2fgrid02.lal.in2p3.fr_3a9000_2fetlK_5fBRPh5X5egTpONtqVg
>
> /dlocal/job-mkXrc16987:
> total 52
> 4 drwx------ 4 sgmali020 alicesgm 4096 Apr 20 10:57 .
> 4 drwxrwxrwt 10 root root 4096 Apr 20 10:57 ..
> 4 drwxr-xr-x 2 sgmali020 alicesgm 4096 Apr 20 10:57 .alien
> 4 drwxr-x--- 2 sgmali020 alicesgm 4096 Apr 20 11:13 alien-job-26025171
> 4 -rw-r--r-- 1 sgmali020 alicesgm 1001 Apr 20 10:57 .BrokerInfo
> 4 -rw-r--r-- 1 sgmali020 alicesgm 834 Apr 20 10:57 dg-submit.7193.sh
> 4 -rw------- 1 sgmali020 alicesgm 115 Apr 20 10:57
> https_3a_2f_2fgrid02.lal.in2p3.fr_3a9000_2fRhY-QMyTeOlCqm5P4UDiqw.output
> 4 -rw-r----- 1 sgmali020 alicesgm 21 Apr 20 10:57 .root_hist
> 12 -rw-r--r-- 1 sgmali020 alicesgm 11286 Apr 20 11:12 std.err
> 8 -rw-r--r-- 1 sgmali020 alicesgm 6065 Apr 20 10:57 std.out
> 0 -rw------- 1 sgmali020 alicesgm 0 Apr 20 10:57 tmp.imlUm17103
>
> /dlocal/job-sDeMvt4461:
> total 8
> 4 drwx------ 2 sgmali016 alicesgm 4096 Dec 17 16:17 .
> 4 drwxrwxrwt 10 root root 4096 Apr 20 10:57 ..
>
> /dlocal/job-sUxTqQ4479:
> total 8
> 4 drwx------ 2 sgmali016 alicesgm 4096 Dec 17 16:05 .
> 4 drwxrwxrwt 10 root root 4096 Apr 20 10:57 ..
>
> /dlocal/job-TFtDbN4528:
> total 8
> 4 drwx------ 2 sgmali016 alicesgm 4096 Dec 17 16:17 .
> 4 drwxrwxrwt 10 root root 4096 Apr 20 10:57 ..
>
> /dlocal/job-wLSRy16215:
> total 60
> 4 drwx------ 4 sgmali020 alicesgm 4096 Apr 20 10:56 .
> 4 drwxrwxrwt 10 root root 4096 Apr 20 10:57 ..
> 4 drwxr-xr-x 2 sgmali020 alicesgm 4096 Apr 20 10:55 .alien
> 4 drwxr-x--- 2 sgmali020 alicesgm 4096 Apr 20 11:12 alien-job-26025092
> 4 -rw-r--r-- 1 sgmali020 alicesgm 1001 Apr 20 10:55 .BrokerInfo
> 4 -rw-r--r-- 1 sgmali020 alicesgm 834 Apr 20 10:55 dg-submit.7193.sh
> 4 -rw------- 1 sgmali020 alicesgm 115 Apr 20 10:55
> https_3a_2f_2fgrid02.lal.in2p3.fr_3a9000_2fA9h1BAG8bLb22FYffnuamQ.output
> 4 -rw-r----- 1 sgmali020 alicesgm 21 Apr 20 10:56 .root_hist
> 20 -rw-r--r-- 1 sgmali020 alicesgm 18505 Apr 20 11:11 std.err
> 8 -rw-r--r-- 1 sgmali020 alicesgm 6110 Apr 20 10:56 std.out
> 0 -rw------- 1 sgmali020 alicesgm 0 Apr 20 10:55 tmp.jhuwl16280
>
> /dlocal/job-yZoShU4459:
> total 8
> 4 drwx------ 2 sgmali016 alicesgm 4096 Dec 17 16:17 .
> 4 drwxrwxrwt 10 root root 4096 Apr 20 10:57 ..
The WMS job wrapper always tries a mkdir and then cd into the directory,
but will continue when either operation fails. Does /var/log/messages
show any problems for the "/dlocal" file system?
Are there any errors under ~sgmali020/.globus/job/*/*?
In theory it is also possible that the user payload moved everything
to the parent directory and deleted the original directory.
> Some job-XXXX directories are empty because they are not removed
> after the job ends.
That could be done by a cron job.
>> That is normal. The EDG_WL_JOBID is only set for jobs sent by RB or WMS
>> nodes and directed to the batch system. The RB/WMS also sends
>> "grid_monitor"
>> jobs running on the lcg-CE itself, and requests to clean up jobs that
>> have
>> finished.
>
>
> All jobs are supposed to come through a WMS.
> You mean that the grid_monitor jobs have the variable EDG_WL_JOBID
> empty ?
Correct.
> PS: I have another problem that I am about to describe here and
> I am not sure if there can be a link between the two :
> Some jobs have misbehaved and tried to remove files recursively
> starting from /. I have evidence of this in undelivered PBS jobs.
That looks like a user payload error. The WMS wrapper does the following
when it ends:
rm -rf "../${newdir}"
Here ${newdir} looks like "https_3a_2f_2f.....".
I hope $GLITE_LOCAL_CUSTOMIZATION_DIR/cp_1.sh does not redefine it?!
|