Print

Print


Jean-Michel Barbet wrote:

> I am using a script $GLITE_LOCAL_CUSTOMIZATION_DIR/cp_1.sh that
> redirect to a local directory on WN named job-XXXXXXX (mktemp).
> In these '/dlocal/job-XXXXXXX' I can have a mix of jobs using
> a directory 'https...' coming from the job ID and jobs working
> directly under the /dlocal/job-XXXXXXX directory :
> 
> 
> ls -als /dlocal/job-*
> /dlocal/job-gKlnk16866:
> total 12
> 4 drwx------   3 sgmali020 alicesgm 4096 Apr 20 10:57 .
> 4 drwxrwxrwt  10 root      root     4096 Apr 20 10:57 ..
> 4 drwx------   4 sgmali020 alicesgm 4096 Apr 20 10:57 
> https_3a_2f_2fgrid02.lal.in2p3.fr_3a9000_2fetlK_5fBRPh5X5egTpONtqVg
> 
> /dlocal/job-mkXrc16987:
> total 52
>  4 drwx------   4 sgmali020 alicesgm  4096 Apr 20 10:57 .
>  4 drwxrwxrwt  10 root      root      4096 Apr 20 10:57 ..
>  4 drwxr-xr-x   2 sgmali020 alicesgm  4096 Apr 20 10:57 .alien
>  4 drwxr-x---   2 sgmali020 alicesgm  4096 Apr 20 11:13 alien-job-26025171
>  4 -rw-r--r--   1 sgmali020 alicesgm  1001 Apr 20 10:57 .BrokerInfo
>  4 -rw-r--r--   1 sgmali020 alicesgm   834 Apr 20 10:57 dg-submit.7193.sh
>  4 -rw-------   1 sgmali020 alicesgm   115 Apr 20 10:57 
> https_3a_2f_2fgrid02.lal.in2p3.fr_3a9000_2fRhY-QMyTeOlCqm5P4UDiqw.output
>  4 -rw-r-----   1 sgmali020 alicesgm    21 Apr 20 10:57 .root_hist
> 12 -rw-r--r--   1 sgmali020 alicesgm 11286 Apr 20 11:12 std.err
>  8 -rw-r--r--   1 sgmali020 alicesgm  6065 Apr 20 10:57 std.out
>  0 -rw-------   1 sgmali020 alicesgm     0 Apr 20 10:57 tmp.imlUm17103
> 
> /dlocal/job-sDeMvt4461:
> total 8
> 4 drwx------   2 sgmali016 alicesgm 4096 Dec 17 16:17 .
> 4 drwxrwxrwt  10 root      root     4096 Apr 20 10:57 ..
> 
> /dlocal/job-sUxTqQ4479:
> total 8
> 4 drwx------   2 sgmali016 alicesgm 4096 Dec 17 16:05 .
> 4 drwxrwxrwt  10 root      root     4096 Apr 20 10:57 ..
> 
> /dlocal/job-TFtDbN4528:
> total 8
> 4 drwx------   2 sgmali016 alicesgm 4096 Dec 17 16:17 .
> 4 drwxrwxrwt  10 root      root     4096 Apr 20 10:57 ..
> 
> /dlocal/job-wLSRy16215:
> total 60
>  4 drwx------   4 sgmali020 alicesgm  4096 Apr 20 10:56 .
>  4 drwxrwxrwt  10 root      root      4096 Apr 20 10:57 ..
>  4 drwxr-xr-x   2 sgmali020 alicesgm  4096 Apr 20 10:55 .alien
>  4 drwxr-x---   2 sgmali020 alicesgm  4096 Apr 20 11:12 alien-job-26025092
>  4 -rw-r--r--   1 sgmali020 alicesgm  1001 Apr 20 10:55 .BrokerInfo
>  4 -rw-r--r--   1 sgmali020 alicesgm   834 Apr 20 10:55 dg-submit.7193.sh
>  4 -rw-------   1 sgmali020 alicesgm   115 Apr 20 10:55 
> https_3a_2f_2fgrid02.lal.in2p3.fr_3a9000_2fA9h1BAG8bLb22FYffnuamQ.output
>  4 -rw-r-----   1 sgmali020 alicesgm    21 Apr 20 10:56 .root_hist
> 20 -rw-r--r--   1 sgmali020 alicesgm 18505 Apr 20 11:11 std.err
>  8 -rw-r--r--   1 sgmali020 alicesgm  6110 Apr 20 10:56 std.out
>  0 -rw-------   1 sgmali020 alicesgm     0 Apr 20 10:55 tmp.jhuwl16280
> 
> /dlocal/job-yZoShU4459:
> total 8
> 4 drwx------   2 sgmali016 alicesgm 4096 Dec 17 16:17 .
> 4 drwxrwxrwt  10 root      root     4096 Apr 20 10:57 ..

The WMS job wrapper always tries a mkdir and then cd into the directory,
but will continue when either operation fails.  Does /var/log/messages
show any problems for the "/dlocal" file system?

Are there any errors under ~sgmali020/.globus/job/*/*?

In theory it is also possible that the user payload moved everything
to the parent directory and deleted the original directory.

> Some job-XXXX directories are empty because they are not removed
> after the job ends.

That could be done by a cron job.

>> That is normal.  The EDG_WL_JOBID is only set for jobs sent by RB or WMS
>> nodes and directed to the batch system.  The RB/WMS also sends 
>> "grid_monitor"
>> jobs running on the lcg-CE itself, and requests to clean up jobs that 
>> have
>> finished.
> 
> 
> All jobs are supposed to come through a WMS.
> You mean that the grid_monitor jobs have the variable EDG_WL_JOBID
> empty ?

Correct.

> PS: I have another problem that I am about to describe here and
> I am not sure if there can be a link between the two :
> Some jobs have misbehaved and tried to remove files recursively
> starting from /. I have evidence of this in undelivered PBS jobs.

That looks like a user payload error.  The WMS wrapper does the following
when it ends:

   rm -rf "../${newdir}"

Here ${newdir} looks like "https_3a_2f_2f.....".

I hope $GLITE_LOCAL_CUSTOMIZATION_DIR/cp_1.sh does not redefine it?!