for comparison what we did with the lcg-CE is to make a small modification to the job manager script. basically it checks whether the job is multi-node parallel (which needs a shared directory). if it IS, it runs in the home dir, otherwise it does
cd $TMPDIR
TMPDIR is configured here so that it creates a directory on local disk like
/tmp/jobdir/pbs-job-id
these directories are cleaned up by the MOM automatically when the job finishes. CREAM should indeed support something like this, as having thousands of jobs run in shared homes puts a rather heavy load on the NFS infrastructure.
JT
On 17 Feb 2010, at 12:48, Arnau Bria wrote:
> On Tue, 16 Feb 2010 19:31:23 +0100
> David Rebatto wrote:
>
> Hi David,
>
> [...]
>
>> I'll make a few tests and come back with more info asap.
>
> If you don't mind, I'll wait for your tests. I'm currently "playing"
> with production hosts.
>
>> If this solve the issue, we can make that template configurable so
>> that you don't need to hack the submission script anymore...
>
> torque does support tmpdir, so creamCE wrapper should not modify it,
> or, as Massimo pointed, give some kind of global var to use it.
>
>>> non-runnig jobs:
>>>
>>> this exited with 271. cpu_time exceed.
>>>
>>> [root@td173 cmprd008]# ls home_cream_830522492
>>> CREAM830522492_jobWrapper.sh cream_830522492.proxy
>>> [root@td173 cmprd008]# du -sh home_cream_830522492
>>> 44K home_cream_830522492
>>>
>>
>> Those files should be removed by torque. The qsub man page states:
>> "On completion of the job, all staged-in and staged-out files are
>> removed from the execution system."
>
> Yep, but we sometime hit some kind of race condition and some dirs are
> not removed. Not this case.
>
>
> In this case:
>
> # grep 830522492 /opt/glite/var/log/glite-ce-cream.log*|grep pbs
> [...]
> status=PROCESSING; lrmsAbsJobId=pbs/20100214/8760046.pbs02.pic.es;
> [...]
>
> look what torque tries to remove (from torque node logs):
>
> 02/16/2010 00:30:24;0080; pbs_mom;Job;8760046.pbs02.pic.es;scan_for_terminated: job 8760046.pbs02.pic.es task 1 terminated, sid=24635
> 02/16/2010 00:30:24;0008; pbs_mom;Job;8760046.pbs02.pic.es;job was terminated
> 02/16/2010 00:30:24;0080; pbs_mom;Job;8760046.pbs02.pic.es;obit sent to server
> 02/16/2010 00:30:25;0080; pbs_mom;Job;8760046.pbs02.pic.es;removing transient job directory /home/tmp/8760046.pbs02.pic.es
>
> it says nothing about /home/cmprd008/cream_...
>
> so, if I'm not wrong, cream jobs creates its own dir in user_dir,
> torque creates job's tmpdir in scratch area, but does not use it.
> torque removes scratch dir, but not all home-cream job dir.
>
> an example from running job.
>
>
> root 0.0 0.0 9664 ? SLs Feb01 5:32 /usr/sbin/pbs_mom
> cmprd008 0.0 0.0 1444 ? Ss 09:23 0:00 \_ -bash
> cmprd008 0.0 0.0 1156 ? S 09:23 0:00 | \_ /bin/bash /var/spool/pbs/mom_priv/jobs/8803030.pbs.SC
> cmprd008 0.0 0.0 1160 ? S 09:23 0:00 | \_ /bin/sh /opt/lcg/libexec/jobwrapper ./CREAM332079557_jobWrapper.sh UI=000000:NS=0000000004:WM=0000
> cmprd008 0.0 0.0 1428 ? S 09:23 0:00 | \_ /bin/sh -l ./CREAM332079557_jobWrapper.sh UI=000000:NS=0000000004:WM=000005:BH=0000000000:JSS=
> cmprd008 0.0 0.0 2536 ? S 09:24 0:00 | \_ perl -e ?use Socket;??sub send_notify {? $cream_url = "193.109.175.14:9091";die "No cre
> cmprd008 0.0 0.0 1124 ? S 09:24 0:00 | \_ sh -c "./test01-MinimumBias-test_pic_backfill_rereco_v6-submit" test01-MinimumBias-tes
> cmprd008 0.0 0.0 1152 ? S 09:24 0:00 | | \_ /bin/sh ./test01-MinimumBias-test_pic_backfill_rereco_v6-submit test01-MinimumBias
> cmprd008 0.0 0.0 1148 ? S 09:24 0:00 | | \_ /bin/sh ./run.sh /home/cmprd008/home_cream_332079557/CREAM332079557/BulkSpecs/
> cmprd008 0.0 0.0 12060 ? Sl 09:24 0:06 | | \_ python /home/cmprd008/home_cream_332079557/CREAM332079557/test01-MinimumBi
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>
> but torque tmdir is also created:
>
> [root@td146 ~]# ls -lsah /home/tmp/8803030.pbs02.pic.es/
> total 16K
> 4.0K drwxr-xr-x 2 cmprd008 cmprd 4.0K Feb 17 09:23 .
> 12K drwxrwxr-x 169 root root 12K Feb 17 12:29 ..
>
>
> but not used.
>
> *I guess that torque will try to remove /home/tmp/$torque_jid and
> not /home/user/cream_$CREAMID. Then, cleanup-grid-accounts will do the
> job.
>
>
>>
>> Cheers,
>> David
> Cheers,
> Arnau
|