Thanks all for your replies.
I somehow forgot about and subsequently missed the egee pages (despite
having looked at them years ago) so thanks for point them out Gareth.
This gels nicely with information offlist from Daniela and Simon. I'll
let people know how this pans out.
On 29/11/16 20:10, Tom Whyntie wrote:
>
> Having run loads of CERN@school and MoEDAL jobs with DIRAC, have we just
> been lucky with the supporting sites we've used? i.e. mainly QMUL,
> Liverpool and Glasgow? I think we "run where we land", as you put it
> Matt...
>
My tuppence worth on this is that the onus to make sure the job runs in
the right place is on the site, not the job submission system or end
user. I suspect that these failings could just be specific to
Lancaster's setup - our NFS server doesn't perform as well as we like,
and we didn't realise it sooner simply because atlas have a `cd $TMPDIR`
in their pilots. Still just in case it isn't I'll be verbose with our
solution (when we figure it out).
Thanks,
Matt
> Cheers, Tom
>
> On Tue, 29 Nov 2016 at 17:03 Daniel Traynor <[log in to unmask]
> <mailto:[log in to unmask]>> wrote:
>
> For grid accounts at QM the home directory is a local directory on
> ever worker node (e.g. /scratch/lcg/prdsno34), they have no
> directory on the nfs server while local users do. The lcg epilog
> script just scps the output at the end of the job onto the CE. The
> same solution works for slurm.
>
> dan
>
> * Dr Daniel Traynor, Grid cluster system manager
> * Tel +44(0)20 7882 6560 <tel:+44%2020%207882%206560>, Particle
> Physics,QMUL
>
> ________________________________________
> From: Testbed Support for GridPP member institutes
> <[log in to unmask] <mailto:[log in to unmask]>> on
> behalf of Matt Doidge <[log in to unmask]
> <mailto:[log in to unmask]>>
> Sent: 29 November 2016 16:44
> To: [log in to unmask] <mailto:[log in to unmask]>
> Subject: TMPDIR strategies
>
> Hi all,
> Following on from Daniela's call for more sites to support Dune you
> might have wondered why Lancaster was having trouble running Dune jobs
> (I doubt you did, but you might have).
>
> Dune - and other Dirac submitted jobs, appear to have a wonderful
> ability to kill our cluster through no real fault of their own. This is
> due to us using a shared sandbox and home areas, mounted on a reasonable
> nfs server - which works fine if jobs do their work on the local disk.
> Grid engine, like a lot of batch systems, sets up a TMPDIR for the job
> on the local disk and points at it with an environment variable. Then
> the first thing a job should do is `cd $TMPDIR`.
>
> I think you can guess the next bit - Dune jobs are running where they
> land, in the nfs mounted home area, and they do just enough iops that a
> few hundred of them hose the nfs mounts, grinding grid activities on our
> cluster. The worst thing is that this will happen to any dirac submitted
> workload of any worthwhile size, so it's something we want to fix. We
> like the smaller VOs.
>
> So the lengthy backstory over, does anyone have any strategies they have
> in place to ensure jobs run in a place that you want them to run, be it
> $TMPDIR or somewhere else?
>
> Daniela kindly shared with me a hack for the jobwrapper template,
> putting in a `cd $TMPDIR` near the start of the jobwrapper, but for some
> reason that simple fix causes jobs to fail at start up - when something
> in the job tries to move some files to the $TMPDIR despite already being
> there (although it doesn't move the actual payload, that would be too
> useful!).
>
> I have also looked at attempting to do something with the job prolog,
> but I think that would have the same problem. We use the EGEE SGE file
> stager (copyright 2004...)- perhaps if we prefix the copy destination
> with $TMPDIR?
>
> Thanks for reading through that ramble, I'd be very happy to hear of
> anyone's ideas or thoughts whilst I dig through all the layers of
> wrappers over the next few days!
>
> Thanks in advance,
> Matt
>
|