Hi,
Steve Traylen wrote:
> + Comments from DC people as to status of sites.
> + Atlas - Frederic
> + LHCb - Ian (?)
I'm assuming that "Ian" means me. Sorry, I was on vacation last week.
Here is a brief summary: Basically, LHCb DC04 seems to currently be
running very smoothly. We have really ramped up in the last few weeks
so that new we are continuously seeing over 3000 active jobs, over 80%
of which are run via LCG. Given the average run time of 15-18 hours, I
think that puts us at over 4000 jobs completed per day. Below is a copy
of the email to LCG-ROLLOUT from Ricardo Graciani who has taken
responsibility for LCG job submission (in case you haven't seen it).
This URL:
http://fpegaes1.usc.es/dmon/DC04/joblist.html
will show you the current active job status. If you have a site which
you think should be running LHCb jobs and isn't listed (or which should
be running *more* LHCb jobs), then please contact Ricardo
([log in to unmask]).
To summarise the main problems:
* Not enough space in working directory on WN. LHCb jobs require
1.5-2.5 gigs free space. Setting the working directory variable (the
name of which escapes me at the moment) may be a very good idea. If you
do this then the LCG wrapper will automatically change to this directory
before starting the job and also do the clean up at the end. Ideally
this will be a large scratch space on a disk local to the WN.
* Not long enough queues. Basically we need 30 hour queues on a 1 GHz
Pentium (more or less). Shorter queues are fine on faster processors.
* Wrong SI00 setting (which means jobs don't get matched to the site).
Please get in touch with me or Ricardo if you have any more questions
(and take a look at his email below).
Cheers,
Ian.
--
Ian Stokes-Rees [log in to unmask]
Particle Physics, Oxford http://www-pnp.physics.ox.ac.uk/~stokes
From: Ricardo Graciani <[log in to unmask]>
Dear Site Manager,
I would like to congratulate all of you for the good work that
you are doing that has make it possible the current success of LHCb
DC04.
Apart from punctual problems at some sites we have been
successfully running about 2000 concurrent jobs or more (almost reaching
the 3000 over this weekend). Close to 75 % of our production is at this
moment taking place on LCG, versus classic DIRAC sites.
The CE's currently included in our production chain is (from our RB):
Running
lxn1184.cern.ch: 608
lcgce02.ifae.es: 100
gw39.hep.ph.ic.ac.uk: 59
grid008.to.infn.it: 17 (died last week)
lcgce01.triumf.ca: 2
t2-ce-01.lnl.infn.it: 80
wn-04-07-02-a.cr.cnaf.infn.it: 330
gridkap01.fzk.de: 519
heplnx131.pp.rl.ac.uk: 44
lcgce02.gridpp.rl.ac.uk: 457 (died yesterday)
ce.gridpp.shef.ac.uk: 36
epcf36.ph.bham.ac.uk: 23
lunegw.lancs.ac.uk: 25
tbn18.nikhef.nl: 220
ce-a.ccc.ucl.ac.uk: 153 (died last week)
grid01.phy.ncu.edu.tw: 12
lcg-ce.lps.umontreal.ca: 2
lcg-ce.usc.cesga.es: 10
lcg-ce.ecm.ub.es: 8
bohr0001.tier2.hep.man.ac.uk: 44
grid109.kfki.hu: 78
cclcgceli01.in2p3.fr: 6
t2ce01.physics.ox.ac.uk: 32
farm012.hep.phy.cam.ac.uk: 5
lcg02.physics.carleton.ca: 22
mu6.matrix.sara.nl: 30
All: 2923 (- 17 - 457 - 153)
We are still trying to add new sites although most remaining sites have
some kind of configuration problem preventing us from doing so.
If your site is not in the above list and you would like it to be
include please let me know. I'm looking forward to hearing from you.
Common problems with new sites are wrong configuration of batch
system (PBS), too short queues (or wrong SI00 published), firewalls.
Again, thanks to all of you for your invaluable help
|