I think David Bouvet is right about this -- they are waiting to transfer
files to CERN CASTOR GridFTP server, which I believe crashed yesterday.
They sit in sleep loop which checks every 10-30 minutes to see if the
data transfer at the end of the job can be be done. If not, they sleep.
They do this for about 12-18 hours then give up.
We would like there to be a "better way", but unfortunately we don't
trust any other "queued data transfer" mechanism in LCG yet, so we build
it into our jobs. Of course if the reason they do this is that if the
data transfer fails then "everything" gets messed up -- the 12-18 hours
already used to create the data is wasted, and we have to catch this
failure and mark the job as "failed due to LCG" meaning it can be
resubmitted, as opposed to "failed due to LHCb", which means someone
needs to look at it to figure out what went wrong.
Joel Closier and Andrei Tsaregorodtsev are also good people to contact
about this if you see it happen at your site.
Cheers,
Ian.
Dimitris Zilaskos wrote:
> Hello ,
>
> I have a number of lhcb jobs sitting in my queue . They have been
> siting in that exact stage for more than 12 hours (the Time Use counter
> is not increasing and the process that was eating cpu appears to have
> completed its task). They appear to be waiting for something ( user
> intervention?).
> There were some same jobs 3-4 days ago that exhibited the same
> behaviour but after around another 12 hours the jobs exited
> successfully.I have mailed Ricardo Graciani who appears to have
> submitted those jobs but I got no response. If someone knows what is
> going on ... because our queues have been filled for days and no other
> jobs cat run (ie the job submission tests)
>
> Job id Name User Time Use S Queue
> ---------------- ---------------- ---------------- -------- - -----
> 8.node001 STDIN lhcb001 27:01:05 R infinite
> 9.node001 STDIN lhcb001 27:37:44 R infinite
> 10.node001 STDIN lhcb001 27:08:25 R infinite
> 11.node001 STDIN lhcb001 27:33:07 R infinite
> 12.node001 STDIN lhcb001 25:59:44 R infinite
> 13.node001 STDIN lhcb001 26:29:33 R infinite
> 14.node001 STDIN lhcb001 27:52:40 R infinite
> 16.node001 STDIN lhcb001 27:13:36 R infinite
> 17.node001 STDIN lhcb001 0 Q infinite
> 18.node001 STDIN lhcb001 0 Q infinite
> 19.node001 STDIN lhcb001 0 Q infinite
> 20.node001 STDIN lhcb001 0 Q infinite
> 21.node001 STDIN lhcb001 0 Q infinite
> 23.node001 STDIN dteam004 0 Q short
> (...)
>
>
> Best regards ,
> --
> =============================================================================
>
>
> Dimitris Zilaskos
>
> Department of Physics @ Aristotle Univercity of Thessaloniki , Greece
> PGP key : http://tassadar.physics.auth.gr/~dzila/pgp_public_key.asc
> http://egnatia.ee.auth.gr/~dzila/pgp_public_key.asc
> MD5sum : de2bd8f73d545f0e4caf3096894ad83f pgp_public_key.asc
> =============================================================================
>
--
Ian Stokes-Rees [log in to unmask]
Particle Physics, Oxford http://grid.physics.ox.ac.uk/~stokes
|