Ian Stokes-Rees wrote:
> I think David Bouvet is right about this -- they are waiting to transfer
> files to CERN CASTOR GridFTP server, which I believe crashed yesterday.
> They sit in sleep loop which checks every 10-30 minutes to see if the
> data transfer at the end of the job can be be done. If not, they sleep.
> They do this for about 12-18 hours then give up.
Why put everything in Castor at CERN? Why not fail over to other (Castor)
MSSs that are available? That would be more in line with a grid...
> We would like there to be a "better way", but unfortunately we don't
> trust any other "queued data transfer" mechanism in LCG yet, so we build
> it into our jobs. Of course if the reason they do this is that if the
> data transfer fails then "everything" gets messed up -- the 12-18 hours
> already used to create the data is wasted, and we have to catch this
> failure and mark the job as "failed due to LCG" meaning it can be
> resubmitted, as opposed to "failed due to LHCb", which means someone
> needs to look at it to figure out what went wrong.
>
> Joel Closier and Andrei Tsaregorodtsev are also good people to contact
> about this if you see it happen at your site.
>
> Cheers,
>
> Ian.
>
> Dimitris Zilaskos wrote:
>
>> Hello ,
>>
>> I have a number of lhcb jobs sitting in my queue . They have been
>> siting in that exact stage for more than 12 hours (the Time Use counter
>> is not increasing and the process that was eating cpu appears to have
>> completed its task). They appear to be waiting for something ( user
>> intervention?).
>> There were some same jobs 3-4 days ago that exhibited the same
>> behaviour but after around another 12 hours the jobs exited
>> successfully.I have mailed Ricardo Graciani who appears to have
>> submitted those jobs but I got no response. If someone knows what is
>> going on ... because our queues have been filled for days and no other
>> jobs cat run (ie the job submission tests)
>>
>> Job id Name User Time Use S Queue
>> ---------------- ---------------- ---------------- -------- - -----
>> 8.node001 STDIN lhcb001 27:01:05 R infinite
>> 9.node001 STDIN lhcb001 27:37:44 R infinite
>> 10.node001 STDIN lhcb001 27:08:25 R infinite
>> 11.node001 STDIN lhcb001 27:33:07 R infinite
>> 12.node001 STDIN lhcb001 25:59:44 R infinite
>> 13.node001 STDIN lhcb001 26:29:33 R infinite
>> 14.node001 STDIN lhcb001 27:52:40 R infinite
>> 16.node001 STDIN lhcb001 27:13:36 R infinite
>> 17.node001 STDIN lhcb001 0 Q infinite
>> 18.node001 STDIN lhcb001 0 Q infinite
>> 19.node001 STDIN lhcb001 0 Q infinite
>> 20.node001 STDIN lhcb001 0 Q infinite
>> 21.node001 STDIN lhcb001 0 Q infinite
>> 23.node001 STDIN dteam004 0 Q short
>> (...)
>>
>>
>> Best regards ,
>> --
>> =============================================================================
>>
>>
>>
>> Dimitris Zilaskos
>>
>> Department of Physics @ Aristotle Univercity of Thessaloniki , Greece
>> PGP key : http://tassadar.physics.auth.gr/~dzila/pgp_public_key.asc
>> http://egnatia.ee.auth.gr/~dzila/pgp_public_key.asc
>> MD5sum : de2bd8f73d545f0e4caf3096894ad83f pgp_public_key.asc
>> =============================================================================
>>
>>
>
> --
> Ian Stokes-Rees [log in to unmask]
> Particle Physics, Oxford http://grid.physics.ox.ac.uk/~stokes
|