Hi Marteen, David and all,
At IN2P3-CC we still have the problem of lost jobs through our CE
cclcgceli02 since the LCG2.3.0 upgrade. We can loose up to 40% of jobs.
This week-end, I observed that several lost jobs have the same grid job
identifier. So, could it be possible that the jobs disappeared because
of prematurely resubmission ?
See below some pieces of logs I obtained (Log format : Grid Job Id,
Creation date, BQS Job Id, Final state)
> https://boswachter.nikhef.nl:9000/bdouHDysnymSOJIGKNZrWA Feb-16-21:25
> lcg0216212519-27815 OK
> https://boswachter.nikhef.nl:9000/bdouHDysnymSOJIGKNZrWA Feb-16-22:44
> lcg0216224454-10328 LOST
> https://mu3.matrix.sara.nl:9000/JCwFWsKGkeMNkUeUWp_lgQ Feb-16-21:22
> lcg0216212219-27352 OK
> https://mu3.matrix.sara.nl:9000/JCwFWsKGkeMNkUeUWp_lgQ Feb-16-22:43
> lcg0216224255-09781 LOST
> https://rb1.egee.fr.cgg.com:9000/iZnGzGCxbXlMqb2ECYrsjA Feb-18-11:45
> lcg0218114535-11241 LOST
> https://rb1.egee.fr.cgg.com:9000/iZnGzGCxbXlMqb2ECYrsjA Feb-18-11:47
> lcg0218114721-11619 LOST
> https://rb1.egee.fr.cgg.com:9000/iZnGzGCxbXlMqb2ECYrsjA Feb-18-11:51
> lcg0218115030-12221 LOST
> https://rb1.egee.fr.cgg.com:9000/iZnGzGCxbXlMqb2ECYrsjA Feb-18-11:54
> lcg0218115430-12903 LOST
> https://rb1.egee.fr.cgg.com:9000/iZnGzGCxbXlMqb2ECYrsjA Feb-18-11:56
> lcg0218115630-13307 LOST
> https://rb1.egee.fr.cgg.com:9000/iZnGzGCxbXlMqb2ECYrsjA Feb-18-12:01
> lcg0218115931-13776 LOST
> https://rb1.egee.fr.cgg.com:9000/iZnGzGCxbXlMqb2ECYrsjA Feb-18-12:03
> lcg0218120317-14473 LOST
> https://rb1.egee.fr.cgg.com:9000/iZnGzGCxbXlMqb2ECYrsjA Feb-18-12:06
> lcg0218120626-15070 LOST
> https://rb1.egee.fr.cgg.com:9000/iZnGzGCxbXlMqb2ECYrsjA Feb-18-12:09
> lcg0218120931-15613 LOST
> https://rb1.egee.fr.cgg.com:9000/iZnGzGCxbXlMqb2ECYrsjA Feb-18-12:12
> lcg0218121224-16184 LOST
As the globus submission mechanism is still opaque for me, I need some
help to understand this behavior.
Could the job submission obey to the following scenario:
1) Job is submitted to the CE by the RB
2) Job is submitted to the batch system BQS by the way of the CE
3) RB decides that Job should be resubmitted (Why ? Does the RB think
that the job is failed ? Is the dialog between CE and RB broken ? ...)
a) RB launches the Job data cleanup on the CE. (It could explain
why my gram_job_state files are oddly disappearing)
b) the CE stops to deal with this job, but it is still submitted to
BQS and it is going to be uselessly run...
(Tip of the day: it could be useful to cancel the job on the
CE in this case...)
c) RB submits again the job
4) Go to 1) unless max retry count is reached
In that case, the problem would be in step 3. So, why could the RB
decide the resubmission of a job ?
Any idea ?
Thanks in advance for any help !
Pierre
--
______________________
Pierre GIRARD
Grid Computing Team Member
IN2P3/CNRS Computing Centre - Lyon (FRANCE)
http://cc.in2p3.fr
Tel. +33 4.78.93.08.80 | Fax. +33 4.72.69.41.70 | e-mail: [log in to unmask]
|