On Mon, 21 Feb 2005, pierre girard wrote:
> Hi Marteen, David and all,
>
> At IN2P3-CC we still have the problem of lost jobs through our CE
> cclcgceli02 since the LCG2.3.0 upgrade. We can loose up to 40% of jobs.
>
> This week-end, I observed that several lost jobs have the same grid job
> identifier. So, could it be possible that the jobs disappeared because
> of prematurely resubmission ?
>
> See below some pieces of logs I obtained (Log format : Grid Job Id,
> Creation date, BQS Job Id, Final state)
>
> > https://boswachter.nikhef.nl:9000/bdouHDysnymSOJIGKNZrWA Feb-16-21:25
> > lcg0216212519-27815 OK
> > https://boswachter.nikhef.nl:9000/bdouHDysnymSOJIGKNZrWA Feb-16-22:44
> > lcg0216224454-10328 LOST
>
> > https://mu3.matrix.sara.nl:9000/JCwFWsKGkeMNkUeUWp_lgQ Feb-16-21:22
> > lcg0216212219-27352 OK
> > https://mu3.matrix.sara.nl:9000/JCwFWsKGkeMNkUeUWp_lgQ Feb-16-22:43
> > lcg0216224255-09781 LOST
>
> > https://rb1.egee.fr.cgg.com:9000/iZnGzGCxbXlMqb2ECYrsjA Feb-18-11:45
> > lcg0218114535-11241 LOST
> > https://rb1.egee.fr.cgg.com:9000/iZnGzGCxbXlMqb2ECYrsjA Feb-18-11:47
> > lcg0218114721-11619 LOST
> > https://rb1.egee.fr.cgg.com:9000/iZnGzGCxbXlMqb2ECYrsjA Feb-18-11:51
> > lcg0218115030-12221 LOST
> > https://rb1.egee.fr.cgg.com:9000/iZnGzGCxbXlMqb2ECYrsjA Feb-18-11:54
> > lcg0218115430-12903 LOST
> > https://rb1.egee.fr.cgg.com:9000/iZnGzGCxbXlMqb2ECYrsjA Feb-18-11:56
> > lcg0218115630-13307 LOST
> > https://rb1.egee.fr.cgg.com:9000/iZnGzGCxbXlMqb2ECYrsjA Feb-18-12:01
> > lcg0218115931-13776 LOST
> > https://rb1.egee.fr.cgg.com:9000/iZnGzGCxbXlMqb2ECYrsjA Feb-18-12:03
> > lcg0218120317-14473 LOST
> > https://rb1.egee.fr.cgg.com:9000/iZnGzGCxbXlMqb2ECYrsjA Feb-18-12:06
> > lcg0218120626-15070 LOST
> > https://rb1.egee.fr.cgg.com:9000/iZnGzGCxbXlMqb2ECYrsjA Feb-18-12:09
> > lcg0218120931-15613 LOST
> > https://rb1.egee.fr.cgg.com:9000/iZnGzGCxbXlMqb2ECYrsjA Feb-18-12:12
> > lcg0218121224-16184 LOST
>
> As the globus submission mechanism is still opaque for me, I need some
> help to understand this behavior.
>
> Could the job submission obey to the following scenario:
> 1) Job is submitted to the CE by the RB
> 2) Job is submitted to the batch system BQS by the way of the CE
> 3) RB decides that Job should be resubmitted (Why ? Does the RB think
> that the job is failed ? Is the dialog between CE and RB broken ? ...)
Possibly, e.g. due to firewall settings. We need the output of
edg-job-get-logging-info -v 1
for each of the distinct job IDs to see what happened according to the RB.
Only the owner of the job or the admin of the RB can do that.
> a) RB launches the Job data cleanup on the CE. (It could explain
> why my gram_job_state files are oddly disappearing)
I suspect the failed jobs never really started, so the cleanup scenario
is not the same as for a job that did start: the job manager can do the
cleanup itself here.
> b) the CE stops to deal with this job, but it is still submitted to
> BQS and it is going to be uselessly run...
> (Tip of the day: it could be useful to cancel the job on the
> CE in this case...)
> c) RB submits again the job
> 4) Go to 1) unless max retry count is reached
>
> In that case, the problem would be in step 3. So, why could the RB
> decide the resubmission of a job ?
> Any idea ?
>
> Thanks in advance for any help !
>
> Pierre
>
>
>
|