Hi Maaten,
thanks for your very quick answer. Are you also a early bird ? ;)
[log in to unmask] a écrit :
>On Mon, 21 Feb 2005, pierre girard wrote:
>
>Could the job submission obey to the following scenario:
>1) Job is submitted to the CE by the RB
>2) Job is submitted to the batch system BQS by the way of the CE
>3) RB decides that Job should be resubmitted (Why ? Does the RB think
>that the job is failed ? Is the dialog between CE and RB broken ? ...)
>
>
>
>Possibly, e.g. due to firewall settings.
>
On our CE, we set:
GLOBUS_TCP_PORT_RANGE=30000 31000
>We need the output of
>
> edg-job-get-logging-info -v 1
>
>for each of the distinct job IDs to see what happened according to the RB.
>Only the owner of the job or the admin of the RB can do that.
>
>
I'm going to ask the users for this. Be sure I'll be back as soon as I
get these logs ;)
We get a RB for test, but up to now, I have not been able to repeat this
phenomenon.
>> a) RB launches the Job data cleanup on the CE. (It could explain
>>why my gram_job_state files are oddly disappearing)
>>
>>
>
>I suspect the failed jobs never really started, so the cleanup scenario
>is not the same as for a job that did start: the job manager can do the
>cleanup itself here.
>
>
Maybe the jobs are considered failed by the RB, but they are scheduled
and then run on our WNs... all of them.
We use a BQS-specific perl module for the jobmanager which implements
the submit, poll, cancel, and so on, functions.
We keep logs for each function call and what we notice is :
- the jobmanager submits the job to BQS.
- the jobmanager polls normally its jobs until the job-related
gram_job_state files disappear (for the case of lost jobs).
- the jobs finally run on a WN, but the CE does not deal with it at all.
- Sometimes, the user informs us of problem with his/her failed jobs
while these jobs are still queued or running on our WNs.
>
>
>> b) the CE stops to deal with this job, but it is still submitted to
>>BQS and it is going to be uselessly run...
>> (Tip of the day: it could be useful to cancel the job on the
>>CE in this case...)
>> c) RB submits again the job
>>4) Go to 1) unless max retry count is reached
>>
>>In that case, the problem would be in step 3. So, why could the RB
>>decide the resubmission of a job ?
>>Any idea ?
>>
>>Thanks in advance for any help !
>>
>>Pierre
>>
>>
>>
>>
>>
>
>
>
--
______________________
Pierre GIRARD
Grid Computing Team Member
IN2P3/CNRS Computing Centre - Lyon (FRANCE)
http://cc.in2p3.fr
Tel. +33 4.78.93.08.80 | Fax. +33 4.72.69.41.70 | e-mail: [log in to unmask]
|