Bonjour Pierre,
do you have some RB job IDs for jobs that failed?
What does this command show for the causes:
edg-job-get-logging-info -v 1 $JOB_ID
Unfortunately there can be many causes for jobs to fail in subtle ways,
to a large extent due to the Globus job submission model.
We expect the job submission to become a lot less intricate when we
have debugged the gLite RB that uses Condor-C (sic) instead of Globus
to submit the job to the CE.
-----Original Message-----
From: LHC Computer Grid - Rollout on behalf of pierre girard
Sent: Sun 2/13/2005 10:50 AM
To: [log in to unmask]
Cc:
Subject: Re: [LCG-ROLLOUT] Question about site_globus_tcp_range
Marteen,
Many thanks for thoses explanations and your documentationon this topic.
But, I have still a question about the cleanup step, because we have
currently a problem with odd gram_job_ state disappearing on our CE
(IN2P3-CC Site).
Indeed, we noticed that several submitted jobs are not anymore known by
our jobmanager. Taking a look at the jobmanager log file, these jobs
were no more handled by jobmanager at about 03:00 this morning, and it
was the same yesterday with other jobs. So, I suppose that it is at this
time that their gram_job_state file has disappeared.
However, these jobs are always known by our batch system, either running
jobs, or queued jobs. So the worst is that the running jobs indefinitely
hang on to a (RB ?) connexion.
So my question is:
Do you know this strange phenomenon ? Is that possible that a RB could
launch the cleanup step too soon on the CE ?
Thanks in advance for any possible explanation,
Cheers,
Pierre
Maarten Litmaath, CERN a écrit :
>On Fri, 11 Feb 2005, owen maroney wrote:
>
>
>
>>This is really useful: is there a page or three on this in the
>>troubleshooting wiki?
>>
>>
>
>Hi Owen,
>I have added entries to the Job Submission category:
>
> http://goc.grid.sinica.edu.tw/gocwiki/SiteProblemsFollowUpFaq
>
>In particular:
>
>http://goc.grid.sinica.edu.tw/gocwiki/Dialog_between_RB_and_CE
>
>http://goc.grid.sinica.edu.tw/gocwiki/Globus_error_79%3a_connecting_to_the_job_manager_failed%2e
>
>Cheers,
> Maarten
>
>
>
--
______________________
Pierre GIRARD
Grid Computing Team Member
IN2P3/CNRS Computing Centre - Lyon (FRANCE)
http://cc.in2p3.fr
Tel. +33 4.78.93.08.80 | Fax. +33 4.72.69.41.70 | e-mail: [log in to unmask]
|