Bonjour Marteen,
Maarten Litmaath a écrit :
>Bonjour Pierre,
>do you have some RB job IDs for jobs that failed?
>What does this command show for the causes:
>
> edg-job-get-logging-info -v 1 $JOB_ID
>
>
Unfortunately, I cannot use this command because I was not the user of
these jobs. But Ricardo, who was the user of some of them, sent me this
explanation from the RB logs:
> Thanks for reporting the problem. The error I get from the RB
>logging system for this jobs is:
>
>Cannot read JobWrapper output, both from Condor and from Maradona.
>
From my point of view, there's nothing strange here since the jobs were
still running on the WNs or still queued. But, the question is why the
jobs were considered as done from the RB ?
As I read again your famous "Dialog between RB and CE" wiki page ;), I
noticed that the grid_monitor running on the CE should inform the RB of
the job state... So, I took a look at the grid_monitor processes running
on my problematic CE:
> 0 S dteam004 27473 27472 0 75 0 - 1721 schedu 08:16 ?
> 00:00:01 perl /tmp/grid_manager_monitor_agent.dteam004.27472.1000
> --delete-self --maxtime=3600s
I noticed that a "maxtime" is defined on this CE (LCG2.3.0), but it is
not defined on our other CE which works well and is still in LCG2.2.0.
Do you know something about this maxtime ?
>
>Unfortunately there can be many causes for jobs to fail in subtle ways,
>to a large extent due to the Globus job submission model.
>We expect the job submission to become a lot less intricate when we
>have debugged the gLite RB that uses Condor-C (sic) instead of Globus
>to submit the job to the CE.
>
>
Well... my problem is that I would need it now ;).
Anyway, thanks for the information, I'm very interested by this topic.
Is there somewhere a documentation/presentation about it ?
As usual... thanks in advance ;).
Cheers,
Pierre
>
>
> -----Original Message-----
> From: LHC Computer Grid - Rollout on behalf of pierre girard
> Sent: Sun 2/13/2005 10:50 AM
> To: [log in to unmask]
> Cc:
> Subject: Re: [LCG-ROLLOUT] Question about site_globus_tcp_range
>
>
>
> Marteen,
>
> Many thanks for thoses explanations and your documentationon this topic.
>
> But, I have still a question about the cleanup step, because we have
> currently a problem with odd gram_job_ state disappearing on our CE
> (IN2P3-CC Site).
>
> Indeed, we noticed that several submitted jobs are not anymore known by
> our jobmanager. Taking a look at the jobmanager log file, these jobs
> were no more handled by jobmanager at about 03:00 this morning, and it
> was the same yesterday with other jobs. So, I suppose that it is at this
> time that their gram_job_state file has disappeared.
>
> However, these jobs are always known by our batch system, either running
> jobs, or queued jobs. So the worst is that the running jobs indefinitely
> hang on to a (RB ?) connexion.
>
> So my question is:
> Do you know this strange phenomenon ? Is that possible that a RB could
> launch the cleanup step too soon on the CE ?
>
> Thanks in advance for any possible explanation,
>
> Cheers,
>
> Pierre
>
> Maarten Litmaath, CERN a écrit :
>
> >On Fri, 11 Feb 2005, owen maroney wrote:
> >
> >
> >
> >>This is really useful: is there a page or three on this in the
> >>troubleshooting wiki?
> >>
> >>
> >
> >Hi Owen,
> >I have added entries to the Job Submission category:
> >
> > http://goc.grid.sinica.edu.tw/gocwiki/SiteProblemsFollowUpFaq
> >
> >In particular:
> >
> >http://goc.grid.sinica.edu.tw/gocwiki/Dialog_between_RB_and_CE
> >
> >http://goc.grid.sinica.edu.tw/gocwiki/Globus_error_79%3a_connecting_to_the_job_manager_failed%2e
> >
> >Cheers,
> > Maarten
> >
> >
> >
>
> --
> ______________________
> Pierre GIRARD
> Grid Computing Team Member
> IN2P3/CNRS Computing Centre - Lyon (FRANCE)
> http://cc.in2p3.fr
> Tel. +33 4.78.93.08.80 | Fax. +33 4.72.69.41.70 | e-mail: [log in to unmask]
>
>
>
>
--
______________________
Pierre GIRARD
Grid Computing Team Member
IN2P3/CNRS Computing Centre - Lyon (FRANCE)
http://cc.in2p3.fr
Tel. +33 4.78.93.08.80 | Fax. +33 4.72.69.41.70 | e-mail: [log in to unmask]
|