Hello,
t hanks for all answers so far.
As indicated by new tests, the jobs are aborted as they arrive a particular
WN. They are executed if qsub'ed and aborted when submitted with
globus-job-run.
It seems not to be a problem to copy back results, nor low disk space or
memory space.
What could cause that jobs submitted with Globus are aborted and those
submutted with qsub not?.
Cheers,
Daniel
On Monday 07 May 2007 10:14, Daniel Lorenz wrote:
> Hello,
>
> > Is there anything in the mom logs on the WN, in particular there should
> > be reason why the file could not be copied back.
>
> For each job there is an entry like:
>
> 05/04/2007 17:26:30;0001; pbs_mom;Job;TMomFinalizeJob3;job
> 1174.gcn54.hep.physik.uni-siegen.de started, pid = 1463
> 05/04/2007 17:26:31;0080;
> pbs_mom;Job;1174.gcn54.hep.physik.uni-siegen.de;scan_for_terminated: job
> 1174.gcn54.hep.physik.uni-siegen.de task 1 terminated, sid 1463
>
> The logevent was set to 255.
>
> > Also when you say you submitted jobs successfully with qsub did the
> > output actually
> > come back?
>
> Yes.
>
> Regards,
> Daniel
>
> > Steve
> >
> > >> So it sounds like you tried everything on:
> > >>
> > >> http://goc.grid.sinica.edu.tw/gocwiki/submit-
> > >> helper_script_..._gave_error%3A_cache_export_dir_...
> > >>
> > >> Try su'ing to one of your pool accounts and qsub'ing a simple job.
> > >> Try in particular as
> > >> an ops VO.
> > >> Steve
> > >
> > > I tried this with a number of accounts and it worked fine . But if
> > > I use
> > > globus-job-run gcn54/jobmanager-lcgpbs no output (but also no error
> > > message)
> > > is returned and a "submit-helper script..." entry on the WN is
> > > generated.
> > > The fork manager also works.
> > >
> > > Regards,
> > > Daniel
> > >
> > >>> submit-helper script running on host gcn51 gave error:
> > >>> cache_export_dir
> > >>> (/home/dteam006/.lcgjm/globus-cache-export.I23975) on gatekeeper
> > >>> did not
> > >>> contain a cache_export_dir.tar archive
> > >>>
> > >>> logging info says:
> > >>> Event: Done
> > >>> - exit_code = 1
> > >>> - host = rb127.cern.ch
> > >>> - level = SYSTEM
> > >>> - priority = asynchronous
> > >>> - reason = Cannot read JobWrapper output, both
> > >>> from Condor
> > >>> and from Maradona.
> > >>> - seqcode =
> > >>> UI=000003:NS=0000000003:WM=000012:BH=0000000000:JSS=000009:LM=000019
> > >>>
> > >>> :L
> > >>>
> > >>> RMS=000000:APP=000000
> > >>> - source = LogMonitor
> > >>> - src_instance = unique
> > >>> - status_code = FAILED
> > >>>
> > >>> The CRL was up to date.
> > >>> I can copy file from the WN to the CE with
> > >>> globus-url-copy.
> > >>> The clocks are synchronized.
> > >>> Users are mapped to the same id.
> > >>>
> > >>> Has anybody an idea?
> > >>>
> > >>> Thanks in advance,
> > >>> Daniel Lorenz
|