Hi, I occasionally see at my site intermittent jobs failures; in
the QMUL test these get reported just as:
> ce01.dur.scotgrid.ac.uk None n62.dur.scotgrid.ac.uk: OK 2011-06-02 13:53:25 ok
> ce01.dur.scotgrid.ac.uk None CRITICAL: Job was aborted. 2011-06-02 12:58:50 error
> ce01.dur.scotgrid.ac.uk None n62.dur.scotgrid.ac.uk: OK: VO-atlas-AtlasPhysics-16.6.4.1.1-i686-slc5-gcc43-opt VO-atlas-AtlasPhysics-16.6.5.1.1-i686-slc5-gcc43-opt VO-atlas-AtlasPhysics-16.6.5.2.1-i686-slc5-gcc43-opt VO-atlas-AtlasPhysics-16.6.5.3.1-i686-slc5-gcc43-opt VO-atlas-AtlasP 2011-06-02 13:53:25 ok
Here the question is how do I get QMUL ATLAS SAM test job
outputs/logs to figure out which error aborted the job.
I have checked other sites and seen "Job was aborted" happen
intermittently at a few other sites, for example most recently:
> epgr04.ph.bham.ac.uk None u4n099: OK 2011-05-31 09:53:00 ok
> epgr04.ph.bham.ac.uk None CRITICAL: Job was aborted. 2011-06-02 13:48:53 error
> epgr04.ph.bham.ac.uk None u4n099: OK: VO-atlas-AtlasPhysics-16.6.4.1.1-i686-slc5-gcc43-opt VO-atlas-AtlasPhysics-16.6.5.1.1-i686-slc5-gcc43-opt VO-atlas-AtlasPhysics-16.6.5.2.1-i686-slc5-gcc43-opt VO-atlas-AtlasPhysics-16.6.5.3.1-i686-slc5-gcc43-opt VO-atlas-BTagging-15.6.12.1.1-i 2011-05-31 09:48:03 ok
I also see, and my local grid users also tell me about, a
certain rate of WMS failures, which seem due mostly to load (not
an unknown situation :->), typically load average checks, but
recently also this for the QMUL tests:
> Warning - Unable to register the job to the service: https://lcgwms02.gridpp.rl.ac.uk:7443/glite_wms_wmproxy_server
> System load is too high:
> Threshold for JC Input JobDir jobs: 6500 => Detected value for JC Input JobDir jobs /var/glite/jobcontrol/jobdir/ : 7729
|