Peter Grandi wrote:
> Hi, I occasionally see at my site intermittent jobs failures; in
> the QMUL test these get reported just as:
>
>> ce01.dur.scotgrid.ac.uk None n62.dur.scotgrid.ac.uk: OK 2011-06-02 13:53:25 ok
>> ce01.dur.scotgrid.ac.uk None CRITICAL: Job was aborted. 2011-06-02 12:58:50 error
>> ce01.dur.scotgrid.ac.uk None n62.dur.scotgrid.ac.uk: OK: VO-atlas-AtlasPhysics-16.6.4.1.1-i686-slc5-gcc43-opt VO-atlas-AtlasPhysics-16.6.5.1.1-i686-slc5-gcc43-opt VO-atlas-AtlasPhysics-16.6.5.2.1-i686-slc5-gcc43-opt VO-atlas-AtlasPhysics-16.6.5.3.1-i686-slc5-gcc43-opt VO-atlas-AtlasP 2011-06-02 13:53:25 ok
>
> Here the question is how do I get QMUL ATLAS SAM test job
> outputs/logs to figure out which error aborted the job.
>
The ATLAS SAM tests, Steve copies the results from the SAM testing
framework - you need to look at that direct.
https://lcg-sam.cern.ch:8443/sam/sam.py?sensors=CE®ions=UKI&vo=atlas&order=SiteName&funct=ShowSensorTests
Steve also runs his own tests as an Atlas user - labeled "atlas tests" -
clicking on the failed job links to a summary page which links to
detailed job output.
> I have checked other sites and seen "Job was aborted" happen
> intermittently at a few other sites, for example most recently:
>
>> epgr04.ph.bham.ac.uk None u4n099: OK 2011-05-31 09:53:00 ok
>> epgr04.ph.bham.ac.uk None CRITICAL: Job was aborted. 2011-06-02 13:48:53 error
>> epgr04.ph.bham.ac.uk None u4n099: OK: VO-atlas-AtlasPhysics-16.6.4.1.1-i686-slc5-gcc43-opt VO-atlas-AtlasPhysics-16.6.5.1.1-i686-slc5-gcc43-opt VO-atlas-AtlasPhysics-16.6.5.2.1-i686-slc5-gcc43-opt VO-atlas-AtlasPhysics-16.6.5.3.1-i686-slc5-gcc43-opt VO-atlas-BTagging-15.6.12.1.1-i 2011-05-31 09:48:03 ok
>
> I also see, and my local grid users also tell me about, a
> certain rate of WMS failures, which seem due mostly to load (not
> an unknown situation :->), typically load average checks, but
> recently also this for the QMUL tests:
>
>> Warning - Unable to register the job to the service: https://lcgwms02.gridpp.rl.ac.uk:7443/glite_wms_wmproxy_server
>> System load is too high:
>> Threshold for JC Input JobDir jobs: 6500 => Detected value for JC Input JobDir jobs /var/glite/jobcontrol/jobdir/ : 7729
Yes, that's a wms problem. You see a vertical stripe of red at
http://pprc.qmul.ac.uk/~lloyd/gridpp/atest.html if that's the case.
Chris
|