JISCMail - TB-SUPPORT Archives

Peter Grandi wrote:
> Hi, I occasionally see at my site intermittent jobs failures; in
> the QMUL test these get reported just as:
> 
>> ce01.dur.scotgrid.ac.uk	None	n62.dur.scotgrid.ac.uk: OK	2011-06-02 13:53:25	ok
>> ce01.dur.scotgrid.ac.uk	None	CRITICAL: Job was aborted.	2011-06-02 12:58:50	error
>> ce01.dur.scotgrid.ac.uk	None	n62.dur.scotgrid.ac.uk: OK: VO-atlas-AtlasPhysics-16.6.4.1.1-i686-slc5-gcc43-opt VO-atlas-AtlasPhysics-16.6.5.1.1-i686-slc5-gcc43-opt VO-atlas-AtlasPhysics-16.6.5.2.1-i686-slc5-gcc43-opt VO-atlas-AtlasPhysics-16.6.5.3.1-i686-slc5-gcc43-opt VO-atlas-AtlasP	2011-06-02 13:53:25	ok
> 
> Here the question is how do I get QMUL ATLAS SAM test job
> outputs/logs to figure out which error aborted the job.
> 

The ATLAS SAM tests, Steve copies the results from the SAM testing
framework - you need to look at that direct.
https://lcg-sam.cern.ch:8443/sam/sam.py?sensors=CE&regions=UKI&vo=atlas&order=SiteName&funct=ShowSensorTests

Steve also runs his own tests as an Atlas user - labeled "atlas tests" -
clicking on the failed  job links to a summary page which links to
detailed job output.

> I have checked other sites and seen "Job was aborted" happen
> intermittently at a few other sites, for example most recently:
> 
>> epgr04.ph.bham.ac.uk	None	u4n099: OK	2011-05-31 09:53:00	ok
>> epgr04.ph.bham.ac.uk	None	CRITICAL: Job was aborted.	2011-06-02 13:48:53	error
>> epgr04.ph.bham.ac.uk	None	u4n099: OK: VO-atlas-AtlasPhysics-16.6.4.1.1-i686-slc5-gcc43-opt VO-atlas-AtlasPhysics-16.6.5.1.1-i686-slc5-gcc43-opt VO-atlas-AtlasPhysics-16.6.5.2.1-i686-slc5-gcc43-opt VO-atlas-AtlasPhysics-16.6.5.3.1-i686-slc5-gcc43-opt VO-atlas-BTagging-15.6.12.1.1-i	2011-05-31 09:48:03	ok
> 
> I also see, and my local grid users also tell me about, a
> certain rate of WMS failures, which seem due mostly to load (not
> an unknown situation :->), typically load average checks, but
> recently also this for the QMUL tests:
> 
>> Warning - Unable to register the job to the service: https://lcgwms02.gridpp.rl.ac.uk:7443/glite_wms_wmproxy_server
>> System load is too high:
>> Threshold for JC Input JobDir jobs: 6500 => Detected value for JC Input JobDir jobs /var/glite/jobcontrol/jobdir/ : 7729

Yes, that's a wms problem. You see a vertical stripe of red at
http://pprc.qmul.ac.uk/~lloyd/gridpp/atest.html if that's the case.

Chris