Peter Grandi wrote: > Hi, I occasionally see at my site intermittent jobs failures; in > the QMUL test these get reported just as: > >> ce01.dur.scotgrid.ac.uk None n62.dur.scotgrid.ac.uk: OK 2011-06-02 13:53:25 ok >> ce01.dur.scotgrid.ac.uk None CRITICAL: Job was aborted. 2011-06-02 12:58:50 error >> ce01.dur.scotgrid.ac.uk None n62.dur.scotgrid.ac.uk: OK: VO-atlas-AtlasPhysics-16.6.4.1.1-i686-slc5-gcc43-opt VO-atlas-AtlasPhysics-16.6.5.1.1-i686-slc5-gcc43-opt VO-atlas-AtlasPhysics-16.6.5.2.1-i686-slc5-gcc43-opt VO-atlas-AtlasPhysics-16.6.5.3.1-i686-slc5-gcc43-opt VO-atlas-AtlasP 2011-06-02 13:53:25 ok > > Here the question is how do I get QMUL ATLAS SAM test job > outputs/logs to figure out which error aborted the job. > The ATLAS SAM tests, Steve copies the results from the SAM testing framework - you need to look at that direct. https://lcg-sam.cern.ch:8443/sam/sam.py?sensors=CE®ions=UKI&vo=atlas&order=SiteName&funct=ShowSensorTests Steve also runs his own tests as an Atlas user - labeled "atlas tests" - clicking on the failed job links to a summary page which links to detailed job output. > I have checked other sites and seen "Job was aborted" happen > intermittently at a few other sites, for example most recently: > >> epgr04.ph.bham.ac.uk None u4n099: OK 2011-05-31 09:53:00 ok >> epgr04.ph.bham.ac.uk None CRITICAL: Job was aborted. 2011-06-02 13:48:53 error >> epgr04.ph.bham.ac.uk None u4n099: OK: VO-atlas-AtlasPhysics-16.6.4.1.1-i686-slc5-gcc43-opt VO-atlas-AtlasPhysics-16.6.5.1.1-i686-slc5-gcc43-opt VO-atlas-AtlasPhysics-16.6.5.2.1-i686-slc5-gcc43-opt VO-atlas-AtlasPhysics-16.6.5.3.1-i686-slc5-gcc43-opt VO-atlas-BTagging-15.6.12.1.1-i 2011-05-31 09:48:03 ok > > I also see, and my local grid users also tell me about, a > certain rate of WMS failures, which seem due mostly to load (not > an unknown situation :->), typically load average checks, but > recently also this for the QMUL tests: > >> Warning - Unable to register the job to the service: https://lcgwms02.gridpp.rl.ac.uk:7443/glite_wms_wmproxy_server >> System load is too high: >> Threshold for JC Input JobDir jobs: 6500 => Detected value for JC Input JobDir jobs /var/glite/jobcontrol/jobdir/ : 7729 Yes, that's a wms problem. You see a vertical stripe of red at http://pprc.qmul.ac.uk/~lloyd/gridpp/atest.html if that's the case. Chris