Hi, Well, it's not harmless as whatever it is causes about 50% of the jobs to fail. Given that as far as I have worked out these jobs mainly run at RAL (as implicated in the error message) I was hoping RAL could shed some light on it. t2k as a VO is quite new to the grid, so they might hit snags other VOs have long circumvented. Cheers, Daniela On 19 January 2011 17:17, Alessandra Forti <[log in to unmask]> wrote: > Sorry... if it is the result of a job it's because a previous submission > failed and the reason should be in logging-info. My questions about that are > in the same thread. > > cheers > alessandra > > > On 19/01/11 17:08, Alessandra Forti wrote: > > You might want to look at lcg-rollout, there was a thread about it. > Apparently it's harmless > > > https://www.jiscmail.ac.uk/cgi-bin/webadmin?A2=ind1101&L=LCG-ROLLOUT&F=&S=&P=38698 > > cheers > alessandra > > > On 19/01/11 16:55, Daniela Bauer wrote: > > Hi All, > > Sorry, lengthy. > > I've just had a complaint from t2k that about half of ~400 jobs failed with > > Current Status: Aborted > Status Reason: failed (LB query failed) > > both at the RAL and Imperial WMS. > > Googling this error just seems to suggest it's not uncommon, but no hint > whatsoever what is wrong. > My own WMS here seems to be reasonably happy right now, and if I do a grep > in the log, there is nothing suspicious, at the end it just says: > > workload_manager_events.log:18 Jan, 23:03:41 -I: [Info] > operator()(dispatcher_utils.cpp:218): new jobsubmit for > https://wmslb01.hep.ph.ic.ac.uk:9000/i0o1aa5KK-Dm8Tf_LnnPaA > workload_manager_events.log:18 Jan, 23:03:44 -I: [Info] > checkRequirement(matchmakerISMImpl.cpp:89): MM for job: > https://wmslb01.hep.ph.ic.ac.uk:9000/i0o1aa5KK-Dm8Tf_LnnPaA (9/12533 [3] ) > workload_manager_events.log:18 Jan, 23:03:44 -I: [Info] > operator()(submit_request.cpp:478): > https://wmslb01.hep.ph.ic.ac.uk:9000/i0o1aa5KK-Dm8Tf_LnnPaA delivered > workload_manager_events.log:18 Jan, 23:14:24 -I: [Info] > operator()(dispatcher_utils.cpp:218): new jobresubmit for > https://wmslb01.hep.ph.ic.ac.uk:9000/i0o1aa5KK-Dm8Tf_LnnPaA > workload_manager_events.log:18 Jan, 23:14:24 -E: [Error] > unrecoverable(submit_request.cpp:111): > https://wmslb01.hep.ph.ic.ac.uk:9000/i0o1aa5KK-Dm8Tf_LnnPaA failed ( > failed (LB query failed)) > > The job went to a cream CE as far as I can tell: > > 2011-01-18 23:03:45,057 INFO - iceCommandSubmit::try_to_submit() - > TID=[140441744] For GridJobID [ > https://wmslb01.hep.ph.ic.ac.uk:9000/i0o1aa5KK-Dm8Tf_LnnPaA] CREAM > Returned CREAM-JOBID [https://lcgce05.gridpp.rl.ac.uk:8443/CREAM503909568] > DB_ID [] > > 2011-01-18 23:03:45,059 DEBUG - iceCommandSubmit::try_to_submit() - > TID=[140441744] Going to START CreamJobID [ > https://lcgce05.gridpp.rl.ac.uk:8443/CREAM503909568] related to GridJobID > [https://wmslb01.hep.ph.ic.ac.uk:9000/i0o1aa5KK-Dm8Tf_LnnPaA].<https://wmslb01.hep.ph.ic.ac.uk:9000/i0o1aa5KK-Dm8Tf_LnnPaA%5D.> > .. > > 2011-01-18 23:03:45,191 INFO - iceLBContext::setLoggingJob - Setting log > job to jobid=[https://wmslb01.hep.ph.ic.ac.uk:9000/i0o1aa5KK-Dm8Tf_LnnPaA] > LB server=[wmslb01.hep.ph.ic.ac.uk:9000] (port is not used, actually...) > > 2011-01-18 23:03:45,191 INFO - iceLBLogger::logEvent() - Cream Transfer OK > Event - [gridJobID=" > https://wmslb01.hep.ph.ic.ac.uk:9000/i0o1aa5KK-Dm8Tf_LnnPaA" CREAMJobID=" > https://lcgce05.gridpp.rl.ac.uk:8443/CREAM503909568"] > > 2011-01-18 23:03:45,214 DEBUG - filelist_request_purger - removing request > [ Arguments = [ JobAd = [ requirements = ( ( > Member("VO-t2k.org-ND280-v8r5p9",other.GlueHostApplicationSoftwareRunTimeEnvironment) > && ( !RegExp("shef.ac.uk",other.GlueCEUniqueID) ) && > other.GlueCEPolicyMaxCPUTime > 1200 ) && ( other.GlueCEStateStatus == > "Production" ) ) && !RegExp(".*sdj$",other.GlueCEUniqueID); RetryCount = 3; > edg_jobid = "https://wmslb01.hep.ph.ic.ac.uk:9000/i0o1aa5KK-Dm8Tf_LnnPaA"; > lrms_type = "torque"; CeApplicationDir = "/stage/sl3-lcg-exp/t2ksgm"; > GlobusResourceContactString = "lcgce05.gridpp.rl.ac.uk:8443/cream-pbs"; > OutputSandboxPath = > "/var/glite/SandboxDir/i0/https_3a_2f_2fwmslb01.hep.ph.ic.ac.uk_3a9000_2fi0o1aa5KK-Dm8Tf_5fLnnPaA/output"; > ce_id = "lcgce05.gridpp.rl.ac.uk:8443/cream-pbs-grid500M"; MyProxyServer = > "lcgrbp01.gridpp.rl.ac.uk"; AllowZippedISB = true; QueueName = "grid500M"; > JobType = "normal"; InputSandboxDestFileName = { "geniefc-10-17-46.sh" }; > SignificantAttributes = { "Requirements","Rank","FuzzyRank" }; Executable = > "geniefc-10-17-46.sh"; OutputSandboxDestURI = { "gsiftp:// > wms01.hep.ph.ic.ac.uk:2811/var/glite/SandboxDir/i0/https_3a_2f_2fwmslb01.hep.ph.ic.ac.uk_3a9000_2fi0o1aa5KK-Dm8Tf_5fLnnPaA/output/stdout > ","gsiftp:// > wms01.hep.ph.ic.ac.uk:2811/var/glite/SandboxDir/i0/https_3a_2f_2fwmslb01.hep.ph.ic.ac.uk_3a9000_2fi0o1aa5KK-Dm8Tf_5fLnnPaA/output/stderr" > }; CertificateSubject = "/C=UK/O=eScience/OU=Imperial/L=Physics/CN=james > dobson"; X509UserProxy = > "/var/glite/SandboxDir/i0/https_3a_2f_2fwmslb01.hep.ph.ic.ac.uk_3a9000_2fi0o1aa5KK-Dm8Tf_5fLnnPaA/user.proxy"; > StdOutput = "stdout"; VOMS_FQAN = "/t2k.org/Role=NULL/Capability=NULL"; > OutputSandbox = { "stdout","stderr" }; LB_sequence_code = > "UI=000000:NS=0000000004:WM=000004:BH=0000000000:JSS=000000:LM=000000:LRMS=000000:APP=000000:LBS=000000"; > InputSandboxPath = > "/var/glite/SandboxDir/i0/https_3a_2f_2fwmslb01.hep.ph.ic.ac.uk_3a9000_2fi0o1aa5KK-Dm8Tf_5fLnnPaA/input"; > VirtualOrganisation = "t2k.org"; rank = > -other.GlueCEStateEstimatedResponseTime; Type = "job"; ShallowRetryCount = > 10; CeRequirements = "true && ( true && ( true && > Member(\"VO-t2k.org-ND280-v8r5p9\",other.GlueHostApplicationSoftwareRunTimeEnvironment) > && other.GlueCEPolicyMaxCPUTime > 1200 ) )"; StdError = "stderr"; > WMPInputSandboxBaseURI = "gsiftp:// > wms01.hep.ph.ic.ac.uk:2811/var/glite/SandboxDir/i0/https_3a_2f_2fwmslb01.hep.ph.ic.ac.uk_3a9000_2fi0o1aa5KK-Dm8Tf_5fLnnPaA"; > DefaultRank = -other.GlueCEStateEstimatedResponseTime; ReallyRunningToken = > "gsiftp:// > wms01.hep.ph.ic.ac.uk/var/glite/SandboxDir/i0/https_3a_2f_2fwmslb01.hep.ph.ic.ac.uk_3a9000_2fi0o1aa5KK-Dm8Tf_5fLnnPaA/token.txt"; > ZippedISB = { "ISBfiles_S6Q0LX8bTIVIlI7tUg_q9A_0.tar.gz" }; InputSandbox = { > "gsiftp:// > wms01.hep.ph.ic.ac.uk:2811/var/glite/SandboxDir/i0/https_3a_2f_2fwmslb01.hep.ph.ic.ac.uk_3a9000_2fi0o1aa5KK-Dm8Tf_5fLnnPaA/input/geniefc-10-17-46.sh > ","gsiftp:// > wms01.hep.ph.ic.ac.uk:2811/var/glite/SandboxDir/i0/https_3a_2f_2fwmslb01.hep.ph.ic.ac.uk_3a9000_2fi0o1aa5KK-Dm8Tf_5fLnnPaA/input/.BrokerInfo" > } ] ]; Command = "Submit"; Source = 2; Protocol = "1.0.0" ] > > 2011-01-18 23:14:23,261 DEBUG - iceCommandStatusPoller::update_single_job() > - Updating ICE's database for gridJobID=" > https://wmslb01.hep.ph.ic.ac.uk:9000/i0o1aa5KK-Dm8Tf_LnnPaA" CREAMJobID=" > https://lcgce05.gridpp.rl.ac.uk:8443/CREAM503909568" status = [REGISTERED] > exit_code = [] failure_reason = [] description = [] > > 2011-01-18 23:14:23,262 DEBUG - iceCommandStatusPoller::update_single_job() > - Updating ICE's database for gridJobID=" > https://wmslb01.hep.ph.ic.ac.uk:9000/i0o1aa5KK-Dm8Tf_LnnPaA" CREAMJobID=" > https://lcgce05.gridpp.rl.ac.uk:8443/CREAM503909568" status = [PENDING] > exit_code = [] failure_reason = [] description = [] > > ----> Now the next bit I would interpret as the job actually having failed > to at the local batch system level, but why does this error not get > propagated ? > > 2011-01-18 23:14:23,263 DEBUG - iceCommandStatusPoller::update_single_job() > - Updating ICE's database for gridJobID=" > https://wmslb01.hep.ph.ic.ac.uk:9000/i0o1aa5KK-Dm8Tf_LnnPaA" CREAMJobID=" > https://lcgce05.gridpp.rl.ac.uk:8443/CREAM503909568" status = [ABORTED] > exit_code = [] failure_reason = [BLAH error: submission command failed (exit > code = 1) (stdout:) (stderr:pbs_iff: cannot read reply from pbs_server-No > Permission.-qsub: cannot connect to server lcgbatch01.gridpp.rl.ac.uk(errno=15007) Unauthorized Request -TERM environment variable not set.-) N/A > (jobId = CREAM503909568)] description = [] > > 2011-01-18 23:14:23,264 INFO - iceLBContext::setLoggingJob - Setting log > job to jobid=[https://wmslb01.hep.ph.ic.ac.uk:9000/i0o1aa5KK-Dm8Tf_LnnPaA] > LB server=[wmslb01.hep.ph.ic.ac.uk:9000] (port is not used, actually...) > > 2011-01-18 23:14:23,264 INFO - iceLBLogger::logEvent() - Job Aborted Event, > reason=[BLAH error: submission command failed (exit code = 1) (stdout:) > (stderr:pbs_iff: cannot read reply from pbs_server-No Permission.-qsub: > cannot connect to server lcgbatch01.gridpp.rl.ac.uk (errno=15007) > Unauthorized Request -TERM environment variable not set.-) N/A (jobId = > CREAM503909568)] - [gridJobID=" > https://wmslb01.hep.ph.ic.ac.uk:9000/i0o1aa5KK-Dm8Tf_LnnPaA" CREAMJobID=" > https://lcgce05.gridpp.rl.ac.uk:8443/CREAM503909568"] > > 2011-01-18 23:14:23,301 DEBUG - Ice::resubmit_or_purge_job() - Removing > purged job [gridJobID=" > https://wmslb01.hep.ph.ic.ac.uk:9000/i0o1aa5KK-Dm8Tf_LnnPaA" CREAMJobID=" > https://lcgce05.gridpp.rl.ac.uk:8443/CREAM503909568"] from ICE's database > > 2011-01-18 23:14:23,315 INFO - iceLBContext::setLoggingJob - Setting log > job to jobid=[https://wmslb01.hep.ph.ic.ac.uk:9000/i0o1aa5KK-Dm8Tf_LnnPaA] > LB server=[wmslb01.hep.ph.ic.ac.uk:9000] (port is not used, actually...) > > 2011-01-18 23:14:23,316 INFO - iceLBLogger::logEvent() - ICE Resubmission > Event, reason=[Job resubmitted by ICE] - [gridJobID=" > https://wmslb01.hep.ph.ic.ac.uk:9000/i0o1aa5KK-Dm8Tf_LnnPaA" CREAMJobID=" > https://lcgce05.gridpp.rl.ac.uk:8443/CREAM503909568"] > > 2011-01-18 23:14:23,329 INFO - iceLBContext::setLoggingJob - Setting log > job to jobid=[https://wmslb01.hep.ph.ic.ac.uk:9000/i0o1aa5KK-Dm8Tf_LnnPaA] > LB server=[wmslb01.hep.ph.ic.ac.uk:9000] (port is not used, actually...) > > 2011-01-18 23:14:23,329 INFO - iceLBLogger::logEvent() - NS Enqueued Start > Event, qname=[/var/glite/workload_manager/jobdir] - [gridJobID=" > https://wmslb01.hep.ph.ic.ac.uk:9000/i0o1aa5KK-Dm8Tf_LnnPaA" CREAMJobID=" > https://lcgce05.gridpp.rl.ac.uk:8443/CREAM503909568"] > > 2011-01-18 23:14:23,341 INFO - Ice::resubmit_job() - Putting [[ arguments = > [ id = "https://wmslb01.hep.ph.ic.ac.uk:9000/i0o1aa5KK-Dm8Tf_LnnPaA"; > lb_sequence_code = > "UI=000000:NS=0000000004:WM=000004:BH=0000000000:JSS=000002:LM=000010:LRMS=000000:APP=000000:LBS=000000" > ]; command = "jobresubmit"; version = "1.0.0" ]] to WM's Input file > > 2011-01-18 23:14:23,341 INFO - iceLBContext::setLoggingJob - Setting log > job to jobid=[https://wmslb01.hep.ph.ic.ac.uk:9000/i0o1aa5KK-Dm8Tf_LnnPaA] > LB server=[wmslb01.hep.ph.ic.ac.uk:9000] (port is not used, actually...) > > 2011-01-18 23:14:23,342 INFO - iceLBLogger::logEvent() - NS Enqueued OK > Event, qname=[/var/glite/workload_manager/jobdir] - [gridJobID=" > https://wmslb01.hep.ph.ic.ac.uk:9000/i0o1aa5KK-Dm8Tf_LnnPaA" CREAMJobID=" > https://lcgce05.gridpp.rl.ac.uk:8443/CREAM503909568"] > > > Does anybody have an idea what is going on here ? > > Thanks, > > Daniela > > -- > ----------------------------------------------------------- > [log in to unmask] > HEP Group/Physics Dep > Imperial College > Tel: +44-(0)20-75947810 > http://www.hep.ph.ic.ac.uk/~dbauer/<http://www.hep.ph.ic.ac.uk/%7Edbauer/> > > > > -- ----------------------------------------------------------- [log in to unmask] HEP Group/Physics Dep Imperial College Tel: +44-(0)20-75947810 http://www.hep.ph.ic.ac.uk/~dbauer/