

Hi All,

Sorry, lengthy.

I've just had a complaint from t2k that about half of ~400 jobs failed with
Current Status:     Aborted 
Status Reason: failed (LB query failed)
both at the RAL and Imperial WMS.

Googling this error just seems to suggest it's not uncommon, but no hint whatsoever what is wrong.
My own WMS here seems to be reasonably happy right now, and if I do a grep in the log, there is nothing suspicious, at the end it just says:

workload_manager_events.log:18 Jan, 23:03:41 -I: [Info] operator()(dispatcher_utils.cpp:218): new jobsubmit for
workload_manager_events.log:18 Jan, 23:03:44 -I: [Info] checkRequirement(matchmakerISMImpl.cpp:89): MM for job: (9/12533 [3] )
workload_manager_events.log:18 Jan, 23:03:44 -I: [Info] operator()(submit_request.cpp:478): delivered
workload_manager_events.log:18 Jan, 23:14:24 -I: [Info] operator()(dispatcher_utils.cpp:218): new jobresubmit for
workload_manager_events.log:18 Jan, 23:14:24 -E: [Error] unrecoverable(submit_request.cpp:111): failed ( failed (LB query failed))

The job went to a cream CE as far as I can tell:

2011-01-18 23:03:45,057 INFO - iceCommandSubmit::try_to_submit() -  TID=[140441744] For GridJobID [] CREAM Returned CREAM-JOBID [] DB_ID []

2011-01-18 23:03:45,059 DEBUG - iceCommandSubmit::try_to_submit() -  TID=[140441744] Going to START CreamJobID [] related to GridJobID []...

2011-01-18 23:03:45,191 INFO - iceLBContext::setLoggingJob - Setting log job to jobid=[] LB server=[] (port is not used, actually...)

2011-01-18 23:03:45,191 INFO - iceLBLogger::logEvent() - Cream Transfer OK Event - [gridJobID="" CREAMJobID=""]

2011-01-18 23:03:45,214 DEBUG - filelist_request_purger - removing request [ Arguments = [ JobAd = [ requirements = ( ( Member("",other.GlueHostApplicationSoftwareRunTimeEnvironment) && (  !RegExp("",other.GlueCEUniqueID) ) && other.GlueCEPolicyMaxCPUTime > 1200 ) && ( other.GlueCEStateStatus == "Production" ) ) &&  !RegExp(".*sdj$",other.GlueCEUniqueID); RetryCount = 3; edg_jobid = ""; lrms_type = "torque"; CeApplicationDir = "/stage/sl3-lcg-exp/t2ksgm"; GlobusResourceContactString = ""; OutputSandboxPath = "/var/glite/SandboxDir/i0/"; ce_id = ""; MyProxyServer = ""; AllowZippedISB = true; QueueName = "grid500M"; JobType = "normal"; InputSandboxDestFileName = { "" }; SignificantAttributes = { "Requirements","Rank","FuzzyRank" }; Executable = ""; OutputSandboxDestURI = { "gsi","gsi" }; CertificateSubject = "/C=UK/O=eScience/OU=Imperial/L=Physics/CN=james dobson"; X509UserProxy = "/var/glite/SandboxDir/i0/"; StdOutput = "stdout"; VOMS_FQAN = "/"; OutputSandbox = { "stdout","stderr" }; LB_sequence_code = "UI=000000:NS=0000000004:WM=000004:BH=0000000000:JSS=000000:LM=000000:LRMS=000000:APP=000000:LBS=000000"; InputSandboxPath = "/var/glite/SandboxDir/i0/"; VirtualOrganisation = ""; rank =  -other.GlueCEStateEstimatedResponseTime; Type = "job"; ShallowRetryCount = 10; CeRequirements = "true && ( true && ( true && Member(\"\",other.GlueHostApplicationSoftwareRunTimeEnvironment) && other.GlueCEPolicyMaxCPUTime > 1200 ) )"; StdError = "stderr"; WMPInputSandboxBaseURI = "gsi"; DefaultRank =  -other.GlueCEStateEstimatedResponseTime; ReallyRunningToken = "gsi"; ZippedISB = { "ISBfiles_S6Q0LX8bTIVIlI7tUg_q9A_0.tar.gz" }; InputSandbox = { "gsi","gsi" } ] ]; Command = "Submit"; Source = 2; Protocol = "1.0.0" ]

2011-01-18 23:14:23,261 DEBUG - iceCommandStatusPoller::update_single_job() - Updating ICE's database for gridJobID="" CREAMJobID="" status = [REGISTERED] exit_code = [] failure_reason = [] description = []

2011-01-18 23:14:23,262 DEBUG - iceCommandStatusPoller::update_single_job() - Updating ICE's database for gridJobID="" CREAMJobID="" status = [PENDING] exit_code = [] failure_reason = [] description = []

----> Now the next bit I would interpret as the job actually having failed to at the local batch system level, but why does this error not get propagated ?

2011-01-18 23:14:23,263 DEBUG - iceCommandStatusPoller::update_single_job() - Updating ICE's database for gridJobID="" CREAMJobID="" status = [ABORTED] exit_code = [] failure_reason = [BLAH error: submission command failed (exit code = 1) (stdout:) (stderr:pbs_iff: cannot read reply from pbs_server-No Permission.-qsub: cannot connect to server (errno=15007) Unauthorized Request -TERM environment variable not set.-) N/A (jobId = CREAM503909568)] description = []

2011-01-18 23:14:23,264 INFO - iceLBContext::setLoggingJob - Setting log job to jobid=[] LB server=[] (port is not used, actually...)

2011-01-18 23:14:23,264 INFO - iceLBLogger::logEvent() - Job Aborted Event, reason=[BLAH error: submission command failed (exit code = 1) (stdout:) (stderr:pbs_iff: cannot read reply from pbs_server-No Permission.-qsub: cannot connect to server (errno=15007) Unauthorized Request -TERM environment variable not set.-) N/A (jobId = CREAM503909568)] - [gridJobID="" CREAMJobID=""]

2011-01-18 23:14:23,301 DEBUG - Ice::resubmit_or_purge_job() - Removing purged job [gridJobID="" CREAMJobID=""] from ICE's database

2011-01-18 23:14:23,315 INFO - iceLBContext::setLoggingJob - Setting log job to jobid=[] LB server=[] (port is not used, actually...)

2011-01-18 23:14:23,316 INFO - iceLBLogger::logEvent() - ICE Resubmission Event, reason=[Job resubmitted by ICE] - [gridJobID="" CREAMJobID=""]

2011-01-18 23:14:23,329 INFO - iceLBContext::setLoggingJob - Setting log job to jobid=[] LB server=[] (port is not used, actually...)

2011-01-18 23:14:23,329 INFO - iceLBLogger::logEvent() - NS Enqueued Start Event, qname=[/var/glite/workload_manager/jobdir] - [gridJobID="" CREAMJobID=""]

2011-01-18 23:14:23,341 INFO - Ice::resubmit_job() - Putting [[ arguments = [ id = ""; lb_sequence_code = "UI=000000:NS=0000000004:WM=000004:BH=0000000000:JSS=000002:LM=000010:LRMS=000000:APP=000000:LBS=000000" ]; command = "jobresubmit"; version = "1.0.0" ]] to WM's Input file

2011-01-18 23:14:23,341 INFO - iceLBContext::setLoggingJob - Setting log job to jobid=[] LB server=[] (port is not used, actually...)

2011-01-18 23:14:23,342 INFO - iceLBLogger::logEvent() - NS Enqueued OK Event, qname=[/var/glite/workload_manager/jobdir] - [gridJobID="" CREAMJobID=""]

Does anybody have an idea what is going on here ?



[log in to unmask]
HEP Group/Physics Dep
Imperial College
Tel: +44-(0)20-75947810