Print

Print


Hi,

Well, it's not harmless as whatever it is causes about 50% of the jobs to
fail.
Given that as far as I have worked out these jobs mainly run at RAL (as
implicated in the error message) I was hoping RAL could shed some light on
it.
t2k as a VO is quite new to the grid, so they might hit snags other VOs have
long circumvented.

Cheers,

Daniela

On 19 January 2011 17:17, Alessandra Forti <[log in to unmask]> wrote:

>  Sorry... if it is the result of a job it's because a previous submission
> failed and the reason should be in logging-info. My questions about that are
> in the same thread.
>
> cheers
> alessandra
>
>
> On 19/01/11 17:08, Alessandra Forti wrote:
>
> You might want to look at lcg-rollout, there was a thread about it.
> Apparently it's harmless
>
>
> https://www.jiscmail.ac.uk/cgi-bin/webadmin?A2=ind1101&L=LCG-ROLLOUT&F=&S=&P=38698
>
> cheers
> alessandra
>
>
> On 19/01/11 16:55, Daniela Bauer wrote:
>
> Hi All,
>
> Sorry, lengthy.
>
> I've just had a complaint from t2k that about half of ~400 jobs failed with
>
> Current Status:     Aborted
> Status Reason:       failed (LB query failed)
>
> both at the RAL and Imperial WMS.
>
> Googling this error just seems to suggest it's not uncommon, but no hint
> whatsoever what is wrong.
> My own WMS here seems to be reasonably happy right now, and if I do a grep
> in the log, there is nothing suspicious, at the end it just says:
>
> workload_manager_events.log:18 Jan, 23:03:41 -I: [Info]
> operator()(dispatcher_utils.cpp:218): new jobsubmit for
> https://wmslb01.hep.ph.ic.ac.uk:9000/i0o1aa5KK-Dm8Tf_LnnPaA
> workload_manager_events.log:18 Jan, 23:03:44 -I: [Info]
> checkRequirement(matchmakerISMImpl.cpp:89): MM for job:
> https://wmslb01.hep.ph.ic.ac.uk:9000/i0o1aa5KK-Dm8Tf_LnnPaA (9/12533 [3] )
> workload_manager_events.log:18 Jan, 23:03:44 -I: [Info]
> operator()(submit_request.cpp:478):
> https://wmslb01.hep.ph.ic.ac.uk:9000/i0o1aa5KK-Dm8Tf_LnnPaA delivered
> workload_manager_events.log:18 Jan, 23:14:24 -I: [Info]
> operator()(dispatcher_utils.cpp:218): new jobresubmit for
> https://wmslb01.hep.ph.ic.ac.uk:9000/i0o1aa5KK-Dm8Tf_LnnPaA
> workload_manager_events.log:18 Jan, 23:14:24 -E: [Error]
> unrecoverable(submit_request.cpp:111):
> https://wmslb01.hep.ph.ic.ac.uk:9000/i0o1aa5KK-Dm8Tf_LnnPaA failed (
> failed (LB query failed))
>
> The job went to a cream CE as far as I can tell:
>
> 2011-01-18 23:03:45,057 INFO - iceCommandSubmit::try_to_submit() -
> TID=[140441744] For GridJobID [
> https://wmslb01.hep.ph.ic.ac.uk:9000/i0o1aa5KK-Dm8Tf_LnnPaA] CREAM
> Returned CREAM-JOBID [https://lcgce05.gridpp.rl.ac.uk:8443/CREAM503909568]
> DB_ID []
>
> 2011-01-18 23:03:45,059 DEBUG - iceCommandSubmit::try_to_submit() -
> TID=[140441744] Going to START CreamJobID [
> https://lcgce05.gridpp.rl.ac.uk:8443/CREAM503909568] related to GridJobID
> [https://wmslb01.hep.ph.ic.ac.uk:9000/i0o1aa5KK-Dm8Tf_LnnPaA].<https://wmslb01.hep.ph.ic.ac.uk:9000/i0o1aa5KK-Dm8Tf_LnnPaA%5D.>
> ..
>
> 2011-01-18 23:03:45,191 INFO - iceLBContext::setLoggingJob - Setting log
> job to jobid=[https://wmslb01.hep.ph.ic.ac.uk:9000/i0o1aa5KK-Dm8Tf_LnnPaA]
> LB server=[wmslb01.hep.ph.ic.ac.uk:9000] (port is not used, actually...)
>
> 2011-01-18 23:03:45,191 INFO - iceLBLogger::logEvent() - Cream Transfer OK
> Event - [gridJobID="
> https://wmslb01.hep.ph.ic.ac.uk:9000/i0o1aa5KK-Dm8Tf_LnnPaA" CREAMJobID="
> https://lcgce05.gridpp.rl.ac.uk:8443/CREAM503909568"]
>
> 2011-01-18 23:03:45,214 DEBUG - filelist_request_purger - removing request
> [ Arguments = [ JobAd = [ requirements = ( (
> Member("VO-t2k.org-ND280-v8r5p9",other.GlueHostApplicationSoftwareRunTimeEnvironment)
> && (  !RegExp("shef.ac.uk",other.GlueCEUniqueID) ) &&
> other.GlueCEPolicyMaxCPUTime > 1200 ) && ( other.GlueCEStateStatus ==
> "Production" ) ) &&  !RegExp(".*sdj$",other.GlueCEUniqueID); RetryCount = 3;
> edg_jobid = "https://wmslb01.hep.ph.ic.ac.uk:9000/i0o1aa5KK-Dm8Tf_LnnPaA";
> lrms_type = "torque"; CeApplicationDir = "/stage/sl3-lcg-exp/t2ksgm";
> GlobusResourceContactString = "lcgce05.gridpp.rl.ac.uk:8443/cream-pbs";
> OutputSandboxPath =
> "/var/glite/SandboxDir/i0/https_3a_2f_2fwmslb01.hep.ph.ic.ac.uk_3a9000_2fi0o1aa5KK-Dm8Tf_5fLnnPaA/output";
> ce_id = "lcgce05.gridpp.rl.ac.uk:8443/cream-pbs-grid500M"; MyProxyServer =
> "lcgrbp01.gridpp.rl.ac.uk"; AllowZippedISB = true; QueueName = "grid500M";
> JobType = "normal"; InputSandboxDestFileName = { "geniefc-10-17-46.sh" };
> SignificantAttributes = { "Requirements","Rank","FuzzyRank" }; Executable =
> "geniefc-10-17-46.sh"; OutputSandboxDestURI = { "gsiftp://
> wms01.hep.ph.ic.ac.uk:2811/var/glite/SandboxDir/i0/https_3a_2f_2fwmslb01.hep.ph.ic.ac.uk_3a9000_2fi0o1aa5KK-Dm8Tf_5fLnnPaA/output/stdout
> ","gsiftp://
> wms01.hep.ph.ic.ac.uk:2811/var/glite/SandboxDir/i0/https_3a_2f_2fwmslb01.hep.ph.ic.ac.uk_3a9000_2fi0o1aa5KK-Dm8Tf_5fLnnPaA/output/stderr"
> }; CertificateSubject = "/C=UK/O=eScience/OU=Imperial/L=Physics/CN=james
> dobson"; X509UserProxy =
> "/var/glite/SandboxDir/i0/https_3a_2f_2fwmslb01.hep.ph.ic.ac.uk_3a9000_2fi0o1aa5KK-Dm8Tf_5fLnnPaA/user.proxy";
> StdOutput = "stdout"; VOMS_FQAN = "/t2k.org/Role=NULL/Capability=NULL";
> OutputSandbox = { "stdout","stderr" }; LB_sequence_code =
> "UI=000000:NS=0000000004:WM=000004:BH=0000000000:JSS=000000:LM=000000:LRMS=000000:APP=000000:LBS=000000";
> InputSandboxPath =
> "/var/glite/SandboxDir/i0/https_3a_2f_2fwmslb01.hep.ph.ic.ac.uk_3a9000_2fi0o1aa5KK-Dm8Tf_5fLnnPaA/input";
> VirtualOrganisation = "t2k.org"; rank =
> -other.GlueCEStateEstimatedResponseTime; Type = "job"; ShallowRetryCount =
> 10; CeRequirements = "true && ( true && ( true &&
> Member(\"VO-t2k.org-ND280-v8r5p9\",other.GlueHostApplicationSoftwareRunTimeEnvironment)
> && other.GlueCEPolicyMaxCPUTime > 1200 ) )"; StdError = "stderr";
> WMPInputSandboxBaseURI = "gsiftp://
> wms01.hep.ph.ic.ac.uk:2811/var/glite/SandboxDir/i0/https_3a_2f_2fwmslb01.hep.ph.ic.ac.uk_3a9000_2fi0o1aa5KK-Dm8Tf_5fLnnPaA";
> DefaultRank =  -other.GlueCEStateEstimatedResponseTime; ReallyRunningToken =
> "gsiftp://
> wms01.hep.ph.ic.ac.uk/var/glite/SandboxDir/i0/https_3a_2f_2fwmslb01.hep.ph.ic.ac.uk_3a9000_2fi0o1aa5KK-Dm8Tf_5fLnnPaA/token.txt";
> ZippedISB = { "ISBfiles_S6Q0LX8bTIVIlI7tUg_q9A_0.tar.gz" }; InputSandbox = {
> "gsiftp://
> wms01.hep.ph.ic.ac.uk:2811/var/glite/SandboxDir/i0/https_3a_2f_2fwmslb01.hep.ph.ic.ac.uk_3a9000_2fi0o1aa5KK-Dm8Tf_5fLnnPaA/input/geniefc-10-17-46.sh
> ","gsiftp://
> wms01.hep.ph.ic.ac.uk:2811/var/glite/SandboxDir/i0/https_3a_2f_2fwmslb01.hep.ph.ic.ac.uk_3a9000_2fi0o1aa5KK-Dm8Tf_5fLnnPaA/input/.BrokerInfo"
> } ] ]; Command = "Submit"; Source = 2; Protocol = "1.0.0" ]
>
> 2011-01-18 23:14:23,261 DEBUG - iceCommandStatusPoller::update_single_job()
> - Updating ICE's database for gridJobID="
> https://wmslb01.hep.ph.ic.ac.uk:9000/i0o1aa5KK-Dm8Tf_LnnPaA" CREAMJobID="
> https://lcgce05.gridpp.rl.ac.uk:8443/CREAM503909568" status = [REGISTERED]
> exit_code = [] failure_reason = [] description = []
>
> 2011-01-18 23:14:23,262 DEBUG - iceCommandStatusPoller::update_single_job()
> - Updating ICE's database for gridJobID="
> https://wmslb01.hep.ph.ic.ac.uk:9000/i0o1aa5KK-Dm8Tf_LnnPaA" CREAMJobID="
> https://lcgce05.gridpp.rl.ac.uk:8443/CREAM503909568" status = [PENDING]
> exit_code = [] failure_reason = [] description = []
>
> ----> Now the next bit I would interpret as the job actually having failed
> to at the local batch system level, but why does this error not get
> propagated ?
>
> 2011-01-18 23:14:23,263 DEBUG - iceCommandStatusPoller::update_single_job()
> - Updating ICE's database for gridJobID="
> https://wmslb01.hep.ph.ic.ac.uk:9000/i0o1aa5KK-Dm8Tf_LnnPaA" CREAMJobID="
> https://lcgce05.gridpp.rl.ac.uk:8443/CREAM503909568" status = [ABORTED]
> exit_code = [] failure_reason = [BLAH error: submission command failed (exit
> code = 1) (stdout:) (stderr:pbs_iff: cannot read reply from pbs_server-No
> Permission.-qsub: cannot connect to server lcgbatch01.gridpp.rl.ac.uk(errno=15007) Unauthorized Request -TERM environment variable not set.-) N/A
> (jobId = CREAM503909568)] description = []
>
> 2011-01-18 23:14:23,264 INFO - iceLBContext::setLoggingJob - Setting log
> job to jobid=[https://wmslb01.hep.ph.ic.ac.uk:9000/i0o1aa5KK-Dm8Tf_LnnPaA]
> LB server=[wmslb01.hep.ph.ic.ac.uk:9000] (port is not used, actually...)
>
> 2011-01-18 23:14:23,264 INFO - iceLBLogger::logEvent() - Job Aborted Event,
> reason=[BLAH error: submission command failed (exit code = 1) (stdout:)
> (stderr:pbs_iff: cannot read reply from pbs_server-No Permission.-qsub:
> cannot connect to server lcgbatch01.gridpp.rl.ac.uk (errno=15007)
> Unauthorized Request -TERM environment variable not set.-) N/A (jobId =
> CREAM503909568)] - [gridJobID="
> https://wmslb01.hep.ph.ic.ac.uk:9000/i0o1aa5KK-Dm8Tf_LnnPaA" CREAMJobID="
> https://lcgce05.gridpp.rl.ac.uk:8443/CREAM503909568"]
>
> 2011-01-18 23:14:23,301 DEBUG - Ice::resubmit_or_purge_job() - Removing
> purged job [gridJobID="
> https://wmslb01.hep.ph.ic.ac.uk:9000/i0o1aa5KK-Dm8Tf_LnnPaA" CREAMJobID="
> https://lcgce05.gridpp.rl.ac.uk:8443/CREAM503909568"] from ICE's database
>
> 2011-01-18 23:14:23,315 INFO - iceLBContext::setLoggingJob - Setting log
> job to jobid=[https://wmslb01.hep.ph.ic.ac.uk:9000/i0o1aa5KK-Dm8Tf_LnnPaA]
> LB server=[wmslb01.hep.ph.ic.ac.uk:9000] (port is not used, actually...)
>
> 2011-01-18 23:14:23,316 INFO - iceLBLogger::logEvent() - ICE Resubmission
> Event, reason=[Job resubmitted by ICE] - [gridJobID="
> https://wmslb01.hep.ph.ic.ac.uk:9000/i0o1aa5KK-Dm8Tf_LnnPaA" CREAMJobID="
> https://lcgce05.gridpp.rl.ac.uk:8443/CREAM503909568"]
>
> 2011-01-18 23:14:23,329 INFO - iceLBContext::setLoggingJob - Setting log
> job to jobid=[https://wmslb01.hep.ph.ic.ac.uk:9000/i0o1aa5KK-Dm8Tf_LnnPaA]
> LB server=[wmslb01.hep.ph.ic.ac.uk:9000] (port is not used, actually...)
>
> 2011-01-18 23:14:23,329 INFO - iceLBLogger::logEvent() - NS Enqueued Start
> Event, qname=[/var/glite/workload_manager/jobdir] - [gridJobID="
> https://wmslb01.hep.ph.ic.ac.uk:9000/i0o1aa5KK-Dm8Tf_LnnPaA" CREAMJobID="
> https://lcgce05.gridpp.rl.ac.uk:8443/CREAM503909568"]
>
> 2011-01-18 23:14:23,341 INFO - Ice::resubmit_job() - Putting [[ arguments =
> [ id = "https://wmslb01.hep.ph.ic.ac.uk:9000/i0o1aa5KK-Dm8Tf_LnnPaA";
> lb_sequence_code =
> "UI=000000:NS=0000000004:WM=000004:BH=0000000000:JSS=000002:LM=000010:LRMS=000000:APP=000000:LBS=000000"
> ]; command = "jobresubmit"; version = "1.0.0" ]] to WM's Input file
>
> 2011-01-18 23:14:23,341 INFO - iceLBContext::setLoggingJob - Setting log
> job to jobid=[https://wmslb01.hep.ph.ic.ac.uk:9000/i0o1aa5KK-Dm8Tf_LnnPaA]
> LB server=[wmslb01.hep.ph.ic.ac.uk:9000] (port is not used, actually...)
>
> 2011-01-18 23:14:23,342 INFO - iceLBLogger::logEvent() - NS Enqueued OK
> Event, qname=[/var/glite/workload_manager/jobdir] - [gridJobID="
> https://wmslb01.hep.ph.ic.ac.uk:9000/i0o1aa5KK-Dm8Tf_LnnPaA" CREAMJobID="
> https://lcgce05.gridpp.rl.ac.uk:8443/CREAM503909568"]
>
>
> Does anybody have an idea what is going on here ?
>
> Thanks,
>
> Daniela
>
> --
> -----------------------------------------------------------
> [log in to unmask]
> HEP Group/Physics Dep
> Imperial College
> Tel: +44-(0)20-75947810
> http://www.hep.ph.ic.ac.uk/~dbauer/<http://www.hep.ph.ic.ac.uk/%7Edbauer/>
>
>
>
>


-- 
-----------------------------------------------------------
[log in to unmask]
HEP Group/Physics Dep
Imperial College
Tel: +44-(0)20-75947810
http://www.hep.ph.ic.ac.uk/~dbauer/