Hi Catalin, I checked that, but most of the jobs that went to the 1000M queue also failed. But for sure, the 5% that made it went to that queue. This is surprising, since last week most jobs of the same kind were passing fine. Is the memory limit applied stricter now? A more general question: when a job exceeds memory limits, shouldn't it fail instead of being cancelled? At least then the user could get the reason for the failure. Cheers, Gustav 2011/6/2 Catalin Condurache <[log in to unmask]>: > Hi Gustav, > > I had a look at that specific job. It was cancelled when attempted to use more than 500M (resources_used.mem, see below). It is possible that the logging available to you hides the real cause of abortion. You should choose a larger queue for your jobs, 700M or 1000M. > > 06/02/2011 04:52:23;E;15490141.lcgbatch01.gridpp.rl.ac.uk;user=t2k028 group=t2k jobname=cre05_840367115 queue=grid500M > ctime=1306986008 qtime=1306986008 etime=1306986008 start=1306986159 [log in to unmask] exec_host=lcg1278.gridpp.rl.ac.uk/5 Resource_List.cput=60:00:00 Resource_List.neednodes=lcg1278.gridpp.rl.ac.uk Resource_List.opsys=sl5 Resource_List.pcput=60:00:00 Resource_List.pmem=500mb Resource_List.walltime=72:00:00 session=11047 end=1306986743 Exit_status=271 resources_used.cput=00:22:18 resources_used.mem=548468kb resources_used.vmem=1563948kb resources_used.walltime=00:28:07 > > Regards, > Catalin Condurache > RAL Tier1 Grid Services > > >> -----Original Message----- >> From: Testbed Support for GridPP member institutes [mailto:TB- >> [log in to unmask]] On Behalf Of Gustav Wikström >> Sent: 02 June 2011 10:54 >> To: [log in to unmask] >> Subject: Re: 95% of jobs getting cancelled >> >> Hi Stuart, >> >> yes it turns out to be quite a lot of information available. One >> example job with glite-wms-job-logging-info -verbosity 3 below. >> >> There seems to be some retrying going on, but in the final step (at >> the bottom) the job runs for 13 mins before being cancelled by the >> LogMonitor, but by then the status of the job is DONE. >> >> Maybe you can make more out of this? >> >> Gustav >> >> >> ===================== glite-job-logging-info Success >> ===================== >> >> LOGGING INFORMATION: >> >> Printing info for the Job : >> https://lcglb01.gridpp.rl.ac.uk:9000/ZIhUvnWm77QBohlunQlXPw >> >> --- >> Event: RegJob >> - Arrived = Thu Jun 2 05:39:57 2011 CEST >> - Host = lcgwms03.gridpp.rl.ac.uk >> - Jobtype = SIMPLE >> - Level = SYSTEM >> - Ns = >> https://lcgwms03.gridpp.rl.ac.uk:7443/glite_wms_wmproxy_server >> - Nsubjobs = 0 >> - Priority = synchronous >> - Seqcode = >> UI=000000:NS=0000000001:WM=000000:BH=0000000000:JSS=000000:LM=000000:LR >> MS=000000:APP=000000:LBS=000000 >> - Source = NetworkServer >> - Src instance = >> https://lcgwms03.gridpp.rl.ac.uk:7443/glite_wms_wmproxy_server >> - Timestamp = Thu Jun 2 05:39:57 2011 CEST >> - User = /DC=ch/DC=cern/OU=Organic >> Units/OU=Users/CN=lwikstro/CN=627993/CN=Gustav Wikstrom >> - Jdl = >> SandboxDir/ZI/https_3a_2f_2flcglb01.gridpp.rl.ac.uk_3a9000_2fZIhUvnWm77 >> QBohlunQlXPw/JDLToStart >> --- >> Event: RegJob >> - Arrived = Thu Jun 2 05:39:58 2011 CEST >> - Host = lcgwms03.gridpp.rl.ac.uk >> - Jobtype = SIMPLE >> - Level = SYSTEM >> - Ns = >> https://lcgwms03.gridpp.rl.ac.uk:7443/glite_wms_wmproxy_server >> - Nsubjobs = 0 >> - Priority = synchronous >> - Seqcode = >> UI=000000:NS=0000000001:WM=000000:BH=0000000000:JSS=000000:LM=000000:LR >> MS=000000:APP=000000:LBS=000000 >> - Source = NetworkServer >> - Src instance = >> https://lcgwms03.gridpp.rl.ac.uk:7443/glite_wms_wmproxy_server >> - Timestamp = Thu Jun 2 05:39:57 2011 CEST >> - User = /DC=ch/DC=cern/OU=Organic >> Units/OU=Users/CN=lwikstro/CN=627993/CN=Gustav Wikstrom >> - Jdl = >> SandboxDir/ZI/https_3a_2f_2flcglb01.gridpp.rl.ac.uk_3a9000_2fZIhUvnWm77 >> QBohlunQlXPw/JDLToStart >> --- >> Event: Accepted >> - Arrived = Thu Jun 2 05:40:02 2011 CEST >> - From = NetworkServer >> - From host = atlas009.unige.ch >> - Host = lcgwms03.gridpp.rl.ac.uk >> - Level = SYSTEM >> - Priority = synchronous >> - Seqcode = >> UI=000000:NS=0000000002:WM=000000:BH=0000000000:JSS=000000:LM=000000:LR >> MS=000000:APP=000000:LBS=000000 >> - Source = NetworkServer >> - Src instance = >> https://lcgwms03.gridpp.rl.ac.uk:7443/glite_wms_wmproxy_server >> - Timestamp = Thu Jun 2 05:40:01 2011 CEST >> - User = /DC=ch/DC=cern/OU=Organic >> Units/OU=Users/CN=lwikstro/CN=627993/CN=Gustav Wikstrom >> --- >> Event: EnQueued >> - Arrived = Thu Jun 2 05:40:02 2011 CEST >> - Host = lcgwms03.gridpp.rl.ac.uk >> - Level = SYSTEM >> - Priority = synchronous >> - Queue = /var/glite/workload_manager/jobdir >> - Result = START >> - Seqcode = >> UI=000000:NS=0000000003:WM=000000:BH=0000000000:JSS=000000:LM=000000:LR >> MS=000000:APP=000000:LBS=000000 >> - Source = NetworkServer >> - Src instance = >> https://lcgwms03.gridpp.rl.ac.uk:7443/glite_wms_wmproxy_server >> - Timestamp = Thu Jun 2 05:40:02 2011 CEST >> - User = /DC=ch/DC=cern/OU=Organic >> Units/OU=Users/CN=lwikstro/CN=627993/CN=Gustav Wikstrom >> - Job = >> /var/glite/SandboxDir/ZI/https_3a_2f_2flcglb01.gridpp.rl.ac.uk_3a9000_2 >> fZIhUvnWm77QBohlunQlXPw/JDLToStart >> --- >> Event: EnQueued >> - Arrived = Thu Jun 2 05:40:02 2011 CEST >> - Host = lcgwms03.gridpp.rl.ac.uk >> - Level = SYSTEM >> - Priority = synchronous >> - Queue = /var/glite/workload_manager/jobdir >> - Result = OK >> - Seqcode = >> UI=000000:NS=0000000004:WM=000000:BH=0000000000:JSS=000000:LM=000000:LR >> MS=000000:APP=000000:LBS=000000 >> - Source = NetworkServer >> - Src instance = >> https://lcgwms03.gridpp.rl.ac.uk:7443/glite_wms_wmproxy_server >> - Timestamp = Thu Jun 2 05:40:02 2011 CEST >> - User = /DC=ch/DC=cern/OU=Organic >> Units/OU=Users/CN=lwikstro/CN=627993/CN=Gustav Wikstrom >> - Job = >> >> [ >> RetryCount = 3; >> LB_sequence_code = >> "UI=000000:NS=0000000004:WM=000000:BH=0000000000:JSS=000000:LM=000000:L >> RMS=000000:APP=000000:LBS=000000"; >> edg_jobid = >> "https://lcglb01.gridpp.rl.ac.uk:9000/ZIhUvnWm77QBohlunQlXPw"; >> Arguments = "-v v9r7p9 -i >> lfn:/grid/t2k.org/nd280/production004/B/rdp/ND280/00006000_00006999/rec >> o/oa_nd_cos_00006945-0166_muaf23qoshfp_reco_000_v9r7p5.root >> -e cosmic -p 4C -t rdp -m oaAnalysis"; >> CertificateSubject = "/DC=ch/DC=cern/OU=Organic >> Units/OU=Users/CN=lwikstro/CN=627993/CN=Gustav Wikstrom"; >> MyProxyServer = "lcgrbp01.gridpp.rl.ac.uk"; >> JobType = "normal"; >> Executable = "ND280Raw_process.py"; >> VirtualOrganisation = "t2k.org"; >> SignificantAttributes = { "Requirements","Rank","FuzzyRank" }; >> InputSandbox = { >> "gsiftp://lcgwms03.gridpp.rl.ac.uk:2811/var/glite/SandboxDir/ZI/https_3 >> a_2f_2flcglb01.gridpp.rl.ac.uk_3a9000_2fZIhUvnWm77QBohlunQlXPw/input/ND >> 280Configs.py","gsiftp://lcgwms03.gridpp.rl.ac.uk:2811/var/glite/Sandbo >> xDir/ZI/https_3a_2f_2flcglb01.gridpp.rl.ac.uk_3a9000_2fZIhUvnWm77QBohlu >> nQlXPw/input/ND280GRID.py","gsiftp://lcgwms03.gridpp.rl.ac.uk:2811/var/ >> glite/SandboxDir/ZI/https_3a_2f_2flcglb01.gridpp.rl.ac.uk_3a9000_2fZIhU >> vnWm77QBohlunQlXPw/input/ND280Job.py","gsiftp://lcgwms03.gridpp.rl.ac.u >> k:2811/var/glite/SandboxDir/ZI/https_3a_2f_2flcglb01.gridpp.rl.ac.uk_3a >> 9000_2fZIhUvnWm77QBohlunQlXPw/input/ND280Software.py","gsiftp://lcgwms0 >> 3.gridpp.rl.ac.uk:2811/var/glite/SandboxDir/ZI/https_3a_2f_2flcglb01.gr >> idpp.rl.ac.uk_3a9000_2fZIhUvnWm77QBohlunQlXPw/input/pexpect.py","gsiftp >> ://lcgwms03.gridpp.rl.ac.uk:2811/var/glite/SandboxDir/ZI/https_3a_2f_2f >> lcglb01.gridpp.rl.ac.uk_3a9000_2fZIhUvnWm77QBohlunQlXPw/input/ND280Raw_ >> process.py" >> }; >> StdOutput = "ND280Raw.out"; >> ShallowRetryCount = 10; >> InputSandboxDestFileName = { >> "ND280Configs.py","ND280GRID.py","ND280Job.py","ND280Software.py","pexp >> ect.py","ND280Raw_process.py" >> }; >> VOMS_FQAN = "/t2k.org/Role=production/Capability=NULL"; >> OutputSandboxPath = >> "/var/glite/SandboxDir/ZI/https_3a_2f_2flcglb01.gridpp.rl.ac.uk_3a9000_ >> 2fZIhUvnWm77QBohlunQlXPw/output"; >> requirements = ( ( >> Member("VO-t2k.org-ND280- >> v9r7p9",other.GlueHostApplicationSoftwareRunTimeEnvironment) >> && other.GlueCEPolicyMaxCPUTime > 600 && >> other.GlueHostMainMemoryRAMSize >= 512 ) && ( other.GlueCEStateStatus >> == "Production" ) ) && !RegExp(".*sdj$",other.GlueCEUniqueID); >> DataRequirements = { >> [ >> DataCatalog = "http://lfc.gridpp.rl.ac.uk:8085/"; >> InputData = { >> "lfn:/grid/t2k.org/nd280/production004/B/rdp/ND280/00006000_00006999/re >> co/oa_nd_cos_00006945-0166_muaf23qoshfp_reco_000_v9r7p5.root" >> }; >> DataCatalogType = "DLI" >> ] }; >> rank = -other.GlueCEStateEstimatedResponseTime; >> Type = "job"; >> OutputSandboxDestURI = { >> "gsiftp://lcgwms03.gridpp.rl.ac.uk:2811/var/glite/SandboxDir/ZI/https_3 >> a_2f_2flcglb01.gridpp.rl.ac.uk_3a9000_2fZIhUvnWm77QBohlunQlXPw/output/N >> D280Raw.out","gsiftp://lcgwms03.gridpp.rl.ac.uk:2811/var/glite/SandboxD >> ir/ZI/https_3a_2f_2flcglb01.gridpp.rl.ac.uk_3a9000_2fZIhUvnWm77QBohlunQ >> lXPw/output/ND280Raw.err","gsiftp://lcgwms03.gridpp.rl.ac.uk:2811/var/g >> lite/SandboxDir/ZI/https_3a_2f_2flcglb01.gridpp.rl.ac.uk_3a9000_2fZIhUv >> nWm77QBohlunQlXPw/output/Raw_cosmic_00006945-0166_v9r7p9.cfg" >> }; >> StdError = "ND280Raw.err"; >> DataAccessProtocol = "gsiftp"; >> DefaultRank = -other.GlueCEStateEstimatedResponseTime; >> WMPInputSandboxBaseURI = >> "gsiftp://lcgwms03.gridpp.rl.ac.uk:2811/var/glite/SandboxDir/ZI/https_3 >> a_2f_2flcglb01.gridpp.rl.ac.uk_3a9000_2fZIhUvnWm77QBohlunQlXPw"; >> AllowZippedISB = true; >> ZippedISB = { "ISBfiles_avupuyXqSP-D8K3jUCJpCA_0.tar.gz" }; >> X509UserProxy = >> "/var/glite/SandboxDir/ZI/https_3a_2f_2flcglb01.gridpp.rl.ac.uk_3a9000_ >> 2fZIhUvnWm77QBohlunQlXPw/user.proxy"; >> InputSandboxPath = >> "/var/glite/SandboxDir/ZI/https_3a_2f_2flcglb01.gridpp.rl.ac.uk_3a9000_ >> 2fZIhUvnWm77QBohlunQlXPw/input"; >> OutputSandbox = { >> "ND280Raw.out","ND280Raw.err","Raw_cosmic_00006945-0166_v9r7p9.cfg" } >> ] >> --- >> Event: DeQueued >> - Arrived = Thu Jun 2 05:40:02 2011 CEST >> - Host = lcgwms03.gridpp.rl.ac.uk >> - Level = SYSTEM >> - Priority = synchronous >> - Queue = /var/glite/workload_manager/jobdir >> - Seqcode = >> UI=000000:NS=0000000004:WM=000001:BH=0000000000:JSS=000000:LM=000000:LR >> MS=000000:APP=000000:LBS=000000 >> - Source = WorkloadManager >> - Src instance = 10391 >> - Timestamp = Thu Jun 2 05:40:02 2011 CEST >> - User = /DC=ch/DC=cern/OU=Organic >> Units/OU=Users/CN=lwikstro/CN=627993/CN=Gustav >> Wikstrom/CN=proxy/CN=proxy >> --- >> Event: Match >> - Arrived = Thu Jun 2 05:40:02 2011 CEST >> - Dest id = >> lcgce05.gridpp.rl.ac.uk:8443/cream-pbs-grid500M >> - Host = lcgwms03.gridpp.rl.ac.uk >> - Level = SYSTEM >> - Priority = synchronous >> - Seqcode = >> UI=000000:NS=0000000004:WM=000002:BH=0000000000:JSS=000000:LM=000000:LR >> MS=000000:APP=000000:LBS=000000 >> - Source = WorkloadManager >> - Src instance = 10391 >> - Timestamp = Thu Jun 2 05:40:02 2011 CEST >> - User = /DC=ch/DC=cern/OU=Organic >> Units/OU=Users/CN=lwikstro/CN=627993/CN=Gustav >> Wikstrom/CN=proxy/CN=proxy >> --- >> Event: EnQueued >> - Arrived = Thu Jun 2 05:40:03 2011 CEST >> - Host = lcgwms03.gridpp.rl.ac.uk >> - Level = SYSTEM >> - Priority = synchronous >> - Queue = /var/glite/ice/jobdir >> - Result = START >> - Seqcode = >> UI=000000:NS=0000000004:WM=000003:BH=0000000000:JSS=000000:LM=000000:LR >> MS=000000:APP=000000:LBS=000000 >> - Source = WorkloadManager >> - Src instance = 10391 >> - Timestamp = Thu Jun 2 05:40:02 2011 CEST >> - User = /DC=ch/DC=cern/OU=Organic >> Units/OU=Users/CN=lwikstro/CN=627993/CN=Gustav >> Wikstrom/CN=proxy/CN=proxy >> --- >> Event: EnQueued >> - Arrived = Thu Jun 2 05:40:03 2011 CEST >> - Host = lcgwms03.gridpp.rl.ac.uk >> - Level = SYSTEM >> - Priority = synchronous >> - Queue = /var/glite/ice/jobdir >> - Result = OK >> - Seqcode = >> UI=000000:NS=0000000004:WM=000004:BH=0000000000:JSS=000000:LM=000000:LR >> MS=000000:APP=000000:LBS=000000 >> - Source = WorkloadManager >> - Src instance = 10391 >> - Timestamp = Thu Jun 2 05:40:03 2011 CEST >> - User = /DC=ch/DC=cern/OU=Organic >> Units/OU=Users/CN=lwikstro/CN=627993/CN=Gustav >> Wikstrom/CN=proxy/CN=proxy >> - Job = >> >> [ >> Arguments = >> [ >> JobAd = >> [ >> RetryCount = 3; >> LB_sequence_code = >> "UI=000000:NS=0000000004:WM=000004:BH=0000000000:JSS=000000:LM=000000:L >> RMS=000000:APP=000000:LBS=000000"; >> ReallyRunningToken = >> "gsiftp://lcgwms03.gridpp.rl.ac.uk/var/glite/SandboxDir/ZI/https_3a_2f_ >> 2flcglb01.gridpp.rl.ac.uk_3a9000_2fZIhUvnWm77QBohlunQlXPw/token.txt"; >> edg_jobid = >> "https://lcglb01.gridpp.rl.ac.uk:9000/ZIhUvnWm77QBohlunQlXPw"; >> lrms_type = "torque"; >> CeRequirements = "true && ( true && ( >> Member(\"VO-t2k.org-ND280- >> v9r7p9\",other.GlueHostApplicationSoftwareRunTimeEnvironment) >> && other.GlueCEPolicyMaxCPUTime > 600 && >> other.GlueHostMainMemoryRAMSize >= 512 ) )"; >> Arguments = "-v v9r7p9 -i >> lfn:/grid/t2k.org/nd280/production004/B/rdp/ND280/00006000_00006999/rec >> o/oa_nd_cos_00006945-0166_muaf23qoshfp_reco_000_v9r7p5.root >> -e cosmic -p 4C -t rdp -m oaAnalysis"; >> CertificateSubject = "/DC=ch/DC=cern/OU=Organic >> Units/OU=Users/CN=lwikstro/CN=627993/CN=Gustav Wikstrom"; >> MyProxyServer = "lcgrbp01.gridpp.rl.ac.uk"; >> ce_id = "lcgce05.gridpp.rl.ac.uk:8443/cream-pbs-grid500M"; >> QueueName = "grid500M"; >> JobType = "normal"; >> Executable = "ND280Raw_process.py"; >> VirtualOrganisation = "t2k.org"; >> SignificantAttributes = { "Requirements","Rank","FuzzyRank" >> }; >> InputSandbox = { >> "gsiftp://lcgwms03.gridpp.rl.ac.uk:2811/var/glite/SandboxDir/ZI/https_3 >> a_2f_2flcglb01.gridpp.rl.ac.uk_3a9000_2fZIhUvnWm77QBohlunQlXPw/input/ND >> 280Configs.py","gsiftp://lcgwms03.gridpp.rl.ac.uk:2811/var/glite/Sandbo >> xDir/ZI/https_3a_2f_2flcglb01.gridpp.rl.ac.uk_3a9000_2fZIhUvnWm77QBohlu >> nQlXPw/input/ND280GRID.py","gsiftp://lcgwms03.gridpp.rl.ac.uk:2811/var/ >> glite/SandboxDir/ZI/https_3a_2f_2flcglb01.gridpp.rl.ac.uk_3a9000_2fZIhU >> vnWm77QBohlunQlXPw/input/ND280Job.py","gsiftp://lcgwms03.gridpp.rl.ac.u >> k:2811/var/glite/SandboxDir/ZI/https_3a_2f_2flcglb01.gridpp.rl.ac.uk_3a >> 9000_2fZIhUvnWm77QBohlunQlXPw/input/ND280Software.py","gsiftp://lcgwms0 >> 3.gridpp.rl.ac.uk:2811/var/glite/SandboxDir/ZI/https_3a_2f_2flcglb01.gr >> idpp.rl.ac.uk_3a9000_2fZIhUvnWm77QBohlunQlXPw/input/pexpect.py","gsiftp >> ://lcgwms03.gridpp.rl.ac.uk:2811/var/glite/SandboxDir/ZI/https_3a_2f_2f >> lcglb01.gridpp.rl.ac.uk_3a9000_2fZIhUvnWm77QBohlunQlXPw/input/ND280Raw_ >> process.py","gsiftp://lcgwms03.gridpp.rl.ac.uk:2811/var/glite/SandboxDi >> r/ZI/https_3a_2f_2flcglb01.gridpp.rl.ac.uk_3a9000_2fZIhUvnWm77QBohlunQl >> XPw/input/.BrokerInfo" >> }; >> StdOutput = "ND280Raw.out"; >> ShallowRetryCount = 10; >> VOMS_FQAN = "/t2k.org/Role=production/Capability=NULL"; >> InputSandboxDestFileName = { >> "ND280Configs.py","ND280GRID.py","ND280Job.py","ND280Software.py","pexp >> ect.py","ND280Raw_process.py" >> }; >> OutputSandboxPath = >> "/var/glite/SandboxDir/ZI/https_3a_2f_2flcglb01.gridpp.rl.ac.uk_3a9000_ >> 2fZIhUvnWm77QBohlunQlXPw/output"; >> requirements = ( ( >> Member("VO-t2k.org-ND280- >> v9r7p9",other.GlueHostApplicationSoftwareRunTimeEnvironment) >> && other.GlueCEPolicyMaxCPUTime > 600 && >> other.GlueHostMainMemoryRAMSize >= 512 ) && ( other.GlueCEStateStatus >> == "Production" ) ) && !RegExp(".*sdj$",other.GlueCEUniqueID); >> DataRequirements = { >> [ >> DataCatalog = "http://lfc.gridpp.rl.ac.uk:8085/"; >> InputData = { >> "lfn:/grid/t2k.org/nd280/production004/B/rdp/ND280/00006000_00006999/re >> co/oa_nd_cos_00006945-0166_muaf23qoshfp_reco_000_v9r7p5.root" >> }; >> DataCatalogType = "DLI" >> ] }; >> rank = -other.GlueCEStateEstimatedResponseTime; >> Type = "job"; >> OutputSandboxDestURI = { >> "gsiftp://lcgwms03.gridpp.rl.ac.uk:2811/var/glite/SandboxDir/ZI/https_3 >> a_2f_2flcglb01.gridpp.rl.ac.uk_3a9000_2fZIhUvnWm77QBohlunQlXPw/output/N >> D280Raw.out","gsiftp://lcgwms03.gridpp.rl.ac.uk:2811/var/glite/SandboxD >> ir/ZI/https_3a_2f_2flcglb01.gridpp.rl.ac.uk_3a9000_2fZIhUvnWm77QBohlunQ >> lXPw/output/ND280Raw.err","gsiftp://lcgwms03.gridpp.rl.ac.uk:2811/var/g >> lite/SandboxDir/ZI/https_3a_2f_2flcglb01.gridpp.rl.ac.uk_3a9000_2fZIhUv >> nWm77QBohlunQlXPw/output/Raw_cosmic_00006945-0166_v9r7p9.cfg" >> }; >> StdError = "ND280Raw.err"; >> DataAccessProtocol = "gsiftp"; >> DefaultRank = -other.GlueCEStateEstimatedResponseTime; >> WMPInputSandboxBaseURI = >> "gsiftp://lcgwms03.gridpp.rl.ac.uk:2811/var/glite/SandboxDir/ZI/https_3 >> a_2f_2flcglb01.gridpp.rl.ac.uk_3a9000_2fZIhUvnWm77QBohlunQlXPw"; >> CeApplicationDir = "/stage/sl3-lcg-exp/t2ksgm"; >> ZippedISB = { "ISBfiles_avupuyXqSP-D8K3jUCJpCA_0.tar.gz" }; >> AllowZippedISB = true; >> X509UserProxy = >> "/var/glite/SandboxDir/ZI/https_3a_2f_2flcglb01.gridpp.rl.ac.uk_3a9000_ >> 2fZIhUvnWm77QBohlunQlXPw/user.proxy"; >> GlobusResourceContactString = >> "lcgce05.gridpp.rl.ac.uk:8443/cream-pbs"; >> InputSandboxPath = >> "/var/glite/SandboxDir/ZI/https_3a_2f_2flcglb01.gridpp.rl.ac.uk_3a9000_ >> 2fZIhUvnWm77QBohlunQlXPw/input"; >> OutputSandbox = { >> "ND280Raw.out","ND280Raw.err","Raw_cosmic_00006945-0166_v9r7p9.cfg" } >> ] >> ]; >> Command = "Submit"; >> Source = 2; >> Protocol = "1.0.0" >> ] >> --- >> Event: DeQueued >> - Arrived = Thu Jun 2 05:40:04 2011 CEST >> - Host = lcgwms03.gridpp.rl.ac.uk >> - Level = SYSTEM >> - Local jobid = >> https://lcglb01.gridpp.rl.ac.uk:9000/ZIhUvnWm77QBohlunQlXPw >> - Priority = synchronous >> - Queue = /var/glite/ice/jobdir >> - Seqcode = >> UI=000000:NS=0000000004:WM=000004:BH=0000000000:JSS=000001:LM=000000:LR >> MS=000000:APP=000000:LBS=000000 >> - Source = JobController >> - Timestamp = Thu Jun 2 05:40:03 2011 CEST >> - User = /DC=ch/DC=cern/OU=Organic >> Units/OU=Users/CN=lwikstro/CN=627993/CN=Gustav >> Wikstrom/CN=proxy/CN=proxy >> --- >> Event: Transfer >> - Arrived = Thu Jun 2 05:40:04 2011 CEST >> - Dest host = >> https://lcgce05.gridpp.rl.ac.uk:8443/ce-cream/services/CREAM2 >> - Dest instance = unavailable >> - Dest jobid = unavailable >> - Destination = LRMS >> - Host = lcgwms03.gridpp.rl.ac.uk >> - Level = SYSTEM >> - Priority = synchronous >> - Reason = unavailable >> - Result = START >> - Seqcode = >> UI=000000:NS=0000000004:WM=000004:BH=0000000000:JSS=000002:LM=000001:LR >> MS=000000:APP=000000:LBS=000000 >> - Source = LogMonitor >> - Timestamp = Thu Jun 2 05:40:04 2011 CEST >> - User = /DC=ch/DC=cern/OU=Organic >> Units/OU=Users/CN=lwikstro/CN=627993/CN=Gustav >> Wikstrom/CN=proxy/CN=proxy >> - Job = >> >> [ >> RetryCount = 3; >> LB_sequence_code = >> "UI=000000:NS=0000000004:WM=000004:BH=0000000000:JSS=000002:LM=000000:L >> RMS=000000:APP=000000:LBS=000000"; >> ReallyRunningToken = >> "gsiftp://lcgwms03.gridpp.rl.ac.uk/var/glite/SandboxDir/ZI/https_3a_2f_ >> 2flcglb01.gridpp.rl.ac.uk_3a9000_2fZIhUvnWm77QBohlunQlXPw/token.txt"; >> edg_jobid = >> "https://lcglb01.gridpp.rl.ac.uk:9000/ZIhUvnWm77QBohlunQlXPw"; >> lrms_type = "torque"; >> CeRequirements = "true && ( true && ( >> Member(\"VO-t2k.org-ND280- >> v9r7p9\",other.GlueHostApplicationSoftwareRunTimeEnvironment) >> && other.GlueCEPolicyMaxCPUTime > 600 && >> other.GlueHostMainMemoryRAMSize >= 512 ) )"; >> Arguments = "-v v9r7p9 -i >> lfn:/grid/t2k.org/nd280/production004/B/rdp/ND280/00006000_00006999/rec >> o/oa_nd_cos_00006945-0166_muaf23qoshfp_reco_000_v9r7p5.root >> -e cosmic -p 4C -t rdp -m oaAnalysis"; >> CertificateSubject = "/DC=ch/DC=cern/OU=Organic >> Units/OU=Users/CN=lwikstro/CN=627993/CN=Gustav Wikstrom"; >> MyProxyServer = "lcgrbp01.gridpp.rl.ac.uk"; >> ce_id = "lcgce05.gridpp.rl.ac.uk:8443/cream-pbs-grid500M"; >> QueueName = "grid500M"; >> JobType = "normal"; >> Executable = "ND280Raw_process.py"; >> VirtualOrganisation = "t2k.org"; >> SignificantAttributes = { "Requirements","Rank","FuzzyRank" }; >> InputSandbox = { >> "gsiftp://lcgwms03.gridpp.rl.ac.uk:2811/var/glite/SandboxDir/ZI/https_3 >> a_2f_2flcglb01.gridpp.rl.ac.uk_3a9000_2fZIhUvnWm77QBohlunQlXPw/input/ND >> 280Configs.py","gsiftp://lcgwms03.gridpp.rl.ac.uk:2811/var/glite/Sandbo >> xDir/ZI/https_3a_2f_2flcglb01.gridpp.rl.ac.uk_3a9000_2fZIhUvnWm77QBohlu >> nQlXPw/input/ND280GRID.py","gsiftp://lcgwms03.gridpp.rl.ac.uk:2811/var/ >> glite/SandboxDir/ZI/https_3a_2f_2flcglb01.gridpp.rl.ac.uk_3a9000_2fZIhU >> vnWm77QBohlunQlXPw/input/ND280Job.py","gsiftp://lcgwms03.gridpp.rl.ac.u >> k:2811/var/glite/SandboxDir/ZI/https_3a_2f_2flcglb01.gridpp.rl.ac.uk_3a >> 9000_2fZIhUvnWm77QBohlunQlXPw/input/ND280Software.py","gsiftp://lcgwms0 >> 3.gridpp.rl.ac.uk:2811/var/glite/SandboxDir/ZI/https_3a_2f_2flcglb01.gr >> idpp.rl.ac.uk_3a9000_2fZIhUvnWm77QBohlunQlXPw/input/pexpect.py","gsiftp >> ://lcgwms03.gridpp.rl.ac.uk:2811/var/glite/SandboxDir/ZI/https_3a_2f_2f >> lcglb01.gridpp.rl.ac.uk_3a9000_2fZIhUvnWm77QBohlunQlXPw/input/ND280Raw_ >> process.py","gsiftp://lcgwms03.gridpp.rl.ac.uk:2811/var/glite/SandboxDi >> r/ZI/https_3a_2f_2flcglb01.gridpp.rl.ac.uk_3a9000_2fZIhUvnWm77QBohlunQl >> XPw/input/.BrokerInfo" >> }; >> StdOutput = "ND280Raw.out"; >> ShallowRetryCount = 10; >> InputSandboxDestFileName = { >> "ND280Configs.py","ND280GRID.py","ND280Job.py","ND280Software.py","pexp >> ect.py","ND280Raw_process.py" >> }; >> VOMS_FQAN = "/t2k.org/Role=production/Capability=NULL"; >> OutputSandboxPath = >> "/var/glite/SandboxDir/ZI/https_3a_2f_2flcglb01.gridpp.rl.ac.uk_3a9000_ >> 2fZIhUvnWm77QBohlunQlXPw/output"; >> requirements = ( ( >> Member("VO-t2k.org-ND280- >> v9r7p9",other.GlueHostApplicationSoftwareRunTimeEnvironment) >> && other.GlueCEPolicyMaxCPUTime > 600 && >> other.GlueHostMainMemoryRAMSize >= 512 ) && ( other.GlueCEStateStatus >> == "Production" ) ) && !RegExp(".*sdj$",other.GlueCEUniqueID); >> DataRequirements = { >> [ >> DataCatalog = "http://lfc.gridpp.rl.ac.uk:8085/"; >> InputData = { >> "lfn:/grid/t2k.org/nd280/production004/B/rdp/ND280/00006000_00006999/re >> co/oa_nd_cos_00006945-0166_muaf23qoshfp_reco_000_v9r7p5.root" >> }; >> DataCatalogType = "DLI" >> ] }; >> rank = -other.GlueCEStateEstimatedResponseTime; >> Type = "job"; >> OutputSandboxDestURI = { >> "gsiftp://lcgwms03.gridpp.rl.ac.uk:2811/var/glite/SandboxDir/ZI/https_3 >> a_2f_2flcglb01.gridpp.rl.ac.uk_3a9000_2fZIhUvnWm77QBohlunQlXPw/output/N >> D280Raw.out","gsiftp://lcgwms03.gridpp.rl.ac.uk:2811/var/glite/SandboxD >> ir/ZI/https_3a_2f_2flcglb01.gridpp.rl.ac.uk_3a9000_2fZIhUvnWm77QBohlunQ >> lXPw/output/ND280Raw.err","gsiftp://lcgwms03.gridpp.rl.ac.uk:2811/var/g >> lite/SandboxDir/ZI/https_3a_2f_2flcglb01.gridpp.rl.ac.uk_3a9000_2fZIhUv >> nWm77QBohlunQlXPw/output/Raw_cosmic_00006945-0166_v9r7p9.cfg" >> }; >> StdError = "ND280Raw.err"; >> DataAccessProtocol = "gsiftp"; >> DefaultRank = -other.GlueCEStateEstimatedResponseTime; >> CeApplicationDir = "/stage/sl3-lcg-exp/t2ksgm"; >> WMPInputSandboxBaseURI = >> "gsiftp://lcgwms03.gridpp.rl.ac.uk:2811/var/glite/SandboxDir/ZI/https_3 >> a_2f_2flcglb01.gridpp.rl.ac.uk_3a9000_2fZIhUvnWm77QBohlunQlXPw"; >> AllowZippedISB = true; >> ZippedISB = { "ISBfiles_avupuyXqSP-D8K3jUCJpCA_0.tar.gz" }; >> X509UserProxy = >> "/var/glite/SandboxDir/ZI/https_3a_2f_2flcglb01.gridpp.rl.ac.uk_3a9000_ >> 2fZIhUvnWm77QBohlunQlXPw/user.proxy"; >> GlobusResourceContactString = >> "lcgce05.gridpp.rl.ac.uk:8443/cream-pbs"; >> InputSandboxPath = >> "/var/glite/SandboxDir/ZI/https_3a_2f_2flcglb01.gridpp.rl.ac.uk_3a9000_ >> 2fZIhUvnWm77QBohlunQlXPw/input"; >> OutputSandbox = { >> "ND280Raw.out","ND280Raw.err","Raw_cosmic_00006945-0166_v9r7p9.cfg" } >> ] >> --- >> Event: Running >> - Arrived = Thu Jun 2 05:42:39 2011 CEST >> - Host = lcg1278.gridpp.rl.ac.uk >> - Level = SYSTEM >> - Node = lcg1278.gridpp.rl.ac.uk >> - Priority = synchronous >> - Seqcode = >> UI=000000:NS=0000000004:WM=000004:BH=0000000000:JSS=000002:LM=000002:LR >> MS=000001:APP=000000:LBS=000000 >> - Source = LRMS >> - Timestamp = Thu Jun 2 05:42:39 2011 CEST >> - User = /DC=ch/DC=cern/OU=Organic >> Units/OU=Users/CN=lwikstro/CN=627993/CN=Gustav Wikstrom >> --- >> Event: ReallyRunning >> - Arrived = Thu Jun 2 05:42:46 2011 CEST >> - Host = lcg1278.gridpp.rl.ac.uk >> - Level = SYSTEM >> - Priority = synchronous >> - Seqcode = >> UI=000000:NS=0000000004:WM=000004:BH=0000000000:JSS=000002:LM=000002:LR >> MS=000003:APP=000000:LBS=000000 >> - Source = LRMS >> - Timestamp = Thu Jun 2 05:42:46 2011 CEST >> - User = /DC=ch/DC=cern/OU=Organic >> Units/OU=Users/CN=lwikstro/CN=627993/CN=Gustav Wikstrom >> --- >> Event: Transfer >> - Arrived = Thu Jun 2 05:40:05 2011 CEST >> - Dest host = >> https://lcgce05.gridpp.rl.ac.uk:8443/ce-cream/services/CREAM2 >> - Dest instance = unavailable >> - Dest jobid = >> https://lcgce05.gridpp.rl.ac.uk:8443/CREAM840367115 >> - Destination = LRMS >> - Host = lcgwms03.gridpp.rl.ac.uk >> - Level = SYSTEM >> - Priority = synchronous >> - Reason = unavailable >> - Result = OK >> - Seqcode = >> UI=000000:NS=0000000004:WM=000004:BH=0000000000:JSS=000002:LM=000003:LR >> MS=000000:APP=000000:LBS=000000 >> - Source = LogMonitor >> - Timestamp = Thu Jun 2 05:40:04 2011 CEST >> - User = /DC=ch/DC=cern/OU=Organic >> Units/OU=Users/CN=lwikstro/CN=627993/CN=Gustav >> Wikstrom/CN=proxy/CN=proxy >> - Job = >> >> [ >> RetryCount = 3; >> LB_sequence_code = >> "UI=000000:NS=0000000004:WM=000004:BH=0000000000:JSS=000002:LM=000002:L >> RMS=000000:APP=000000:LBS=000000"; >> ReallyRunningToken = >> "gsiftp://lcgwms03.gridpp.rl.ac.uk/var/glite/SandboxDir/ZI/https_3a_2f_ >> 2flcglb01.gridpp.rl.ac.uk_3a9000_2fZIhUvnWm77QBohlunQlXPw/token.txt"; >> edg_jobid = >> "https://lcglb01.gridpp.rl.ac.uk:9000/ZIhUvnWm77QBohlunQlXPw"; >> lrms_type = "torque"; >> CeRequirements = "true && ( true && ( >> Member(\"VO-t2k.org-ND280- >> v9r7p9\",other.GlueHostApplicationSoftwareRunTimeEnvironment) >> && other.GlueCEPolicyMaxCPUTime > 600 && >> other.GlueHostMainMemoryRAMSize >= 512 ) )"; >> Arguments = "-v v9r7p9 -i >> lfn:/grid/t2k.org/nd280/production004/B/rdp/ND280/00006000_00006999/rec >> o/oa_nd_cos_00006945-0166_muaf23qoshfp_reco_000_v9r7p5.root >> -e cosmic -p 4C -t rdp -m oaAnalysis"; >> CertificateSubject = "/DC=ch/DC=cern/OU=Organic >> Units/OU=Users/CN=lwikstro/CN=627993/CN=Gustav Wikstrom"; >> MyProxyServer = "lcgrbp01.gridpp.rl.ac.uk"; >> ce_id = "lcgce05.gridpp.rl.ac.uk:8443/cream-pbs-grid500M"; >> QueueName = "grid500M"; >> JobType = "normal"; >> Executable = "ND280Raw_process.py"; >> VirtualOrganisation = "t2k.org"; >> SignificantAttributes = { "Requirements","Rank","FuzzyRank" }; >> InputSandbox = { >> "gsiftp://lcgwms03.gridpp.rl.ac.uk:2811/var/glite/SandboxDir/ZI/https_3 >> a_2f_2flcglb01.gridpp.rl.ac.uk_3a9000_2fZIhUvnWm77QBohlunQlXPw/input/ND >> 280Configs.py","gsiftp://lcgwms03.gridpp.rl.ac.uk:2811/var/glite/Sandbo >> xDir/ZI/https_3a_2f_2flcglb01.gridpp.rl.ac.uk_3a9000_2fZIhUvnWm77QBohlu >> nQlXPw/input/ND280GRID.py","gsiftp://lcgwms03.gridpp.rl.ac.uk:2811/var/ >> glite/SandboxDir/ZI/https_3a_2f_2flcglb01.gridpp.rl.ac.uk_3a9000_2fZIhU >> vnWm77QBohlunQlXPw/input/ND280Job.py","gsiftp://lcgwms03.gridpp.rl.ac.u >> k:2811/var/glite/SandboxDir/ZI/https_3a_2f_2flcglb01.gridpp.rl.ac.uk_3a >> 9000_2fZIhUvnWm77QBohlunQlXPw/input/ND280Software.py","gsiftp://lcgwms0 >> 3.gridpp.rl.ac.uk:2811/var/glite/SandboxDir/ZI/https_3a_2f_2flcglb01.gr >> idpp.rl.ac.uk_3a9000_2fZIhUvnWm77QBohlunQlXPw/input/pexpect.py","gsiftp >> ://lcgwms03.gridpp.rl.ac.uk:2811/var/glite/SandboxDir/ZI/https_3a_2f_2f >> lcglb01.gridpp.rl.ac.uk_3a9000_2fZIhUvnWm77QBohlunQlXPw/input/ND280Raw_ >> process.py","gsiftp://lcgwms03.gridpp.rl.ac.uk:2811/var/glite/SandboxDi >> r/ZI/https_3a_2f_2flcglb01.gridpp.rl.ac.uk_3a9000_2fZIhUvnWm77QBohlunQl >> XPw/input/.BrokerInfo" >> }; >> StdOutput = "ND280Raw.out"; >> ShallowRetryCount = 10; >> InputSandboxDestFileName = { >> "ND280Configs.py","ND280GRID.py","ND280Job.py","ND280Software.py","pexp >> ect.py","ND280Raw_process.py" >> }; >> VOMS_FQAN = "/t2k.org/Role=production/Capability=NULL"; >> OutputSandboxPath = >> "/var/glite/SandboxDir/ZI/https_3a_2f_2flcglb01.gridpp.rl.ac.uk_3a9000_ >> 2fZIhUvnWm77QBohlunQlXPw/output"; >> requirements = ( ( >> Member("VO-t2k.org-ND280- >> v9r7p9",other.GlueHostApplicationSoftwareRunTimeEnvironment) >> && other.GlueCEPolicyMaxCPUTime > 600 && >> other.GlueHostMainMemoryRAMSize >= 512 ) && ( other.GlueCEStateStatus >> == "Production" ) ) && !RegExp(".*sdj$",other.GlueCEUniqueID); >> DataRequirements = { >> [ >> DataCatalog = "http://lfc.gridpp.rl.ac.uk:8085/"; >> InputData = { >> "lfn:/grid/t2k.org/nd280/production004/B/rdp/ND280/00006000_00006999/re >> co/oa_nd_cos_00006945-0166_muaf23qoshfp_reco_000_v9r7p5.root" >> }; >> DataCatalogType = "DLI" >> ] }; >> rank = -other.GlueCEStateEstimatedResponseTime; >> Type = "job"; >> OutputSandboxDestURI = { >> "gsiftp://lcgwms03.gridpp.rl.ac.uk:2811/var/glite/SandboxDir/ZI/https_3 >> a_2f_2flcglb01.gridpp.rl.ac.uk_3a9000_2fZIhUvnWm77QBohlunQlXPw/output/N >> D280Raw.out","gsiftp://lcgwms03.gridpp.rl.ac.uk:2811/var/glite/SandboxD >> ir/ZI/https_3a_2f_2flcglb01.gridpp.rl.ac.uk_3a9000_2fZIhUvnWm77QBohlunQ >> lXPw/output/ND280Raw.err","gsiftp://lcgwms03.gridpp.rl.ac.uk:2811/var/g >> lite/SandboxDir/ZI/https_3a_2f_2flcglb01.gridpp.rl.ac.uk_3a9000_2fZIhUv >> nWm77QBohlunQlXPw/output/Raw_cosmic_00006945-0166_v9r7p9.cfg" >> }; >> StdError = "ND280Raw.err"; >> DataAccessProtocol = "gsiftp"; >> DefaultRank = -other.GlueCEStateEstimatedResponseTime; >> CeApplicationDir = "/stage/sl3-lcg-exp/t2ksgm"; >> WMPInputSandboxBaseURI = >> "gsiftp://lcgwms03.gridpp.rl.ac.uk:2811/var/glite/SandboxDir/ZI/https_3 >> a_2f_2flcglb01.gridpp.rl.ac.uk_3a9000_2fZIhUvnWm77QBohlunQlXPw"; >> AllowZippedISB = true; >> ZippedISB = { "ISBfiles_avupuyXqSP-D8K3jUCJpCA_0.tar.gz" }; >> X509UserProxy = >> "/var/glite/SandboxDir/ZI/https_3a_2f_2flcglb01.gridpp.rl.ac.uk_3a9000_ >> 2fZIhUvnWm77QBohlunQlXPw/user.proxy"; >> GlobusResourceContactString = >> "lcgce05.gridpp.rl.ac.uk:8443/cream-pbs"; >> InputSandboxPath = >> "/var/glite/SandboxDir/ZI/https_3a_2f_2flcglb01.gridpp.rl.ac.uk_3a9000_ >> 2fZIhUvnWm77QBohlunQlXPw/input"; >> OutputSandbox = { >> "ND280Raw.out","ND280Raw.err","Raw_cosmic_00006945-0166_v9r7p9.cfg" } >> ] >> --- >> Event: Running >> - Arrived = Thu Jun 2 05:50:58 2011 CEST >> - Host = lcgwms03.gridpp.rl.ac.uk >> - Level = SYSTEM >> - Node = lcg1278.gridpp.rl.ac.uk >> - Priority = synchronous >> - Seqcode = >> UI=000000:NS=0000000004:WM=000004:BH=0000000000:JSS=000002:LM=000005:LR >> MS=000000:APP=000000:LBS=000000 >> - Source = LogMonitor >> - Timestamp = Thu Jun 2 05:50:58 2011 CEST >> - User = /DC=ch/DC=cern/OU=Organic >> Units/OU=Users/CN=lwikstro/CN=627993/CN=Gustav >> Wikstrom/CN=proxy/CN=proxy >> --- >> Event: ReallyRunning >> - Arrived = Thu Jun 2 05:50:59 2011 CEST >> - Host = lcgwms03.gridpp.rl.ac.uk >> - Level = SYSTEM >> - Priority = synchronous >> - Seqcode = >> UI=000000:NS=0000000004:WM=000004:BH=0000000000:JSS=000002:LM=000007:LR >> MS=000000:APP=000000:LBS=000000 >> - Source = LogMonitor >> - Timestamp = Thu Jun 2 05:50:58 2011 CEST >> - User = /DC=ch/DC=cern/OU=Organic >> Units/OU=Users/CN=lwikstro/CN=627993/CN=Gustav >> Wikstrom/CN=proxy/CN=proxy >> - Wn seq = >> UI=000000:NS=0000000004:WM=000004:BH=0000000000:JSS=000002:LM=000002:LR >> MS=000000:APP=000000:LBS=000000 >> --- >> Event: Cancel >> - Arrived = Thu Jun 2 06:03:11 2011 CEST >> - Host = lcgwms03.gridpp.rl.ac.uk >> - Level = SYSTEM >> - Priority = synchronous >> - Seqcode = >> UI=000000:NS=0000000004:WM=000004:BH=0000000000:JSS=000002:LM=000009:LR >> MS=000000:APP=000000:LBS=000000 >> - Source = LogMonitor >> - Status code = DONE >> - Timestamp = Thu Jun 2 06:03:11 2011 CEST >> - User = /DC=ch/DC=cern/OU=Organic >> Units/OU=Users/CN=lwikstro/CN=627993/CN=Gustav >> Wikstrom/CN=proxy/CN=proxy >> --- >> Event: Done >> - Arrived = Thu Jun 2 06:03:11 2011 CEST >> - Exit code = 0 >> - Host = lcgwms03.gridpp.rl.ac.uk >> - Level = SYSTEM >> - Priority = synchronous >> - Seqcode = >> UI=000000:NS=0000000004:WM=000004:BH=0000000000:JSS=000002:LM=000010:LR >> MS=000000:APP=000000:LBS=000000 >> - Source = LogMonitor >> - Status code = CANCELLED >> - Timestamp = Thu Jun 2 06:03:11 2011 CEST >> - User = /DC=ch/DC=cern/OU=Organic >> Units/OU=Users/CN=lwikstro/CN=627993/CN=Gustav >> Wikstrom/CN=proxy/CN=proxy >> --- >> Event: Clear >> - Arrived = Thu Jun 2 06:03:12 2011 CEST >> - Host = lcgwms03.gridpp.rl.ac.uk >> - Level = SYSTEM >> - Priority = synchronous >> - Reason = 1 >> - Seqcode = >> UI=000009:NS=0000096670:WM=000000:BH=0000000000:JSS=000000:LM=000000:LR >> MS=000000:APP=000000:LBS=000000 >> - Source = NetworkServer >> - Src instance = 22543 >> - Timestamp = Thu Jun 2 06:03:12 2011 CEST >> - User = >> /C=UK/O=eScience/OU=CLRC/L=RAL/CN=lcgwms03.gridpp.rl.ac.uk/Email=tier1a >> [log in to unmask] >> ======================================================================= >> === >> >> >> >> 2011/6/2 Stuart Purdie <[log in to unmask]>: >> > >> > On 2 Jun 2011, at 09:49, Gustav Wikström wrote: >> > >> >> Hi all, >> >> >> >> I'm having serious problems with running my VO t2k.org jobs, >> currently >> >> 95% of them are being cancelled by the WMSs >> (lcgwms03.gridpp.rl.ac.uk >> >> and wms02.grid.hep.ic.ac.uk) or the CEs. As I understand it, when a >> >> WMS stops a job, it is labeled Aborted, and then Cancelled is when a >> >> CE stops a job? The bad thing is that there is no information about >> a >> >> job after it has been stopped unless it failed. >> >> >> >> So, what could cause a job to be cancelled? Is memory usage one of >> the reasons? >> > >> > Not the most likely culprit, as it's not the most strongly enforced >> constriant across all sites, but it is possible. It does have a bit of >> a site dependance, so if the 5% that don't get cancelled end up on a >> different site, that's useful data. Job CPU use and Wall time are more >> strongly enforced; but it could also be missing input files causing the >> jobs to die on start up. >> > >> > If it's (apparently) randomly distributed across all sites, the first >> thing I'd be checking is proxy lifespans, job queueing time and >> myproxy stuff (if used). >> > >> > There might be more information lurking around, which, if you've not >> tried already, can be released with 'glite-wms-job-status --verbosity 3 >> <jid>', and 'glite-wms-job-logging-info --verbosity 3 <jid>' >> > which might give more idea on where to poke at next. In particular, >> the WMS (by default) will try re-submitting a failed job a couple of >> times, and walking through that process might be informative. The >> amount of time jobs spend running might also help identify the root >> problem. >> > >> > >> > >