Hi Stuart,
yes it turns out to be quite a lot of information available. One
example job with glite-wms-job-logging-info -verbosity 3 below.
There seems to be some retrying going on, but in the final step (at
the bottom) the job runs for 13 mins before being cancelled by the
LogMonitor, but by then the status of the job is DONE.
Maybe you can make more out of this?
Gustav
===================== glite-job-logging-info Success =====================
LOGGING INFORMATION:
Printing info for the Job :
https://lcglb01.gridpp.rl.ac.uk:9000/ZIhUvnWm77QBohlunQlXPw
---
Event: RegJob
- Arrived = Thu Jun 2 05:39:57 2011 CEST
- Host = lcgwms03.gridpp.rl.ac.uk
- Jobtype = SIMPLE
- Level = SYSTEM
- Ns =
https://lcgwms03.gridpp.rl.ac.uk:7443/glite_wms_wmproxy_server
- Nsubjobs = 0
- Priority = synchronous
- Seqcode =
UI=000000:NS=0000000001:WM=000000:BH=0000000000:JSS=000000:LM=000000:LRMS=000000:APP=000000:LBS=000000
- Source = NetworkServer
- Src instance =
https://lcgwms03.gridpp.rl.ac.uk:7443/glite_wms_wmproxy_server
- Timestamp = Thu Jun 2 05:39:57 2011 CEST
- User = /DC=ch/DC=cern/OU=Organic
Units/OU=Users/CN=lwikstro/CN=627993/CN=Gustav Wikstrom
- Jdl =
SandboxDir/ZI/https_3a_2f_2flcglb01.gridpp.rl.ac.uk_3a9000_2fZIhUvnWm77QBohlunQlXPw/JDLToStart
---
Event: RegJob
- Arrived = Thu Jun 2 05:39:58 2011 CEST
- Host = lcgwms03.gridpp.rl.ac.uk
- Jobtype = SIMPLE
- Level = SYSTEM
- Ns =
https://lcgwms03.gridpp.rl.ac.uk:7443/glite_wms_wmproxy_server
- Nsubjobs = 0
- Priority = synchronous
- Seqcode =
UI=000000:NS=0000000001:WM=000000:BH=0000000000:JSS=000000:LM=000000:LRMS=000000:APP=000000:LBS=000000
- Source = NetworkServer
- Src instance =
https://lcgwms03.gridpp.rl.ac.uk:7443/glite_wms_wmproxy_server
- Timestamp = Thu Jun 2 05:39:57 2011 CEST
- User = /DC=ch/DC=cern/OU=Organic
Units/OU=Users/CN=lwikstro/CN=627993/CN=Gustav Wikstrom
- Jdl =
SandboxDir/ZI/https_3a_2f_2flcglb01.gridpp.rl.ac.uk_3a9000_2fZIhUvnWm77QBohlunQlXPw/JDLToStart
---
Event: Accepted
- Arrived = Thu Jun 2 05:40:02 2011 CEST
- From = NetworkServer
- From host = atlas009.unige.ch
- Host = lcgwms03.gridpp.rl.ac.uk
- Level = SYSTEM
- Priority = synchronous
- Seqcode =
UI=000000:NS=0000000002:WM=000000:BH=0000000000:JSS=000000:LM=000000:LRMS=000000:APP=000000:LBS=000000
- Source = NetworkServer
- Src instance =
https://lcgwms03.gridpp.rl.ac.uk:7443/glite_wms_wmproxy_server
- Timestamp = Thu Jun 2 05:40:01 2011 CEST
- User = /DC=ch/DC=cern/OU=Organic
Units/OU=Users/CN=lwikstro/CN=627993/CN=Gustav Wikstrom
---
Event: EnQueued
- Arrived = Thu Jun 2 05:40:02 2011 CEST
- Host = lcgwms03.gridpp.rl.ac.uk
- Level = SYSTEM
- Priority = synchronous
- Queue = /var/glite/workload_manager/jobdir
- Result = START
- Seqcode =
UI=000000:NS=0000000003:WM=000000:BH=0000000000:JSS=000000:LM=000000:LRMS=000000:APP=000000:LBS=000000
- Source = NetworkServer
- Src instance =
https://lcgwms03.gridpp.rl.ac.uk:7443/glite_wms_wmproxy_server
- Timestamp = Thu Jun 2 05:40:02 2011 CEST
- User = /DC=ch/DC=cern/OU=Organic
Units/OU=Users/CN=lwikstro/CN=627993/CN=Gustav Wikstrom
- Job =
/var/glite/SandboxDir/ZI/https_3a_2f_2flcglb01.gridpp.rl.ac.uk_3a9000_2fZIhUvnWm77QBohlunQlXPw/JDLToStart
---
Event: EnQueued
- Arrived = Thu Jun 2 05:40:02 2011 CEST
- Host = lcgwms03.gridpp.rl.ac.uk
- Level = SYSTEM
- Priority = synchronous
- Queue = /var/glite/workload_manager/jobdir
- Result = OK
- Seqcode =
UI=000000:NS=0000000004:WM=000000:BH=0000000000:JSS=000000:LM=000000:LRMS=000000:APP=000000:LBS=000000
- Source = NetworkServer
- Src instance =
https://lcgwms03.gridpp.rl.ac.uk:7443/glite_wms_wmproxy_server
- Timestamp = Thu Jun 2 05:40:02 2011 CEST
- User = /DC=ch/DC=cern/OU=Organic
Units/OU=Users/CN=lwikstro/CN=627993/CN=Gustav Wikstrom
- Job =
[
RetryCount = 3;
LB_sequence_code =
"UI=000000:NS=0000000004:WM=000000:BH=0000000000:JSS=000000:LM=000000:LRMS=000000:APP=000000:LBS=000000";
edg_jobid =
"https://lcglb01.gridpp.rl.ac.uk:9000/ZIhUvnWm77QBohlunQlXPw";
Arguments = "-v v9r7p9 -i
lfn:/grid/t2k.org/nd280/production004/B/rdp/ND280/00006000_00006999/reco/oa_nd_cos_00006945-0166_muaf23qoshfp_reco_000_v9r7p5.root
-e cosmic -p 4C -t rdp -m oaAnalysis";
CertificateSubject = "/DC=ch/DC=cern/OU=Organic
Units/OU=Users/CN=lwikstro/CN=627993/CN=Gustav Wikstrom";
MyProxyServer = "lcgrbp01.gridpp.rl.ac.uk";
JobType = "normal";
Executable = "ND280Raw_process.py";
VirtualOrganisation = "t2k.org";
SignificantAttributes = { "Requirements","Rank","FuzzyRank" };
InputSandbox = {
"gsiftp://lcgwms03.gridpp.rl.ac.uk:2811/var/glite/SandboxDir/ZI/https_3a_2f_2flcglb01.gridpp.rl.ac.uk_3a9000_2fZIhUvnWm77QBohlunQlXPw/input/ND280Configs.py","gsiftp://lcgwms03.gridpp.rl.ac.uk:2811/var/glite/SandboxDir/ZI/https_3a_2f_2flcglb01.gridpp.rl.ac.uk_3a9000_2fZIhUvnWm77QBohlunQlXPw/input/ND280GRID.py","gsiftp://lcgwms03.gridpp.rl.ac.uk:2811/var/glite/SandboxDir/ZI/https_3a_2f_2flcglb01.gridpp.rl.ac.uk_3a9000_2fZIhUvnWm77QBohlunQlXPw/input/ND280Job.py","gsiftp://lcgwms03.gridpp.rl.ac.uk:2811/var/glite/SandboxDir/ZI/https_3a_2f_2flcglb01.gridpp.rl.ac.uk_3a9000_2fZIhUvnWm77QBohlunQlXPw/input/ND280Software.py","gsiftp://lcgwms03.gridpp.rl.ac.uk:2811/var/glite/SandboxDir/ZI/https_3a_2f_2flcglb01.gridpp.rl.ac.uk_3a9000_2fZIhUvnWm77QBohlunQlXPw/input/pexpect.py","gsiftp://lcgwms03.gridpp.rl.ac.uk:2811/var/glite/SandboxDir/ZI/https_3a_2f_2flcglb01.gridpp.rl.ac.uk_3a9000_2fZIhUvnWm77QBohlunQlXPw/input/ND280Raw_process.py"
};
StdOutput = "ND280Raw.out";
ShallowRetryCount = 10;
InputSandboxDestFileName = {
"ND280Configs.py","ND280GRID.py","ND280Job.py","ND280Software.py","pexpect.py","ND280Raw_process.py"
};
VOMS_FQAN = "/t2k.org/Role=production/Capability=NULL";
OutputSandboxPath =
"/var/glite/SandboxDir/ZI/https_3a_2f_2flcglb01.gridpp.rl.ac.uk_3a9000_2fZIhUvnWm77QBohlunQlXPw/output";
requirements = ( (
Member("VO-t2k.org-ND280-v9r7p9",other.GlueHostApplicationSoftwareRunTimeEnvironment)
&& other.GlueCEPolicyMaxCPUTime > 600 &&
other.GlueHostMainMemoryRAMSize >= 512 ) && ( other.GlueCEStateStatus
== "Production" ) ) && !RegExp(".*sdj$",other.GlueCEUniqueID);
DataRequirements = {
[
DataCatalog = "http://lfc.gridpp.rl.ac.uk:8085/";
InputData = {
"lfn:/grid/t2k.org/nd280/production004/B/rdp/ND280/00006000_00006999/reco/oa_nd_cos_00006945-0166_muaf23qoshfp_reco_000_v9r7p5.root"
};
DataCatalogType = "DLI"
] };
rank = -other.GlueCEStateEstimatedResponseTime;
Type = "job";
OutputSandboxDestURI = {
"gsiftp://lcgwms03.gridpp.rl.ac.uk:2811/var/glite/SandboxDir/ZI/https_3a_2f_2flcglb01.gridpp.rl.ac.uk_3a9000_2fZIhUvnWm77QBohlunQlXPw/output/ND280Raw.out","gsiftp://lcgwms03.gridpp.rl.ac.uk:2811/var/glite/SandboxDir/ZI/https_3a_2f_2flcglb01.gridpp.rl.ac.uk_3a9000_2fZIhUvnWm77QBohlunQlXPw/output/ND280Raw.err","gsiftp://lcgwms03.gridpp.rl.ac.uk:2811/var/glite/SandboxDir/ZI/https_3a_2f_2flcglb01.gridpp.rl.ac.uk_3a9000_2fZIhUvnWm77QBohlunQlXPw/output/Raw_cosmic_00006945-0166_v9r7p9.cfg"
};
StdError = "ND280Raw.err";
DataAccessProtocol = "gsiftp";
DefaultRank = -other.GlueCEStateEstimatedResponseTime;
WMPInputSandboxBaseURI =
"gsiftp://lcgwms03.gridpp.rl.ac.uk:2811/var/glite/SandboxDir/ZI/https_3a_2f_2flcglb01.gridpp.rl.ac.uk_3a9000_2fZIhUvnWm77QBohlunQlXPw";
AllowZippedISB = true;
ZippedISB = { "ISBfiles_avupuyXqSP-D8K3jUCJpCA_0.tar.gz" };
X509UserProxy =
"/var/glite/SandboxDir/ZI/https_3a_2f_2flcglb01.gridpp.rl.ac.uk_3a9000_2fZIhUvnWm77QBohlunQlXPw/user.proxy";
InputSandboxPath =
"/var/glite/SandboxDir/ZI/https_3a_2f_2flcglb01.gridpp.rl.ac.uk_3a9000_2fZIhUvnWm77QBohlunQlXPw/input";
OutputSandbox = {
"ND280Raw.out","ND280Raw.err","Raw_cosmic_00006945-0166_v9r7p9.cfg" }
]
---
Event: DeQueued
- Arrived = Thu Jun 2 05:40:02 2011 CEST
- Host = lcgwms03.gridpp.rl.ac.uk
- Level = SYSTEM
- Priority = synchronous
- Queue = /var/glite/workload_manager/jobdir
- Seqcode =
UI=000000:NS=0000000004:WM=000001:BH=0000000000:JSS=000000:LM=000000:LRMS=000000:APP=000000:LBS=000000
- Source = WorkloadManager
- Src instance = 10391
- Timestamp = Thu Jun 2 05:40:02 2011 CEST
- User = /DC=ch/DC=cern/OU=Organic
Units/OU=Users/CN=lwikstro/CN=627993/CN=Gustav
Wikstrom/CN=proxy/CN=proxy
---
Event: Match
- Arrived = Thu Jun 2 05:40:02 2011 CEST
- Dest id =
lcgce05.gridpp.rl.ac.uk:8443/cream-pbs-grid500M
- Host = lcgwms03.gridpp.rl.ac.uk
- Level = SYSTEM
- Priority = synchronous
- Seqcode =
UI=000000:NS=0000000004:WM=000002:BH=0000000000:JSS=000000:LM=000000:LRMS=000000:APP=000000:LBS=000000
- Source = WorkloadManager
- Src instance = 10391
- Timestamp = Thu Jun 2 05:40:02 2011 CEST
- User = /DC=ch/DC=cern/OU=Organic
Units/OU=Users/CN=lwikstro/CN=627993/CN=Gustav
Wikstrom/CN=proxy/CN=proxy
---
Event: EnQueued
- Arrived = Thu Jun 2 05:40:03 2011 CEST
- Host = lcgwms03.gridpp.rl.ac.uk
- Level = SYSTEM
- Priority = synchronous
- Queue = /var/glite/ice/jobdir
- Result = START
- Seqcode =
UI=000000:NS=0000000004:WM=000003:BH=0000000000:JSS=000000:LM=000000:LRMS=000000:APP=000000:LBS=000000
- Source = WorkloadManager
- Src instance = 10391
- Timestamp = Thu Jun 2 05:40:02 2011 CEST
- User = /DC=ch/DC=cern/OU=Organic
Units/OU=Users/CN=lwikstro/CN=627993/CN=Gustav
Wikstrom/CN=proxy/CN=proxy
---
Event: EnQueued
- Arrived = Thu Jun 2 05:40:03 2011 CEST
- Host = lcgwms03.gridpp.rl.ac.uk
- Level = SYSTEM
- Priority = synchronous
- Queue = /var/glite/ice/jobdir
- Result = OK
- Seqcode =
UI=000000:NS=0000000004:WM=000004:BH=0000000000:JSS=000000:LM=000000:LRMS=000000:APP=000000:LBS=000000
- Source = WorkloadManager
- Src instance = 10391
- Timestamp = Thu Jun 2 05:40:03 2011 CEST
- User = /DC=ch/DC=cern/OU=Organic
Units/OU=Users/CN=lwikstro/CN=627993/CN=Gustav
Wikstrom/CN=proxy/CN=proxy
- Job =
[
Arguments =
[
JobAd =
[
RetryCount = 3;
LB_sequence_code =
"UI=000000:NS=0000000004:WM=000004:BH=0000000000:JSS=000000:LM=000000:LRMS=000000:APP=000000:LBS=000000";
ReallyRunningToken =
"gsiftp://lcgwms03.gridpp.rl.ac.uk/var/glite/SandboxDir/ZI/https_3a_2f_2flcglb01.gridpp.rl.ac.uk_3a9000_2fZIhUvnWm77QBohlunQlXPw/token.txt";
edg_jobid =
"https://lcglb01.gridpp.rl.ac.uk:9000/ZIhUvnWm77QBohlunQlXPw";
lrms_type = "torque";
CeRequirements = "true && ( true && (
Member(\"VO-t2k.org-ND280-v9r7p9\",other.GlueHostApplicationSoftwareRunTimeEnvironment)
&& other.GlueCEPolicyMaxCPUTime > 600 &&
other.GlueHostMainMemoryRAMSize >= 512 ) )";
Arguments = "-v v9r7p9 -i
lfn:/grid/t2k.org/nd280/production004/B/rdp/ND280/00006000_00006999/reco/oa_nd_cos_00006945-0166_muaf23qoshfp_reco_000_v9r7p5.root
-e cosmic -p 4C -t rdp -m oaAnalysis";
CertificateSubject = "/DC=ch/DC=cern/OU=Organic
Units/OU=Users/CN=lwikstro/CN=627993/CN=Gustav Wikstrom";
MyProxyServer = "lcgrbp01.gridpp.rl.ac.uk";
ce_id = "lcgce05.gridpp.rl.ac.uk:8443/cream-pbs-grid500M";
QueueName = "grid500M";
JobType = "normal";
Executable = "ND280Raw_process.py";
VirtualOrganisation = "t2k.org";
SignificantAttributes = { "Requirements","Rank","FuzzyRank" };
InputSandbox = {
"gsiftp://lcgwms03.gridpp.rl.ac.uk:2811/var/glite/SandboxDir/ZI/https_3a_2f_2flcglb01.gridpp.rl.ac.uk_3a9000_2fZIhUvnWm77QBohlunQlXPw/input/ND280Configs.py","gsiftp://lcgwms03.gridpp.rl.ac.uk:2811/var/glite/SandboxDir/ZI/https_3a_2f_2flcglb01.gridpp.rl.ac.uk_3a9000_2fZIhUvnWm77QBohlunQlXPw/input/ND280GRID.py","gsiftp://lcgwms03.gridpp.rl.ac.uk:2811/var/glite/SandboxDir/ZI/https_3a_2f_2flcglb01.gridpp.rl.ac.uk_3a9000_2fZIhUvnWm77QBohlunQlXPw/input/ND280Job.py","gsiftp://lcgwms03.gridpp.rl.ac.uk:2811/var/glite/SandboxDir/ZI/https_3a_2f_2flcglb01.gridpp.rl.ac.uk_3a9000_2fZIhUvnWm77QBohlunQlXPw/input/ND280Software.py","gsiftp://lcgwms03.gridpp.rl.ac.uk:2811/var/glite/SandboxDir/ZI/https_3a_2f_2flcglb01.gridpp.rl.ac.uk_3a9000_2fZIhUvnWm77QBohlunQlXPw/input/pexpect.py","gsiftp://lcgwms03.gridpp.rl.ac.uk:2811/var/glite/SandboxDir/ZI/https_3a_2f_2flcglb01.gridpp.rl.ac.uk_3a9000_2fZIhUvnWm77QBohlunQlXPw/input/ND280Raw_process.py","gsiftp://lcgwms03.gridpp.rl.ac.uk:2811/var/glite/SandboxDir/ZI/https_3a_2f_2flcglb01.gridpp.rl.ac.uk_3a9000_2fZIhUvnWm77QBohlunQlXPw/input/.BrokerInfo"
};
StdOutput = "ND280Raw.out";
ShallowRetryCount = 10;
VOMS_FQAN = "/t2k.org/Role=production/Capability=NULL";
InputSandboxDestFileName = {
"ND280Configs.py","ND280GRID.py","ND280Job.py","ND280Software.py","pexpect.py","ND280Raw_process.py"
};
OutputSandboxPath =
"/var/glite/SandboxDir/ZI/https_3a_2f_2flcglb01.gridpp.rl.ac.uk_3a9000_2fZIhUvnWm77QBohlunQlXPw/output";
requirements = ( (
Member("VO-t2k.org-ND280-v9r7p9",other.GlueHostApplicationSoftwareRunTimeEnvironment)
&& other.GlueCEPolicyMaxCPUTime > 600 &&
other.GlueHostMainMemoryRAMSize >= 512 ) && ( other.GlueCEStateStatus
== "Production" ) ) && !RegExp(".*sdj$",other.GlueCEUniqueID);
DataRequirements = {
[
DataCatalog = "http://lfc.gridpp.rl.ac.uk:8085/";
InputData = {
"lfn:/grid/t2k.org/nd280/production004/B/rdp/ND280/00006000_00006999/reco/oa_nd_cos_00006945-0166_muaf23qoshfp_reco_000_v9r7p5.root"
};
DataCatalogType = "DLI"
] };
rank = -other.GlueCEStateEstimatedResponseTime;
Type = "job";
OutputSandboxDestURI = {
"gsiftp://lcgwms03.gridpp.rl.ac.uk:2811/var/glite/SandboxDir/ZI/https_3a_2f_2flcglb01.gridpp.rl.ac.uk_3a9000_2fZIhUvnWm77QBohlunQlXPw/output/ND280Raw.out","gsiftp://lcgwms03.gridpp.rl.ac.uk:2811/var/glite/SandboxDir/ZI/https_3a_2f_2flcglb01.gridpp.rl.ac.uk_3a9000_2fZIhUvnWm77QBohlunQlXPw/output/ND280Raw.err","gsiftp://lcgwms03.gridpp.rl.ac.uk:2811/var/glite/SandboxDir/ZI/https_3a_2f_2flcglb01.gridpp.rl.ac.uk_3a9000_2fZIhUvnWm77QBohlunQlXPw/output/Raw_cosmic_00006945-0166_v9r7p9.cfg"
};
StdError = "ND280Raw.err";
DataAccessProtocol = "gsiftp";
DefaultRank = -other.GlueCEStateEstimatedResponseTime;
WMPInputSandboxBaseURI =
"gsiftp://lcgwms03.gridpp.rl.ac.uk:2811/var/glite/SandboxDir/ZI/https_3a_2f_2flcglb01.gridpp.rl.ac.uk_3a9000_2fZIhUvnWm77QBohlunQlXPw";
CeApplicationDir = "/stage/sl3-lcg-exp/t2ksgm";
ZippedISB = { "ISBfiles_avupuyXqSP-D8K3jUCJpCA_0.tar.gz" };
AllowZippedISB = true;
X509UserProxy =
"/var/glite/SandboxDir/ZI/https_3a_2f_2flcglb01.gridpp.rl.ac.uk_3a9000_2fZIhUvnWm77QBohlunQlXPw/user.proxy";
GlobusResourceContactString =
"lcgce05.gridpp.rl.ac.uk:8443/cream-pbs";
InputSandboxPath =
"/var/glite/SandboxDir/ZI/https_3a_2f_2flcglb01.gridpp.rl.ac.uk_3a9000_2fZIhUvnWm77QBohlunQlXPw/input";
OutputSandbox = {
"ND280Raw.out","ND280Raw.err","Raw_cosmic_00006945-0166_v9r7p9.cfg" }
]
];
Command = "Submit";
Source = 2;
Protocol = "1.0.0"
]
---
Event: DeQueued
- Arrived = Thu Jun 2 05:40:04 2011 CEST
- Host = lcgwms03.gridpp.rl.ac.uk
- Level = SYSTEM
- Local jobid =
https://lcglb01.gridpp.rl.ac.uk:9000/ZIhUvnWm77QBohlunQlXPw
- Priority = synchronous
- Queue = /var/glite/ice/jobdir
- Seqcode =
UI=000000:NS=0000000004:WM=000004:BH=0000000000:JSS=000001:LM=000000:LRMS=000000:APP=000000:LBS=000000
- Source = JobController
- Timestamp = Thu Jun 2 05:40:03 2011 CEST
- User = /DC=ch/DC=cern/OU=Organic
Units/OU=Users/CN=lwikstro/CN=627993/CN=Gustav
Wikstrom/CN=proxy/CN=proxy
---
Event: Transfer
- Arrived = Thu Jun 2 05:40:04 2011 CEST
- Dest host =
https://lcgce05.gridpp.rl.ac.uk:8443/ce-cream/services/CREAM2
- Dest instance = unavailable
- Dest jobid = unavailable
- Destination = LRMS
- Host = lcgwms03.gridpp.rl.ac.uk
- Level = SYSTEM
- Priority = synchronous
- Reason = unavailable
- Result = START
- Seqcode =
UI=000000:NS=0000000004:WM=000004:BH=0000000000:JSS=000002:LM=000001:LRMS=000000:APP=000000:LBS=000000
- Source = LogMonitor
- Timestamp = Thu Jun 2 05:40:04 2011 CEST
- User = /DC=ch/DC=cern/OU=Organic
Units/OU=Users/CN=lwikstro/CN=627993/CN=Gustav
Wikstrom/CN=proxy/CN=proxy
- Job =
[
RetryCount = 3;
LB_sequence_code =
"UI=000000:NS=0000000004:WM=000004:BH=0000000000:JSS=000002:LM=000000:LRMS=000000:APP=000000:LBS=000000";
ReallyRunningToken =
"gsiftp://lcgwms03.gridpp.rl.ac.uk/var/glite/SandboxDir/ZI/https_3a_2f_2flcglb01.gridpp.rl.ac.uk_3a9000_2fZIhUvnWm77QBohlunQlXPw/token.txt";
edg_jobid =
"https://lcglb01.gridpp.rl.ac.uk:9000/ZIhUvnWm77QBohlunQlXPw";
lrms_type = "torque";
CeRequirements = "true && ( true && (
Member(\"VO-t2k.org-ND280-v9r7p9\",other.GlueHostApplicationSoftwareRunTimeEnvironment)
&& other.GlueCEPolicyMaxCPUTime > 600 &&
other.GlueHostMainMemoryRAMSize >= 512 ) )";
Arguments = "-v v9r7p9 -i
lfn:/grid/t2k.org/nd280/production004/B/rdp/ND280/00006000_00006999/reco/oa_nd_cos_00006945-0166_muaf23qoshfp_reco_000_v9r7p5.root
-e cosmic -p 4C -t rdp -m oaAnalysis";
CertificateSubject = "/DC=ch/DC=cern/OU=Organic
Units/OU=Users/CN=lwikstro/CN=627993/CN=Gustav Wikstrom";
MyProxyServer = "lcgrbp01.gridpp.rl.ac.uk";
ce_id = "lcgce05.gridpp.rl.ac.uk:8443/cream-pbs-grid500M";
QueueName = "grid500M";
JobType = "normal";
Executable = "ND280Raw_process.py";
VirtualOrganisation = "t2k.org";
SignificantAttributes = { "Requirements","Rank","FuzzyRank" };
InputSandbox = {
"gsiftp://lcgwms03.gridpp.rl.ac.uk:2811/var/glite/SandboxDir/ZI/https_3a_2f_2flcglb01.gridpp.rl.ac.uk_3a9000_2fZIhUvnWm77QBohlunQlXPw/input/ND280Configs.py","gsiftp://lcgwms03.gridpp.rl.ac.uk:2811/var/glite/SandboxDir/ZI/https_3a_2f_2flcglb01.gridpp.rl.ac.uk_3a9000_2fZIhUvnWm77QBohlunQlXPw/input/ND280GRID.py","gsiftp://lcgwms03.gridpp.rl.ac.uk:2811/var/glite/SandboxDir/ZI/https_3a_2f_2flcglb01.gridpp.rl.ac.uk_3a9000_2fZIhUvnWm77QBohlunQlXPw/input/ND280Job.py","gsiftp://lcgwms03.gridpp.rl.ac.uk:2811/var/glite/SandboxDir/ZI/https_3a_2f_2flcglb01.gridpp.rl.ac.uk_3a9000_2fZIhUvnWm77QBohlunQlXPw/input/ND280Software.py","gsiftp://lcgwms03.gridpp.rl.ac.uk:2811/var/glite/SandboxDir/ZI/https_3a_2f_2flcglb01.gridpp.rl.ac.uk_3a9000_2fZIhUvnWm77QBohlunQlXPw/input/pexpect.py","gsiftp://lcgwms03.gridpp.rl.ac.uk:2811/var/glite/SandboxDir/ZI/https_3a_2f_2flcglb01.gridpp.rl.ac.uk_3a9000_2fZIhUvnWm77QBohlunQlXPw/input/ND280Raw_process.py","gsiftp://lcgwms03.gridpp.rl.ac.uk:2811/var/glite/SandboxDir/ZI/https_3a_2f_2flcglb01.gridpp.rl.ac.uk_3a9000_2fZIhUvnWm77QBohlunQlXPw/input/.BrokerInfo"
};
StdOutput = "ND280Raw.out";
ShallowRetryCount = 10;
InputSandboxDestFileName = {
"ND280Configs.py","ND280GRID.py","ND280Job.py","ND280Software.py","pexpect.py","ND280Raw_process.py"
};
VOMS_FQAN = "/t2k.org/Role=production/Capability=NULL";
OutputSandboxPath =
"/var/glite/SandboxDir/ZI/https_3a_2f_2flcglb01.gridpp.rl.ac.uk_3a9000_2fZIhUvnWm77QBohlunQlXPw/output";
requirements = ( (
Member("VO-t2k.org-ND280-v9r7p9",other.GlueHostApplicationSoftwareRunTimeEnvironment)
&& other.GlueCEPolicyMaxCPUTime > 600 &&
other.GlueHostMainMemoryRAMSize >= 512 ) && ( other.GlueCEStateStatus
== "Production" ) ) && !RegExp(".*sdj$",other.GlueCEUniqueID);
DataRequirements = {
[
DataCatalog = "http://lfc.gridpp.rl.ac.uk:8085/";
InputData = {
"lfn:/grid/t2k.org/nd280/production004/B/rdp/ND280/00006000_00006999/reco/oa_nd_cos_00006945-0166_muaf23qoshfp_reco_000_v9r7p5.root"
};
DataCatalogType = "DLI"
] };
rank = -other.GlueCEStateEstimatedResponseTime;
Type = "job";
OutputSandboxDestURI = {
"gsiftp://lcgwms03.gridpp.rl.ac.uk:2811/var/glite/SandboxDir/ZI/https_3a_2f_2flcglb01.gridpp.rl.ac.uk_3a9000_2fZIhUvnWm77QBohlunQlXPw/output/ND280Raw.out","gsiftp://lcgwms03.gridpp.rl.ac.uk:2811/var/glite/SandboxDir/ZI/https_3a_2f_2flcglb01.gridpp.rl.ac.uk_3a9000_2fZIhUvnWm77QBohlunQlXPw/output/ND280Raw.err","gsiftp://lcgwms03.gridpp.rl.ac.uk:2811/var/glite/SandboxDir/ZI/https_3a_2f_2flcglb01.gridpp.rl.ac.uk_3a9000_2fZIhUvnWm77QBohlunQlXPw/output/Raw_cosmic_00006945-0166_v9r7p9.cfg"
};
StdError = "ND280Raw.err";
DataAccessProtocol = "gsiftp";
DefaultRank = -other.GlueCEStateEstimatedResponseTime;
CeApplicationDir = "/stage/sl3-lcg-exp/t2ksgm";
WMPInputSandboxBaseURI =
"gsiftp://lcgwms03.gridpp.rl.ac.uk:2811/var/glite/SandboxDir/ZI/https_3a_2f_2flcglb01.gridpp.rl.ac.uk_3a9000_2fZIhUvnWm77QBohlunQlXPw";
AllowZippedISB = true;
ZippedISB = { "ISBfiles_avupuyXqSP-D8K3jUCJpCA_0.tar.gz" };
X509UserProxy =
"/var/glite/SandboxDir/ZI/https_3a_2f_2flcglb01.gridpp.rl.ac.uk_3a9000_2fZIhUvnWm77QBohlunQlXPw/user.proxy";
GlobusResourceContactString = "lcgce05.gridpp.rl.ac.uk:8443/cream-pbs";
InputSandboxPath =
"/var/glite/SandboxDir/ZI/https_3a_2f_2flcglb01.gridpp.rl.ac.uk_3a9000_2fZIhUvnWm77QBohlunQlXPw/input";
OutputSandbox = {
"ND280Raw.out","ND280Raw.err","Raw_cosmic_00006945-0166_v9r7p9.cfg" }
]
---
Event: Running
- Arrived = Thu Jun 2 05:42:39 2011 CEST
- Host = lcg1278.gridpp.rl.ac.uk
- Level = SYSTEM
- Node = lcg1278.gridpp.rl.ac.uk
- Priority = synchronous
- Seqcode =
UI=000000:NS=0000000004:WM=000004:BH=0000000000:JSS=000002:LM=000002:LRMS=000001:APP=000000:LBS=000000
- Source = LRMS
- Timestamp = Thu Jun 2 05:42:39 2011 CEST
- User = /DC=ch/DC=cern/OU=Organic
Units/OU=Users/CN=lwikstro/CN=627993/CN=Gustav Wikstrom
---
Event: ReallyRunning
- Arrived = Thu Jun 2 05:42:46 2011 CEST
- Host = lcg1278.gridpp.rl.ac.uk
- Level = SYSTEM
- Priority = synchronous
- Seqcode =
UI=000000:NS=0000000004:WM=000004:BH=0000000000:JSS=000002:LM=000002:LRMS=000003:APP=000000:LBS=000000
- Source = LRMS
- Timestamp = Thu Jun 2 05:42:46 2011 CEST
- User = /DC=ch/DC=cern/OU=Organic
Units/OU=Users/CN=lwikstro/CN=627993/CN=Gustav Wikstrom
---
Event: Transfer
- Arrived = Thu Jun 2 05:40:05 2011 CEST
- Dest host =
https://lcgce05.gridpp.rl.ac.uk:8443/ce-cream/services/CREAM2
- Dest instance = unavailable
- Dest jobid =
https://lcgce05.gridpp.rl.ac.uk:8443/CREAM840367115
- Destination = LRMS
- Host = lcgwms03.gridpp.rl.ac.uk
- Level = SYSTEM
- Priority = synchronous
- Reason = unavailable
- Result = OK
- Seqcode =
UI=000000:NS=0000000004:WM=000004:BH=0000000000:JSS=000002:LM=000003:LRMS=000000:APP=000000:LBS=000000
- Source = LogMonitor
- Timestamp = Thu Jun 2 05:40:04 2011 CEST
- User = /DC=ch/DC=cern/OU=Organic
Units/OU=Users/CN=lwikstro/CN=627993/CN=Gustav
Wikstrom/CN=proxy/CN=proxy
- Job =
[
RetryCount = 3;
LB_sequence_code =
"UI=000000:NS=0000000004:WM=000004:BH=0000000000:JSS=000002:LM=000002:LRMS=000000:APP=000000:LBS=000000";
ReallyRunningToken =
"gsiftp://lcgwms03.gridpp.rl.ac.uk/var/glite/SandboxDir/ZI/https_3a_2f_2flcglb01.gridpp.rl.ac.uk_3a9000_2fZIhUvnWm77QBohlunQlXPw/token.txt";
edg_jobid =
"https://lcglb01.gridpp.rl.ac.uk:9000/ZIhUvnWm77QBohlunQlXPw";
lrms_type = "torque";
CeRequirements = "true && ( true && (
Member(\"VO-t2k.org-ND280-v9r7p9\",other.GlueHostApplicationSoftwareRunTimeEnvironment)
&& other.GlueCEPolicyMaxCPUTime > 600 &&
other.GlueHostMainMemoryRAMSize >= 512 ) )";
Arguments = "-v v9r7p9 -i
lfn:/grid/t2k.org/nd280/production004/B/rdp/ND280/00006000_00006999/reco/oa_nd_cos_00006945-0166_muaf23qoshfp_reco_000_v9r7p5.root
-e cosmic -p 4C -t rdp -m oaAnalysis";
CertificateSubject = "/DC=ch/DC=cern/OU=Organic
Units/OU=Users/CN=lwikstro/CN=627993/CN=Gustav Wikstrom";
MyProxyServer = "lcgrbp01.gridpp.rl.ac.uk";
ce_id = "lcgce05.gridpp.rl.ac.uk:8443/cream-pbs-grid500M";
QueueName = "grid500M";
JobType = "normal";
Executable = "ND280Raw_process.py";
VirtualOrganisation = "t2k.org";
SignificantAttributes = { "Requirements","Rank","FuzzyRank" };
InputSandbox = {
"gsiftp://lcgwms03.gridpp.rl.ac.uk:2811/var/glite/SandboxDir/ZI/https_3a_2f_2flcglb01.gridpp.rl.ac.uk_3a9000_2fZIhUvnWm77QBohlunQlXPw/input/ND280Configs.py","gsiftp://lcgwms03.gridpp.rl.ac.uk:2811/var/glite/SandboxDir/ZI/https_3a_2f_2flcglb01.gridpp.rl.ac.uk_3a9000_2fZIhUvnWm77QBohlunQlXPw/input/ND280GRID.py","gsiftp://lcgwms03.gridpp.rl.ac.uk:2811/var/glite/SandboxDir/ZI/https_3a_2f_2flcglb01.gridpp.rl.ac.uk_3a9000_2fZIhUvnWm77QBohlunQlXPw/input/ND280Job.py","gsiftp://lcgwms03.gridpp.rl.ac.uk:2811/var/glite/SandboxDir/ZI/https_3a_2f_2flcglb01.gridpp.rl.ac.uk_3a9000_2fZIhUvnWm77QBohlunQlXPw/input/ND280Software.py","gsiftp://lcgwms03.gridpp.rl.ac.uk:2811/var/glite/SandboxDir/ZI/https_3a_2f_2flcglb01.gridpp.rl.ac.uk_3a9000_2fZIhUvnWm77QBohlunQlXPw/input/pexpect.py","gsiftp://lcgwms03.gridpp.rl.ac.uk:2811/var/glite/SandboxDir/ZI/https_3a_2f_2flcglb01.gridpp.rl.ac.uk_3a9000_2fZIhUvnWm77QBohlunQlXPw/input/ND280Raw_process.py","gsiftp://lcgwms03.gridpp.rl.ac.uk:2811/var/glite/SandboxDir/ZI/https_3a_2f_2flcglb01.gridpp.rl.ac.uk_3a9000_2fZIhUvnWm77QBohlunQlXPw/input/.BrokerInfo"
};
StdOutput = "ND280Raw.out";
ShallowRetryCount = 10;
InputSandboxDestFileName = {
"ND280Configs.py","ND280GRID.py","ND280Job.py","ND280Software.py","pexpect.py","ND280Raw_process.py"
};
VOMS_FQAN = "/t2k.org/Role=production/Capability=NULL";
OutputSandboxPath =
"/var/glite/SandboxDir/ZI/https_3a_2f_2flcglb01.gridpp.rl.ac.uk_3a9000_2fZIhUvnWm77QBohlunQlXPw/output";
requirements = ( (
Member("VO-t2k.org-ND280-v9r7p9",other.GlueHostApplicationSoftwareRunTimeEnvironment)
&& other.GlueCEPolicyMaxCPUTime > 600 &&
other.GlueHostMainMemoryRAMSize >= 512 ) && ( other.GlueCEStateStatus
== "Production" ) ) && !RegExp(".*sdj$",other.GlueCEUniqueID);
DataRequirements = {
[
DataCatalog = "http://lfc.gridpp.rl.ac.uk:8085/";
InputData = {
"lfn:/grid/t2k.org/nd280/production004/B/rdp/ND280/00006000_00006999/reco/oa_nd_cos_00006945-0166_muaf23qoshfp_reco_000_v9r7p5.root"
};
DataCatalogType = "DLI"
] };
rank = -other.GlueCEStateEstimatedResponseTime;
Type = "job";
OutputSandboxDestURI = {
"gsiftp://lcgwms03.gridpp.rl.ac.uk:2811/var/glite/SandboxDir/ZI/https_3a_2f_2flcglb01.gridpp.rl.ac.uk_3a9000_2fZIhUvnWm77QBohlunQlXPw/output/ND280Raw.out","gsiftp://lcgwms03.gridpp.rl.ac.uk:2811/var/glite/SandboxDir/ZI/https_3a_2f_2flcglb01.gridpp.rl.ac.uk_3a9000_2fZIhUvnWm77QBohlunQlXPw/output/ND280Raw.err","gsiftp://lcgwms03.gridpp.rl.ac.uk:2811/var/glite/SandboxDir/ZI/https_3a_2f_2flcglb01.gridpp.rl.ac.uk_3a9000_2fZIhUvnWm77QBohlunQlXPw/output/Raw_cosmic_00006945-0166_v9r7p9.cfg"
};
StdError = "ND280Raw.err";
DataAccessProtocol = "gsiftp";
DefaultRank = -other.GlueCEStateEstimatedResponseTime;
CeApplicationDir = "/stage/sl3-lcg-exp/t2ksgm";
WMPInputSandboxBaseURI =
"gsiftp://lcgwms03.gridpp.rl.ac.uk:2811/var/glite/SandboxDir/ZI/https_3a_2f_2flcglb01.gridpp.rl.ac.uk_3a9000_2fZIhUvnWm77QBohlunQlXPw";
AllowZippedISB = true;
ZippedISB = { "ISBfiles_avupuyXqSP-D8K3jUCJpCA_0.tar.gz" };
X509UserProxy =
"/var/glite/SandboxDir/ZI/https_3a_2f_2flcglb01.gridpp.rl.ac.uk_3a9000_2fZIhUvnWm77QBohlunQlXPw/user.proxy";
GlobusResourceContactString = "lcgce05.gridpp.rl.ac.uk:8443/cream-pbs";
InputSandboxPath =
"/var/glite/SandboxDir/ZI/https_3a_2f_2flcglb01.gridpp.rl.ac.uk_3a9000_2fZIhUvnWm77QBohlunQlXPw/input";
OutputSandbox = {
"ND280Raw.out","ND280Raw.err","Raw_cosmic_00006945-0166_v9r7p9.cfg" }
]
---
Event: Running
- Arrived = Thu Jun 2 05:50:58 2011 CEST
- Host = lcgwms03.gridpp.rl.ac.uk
- Level = SYSTEM
- Node = lcg1278.gridpp.rl.ac.uk
- Priority = synchronous
- Seqcode =
UI=000000:NS=0000000004:WM=000004:BH=0000000000:JSS=000002:LM=000005:LRMS=000000:APP=000000:LBS=000000
- Source = LogMonitor
- Timestamp = Thu Jun 2 05:50:58 2011 CEST
- User = /DC=ch/DC=cern/OU=Organic
Units/OU=Users/CN=lwikstro/CN=627993/CN=Gustav
Wikstrom/CN=proxy/CN=proxy
---
Event: ReallyRunning
- Arrived = Thu Jun 2 05:50:59 2011 CEST
- Host = lcgwms03.gridpp.rl.ac.uk
- Level = SYSTEM
- Priority = synchronous
- Seqcode =
UI=000000:NS=0000000004:WM=000004:BH=0000000000:JSS=000002:LM=000007:LRMS=000000:APP=000000:LBS=000000
- Source = LogMonitor
- Timestamp = Thu Jun 2 05:50:58 2011 CEST
- User = /DC=ch/DC=cern/OU=Organic
Units/OU=Users/CN=lwikstro/CN=627993/CN=Gustav
Wikstrom/CN=proxy/CN=proxy
- Wn seq =
UI=000000:NS=0000000004:WM=000004:BH=0000000000:JSS=000002:LM=000002:LRMS=000000:APP=000000:LBS=000000
---
Event: Cancel
- Arrived = Thu Jun 2 06:03:11 2011 CEST
- Host = lcgwms03.gridpp.rl.ac.uk
- Level = SYSTEM
- Priority = synchronous
- Seqcode =
UI=000000:NS=0000000004:WM=000004:BH=0000000000:JSS=000002:LM=000009:LRMS=000000:APP=000000:LBS=000000
- Source = LogMonitor
- Status code = DONE
- Timestamp = Thu Jun 2 06:03:11 2011 CEST
- User = /DC=ch/DC=cern/OU=Organic
Units/OU=Users/CN=lwikstro/CN=627993/CN=Gustav
Wikstrom/CN=proxy/CN=proxy
---
Event: Done
- Arrived = Thu Jun 2 06:03:11 2011 CEST
- Exit code = 0
- Host = lcgwms03.gridpp.rl.ac.uk
- Level = SYSTEM
- Priority = synchronous
- Seqcode =
UI=000000:NS=0000000004:WM=000004:BH=0000000000:JSS=000002:LM=000010:LRMS=000000:APP=000000:LBS=000000
- Source = LogMonitor
- Status code = CANCELLED
- Timestamp = Thu Jun 2 06:03:11 2011 CEST
- User = /DC=ch/DC=cern/OU=Organic
Units/OU=Users/CN=lwikstro/CN=627993/CN=Gustav
Wikstrom/CN=proxy/CN=proxy
---
Event: Clear
- Arrived = Thu Jun 2 06:03:12 2011 CEST
- Host = lcgwms03.gridpp.rl.ac.uk
- Level = SYSTEM
- Priority = synchronous
- Reason = 1
- Seqcode =
UI=000009:NS=0000096670:WM=000000:BH=0000000000:JSS=000000:LM=000000:LRMS=000000:APP=000000:LBS=000000
- Source = NetworkServer
- Src instance = 22543
- Timestamp = Thu Jun 2 06:03:12 2011 CEST
- User =
[log in to unmask]
==========================================================================
2011/6/2 Stuart Purdie <[log in to unmask]>:
>
> On 2 Jun 2011, at 09:49, Gustav Wikström wrote:
>
>> Hi all,
>>
>> I'm having serious problems with running my VO t2k.org jobs, currently
>> 95% of them are being cancelled by the WMSs (lcgwms03.gridpp.rl.ac.uk
>> and wms02.grid.hep.ic.ac.uk) or the CEs. As I understand it, when a
>> WMS stops a job, it is labeled Aborted, and then Cancelled is when a
>> CE stops a job? The bad thing is that there is no information about a
>> job after it has been stopped unless it failed.
>>
>> So, what could cause a job to be cancelled? Is memory usage one of the reasons?
>
> Not the most likely culprit, as it's not the most strongly enforced constriant across all sites, but it is possible. It does have a bit of a site dependance, so if the 5% that don't get cancelled end up on a different site, that's useful data. Job CPU use and Wall time are more strongly enforced; but it could also be missing input files causing the jobs to die on start up.
>
> If it's (apparently) randomly distributed across all sites, the first thing I'd be checking is proxy lifespans, job queueing time and myproxy stuff (if used).
>
> There might be more information lurking around, which, if you've not tried already, can be released with 'glite-wms-job-status --verbosity 3 <jid>', and 'glite-wms-job-logging-info --verbosity 3 <jid>'
> which might give more idea on where to poke at next. In particular, the WMS (by default) will try re-submitting a failed job a couple of times, and walking through that process might be informative. The amount of time jobs spend running might also help identify the root problem.
>
>
>
|