I submitted a job about 12 days ago. Its dg_jobId is:
https://gm03.hep.ph.ic.ac.uk:7846/130.246.183.172/111355320838832?gm03.hep.p
h.ic.ac.uk:7771
I also discovered it sitting around when I ran dg-job-get-status --all,
however this one appears to be semi-conscious (well, let's say the life
support hasn't been turned off yet). The job was almost certainly of the
"echo Hello World" variety. It is moving very slowly through Grid World,
and keeps taking multi-day vacations. Attached is a copy of
dg-job-get-logging-info output. Below is my deciphered diary of events.
What causes this sort of behaviour?
Is this just because there is a major backlog of jobs that the RB is trying
to schedule on over-comitted resources?
If that is the case, then how do I:
1) Find out how many jobs the RB is trying to schedule; and,
2) Find out what the state of the queues are at the various sites the RB
serves.
Cheers,
Ian.
Fri May 23 11:13:56 2003: gm03.hep.ph.ic.ac.uk (RB) accepts job
Fri May 23 11:13:56 2003: gppui04.gridpp.rl.ac.uk transfers job to RB
... four days pass ...
Tue May 27 14:21:37 2003: RB matches job to
tuber5.phy.bris.ac.uk:2119/jobmanager-pbs-tbq
Tue May 27 14:21:41 2003: JSS refuses jobs at gm03 with "condor command
failed" msg
Tue May 27 14:21:41 2003: RB accepts job back from JSS
Tue May 27 14:21:58 2003: RB logs JobPending state with reason
"Resubmitting"
... six days pass ...
Mon Jun 2 11:15:26 2003: RB matches job to
bottom.phy.bris.ac.uk:2119/jobmanager-pbs-gridq
Mon Jun 2 11:16:22 2003: JSS refuses job as before, RB accepts job back,
logs JobPending
... two hours pass ...
Mon Jun 2 13:33:42 2003: RB matches job to
tuber5.phy.bris.ac.uk:2119/jobmanager-pbs-bseq
Mon Jun 2 13:34:33 2003: JSS refuses job as before, RB accepts job back,
logs JobPending
... 90 minutes pass ...
Mon Jun 2 15:06:45 2003: RB matches job to
testbed008.cnaf.infn.it:2119/jobmanager-pbs-short
Mon Jun 2 15:32:12 2003: JSS refuses job as before (but takes 25 minutes to
decide)
Mon Jun 2 15:32:30 2003: RB holds job in JobPending state with reason
"Resubmitting"
... one day passes ...
and there are no more log entries. Its current status is "Waiting", with
StatusReason "Resubmitting".
--
Ian Stokes-Rees [log in to unmask]
Particle Physics, Oxford http://www-pnp.physics.ox.ac.uk/~stokes/
**********************************************************************
LOGGING INFORMATION:
Printing info for the Job : https://gm03.hep.ph.ic.ac.uk:7846/130.246.183.172/111355320838832?gm03.hep.ph.ic.ac.uk:7771
---
Event Type = JobAccept
dg_jobId = https://gm03.hep.ph.ic.ac.uk:7846/130.246.183.172/111355320838832?gm03.hep.ph.ic.ac.uk:7771
Certificate Subject = /O=Grid/O=UKHEP/CN=host/gm03.hep.ph.ic.ac.uk
Logging Level = System
Date (UTC) = Fri May 23 11:13:56 2003
Job Accept New Id = RB assigned ID
Job Accept Source = UserInterface
Host Name = gm03
Source Program = ResourceBroker
---
Event Type = JobTransfer
dg_jobId = https://gm03.hep.ph.ic.ac.uk:7846/130.246.183.172/111355320838832?gm03.hep.ph.ic.ac.uk:7771
Certificate Subject = /O=Grid/O=UKHEP/OU=physics.ox.ac.uk/CN=Ian Stokes-Rees
Logging Level = System
Date (UTC) = Fri May 23 11:13:56 2003
Job Transfer Dest = ResourceBroker/gm03.hep.ph.ic.ac.uk:7771
Job Transfer Result = OK
Host Name = gppui04.gridpp.rl.ac.uk
Source Program = UserInterface
---
Event Type = JobMatch
dg_jobId = https://gm03.hep.ph.ic.ac.uk:7846/130.246.183.172/111355320838832?gm03.hep.ph.ic.ac.uk:7771
Certificate Subject = /O=Grid/O=UKHEP/CN=host/gm03.hep.ph.ic.ac.uk
Logging Level = System
Date (UTC) = Tue May 27 14:21:37 2003
Job Match Destination = tuber5.phy.bris.ac.uk:2119/jobmanager-pbs-tbq
Host Name = gm03
Source Program = ResourceBroker
---
Event Type = JobRefuse
dg_jobId = https://gm03.hep.ph.ic.ac.uk:7846/130.246.183.172/111355320838832?gm03.hep.ph.ic.ac.uk:7771
Certificate Subject = /O=Grid/O=UKHEP/CN=host/gm03.hep.ph.ic.ac.uk
Logging Level = System
Date (UTC) = Tue May 27 14:21:41 2003
Job Refuse Reason = Submitting job(s) - condor command failed
Job Refuse Source = JobSubmissionService
Host Name = gm03
Source Program = JobSubmissionService
---
Event Type = JobAccept
dg_jobId = https://gm03.hep.ph.ic.ac.uk:7846/130.246.183.172/111355320838832?gm03.hep.ph.ic.ac.uk:7771
Certificate Subject = /O=Grid/O=UKHEP/CN=host/gm03.hep.ph.ic.ac.uk
Logging Level = System
Date (UTC) = Tue May 27 14:21:41 2003
Job Accept New Id = Sent back by JSS
Job Accept Source = JobSubmissionService
Host Name = gm03
Source Program = ResourceBroker
---
Event Type = JobPending
Job Pending Reason = Resubmitting.
dg_jobId = https://gm03.hep.ph.ic.ac.uk:7846/130.246.183.172/111355320838832?gm03.hep.ph.ic.ac.uk:7771
Certificate Subject = /O=Grid/O=UKHEP/CN=host/gm03.hep.ph.ic.ac.uk
Logging Level = System
Date (UTC) = Tue May 27 14:21:58 2003
Host Name = gm03
Source Program = ResourceBroker
---
Event Type = JobMatch
dg_jobId = https://gm03.hep.ph.ic.ac.uk:7846/130.246.183.172/111355320838832?gm03.hep.ph.ic.ac.uk:7771
Certificate Subject = /O=Grid/O=UKHEP/CN=host/gm03.hep.ph.ic.ac.uk
Logging Level = System
Date (UTC) = Mon Jun 2 11:15:26 2003
Job Match Destination = bottom.phy.bris.ac.uk:2119/jobmanager-pbs-gridq
Host Name = gm03
Source Program = ResourceBroker
---
Event Type = JobRefuse
dg_jobId = https://gm03.hep.ph.ic.ac.uk:7846/130.246.183.172/111355320838832?gm03.hep.ph.ic.ac.uk:7771
Certificate Subject = /O=Grid/O=UKHEP/CN=host/gm03.hep.ph.ic.ac.uk
Logging Level = System
Date (UTC) = Mon Jun 2 11:16:22 2003
Job Refuse Reason = Submitting job(s) - condor command failed
Job Refuse Source = JobSubmissionService
Host Name = gm03
Source Program = JobSubmissionService
---
Event Type = JobAccept
dg_jobId = https://gm03.hep.ph.ic.ac.uk:7846/130.246.183.172/111355320838832?gm03.hep.ph.ic.ac.uk:7771
Certificate Subject = /O=Grid/O=UKHEP/CN=host/gm03.hep.ph.ic.ac.uk
Logging Level = System
Date (UTC) = Mon Jun 2 11:16:23 2003
Job Accept New Id = Sent back by JSS
Job Accept Source = JobSubmissionService
Host Name = gm03
Source Program = ResourceBroker
---
Event Type = JobPending
Job Pending Reason = Resubmitting.
dg_jobId = https://gm03.hep.ph.ic.ac.uk:7846/130.246.183.172/111355320838832?gm03.hep.ph.ic.ac.uk:7771
Certificate Subject = /O=Grid/O=UKHEP/CN=host/gm03.hep.ph.ic.ac.uk
Logging Level = System
Date (UTC) = Mon Jun 2 11:16:45 2003
Host Name = gm03
Source Program = ResourceBroker
---
Event Type = JobMatch
dg_jobId = https://gm03.hep.ph.ic.ac.uk:7846/130.246.183.172/111355320838832?gm03.hep.ph.ic.ac.uk:7771
Certificate Subject = /O=Grid/O=UKHEP/CN=host/gm03.hep.ph.ic.ac.uk
Logging Level = System
Date (UTC) = Mon Jun 2 13:33:42 2003
Job Match Destination = tuber5.phy.bris.ac.uk:2119/jobmanager-pbs-bseq
Host Name = gm03
Source Program = ResourceBroker
---
Event Type = JobRefuse
dg_jobId = https://gm03.hep.ph.ic.ac.uk:7846/130.246.183.172/111355320838832?gm03.hep.ph.ic.ac.uk:7771
Certificate Subject = /O=Grid/O=UKHEP/CN=host/gm03.hep.ph.ic.ac.uk
Logging Level = System
Date (UTC) = Mon Jun 2 13:34:33 2003
Job Refuse Reason = Submitting job(s) - condor command failed
Job Refuse Source = JobSubmissionService
Host Name = gm03
Source Program = JobSubmissionService
---
Event Type = JobAccept
dg_jobId = https://gm03.hep.ph.ic.ac.uk:7846/130.246.183.172/111355320838832?gm03.hep.ph.ic.ac.uk:7771
Certificate Subject = /O=Grid/O=UKHEP/CN=host/gm03.hep.ph.ic.ac.uk
Logging Level = System
Date (UTC) = Mon Jun 2 13:34:36 2003
Job Accept New Id = Sent back by JSS
Job Accept Source = JobSubmissionService
Host Name = gm03
Source Program = ResourceBroker
---
Event Type = JobPending
Job Pending Reason = Resubmitting.
dg_jobId = https://gm03.hep.ph.ic.ac.uk:7846/130.246.183.172/111355320838832?gm03.hep.ph.ic.ac.uk:7771
Certificate Subject = /O=Grid/O=UKHEP/CN=host/gm03.hep.ph.ic.ac.uk
Logging Level = System
Date (UTC) = Mon Jun 2 13:40:24 2003
Host Name = gm03
Source Program = ResourceBroker
---
Event Type = JobMatch
dg_jobId = https://gm03.hep.ph.ic.ac.uk:7846/130.246.183.172/111355320838832?gm03.hep.ph.ic.ac.uk:7771
Certificate Subject = /O=Grid/O=UKHEP/CN=host/gm03.hep.ph.ic.ac.uk
Logging Level = System
Date (UTC) = Mon Jun 2 15:06:45 2003
Job Match Destination = testbed008.cnaf.infn.it:2119/jobmanager-pbs-short
Host Name = gm03
Source Program = ResourceBroker
---
Event Type = JobRefuse
dg_jobId = https://gm03.hep.ph.ic.ac.uk:7846/130.246.183.172/111355320838832?gm03.hep.ph.ic.ac.uk:7771
Certificate Subject = /O=Grid/O=UKHEP/CN=host/gm03.hep.ph.ic.ac.uk
Logging Level = System
Date (UTC) = Mon Jun 2 15:32:12 2003
Job Refuse Reason = Submitting job(s) - condor command failed
Job Refuse Source = JobSubmissionService
Host Name = gm03
Source Program = JobSubmissionService
---
Event Type = JobAccept
dg_jobId = https://gm03.hep.ph.ic.ac.uk:7846/130.246.183.172/111355320838832?gm03.hep.ph.ic.ac.uk:7771
Certificate Subject = /O=Grid/O=UKHEP/CN=host/gm03.hep.ph.ic.ac.uk
Logging Level = System
Date (UTC) = Mon Jun 2 15:32:12 2003
Job Accept New Id = Sent back by JSS
Job Accept Source = JobSubmissionService
Host Name = gm03
Source Program = ResourceBroker
---
Event Type = JobPending
Job Pending Reason = Resubmitting.
dg_jobId = https://gm03.hep.ph.ic.ac.uk:7846/130.246.183.172/111355320838832?gm03.hep.ph.ic.ac.uk:7771
Certificate Subject = /O=Grid/O=UKHEP/CN=host/gm03.hep.ph.ic.ac.uk
Logging Level = System
Date (UTC) = Mon Jun 2 15:32:30 2003
Host Name = gm03
Source Program = ResourceBroker
**********************************************************************
|