Hi,
The large spike for RALPP last night I suspect was one of our CreamCEs
falling over.
For the longer spreads of "background" I see jobs in the accounting with an
Exit Status of 271 (exited due to signal Terminated (15)) which would
initially suggest that it's being killed by the batch system.
However, looking at the jobs they are nowhere near the limts and if I
tracejob one of them I get:
Job: 15204762.heplnx201.pp.rl.ac.uk
06/20/2012 11:19:32 S enqueuing into grid, state 1 hop 1
06/20/2012 11:19:32 S Job Queued at request of
[log in to unmask], owner = [log in to unmask], job
name = cream_565133125, queue = grid
06/20/2012 11:19:32 A queue=grid
06/20/2012 20:53:51 S Job Modified at request of
[log in to unmask]
06/20/2012 20:53:51 S Job Run at request of [log in to unmask]
06/20/2012 20:53:51 S Job Modified at request of
[log in to unmask]
06/20/2012 20:53:51 S post_modify_req: PBSE_UNKJOBID for job
15204762.heplnx201.pp.rl.ac.uk in state RUNNING-STAGEGO, dest =
heplnc369.pp.rl.ac.uk
06/20/2012 20:53:56 A user=prdatl05 group=prdatlas
jobname=cream_565133125 queue=grid ctime=1340187572 qtime=1340187572
etime=1340187572 start=1340222036 [log in to unmask]
exec_host=heplnc369.pp.rl.ac.uk/7
Resource_List.cput=72:00:00 Resource_List.mem=2000mb
Resource_List.neednodes=1 Resource_List.nodect=1 Resource_List.nodes=1
Resource_List.pvmem=4000mb
Resource_List.walltime=96:00:00
06/20/2012 23:57:12 S Job deleted at request of
[log in to unmask]
06/20/2012 23:57:12 S Job sent signal SIGTERM on delete
06/20/2012 23:57:12 A [log in to unmask]
06/20/2012 23:57:14 S Job sent signal SIGKILL on delete
06/20/2012 23:57:27 S Exit_status=271 resources_used.cput=00:41:50
resources_used.mem=1578200kb resources_used.vmem=2174092kb
resources_used.walltime=03:19:28
06/20/2012 23:57:27 A user=prdatl05 group=prdatlas
jobname=cream_565133125 queue=grid ctime=1340187572 qtime=1340187572
etime=1340187572 start=1340222036 [log in to unmask]
exec_host=heplnc369.pp.rl.ac.uk/7
Resource_List.cput=72:00:00 Resource_List.mem=2000mb
Resource_List.neednodes=1 Resource_List.nodect=1 Resource_List.nodes=1
Resource_List.pvmem=4000mb
Resource_List.walltime=96:00:00 session=10915
end=1340233047 Exit_status=271 resources_used.cput=00:41:50
resources_used.mem=1578200kb resources_used.vmem=2174092kb
resources_used.walltime=03:19:28
06/20/2012 23:57:37 S Post job file processing error
06/20/2012 23:58:37 S dequeuing from grid, state COMPLETE
06/21/2012 00:42:24 S Unknown Job Id
06/21/2012 01:04:55 S Unknown Job Id
Notice the " Job deleted at request of [log in to unmask]" at
23:57:12
And looking in the Cream Logs I get:
20 Jun 2012 23:57:15,313 INFO
org.glite.ce.creamapi.jobmanagement.cmdexecutor.LeaseManager
(LeaseManager.java:343) - (TIMER) Job has been cancelled. jobId =
CREAM565133125
20 Jun 2012 23:57:20,346 ERROR
org.glite.ce.creamapi.jobmanagement.cmdexecutor.LeaseManager
(LeaseManager.java:260) - (TIMER) qdel: Unknown Job Id
14728072.heplnx201.pp.rl.ac.uk
(And I see loads of other "Job has been cancelled" entries too).
Not sure what this means but I'm going for coffee.
Chris.
|