Print

Print


Hi,

I am testing a cream ce that I have set up on a scotgrid dev machine.
The current setup is cream ce and torque/maui on different hosts with the
logs mounted via NFS on the CE.

The job gets submitted successfully and is traceable in torque.  However, it
moves from Q to W and waits forever.  On deeper investigation it looks to me
like torque or cream thinks the job is actually running on a job slot where
another totally different job is running.  When running pbsnodes for the
node id and grepping for the cream job id - nothing is returned.  Then
tailing the torque logs there is another dteam job with the same exec_host
as the cream job.  This leads me to thinking that I may not have set up the
log parsing correctly on the ce or something is getting thoroughly confused.

The outputs from various commands to come to this conclusion are listed
below.  Any thoughts on this would be greatly appreciated.

svr016:/var/spool/pbs/server_logs#* tracejob 2409813*
/var/spool/pbs/mom_logs/20090220: No such file or directory
/var/spool/pbs/sched_logs/20090220: No such file or directory

Job: 2409813.svr016.gla.scotgrid.ac.uk

02/20/2009 14:36:34  S    enqueuing into q30m, state 1 hop 1
02/20/2009 14:36:34  S    Job Queued at request of
[log in to unmask], owner =
[log in to unmask], job name =
                          cream_520507277, queue = q30m
02/20/2009 14:36:34  A    queue=q30m
02/20/2009 15:19:37  S    Job Modified at request of
[log in to unmask]
02/20/2009 15:19:37  S    Job Run at request of
[log in to unmask]


svr016:/var/spool/pbs/server_logs# *qstat -f 2409813*
Job Id: *2409813.svr016.gla.scotgrid.ac.uk*
    Job_Name = cream_520507277
    Job_Owner = [log in to unmask]
    job_state = W
    queue = q30m
    server = svr016.gla.scotgrid.ac.uk
    Checkpoint = u
    ctime = Fri Feb 20 14:36:34 2009
    Error_Path = dev011.gla.scotgrid.ac.uk:/dev/null
    *exec_host = node182/2*
    Execution_Time = Fri Feb 20 15:49:41 2009
    ......

svr016:~# *pbsnodes node182 | grep 2409813*

svr016:~# *pbsnodes node182*
node182
     state = job-exclusive
     np = 8
     properties = lcgpro
     ntype = cluster
     jobs = 0/2406819.svr016.gla.scotgrid.ac.uk, 1/
2409176.svr016.gla.scotgrid.ac.uk, 2/2409262.svr016.gla.scotgrid.ac.uk, 3/
2340154.svr016.gla.scotgrid.ac.uk, 4/2354251.svr016.gla.scotgrid.ac.uk, 5/
2407443.svr016.gla.scotgrid.ac.uk, 6/2408238.svr016.gla.scotgrid.ac.uk, 7/
2407591.svr016.gla.scotgrid.ac.uk
     status = opsys=linux,uname=Linux node182.beowulf.cluster
2.6.9-78.0.1.ELsmp #1 SMP Tue Aug 5 13:53:03 CDT 2008 x86_64,sessions=30091
1175 1856 9228 11111 20888 22885
30853,nsessions=8,nusers=4,idletime=3709813,totmem=5952308kb,availmem=2123760kb,physmem=16438780kb,ncpus=8,loadave=8.04,netload=4294967294,state=free,jobs=
2340154.svr016.gla.scotgrid.ac.uk 2354251.svr016.gla.scotgrid.ac.uk
2406819.svr016.gla.scotgrid.ac.uk 2407443.svr016.gla.scotgrid.ac.uk
2407591.svr016.gla.scotgrid.ac.uk 2408238.svr016.gla.scotgrid.ac.uk
2409176.svr016.gla.scotgrid.ac.uk *2409262.svr016.gla.scotgrid.ac.uk
,rectime=1235143422*

02/20/2009 15:20:23;S;*2409262.svr016.gla.scotgrid.ac.uk*;user=dteam166
group=dteam jobname=STDIN queue=q3d ctime=1235130339 qtime=1235130339
etime=1235130339 start=1235143223
[log in to unmask] *exec_host=node182/2
*Resource_List.cput=72:00:00 Resource_List.neednodes=1
Resource_List.nodect=1 Resource_List.nodes=1 Resource_List.walltime=72:00:00


Thanks,

Dug

-- 
ScotGrid, Room 481, Kelvin Building, University of Glasgow
tel: +44(0)141 330 6439