Are you using the new or old blparser ?
Can you see what is reported in its log file (bnotifier.log or
pbs-blparser.log) for this job ?
I see that at 21:54:08 the job entered the really-running status, and this
can happen only when the jw running on the WN started executing the user
payload.
So I don't think it is true that the job didn't start at all.
Cheers, Massimo
On Wed, 24 Nov 2010, Lukasz Flis wrote:
> Hello,
>
> Due to the fact that there are some preparations to package torque 2.3.12(13)
> in gLite, the following information might be interesting for you
>
> Recently we have upgraded our torque installation to 2.4.11 from (2.3.9).
> Tonight we know it wasn't a good idea at all...
>
> Problems with CREAM jobs we encountered may cause maui to stop scheduling.
>
> Background:
>
> * CREAMCE is using STAGE IN/OUT mechanism to deliver files from job cream
> sandbox to destination workernode
> * Torque can use rcp/scp to acomplish stagein or cp when $usecp is defined in
> mom_config
>
> Problems observed during 2.3.9 era:
> * some jobs used to stay in W state forever waiting for STAGE IN/OUT phase
> to complete
> * when we used SCP we've seen many scp complaints in the log files
> * Only CREAM jobs were affected by Waiting (Forever) State
>
> Brand new problems during 2.4.11 era:
> * no more jobs in waiting state
> * maui scheduling get stuck until job is removed by hand
> LOG: (oh here we have recent one which i'm going to qdel to make maui work)
>
> 11/24 00:38:26 ERROR: job '7561994' cannot be started: (rc: 15057 errmsg:
> 'Cannot execute at specified host because of checkpoint or stagein files
> MSG=allocated nodes must match input file stagein location' hostlist:
> 'n10-4-18.local')
>
>
> Why this happens? (2.4.11 era version)
>
> Let's look at the qob:
> - job gets into the queue with Q state
> - jobs goes to the workernode X
> - workernode X rejects the job for some reason
> - job goes back to the queue with Q state
> Now look at the CREAM: CREAM comes in:
> - CREAM thinks that job has executed and finished
> - CREAM clears job sandbox removing files defined for STAGEIN
> Now back to the job
> - Maui tries to schedule job again
> (2.3.9 - job ends in W)
> (2.4.11 - maui gets locked)
>
> What is the problem here?
>
> For me it looks like CREAM has improperly recognized job state and started
> cleanup process before the job actually started. (i reserve the right to be
> wrong here)
>
> Please let me know if this problem has been seen on your sites.
> (Stuck scheduling, Jobs in waiting state)
>
> Here are some logs related to the job:
>
> ---------------- PBS accounting -----------------------
> 11/23/2010 21:49:35;Q;7559909.batch.grid.cyf-kr.edu.pl;queue=atlas
> 11/23/2010 22:37:50;S;7559909.batch.grid.cyf-kr.edu.pl;user=pltatlas001
> group=atlasplt jobname=cream_984655218 queue=atlas ctime=1290545375
> qtime=1290545375 etime=1290545375 start=1290548270
> [log in to unmask] exec_host=n11-3-17.local/3
> Resource_List.cput=48:00:00 Resource_List.neednodes=sl5
> Resource_List.walltime=72:00:00
> 11/23/2010 22:38:08;E;7559909.batch.grid.cyf-kr.edu.pl;user=pltatlas001
> group=atlasplt jobname=cream_984655218 queue=atlas ctime=1290545375
> qtime=1290545375 etime=1290545375 start=1290548270
> [log in to unmask] exec_host=n11-3-17.local/3
> Resource_List.cput=48:00:00 Resource_List.neednodes=sl5
> Resource_List.walltime=72:00:00 session=7325 end=1290548288 Exit_status=127
> resources_used.cput=00:00:00 resources_used.mem=0kb resources_used.vmem=0kb
> resources_used.walltime=00:00:25
>
> ---------------- MOM log -----------------------------
> 11/23/2010 21:53:54;0001; pbs_mom;Job;TMomFinalizeJob3;job
> 7559909.batch.grid.cyf-kr.edu.pl started, pid = 21689
> 11/23/2010 21:53:54;0002; pbs_mom;n/a;mom_server_check_connection;connection
> to server 10.4.0.15 timeout
> 11/23/2010 21:53:54;0002; pbs_mom;n/a;mom_server_check_connection;sending
> hello to server 10.4.0.15
> 11/23/2010 21:56:03;0002; pbs_mom;Svr;pbs_mom;Torque Mom Version = 2.4.11,
> loglevel = 0
> 11/23/2010 21:56:18;0080; pbs_mom;Job;7559909.batch.grid.cyf-
> kr.edu.pl;scan_for_terminated: job 7559909.batch.grid.cyf-kr.edu.pl task 1
> terminated, sid=21689
> 11/23/2010 21:56:18;0008; pbs_mom;Job;7559909.batch.grid.cyf-kr.edu.pl;job was
> terminated
> 11/23/2010 21:56:18;0080; pbs_mom;Svr;preobit_reply;top of preobit_reply
> 11/23/2010 21:56:18;0080;
> pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top of
> while loop
> 11/23/2010 21:56:18;0080; pbs_mom;Svr;preobit_reply;in while loop, no error
> from job stat
> 11/23/2010 21:56:18;0080; pbs_mom;Job;7559909.batch.grid.cyf-kr.edu.pl;obit
> sent to server
> 11/23/2010 21:56:18;0001; pbs_mom;Svr;pbs_mom;LOG_ALERT::obit_reply, server
> rejected job obit - unexpected job state
> 11/23/2010 22:38:08;0080; pbs_mom;Job;7559909.batch.grid.cyf-
> kr.edu.pl;scan_for_terminated: job 7559909.batch.grid.cyf-kr.edu.pl task 1
> terminated, sid=7325
> 11/23/2010 22:38:08;0008; pbs_mom;Job;7559909.batch.grid.cyf-kr.edu.pl;job
> was terminated
> 11/23/2010 22:38:08;0080; pbs_mom;Svr;preobit_reply;top of preobit_reply
> 11/23/2010 22:38:08;0080;
> pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top of
> while loop
> 11/23/2010 22:38:08;0080; pbs_mom;Svr;preobit_reply;in while loop, no error
> from job stat
> 11/23/2010 22:38:08;0080; pbs_mom;Job;7559909.batch.grid.cyf-kr.edu.pl;obit
> sent to server
> 11/23/2010 22:38:16;0080; pbs_mom;Job;7559909.batch.grid.cyf-
> kr.edu.pl;removing transient job directory /scratch/7559909.batch.grid.cyf-
> kr.edu.pl
>
>
> ------------ CREAM log -------------------
> glite-ce-cream.log:23 Nov 2010 21:49:27,849 INFO
> org.glite.ce.creamapi.jobmanagement.cmdexecutor.AbstractJobExecutor
> (AbstractJobExecutor.java:2163) - (Worker Thread 8) JOB CREAM984655218 STATUS
> CHANGED: REGISTERED => PENDING [localUser=pltatlas001]
> [delegationId=1290235905.865076]
> glite-ce-cream.log:23 Nov 2010 21:49:39,139 INFO
> org.glite.ce.creamapi.jobmanagement.cmdexecutor.AbstractJobExecutor
> (AbstractJobExecutor.java:2163) - (Worker Thread 8) JOB CREAM984655218 STATUS
> CHANGED: PENDING => IDLE [localUser=pltatlas001]
> [delegationId=1290235905.865076]
> glite-ce-cream.log:23 Nov 2010 21:49:39,139 INFO
> org.glite.ce.creamapi.jobmanagement.cmdexecutor.AbstractJobExecutor
> (AbstractJobExecutor.java:908) - (Worker Thread 8)
> REMOTE_REQUEST_ADDRESS=128.142.167.211;
> USER_DN=/C=DE/O=GermanGrid/OU=LMU/CN=Rodney Walker; USER_FQAN={
> /atlas/Role=pilot/Capability=NULL; /atlas/Role=NULL/Capability=NULL;
> /atlas/alarm/Role=NULL/Capability=NULL; /atlas/de/Role=NULL/Capability=NULL;
> /atlas/lcg1/Role=NULL/Capability=NULL; /atlas/team/Role=NULL/Capability=NULL;
> /atlas/uk/Role=NULL/Capability=NULL; }; CMD_NAME=JOB_START;
> CMD_CATEGORY=JOB_MANAGEMENT; CMD_STATUS=PROCESSING; commandName=JOB_START;
> cmdExecutorName=BLAHExecutor;
> userId=_C_DE_O_GermanGrid_OU_LMU_CN_Rodney_Walker_atlas_Role_pilot_Capability_NULL;
> jobId=CREAM984655218; status=PROCESSING;
> lrmsAbsJobId=pbs/20101123/7559909.batch.grid.cyf-kr.edu.pl;
> glite-ce-cream.log:23 Nov 2010 21:54:07,160 INFO
> org.glite.ce.creamapi.jobmanagement.cmdexecutor.AbstractJobExecutor
> (AbstractJobExecutor.java:2163) - (Worker Thread 49) JOB CREAM984655218 STATUS
> CHANGED: IDLE => RUNNING [localUser=pltatlas001]
> [delegationId=1290235905.865076]
> glite-ce-cream.log:23 Nov 2010 21:54:08,622 INFO
> org.glite.ce.creamapi.jobmanagement.cmdexecutor.AbstractJobExecutor
> (AbstractJobExecutor.java:2163) - (Worker Thread 23) JOB CREAM984655218 STATUS
> CHANGED: RUNNING => REALLY-RUNNING [localUser=pltatlas001]
> [workerNode=n11-3-17.local] [delegationId=1290235905.865076]
> glite-ce-cream.log:23 Nov 2010 21:56:18,121 INFO
> org.glite.ce.creamapi.jobmanagement.cmdexecutor.AbstractJobExecutor
> (AbstractJobExecutor.java:2163) - (Worker Thread 34) JOB CREAM984655218 STATUS
> CHANGED: REALLY-RUNNING => DONE-OK [failureReason=reason=0]
> [localUser=pltatlas001] [workerNode=n11-3-17.local]
> [delegationId=1290235905.865076]
> glite-ce-cream.log:23 Nov 2010 21:56:57,969 INFO
> org.glite.ce.creamapi.jobmanagement.cmdexecutor.AbstractJobExecutor
> (AbstractJobExecutor.java:883) - (Worker Thread 23)
> REMOTE_REQUEST_ADDRESS=128.142.167.211;
> USER_DN=/C=DE/O=GermanGrid/OU=LMU/CN=Rodney Walker; USER_FQAN={
> /atlas/Role=pilot/Capability=NULL; /atlas/Role=NULL/Capability=NULL;
> /atlas/alarm/Role=NULL/Capability=NULL; /atlas/de/Role=NULL/Capability=NULL;
> /atlas/lcg1/Role=NULL/Capability=NULL; /atlas/team/Role=NULL/Capability=NULL;
> /atlas/uk/Role=NULL/Capability=NULL; }; CMD_NAME=JOB_PURGE;
> CMD_CATEGORY=JOB_MANAGEMENT; CMD_STATUS=PROCESSING; commandName=JOB_PURGE;
> cmdExecutorName=BLAHExecutor;
> userId=_C_DE_O_GermanGrid_OU_LMU_CN_Rodney_Walker_atlas_Role_pilot_Capability_NULL;
> jobId=CREAM984655218; status=PROCESSING;
> glite-ce-cream.log:23 Nov 2010 21:56:58,268 INFO
> org.glite.ce.cream.jobmanagement.db.table.JobTable (JobTable.java:162) -
> (Worker Thread 23) Job deleted. JobId = CREAM984655218
> glite-ce-cream.log:23 Nov 2010 21:56:58,282 INFO
> org.glite.ce.creamapi.jobmanagement.cmdexecutor.AbstractJobExecutor
> (AbstractJobExecutor.java:1946) - (Worker Thread 23) purge: purged job
> CREAM984655218
>
>
>
>
>
> Best Regards
> --
> Lukasz Flis
>
\|||/
-----------0oo----( o o )----oo0-------------------
(_)
INFN Sezione di Padova
Via Marzolo, 8
35131 Padova - Italy E-mail: massimo.sgaravatto [at] pd.infn.it
Tel: ++39 0498275908 Skype: massimo.sgaravatto
Fax: ++39 0498275952
|