JISCMail - LCG-ROLLOUT Archives

Hi *,

I think the problem is caused by a bad response from PBS server
or timeout while running qstat  command.


CREAM is periodically querying pbs server for job status.
When server is restarted or busy, command may timeout.
As we can see below, such job is then considered ABORTED which is simply 
wrong in most cases.

If job is unavailable, qstat will return exit_code of value 153 and 
pbs_status.sh which is invoked by CREAM should address this case 
properly. In other words, for TORQUE LRMS only 153 exit code mean that 
job has finished executing all other codes should not invoke change of 
state to ABORTED.

Otherwise:
  * not yet started jobs will be unable to do STAGEIN resulting in "W" state
  * already running jobs will not be able to do STAGEOUT


I think CREAM/blah needs to address temporary PBS service 
unavaiabilities better.

Can anyone point me a link to the documentation describing CREAM -> 
pbs_{status,submit,resume,hold,cancel} interface?

Cheers,
--
Lukasz Flis



19 Feb 2012 11:47:08,865 INFO 
org.glite.ce.cream.jobmanagement.db.table.JobTable (JobTable.java:232) - 
(TP-Processor17) Job inserted. JobId = CREAM745296672
19 Feb 2012 11:47:08,949 INFO 
org.glite.ce.creamapi.jobmanagement.cmdexecutor.AbstractJobExecutor 
(AbstractJobExecutor.java:2163) - (TP-Processor17) JOB CREAM745296672 
STATUS CHANGED: -- => REGISTERED [localUser=prdatlas002] 
[delegationId=1328717924.913405]
19 Feb 2012 11:47:42,090 INFO 
org.glite.ce.creamapi.jobmanagement.cmdexecutor.AbstractJobExecutor 
(AbstractJobExecutor.java:895) - (Worker Thread 3) 
REMOTE_REQUEST_ADDRESS=128.142.194.89; USER_DN=/DC=ch/DC=cern/OU=Organic 
Units/OU=Users/CN=atlpilo2/CN=596434/CN=Robot: ATLAS Pilot2; USER_FQAN={ 
/atlas/Role=production/Capability=NULL; 
/atlas/Role=NULL/Capability=NULL; /atlas/lcg1/Role=NULL/Capability=NULL; 
}; CMD_NAME=JOB_START; CMD_CATEGORY=JOB_MANAGEMENT; 
CMD_STATUS=PROCESSING; commandName=JOB_START; 
cmdExecutorName=BLAHExecutor; 
userId=_DC_ch_DC_cern_OU_Organic_Units_OU_Users_CN_atlpilo2_CN_596434_CN_Robot__ATLAS_Pilot2_atlas_Role_production_Capability_NULL; 
jobId=CREAM745296672; status=PROCESSING;
19 Feb 2012 11:47:42,942 INFO 
org.glite.ce.creamapi.jobmanagement.cmdexecutor.AbstractJobExecutor 
(AbstractJobExecutor.java:2163) - (Worker Thread 3) JOB CREAM745296672 
STATUS CHANGED: REGISTERED => PENDING [localUser=prdatlas002] 
[delegationId=1328717924.913405]
19 Feb 2012 11:51:27,522 INFO 
org.glite.ce.creamapi.jobmanagement.cmdexecutor.AbstractJobExecutor 
(AbstractJobExecutor.java:2163) - (Worker Thread 3) JOB CREAM745296672 
STATUS CHANGED: PENDING => ABORTED [failureReason=BLAH error: no jobId 
in submission script's output (stdout:) (stderr: <blah> execute_cmd: 200 
seconds timeout expired, killing child process.-) N/A (jobId = 
CREAM745296672)] [localUser=prdatlas002] [delegationId=1328717924.913405]
19 Feb 2012 12:18:43,761 INFO 
org.glite.ce.creamapi.jobmanagement.cmdexecutor.AbstractJobExecutor 
(AbstractJobExecutor.java:883) - (Worker Thread 34) 
REMOTE_REQUEST_ADDRESS=128.142.194.89; USER_DN=/DC=ch/DC=cern/OU=Organic 
Units/OU=Users/CN=atlpilo2/CN=596434/CN=Robot: ATLAS Pilot2; USER_FQAN={ 
/atlas/Role=production/Capability=NULL; 
/atlas/Role=NULL/Capability=NULL; /atlas/lcg1/Role=NULL/Capability=NULL; 
}; CMD_NAME=JOB_PURGE; CMD_CATEGORY=JOB_MANAGEMENT; 
CMD_STATUS=PROCESSING; commandName=JOB_PURGE; 
cmdExecutorName=BLAHExecutor; 
userId=_DC_ch_DC_cern_OU_Organic_Units_OU_Users_CN_atlpilo2_CN_596434_CN_Robot__ATLAS_Pilot2_atlas_Role_production_Capability_NULL; 
jobId=CREAM745296672; status=PROCESSING;
19 Feb 2012 12:18:48,158 INFO 
org.glite.ce.cream.jobmanagement.db.table.JobTable (JobTable.java:162) - 
(Worker Thread 34) Job deleted. JobId = CREAM745296672
19 Feb 2012 12:18:48,177 INFO 
org.glite.ce.creamapi.jobmanagement.cmdexecutor.AbstractJobExecutor 
(AbstractJobExecutor.java:1946) - (Worker Thread 34) purge: purged job 
CREAM745296672







On 22.02.2012 19:15, Leslie Groer wrote:
> Hi Gila and others for replies. I should mention that these are NOT
> cancelled jobs on our site. Most are pilots that have not found payloads
> that I checked today, but I do see at least one successful user analysis
> job that consumed 27 mins of walltime and 1 minute of CPU that reported
> the Unable to copy file error:
>
>  >From the torque server accounting logs:
>
> 20120222:02/22/2012 11:10:21;Q;10312027.gpc-sched;queue=atlas
> 20120222:02/22/2012 11:10:21;Q;10312027.gpc-sched;queue=serial
> 20120222:02/22/2012 11:10:39;S;10312027.gpc-sched;user=atlaspt1
> group=atlaspt jobname=craaa_563417814 queue=serial ctime=1329927021
> qtime=1329927021 etime=1329927021 start=1329927039
> owner=atlaspt1@lcg-admin3 exec_host=gpc-f113n066/6
> Resource_List.os=centos53computeHEP Resource_List.walltime=35:03:00
> 20120222:02/22/2012 11:37:51;E;10312027.gpc-sched;user=atlaspt1
> group=atlaspt jobname=craaa_563417814 queue=serial ctime=1329927021
> qtime=1329927021 etime=1329927021 start=1329927039
> owner=atlaspt1@lcg-admin3 exec_host=gpc-f113n066/6
> Resource_List.os=centos53computeHEP Resource_List.walltime=35:03:00
> session=28206 end=1329928671 Exit_status=0 resources_used.cput=00:01:19
> resources_used.mem=307956kb resources_used.vmem=1046852kb
> resources_used.walltime=00:27:12
>
> This job maps to PandaID
> http://panda.cern.ch/server/pandamon/query?job=1432922395
>
> which shows success, but I still get a PBS JOB email with the following:
>
> PBS Job Id: 10312027.gpc-sched
> Job Name:   craaa_563417814
> Exec host:  gpc-f113n066/6
> An error has occurred processing your job, see below.
> Post job file processing error; job 10312027.gpc-sched on host gpc-f113n066/6
>
> Unable to copy file err_craaa_563417814_StandardOutput [log in to unmask]:/opt/glite/var/cream_sandbox/atlaspt/_C_CA_O_Grid_OU_triumf_ca_CN_Asoka_De_Silva_GC1_atlas_Role_pilot_Capability_NULL_atlaspt1/56/CREAM563417814/StandardError
> *** error from copy
> scp:/opt/glite/var/cream_sandbox/atlaspt/_C_CA_O_Grid_OU_triumf_ca_CN_Asoka_De_Silva_GC1_atlas_Role_pilot_Capability_NULL_atlaspt1/56/CREAM563417814/StandardError: No such file or directory
> *** end error output
>
> Unable to copy file out_craaa_563417814_StandardOutput [log in to unmask]:/opt/glite/var/cream_sandbox/atlaspt/_C_CA_O_Grid_OU_triumf_ca_CN_Asoka_De_Silva_GC1_atlas_Role_pilot_Capability_NULL_atlaspt1/56/CREAM563417814/StandardOutput
> *** error from copy
> scp:/opt/glite/var/cream_sandbox/atlaspt/_C_CA_O_Grid_OU_triumf_ca_CN_Asoka_De_Silva_GC1_atlas_Role_pilot_Capability_NULL_atlaspt1/56/CREAM563417814/StandardOutput: No such file or directory
> *** end error output
>
>
> We do not experience any of the Q->W flapping others talk about. We did set
>
> CLASSCFG[atlas] CancelFailedJobs=TRUE
>
> in our MOAB configuration as if a job fails then it just burns a pilot
> to kick it out of the queue rather than trying to reschedule (we were
> seeing some flapping with jobs that never got scheduled to a node
> sometimes, but this had a different pathology).
>
> Leslie
>
>
> Gila Arrondo Miguel Angel wrote:
>> Hi Leslie,
>>
>> With a very similar setup (Moab/Torque on 1 host and CREAM-CE on another) we've seen an error somehow close to what you describe here. In our /var/log/messages we find tons of entires like this:
>>
>> Feb 22 09:10:24 wn202 pbs_mom: LOG_ERROR::sys_copy, command '/usr/bin/scp -rpB [log in to unmask]:/cream_localsandbox/data/atlas/_DC_ch_DC_cern_OU_Organic_Units_OU_Users_CN_atlpilo1_CN_614260_CN_Robot__ATLAS_Pilot1_atlas_Role_pilot_Capability_NULL_atlasplt/78/CREAM780422084/StandardError' failed with status=1, giving up after 4 attempts
>>
>> Feb 22 09:10:24 wn202 pbs_mom: LOG_ERROR::req_cpyfile, Unable to copy file err_cre02_780422084_StandardError [log in to unmask]:/cream_localsandbox/data/atlas/_DC_ch_DC_cern_OU_Organic_Units_OU_Users_CN_atlpilo1_CN_614260_CN_Robot__ATLAS_Pilot1_atlas_Role_pilot_Capability_NULL_atlasplt/78/CREAM780422084/StandardError
>>
>> Feb 22 09:10:29 wn202 pbs_mom: LOG_ERROR::sys_copy, command '/usr/bin/scp -rpB [log in to unmask]:/cream_localsandbox/data/atlas/_DC_ch_DC_cern_OU_Organic_Units_OU_Users_CN_atlpilo1_CN_614260_CN_Robot__ATLAS_Pilot1_atlas_Role_pilot_Capability_NULL_atlasplt/78/CREAM780422084/StandardOutput' failed with status=1, giving up after 4 attempts
>> F
>>
>> Have you checked whether these failed transfers are from cancelled jobs? In our experience, it was always the case.
>>
>> We've also looked at ways to mitigate this annoying verbosity, but no luck so far. The only option that we can think of is to stop using scp for copies and move the sandbox to a shared area with the WNs, so you use regular cp ($usecp directive) and these errors are hidden. But, of course, this approach has its own disadvantages as well.
>>
>> Does anyone else have a better idea?
>>
>> Cheers,
>> Miguel
>>
>>
>>
>> --
>> Miguel Gila
>> CSCS Swiss National Supercomputing Centre
>> HPC Solutions
>> Via Cantonale, Galleria 2 | CH-6928 Manno | Switzerland
>> Miguel.gila [at] cscs.ch
>>
>> From: Leslie Groer<[log in to unmask]<mailto:[log in to unmask]>>
>> Reply-To: LHC Computer Grid - Rollout<[log in to unmask]<mailto:[log in to unmask]>>
>> Date: Thu, 16 Feb 2012 14:43:56 -0500
>> To:<[log in to unmask]<mailto:[log in to unmask]>>
>> Subject: [LCG-ROLLOUT] Torque/CREAM race condition at job completion?
>>
>> scp: /opt/glite/var/cream_sandbox/atlasprd/_C_CA_O_Grid_OU_triumf_ca_CN_Asoka_De_Silva_GC1_atlas_Role_production_Capability_NULL_prdatl15/61/CREAM619112035/StandardOutput: No such file or directory
>>
>