JISCMail - LCG-ROLLOUT Archives

Email discussion lists for the UK Education and Research communities
Subscriber's Corner
Email Lists
LCG-ROLLOUT Archives

LCG-ROLLOUT@JISCMAIL.AC.UK

View:

Message:
[
First
Last
]
By Topic:
[
First
Last
]
By Author:
[
First
Last
]
Font:
Proportional Font
		LISTSERV Archives
		LCG-ROLLOUT Home
		LCG-ROLLOUT February 2012
Options

Subscribe or Unsubscribe
Get Password
Subject:
Re: Torque/CREAM race condition at job completion?
From:
Massimo Sgaravatto <[log in to unmask]>
Reply-To:
LHC Computer Grid - Rollout <[log in to unmask]>
Date:
Fri, 24 Feb 2012 13:18:44 +0100
Content-Type:
multipart/signed
Parts/Attachments:
text/plain (265 lines) , smime.p7s (265 lines)
This is a different issue (the third one discussed in this thread): the 
job is not submitted at all to the batch system because in 200 secs it 
was not able to get back the lrms jobid
Or was the job submitted ?


I remind the other two issues discussed in this thread:

a: jobs changing its status q-->w--> q ...
    happening when the proxy expires and the proxy file is removed
    I still would like to understand if setting:



DELEGATION_PURGE_RATE=-1

in the cream-config.xml and restarting tomcat helps


b: cream sandbox dir removed too early, and causing error messages such as:


Unable to copy file err_craaa_619112035_StandardOutput to 
[log in to unmask]:/opt/glite/var/cream_sandbox/atlasprd/_C_CA_O_Grid_OU_triumf_ca_CN_Asoka_De_Silva_GC1_atlas_Role_production_Capability_NULL_prdatl15/61/CREAM619112035/StandardError
*** error from copy
scp: 
/opt/glite/var/cream_sandbox/atlasprd/_C_CA_O_Grid_OU_triumf_ca_CN_Asoka_De_Silva_GC1_atlas_Role_production_Capability_NULL_prdatl15/61/CREAM619112035/StandardError: 
No such file or directory
*** end error output


I didn't understand if even in this case the job status keeps changing 
between q and w or if the job exits the torque queue.

Here, as I wrote, I can guess what happened, but please provide (in a 
ggus ticket) the glite-ce-cream logs along with the id of a problematic job


Cheers, Massimo



On 02/24/2012 12:54 PM, Lukasz Flis wrote:
> Hi *,
>
> I think the problem is caused by a bad response from PBS server
> or timeout while running qstat command.
>
>
> CREAM is periodically querying pbs server for job status.
> When server is restarted or busy, command may timeout.
> As we can see below, such job is then considered ABORTED which is simply
> wrong in most cases.
>
> If job is unavailable, qstat will return exit_code of value 153 and
> pbs_status.sh which is invoked by CREAM should address this case
> properly. In other words, for TORQUE LRMS only 153 exit code mean that
> job has finished executing all other codes should not invoke change of
> state to ABORTED.
>
> Otherwise:
> * not yet started jobs will be unable to do STAGEIN resulting in "W" state
> * already running jobs will not be able to do STAGEOUT
>
>
> I think CREAM/blah needs to address temporary PBS service
> unavaiabilities better.
>
> Can anyone point me a link to the documentation describing CREAM ->
> pbs_{status,submit,resume,hold,cancel} interface?
>
> Cheers,
> --
> Lukasz Flis
>
>
>
> 19 Feb 2012 11:47:08,865 INFO
> org.glite.ce.cream.jobmanagement.db.table.JobTable (JobTable.java:232) -
> (TP-Processor17) Job inserted. JobId = CREAM745296672
> 19 Feb 2012 11:47:08,949 INFO
> org.glite.ce.creamapi.jobmanagement.cmdexecutor.AbstractJobExecutor
> (AbstractJobExecutor.java:2163) - (TP-Processor17) JOB CREAM745296672
> STATUS CHANGED: -- => REGISTERED [localUser=prdatlas002]
> [delegationId=1328717924.913405]
> 19 Feb 2012 11:47:42,090 INFO
> org.glite.ce.creamapi.jobmanagement.cmdexecutor.AbstractJobExecutor
> (AbstractJobExecutor.java:895) - (Worker Thread 3)
> REMOTE_REQUEST_ADDRESS=128.142.194.89; USER_DN=/DC=ch/DC=cern/OU=Organic
> Units/OU=Users/CN=atlpilo2/CN=596434/CN=Robot: ATLAS Pilot2; USER_FQAN={
> /atlas/Role=production/Capability=NULL;
> /atlas/Role=NULL/Capability=NULL; /atlas/lcg1/Role=NULL/Capability=NULL;
> }; CMD_NAME=JOB_START; CMD_CATEGORY=JOB_MANAGEMENT;
> CMD_STATUS=PROCESSING; commandName=JOB_START;
> cmdExecutorName=BLAHExecutor;
> userId=_DC_ch_DC_cern_OU_Organic_Units_OU_Users_CN_atlpilo2_CN_596434_CN_Robot__ATLAS_Pilot2_atlas_Role_production_Capability_NULL;
> jobId=CREAM745296672; status=PROCESSING;
> 19 Feb 2012 11:47:42,942 INFO
> org.glite.ce.creamapi.jobmanagement.cmdexecutor.AbstractJobExecutor
> (AbstractJobExecutor.java:2163) - (Worker Thread 3) JOB CREAM745296672
> STATUS CHANGED: REGISTERED => PENDING [localUser=prdatlas002]
> [delegationId=1328717924.913405]
> 19 Feb 2012 11:51:27,522 INFO
> org.glite.ce.creamapi.jobmanagement.cmdexecutor.AbstractJobExecutor
> (AbstractJobExecutor.java:2163) - (Worker Thread 3) JOB CREAM745296672
> STATUS CHANGED: PENDING => ABORTED [failureReason=BLAH error: no jobId
> in submission script's output (stdout:) (stderr: <blah> execute_cmd: 200
> seconds timeout expired, killing child process.-) N/A (jobId =
> CREAM745296672)] [localUser=prdatlas002] [delegationId=1328717924.913405]
> 19 Feb 2012 12:18:43,761 INFO
> org.glite.ce.creamapi.jobmanagement.cmdexecutor.AbstractJobExecutor
> (AbstractJobExecutor.java:883) - (Worker Thread 34)
> REMOTE_REQUEST_ADDRESS=128.142.194.89; USER_DN=/DC=ch/DC=cern/OU=Organic
> Units/OU=Users/CN=atlpilo2/CN=596434/CN=Robot: ATLAS Pilot2; USER_FQAN={
> /atlas/Role=production/Capability=NULL;
> /atlas/Role=NULL/Capability=NULL; /atlas/lcg1/Role=NULL/Capability=NULL;
> }; CMD_NAME=JOB_PURGE; CMD_CATEGORY=JOB_MANAGEMENT;
> CMD_STATUS=PROCESSING; commandName=JOB_PURGE;
> cmdExecutorName=BLAHExecutor;
> userId=_DC_ch_DC_cern_OU_Organic_Units_OU_Users_CN_atlpilo2_CN_596434_CN_Robot__ATLAS_Pilot2_atlas_Role_production_Capability_NULL;
> jobId=CREAM745296672; status=PROCESSING;
> 19 Feb 2012 12:18:48,158 INFO
> org.glite.ce.cream.jobmanagement.db.table.JobTable (JobTable.java:162) -
> (Worker Thread 34) Job deleted. JobId = CREAM745296672
> 19 Feb 2012 12:18:48,177 INFO
> org.glite.ce.creamapi.jobmanagement.cmdexecutor.AbstractJobExecutor
> (AbstractJobExecutor.java:1946) - (Worker Thread 34) purge: purged job
> CREAM745296672
>
>
>
>
>
>
>
> On 22.02.2012 19:15, Leslie Groer wrote:
>> Hi Gila and others for replies. I should mention that these are NOT
>> cancelled jobs on our site. Most are pilots that have not found payloads
>> that I checked today, but I do see at least one successful user analysis
>> job that consumed 27 mins of walltime and 1 minute of CPU that reported
>> the Unable to copy file error:
>>
>> >From the torque server accounting logs:
>>
>> 20120222:02/22/2012 11:10:21;Q;10312027.gpc-sched;queue=atlas
>> 20120222:02/22/2012 11:10:21;Q;10312027.gpc-sched;queue=serial
>> 20120222:02/22/2012 11:10:39;S;10312027.gpc-sched;user=atlaspt1
>> group=atlaspt jobname=craaa_563417814 queue=serial ctime=1329927021
>> qtime=1329927021 etime=1329927021 start=1329927039
>> owner=atlaspt1@lcg-admin3 exec_host=gpc-f113n066/6
>> Resource_List.os=centos53computeHEP Resource_List.walltime=35:03:00
>> 20120222:02/22/2012 11:37:51;E;10312027.gpc-sched;user=atlaspt1
>> group=atlaspt jobname=craaa_563417814 queue=serial ctime=1329927021
>> qtime=1329927021 etime=1329927021 start=1329927039
>> owner=atlaspt1@lcg-admin3 exec_host=gpc-f113n066/6
>> Resource_List.os=centos53computeHEP Resource_List.walltime=35:03:00
>> session=28206 end=1329928671 Exit_status=0 resources_used.cput=00:01:19
>> resources_used.mem=307956kb resources_used.vmem=1046852kb
>> resources_used.walltime=00:27:12
>>
>> This job maps to PandaID
>> http://panda.cern.ch/server/pandamon/query?job=1432922395
>>
>> which shows success, but I still get a PBS JOB email with the following:
>>
>> PBS Job Id: 10312027.gpc-sched
>> Job Name: craaa_563417814
>> Exec host: gpc-f113n066/6
>> An error has occurred processing your job, see below.
>> Post job file processing error; job 10312027.gpc-sched on host
>> gpc-f113n066/6
>>
>> Unable to copy file err_craaa_563417814_StandardOutput
>> [log in to unmask]:/opt/glite/var/cream_sandbox/atlaspt/_C_CA_O_Grid_OU_triumf_ca_CN_Asoka_De_Silva_GC1_atlas_Role_pilot_Capability_NULL_atlaspt1/56/CREAM563417814/StandardError
>>
>> *** error from copy
>> scp:/opt/glite/var/cream_sandbox/atlaspt/_C_CA_O_Grid_OU_triumf_ca_CN_Asoka_De_Silva_GC1_atlas_Role_pilot_Capability_NULL_atlaspt1/56/CREAM563417814/StandardError:
>> No such file or directory
>> *** end error output
>>
>> Unable to copy file out_craaa_563417814_StandardOutput
>> [log in to unmask]:/opt/glite/var/cream_sandbox/atlaspt/_C_CA_O_Grid_OU_triumf_ca_CN_Asoka_De_Silva_GC1_atlas_Role_pilot_Capability_NULL_atlaspt1/56/CREAM563417814/StandardOutput
>>
>> *** error from copy
>> scp:/opt/glite/var/cream_sandbox/atlaspt/_C_CA_O_Grid_OU_triumf_ca_CN_Asoka_De_Silva_GC1_atlas_Role_pilot_Capability_NULL_atlaspt1/56/CREAM563417814/StandardOutput:
>> No such file or directory
>> *** end error output
>>
>>
>> We do not experience any of the Q->W flapping others talk about. We
>> did set
>>
>> CLASSCFG[atlas] CancelFailedJobs=TRUE
>>
>> in our MOAB configuration as if a job fails then it just burns a pilot
>> to kick it out of the queue rather than trying to reschedule (we were
>> seeing some flapping with jobs that never got scheduled to a node
>> sometimes, but this had a different pathology).
>>
>> Leslie
>>
>>
>> Gila Arrondo Miguel Angel wrote:
>>> Hi Leslie,
>>>
>>> With a very similar setup (Moab/Torque on 1 host and CREAM-CE on
>>> another) we've seen an error somehow close to what you describe here.
>>> In our /var/log/messages we find tons of entires like this:
>>>
>>> Feb 22 09:10:24 wn202 pbs_mom: LOG_ERROR::sys_copy, command
>>> '/usr/bin/scp -rpB
>>> [log in to unmask]:/cream_localsandbox/data/atlas/_DC_ch_DC_cern_OU_Organic_Units_OU_Users_CN_atlpilo1_CN_614260_CN_Robot__ATLAS_Pilot1_atlas_Role_pilot_Capability_NULL_atlasplt/78/CREAM780422084/StandardError'
>>> failed with status=1, giving up after 4 attempts
>>>
>>> Feb 22 09:10:24 wn202 pbs_mom: LOG_ERROR::req_cpyfile, Unable to copy
>>> file err_cre02_780422084_StandardError
>>> [log in to unmask]:/cream_localsandbox/data/atlas/_DC_ch_DC_cern_OU_Organic_Units_OU_Users_CN_atlpilo1_CN_614260_CN_Robot__ATLAS_Pilot1_atlas_Role_pilot_Capability_NULL_atlasplt/78/CREAM780422084/StandardError
>>>
>>>
>>> Feb 22 09:10:29 wn202 pbs_mom: LOG_ERROR::sys_copy, command
>>> '/usr/bin/scp -rpB
>>> [log in to unmask]:/cream_localsandbox/data/atlas/_DC_ch_DC_cern_OU_Organic_Units_OU_Users_CN_atlpilo1_CN_614260_CN_Robot__ATLAS_Pilot1_atlas_Role_pilot_Capability_NULL_atlasplt/78/CREAM780422084/StandardOutput'
>>> failed with status=1, giving up after 4 attempts
>>> F
>>>
>>> Have you checked whether these failed transfers are from cancelled
>>> jobs? In our experience, it was always the case.
>>>
>>> We've also looked at ways to mitigate this annoying verbosity, but no
>>> luck so far. The only option that we can think of is to stop using
>>> scp for copies and move the sandbox to a shared area with the WNs, so
>>> you use regular cp ($usecp directive) and these errors are hidden.
>>> But, of course, this approach has its own disadvantages as well.
>>>
>>> Does anyone else have a better idea?
>>>
>>> Cheers,
>>> Miguel
>>>
>>>
>>>
>>> --
>>> Miguel Gila
>>> CSCS Swiss National Supercomputing Centre
>>> HPC Solutions
>>> Via Cantonale, Galleria 2 | CH-6928 Manno | Switzerland
>>> Miguel.gila [at] cscs.ch
>>>
>>> From: Leslie
>>> Groer<[log in to unmask]<mailto:[log in to unmask]>>
>>> Reply-To: LHC Computer Grid -
>>> Rollout<[log in to unmask]<mailto:[log in to unmask]>>
>>> Date: Thu, 16 Feb 2012 14:43:56 -0500
>>> To:<[log in to unmask]<mailto:[log in to unmask]>>
>>> Subject: [LCG-ROLLOUT] Torque/CREAM race condition at job completion?
>>>
>>> scp:
>>> /opt/glite/var/cream_sandbox/atlasprd/_C_CA_O_Grid_OU_triumf_ca_CN_Asoka_De_Silva_GC1_atlas_Role_production_Capability_NULL_prdatl15/61/CREAM619112035/StandardOutput:
>>> No such file or directory
>>>
>>
Top of Message | Previous Page | Permalink
JiscMail Tools

Files Area | help
RSS Feeds and Sharing

Search Archives

Advanced Options