JISCMail - LCG-ROLLOUT Archives

Email discussion lists for the UK Education and Research communities
Subscriber's Corner
Email Lists
LCG-ROLLOUT Archives

LCG-ROLLOUT@JISCMAIL.AC.UK

View:

Message:
[
First
Last
]
By Topic:
[
First
Last
]
By Author:
[
First
Last
]
Font:
Proportional Font
		LISTSERV Archives
		LCG-ROLLOUT Home
		LCG-ROLLOUT February 2012
Options

Subscribe or Unsubscribe
Get Password
Subject:
Re: Torque/CREAM race condition at job completion?
From:
Lukasz Flis <[log in to unmask]>
Reply-To:
LHC Computer Grid - Rollout <[log in to unmask]>
Date:
Fri, 24 Feb 2012 13:31:51 +0100
Content-Type:
text/plain
Parts/Attachments:
text/plain (382 lines)
Hi.
> This is a different issue (the third one discussed in this thread): 
> the job is not submitted at all to the batch system because in 200 
> secs it was not able to get back the lrms jobid
> Or was the job submitted ?
>

Massimo, this actually is a cause of issue (b.
Sanbox is removed __before__ job is started by PBS (it is queued) or 
during the time job is running becuse cream wrongly thinks that job has 
finished due to client-> server communication errors

Both cases cause "Unable to copy file errors"

Job state change is due to the fact that if state W lasts too long. Job 
may be put back into Q state to be run on a different host.

Every client->server communication problem during pbs_status.sh 
invocation results in sandbox being removed, job Marked as aborted and 
not removed from PBS server queue.

I would like to ask you guys to check how many batch related errors do 
you have in your cream logs:

cd /opt/glite/var/log
grep -e "failureReason=BLAH error" glite-ce-* | sed -e 's/.*stderr://' | 
sed -e 's|N/A.*||' | sort | uniq -c


LOGS: 17 Feb - 24 Feb - Host: cream.grid.cyf-kr.edu.pl

[root@cream log]# grep -e "failureReason=BLAH error" glite-ce-* | sed -e 
's/.*stderr://' | sed -e 's|N/A.*||' | sort | uniq -c
     159 <blah> execute_cmd: 200 seconds timeout expired, killing child 
process.-)
      45 Connection timed out-qsub: cannot connect to server 
batch.grid.cyf-kr.edu.pl (errno=110) Connection timed out-)
     697 qsub: Error (15033 - Batch protocol error) submitting job-)
       2 qsub: Invalid credential MSG=Hosts do not match-)
      52 qsub: Maximum number of jobs already in queue MSG=total number 
of jobs in queue exceeds the queue limit: user 
[log in to unmask], queue biomed-)
      11 qsub: Maximum number of jobs already in queue MSG=total number 
of jobs in queue exceeds the queue limit: user 
[log in to unmask], queue biomed-)
      14 qsub: Maximum number of jobs already in queue MSG=total number 
of jobs in queue exceeds the queue limit: user 
[log in to unmask], queue biomed-)
       4 qsub: Maximum number of jobs already in queue MSG=total number 
of jobs in queue exceeds the queue limit: user 
[log in to unmask], queue biomed-)
      77 qsub: Maximum number of jobs already in queue MSG=total number 
of jobs in queue exceeds the queue limit: user 
[log in to unmask], queue biomed-)


LOGS: 13 De - 24 Feb - Host: cream02.grid.cyf-kr.edu.pl
     625 <blah> execute_cmd: 200 seconds timeout expired, killing child 
process.-)
     156 Cannot connect to default server host 
'batch.grid.cyf-kr.edu.pl' - check pbs_server daemon.-qsub: cannot 
connect to server batch.grid.cyf-kr.edu.pl (errno=111) Connection refused-)
     157 Connection timed out-qsub: cannot connect to server 
batch.grid.cyf-kr.edu.pl (errno=110) Connection timed out-)
       1 qsub: End of File-)
    2393 qsub: Error (15033 - Batch protocol error) submitting job-)
      37 qsub: Invalid credential-)
       1 qsub: Invalid credential MSG=Hosts do not match-)
       1 qsub: Maximum number of jobs already in queue MSG=total number 
of jobs in queue exceeds the queue limit: user 
[log in to unmask], queue biomed-)
      25 qsub: Maximum number of jobs already in queue MSG=total number 
of jobs in queue exceeds the queue limit: user 
[log in to unmask], queue biomed-)
     129 qsub: Maximum number of jobs already in queue MSG=total number 
of jobs in queue exceeds the queue limit: user 
[log in to unmask], queue biomed-)
       6 qsub: Maximum number of jobs already in queue MSG=total number 
of jobs in queue exceeds the queue limit: user 
[log in to unmask], queue biomed-)
      75 qsub: Maximum number of jobs already in queue MSG=total number 
of jobs in queue exceeds the queue limit: user 
[log in to unmask], queue biomed-)
    1516 qsub: Maximum number of jobs already in queue MSG=total number 
of jobs in queue exceeds the queue limit: user 
[log in to unmask], queue biomed-)
       3 qsub: Maximum number of jobs already in queue MSG=total number 
of jobs in queue exceeds the queue limit: user 
[log in to unmask], queue biomed-)
      17 qsub: Maximum number of jobs already in queue MSG=total number 
of jobs in queue exceeds the queue limit: user 
[log in to unmask], queue biomed-)
     213 qsub: Maximum number of jobs already in queue MSG=total number 
of jobs in queue exceeds the queue limit: user 
[log in to unmask], queue biomed-)
      63 qsub: Maximum number of jobs already in queue MSG=total number 
of jobs in queue exceeds the queue limit: user 
[log in to unmask], queue biomed-)
     112 qsub: Maximum number of jobs already in queue MSG=total number 
of jobs in queue exceeds the queue limit: user 
[log in to unmask], queue biomed-)
       6 qsub: Premature end of message-)
       4 qsub: Request not allowed: Server shutting down-)



>
> I remind the other two issues discussed in this thread:
>
> a: jobs changing its status q-->w--> q ...
>    happening when the proxy expires and the proxy file is removed
>    I still would like to understand if setting:
>
>
>
> DELEGATION_PURGE_RATE=-1
>
> in the cream-config.xml and restarting tomcat helps
>
>
> b: cream sandbox dir removed too early, and causing error messages 
> such as:
>
>
> Unable to copy file err_craaa_619112035_StandardOutput to 
> [log in to unmask]:/opt/glite/var/cream_sandbox/atlasprd/_C_CA_O_Grid_OU_triumf_ca_CN_Asoka_De_Silva_GC1_atlas_Role_production_Capability_NULL_prdatl15/61/CREAM619112035/StandardError
> *** error from copy
> scp: 
> /opt/glite/var/cream_sandbox/atlasprd/_C_CA_O_Grid_OU_triumf_ca_CN_Asoka_De_Silva_GC1_atlas_Role_production_Capability_NULL_prdatl15/61/CREAM619112035/StandardError: 
> No such file or directory
> *** end error output
>
>
> I didn't understand if even in this case the job status keeps changing 
> between q and w or if the job exits the torque queue.
>
> Here, as I wrote, I can guess what happened, but please provide (in a 
> ggus ticket) the glite-ce-cream logs along with the id of a 
> problematic job
>
>
> Cheers, Massimo
>
>
>
> On 02/24/2012 12:54 PM, Lukasz Flis wrote:
>> Hi *,
>>
>> I think the problem is caused by a bad response from PBS server
>> or timeout while running qstat command.
>>
>>
>> CREAM is periodically querying pbs server for job status.
>> When server is restarted or busy, command may timeout.
>> As we can see below, such job is then considered ABORTED which is simply
>> wrong in most cases.
>>
>> If job is unavailable, qstat will return exit_code of value 153 and
>> pbs_status.sh which is invoked by CREAM should address this case
>> properly. In other words, for TORQUE LRMS only 153 exit code mean that
>> job has finished executing all other codes should not invoke change of
>> state to ABORTED.
>>
>> Otherwise:
>> * not yet started jobs will be unable to do STAGEIN resulting in "W" 
>> state
>> * already running jobs will not be able to do STAGEOUT
>>
>>
>> I think CREAM/blah needs to address temporary PBS service
>> unavaiabilities better.
>>
>> Can anyone point me a link to the documentation describing CREAM ->
>> pbs_{status,submit,resume,hold,cancel} interface?
>>
>> Cheers,
>> -- 
>> Lukasz Flis
>>
>>
>>
>> 19 Feb 2012 11:47:08,865 INFO
>> org.glite.ce.cream.jobmanagement.db.table.JobTable (JobTable.java:232) -
>> (TP-Processor17) Job inserted. JobId = CREAM745296672
>> 19 Feb 2012 11:47:08,949 INFO
>> org.glite.ce.creamapi.jobmanagement.cmdexecutor.AbstractJobExecutor
>> (AbstractJobExecutor.java:2163) - (TP-Processor17) JOB CREAM745296672
>> STATUS CHANGED: -- => REGISTERED [localUser=prdatlas002]
>> [delegationId=1328717924.913405]
>> 19 Feb 2012 11:47:42,090 INFO
>> org.glite.ce.creamapi.jobmanagement.cmdexecutor.AbstractJobExecutor
>> (AbstractJobExecutor.java:895) - (Worker Thread 3)
>> REMOTE_REQUEST_ADDRESS=128.142.194.89; USER_DN=/DC=ch/DC=cern/OU=Organic
>> Units/OU=Users/CN=atlpilo2/CN=596434/CN=Robot: ATLAS Pilot2; USER_FQAN={
>> /atlas/Role=production/Capability=NULL;
>> /atlas/Role=NULL/Capability=NULL; /atlas/lcg1/Role=NULL/Capability=NULL;
>> }; CMD_NAME=JOB_START; CMD_CATEGORY=JOB_MANAGEMENT;
>> CMD_STATUS=PROCESSING; commandName=JOB_START;
>> cmdExecutorName=BLAHExecutor;
>> userId=_DC_ch_DC_cern_OU_Organic_Units_OU_Users_CN_atlpilo2_CN_596434_CN_Robot__ATLAS_Pilot2_atlas_Role_production_Capability_NULL; 
>>
>> jobId=CREAM745296672; status=PROCESSING;
>> 19 Feb 2012 11:47:42,942 INFO
>> org.glite.ce.creamapi.jobmanagement.cmdexecutor.AbstractJobExecutor
>> (AbstractJobExecutor.java:2163) - (Worker Thread 3) JOB CREAM745296672
>> STATUS CHANGED: REGISTERED => PENDING [localUser=prdatlas002]
>> [delegationId=1328717924.913405]
>> 19 Feb 2012 11:51:27,522 INFO
>> org.glite.ce.creamapi.jobmanagement.cmdexecutor.AbstractJobExecutor
>> (AbstractJobExecutor.java:2163) - (Worker Thread 3) JOB CREAM745296672
>> STATUS CHANGED: PENDING => ABORTED [failureReason=BLAH error: no jobId
>> in submission script's output (stdout:) (stderr: <blah> execute_cmd: 200
>> seconds timeout expired, killing child process.-) N/A (jobId =
>> CREAM745296672)] [localUser=prdatlas002] 
>> [delegationId=1328717924.913405]
>> 19 Feb 2012 12:18:43,761 INFO
>> org.glite.ce.creamapi.jobmanagement.cmdexecutor.AbstractJobExecutor
>> (AbstractJobExecutor.java:883) - (Worker Thread 34)
>> REMOTE_REQUEST_ADDRESS=128.142.194.89; USER_DN=/DC=ch/DC=cern/OU=Organic
>> Units/OU=Users/CN=atlpilo2/CN=596434/CN=Robot: ATLAS Pilot2; USER_FQAN={
>> /atlas/Role=production/Capability=NULL;
>> /atlas/Role=NULL/Capability=NULL; /atlas/lcg1/Role=NULL/Capability=NULL;
>> }; CMD_NAME=JOB_PURGE; CMD_CATEGORY=JOB_MANAGEMENT;
>> CMD_STATUS=PROCESSING; commandName=JOB_PURGE;
>> cmdExecutorName=BLAHExecutor;
>> userId=_DC_ch_DC_cern_OU_Organic_Units_OU_Users_CN_atlpilo2_CN_596434_CN_Robot__ATLAS_Pilot2_atlas_Role_production_Capability_NULL; 
>>
>> jobId=CREAM745296672; status=PROCESSING;
>> 19 Feb 2012 12:18:48,158 INFO
>> org.glite.ce.cream.jobmanagement.db.table.JobTable (JobTable.java:162) -
>> (Worker Thread 34) Job deleted. JobId = CREAM745296672
>> 19 Feb 2012 12:18:48,177 INFO
>> org.glite.ce.creamapi.jobmanagement.cmdexecutor.AbstractJobExecutor
>> (AbstractJobExecutor.java:1946) - (Worker Thread 34) purge: purged job
>> CREAM745296672
>>
>>
>>
>>
>>
>>
>>
>> On 22.02.2012 19:15, Leslie Groer wrote:
>>> Hi Gila and others for replies. I should mention that these are NOT
>>> cancelled jobs on our site. Most are pilots that have not found 
>>> payloads
>>> that I checked today, but I do see at least one successful user 
>>> analysis
>>> job that consumed 27 mins of walltime and 1 minute of CPU that reported
>>> the Unable to copy file error:
>>>
>>> >From the torque server accounting logs:
>>>
>>> 20120222:02/22/2012 11:10:21;Q;10312027.gpc-sched;queue=atlas
>>> 20120222:02/22/2012 11:10:21;Q;10312027.gpc-sched;queue=serial
>>> 20120222:02/22/2012 11:10:39;S;10312027.gpc-sched;user=atlaspt1
>>> group=atlaspt jobname=craaa_563417814 queue=serial ctime=1329927021
>>> qtime=1329927021 etime=1329927021 start=1329927039
>>> owner=atlaspt1@lcg-admin3 exec_host=gpc-f113n066/6
>>> Resource_List.os=centos53computeHEP Resource_List.walltime=35:03:00
>>> 20120222:02/22/2012 11:37:51;E;10312027.gpc-sched;user=atlaspt1
>>> group=atlaspt jobname=craaa_563417814 queue=serial ctime=1329927021
>>> qtime=1329927021 etime=1329927021 start=1329927039
>>> owner=atlaspt1@lcg-admin3 exec_host=gpc-f113n066/6
>>> Resource_List.os=centos53computeHEP Resource_List.walltime=35:03:00
>>> session=28206 end=1329928671 Exit_status=0 resources_used.cput=00:01:19
>>> resources_used.mem=307956kb resources_used.vmem=1046852kb
>>> resources_used.walltime=00:27:12
>>>
>>> This job maps to PandaID
>>> http://panda.cern.ch/server/pandamon/query?job=1432922395
>>>
>>> which shows success, but I still get a PBS JOB email with the 
>>> following:
>>>
>>> PBS Job Id: 10312027.gpc-sched
>>> Job Name: craaa_563417814
>>> Exec host: gpc-f113n066/6
>>> An error has occurred processing your job, see below.
>>> Post job file processing error; job 10312027.gpc-sched on host
>>> gpc-f113n066/6
>>>
>>> Unable to copy file err_craaa_563417814_StandardOutput
>>> [log in to unmask]:/opt/glite/var/cream_sandbox/atlaspt/_C_CA_O_Grid_OU_triumf_ca_CN_Asoka_De_Silva_GC1_atlas_Role_pilot_Capability_NULL_atlaspt1/56/CREAM563417814/StandardError 
>>>
>>>
>>> *** error from copy
>>> scp:/opt/glite/var/cream_sandbox/atlaspt/_C_CA_O_Grid_OU_triumf_ca_CN_Asoka_De_Silva_GC1_atlas_Role_pilot_Capability_NULL_atlaspt1/56/CREAM563417814/StandardError: 
>>>
>>> No such file or directory
>>> *** end error output
>>>
>>> Unable to copy file out_craaa_563417814_StandardOutput
>>> [log in to unmask]:/opt/glite/var/cream_sandbox/atlaspt/_C_CA_O_Grid_OU_triumf_ca_CN_Asoka_De_Silva_GC1_atlas_Role_pilot_Capability_NULL_atlaspt1/56/CREAM563417814/StandardOutput 
>>>
>>>
>>> *** error from copy
>>> scp:/opt/glite/var/cream_sandbox/atlaspt/_C_CA_O_Grid_OU_triumf_ca_CN_Asoka_De_Silva_GC1_atlas_Role_pilot_Capability_NULL_atlaspt1/56/CREAM563417814/StandardOutput: 
>>>
>>> No such file or directory
>>> *** end error output
>>>
>>>
>>> We do not experience any of the Q->W flapping others talk about. We
>>> did set
>>>
>>> CLASSCFG[atlas] CancelFailedJobs=TRUE
>>>
>>> in our MOAB configuration as if a job fails then it just burns a pilot
>>> to kick it out of the queue rather than trying to reschedule (we were
>>> seeing some flapping with jobs that never got scheduled to a node
>>> sometimes, but this had a different pathology).
>>>
>>> Leslie
>>>
>>>
>>> Gila Arrondo Miguel Angel wrote:
>>>> Hi Leslie,
>>>>
>>>> With a very similar setup (Moab/Torque on 1 host and CREAM-CE on
>>>> another) we've seen an error somehow close to what you describe here.
>>>> In our /var/log/messages we find tons of entires like this:
>>>>
>>>> Feb 22 09:10:24 wn202 pbs_mom: LOG_ERROR::sys_copy, command
>>>> '/usr/bin/scp -rpB
>>>> [log in to unmask]:/cream_localsandbox/data/atlas/_DC_ch_DC_cern_OU_Organic_Units_OU_Users_CN_atlpilo1_CN_614260_CN_Robot__ATLAS_Pilot1_atlas_Role_pilot_Capability_NULL_atlasplt/78/CREAM780422084/StandardError' 
>>>>
>>>> failed with status=1, giving up after 4 attempts
>>>>
>>>> Feb 22 09:10:24 wn202 pbs_mom: LOG_ERROR::req_cpyfile, Unable to copy
>>>> file err_cre02_780422084_StandardError
>>>> [log in to unmask]:/cream_localsandbox/data/atlas/_DC_ch_DC_cern_OU_Organic_Units_OU_Users_CN_atlpilo1_CN_614260_CN_Robot__ATLAS_Pilot1_atlas_Role_pilot_Capability_NULL_atlasplt/78/CREAM780422084/StandardError 
>>>>
>>>>
>>>>
>>>> Feb 22 09:10:29 wn202 pbs_mom: LOG_ERROR::sys_copy, command
>>>> '/usr/bin/scp -rpB
>>>> [log in to unmask]:/cream_localsandbox/data/atlas/_DC_ch_DC_cern_OU_Organic_Units_OU_Users_CN_atlpilo1_CN_614260_CN_Robot__ATLAS_Pilot1_atlas_Role_pilot_Capability_NULL_atlasplt/78/CREAM780422084/StandardOutput' 
>>>>
>>>> failed with status=1, giving up after 4 attempts
>>>> F
>>>>
>>>> Have you checked whether these failed transfers are from cancelled
>>>> jobs? In our experience, it was always the case.
>>>>
>>>> We've also looked at ways to mitigate this annoying verbosity, but no
>>>> luck so far. The only option that we can think of is to stop using
>>>> scp for copies and move the sandbox to a shared area with the WNs, so
>>>> you use regular cp ($usecp directive) and these errors are hidden.
>>>> But, of course, this approach has its own disadvantages as well.
>>>>
>>>> Does anyone else have a better idea?
>>>>
>>>> Cheers,
>>>> Miguel
>>>>
>>>>
>>>>
>>>> -- 
>>>> Miguel Gila
>>>> CSCS Swiss National Supercomputing Centre
>>>> HPC Solutions
>>>> Via Cantonale, Galleria 2 | CH-6928 Manno | Switzerland
>>>> Miguel.gila [at] cscs.ch
>>>>
>>>> From: Leslie
>>>> Groer<[log in to unmask]<mailto:[log in to unmask]>>
>>>> Reply-To: LHC Computer Grid -
>>>> Rollout<[log in to unmask]<mailto:[log in to unmask]>>
>>>> Date: Thu, 16 Feb 2012 14:43:56 -0500
>>>> To:<[log in to unmask]<mailto:[log in to unmask]>>
>>>> Subject: [LCG-ROLLOUT] Torque/CREAM race condition at job completion?
>>>>
>>>> scp:
>>>> /opt/glite/var/cream_sandbox/atlasprd/_C_CA_O_Grid_OU_triumf_ca_CN_Asoka_De_Silva_GC1_atlas_Role_production_Capability_NULL_prdatl15/61/CREAM619112035/StandardOutput: 
>>>>
>>>> No such file or directory
>>>>
>>>
>
>
Top of Message | Previous Page | Permalink
JiscMail Tools

Files Area | help
RSS Feeds and Sharing

Search Archives

Advanced Options