JISCMail - LCG-ROLLOUT Archives

Email discussion lists for the UK Education and Research communities
Subscriber's Corner
Email Lists
LCG-ROLLOUT Archives

LCG-ROLLOUT@JISCMAIL.AC.UK

View:

Message:
[
First
Last
]
By Topic:
[
First
Last
]
By Author:
[
First
Last
]
Font:
Proportional Font
		LISTSERV Archives
		LCG-ROLLOUT Home
		LCG-ROLLOUT February 2012
Options

Subscribe or Unsubscribe
Get Password
Subject:
Re: Torque/CREAM race condition at job completion?
From:
Leslie Groer <[log in to unmask]>
Reply-To:
LHC Computer Grid - Rollout <[log in to unmask]>
Date:
Fri, 24 Feb 2012 15:14:03 -0500
Content-Type:
text/plain
Parts/Attachments:
text/plain (416 lines)
Hi Lukasz,

LOGS:  15 April 2012- 24 Feb - Host: lcg-admin3.scinet.utoronto.ca

Here is the result of the search on our CREAM logs:
[root@lcg-admin3 log]# grep -e "failureReason=BLAH error" glite-ce-* | 
sed -e 's/.*stderr://' | sed -e 's|N/A.*||' | sort | uniq -c
     25 Cannot connect to default server host 'gpc-sched.scinet.local' - 
check pbs_server daemon.-qsub: cannot connect to server 
gpc-sched.scinet.local (errno=111) Connection refused-)
   7455 pbs_iff: cannot read reply from pbs_server-No Permission.-qsub: 
cannot connect to server gpc-sched.scinet.local (errno=15007) 
Unauthorized Request -)
      2 qsub: Invalid credential-)

Leslie


Lukasz Flis wrote:
> Hi.
>> This is a different issue (the third one discussed in this thread): 
>> the job is not submitted at all to the batch system because in 200 
>> secs it was not able to get back the lrms jobid
>> Or was the job submitted ?
>>
>
> Massimo, this actually is a cause of issue (b.
> Sanbox is removed __before__ job is started by PBS (it is queued) or 
> during the time job is running becuse cream wrongly thinks that job 
> has finished due to client-> server communication errors
>
> Both cases cause "Unable to copy file errors"
>
> Job state change is due to the fact that if state W lasts too long. 
> Job may be put back into Q state to be run on a different host.
>
> Every client->server communication problem during pbs_status.sh 
> invocation results in sandbox being removed, job Marked as aborted and 
> not removed from PBS server queue.
>
> I would like to ask you guys to check how many batch related errors do 
> you have in your cream logs:
>
> cd /opt/glite/var/log
> grep -e "failureReason=BLAH error" glite-ce-* | sed -e 's/.*stderr://' 
> | sed -e 's|N/A.*||' | sort | uniq -c
>
>
> LOGS: 17 Feb - 24 Feb - Host: cream.grid.cyf-kr.edu.pl
>
> [root@cream log]# grep -e "failureReason=BLAH error" glite-ce-* | sed 
> -e 's/.*stderr://' | sed -e 's|N/A.*||' | sort | uniq -c
>     159 <blah> execute_cmd: 200 seconds timeout expired, killing child 
> process.-)
>      45 Connection timed out-qsub: cannot connect to server 
> batch.grid.cyf-kr.edu.pl (errno=110) Connection timed out-)
>     697 qsub: Error (15033 - Batch protocol error) submitting job-)
>       2 qsub: Invalid credential MSG=Hosts do not match-)
>      52 qsub: Maximum number of jobs already in queue MSG=total number 
> of jobs in queue exceeds the queue limit: user 
> [log in to unmask], queue biomed-)
>      11 qsub: Maximum number of jobs already in queue MSG=total number 
> of jobs in queue exceeds the queue limit: user 
> [log in to unmask], queue biomed-)
>      14 qsub: Maximum number of jobs already in queue MSG=total number 
> of jobs in queue exceeds the queue limit: user 
> [log in to unmask], queue biomed-)
>       4 qsub: Maximum number of jobs already in queue MSG=total number 
> of jobs in queue exceeds the queue limit: user 
> [log in to unmask], queue biomed-)
>      77 qsub: Maximum number of jobs already in queue MSG=total number 
> of jobs in queue exceeds the queue limit: user 
> [log in to unmask], queue biomed-)
>
>
> LOGS: 13 De - 24 Feb - Host: cream02.grid.cyf-kr.edu.pl
>     625 <blah> execute_cmd: 200 seconds timeout expired, killing child 
> process.-)
>     156 Cannot connect to default server host 
> 'batch.grid.cyf-kr.edu.pl' - check pbs_server daemon.-qsub: cannot 
> connect to server batch.grid.cyf-kr.edu.pl (errno=111) Connection 
> refused-)
>     157 Connection timed out-qsub: cannot connect to server 
> batch.grid.cyf-kr.edu.pl (errno=110) Connection timed out-)
>       1 qsub: End of File-)
>    2393 qsub: Error (15033 - Batch protocol error) submitting job-)
>      37 qsub: Invalid credential-)
>       1 qsub: Invalid credential MSG=Hosts do not match-)
>       1 qsub: Maximum number of jobs already in queue MSG=total number 
> of jobs in queue exceeds the queue limit: user 
> [log in to unmask], queue biomed-)
>      25 qsub: Maximum number of jobs already in queue MSG=total number 
> of jobs in queue exceeds the queue limit: user 
> [log in to unmask], queue biomed-)
>     129 qsub: Maximum number of jobs already in queue MSG=total number 
> of jobs in queue exceeds the queue limit: user 
> [log in to unmask], queue biomed-)
>       6 qsub: Maximum number of jobs already in queue MSG=total number 
> of jobs in queue exceeds the queue limit: user 
> [log in to unmask], queue biomed-)
>      75 qsub: Maximum number of jobs already in queue MSG=total number 
> of jobs in queue exceeds the queue limit: user 
> [log in to unmask], queue biomed-)
>    1516 qsub: Maximum number of jobs already in queue MSG=total number 
> of jobs in queue exceeds the queue limit: user 
> [log in to unmask], queue biomed-)
>       3 qsub: Maximum number of jobs already in queue MSG=total number 
> of jobs in queue exceeds the queue limit: user 
> [log in to unmask], queue biomed-)
>      17 qsub: Maximum number of jobs already in queue MSG=total number 
> of jobs in queue exceeds the queue limit: user 
> [log in to unmask], queue biomed-)
>     213 qsub: Maximum number of jobs already in queue MSG=total number 
> of jobs in queue exceeds the queue limit: user 
> [log in to unmask], queue biomed-)
>      63 qsub: Maximum number of jobs already in queue MSG=total number 
> of jobs in queue exceeds the queue limit: user 
> [log in to unmask], queue biomed-)
>     112 qsub: Maximum number of jobs already in queue MSG=total number 
> of jobs in queue exceeds the queue limit: user 
> [log in to unmask], queue biomed-)
>       6 qsub: Premature end of message-)
>       4 qsub: Request not allowed: Server shutting down-)
>
>
>
>>
>> I remind the other two issues discussed in this thread:
>>
>> a: jobs changing its status q-->w--> q ...
>>    happening when the proxy expires and the proxy file is removed
>>    I still would like to understand if setting:
>>
>>
>>
>> DELEGATION_PURGE_RATE=-1
>>
>> in the cream-config.xml and restarting tomcat helps
>>
>>
>> b: cream sandbox dir removed too early, and causing error messages 
>> such as:
>>
>>
>> Unable to copy file err_craaa_619112035_StandardOutput to 
>> [log in to unmask]:/opt/glite/var/cream_sandbox/atlasprd/_C_CA_O_Grid_OU_triumf_ca_CN_Asoka_De_Silva_GC1_atlas_Role_production_Capability_NULL_prdatl15/61/CREAM619112035/StandardError 
>>
>> *** error from copy
>> scp: 
>> /opt/glite/var/cream_sandbox/atlasprd/_C_CA_O_Grid_OU_triumf_ca_CN_Asoka_De_Silva_GC1_atlas_Role_production_Capability_NULL_prdatl15/61/CREAM619112035/StandardError: 
>> No such file or directory
>> *** end error output
>>
>>
>> I didn't understand if even in this case the job status keeps 
>> changing between q and w or if the job exits the torque queue.
>>
>> Here, as I wrote, I can guess what happened, but please provide (in a 
>> ggus ticket) the glite-ce-cream logs along with the id of a 
>> problematic job
>>
>>
>> Cheers, Massimo
>>
>>
>>
>> On 02/24/2012 12:54 PM, Lukasz Flis wrote:
>>> Hi *,
>>>
>>> I think the problem is caused by a bad response from PBS server
>>> or timeout while running qstat command.
>>>
>>>
>>> CREAM is periodically querying pbs server for job status.
>>> When server is restarted or busy, command may timeout.
>>> As we can see below, such job is then considered ABORTED which is 
>>> simply
>>> wrong in most cases.
>>>
>>> If job is unavailable, qstat will return exit_code of value 153 and
>>> pbs_status.sh which is invoked by CREAM should address this case
>>> properly. In other words, for TORQUE LRMS only 153 exit code mean that
>>> job has finished executing all other codes should not invoke change of
>>> state to ABORTED.
>>>
>>> Otherwise:
>>> * not yet started jobs will be unable to do STAGEIN resulting in "W" 
>>> state
>>> * already running jobs will not be able to do STAGEOUT
>>>
>>>
>>> I think CREAM/blah needs to address temporary PBS service
>>> unavaiabilities better.
>>>
>>> Can anyone point me a link to the documentation describing CREAM ->
>>> pbs_{status,submit,resume,hold,cancel} interface?
>>>
>>> Cheers,
>>> -- 
>>> Lukasz Flis
>>>
>>>
>>>
>>> 19 Feb 2012 11:47:08,865 INFO
>>> org.glite.ce.cream.jobmanagement.db.table.JobTable 
>>> (JobTable.java:232) -
>>> (TP-Processor17) Job inserted. JobId = CREAM745296672
>>> 19 Feb 2012 11:47:08,949 INFO
>>> org.glite.ce.creamapi.jobmanagement.cmdexecutor.AbstractJobExecutor
>>> (AbstractJobExecutor.java:2163) - (TP-Processor17) JOB CREAM745296672
>>> STATUS CHANGED: -- => REGISTERED [localUser=prdatlas002]
>>> [delegationId=1328717924.913405]
>>> 19 Feb 2012 11:47:42,090 INFO
>>> org.glite.ce.creamapi.jobmanagement.cmdexecutor.AbstractJobExecutor
>>> (AbstractJobExecutor.java:895) - (Worker Thread 3)
>>> REMOTE_REQUEST_ADDRESS=128.142.194.89; 
>>> USER_DN=/DC=ch/DC=cern/OU=Organic
>>> Units/OU=Users/CN=atlpilo2/CN=596434/CN=Robot: ATLAS Pilot2; 
>>> USER_FQAN={
>>> /atlas/Role=production/Capability=NULL;
>>> /atlas/Role=NULL/Capability=NULL; 
>>> /atlas/lcg1/Role=NULL/Capability=NULL;
>>> }; CMD_NAME=JOB_START; CMD_CATEGORY=JOB_MANAGEMENT;
>>> CMD_STATUS=PROCESSING; commandName=JOB_START;
>>> cmdExecutorName=BLAHExecutor;
>>> userId=_DC_ch_DC_cern_OU_Organic_Units_OU_Users_CN_atlpilo2_CN_596434_CN_Robot__ATLAS_Pilot2_atlas_Role_production_Capability_NULL; 
>>>
>>> jobId=CREAM745296672; status=PROCESSING;
>>> 19 Feb 2012 11:47:42,942 INFO
>>> org.glite.ce.creamapi.jobmanagement.cmdexecutor.AbstractJobExecutor
>>> (AbstractJobExecutor.java:2163) - (Worker Thread 3) JOB CREAM745296672
>>> STATUS CHANGED: REGISTERED => PENDING [localUser=prdatlas002]
>>> [delegationId=1328717924.913405]
>>> 19 Feb 2012 11:51:27,522 INFO
>>> org.glite.ce.creamapi.jobmanagement.cmdexecutor.AbstractJobExecutor
>>> (AbstractJobExecutor.java:2163) - (Worker Thread 3) JOB CREAM745296672
>>> STATUS CHANGED: PENDING => ABORTED [failureReason=BLAH error: no jobId
>>> in submission script's output (stdout:) (stderr: <blah> execute_cmd: 
>>> 200
>>> seconds timeout expired, killing child process.-) N/A (jobId =
>>> CREAM745296672)] [localUser=prdatlas002] 
>>> [delegationId=1328717924.913405]
>>> 19 Feb 2012 12:18:43,761 INFO
>>> org.glite.ce.creamapi.jobmanagement.cmdexecutor.AbstractJobExecutor
>>> (AbstractJobExecutor.java:883) - (Worker Thread 34)
>>> REMOTE_REQUEST_ADDRESS=128.142.194.89; 
>>> USER_DN=/DC=ch/DC=cern/OU=Organic
>>> Units/OU=Users/CN=atlpilo2/CN=596434/CN=Robot: ATLAS Pilot2; 
>>> USER_FQAN={
>>> /atlas/Role=production/Capability=NULL;
>>> /atlas/Role=NULL/Capability=NULL; 
>>> /atlas/lcg1/Role=NULL/Capability=NULL;
>>> }; CMD_NAME=JOB_PURGE; CMD_CATEGORY=JOB_MANAGEMENT;
>>> CMD_STATUS=PROCESSING; commandName=JOB_PURGE;
>>> cmdExecutorName=BLAHExecutor;
>>> userId=_DC_ch_DC_cern_OU_Organic_Units_OU_Users_CN_atlpilo2_CN_596434_CN_Robot__ATLAS_Pilot2_atlas_Role_production_Capability_NULL; 
>>>
>>> jobId=CREAM745296672; status=PROCESSING;
>>> 19 Feb 2012 12:18:48,158 INFO
>>> org.glite.ce.cream.jobmanagement.db.table.JobTable 
>>> (JobTable.java:162) -
>>> (Worker Thread 34) Job deleted. JobId = CREAM745296672
>>> 19 Feb 2012 12:18:48,177 INFO
>>> org.glite.ce.creamapi.jobmanagement.cmdexecutor.AbstractJobExecutor
>>> (AbstractJobExecutor.java:1946) - (Worker Thread 34) purge: purged job
>>> CREAM745296672
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On 22.02.2012 19:15, Leslie Groer wrote:
>>>> Hi Gila and others for replies. I should mention that these are NOT
>>>> cancelled jobs on our site. Most are pilots that have not found 
>>>> payloads
>>>> that I checked today, but I do see at least one successful user 
>>>> analysis
>>>> job that consumed 27 mins of walltime and 1 minute of CPU that 
>>>> reported
>>>> the Unable to copy file error:
>>>>
>>>> >From the torque server accounting logs:
>>>>
>>>> 20120222:02/22/2012 11:10:21;Q;10312027.gpc-sched;queue=atlas
>>>> 20120222:02/22/2012 11:10:21;Q;10312027.gpc-sched;queue=serial
>>>> 20120222:02/22/2012 11:10:39;S;10312027.gpc-sched;user=atlaspt1
>>>> group=atlaspt jobname=craaa_563417814 queue=serial ctime=1329927021
>>>> qtime=1329927021 etime=1329927021 start=1329927039
>>>> owner=atlaspt1@lcg-admin3 exec_host=gpc-f113n066/6
>>>> Resource_List.os=centos53computeHEP Resource_List.walltime=35:03:00
>>>> 20120222:02/22/2012 11:37:51;E;10312027.gpc-sched;user=atlaspt1
>>>> group=atlaspt jobname=craaa_563417814 queue=serial ctime=1329927021
>>>> qtime=1329927021 etime=1329927021 start=1329927039
>>>> owner=atlaspt1@lcg-admin3 exec_host=gpc-f113n066/6
>>>> Resource_List.os=centos53computeHEP Resource_List.walltime=35:03:00
>>>> session=28206 end=1329928671 Exit_status=0 
>>>> resources_used.cput=00:01:19
>>>> resources_used.mem=307956kb resources_used.vmem=1046852kb
>>>> resources_used.walltime=00:27:12
>>>>
>>>> This job maps to PandaID
>>>> http://panda.cern.ch/server/pandamon/query?job=1432922395
>>>>
>>>> which shows success, but I still get a PBS JOB email with the 
>>>> following:
>>>>
>>>> PBS Job Id: 10312027.gpc-sched
>>>> Job Name: craaa_563417814
>>>> Exec host: gpc-f113n066/6
>>>> An error has occurred processing your job, see below.
>>>> Post job file processing error; job 10312027.gpc-sched on host
>>>> gpc-f113n066/6
>>>>
>>>> Unable to copy file err_craaa_563417814_StandardOutput
>>>> [log in to unmask]:/opt/glite/var/cream_sandbox/atlaspt/_C_CA_O_Grid_OU_triumf_ca_CN_Asoka_De_Silva_GC1_atlas_Role_pilot_Capability_NULL_atlaspt1/56/CREAM563417814/StandardError 
>>>>
>>>>
>>>> *** error from copy
>>>> scp:/opt/glite/var/cream_sandbox/atlaspt/_C_CA_O_Grid_OU_triumf_ca_CN_Asoka_De_Silva_GC1_atlas_Role_pilot_Capability_NULL_atlaspt1/56/CREAM563417814/StandardError: 
>>>>
>>>> No such file or directory
>>>> *** end error output
>>>>
>>>> Unable to copy file out_craaa_563417814_StandardOutput
>>>> [log in to unmask]:/opt/glite/var/cream_sandbox/atlaspt/_C_CA_O_Grid_OU_triumf_ca_CN_Asoka_De_Silva_GC1_atlas_Role_pilot_Capability_NULL_atlaspt1/56/CREAM563417814/StandardOutput 
>>>>
>>>>
>>>> *** error from copy
>>>> scp:/opt/glite/var/cream_sandbox/atlaspt/_C_CA_O_Grid_OU_triumf_ca_CN_Asoka_De_Silva_GC1_atlas_Role_pilot_Capability_NULL_atlaspt1/56/CREAM563417814/StandardOutput: 
>>>>
>>>> No such file or directory
>>>> *** end error output
>>>>
>>>>
>>>> We do not experience any of the Q->W flapping others talk about. We
>>>> did set
>>>>
>>>> CLASSCFG[atlas] CancelFailedJobs=TRUE
>>>>
>>>> in our MOAB configuration as if a job fails then it just burns a pilot
>>>> to kick it out of the queue rather than trying to reschedule (we were
>>>> seeing some flapping with jobs that never got scheduled to a node
>>>> sometimes, but this had a different pathology).
>>>>
>>>> Leslie
>>>>
>>>>
>>>> Gila Arrondo Miguel Angel wrote:
>>>>> Hi Leslie,
>>>>>
>>>>> With a very similar setup (Moab/Torque on 1 host and CREAM-CE on
>>>>> another) we've seen an error somehow close to what you describe here.
>>>>> In our /var/log/messages we find tons of entires like this:
>>>>>
>>>>> Feb 22 09:10:24 wn202 pbs_mom: LOG_ERROR::sys_copy, command
>>>>> '/usr/bin/scp -rpB
>>>>> [log in to unmask]:/cream_localsandbox/data/atlas/_DC_ch_DC_cern_OU_Organic_Units_OU_Users_CN_atlpilo1_CN_614260_CN_Robot__ATLAS_Pilot1_atlas_Role_pilot_Capability_NULL_atlasplt/78/CREAM780422084/StandardError' 
>>>>>
>>>>> failed with status=1, giving up after 4 attempts
>>>>>
>>>>> Feb 22 09:10:24 wn202 pbs_mom: LOG_ERROR::req_cpyfile, Unable to copy
>>>>> file err_cre02_780422084_StandardError
>>>>> [log in to unmask]:/cream_localsandbox/data/atlas/_DC_ch_DC_cern_OU_Organic_Units_OU_Users_CN_atlpilo1_CN_614260_CN_Robot__ATLAS_Pilot1_atlas_Role_pilot_Capability_NULL_atlasplt/78/CREAM780422084/StandardError 
>>>>>
>>>>>
>>>>>
>>>>> Feb 22 09:10:29 wn202 pbs_mom: LOG_ERROR::sys_copy, command
>>>>> '/usr/bin/scp -rpB
>>>>> [log in to unmask]:/cream_localsandbox/data/atlas/_DC_ch_DC_cern_OU_Organic_Units_OU_Users_CN_atlpilo1_CN_614260_CN_Robot__ATLAS_Pilot1_atlas_Role_pilot_Capability_NULL_atlasplt/78/CREAM780422084/StandardOutput' 
>>>>>
>>>>> failed with status=1, giving up after 4 attempts
>>>>> F
>>>>>
>>>>> Have you checked whether these failed transfers are from cancelled
>>>>> jobs? In our experience, it was always the case.
>>>>>
>>>>> We've also looked at ways to mitigate this annoying verbosity, but no
>>>>> luck so far. The only option that we can think of is to stop using
>>>>> scp for copies and move the sandbox to a shared area with the WNs, so
>>>>> you use regular cp ($usecp directive) and these errors are hidden.
>>>>> But, of course, this approach has its own disadvantages as well.
>>>>>
>>>>> Does anyone else have a better idea?
>>>>>
>>>>> Cheers,
>>>>> Miguel
>>>>>
>>>>>
>>>>>
>>>>> -- 
>>>>> Miguel Gila
>>>>> CSCS Swiss National Supercomputing Centre
>>>>> HPC Solutions
>>>>> Via Cantonale, Galleria 2 | CH-6928 Manno | Switzerland
>>>>> Miguel.gila [at] cscs.ch
>>>>>
>>>>> From: Leslie
>>>>> Groer<[log in to unmask]<mailto:[log in to unmask]>>
>>>>> Reply-To: LHC Computer Grid -
>>>>> Rollout<[log in to unmask]<mailto:[log in to unmask]>> 
>>>>>
>>>>> Date: Thu, 16 Feb 2012 14:43:56 -0500
>>>>> To:<[log in to unmask]<mailto:[log in to unmask]>>
>>>>> Subject: [LCG-ROLLOUT] Torque/CREAM race condition at job completion?
>>>>>
>>>>> scp:
>>>>> /opt/glite/var/cream_sandbox/atlasprd/_C_CA_O_Grid_OU_triumf_ca_CN_Asoka_De_Silva_GC1_atlas_Role_production_Capability_NULL_prdatl15/61/CREAM619112035/StandardOutput: 
>>>>>
>>>>> No such file or directory
>>>>>
>>>>
>>
>>
Top of Message | Previous Page | Permalink
JiscMail Tools

Files Area | help
RSS Feeds and Sharing

Search Archives

Advanced Options