Hi Lukasz,
LOGS: 15 April 2012- 24 Feb - Host: lcg-admin3.scinet.utoronto.ca
Here is the result of the search on our CREAM logs:
[root@lcg-admin3 log]# grep -e "failureReason=BLAH error" glite-ce-* |
sed -e 's/.*stderr://' | sed -e 's|N/A.*||' | sort | uniq -c
25 Cannot connect to default server host 'gpc-sched.scinet.local' -
check pbs_server daemon.-qsub: cannot connect to server
gpc-sched.scinet.local (errno=111) Connection refused-)
7455 pbs_iff: cannot read reply from pbs_server-No Permission.-qsub:
cannot connect to server gpc-sched.scinet.local (errno=15007)
Unauthorized Request -)
2 qsub: Invalid credential-)
Leslie
Lukasz Flis wrote:
> Hi.
>> This is a different issue (the third one discussed in this thread):
>> the job is not submitted at all to the batch system because in 200
>> secs it was not able to get back the lrms jobid
>> Or was the job submitted ?
>>
>
> Massimo, this actually is a cause of issue (b.
> Sanbox is removed __before__ job is started by PBS (it is queued) or
> during the time job is running becuse cream wrongly thinks that job
> has finished due to client-> server communication errors
>
> Both cases cause "Unable to copy file errors"
>
> Job state change is due to the fact that if state W lasts too long.
> Job may be put back into Q state to be run on a different host.
>
> Every client->server communication problem during pbs_status.sh
> invocation results in sandbox being removed, job Marked as aborted and
> not removed from PBS server queue.
>
> I would like to ask you guys to check how many batch related errors do
> you have in your cream logs:
>
> cd /opt/glite/var/log
> grep -e "failureReason=BLAH error" glite-ce-* | sed -e 's/.*stderr://'
> | sed -e 's|N/A.*||' | sort | uniq -c
>
>
> LOGS: 17 Feb - 24 Feb - Host: cream.grid.cyf-kr.edu.pl
>
> [root@cream log]# grep -e "failureReason=BLAH error" glite-ce-* | sed
> -e 's/.*stderr://' | sed -e 's|N/A.*||' | sort | uniq -c
> 159 <blah> execute_cmd: 200 seconds timeout expired, killing child
> process.-)
> 45 Connection timed out-qsub: cannot connect to server
> batch.grid.cyf-kr.edu.pl (errno=110) Connection timed out-)
> 697 qsub: Error (15033 - Batch protocol error) submitting job-)
> 2 qsub: Invalid credential MSG=Hosts do not match-)
> 52 qsub: Maximum number of jobs already in queue MSG=total number
> of jobs in queue exceeds the queue limit: user
> [log in to unmask], queue biomed-)
> 11 qsub: Maximum number of jobs already in queue MSG=total number
> of jobs in queue exceeds the queue limit: user
> [log in to unmask], queue biomed-)
> 14 qsub: Maximum number of jobs already in queue MSG=total number
> of jobs in queue exceeds the queue limit: user
> [log in to unmask], queue biomed-)
> 4 qsub: Maximum number of jobs already in queue MSG=total number
> of jobs in queue exceeds the queue limit: user
> [log in to unmask], queue biomed-)
> 77 qsub: Maximum number of jobs already in queue MSG=total number
> of jobs in queue exceeds the queue limit: user
> [log in to unmask], queue biomed-)
>
>
> LOGS: 13 De - 24 Feb - Host: cream02.grid.cyf-kr.edu.pl
> 625 <blah> execute_cmd: 200 seconds timeout expired, killing child
> process.-)
> 156 Cannot connect to default server host
> 'batch.grid.cyf-kr.edu.pl' - check pbs_server daemon.-qsub: cannot
> connect to server batch.grid.cyf-kr.edu.pl (errno=111) Connection
> refused-)
> 157 Connection timed out-qsub: cannot connect to server
> batch.grid.cyf-kr.edu.pl (errno=110) Connection timed out-)
> 1 qsub: End of File-)
> 2393 qsub: Error (15033 - Batch protocol error) submitting job-)
> 37 qsub: Invalid credential-)
> 1 qsub: Invalid credential MSG=Hosts do not match-)
> 1 qsub: Maximum number of jobs already in queue MSG=total number
> of jobs in queue exceeds the queue limit: user
> [log in to unmask], queue biomed-)
> 25 qsub: Maximum number of jobs already in queue MSG=total number
> of jobs in queue exceeds the queue limit: user
> [log in to unmask], queue biomed-)
> 129 qsub: Maximum number of jobs already in queue MSG=total number
> of jobs in queue exceeds the queue limit: user
> [log in to unmask], queue biomed-)
> 6 qsub: Maximum number of jobs already in queue MSG=total number
> of jobs in queue exceeds the queue limit: user
> [log in to unmask], queue biomed-)
> 75 qsub: Maximum number of jobs already in queue MSG=total number
> of jobs in queue exceeds the queue limit: user
> [log in to unmask], queue biomed-)
> 1516 qsub: Maximum number of jobs already in queue MSG=total number
> of jobs in queue exceeds the queue limit: user
> [log in to unmask], queue biomed-)
> 3 qsub: Maximum number of jobs already in queue MSG=total number
> of jobs in queue exceeds the queue limit: user
> [log in to unmask], queue biomed-)
> 17 qsub: Maximum number of jobs already in queue MSG=total number
> of jobs in queue exceeds the queue limit: user
> [log in to unmask], queue biomed-)
> 213 qsub: Maximum number of jobs already in queue MSG=total number
> of jobs in queue exceeds the queue limit: user
> [log in to unmask], queue biomed-)
> 63 qsub: Maximum number of jobs already in queue MSG=total number
> of jobs in queue exceeds the queue limit: user
> [log in to unmask], queue biomed-)
> 112 qsub: Maximum number of jobs already in queue MSG=total number
> of jobs in queue exceeds the queue limit: user
> [log in to unmask], queue biomed-)
> 6 qsub: Premature end of message-)
> 4 qsub: Request not allowed: Server shutting down-)
>
>
>
>>
>> I remind the other two issues discussed in this thread:
>>
>> a: jobs changing its status q-->w--> q ...
>> happening when the proxy expires and the proxy file is removed
>> I still would like to understand if setting:
>>
>>
>>
>> DELEGATION_PURGE_RATE=-1
>>
>> in the cream-config.xml and restarting tomcat helps
>>
>>
>> b: cream sandbox dir removed too early, and causing error messages
>> such as:
>>
>>
>> Unable to copy file err_craaa_619112035_StandardOutput to
>> [log in to unmask]:/opt/glite/var/cream_sandbox/atlasprd/_C_CA_O_Grid_OU_triumf_ca_CN_Asoka_De_Silva_GC1_atlas_Role_production_Capability_NULL_prdatl15/61/CREAM619112035/StandardError
>>
>> *** error from copy
>> scp:
>> /opt/glite/var/cream_sandbox/atlasprd/_C_CA_O_Grid_OU_triumf_ca_CN_Asoka_De_Silva_GC1_atlas_Role_production_Capability_NULL_prdatl15/61/CREAM619112035/StandardError:
>> No such file or directory
>> *** end error output
>>
>>
>> I didn't understand if even in this case the job status keeps
>> changing between q and w or if the job exits the torque queue.
>>
>> Here, as I wrote, I can guess what happened, but please provide (in a
>> ggus ticket) the glite-ce-cream logs along with the id of a
>> problematic job
>>
>>
>> Cheers, Massimo
>>
>>
>>
>> On 02/24/2012 12:54 PM, Lukasz Flis wrote:
>>> Hi *,
>>>
>>> I think the problem is caused by a bad response from PBS server
>>> or timeout while running qstat command.
>>>
>>>
>>> CREAM is periodically querying pbs server for job status.
>>> When server is restarted or busy, command may timeout.
>>> As we can see below, such job is then considered ABORTED which is
>>> simply
>>> wrong in most cases.
>>>
>>> If job is unavailable, qstat will return exit_code of value 153 and
>>> pbs_status.sh which is invoked by CREAM should address this case
>>> properly. In other words, for TORQUE LRMS only 153 exit code mean that
>>> job has finished executing all other codes should not invoke change of
>>> state to ABORTED.
>>>
>>> Otherwise:
>>> * not yet started jobs will be unable to do STAGEIN resulting in "W"
>>> state
>>> * already running jobs will not be able to do STAGEOUT
>>>
>>>
>>> I think CREAM/blah needs to address temporary PBS service
>>> unavaiabilities better.
>>>
>>> Can anyone point me a link to the documentation describing CREAM ->
>>> pbs_{status,submit,resume,hold,cancel} interface?
>>>
>>> Cheers,
>>> --
>>> Lukasz Flis
>>>
>>>
>>>
>>> 19 Feb 2012 11:47:08,865 INFO
>>> org.glite.ce.cream.jobmanagement.db.table.JobTable
>>> (JobTable.java:232) -
>>> (TP-Processor17) Job inserted. JobId = CREAM745296672
>>> 19 Feb 2012 11:47:08,949 INFO
>>> org.glite.ce.creamapi.jobmanagement.cmdexecutor.AbstractJobExecutor
>>> (AbstractJobExecutor.java:2163) - (TP-Processor17) JOB CREAM745296672
>>> STATUS CHANGED: -- => REGISTERED [localUser=prdatlas002]
>>> [delegationId=1328717924.913405]
>>> 19 Feb 2012 11:47:42,090 INFO
>>> org.glite.ce.creamapi.jobmanagement.cmdexecutor.AbstractJobExecutor
>>> (AbstractJobExecutor.java:895) - (Worker Thread 3)
>>> REMOTE_REQUEST_ADDRESS=128.142.194.89;
>>> USER_DN=/DC=ch/DC=cern/OU=Organic
>>> Units/OU=Users/CN=atlpilo2/CN=596434/CN=Robot: ATLAS Pilot2;
>>> USER_FQAN={
>>> /atlas/Role=production/Capability=NULL;
>>> /atlas/Role=NULL/Capability=NULL;
>>> /atlas/lcg1/Role=NULL/Capability=NULL;
>>> }; CMD_NAME=JOB_START; CMD_CATEGORY=JOB_MANAGEMENT;
>>> CMD_STATUS=PROCESSING; commandName=JOB_START;
>>> cmdExecutorName=BLAHExecutor;
>>> userId=_DC_ch_DC_cern_OU_Organic_Units_OU_Users_CN_atlpilo2_CN_596434_CN_Robot__ATLAS_Pilot2_atlas_Role_production_Capability_NULL;
>>>
>>> jobId=CREAM745296672; status=PROCESSING;
>>> 19 Feb 2012 11:47:42,942 INFO
>>> org.glite.ce.creamapi.jobmanagement.cmdexecutor.AbstractJobExecutor
>>> (AbstractJobExecutor.java:2163) - (Worker Thread 3) JOB CREAM745296672
>>> STATUS CHANGED: REGISTERED => PENDING [localUser=prdatlas002]
>>> [delegationId=1328717924.913405]
>>> 19 Feb 2012 11:51:27,522 INFO
>>> org.glite.ce.creamapi.jobmanagement.cmdexecutor.AbstractJobExecutor
>>> (AbstractJobExecutor.java:2163) - (Worker Thread 3) JOB CREAM745296672
>>> STATUS CHANGED: PENDING => ABORTED [failureReason=BLAH error: no jobId
>>> in submission script's output (stdout:) (stderr: <blah> execute_cmd:
>>> 200
>>> seconds timeout expired, killing child process.-) N/A (jobId =
>>> CREAM745296672)] [localUser=prdatlas002]
>>> [delegationId=1328717924.913405]
>>> 19 Feb 2012 12:18:43,761 INFO
>>> org.glite.ce.creamapi.jobmanagement.cmdexecutor.AbstractJobExecutor
>>> (AbstractJobExecutor.java:883) - (Worker Thread 34)
>>> REMOTE_REQUEST_ADDRESS=128.142.194.89;
>>> USER_DN=/DC=ch/DC=cern/OU=Organic
>>> Units/OU=Users/CN=atlpilo2/CN=596434/CN=Robot: ATLAS Pilot2;
>>> USER_FQAN={
>>> /atlas/Role=production/Capability=NULL;
>>> /atlas/Role=NULL/Capability=NULL;
>>> /atlas/lcg1/Role=NULL/Capability=NULL;
>>> }; CMD_NAME=JOB_PURGE; CMD_CATEGORY=JOB_MANAGEMENT;
>>> CMD_STATUS=PROCESSING; commandName=JOB_PURGE;
>>> cmdExecutorName=BLAHExecutor;
>>> userId=_DC_ch_DC_cern_OU_Organic_Units_OU_Users_CN_atlpilo2_CN_596434_CN_Robot__ATLAS_Pilot2_atlas_Role_production_Capability_NULL;
>>>
>>> jobId=CREAM745296672; status=PROCESSING;
>>> 19 Feb 2012 12:18:48,158 INFO
>>> org.glite.ce.cream.jobmanagement.db.table.JobTable
>>> (JobTable.java:162) -
>>> (Worker Thread 34) Job deleted. JobId = CREAM745296672
>>> 19 Feb 2012 12:18:48,177 INFO
>>> org.glite.ce.creamapi.jobmanagement.cmdexecutor.AbstractJobExecutor
>>> (AbstractJobExecutor.java:1946) - (Worker Thread 34) purge: purged job
>>> CREAM745296672
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On 22.02.2012 19:15, Leslie Groer wrote:
>>>> Hi Gila and others for replies. I should mention that these are NOT
>>>> cancelled jobs on our site. Most are pilots that have not found
>>>> payloads
>>>> that I checked today, but I do see at least one successful user
>>>> analysis
>>>> job that consumed 27 mins of walltime and 1 minute of CPU that
>>>> reported
>>>> the Unable to copy file error:
>>>>
>>>> >From the torque server accounting logs:
>>>>
>>>> 20120222:02/22/2012 11:10:21;Q;10312027.gpc-sched;queue=atlas
>>>> 20120222:02/22/2012 11:10:21;Q;10312027.gpc-sched;queue=serial
>>>> 20120222:02/22/2012 11:10:39;S;10312027.gpc-sched;user=atlaspt1
>>>> group=atlaspt jobname=craaa_563417814 queue=serial ctime=1329927021
>>>> qtime=1329927021 etime=1329927021 start=1329927039
>>>> owner=atlaspt1@lcg-admin3 exec_host=gpc-f113n066/6
>>>> Resource_List.os=centos53computeHEP Resource_List.walltime=35:03:00
>>>> 20120222:02/22/2012 11:37:51;E;10312027.gpc-sched;user=atlaspt1
>>>> group=atlaspt jobname=craaa_563417814 queue=serial ctime=1329927021
>>>> qtime=1329927021 etime=1329927021 start=1329927039
>>>> owner=atlaspt1@lcg-admin3 exec_host=gpc-f113n066/6
>>>> Resource_List.os=centos53computeHEP Resource_List.walltime=35:03:00
>>>> session=28206 end=1329928671 Exit_status=0
>>>> resources_used.cput=00:01:19
>>>> resources_used.mem=307956kb resources_used.vmem=1046852kb
>>>> resources_used.walltime=00:27:12
>>>>
>>>> This job maps to PandaID
>>>> http://panda.cern.ch/server/pandamon/query?job=1432922395
>>>>
>>>> which shows success, but I still get a PBS JOB email with the
>>>> following:
>>>>
>>>> PBS Job Id: 10312027.gpc-sched
>>>> Job Name: craaa_563417814
>>>> Exec host: gpc-f113n066/6
>>>> An error has occurred processing your job, see below.
>>>> Post job file processing error; job 10312027.gpc-sched on host
>>>> gpc-f113n066/6
>>>>
>>>> Unable to copy file err_craaa_563417814_StandardOutput
>>>> [log in to unmask]:/opt/glite/var/cream_sandbox/atlaspt/_C_CA_O_Grid_OU_triumf_ca_CN_Asoka_De_Silva_GC1_atlas_Role_pilot_Capability_NULL_atlaspt1/56/CREAM563417814/StandardError
>>>>
>>>>
>>>> *** error from copy
>>>> scp:/opt/glite/var/cream_sandbox/atlaspt/_C_CA_O_Grid_OU_triumf_ca_CN_Asoka_De_Silva_GC1_atlas_Role_pilot_Capability_NULL_atlaspt1/56/CREAM563417814/StandardError:
>>>>
>>>> No such file or directory
>>>> *** end error output
>>>>
>>>> Unable to copy file out_craaa_563417814_StandardOutput
>>>> [log in to unmask]:/opt/glite/var/cream_sandbox/atlaspt/_C_CA_O_Grid_OU_triumf_ca_CN_Asoka_De_Silva_GC1_atlas_Role_pilot_Capability_NULL_atlaspt1/56/CREAM563417814/StandardOutput
>>>>
>>>>
>>>> *** error from copy
>>>> scp:/opt/glite/var/cream_sandbox/atlaspt/_C_CA_O_Grid_OU_triumf_ca_CN_Asoka_De_Silva_GC1_atlas_Role_pilot_Capability_NULL_atlaspt1/56/CREAM563417814/StandardOutput:
>>>>
>>>> No such file or directory
>>>> *** end error output
>>>>
>>>>
>>>> We do not experience any of the Q->W flapping others talk about. We
>>>> did set
>>>>
>>>> CLASSCFG[atlas] CancelFailedJobs=TRUE
>>>>
>>>> in our MOAB configuration as if a job fails then it just burns a pilot
>>>> to kick it out of the queue rather than trying to reschedule (we were
>>>> seeing some flapping with jobs that never got scheduled to a node
>>>> sometimes, but this had a different pathology).
>>>>
>>>> Leslie
>>>>
>>>>
>>>> Gila Arrondo Miguel Angel wrote:
>>>>> Hi Leslie,
>>>>>
>>>>> With a very similar setup (Moab/Torque on 1 host and CREAM-CE on
>>>>> another) we've seen an error somehow close to what you describe here.
>>>>> In our /var/log/messages we find tons of entires like this:
>>>>>
>>>>> Feb 22 09:10:24 wn202 pbs_mom: LOG_ERROR::sys_copy, command
>>>>> '/usr/bin/scp -rpB
>>>>> [log in to unmask]:/cream_localsandbox/data/atlas/_DC_ch_DC_cern_OU_Organic_Units_OU_Users_CN_atlpilo1_CN_614260_CN_Robot__ATLAS_Pilot1_atlas_Role_pilot_Capability_NULL_atlasplt/78/CREAM780422084/StandardError'
>>>>>
>>>>> failed with status=1, giving up after 4 attempts
>>>>>
>>>>> Feb 22 09:10:24 wn202 pbs_mom: LOG_ERROR::req_cpyfile, Unable to copy
>>>>> file err_cre02_780422084_StandardError
>>>>> [log in to unmask]:/cream_localsandbox/data/atlas/_DC_ch_DC_cern_OU_Organic_Units_OU_Users_CN_atlpilo1_CN_614260_CN_Robot__ATLAS_Pilot1_atlas_Role_pilot_Capability_NULL_atlasplt/78/CREAM780422084/StandardError
>>>>>
>>>>>
>>>>>
>>>>> Feb 22 09:10:29 wn202 pbs_mom: LOG_ERROR::sys_copy, command
>>>>> '/usr/bin/scp -rpB
>>>>> [log in to unmask]:/cream_localsandbox/data/atlas/_DC_ch_DC_cern_OU_Organic_Units_OU_Users_CN_atlpilo1_CN_614260_CN_Robot__ATLAS_Pilot1_atlas_Role_pilot_Capability_NULL_atlasplt/78/CREAM780422084/StandardOutput'
>>>>>
>>>>> failed with status=1, giving up after 4 attempts
>>>>> F
>>>>>
>>>>> Have you checked whether these failed transfers are from cancelled
>>>>> jobs? In our experience, it was always the case.
>>>>>
>>>>> We've also looked at ways to mitigate this annoying verbosity, but no
>>>>> luck so far. The only option that we can think of is to stop using
>>>>> scp for copies and move the sandbox to a shared area with the WNs, so
>>>>> you use regular cp ($usecp directive) and these errors are hidden.
>>>>> But, of course, this approach has its own disadvantages as well.
>>>>>
>>>>> Does anyone else have a better idea?
>>>>>
>>>>> Cheers,
>>>>> Miguel
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Miguel Gila
>>>>> CSCS Swiss National Supercomputing Centre
>>>>> HPC Solutions
>>>>> Via Cantonale, Galleria 2 | CH-6928 Manno | Switzerland
>>>>> Miguel.gila [at] cscs.ch
>>>>>
>>>>> From: Leslie
>>>>> Groer<[log in to unmask]<mailto:[log in to unmask]>>
>>>>> Reply-To: LHC Computer Grid -
>>>>> Rollout<[log in to unmask]<mailto:[log in to unmask]>>
>>>>>
>>>>> Date: Thu, 16 Feb 2012 14:43:56 -0500
>>>>> To:<[log in to unmask]<mailto:[log in to unmask]>>
>>>>> Subject: [LCG-ROLLOUT] Torque/CREAM race condition at job completion?
>>>>>
>>>>> scp:
>>>>> /opt/glite/var/cream_sandbox/atlasprd/_C_CA_O_Grid_OU_triumf_ca_CN_Asoka_De_Silva_GC1_atlas_Role_production_Capability_NULL_prdatl15/61/CREAM619112035/StandardOutput:
>>>>>
>>>>> No such file or directory
>>>>>
>>>>
>>
>>
|