Hi.
> This is a different issue (the third one discussed in this thread):
> the job is not submitted at all to the batch system because in 200
> secs it was not able to get back the lrms jobid
> Or was the job submitted ?
>
Massimo, this actually is a cause of issue (b.
Sanbox is removed __before__ job is started by PBS (it is queued) or
during the time job is running becuse cream wrongly thinks that job has
finished due to client-> server communication errors
Both cases cause "Unable to copy file errors"
Job state change is due to the fact that if state W lasts too long. Job
may be put back into Q state to be run on a different host.
Every client->server communication problem during pbs_status.sh
invocation results in sandbox being removed, job Marked as aborted and
not removed from PBS server queue.
I would like to ask you guys to check how many batch related errors do
you have in your cream logs:
cd /opt/glite/var/log
grep -e "failureReason=BLAH error" glite-ce-* | sed -e 's/.*stderr://' |
sed -e 's|N/A.*||' | sort | uniq -c
LOGS: 17 Feb - 24 Feb - Host: cream.grid.cyf-kr.edu.pl
[root@cream log]# grep -e "failureReason=BLAH error" glite-ce-* | sed -e
's/.*stderr://' | sed -e 's|N/A.*||' | sort | uniq -c
159 <blah> execute_cmd: 200 seconds timeout expired, killing child
process.-)
45 Connection timed out-qsub: cannot connect to server
batch.grid.cyf-kr.edu.pl (errno=110) Connection timed out-)
697 qsub: Error (15033 - Batch protocol error) submitting job-)
2 qsub: Invalid credential MSG=Hosts do not match-)
52 qsub: Maximum number of jobs already in queue MSG=total number
of jobs in queue exceeds the queue limit: user
[log in to unmask], queue biomed-)
11 qsub: Maximum number of jobs already in queue MSG=total number
of jobs in queue exceeds the queue limit: user
[log in to unmask], queue biomed-)
14 qsub: Maximum number of jobs already in queue MSG=total number
of jobs in queue exceeds the queue limit: user
[log in to unmask], queue biomed-)
4 qsub: Maximum number of jobs already in queue MSG=total number
of jobs in queue exceeds the queue limit: user
[log in to unmask], queue biomed-)
77 qsub: Maximum number of jobs already in queue MSG=total number
of jobs in queue exceeds the queue limit: user
[log in to unmask], queue biomed-)
LOGS: 13 De - 24 Feb - Host: cream02.grid.cyf-kr.edu.pl
625 <blah> execute_cmd: 200 seconds timeout expired, killing child
process.-)
156 Cannot connect to default server host
'batch.grid.cyf-kr.edu.pl' - check pbs_server daemon.-qsub: cannot
connect to server batch.grid.cyf-kr.edu.pl (errno=111) Connection refused-)
157 Connection timed out-qsub: cannot connect to server
batch.grid.cyf-kr.edu.pl (errno=110) Connection timed out-)
1 qsub: End of File-)
2393 qsub: Error (15033 - Batch protocol error) submitting job-)
37 qsub: Invalid credential-)
1 qsub: Invalid credential MSG=Hosts do not match-)
1 qsub: Maximum number of jobs already in queue MSG=total number
of jobs in queue exceeds the queue limit: user
[log in to unmask], queue biomed-)
25 qsub: Maximum number of jobs already in queue MSG=total number
of jobs in queue exceeds the queue limit: user
[log in to unmask], queue biomed-)
129 qsub: Maximum number of jobs already in queue MSG=total number
of jobs in queue exceeds the queue limit: user
[log in to unmask], queue biomed-)
6 qsub: Maximum number of jobs already in queue MSG=total number
of jobs in queue exceeds the queue limit: user
[log in to unmask], queue biomed-)
75 qsub: Maximum number of jobs already in queue MSG=total number
of jobs in queue exceeds the queue limit: user
[log in to unmask], queue biomed-)
1516 qsub: Maximum number of jobs already in queue MSG=total number
of jobs in queue exceeds the queue limit: user
[log in to unmask], queue biomed-)
3 qsub: Maximum number of jobs already in queue MSG=total number
of jobs in queue exceeds the queue limit: user
[log in to unmask], queue biomed-)
17 qsub: Maximum number of jobs already in queue MSG=total number
of jobs in queue exceeds the queue limit: user
[log in to unmask], queue biomed-)
213 qsub: Maximum number of jobs already in queue MSG=total number
of jobs in queue exceeds the queue limit: user
[log in to unmask], queue biomed-)
63 qsub: Maximum number of jobs already in queue MSG=total number
of jobs in queue exceeds the queue limit: user
[log in to unmask], queue biomed-)
112 qsub: Maximum number of jobs already in queue MSG=total number
of jobs in queue exceeds the queue limit: user
[log in to unmask], queue biomed-)
6 qsub: Premature end of message-)
4 qsub: Request not allowed: Server shutting down-)
>
> I remind the other two issues discussed in this thread:
>
> a: jobs changing its status q-->w--> q ...
> happening when the proxy expires and the proxy file is removed
> I still would like to understand if setting:
>
>
>
> DELEGATION_PURGE_RATE=-1
>
> in the cream-config.xml and restarting tomcat helps
>
>
> b: cream sandbox dir removed too early, and causing error messages
> such as:
>
>
> Unable to copy file err_craaa_619112035_StandardOutput to
> [log in to unmask]:/opt/glite/var/cream_sandbox/atlasprd/_C_CA_O_Grid_OU_triumf_ca_CN_Asoka_De_Silva_GC1_atlas_Role_production_Capability_NULL_prdatl15/61/CREAM619112035/StandardError
> *** error from copy
> scp:
> /opt/glite/var/cream_sandbox/atlasprd/_C_CA_O_Grid_OU_triumf_ca_CN_Asoka_De_Silva_GC1_atlas_Role_production_Capability_NULL_prdatl15/61/CREAM619112035/StandardError:
> No such file or directory
> *** end error output
>
>
> I didn't understand if even in this case the job status keeps changing
> between q and w or if the job exits the torque queue.
>
> Here, as I wrote, I can guess what happened, but please provide (in a
> ggus ticket) the glite-ce-cream logs along with the id of a
> problematic job
>
>
> Cheers, Massimo
>
>
>
> On 02/24/2012 12:54 PM, Lukasz Flis wrote:
>> Hi *,
>>
>> I think the problem is caused by a bad response from PBS server
>> or timeout while running qstat command.
>>
>>
>> CREAM is periodically querying pbs server for job status.
>> When server is restarted or busy, command may timeout.
>> As we can see below, such job is then considered ABORTED which is simply
>> wrong in most cases.
>>
>> If job is unavailable, qstat will return exit_code of value 153 and
>> pbs_status.sh which is invoked by CREAM should address this case
>> properly. In other words, for TORQUE LRMS only 153 exit code mean that
>> job has finished executing all other codes should not invoke change of
>> state to ABORTED.
>>
>> Otherwise:
>> * not yet started jobs will be unable to do STAGEIN resulting in "W"
>> state
>> * already running jobs will not be able to do STAGEOUT
>>
>>
>> I think CREAM/blah needs to address temporary PBS service
>> unavaiabilities better.
>>
>> Can anyone point me a link to the documentation describing CREAM ->
>> pbs_{status,submit,resume,hold,cancel} interface?
>>
>> Cheers,
>> --
>> Lukasz Flis
>>
>>
>>
>> 19 Feb 2012 11:47:08,865 INFO
>> org.glite.ce.cream.jobmanagement.db.table.JobTable (JobTable.java:232) -
>> (TP-Processor17) Job inserted. JobId = CREAM745296672
>> 19 Feb 2012 11:47:08,949 INFO
>> org.glite.ce.creamapi.jobmanagement.cmdexecutor.AbstractJobExecutor
>> (AbstractJobExecutor.java:2163) - (TP-Processor17) JOB CREAM745296672
>> STATUS CHANGED: -- => REGISTERED [localUser=prdatlas002]
>> [delegationId=1328717924.913405]
>> 19 Feb 2012 11:47:42,090 INFO
>> org.glite.ce.creamapi.jobmanagement.cmdexecutor.AbstractJobExecutor
>> (AbstractJobExecutor.java:895) - (Worker Thread 3)
>> REMOTE_REQUEST_ADDRESS=128.142.194.89; USER_DN=/DC=ch/DC=cern/OU=Organic
>> Units/OU=Users/CN=atlpilo2/CN=596434/CN=Robot: ATLAS Pilot2; USER_FQAN={
>> /atlas/Role=production/Capability=NULL;
>> /atlas/Role=NULL/Capability=NULL; /atlas/lcg1/Role=NULL/Capability=NULL;
>> }; CMD_NAME=JOB_START; CMD_CATEGORY=JOB_MANAGEMENT;
>> CMD_STATUS=PROCESSING; commandName=JOB_START;
>> cmdExecutorName=BLAHExecutor;
>> userId=_DC_ch_DC_cern_OU_Organic_Units_OU_Users_CN_atlpilo2_CN_596434_CN_Robot__ATLAS_Pilot2_atlas_Role_production_Capability_NULL;
>>
>> jobId=CREAM745296672; status=PROCESSING;
>> 19 Feb 2012 11:47:42,942 INFO
>> org.glite.ce.creamapi.jobmanagement.cmdexecutor.AbstractJobExecutor
>> (AbstractJobExecutor.java:2163) - (Worker Thread 3) JOB CREAM745296672
>> STATUS CHANGED: REGISTERED => PENDING [localUser=prdatlas002]
>> [delegationId=1328717924.913405]
>> 19 Feb 2012 11:51:27,522 INFO
>> org.glite.ce.creamapi.jobmanagement.cmdexecutor.AbstractJobExecutor
>> (AbstractJobExecutor.java:2163) - (Worker Thread 3) JOB CREAM745296672
>> STATUS CHANGED: PENDING => ABORTED [failureReason=BLAH error: no jobId
>> in submission script's output (stdout:) (stderr: <blah> execute_cmd: 200
>> seconds timeout expired, killing child process.-) N/A (jobId =
>> CREAM745296672)] [localUser=prdatlas002]
>> [delegationId=1328717924.913405]
>> 19 Feb 2012 12:18:43,761 INFO
>> org.glite.ce.creamapi.jobmanagement.cmdexecutor.AbstractJobExecutor
>> (AbstractJobExecutor.java:883) - (Worker Thread 34)
>> REMOTE_REQUEST_ADDRESS=128.142.194.89; USER_DN=/DC=ch/DC=cern/OU=Organic
>> Units/OU=Users/CN=atlpilo2/CN=596434/CN=Robot: ATLAS Pilot2; USER_FQAN={
>> /atlas/Role=production/Capability=NULL;
>> /atlas/Role=NULL/Capability=NULL; /atlas/lcg1/Role=NULL/Capability=NULL;
>> }; CMD_NAME=JOB_PURGE; CMD_CATEGORY=JOB_MANAGEMENT;
>> CMD_STATUS=PROCESSING; commandName=JOB_PURGE;
>> cmdExecutorName=BLAHExecutor;
>> userId=_DC_ch_DC_cern_OU_Organic_Units_OU_Users_CN_atlpilo2_CN_596434_CN_Robot__ATLAS_Pilot2_atlas_Role_production_Capability_NULL;
>>
>> jobId=CREAM745296672; status=PROCESSING;
>> 19 Feb 2012 12:18:48,158 INFO
>> org.glite.ce.cream.jobmanagement.db.table.JobTable (JobTable.java:162) -
>> (Worker Thread 34) Job deleted. JobId = CREAM745296672
>> 19 Feb 2012 12:18:48,177 INFO
>> org.glite.ce.creamapi.jobmanagement.cmdexecutor.AbstractJobExecutor
>> (AbstractJobExecutor.java:1946) - (Worker Thread 34) purge: purged job
>> CREAM745296672
>>
>>
>>
>>
>>
>>
>>
>> On 22.02.2012 19:15, Leslie Groer wrote:
>>> Hi Gila and others for replies. I should mention that these are NOT
>>> cancelled jobs on our site. Most are pilots that have not found
>>> payloads
>>> that I checked today, but I do see at least one successful user
>>> analysis
>>> job that consumed 27 mins of walltime and 1 minute of CPU that reported
>>> the Unable to copy file error:
>>>
>>> >From the torque server accounting logs:
>>>
>>> 20120222:02/22/2012 11:10:21;Q;10312027.gpc-sched;queue=atlas
>>> 20120222:02/22/2012 11:10:21;Q;10312027.gpc-sched;queue=serial
>>> 20120222:02/22/2012 11:10:39;S;10312027.gpc-sched;user=atlaspt1
>>> group=atlaspt jobname=craaa_563417814 queue=serial ctime=1329927021
>>> qtime=1329927021 etime=1329927021 start=1329927039
>>> owner=atlaspt1@lcg-admin3 exec_host=gpc-f113n066/6
>>> Resource_List.os=centos53computeHEP Resource_List.walltime=35:03:00
>>> 20120222:02/22/2012 11:37:51;E;10312027.gpc-sched;user=atlaspt1
>>> group=atlaspt jobname=craaa_563417814 queue=serial ctime=1329927021
>>> qtime=1329927021 etime=1329927021 start=1329927039
>>> owner=atlaspt1@lcg-admin3 exec_host=gpc-f113n066/6
>>> Resource_List.os=centos53computeHEP Resource_List.walltime=35:03:00
>>> session=28206 end=1329928671 Exit_status=0 resources_used.cput=00:01:19
>>> resources_used.mem=307956kb resources_used.vmem=1046852kb
>>> resources_used.walltime=00:27:12
>>>
>>> This job maps to PandaID
>>> http://panda.cern.ch/server/pandamon/query?job=1432922395
>>>
>>> which shows success, but I still get a PBS JOB email with the
>>> following:
>>>
>>> PBS Job Id: 10312027.gpc-sched
>>> Job Name: craaa_563417814
>>> Exec host: gpc-f113n066/6
>>> An error has occurred processing your job, see below.
>>> Post job file processing error; job 10312027.gpc-sched on host
>>> gpc-f113n066/6
>>>
>>> Unable to copy file err_craaa_563417814_StandardOutput
>>> [log in to unmask]:/opt/glite/var/cream_sandbox/atlaspt/_C_CA_O_Grid_OU_triumf_ca_CN_Asoka_De_Silva_GC1_atlas_Role_pilot_Capability_NULL_atlaspt1/56/CREAM563417814/StandardError
>>>
>>>
>>> *** error from copy
>>> scp:/opt/glite/var/cream_sandbox/atlaspt/_C_CA_O_Grid_OU_triumf_ca_CN_Asoka_De_Silva_GC1_atlas_Role_pilot_Capability_NULL_atlaspt1/56/CREAM563417814/StandardError:
>>>
>>> No such file or directory
>>> *** end error output
>>>
>>> Unable to copy file out_craaa_563417814_StandardOutput
>>> [log in to unmask]:/opt/glite/var/cream_sandbox/atlaspt/_C_CA_O_Grid_OU_triumf_ca_CN_Asoka_De_Silva_GC1_atlas_Role_pilot_Capability_NULL_atlaspt1/56/CREAM563417814/StandardOutput
>>>
>>>
>>> *** error from copy
>>> scp:/opt/glite/var/cream_sandbox/atlaspt/_C_CA_O_Grid_OU_triumf_ca_CN_Asoka_De_Silva_GC1_atlas_Role_pilot_Capability_NULL_atlaspt1/56/CREAM563417814/StandardOutput:
>>>
>>> No such file or directory
>>> *** end error output
>>>
>>>
>>> We do not experience any of the Q->W flapping others talk about. We
>>> did set
>>>
>>> CLASSCFG[atlas] CancelFailedJobs=TRUE
>>>
>>> in our MOAB configuration as if a job fails then it just burns a pilot
>>> to kick it out of the queue rather than trying to reschedule (we were
>>> seeing some flapping with jobs that never got scheduled to a node
>>> sometimes, but this had a different pathology).
>>>
>>> Leslie
>>>
>>>
>>> Gila Arrondo Miguel Angel wrote:
>>>> Hi Leslie,
>>>>
>>>> With a very similar setup (Moab/Torque on 1 host and CREAM-CE on
>>>> another) we've seen an error somehow close to what you describe here.
>>>> In our /var/log/messages we find tons of entires like this:
>>>>
>>>> Feb 22 09:10:24 wn202 pbs_mom: LOG_ERROR::sys_copy, command
>>>> '/usr/bin/scp -rpB
>>>> [log in to unmask]:/cream_localsandbox/data/atlas/_DC_ch_DC_cern_OU_Organic_Units_OU_Users_CN_atlpilo1_CN_614260_CN_Robot__ATLAS_Pilot1_atlas_Role_pilot_Capability_NULL_atlasplt/78/CREAM780422084/StandardError'
>>>>
>>>> failed with status=1, giving up after 4 attempts
>>>>
>>>> Feb 22 09:10:24 wn202 pbs_mom: LOG_ERROR::req_cpyfile, Unable to copy
>>>> file err_cre02_780422084_StandardError
>>>> [log in to unmask]:/cream_localsandbox/data/atlas/_DC_ch_DC_cern_OU_Organic_Units_OU_Users_CN_atlpilo1_CN_614260_CN_Robot__ATLAS_Pilot1_atlas_Role_pilot_Capability_NULL_atlasplt/78/CREAM780422084/StandardError
>>>>
>>>>
>>>>
>>>> Feb 22 09:10:29 wn202 pbs_mom: LOG_ERROR::sys_copy, command
>>>> '/usr/bin/scp -rpB
>>>> [log in to unmask]:/cream_localsandbox/data/atlas/_DC_ch_DC_cern_OU_Organic_Units_OU_Users_CN_atlpilo1_CN_614260_CN_Robot__ATLAS_Pilot1_atlas_Role_pilot_Capability_NULL_atlasplt/78/CREAM780422084/StandardOutput'
>>>>
>>>> failed with status=1, giving up after 4 attempts
>>>> F
>>>>
>>>> Have you checked whether these failed transfers are from cancelled
>>>> jobs? In our experience, it was always the case.
>>>>
>>>> We've also looked at ways to mitigate this annoying verbosity, but no
>>>> luck so far. The only option that we can think of is to stop using
>>>> scp for copies and move the sandbox to a shared area with the WNs, so
>>>> you use regular cp ($usecp directive) and these errors are hidden.
>>>> But, of course, this approach has its own disadvantages as well.
>>>>
>>>> Does anyone else have a better idea?
>>>>
>>>> Cheers,
>>>> Miguel
>>>>
>>>>
>>>>
>>>> --
>>>> Miguel Gila
>>>> CSCS Swiss National Supercomputing Centre
>>>> HPC Solutions
>>>> Via Cantonale, Galleria 2 | CH-6928 Manno | Switzerland
>>>> Miguel.gila [at] cscs.ch
>>>>
>>>> From: Leslie
>>>> Groer<[log in to unmask]<mailto:[log in to unmask]>>
>>>> Reply-To: LHC Computer Grid -
>>>> Rollout<[log in to unmask]<mailto:[log in to unmask]>>
>>>> Date: Thu, 16 Feb 2012 14:43:56 -0500
>>>> To:<[log in to unmask]<mailto:[log in to unmask]>>
>>>> Subject: [LCG-ROLLOUT] Torque/CREAM race condition at job completion?
>>>>
>>>> scp:
>>>> /opt/glite/var/cream_sandbox/atlasprd/_C_CA_O_Grid_OU_triumf_ca_CN_Asoka_De_Silva_GC1_atlas_Role_production_Capability_NULL_prdatl15/61/CREAM619112035/StandardOutput:
>>>>
>>>> No such file or directory
>>>>
>>>
>
>
|