Elena has asked me to copy this email into TB_SUPPORT.
Pls see below.
The fix here at L'pool was to do this on our ARC/CONDOR system.
/bin/sed -i -e 's/JobCpuLimit \= \$maxcputime/JobCpuLimit \=
\$joboption_cputime/' /usr/share/arc/submit-condor-job
Cheers,
Steve
-------- Forwarded Message --------
Subject: Re: mcore jobs with lost heartbeat at RAL
Date: Mon, 2 Mar 2015 14:53:34 +0000
From: Alastair Dewhurst <[log in to unmask]>
To: Rodney Walker <[log in to unmask]>
CC: Alessandra Forti <[log in to unmask]>, Andrzej Olszewski
<[log in to unmask]>, atlas-support-cloud-uk (ATLAS support
contact for UK cloud) <[log in to unmask]>, Josh McFayden
<[log in to unmask]>, [log in to unmask]
<[log in to unmask]>, Robert Ball <[log in to unmask]>,
[log in to unmask] <[log in to unmask]>
Hi
I think we have figured out the problem.
In AGIS there is a maxtime of 345600 (i.e. 4 days). When the job comes
in to the ARC CE this has been converted into both a wall time limit and
a CPU time limit. The CPU time should be <number of cores> * wall time.
The pilot is correctly asking for this. What we have found is that
ARC then divides this CPU limit by the number of cores. We believe this
is just a bug and Andrew Lahiff has submitted a ticket:
http://bugzilla.nordugrid.org/show_bug.cgi?id=3452
We have made a manual change to remove this, so it should work for NEW jobs.
Other sites using ARC + HTCondor will also be affected so they may want
to apply the fix in the ticket as well.
Alastair
On 2 Mar 2015, at 10:51, Rodney Walker
<[log in to unmask]
<mailto:[log in to unmask]>> wrote:
> Hi,
> Maybe you could look at one of the running jobs now. The modification
> time (in UTC) should be less then 30mins old, otherwise at least 1
> update is missing.
> http://bigpanda.cern.ch/job?pandaid=2403474798
> 9:06 which is nearly 2hrs old. Is this batch jobs still running?
>
> The APF thinks it is
> http://aipanda019.cern.ch/pilots/2015-03-01/RAL-LCG2_MCORE-7255/13922466.1.log
>
> although all that eviction and resbumit is bad.
>
> Cheers,
> Rod.
>
>
> On 2 March 2015 at 11:40, Alastair Dewhurst <[log in to unmask]
> <mailto:[log in to unmask]>> wrote:
>
> Hi
>
> That job exited normally according to Condor. I can send you the
> full log if you want but there is nothing terribly interesting
> there. There have been a couple of incidents like this before. I
> can’t believe it is a network problem that just affects one task.
> I can have a think to see if we can come up for a reason (or a
> test) for the lost heartbeat errors but for now I would assume
> that the error is the same as the jobs that are reporting errors.
>
> Alastair
>
>
>
>
> On 2 Mar 2015, at 09:23, Rodney Walker
> <[log in to unmask]
> <mailto:[log in to unmask]>> wrote:
>
>> Just hidden
>> http://bigpanda.cern.ch/job/2402892925/
>>
>> Cheers,
>> Rod.
>>
>> On 2 March 2015 at 10:22, Alastair Dewhurst
>> <[log in to unmask] <mailto:[log in to unmask]>> wrote:
>>
>> Hi
>>
>> I clicked the link, the jobs are failing with:
>> Executable error 65: Non-zero return code from EVNTtoHITS
>> (65); Error in logfile: "04:37:11 *** G4Exception: Aborting
>> execution *** 04:37:11 04:37:11 04:37:11 File:
>> athenaMP-workers-EVNTtoHITS-sim/worker_7/AthenaMP.log
>> 04:37:11 04:37:11 AthMpEvtLoopMgr... INFO Logs redirected in
>> the AthenaMP event worker PID=5447 04:37:11
>> AthMpEvtLoopMgr... INFO Io registry updated in the AthenaMP
>> event worker PID=5447 04:37:11 AthMpEvtLoopMgr...WARNING The
>> file
>> /pool/condor/dir_21151/tfSMDm1O9llnc1XDjqYugZkqABFKDmABFKDmA4KKDmC
>>
>> *exe:* Non-zero return code from EVNTtoHITS (65); Error in
>> logfile: "04:37:11 *** G4Exception: Aborting execution ***
>> 04:37:11 04:37:11 04:37:11 File:
>> athenaMP-workers-EVNTtoHITS-sim/worker_7/AthenaMP.log
>> 04:37:11 04:37:11 AthMpEvtLoopMgr... INFO Logs redirected in
>> the AthenaMP event worker PID=5447 04:37:11
>> AthMpEvtLoopMgr... INFO Io registry updated in the AthenaMP
>> event worker PID=5447 04:37:11 AthMpEvtLoopMgr...WARNING The
>> file
>> /pool/condor/dir_21151/tfSMDm1O9llnc1XDjqYugZkqABFKDmABFKDmA4KKDmC
>>
>> Log files for a particular job can be found here:
>> http://bigpanda.cern.ch/filebrowser/?guid=ca8c6a18-f38a-4513-a013-9e497789966e&lfn=log.04948732._000034.job.log.tgz.1&site=RAL-LCG2_SL6&scope=mc12_5TeV
>>
>> This error should be handled by someone in MC production
>> group. Did you send the wrong link initially because I can’t
>> find any lost heartbeat errors in this task?
>>
>> Alastair
>>
>>
>>
>>
>> On 2 Mar 2015, at 08:58, Alessandra Forti
>> <[log in to unmask] <mailto:[log in to unmask]>>
>> wrote:
>>
>>> Redirecting to the cloud squad.
>>>
>>> On 02/03/2015 08:55, Andrzej Olszewski wrote:
>>>> Hi Rod,
>>>>
>>>> there is a simulation production
>>>> http://bigpanda.cern.ch/jobs/?jeditaskid=4948732 for
>>>> HeavyIon group running at RAL-LCG2_MCORE, where most of the
>>>> jobs (all of those currently at attempt no 3) are failing
>>>> with lost heartbeat message and logs are not available.
>>>> Jobs are running about 11-12h. Is there a timeout limit
>>>> somewhere of this order (like detection of job hanging?). I
>>>> can imagine that these jobs may run some single event for
>>>> longer than 12h and thus not writing anything to output.
>>>> There is a possibility that these jobs have been defined
>>>> for simulation of with too many events, but in this case I
>>>> would expect jobs to run much longer. Or can you advice on
>>>> any other reason?
>>>>
>>>> Best, Andrzej
>>>
>>> --
>>> Respect is a rational process. \\//
>>>
>>
>>
>>
>>
>> --
>> Tel. +49 89 289 14152 <tel:%2B49%2089%20289%2014152>
>
>
>
>
> --
> Tel. +49 89 289 14152
|