Hi all,
as one can see on the given page, the vast majority of reasons for the Held state
essentially are "fatal error - giving up". Using the Held state for spooling files
is a design mistake IMO, but that usage should in any case be short-lived normally,
certainly for well-behaved (!) grid jobs that do not have huge input or output
sandboxes to be transferred. That implies it is quite unlikely for any grid job to
be caught in the Held state while spooling input files.
Following the argument that only HoldReasonCode == 16 should be counted as waiting,
the ARC CE info provider still appears to have a bug, because in JM's case it also
counted jobs that were killed by the OOM killer.
What criteria does the ARC CE info provider apply to obtain the number of waiting
jobs for the GLUE and NorduGrid schemas?
On 09/07/18 12:06, David Rebatto wrote:
> Hi,
> David is right, held is also a transient state between idle and running, while the
> necessary files are staged to the execution machine.
> Still, it is pretty easy to tell whether to ignore to job or classify it as waiting, just
> looking at the HoldReasonCode:
>
> http://research.cs.wisc.edu/htcondor/manual/v8.7/JobClassAdAttributes.html#dx170-1249191
>
> At a first glance, I'd suggest to disregard all held jobs but the ones with HoldReasonCode
> == 16.
>
> Cheers,
> David
>
>
> Il 07/09/2018 11:41, Maarten Litmaath ha scritto:
>> Hi David, all,
>> AFAIK the Held state is there to allow the _user_ (or a service admin)
>> to decide what to do with the job, because HTCondor encountered a
>> hard error for it and cannot solve the matter on its own.
>>
>> As far as grid jobs are concerned, the Held state is useless in practice
>> and jobs in that state should not be counted as waiting, but instead sit
>> in another category that is dealt with separately. In practice, held jobs
>> typically are purged fairly quickly, potentially leaving a buffer covering
>> a number of hours / days to help in debugging, should that be needed.
>>
>> That is how things work e.g. on ALICE VOBOXes submitting to HTCondor.
>>
>> ________________________________________
>> From: David Cameron
>> Sent: 07 September 2018 09:38
>> To: Maarten Litmaath; wlcg-arc-ce-discuss (ML for discussion around arc-ce usage and
>> issues in WLCG)
>> Cc: LHC Computer Grid - Rollout; Jean-Michel Barbet
>> Subject: Re: [LCG-ROLLOUT] HT-Condor and held jobs because out-of-memory
>>
>> Hi Maarten,
>>
>> I'm no condor expert, but as far as I know jobs can go into Held state
>> for many reasons, such as waiting for input files to be spooled for
>> which you would want to count them as queued. If everyone is fine with
>> ARC ignoring all Held jobs then it's trivial to fix the code, but maybe
>> it's worth consulting a condor expert first (I'm sure there are plenty
>> on these lists!).
>>
>> Cheers,
>> David
>>
>>
>> On 06/09/18 15:07, Maarten Litmaath wrote:
>>> Hi all,
>>> though that is an easy workaround, the ARC CE info provider still has a bug to be fixed.
>>>
>>> ________________________________________
>>> From: LHC Computer Grid - Rollout [[log in to unmask]] on behalf of Jean-Michel
>>> Barbet [[log in to unmask]]
>>> Sent: 06 September 2018 10:02
>>> To: [log in to unmask]
>>> Subject: Re: [LCG-ROLLOUT] HT-Condor and held jobs because out-of-memory
>>>
>>> On 09/05/2018 03:08 PM, Max Fischer wrote:
>>>> Hi Maarten, Jean-Michel,
>>>>
>>>> for reference, we let the schedds on the ARC-CEs do the cleanup:
>>>>
>>>> # /etc/condor/config.d/schedd.cfg
>>>> JOBSTATE_HELD = 5
>>>> SCHEDD.SYSTEM_PERIODIC_REMOVE = ((JobStatus == $(JOBSTATE_HELD)) && (time() -
>>>> EnteredCurrentStatus > 2 * $(DAY)))
>>> Hi Max,
>>>
>>> I like your solution. I was monitoring the rate at which the hold jobs
>>> appear in our cluster and it is relatively frequent. Depending on the
>>> workload, I can have ~10/hour for ~400 cores. I think I am simply going
>>> tu put an hourly cron with sth like :
>>>
>>> condor_rm -constraint "JobStatus == 5"
>>>
>>> JM
>>>
>>> --
>>> ------------------------------------------------------------------------
>>> Jean-michel BARBET | Tel: +33 (0)2 51 85 84 86
>>> Laboratoire SUBATECH Nantes France | Fax: +33 (0)2 51 85 84 79
>>> CNRS-IN2P3/IMT-Atlantique/Univ.Nantes | E-Mail: [log in to unmask]
>>> ------------------------------------------------------------------------
>>>
########################################################################
To unsubscribe from the LCG-ROLLOUT list, click the following link:
https://www.jiscmail.ac.uk/cgi-bin/webadmin?SUBED1=LCG-ROLLOUT&A=1
|