Hi David, all,
AFAIK the Held state is there to allow the _user_ (or a service admin)
to decide what to do with the job, because HTCondor encountered a
hard error for it and cannot solve the matter on its own.
As far as grid jobs are concerned, the Held state is useless in practice
and jobs in that state should not be counted as waiting, but instead sit
in another category that is dealt with separately. In practice, held jobs
typically are purged fairly quickly, potentially leaving a buffer covering
a number of hours / days to help in debugging, should that be needed.
That is how things work e.g. on ALICE VOBOXes submitting to HTCondor.
________________________________________
From: David Cameron
Sent: 07 September 2018 09:38
To: Maarten Litmaath; wlcg-arc-ce-discuss (ML for discussion around arc-ce usage and issues in WLCG)
Cc: LHC Computer Grid - Rollout; Jean-Michel Barbet
Subject: Re: [LCG-ROLLOUT] HT-Condor and held jobs because out-of-memory
Hi Maarten,
I'm no condor expert, but as far as I know jobs can go into Held state
for many reasons, such as waiting for input files to be spooled for
which you would want to count them as queued. If everyone is fine with
ARC ignoring all Held jobs then it's trivial to fix the code, but maybe
it's worth consulting a condor expert first (I'm sure there are plenty
on these lists!).
Cheers,
David
On 06/09/18 15:07, Maarten Litmaath wrote:
> Hi all,
> though that is an easy workaround, the ARC CE info provider still has a bug to be fixed.
>
> ________________________________________
> From: LHC Computer Grid - Rollout [[log in to unmask]] on behalf of Jean-Michel Barbet [[log in to unmask]]
> Sent: 06 September 2018 10:02
> To: [log in to unmask]
> Subject: Re: [LCG-ROLLOUT] HT-Condor and held jobs because out-of-memory
>
> On 09/05/2018 03:08 PM, Max Fischer wrote:
>> Hi Maarten, Jean-Michel,
>>
>> for reference, we let the schedds on the ARC-CEs do the cleanup:
>>
>> # /etc/condor/config.d/schedd.cfg
>> JOBSTATE_HELD = 5
>> SCHEDD.SYSTEM_PERIODIC_REMOVE = ((JobStatus == $(JOBSTATE_HELD)) && (time() - EnteredCurrentStatus > 2 * $(DAY)))
> Hi Max,
>
> I like your solution. I was monitoring the rate at which the hold jobs
> appear in our cluster and it is relatively frequent. Depending on the
> workload, I can have ~10/hour for ~400 cores. I think I am simply going
> tu put an hourly cron with sth like :
>
> condor_rm -constraint "JobStatus == 5"
>
> JM
>
> --
> ------------------------------------------------------------------------
> Jean-michel BARBET | Tel: +33 (0)2 51 85 84 86
> Laboratoire SUBATECH Nantes France | Fax: +33 (0)2 51 85 84 79
> CNRS-IN2P3/IMT-Atlantique/Univ.Nantes | E-Mail: [log in to unmask]
> ------------------------------------------------------------------------
>
> ########################################################################
>
> To unsubscribe from the LCG-ROLLOUT list, click the following link:
> https://www.jiscmail.ac.uk/cgi-bin/webadmin?SUBED1=LCG-ROLLOUT&A=1
>
########################################################################
To unsubscribe from the LCG-ROLLOUT list, click the following link:
https://www.jiscmail.ac.uk/cgi-bin/webadmin?SUBED1=LCG-ROLLOUT&A=1
|