JiscMail Logo
Email discussion lists for the UK Education and Research communities

Help for LCG-ROLLOUT Archives


LCG-ROLLOUT Archives

LCG-ROLLOUT Archives


LCG-ROLLOUT@JISCMAIL.AC.UK


View:

Message:

[

First

|

Previous

|

Next

|

Last

]

By Topic:

[

First

|

Previous

|

Next

|

Last

]

By Author:

[

First

|

Previous

|

Next

|

Last

]

Font:

Proportional Font

LISTSERV Archives

LISTSERV Archives

LCG-ROLLOUT Home

LCG-ROLLOUT Home

LCG-ROLLOUT  October 2018

LCG-ROLLOUT October 2018

Options

Subscribe or Unsubscribe

Subscribe or Unsubscribe

Log In

Log In

Get Password

Get Password

Subject:

Re: HT-Condor and held jobs because out-of-memory

From:

Maarten Litmaath <[log in to unmask]>

Reply-To:

LHC Computer Grid - Rollout <[log in to unmask]>

Date:

Mon, 15 Oct 2018 20:16:19 +0000

Content-Type:

text/plain

Parts/Attachments:

Parts/Attachments

text/plain (423 lines)

Hi Florido, all,
the LHC experiments and now also other VOs submit pilot jobs
whose outputs usually should not matter and which therefore
ought not need to be tracked individually.

Certainly in ALICE the motto is: fire and forget.

In ALICE we typically switch off the output sandbox of such jobs
to help lower the pressure on the file system holding the sandboxes
and in particular avoid that it fills up (has happened, unfortunately).

For ALICE then there are only 2 quantities that matter:
1. how many jobs supposedly are running (to be compared with MonALISA);
2. how many jobs are waiting with a good chance to run.

For each site there are caps defined for the sum and for case 2 alone.

Those numbers can very conveniently and cheaply be obtained from
a resource or site BDII, provided the BDII publishes _sensible_ values
for the relevant attributes.  In the CREAM CE that works very nicely.

We do not want fatally held jobs to muddy the waters here.

We should have defined a GLUE 2 attribute for those, but as we did not,
they need to be ignored altogether.

HoldReasonCode == 16 may be tricky to reproduce, as it should be hard
to catch a reasonable job in such a state, viz. staging its input files;
maybe submit a test job with a few GB of input data and run condor_q
repeatedly to catch it staging that data?


________________________________________
From: Florido Paganelli [[log in to unmask]]
Sent: 15 October 2018 10:47
To: Maarten Litmaath; David Cameron; James Frey
Cc: wlcg-arc-ce-discuss (ML for discussion around arc-ce usage and issues in WLCG); LHC Computer Grid - Rollout; Jean-Michel Barbet
Subject: Re: [LCG-ROLLOUT] HT-Condor and held jobs because out-of-memory

Hi Marteen, all

On 2018-10-13 19:32, Maarten Litmaath wrote:
 > Hi Florido, all,
 > reading the ticket we can appreciate that fixing this issue is
non-trivial,
 > as ARC needs to provide workable numbers not only in GLUE 2, but
 > also to ARC clients.  Rereading the relevant part of GLUE 2, I think
 > we should have foreseen one or two more states that jobs can be in,
 > so that we could have matched more easily not only the held state,
 > but also the completed state (which may also last quite a while).
 >
 > As far as GLUE 2 is concerned, we mostly care about these concepts:
 >
 > running jobs = jobs that currently occupy active job slots
 > waiting jobs = jobs that still have a real chance of running
 >
 > In terms of GLUE 2 attributes, I suppose clients would need to do this:
 >
 > running jobs = RunningJobs + StagingJobs
 > waiting jobs = WaitingJobs + SuspendedJobs + PreLRMSWaitingJobs
 >
 > By that logic we must _NOT_ have the fatally held jobs added to the
 > SuspendedJobs numbers, because that would bring us to a situation
 > very much like the one that prompted this whole discussion!
 >

I think submitting clients should check the job's Status field and not
just the aggregated values.
Clients that just collect the infosys data for statistics may care if
thousands jobs are in the HOLD state.
What kind of clients are you talking about?

Well we don't count SuspendedJobs as WaitingJobs in ARC clients at the
moment, they're in the Running state but not accounted as such in GLUE2,
so my suggestion suits the purpose for the moment. You can query
detailed job information with the ARC Client API if you're curious, and
if you want it in GLUE2 you can enable jobs information in LDAP
(although it might kill your ldap server ;) ).

Is anyone else using this SuspendedJobs field? Looking at the GLUE2
definitions looks like it was meant for pilot jobs.

I don't particularly like the trend of hiding information from the LRMS,
especially for accounting and performance evaluation purposes. These
jobs in hold waste CPU time on the LRMS scheduler and processing time in
the frontend. To me they're jobs in an error state and should be reported.
But I assume you don't really care about measuring the time you waste --
at least not using infosys.

 > I suspect your only solution in GLUE 2 is to _ignore_ those jobs, OK?
 >

I personally think it is not OK to ignore them in GLUE2, but it's broken
and dying anyway so I can change my patch to just discard them as you
suggest.

Mind that even if I don't show them in the infosys statistics these will
be still processed by the grid manager as they are in the INLRMS state
forever until the sysadmin deletes them, so this is a waste of frontend
CPU. I don't think the Condor backend should do automatic cleanup
because sysdamin  should have time to see what happened.

I will discuss with the other developers what is best to do with those
given that you suggest to ignore them completely.

 > Jobs with HoldReasonCode == 16 should be added to StagingJobs
 > and hence be seen as running, as already argued by David Rebatto.
 >
 >

yes. I am no condor expert and I am doing this in my spare time, I might
need help to see what this HoldReasonCode == 16 means and how to trigger
it to reproduce.
Condor experts in the list, is there a way to reproduce this
HoldReasonCode == 16 situation? Thanks!

Cheers,
Florido

 > ________________________________________
 > From: Florido Paganelli [[log in to unmask]]
 > Sent: 12 October 2018 17:15
 > To: David Cameron; James Frey; Maarten Litmaath
 > Cc: wlcg-arc-ce-discuss (ML for discussion around arc-ce usage and
issues in WLCG); LHC Computer Grid - Rollout; Jean-Michel Barbet
 > Subject: Re: [LCG-ROLLOUT] HT-Condor and held jobs because out-of-memory
 >
 > Hi all,
 >
 > I'm working on fixing this issue for ARC.
 >
 > The quickest fix I could do to implement David Cameron's suggestion of
 > not counting is described and commented in this Bugzilla ticket:
 >
 > https://bugzilla.nordugrid.org/show_bug.cgi?id=3753
 >
 > The solution is that I will move these jobs to be counted as SUSPENDED.
 > This will avoid them being counted as Queued or Running, which I believe
 > is the goal we want to achieve as a consequence of this thread.
 >
 > However, many here said that these jobs should not be counted as
 > waiting, but I think you guys are wrong. You should re-read the
 > definitions of NorduGRID states and EMI-ES states and you will see that
 > HOLD definitely belongs to jobs in the waiting state, and that clients
 > have all the needed information to sort out things.
 >
 > A couple of comments to support my claim:
 >
 > 1) If the job is NOT waiting (i.e. queued) then what is it?
 > The NorduGRID/EMIES model tries to be LRMS-independent so I really would
 > like to avoid a special LRMS "HOLD" state for a job that is basically
 > pending/queued forever. To me it is exactly like a queued or re-queued
 > job with infinite waiting time.
 > Why do you think we need to differentiate?
 > Note that currently these jobs are in the INLRMS:O (Other) state.
 >    From the ARC infosys tech manual[1] page 38:
 >
 > - INLRMS:Q
 >     The job is queuing in the LRMS, waiting for a node, being put on
hold,
 >     for some reason the job is in a ’pending state’ of the LRMS.
 >     internal state:  INLRMS"
 >
 > - INLRMS:O
 >    Any other native LRMS state which can not be mapped to the above
 >    general states must be labeled as ’O’, meaning ”other”
 >    internal state:  INLRMS
 >
 > - INLRMS:S
 >    An already running job is in a suspended state.
 >    internal state:  INLRMS
 >
 > For historical and sound reasons jobs in the O state are counted as
 > RUNNING because they are under the LRMS control and ARC cannot take over
 > the LRMS decisions, so it is safer to consider them running.
 >
 > I think the client has all the information to take decisions, I see no
 > reasons for faking the statistics. I see no bugs. But anyway let's
 > remove them from waiting as suggested.
 >
 >   From GLUE2 definitions for counting jobs page 28-29 [2]:
 >
 > WaitingJobs
 >     The number of jobs which are currently
 >     waiting to start execution, submitted via any
 >     type of interface (local and Grid). Usually
 >     these will be queued in the underlying
 >     Computing Manager (i.e., a Local Resource
 >     Managment System or LRMS).
 >
 > SuspendedJobs:
 >     The number of jobs, submitted via any type of
 >     interface (local and Grid), which have started
 >     their execution, but are currently suspended
 >     (e.g., having been preempted by another job).
 >
 > Also here I see no bugs or contradictions above. The good thing is that
 > we can definitely use Suspended to park these jobs in HOLD, I think.
 >
 > Anyway with the latest patch they will be counted as Suspended in all
 > renderings.
 >
 > 2) I can work on the HoldReasonCode == 16 suggestion to include these in
 > the queued jobs. However, is this worth to be implemented as such?
 > In other words, will it help counting them or you prefer marking those
 > as Suspended anyway?
 >
 > Cheers,
 > Florido
 >
 > [1] ARC LDAP Infosys technical manual
 >       http://www.nordugrid.org/documents/arc_infosys.pdf
 > [2] GLUE2 Specification, GFD.147
 >       https://www.ogf.org/documents/GFD.147.pdf
 >
 > On 2018-09-13 14:27, David Cameron wrote:
 >> Hi all,
 >>
 >> Just to conclude this thread, we will stop counting Held jobs as waiting
 >> in the next ARC release.
 >>
 >> Cheers,
 >> David
 >>
 >>
 >> On 12/09/18 22:54, Jaime Frey wrote:
 >>> Hi, Condor expert here.
 >>> The Held status can be triggered by the user, the admin, or the
 >>> HTCondor system itself. When it’s triggered by the system, it is
 >>> indeed usually “fatal error - giving up”.
 >>> I agree that using the Held status for spooling of input files is a
 >>> design mistake. It was a quick hack at the time, and fixing since
 >>> hasn’t been worth the effort. This use of Held only occurs if the job
 >>> submitter wants to spool the job’s input files over the network
 >>> connection to the condor_schedd daemon. When using condor_submit, this
 >>> only happens when the -spool or -remote argument is used. If the ARC
 >>> CE isn’t using those arguments with condor_submit, then that case can
 >>> probably be ignored. (I’m assuming the ARC monitoring code isn’t
 >>> concerned about jobs submitted to Condor directly by third parties.)
 >>>
 >>> For all other causes of the Held status, it sounds like the jobs
 >>> shouldn’t be treated as waiting.
 >>>
 >>>    - Jaime
 >>>
 >>>> On Sep 7, 2018, at 6:46 AM, Maarten Litmaath
 >>>> <[log in to unmask]> wrote:
 >>>>
 >>>> Hi all,
 >>>> as one can see on the given page, the vast majority of reasons for
 >>>> the Held state
 >>>> essentially are "fatal error - giving up".  Using the Held state for
 >>>> spooling files
 >>>> is a design mistake IMO, but that usage should in any case be
 >>>> short-lived normally,
 >>>> certainly for well-behaved (!) grid jobs that do not have huge input
 >>>> or output
 >>>> sandboxes to be transferred.  That implies it is quite unlikely for
 >>>> any grid job to
 >>>> be caught in the Held state while spooling input files.
 >>>>
 >>>> Following the argument that only HoldReasonCode == 16 should be
 >>>> counted as waiting,
 >>>> the ARC CE info provider still appears to have a bug, because in JM's
 >>>> case it also
 >>>> counted jobs that were killed by the OOM killer.
 >>>>
 >>>> What criteria does the ARC CE info provider apply to obtain the
 >>>> number of waiting
 >>>> jobs for the GLUE and NorduGrid schemas?
 >>>>
 >>>>
 >>>> On 09/07/18 12:06, David Rebatto wrote:
 >>>>> Hi,
 >>>>> David is right, held is also a transient state between idle and
 >>>>> running, while the necessary files are staged to the execution
machine.
 >>>>> Still, it is pretty easy to tell whether to ignore to job or
 >>>>> classify it as waiting, just looking at the HoldReasonCode:
 >>>>>
http://research.cs.wisc.edu/htcondor/manual/v8.7/JobClassAdAttributes.html#dx170-1249191
 >>>>>
 >>>>> At a first glance, I'd suggest to disregard all held jobs but the
 >>>>> ones with HoldReasonCode == 16.
 >>>>> Cheers,
 >>>>> David
 >>>>> Il 07/09/2018 11:41, Maarten Litmaath ha scritto:
 >>>>>> Hi David, all,
 >>>>>> AFAIK the Held state is there to allow the _user_ (or a service
admin)
 >>>>>> to decide what to do with the job, because HTCondor encountered a
 >>>>>> hard error for it and cannot solve the matter on its own.
 >>>>>>
 >>>>>> As far as grid jobs are concerned, the Held state is useless in
 >>>>>> practice
 >>>>>> and jobs in that state should not be counted as waiting, but
 >>>>>> instead sit
 >>>>>> in another category that is dealt with separately.  In practice,
 >>>>>> held jobs
 >>>>>> typically are purged fairly quickly, potentially leaving a buffer
 >>>>>> covering
 >>>>>> a number of hours / days to help in debugging, should that be
needed.
 >>>>>>
 >>>>>> That is how things work e.g. on ALICE VOBOXes submitting to
HTCondor.
 >>>>>>
 >>>>>> ________________________________________
 >>>>>> From: David Cameron
 >>>>>> Sent: 07 September 2018 09:38
 >>>>>> To: Maarten Litmaath; wlcg-arc-ce-discuss (ML for discussion around
 >>>>>> arc-ce usage and issues in WLCG)
 >>>>>> Cc: LHC Computer Grid - Rollout; Jean-Michel Barbet
 >>>>>> Subject: Re: [LCG-ROLLOUT] HT-Condor and held jobs because
 >>>>>> out-of-memory
 >>>>>>
 >>>>>> Hi Maarten,
 >>>>>>
 >>>>>> I'm no condor expert, but as far as I know jobs can go into Held
state
 >>>>>> for many reasons, such as waiting for input files to be spooled for
 >>>>>> which you would want to count them as queued. If everyone is
fine with
 >>>>>> ARC ignoring all Held jobs then it's trivial to fix the code, but
 >>>>>> maybe
 >>>>>> it's worth consulting a condor expert first (I'm sure there are
plenty
 >>>>>> on these lists!).
 >>>>>>
 >>>>>> Cheers,
 >>>>>> David
 >>>>>>
 >>>>>>
 >>>>>> On 06/09/18 15:07, Maarten Litmaath wrote:
 >>>>>>> Hi all,
 >>>>>>> though that is an easy workaround, the ARC CE info provider still
 >>>>>>> has a bug to be fixed.
 >>>>>>>
 >>>>>>> ________________________________________
 >>>>>>> From: LHC Computer Grid - Rollout [[log in to unmask]] on
 >>>>>>> behalf of Jean-Michel Barbet [[log in to unmask]]
 >>>>>>> Sent: 06 September 2018 10:02
 >>>>>>> To: [log in to unmask]
 >>>>>>> Subject: Re: [LCG-ROLLOUT] HT-Condor and held jobs because
 >>>>>>> out-of-memory
 >>>>>>>
 >>>>>>> On 09/05/2018 03:08 PM, Max Fischer wrote:
 >>>>>>>> Hi Maarten, Jean-Michel,
 >>>>>>>>
 >>>>>>>> for reference, we let the schedds on the ARC-CEs do the cleanup:
 >>>>>>>>
 >>>>>>>> # /etc/condor/config.d/schedd.cfg
 >>>>>>>> JOBSTATE_HELD = 5
 >>>>>>>> SCHEDD.SYSTEM_PERIODIC_REMOVE = ((JobStatus == $(JOBSTATE_HELD))
 >>>>>>>> && (time() - EnteredCurrentStatus > 2 * $(DAY)))
 >>>>>>> Hi Max,
 >>>>>>>
 >>>>>>> I like your solution. I was monitoring the rate at which the hold
 >>>>>>> jobs
 >>>>>>> appear in our cluster and it is relatively frequent. Depending
on the
 >>>>>>> workload, I can have ~10/hour for ~400 cores. I think I am simply
 >>>>>>> going
 >>>>>>> tu put an hourly cron with sth like :
 >>>>>>>
 >>>>>>> condor_rm -constraint "JobStatus == 5"
 >>>>>>>
 >>>>>>> JM
 >>>>>>>
 >>>>>>> --
 >>>>>>>
------------------------------------------------------------------------
 >>>>>>>
 >>>>>>> Jean-michel BARBET                    | Tel: +33 (0)2 51 85 84 86
 >>>>>>> Laboratoire SUBATECH Nantes France    | Fax: +33 (0)2 51 85 84 79
 >>>>>>> CNRS-IN2P3/IMT-Atlantique/Univ.Nantes | E-Mail:
 >>>>>>> [log in to unmask]
 >>>>>>>
------------------------------------------------------------------------
 >>>>>>>
 >>>>>>>
 >>
 >
 >
 > --
 > ==================================================
 >    Florido Paganelli
 >      ARC Middleware Developer - NorduGrid Collaboration
 >      System Administrator
 >    Lund University
 >    Department of Physics
 >    Division of Particle Physics
 >    BOX118
 >    221 00 Lund
 >    Office Location: Fysikum, Hus A, Rum A403
 >    Office Tel: 046-2220272
 >    Email: [log in to unmask]
 >    Homepage: http://www.hep.lu.se/staff/paganelli
 > ==================================================
 >

--
==================================================
  Florido Paganelli
    ARC Middleware Developer - NorduGrid Collaboration
    System Administrator
  Lund University
  Department of Physics
  Division of Particle Physics
  BOX118
  221 00 Lund
  Office Location: Fysikum, Hus A, Rum A403
  Office Tel: 046-2220272
  Email: [log in to unmask]
  Homepage: http://www.hep.lu.se/staff/paganelli
==================================================


########################################################################

To unsubscribe from the LCG-ROLLOUT list, click the following link:
https://www.jiscmail.ac.uk/cgi-bin/webadmin?SUBED1=LCG-ROLLOUT&A=1

Top of Message | Previous Page | Permalink

JiscMail Tools


RSS Feeds and Sharing


Advanced Options


Archives

March 2024
November 2023
June 2023
May 2023
April 2023
March 2023
February 2023
September 2022
June 2022
May 2022
April 2022
February 2022
December 2021
November 2021
October 2021
September 2021
July 2021
June 2021
May 2021
February 2021
January 2021
November 2020
September 2020
August 2020
July 2020
June 2020
May 2020
April 2020
March 2020
February 2020
January 2020
November 2019
October 2019
September 2019
August 2019
July 2019
June 2019
May 2019
March 2019
February 2019
January 2019
December 2018
November 2018
October 2018
September 2018
August 2018
July 2018
June 2018
May 2018
April 2018
February 2018
January 2018
November 2017
October 2017
September 2017
July 2017
June 2017
May 2017
March 2017
February 2017
January 2017
December 2016
November 2016
October 2016
September 2016
August 2016
July 2016
June 2016
May 2016
April 2016
March 2016
February 2016
January 2016
December 2015
November 2015
October 2015
September 2015
August 2015
July 2015
June 2015
May 2015
April 2015
March 2015
February 2015
January 2015
December 2014
November 2014
October 2014
September 2014
August 2014
July 2014
June 2014
May 2014
April 2014
March 2014
February 2014
January 2014
December 2013
November 2013
October 2013
September 2013
August 2013
July 2013
June 2013
May 2013
April 2013
March 2013
February 2013
January 2013
December 2012
November 2012
October 2012
September 2012
August 2012
July 2012
June 2012
May 2012
April 2012
March 2012
February 2012
January 2012
December 2011
November 2011
October 2011
September 2011
August 2011
July 2011
June 2011
May 2011
April 2011
March 2011
February 2011
January 2011
December 2010
November 2010
October 2010
September 2010
August 2010
July 2010
June 2010
May 2010
April 2010
March 2010
February 2010
January 2010
December 2009
November 2009
October 2009
September 2009
August 2009
July 2009
June 2009
May 2009
April 2009
March 2009
February 2009
January 2009
December 2008
November 2008
October 2008
September 2008
August 2008
July 2008
June 2008
May 2008
April 2008
March 2008
February 2008
January 2008
December 2007
November 2007
October 2007
September 2007
August 2007
July 2007
June 2007
May 2007
April 2007
March 2007
February 2007
January 2007
2006
2005
2004
2003


JiscMail is a Jisc service.

View our service policies at https://www.jiscmail.ac.uk/policyandsecurity/ and Jisc's privacy policy at https://www.jisc.ac.uk/website/privacy-notice

For help and support help@jisc.ac.uk

Secured by F-Secure Anti-Virus CataList Email List Search Powered by the LISTSERV Email List Manager