JISCMail - LCG-ROLLOUT Archives

Email discussion lists for the UK Education and Research communities
Subscriber's Corner
Email Lists
LCG-ROLLOUT Archives

LCG-ROLLOUT@JISCMAIL.AC.UK

View:

Message:
[
First
Last
]
By Topic:
[
First
Last
]
By Author:
[
First
Last
]
Font:
Proportional Font
		LISTSERV Archives
		LCG-ROLLOUT Home
		LCG-ROLLOUT October 2018
Options

Subscribe or Unsubscribe
Get Password
Subject:
Re: HT-Condor and held jobs because out-of-memory
From:
Maarten Litmaath <[log in to unmask]>
Reply-To:
LHC Computer Grid - Rollout <[log in to unmask]>
Date:
Mon, 15 Oct 2018 20:16:19 +0000
Content-Type:
text/plain
Parts/Attachments:
text/plain (423 lines)
Hi Florido, all,
the LHC experiments and now also other VOs submit pilot jobs
whose outputs usually should not matter and which therefore
ought not need to be tracked individually.

Certainly in ALICE the motto is: fire and forget.

In ALICE we typically switch off the output sandbox of such jobs
to help lower the pressure on the file system holding the sandboxes
and in particular avoid that it fills up (has happened, unfortunately).

For ALICE then there are only 2 quantities that matter:
1. how many jobs supposedly are running (to be compared with MonALISA);
2. how many jobs are waiting with a good chance to run.

For each site there are caps defined for the sum and for case 2 alone.

Those numbers can very conveniently and cheaply be obtained from
a resource or site BDII, provided the BDII publishes _sensible_ values
for the relevant attributes.  In the CREAM CE that works very nicely.

We do not want fatally held jobs to muddy the waters here.

We should have defined a GLUE 2 attribute for those, but as we did not,
they need to be ignored altogether.

HoldReasonCode == 16 may be tricky to reproduce, as it should be hard
to catch a reasonable job in such a state, viz. staging its input files;
maybe submit a test job with a few GB of input data and run condor_q
repeatedly to catch it staging that data?


________________________________________
From: Florido Paganelli [[log in to unmask]]
Sent: 15 October 2018 10:47
To: Maarten Litmaath; David Cameron; James Frey
Cc: wlcg-arc-ce-discuss (ML for discussion around arc-ce usage and issues in WLCG); LHC Computer Grid - Rollout; Jean-Michel Barbet
Subject: Re: [LCG-ROLLOUT] HT-Condor and held jobs because out-of-memory

Hi Marteen, all

On 2018-10-13 19:32, Maarten Litmaath wrote:
 > Hi Florido, all,
 > reading the ticket we can appreciate that fixing this issue is
non-trivial,
 > as ARC needs to provide workable numbers not only in GLUE 2, but
 > also to ARC clients.  Rereading the relevant part of GLUE 2, I think
 > we should have foreseen one or two more states that jobs can be in,
 > so that we could have matched more easily not only the held state,
 > but also the completed state (which may also last quite a while).
 >
 > As far as GLUE 2 is concerned, we mostly care about these concepts:
 >
 > running jobs = jobs that currently occupy active job slots
 > waiting jobs = jobs that still have a real chance of running
 >
 > In terms of GLUE 2 attributes, I suppose clients would need to do this:
 >
 > running jobs = RunningJobs + StagingJobs
 > waiting jobs = WaitingJobs + SuspendedJobs + PreLRMSWaitingJobs
 >
 > By that logic we must _NOT_ have the fatally held jobs added to the
 > SuspendedJobs numbers, because that would bring us to a situation
 > very much like the one that prompted this whole discussion!
 >

I think submitting clients should check the job's Status field and not
just the aggregated values.
Clients that just collect the infosys data for statistics may care if
thousands jobs are in the HOLD state.
What kind of clients are you talking about?

Well we don't count SuspendedJobs as WaitingJobs in ARC clients at the
moment, they're in the Running state but not accounted as such in GLUE2,
so my suggestion suits the purpose for the moment. You can query
detailed job information with the ARC Client API if you're curious, and
if you want it in GLUE2 you can enable jobs information in LDAP
(although it might kill your ldap server ;) ).

Is anyone else using this SuspendedJobs field? Looking at the GLUE2
definitions looks like it was meant for pilot jobs.

I don't particularly like the trend of hiding information from the LRMS,
especially for accounting and performance evaluation purposes. These
jobs in hold waste CPU time on the LRMS scheduler and processing time in
the frontend. To me they're jobs in an error state and should be reported.
But I assume you don't really care about measuring the time you waste --
at least not using infosys.

 > I suspect your only solution in GLUE 2 is to _ignore_ those jobs, OK?
 >

I personally think it is not OK to ignore them in GLUE2, but it's broken
and dying anyway so I can change my patch to just discard them as you
suggest.

Mind that even if I don't show them in the infosys statistics these will
be still processed by the grid manager as they are in the INLRMS state
forever until the sysadmin deletes them, so this is a waste of frontend
CPU. I don't think the Condor backend should do automatic cleanup
because sysdamin  should have time to see what happened.

I will discuss with the other developers what is best to do with those
given that you suggest to ignore them completely.

 > Jobs with HoldReasonCode == 16 should be added to StagingJobs
 > and hence be seen as running, as already argued by David Rebatto.
 >
 >

yes. I am no condor expert and I am doing this in my spare time, I might
need help to see what this HoldReasonCode == 16 means and how to trigger
it to reproduce.
Condor experts in the list, is there a way to reproduce this
HoldReasonCode == 16 situation? Thanks!

Cheers,
Florido

 > ________________________________________
 > From: Florido Paganelli [[log in to unmask]]
 > Sent: 12 October 2018 17:15
 > To: David Cameron; James Frey; Maarten Litmaath
 > Cc: wlcg-arc-ce-discuss (ML for discussion around arc-ce usage and
issues in WLCG); LHC Computer Grid - Rollout; Jean-Michel Barbet
 > Subject: Re: [LCG-ROLLOUT] HT-Condor and held jobs because out-of-memory
 >
 > Hi all,
 >
 > I'm working on fixing this issue for ARC.
 >
 > The quickest fix I could do to implement David Cameron's suggestion of
 > not counting is described and commented in this Bugzilla ticket:
 >
 > https://bugzilla.nordugrid.org/show_bug.cgi?id=3753
 >
 > The solution is that I will move these jobs to be counted as SUSPENDED.
 > This will avoid them being counted as Queued or Running, which I believe
 > is the goal we want to achieve as a consequence of this thread.
 >
 > However, many here said that these jobs should not be counted as
 > waiting, but I think you guys are wrong. You should re-read the
 > definitions of NorduGRID states and EMI-ES states and you will see that
 > HOLD definitely belongs to jobs in the waiting state, and that clients
 > have all the needed information to sort out things.
 >
 > A couple of comments to support my claim:
 >
 > 1) If the job is NOT waiting (i.e. queued) then what is it?
 > The NorduGRID/EMIES model tries to be LRMS-independent so I really would
 > like to avoid a special LRMS "HOLD" state for a job that is basically
 > pending/queued forever. To me it is exactly like a queued or re-queued
 > job with infinite waiting time.
 > Why do you think we need to differentiate?
 > Note that currently these jobs are in the INLRMS:O (Other) state.
 >    From the ARC infosys tech manual[1] page 38:
 >
 > - INLRMS:Q
 >     The job is queuing in the LRMS, waiting for a node, being put on
hold,
 >     for some reason the job is in a ’pending state’ of the LRMS.
 >     internal state:  INLRMS"
 >
 > - INLRMS:O
 >    Any other native LRMS state which can not be mapped to the above
 >    general states must be labeled as ’O’, meaning ”other”
 >    internal state:  INLRMS
 >
 > - INLRMS:S
 >    An already running job is in a suspended state.
 >    internal state:  INLRMS
 >
 > For historical and sound reasons jobs in the O state are counted as
 > RUNNING because they are under the LRMS control and ARC cannot take over
 > the LRMS decisions, so it is safer to consider them running.
 >
 > I think the client has all the information to take decisions, I see no
 > reasons for faking the statistics. I see no bugs. But anyway let's
 > remove them from waiting as suggested.
 >
 >   From GLUE2 definitions for counting jobs page 28-29 [2]:
 >
 > WaitingJobs
 >     The number of jobs which are currently
 >     waiting to start execution, submitted via any
 >     type of interface (local and Grid). Usually
 >     these will be queued in the underlying
 >     Computing Manager (i.e., a Local Resource
 >     Managment System or LRMS).
 >
 > SuspendedJobs:
 >     The number of jobs, submitted via any type of
 >     interface (local and Grid), which have started
 >     their execution, but are currently suspended
 >     (e.g., having been preempted by another job).
 >
 > Also here I see no bugs or contradictions above. The good thing is that
 > we can definitely use Suspended to park these jobs in HOLD, I think.
 >
 > Anyway with the latest patch they will be counted as Suspended in all
 > renderings.
 >
 > 2) I can work on the HoldReasonCode == 16 suggestion to include these in
 > the queued jobs. However, is this worth to be implemented as such?
 > In other words, will it help counting them or you prefer marking those
 > as Suspended anyway?
 >
 > Cheers,
 > Florido
 >
 > [1] ARC LDAP Infosys technical manual
 >       http://www.nordugrid.org/documents/arc_infosys.pdf
 > [2] GLUE2 Specification, GFD.147
 >       https://www.ogf.org/documents/GFD.147.pdf
 >
 > On 2018-09-13 14:27, David Cameron wrote:
 >> Hi all,
 >>
 >> Just to conclude this thread, we will stop counting Held jobs as waiting
 >> in the next ARC release.
 >>
 >> Cheers,
 >> David
 >>
 >>
 >> On 12/09/18 22:54, Jaime Frey wrote:
 >>> Hi, Condor expert here.
 >>> The Held status can be triggered by the user, the admin, or the
 >>> HTCondor system itself. When it’s triggered by the system, it is
 >>> indeed usually “fatal error - giving up”.
 >>> I agree that using the Held status for spooling of input files is a
 >>> design mistake. It was a quick hack at the time, and fixing since
 >>> hasn’t been worth the effort. This use of Held only occurs if the job
 >>> submitter wants to spool the job’s input files over the network
 >>> connection to the condor_schedd daemon. When using condor_submit, this
 >>> only happens when the -spool or -remote argument is used. If the ARC
 >>> CE isn’t using those arguments with condor_submit, then that case can
 >>> probably be ignored. (I’m assuming the ARC monitoring code isn’t
 >>> concerned about jobs submitted to Condor directly by third parties.)
 >>>
 >>> For all other causes of the Held status, it sounds like the jobs
 >>> shouldn’t be treated as waiting.
 >>>
 >>>    - Jaime
 >>>
 >>>> On Sep 7, 2018, at 6:46 AM, Maarten Litmaath
 >>>> <[log in to unmask]> wrote:
 >>>>
 >>>> Hi all,
 >>>> as one can see on the given page, the vast majority of reasons for
 >>>> the Held state
 >>>> essentially are "fatal error - giving up".  Using the Held state for
 >>>> spooling files
 >>>> is a design mistake IMO, but that usage should in any case be
 >>>> short-lived normally,
 >>>> certainly for well-behaved (!) grid jobs that do not have huge input
 >>>> or output
 >>>> sandboxes to be transferred.  That implies it is quite unlikely for
 >>>> any grid job to
 >>>> be caught in the Held state while spooling input files.
 >>>>
 >>>> Following the argument that only HoldReasonCode == 16 should be
 >>>> counted as waiting,
 >>>> the ARC CE info provider still appears to have a bug, because in JM's
 >>>> case it also
 >>>> counted jobs that were killed by the OOM killer.
 >>>>
 >>>> What criteria does the ARC CE info provider apply to obtain the
 >>>> number of waiting
 >>>> jobs for the GLUE and NorduGrid schemas?
 >>>>
 >>>>
 >>>> On 09/07/18 12:06, David Rebatto wrote:
 >>>>> Hi,
 >>>>> David is right, held is also a transient state between idle and
 >>>>> running, while the necessary files are staged to the execution
machine.
 >>>>> Still, it is pretty easy to tell whether to ignore to job or
 >>>>> classify it as waiting, just looking at the HoldReasonCode:
 >>>>>
http://research.cs.wisc.edu/htcondor/manual/v8.7/JobClassAdAttributes.html#dx170-1249191
 >>>>>
 >>>>> At a first glance, I'd suggest to disregard all held jobs but the
 >>>>> ones with HoldReasonCode == 16.
 >>>>> Cheers,
 >>>>> David
 >>>>> Il 07/09/2018 11:41, Maarten Litmaath ha scritto:
 >>>>>> Hi David, all,
 >>>>>> AFAIK the Held state is there to allow the _user_ (or a service
admin)
 >>>>>> to decide what to do with the job, because HTCondor encountered a
 >>>>>> hard error for it and cannot solve the matter on its own.
 >>>>>>
 >>>>>> As far as grid jobs are concerned, the Held state is useless in
 >>>>>> practice
 >>>>>> and jobs in that state should not be counted as waiting, but
 >>>>>> instead sit
 >>>>>> in another category that is dealt with separately.  In practice,
 >>>>>> held jobs
 >>>>>> typically are purged fairly quickly, potentially leaving a buffer
 >>>>>> covering
 >>>>>> a number of hours / days to help in debugging, should that be
needed.
 >>>>>>
 >>>>>> That is how things work e.g. on ALICE VOBOXes submitting to
HTCondor.
 >>>>>>
 >>>>>> ________________________________________
 >>>>>> From: David Cameron
 >>>>>> Sent: 07 September 2018 09:38
 >>>>>> To: Maarten Litmaath; wlcg-arc-ce-discuss (ML for discussion around
 >>>>>> arc-ce usage and issues in WLCG)
 >>>>>> Cc: LHC Computer Grid - Rollout; Jean-Michel Barbet
 >>>>>> Subject: Re: [LCG-ROLLOUT] HT-Condor and held jobs because
 >>>>>> out-of-memory
 >>>>>>
 >>>>>> Hi Maarten,
 >>>>>>
 >>>>>> I'm no condor expert, but as far as I know jobs can go into Held
state
 >>>>>> for many reasons, such as waiting for input files to be spooled for
 >>>>>> which you would want to count them as queued. If everyone is
fine with
 >>>>>> ARC ignoring all Held jobs then it's trivial to fix the code, but
 >>>>>> maybe
 >>>>>> it's worth consulting a condor expert first (I'm sure there are
plenty
 >>>>>> on these lists!).
 >>>>>>
 >>>>>> Cheers,
 >>>>>> David
 >>>>>>
 >>>>>>
 >>>>>> On 06/09/18 15:07, Maarten Litmaath wrote:
 >>>>>>> Hi all,
 >>>>>>> though that is an easy workaround, the ARC CE info provider still
 >>>>>>> has a bug to be fixed.
 >>>>>>>
 >>>>>>> ________________________________________
 >>>>>>> From: LHC Computer Grid - Rollout [[log in to unmask]] on
 >>>>>>> behalf of Jean-Michel Barbet [[log in to unmask]]
 >>>>>>> Sent: 06 September 2018 10:02
 >>>>>>> To: [log in to unmask]
 >>>>>>> Subject: Re: [LCG-ROLLOUT] HT-Condor and held jobs because
 >>>>>>> out-of-memory
 >>>>>>>
 >>>>>>> On 09/05/2018 03:08 PM, Max Fischer wrote:
 >>>>>>>> Hi Maarten, Jean-Michel,
 >>>>>>>>
 >>>>>>>> for reference, we let the schedds on the ARC-CEs do the cleanup:
 >>>>>>>>
 >>>>>>>> # /etc/condor/config.d/schedd.cfg
 >>>>>>>> JOBSTATE_HELD = 5
 >>>>>>>> SCHEDD.SYSTEM_PERIODIC_REMOVE = ((JobStatus == $(JOBSTATE_HELD))
 >>>>>>>> && (time() - EnteredCurrentStatus > 2 * $(DAY)))
 >>>>>>> Hi Max,
 >>>>>>>
 >>>>>>> I like your solution. I was monitoring the rate at which the hold
 >>>>>>> jobs
 >>>>>>> appear in our cluster and it is relatively frequent. Depending
on the
 >>>>>>> workload, I can have ~10/hour for ~400 cores. I think I am simply
 >>>>>>> going
 >>>>>>> tu put an hourly cron with sth like :
 >>>>>>>
 >>>>>>> condor_rm -constraint "JobStatus == 5"
 >>>>>>>
 >>>>>>> JM
 >>>>>>>
 >>>>>>> --
 >>>>>>>
------------------------------------------------------------------------
 >>>>>>>
 >>>>>>> Jean-michel BARBET                    | Tel: +33 (0)2 51 85 84 86
 >>>>>>> Laboratoire SUBATECH Nantes France    | Fax: +33 (0)2 51 85 84 79
 >>>>>>> CNRS-IN2P3/IMT-Atlantique/Univ.Nantes | E-Mail:
 >>>>>>> [log in to unmask]
 >>>>>>>
------------------------------------------------------------------------
 >>>>>>>
 >>>>>>>
 >>
 >
 >
 > --
 > ==================================================
 >    Florido Paganelli
 >      ARC Middleware Developer - NorduGrid Collaboration
 >      System Administrator
 >    Lund University
 >    Department of Physics
 >    Division of Particle Physics
 >    BOX118
 >    221 00 Lund
 >    Office Location: Fysikum, Hus A, Rum A403
 >    Office Tel: 046-2220272
 >    Email: [log in to unmask]
 >    Homepage: http://www.hep.lu.se/staff/paganelli
 > ==================================================
 >

--
==================================================
  Florido Paganelli
    ARC Middleware Developer - NorduGrid Collaboration
    System Administrator
  Lund University
  Department of Physics
  Division of Particle Physics
  BOX118
  221 00 Lund
  Office Location: Fysikum, Hus A, Rum A403
  Office Tel: 046-2220272
  Email: [log in to unmask]
  Homepage: http://www.hep.lu.se/staff/paganelli
==================================================


########################################################################

To unsubscribe from the LCG-ROLLOUT list, click the following link:
https://www.jiscmail.ac.uk/cgi-bin/webadmin?SUBED1=LCG-ROLLOUT&A=1
Top of Message | Previous Page | Permalink
JiscMail Tools

Files Area | help
RSS Feeds and Sharing

Search Archives

Advanced Options