Elena,
If you are using torque, you could put this in the torque qmgr config:
set queue long max_queuable = 6000
That way, no VO can submit jobs once you have 6000 in the queue.
The limit gets advertised in the BDII as ...MaxTotalJobs, and the WMS
is supposed to read and obey it.
Sadly it is only a partial solution. If (say) biomed burps out a
sudden blast of jobs, your queue will fill and all VOs will be
blocked, not just biomed. It would be good if the feature
was more granular.
Another possibility is to ban the user concerned. For example,
I have this in my argus policy loading sheel script:
pap-admin ban subject "/O=GRID-FR/C=FR/O=ISCPIF/CN=Romain Reuillon"
This is a shortcut - you could also enter this in your actual
policy file to achieve the same goal, which bans some users:
resource ".*" {
action ".*" {
rule deny { subject="CN=Julien Delile,O=ISCPIF,C=FR,O=GRID-FR" }
rule deny { subject="CN=Romain Reuillon,O=ISCPIF,C=FR,O=GRID-FR" }
rule deny { subject="CN=john green
(ssc),L=RAL,OU=CLRC,O=eScience,C=UK" }
}
}
Cheers,
Steve
On 07/14/2014 06:10 PM, Elena Korolkova wrote:
> Hi
>
> I aslo see a huge amount of biomed jobs and I have just a several slots for them. Biomed job submission became too aggressive.
> Is there a way to throttle arriving jobs?
>
> Thanks
> Elena
> On 14 Jul 2014, at 18:00, Christopher J. Walker <[log in to unmask]> wrote:
>
>>> On 14 Jul 2014, at 15:04, "Stephen Jones" <[log in to unmask]> wrote:
>>>
>>>> On 07/13/2014 05:52 PM, Daniela Bauer wrote:
>>>> Well, my WMS have an attack of the biomeds (5000 jobs each), so I cac't guarantee anything.
>>> I came into work last Monday and found 25,000 biomeds queued our cluster.
>>> I had to ban "/O=GRID-FR/C=FR/O=ISCPIF/CN=Romain Reuillon", then spend
>>> the morning getting things sorted out. Then I had a ticket from biomed complaining.
>>>
>> And rightly so, though it isn't your fault.
>>
>> IMHO, it's a bug in the WMS that it brokers that many jobs to you if they aren't going to get run.
>>
>> I guess nobody is likely to do anything about it though.
>>
>> Chris
>>
>>
>>> Steve
>>>
>>>
>>>
>>>> Daniela
>>>>
>>>>
>>>> On 12 July 2014 18:35, Kashif Mohammad <[log in to unmask] <mailto:[log in to unmask]>> wrote:
>>>>
>>>> Hi Winnie
>>>>
>>>> Thanks for pointing this out. A large number of CREAM CE's are
>>>> failing and it looks like a high load issue with WMS. I have
>>>> changed order of WMS in Nagios configuration so it might help.
>>>>
>>>> Thanks
>>>> Kashif
>>>> ________________________________________
>>>> From: Testbed Support for GridPP member institutes
>>>> [[log in to unmask] <mailto:[log in to unmask]>] on
>>>> behalf of Winnie Lacesso [[log in to unmask]
>>>> <mailto:[log in to unmask]>]
>>>> Sent: Saturday, July 12, 2014 2:07 PM
>>>> To: [log in to unmask] <mailto:[log in to unmask]>
>>>> Subject: "Upstream" problem for UKI?
>>>>
>>>> Dear *,
>>>>
>>>> Is there some "upstream" problem for ops/UKI?
>>>> Bristol's 2 CREAM-CEs have been red for 17hrs now failing 2 tests. In
>>>> looking at https://mon.egi.eu/myegi/ for our & other UKI sites, the
>>>> CREAM-CEs are all similar - red last 7 to 17 hrs.
>>>>
>>>> The 2 errors seem to be same for each site checked in nagios,
>>>> except most
>>>> sites are only 7-13 hrs red not 17 like Bristol's (did I do any
>>>> change on a
>>>> Friday afternoon?!?! Noooo!)
>>>>
>>>> emi.cream.CREAMCE-JobSubmit-/ops/Role=lcgadmin
>>>> CRITICAL 07-12-2014 12:39:10 0d 7h 1m 5s 2/2 CRITICAL: [3W/2] [Running->Cancelled [timeout/dropped]]
>>>>
>>>> emi.cream.glexec.CREAMCE-JobSubmit-/ops/Role=pilot
>>>> CRITICAL 07-12-2014 10:32:03 0d 9h 8m 46s 2/2 CRITICAL: [3W/2] [Running->Cancelled [timeout/dropped]]
>>>>
>>>> It looks like opssgm jobs hit the 30-min short/express walltime queue
>>>> timeout & fail, causing a backlog of short/express jobs (big queue of
>>>> them).
>>>>
>>>> Tracing job on WN,
>>>>
>>>> /home/opssgm/home_cream_134122892/CREAM134122892/gridjob.out ends at
>>>>
>>>> Python 2.6.6
>>>> Can we import Python LDAP ...
>>>> YES.
>>>> Launching MTA.
>>>> /home/opssgm/home_cream_134122892/CREAM134122892/nagios/bin/mta-simple
>>>> --dirq /tmp/sam.16938.24688/msg-outgoing --destination
>>>> /queue/grid.probe.metricOutput.EGEE.gridppnagios_lancs_ac_uk
>>>> --broker-network PROD --pidfiledir
>>>> /home/opssgm/home_cream_134122892/CREAM134122892/nagios/var/ -v
>>>> info --bdii-uri lcgbdii.gridpp.rl.ac.uk:2170
>>>> <http://lcgbdii.gridpp.rl.ac.uk:2170>,topbdii.grid.hep.ph.ic.ac.uk:2170
>>>> <http://topbdii.grid.hep.ph.ic.ac.uk:2170>,top-bdii.tier2.hep.manchester.ac.uk:2170
>>>> <http://top-bdii.tier2.hep.manchester.ac.uk:2170>
>>>> No handlers could be found for logger "stomp.py"
>>>>
>>>> Anyone know what No handlers could be found for logger "stomp.py"
>>>> means?
>>>>
>>>> Process tree:
>>>> root@sm23> pstree -lp 16803
>>>> bash(16803)---1125180.lcgce04(16818)---CREAM134122892_(16823)---perl(16934)-+-perl(16936)
>>>> `-sh(16935)---nagrun.sh(16938)---python(16961)
>>>>
>>>> root@sm23> strace -p 16961
>>>> Process 16961 attached - interrupt to quit
>>>> connect(4, {sa_family=AF_INET, sin_port=htons(6163),
>>>> sin_addr=inet_addr("195.251.55.91")}, 16^C <unfinished ...>
>>>>
>>>> root@bse11> nslookup 195.251.55.91
>>>> 91.0/25.55.251.195.in-addr.arpa name = mq.afroditi.hellasgrid.gr
>>>> <http://mq.afroditi.hellasgrid.gr>.
>>>>
>>>> It's the same on both clusters (Xeon & AMD)
>>>>
>>>> Is there an upstream problem with hellasgrid.gr
>>>> <http://hellasgrid.gr>?
>>>> (for all of UK)?
>>>>
>>>> Winnie Lacesso / Bristol University Particle Physics Computing Systems
>>>> HH Wills Physics Laboratory, Tyndall Avenue, Bristol, BS8 1TL, UK
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Sent from the pit of despair
>>>>
>>>> -----------------------------------------------------------
>>>> [log in to unmask] <mailto:[log in to unmask]>
>>>> HEP Group/Physics Dep
>>>> Imperial College
>>>> London, SW7 2BW
>>>> Tel: +44-(0)20-75947810
>>>> http://www.hep.ph.ic.ac.uk/~dbauer/ <http://www.hep.ph.ic.ac.uk/%7Edbauer/>
>>>
>>> --
>>> Steve Jones [log in to unmask]
>>> System Administrator office: 220
>>> High Energy Physics Division tel (int): 42334
>>> Oliver Lodge Laboratory tel (ext): +44 (0)151 794 2334
>>> University of Liverpool http://www.liv.ac.uk/physics/hep/
--
Steve Jones [log in to unmask]
System Administrator office: 220
High Energy Physics Division tel (int): 42334
Oliver Lodge Laboratory tel (ext): +44 (0)151 794 2334
University of Liverpool http://www.liv.ac.uk/physics/hep/
|