> I guess nobody is likely to do anything about it though.
If the load is impacting the infrastructure overall then it will be escalated and (possibly) resolved. Do we know how the jobs are being submitted? Presumably site fair shares are kicking in and the issue is about the size of the queues?
I’ll check if there is a way to throttle the submissions - even if it is to ask them to moderate the total number of jobs submitted.
Jeremy
On 14 Jul 2014, at 18:00, Christopher J. Walker <[log in to unmask]> wrote:
>> On 14 Jul 2014, at 15:04, "Stephen Jones" <[log in to unmask]> wrote:
>>
>>> On 07/13/2014 05:52 PM, Daniela Bauer wrote:
>>> Well, my WMS have an attack of the biomeds (5000 jobs each), so I cac't guarantee anything.
>>
>> I came into work last Monday and found 25,000 biomeds queued our cluster.
>> I had to ban "/O=GRID-FR/C=FR/O=ISCPIF/CN=Romain Reuillon", then spend
>> the morning getting things sorted out. Then I had a ticket from biomed complaining.
>>
>
> And rightly so, though it isn't your fault.
>
> IMHO, it's a bug in the WMS that it brokers that many jobs to you if they aren't going to get run.
>
> I guess nobody is likely to do anything about it though.
>
> Chris
>
>
>> Steve
>>
>>
>>
>>>
>>> Daniela
>>>
>>>
>>> On 12 July 2014 18:35, Kashif Mohammad <[log in to unmask] <mailto:[log in to unmask]>> wrote:
>>>
>>> Hi Winnie
>>>
>>> Thanks for pointing this out. A large number of CREAM CE's are
>>> failing and it looks like a high load issue with WMS. I have
>>> changed order of WMS in Nagios configuration so it might help.
>>>
>>> Thanks
>>> Kashif
>>> ________________________________________
>>> From: Testbed Support for GridPP member institutes
>>> [[log in to unmask] <mailto:[log in to unmask]>] on
>>> behalf of Winnie Lacesso [[log in to unmask]
>>> <mailto:[log in to unmask]>]
>>> Sent: Saturday, July 12, 2014 2:07 PM
>>> To: [log in to unmask] <mailto:[log in to unmask]>
>>> Subject: "Upstream" problem for UKI?
>>>
>>> Dear *,
>>>
>>> Is there some "upstream" problem for ops/UKI?
>>> Bristol's 2 CREAM-CEs have been red for 17hrs now failing 2 tests. In
>>> looking at https://mon.egi.eu/myegi/ for our & other UKI sites, the
>>> CREAM-CEs are all similar - red last 7 to 17 hrs.
>>>
>>> The 2 errors seem to be same for each site checked in nagios,
>>> except most
>>> sites are only 7-13 hrs red not 17 like Bristol's (did I do any
>>> change on a
>>> Friday afternoon?!?! Noooo!)
>>>
>>> emi.cream.CREAMCE-JobSubmit-/ops/Role=lcgadmin
>>> CRITICAL 07-12-2014 12:39:10 0d 7h 1m 5s 2/2 CRITICAL: [3W/2] [Running->Cancelled [timeout/dropped]]
>>>
>>> emi.cream.glexec.CREAMCE-JobSubmit-/ops/Role=pilot
>>> CRITICAL 07-12-2014 10:32:03 0d 9h 8m 46s 2/2 CRITICAL: [3W/2] [Running->Cancelled [timeout/dropped]]
>>>
>>> It looks like opssgm jobs hit the 30-min short/express walltime queue
>>> timeout & fail, causing a backlog of short/express jobs (big queue of
>>> them).
>>>
>>> Tracing job on WN,
>>>
>>> /home/opssgm/home_cream_134122892/CREAM134122892/gridjob.out ends at
>>>
>>> Python 2.6.6
>>> Can we import Python LDAP ...
>>> YES.
>>> Launching MTA.
>>> /home/opssgm/home_cream_134122892/CREAM134122892/nagios/bin/mta-simple
>>> --dirq /tmp/sam.16938.24688/msg-outgoing --destination
>>> /queue/grid.probe.metricOutput.EGEE.gridppnagios_lancs_ac_uk
>>> --broker-network PROD --pidfiledir
>>> /home/opssgm/home_cream_134122892/CREAM134122892/nagios/var/ -v
>>> info --bdii-uri lcgbdii.gridpp.rl.ac.uk:2170
>>> <http://lcgbdii.gridpp.rl.ac.uk:2170>,topbdii.grid.hep.ph.ic.ac.uk:2170
>>> <http://topbdii.grid.hep.ph.ic.ac.uk:2170>,top-bdii.tier2.hep.manchester.ac.uk:2170
>>> <http://top-bdii.tier2.hep.manchester.ac.uk:2170>
>>> No handlers could be found for logger "stomp.py"
>>>
>>> Anyone know what No handlers could be found for logger "stomp.py"
>>> means?
>>>
>>> Process tree:
>>> root@sm23> pstree -lp 16803
>>> bash(16803)---1125180.lcgce04(16818)---CREAM134122892_(16823)---perl(16934)-+-perl(16936)
>>> `-sh(16935)---nagrun.sh(16938)---python(16961)
>>>
>>> root@sm23> strace -p 16961
>>> Process 16961 attached - interrupt to quit
>>> connect(4, {sa_family=AF_INET, sin_port=htons(6163),
>>> sin_addr=inet_addr("195.251.55.91")}, 16^C <unfinished ...>
>>>
>>> root@bse11> nslookup 195.251.55.91
>>> 91.0/25.55.251.195.in-addr.arpa name = mq.afroditi.hellasgrid.gr
>>> <http://mq.afroditi.hellasgrid.gr>.
>>>
>>> It's the same on both clusters (Xeon & AMD)
>>>
>>> Is there an upstream problem with hellasgrid.gr
>>> <http://hellasgrid.gr>?
>>> (for all of UK)?
>>>
>>> Winnie Lacesso / Bristol University Particle Physics Computing Systems
>>> HH Wills Physics Laboratory, Tyndall Avenue, Bristol, BS8 1TL, UK
>>>
>>>
>>>
>>>
>>> --
>>> Sent from the pit of despair
>>>
>>> -----------------------------------------------------------
>>> [log in to unmask] <mailto:[log in to unmask]>
>>> HEP Group/Physics Dep
>>> Imperial College
>>> London, SW7 2BW
>>> Tel: +44-(0)20-75947810
>>> http://www.hep.ph.ic.ac.uk/~dbauer/ <http://www.hep.ph.ic.ac.uk/%7Edbauer/>
>>
>>
>> --
>> Steve Jones [log in to unmask]
>> System Administrator office: 220
>> High Energy Physics Division tel (int): 42334
>> Oliver Lodge Laboratory tel (ext): +44 (0)151 794 2334
>> University of Liverpool http://www.liv.ac.uk/physics/hep/
|