Hi
I aslo see a huge amount of biomed jobs and I have just a several slots for them. Biomed job submission became too aggressive.
Is there a way to throttle arriving jobs?
Thanks
Elena
On 14 Jul 2014, at 18:00, Christopher J. Walker <[log in to unmask]> wrote:
>> On 14 Jul 2014, at 15:04, "Stephen Jones" <[log in to unmask]> wrote:
>>
>>> On 07/13/2014 05:52 PM, Daniela Bauer wrote:
>>> Well, my WMS have an attack of the biomeds (5000 jobs each), so I cac't guarantee anything.
>>
>> I came into work last Monday and found 25,000 biomeds queued our cluster.
>> I had to ban "/O=GRID-FR/C=FR/O=ISCPIF/CN=Romain Reuillon", then spend
>> the morning getting things sorted out. Then I had a ticket from biomed complaining.
>>
>
> And rightly so, though it isn't your fault.
>
> IMHO, it's a bug in the WMS that it brokers that many jobs to you if they aren't going to get run.
>
> I guess nobody is likely to do anything about it though.
>
> Chris
>
>
>> Steve
>>
>>
>>
>>>
>>> Daniela
>>>
>>>
>>> On 12 July 2014 18:35, Kashif Mohammad <[log in to unmask] <mailto:[log in to unmask]>> wrote:
>>>
>>> Hi Winnie
>>>
>>> Thanks for pointing this out. A large number of CREAM CE's are
>>> failing and it looks like a high load issue with WMS. I have
>>> changed order of WMS in Nagios configuration so it might help.
>>>
>>> Thanks
>>> Kashif
>>> ________________________________________
>>> From: Testbed Support for GridPP member institutes
>>> [[log in to unmask] <mailto:[log in to unmask]>] on
>>> behalf of Winnie Lacesso [[log in to unmask]
>>> <mailto:[log in to unmask]>]
>>> Sent: Saturday, July 12, 2014 2:07 PM
>>> To: [log in to unmask] <mailto:[log in to unmask]>
>>> Subject: "Upstream" problem for UKI?
>>>
>>> Dear *,
>>>
>>> Is there some "upstream" problem for ops/UKI?
>>> Bristol's 2 CREAM-CEs have been red for 17hrs now failing 2 tests. In
>>> looking at https://mon.egi.eu/myegi/ for our & other UKI sites, the
>>> CREAM-CEs are all similar - red last 7 to 17 hrs.
>>>
>>> The 2 errors seem to be same for each site checked in nagios,
>>> except most
>>> sites are only 7-13 hrs red not 17 like Bristol's (did I do any
>>> change on a
>>> Friday afternoon?!?! Noooo!)
>>>
>>> emi.cream.CREAMCE-JobSubmit-/ops/Role=lcgadmin
>>> CRITICAL 07-12-2014 12:39:10 0d 7h 1m 5s 2/2 CRITICAL: [3W/2] [Running->Cancelled [timeout/dropped]]
>>>
>>> emi.cream.glexec.CREAMCE-JobSubmit-/ops/Role=pilot
>>> CRITICAL 07-12-2014 10:32:03 0d 9h 8m 46s 2/2 CRITICAL: [3W/2] [Running->Cancelled [timeout/dropped]]
>>>
>>> It looks like opssgm jobs hit the 30-min short/express walltime queue
>>> timeout & fail, causing a backlog of short/express jobs (big queue of
>>> them).
>>>
>>> Tracing job on WN,
>>>
>>> /home/opssgm/home_cream_134122892/CREAM134122892/gridjob.out ends at
>>>
>>> Python 2.6.6
>>> Can we import Python LDAP ...
>>> YES.
>>> Launching MTA.
>>> /home/opssgm/home_cream_134122892/CREAM134122892/nagios/bin/mta-simple
>>> --dirq /tmp/sam.16938.24688/msg-outgoing --destination
>>> /queue/grid.probe.metricOutput.EGEE.gridppnagios_lancs_ac_uk
>>> --broker-network PROD --pidfiledir
>>> /home/opssgm/home_cream_134122892/CREAM134122892/nagios/var/ -v
>>> info --bdii-uri lcgbdii.gridpp.rl.ac.uk:2170
>>> <http://lcgbdii.gridpp.rl.ac.uk:2170>,topbdii.grid.hep.ph.ic.ac.uk:2170
>>> <http://topbdii.grid.hep.ph.ic.ac.uk:2170>,top-bdii.tier2.hep.manchester.ac.uk:2170
>>> <http://top-bdii.tier2.hep.manchester.ac.uk:2170>
>>> No handlers could be found for logger "stomp.py"
>>>
>>> Anyone know what No handlers could be found for logger "stomp.py"
>>> means?
>>>
>>> Process tree:
>>> root@sm23> pstree -lp 16803
>>> bash(16803)---1125180.lcgce04(16818)---CREAM134122892_(16823)---perl(16934)-+-perl(16936)
>>> `-sh(16935)---nagrun.sh(16938)---python(16961)
>>>
>>> root@sm23> strace -p 16961
>>> Process 16961 attached - interrupt to quit
>>> connect(4, {sa_family=AF_INET, sin_port=htons(6163),
>>> sin_addr=inet_addr("195.251.55.91")}, 16^C <unfinished ...>
>>>
>>> root@bse11> nslookup 195.251.55.91
>>> 91.0/25.55.251.195.in-addr.arpa name = mq.afroditi.hellasgrid.gr
>>> <http://mq.afroditi.hellasgrid.gr>.
>>>
>>> It's the same on both clusters (Xeon & AMD)
>>>
>>> Is there an upstream problem with hellasgrid.gr
>>> <http://hellasgrid.gr>?
>>> (for all of UK)?
>>>
>>> Winnie Lacesso / Bristol University Particle Physics Computing Systems
>>> HH Wills Physics Laboratory, Tyndall Avenue, Bristol, BS8 1TL, UK
>>>
>>>
>>>
>>>
>>> --
>>> Sent from the pit of despair
>>>
>>> -----------------------------------------------------------
>>> [log in to unmask] <mailto:[log in to unmask]>
>>> HEP Group/Physics Dep
>>> Imperial College
>>> London, SW7 2BW
>>> Tel: +44-(0)20-75947810
>>> http://www.hep.ph.ic.ac.uk/~dbauer/ <http://www.hep.ph.ic.ac.uk/%7Edbauer/>
>>
>>
>> --
>> Steve Jones [log in to unmask]
>> System Administrator office: 220
>> High Energy Physics Division tel (int): 42334
>> Oliver Lodge Laboratory tel (ext): +44 (0)151 794 2334
>> University of Liverpool http://www.liv.ac.uk/physics/hep/
|