> On 14 Jul 2014, at 15:04, "Stephen Jones" <[log in to unmask]> wrote:
>
>> On 07/13/2014 05:52 PM, Daniela Bauer wrote:
>> Well, my WMS have an attack of the biomeds (5000 jobs each), so I cac't guarantee anything.
>
> I came into work last Monday and found 25,000 biomeds queued our cluster.
> I had to ban "/O=GRID-FR/C=FR/O=ISCPIF/CN=Romain Reuillon", then spend
> the morning getting things sorted out. Then I had a ticket from biomed complaining.
>
And rightly so, though it isn't your fault.
IMHO, it's a bug in the WMS that it brokers that many jobs to you if they aren't going to get run.
I guess nobody is likely to do anything about it though.
Chris
> Steve
>
>
>
>>
>> Daniela
>>
>>
>> On 12 July 2014 18:35, Kashif Mohammad <[log in to unmask] <mailto:[log in to unmask]>> wrote:
>>
>> Hi Winnie
>>
>> Thanks for pointing this out. A large number of CREAM CE's are
>> failing and it looks like a high load issue with WMS. I have
>> changed order of WMS in Nagios configuration so it might help.
>>
>> Thanks
>> Kashif
>> ________________________________________
>> From: Testbed Support for GridPP member institutes
>> [[log in to unmask] <mailto:[log in to unmask]>] on
>> behalf of Winnie Lacesso [[log in to unmask]
>> <mailto:[log in to unmask]>]
>> Sent: Saturday, July 12, 2014 2:07 PM
>> To: [log in to unmask] <mailto:[log in to unmask]>
>> Subject: "Upstream" problem for UKI?
>>
>> Dear *,
>>
>> Is there some "upstream" problem for ops/UKI?
>> Bristol's 2 CREAM-CEs have been red for 17hrs now failing 2 tests. In
>> looking at https://mon.egi.eu/myegi/ for our & other UKI sites, the
>> CREAM-CEs are all similar - red last 7 to 17 hrs.
>>
>> The 2 errors seem to be same for each site checked in nagios,
>> except most
>> sites are only 7-13 hrs red not 17 like Bristol's (did I do any
>> change on a
>> Friday afternoon?!?! Noooo!)
>>
>> emi.cream.CREAMCE-JobSubmit-/ops/Role=lcgadmin
>> CRITICAL 07-12-2014 12:39:10 0d 7h 1m 5s 2/2 CRITICAL: [3W/2] [Running->Cancelled [timeout/dropped]]
>>
>> emi.cream.glexec.CREAMCE-JobSubmit-/ops/Role=pilot
>> CRITICAL 07-12-2014 10:32:03 0d 9h 8m 46s 2/2 CRITICAL: [3W/2] [Running->Cancelled [timeout/dropped]]
>>
>> It looks like opssgm jobs hit the 30-min short/express walltime queue
>> timeout & fail, causing a backlog of short/express jobs (big queue of
>> them).
>>
>> Tracing job on WN,
>>
>> /home/opssgm/home_cream_134122892/CREAM134122892/gridjob.out ends at
>>
>> Python 2.6.6
>> Can we import Python LDAP ...
>> YES.
>> Launching MTA.
>> /home/opssgm/home_cream_134122892/CREAM134122892/nagios/bin/mta-simple
>> --dirq /tmp/sam.16938.24688/msg-outgoing --destination
>> /queue/grid.probe.metricOutput.EGEE.gridppnagios_lancs_ac_uk
>> --broker-network PROD --pidfiledir
>> /home/opssgm/home_cream_134122892/CREAM134122892/nagios/var/ -v
>> info --bdii-uri lcgbdii.gridpp.rl.ac.uk:2170
>> <http://lcgbdii.gridpp.rl.ac.uk:2170>,topbdii.grid.hep.ph.ic.ac.uk:2170
>> <http://topbdii.grid.hep.ph.ic.ac.uk:2170>,top-bdii.tier2.hep.manchester.ac.uk:2170
>> <http://top-bdii.tier2.hep.manchester.ac.uk:2170>
>> No handlers could be found for logger "stomp.py"
>>
>> Anyone know what No handlers could be found for logger "stomp.py"
>> means?
>>
>> Process tree:
>> root@sm23> pstree -lp 16803
>> bash(16803)---1125180.lcgce04(16818)---CREAM134122892_(16823)---perl(16934)-+-perl(16936)
>> `-sh(16935)---nagrun.sh(16938)---python(16961)
>>
>> root@sm23> strace -p 16961
>> Process 16961 attached - interrupt to quit
>> connect(4, {sa_family=AF_INET, sin_port=htons(6163),
>> sin_addr=inet_addr("195.251.55.91")}, 16^C <unfinished ...>
>>
>> root@bse11> nslookup 195.251.55.91
>> 91.0/25.55.251.195.in-addr.arpa name = mq.afroditi.hellasgrid.gr
>> <http://mq.afroditi.hellasgrid.gr>.
>>
>> It's the same on both clusters (Xeon & AMD)
>>
>> Is there an upstream problem with hellasgrid.gr
>> <http://hellasgrid.gr>?
>> (for all of UK)?
>>
>> Winnie Lacesso / Bristol University Particle Physics Computing Systems
>> HH Wills Physics Laboratory, Tyndall Avenue, Bristol, BS8 1TL, UK
>>
>>
>>
>>
>> --
>> Sent from the pit of despair
>>
>> -----------------------------------------------------------
>> [log in to unmask] <mailto:[log in to unmask]>
>> HEP Group/Physics Dep
>> Imperial College
>> London, SW7 2BW
>> Tel: +44-(0)20-75947810
>> http://www.hep.ph.ic.ac.uk/~dbauer/ <http://www.hep.ph.ic.ac.uk/%7Edbauer/>
>
>
> --
> Steve Jones [log in to unmask]
> System Administrator office: 220
> High Energy Physics Division tel (int): 42334
> Oliver Lodge Laboratory tel (ext): +44 (0)151 794 2334
> University of Liverpool http://www.liv.ac.uk/physics/hep/
|