I think this may be happening again. Again on both our CREAM-CEs, opssgm
jobs are hanging on a python process & hitting the 30-min walltime limit
for the short/express queue.
Exactly as before the opssgm process tree ends in a stuck/hung python
process that can't reach 195.251.55.91:
root@bse09> tail gridjob.out
=== WN: bse09.phy.bris.ac.uk
=== WN arch: x86_64
Check Python version:
/usr/bin/python
Python 2.6.6
Can we import Python LDAP ...
YES.
Launching MTA.
/home/opssgm/home_cream_364446563/CREAM364446563/nagios/bin/mta-simple --dirq /tmp/sam.15538.7233/msg-outgoing --destination /queue/grid.probe.metricOutput.EGEE.gridppnagios_physics_ox_ac_uk --broker-network PROD --pidfiledir /home/opssgm/home_cream_364446563/CREAM364446563/nagios/var/ -v info --bdii-uri lcgbdii.gridpp.rl.ac.uk:2170,topbdii.grid.hep.ph.ic.ac.uk:2170,top-bdii.tier2.hep.manchester.ac.uk:2170
No handlers could be found for logger "stomp.py"
root@bse09> pstree -lp 15402
bash(15402)---871773.lcgce03.(15417)---CREAM364446563_(15422)---perl(15535)-+-perl(15537)
`-sh(15536)---nagrun.sh(15538)---python(15562)
root@bse09> strace -p 15562
Process 15562 attached - interrupt to quit
connect(4, {sa_family=AF_INET, sin_port=htons(6163), sin_addr=inet_addr("195.251.55.91")}, 16
^C <unfinished ...>
Process 15562 detached
What can we as site do to get this fixed ASAP?
Winnie Lacesso / Bristol University Particle Physics Computing Systems
HH Wills Physics Laboratory, Tyndall Avenue, Bristol, BS8 1TL, UK
|