Tickets Ahoy!
40 Open tickets this week, and it's the start of the month so we get to
go other all of them! Maybe every month is a little too often for such a
review...
NGI
https://ggus.eu/ws/ticket_info.php?ticket=84381 (19/7)
COMET VO creation, On Hold pending the other VO creation gubbins (6/9)
https://ggus.eu/ws/ticket_info.php?ticket=82492 (24/5)
Chris's VOMS request rejig ticket. On hold until the UK Voms reshuffle
is complete, the reminder date (24/9) has passed. (6/9)
TIER 1
https://ggus.eu/ws/ticket_info.php?ticket=86570 (1/10)
GGUS is moving to a SHA2 certificate on their next release (~24th), and
have asked if the SHA2 cert will cause any trouble. Gareth has noted the
ticket, but unsure if others will take notice. In progress (1/10)
https://ggus.eu/ws/ticket_info.php?ticket=86552 (30/9)
Atlas transfers from/to RAL-LCG2 failed, apparently due to high load at
the RAL end. Found to be caused by a database problem. Should be fixed,
at risk for a little while longer. In Progress (1/10)
https://ggus.eu/ws/ticket_info.php?ticket=86541 (29/9)
Before the above problem atlas transfers were failing with
SECURITY_ERRORs. A known FTS bug caused this
(https://ggus.eu/tech/ticket_show.php?ticket=81844). Patch applied this
morning. In progress (1/10)
https://ggus.eu/ws/ticket_info.php?ticket=86152 (17/9)
Duncan has ticketed the Tier 1 over packet loss seen on many (not all)
Perfsonar tests where the RAL perfsonar is the destination. The RAL
chaps are looking into it, but aren't expecting a solution to easily
present itself. In progress (19/9)
https://ggus.eu/ws/ticket_info.php?ticket=85077 (13/8)
biomend nagios jobs can't register files on srm-biomed.gridpp.rl.ac.uk.
An odd problem that only seemed to affect biomed jobs. Looked to be
dealt with for a while, but seems to have re-emerged. In progress (24/9)
https://ggus.eu/ws/ticket_info.php?ticket=68853 (22/3/11)
SL4 DPM retirement master ticket. On hold but should be In progressed
with a view to close (6/9)
DURHAM
https://ggus.eu/ws/ticket_info.php?ticket=86578 (1/10)
Ops srm-put tests are failing. Only put in this morning though, still
assigned. (1/10)
https://ggus.eu/ws/ticket_info.php?ticket=86534 (28/9)
Ops wn-rep tests failing. Related to above? Still just assigned (28/8)
https://ggus.eu/ws/ticket_info.php?ticket=86281 (21/9)
Another wn-rep related ticket (for a different CE). This one too is just
assigned. Are these getting to Mike? (21/9)
https://ggus.eu/ws/ticket_info.php?ticket=86242 (20/9)
Biomed having trouble submitting to cream02, "no space left on device
errors. Not much movement, just in progressed (24/9)
https://ggus.eu/ws/ticket_info.php?ticket=85181 (20/8)
One of the last two glite 3.1 retirement tickets. No reply since Daniela
asked if the BDII was indeed glite 3.1. In Progress (13/9)
https://ggus.eu/ws/ticket_info.php?ticket=84123 (11/7)
High atlas production rate failure at Durham. Durhams rocky summer
hasn't helped, but hopefully they're out of the woods(?). On Hold (3/9)
https://ggus.eu/ws/ticket_info.php?ticket=75488 (19/10/11)
Ancient compchem ticket. On hold but might not be relevant as all the
CE's have been reinstalled (6/9)
https://ggus.eu/ws/ticket_info.php?ticket=68859 (22/3/11)
SL4 retirement ticket. It looks like it can be closed, just need some
confirmation from someone Durham side. In progress (28/9)
OXFORD
https://ggus.eu/ws/ticket_info.php?ticket=86544 (29/9)
Problems after running out of atlas pool accounts at Oxford. Probably
caused by lcg-expiregridmapdir bug (I missed the discussion of this,
maybe it was offline?), fix in place. Long term plan to up the number of
atlas pool accounts. In progress (29/9)
https://ggus.eu/ws/ticket_info.php?ticket=86106 (14/9)
Low atlas sonar rate seen between Oxford & BNL. Ewan has been and is
looking into it (17/9)
https://ggus.eu/ws/ticket_info.php?ticket=85968 (10/9)
Oxford being bitten by the EMI lcg_utils bug. On hold pending EMI
pulling thier finger out. (20/9)
LIVERPOOL
https://ggus.eu/ws/ticket_info.php?ticket=86542 (29/9)
Liverpool suffered a bunch of SRM transfer failures in a short
timeframe, no obvious causes found at the time. Were investigating, but
were probably interrupted by their unexpected cable bisecting incident
today. In progress (29/9).
https://ggus.eu/ws/ticket_info.php?ticket=86095 (14/9)
Liverpool's encounter with the EMI lcg-utils bug mucking up their WN-rep
ops tests. On hold, but has been green for a while -maybe just been
lucky? (20/9)
BIRMINGHAM
https://ggus.eu/ws/ticket_info.php?ticket=86540 (28/9)
Atlas transfers to Birmingham failed with "SRM_ABORTED" messages. Mark
reports that the VM they are using as a headnode isn't beefy enough to
cope with the demand, causing SRM responses to be too slow. He upped the
power of the VM but that wasn't a full fix, hoping to get a reinstall in
today. A note from atlas this morning mentions that transfers fail for
DATADISK but not for PRODDISK, which is odd. Are there any differences
in the nature of these transfers? In progress (1/10)
https://ggus.eu/ws/ticket_info.php?ticket=86105 (14/9)
One of the tickets clocking poor atlas sonar rates between Birmingham
and BNL. Mark and Laurie have looked into this, but not come up with
anything conclusive. In progress (19/9)
BRUNEL
https://ggus.eu/ws/ticket_info.php?ticket=86533 (28/9)
Ops "WN-RepDel" tests failing, likely due to the known EMI WN lcg-utils
timing out bug. As Brunel already have at ticket about this issue on a
different CE (presumably fronting the same WNs) then Daniela asks if the
ROD team can sum it in one ticket rather then multiples. In progress (1/10)
https://ggus.eu/ws/ticket_info.php?ticket=85973 (10/9)
The "original" RepDel test failure ticket at Brunel. On hold (awaiting a
fix from EMI) (20/9)
IMPERIAL
https://ggus.eu/ws/ticket_info.php?ticket=86426 (26/9)
Hone have trouble submitting to the Imperial WMSi. Daniela reports that
the machines are suffering from being too old (something we can all
relate to), replacements should have arrived on Friday but hadn't. Dell
report a new delivery date of the 8th. In progress (could be on hold
until the kit arrives?) (29/9).
GLASGOW
https://ggus.eu/ws/ticket_info.php?ticket=86391 (25/9)
Atlas were having staging in problems due to high disk server load.
Problems however persisted for a while after the load on the server
calmed down. Did thing sort themselves out after the weekend? In
progress (27/9)
https://ggus.eu/ws/ticket_info.php?ticket=85183 (14/8)
One of the last few glite 3.1 retirement tickets. Due to the severe
crustiness of the old WMS hardware Glasgow powered it down rather then
upgrade (was it only 32-bit hardware?) and are now pondering the next
steps. In progress (28/9)
https://ggus.eu/ws/ticket_info.php?ticket=85025 (9/8)
Sno+ WMS problems at Glasgow. AFAICS the wms in question has been
switched off due to the reasons above? It might be useful to make that
clear to Sno+! In progress (10/9).
RHUL
https://ggus.eu/ws/ticket_info.php?ticket=86383 (25/9)
RHUL stopped publishing UserDN accounting after "upgrading" from glite
to EMI apel in August. Apel support have been called in, and Daniela
suggests checking the FAQ. In progress (1/10)
QMUL
https://ggus.eu/ws/ticket_info.php?ticket=86378 (25/9)
Hone had jobs waiting "too long" at QM, but the problems disappeared.
Along with a bunch of jobs, looks like the QM creams suffered from the
database resetting issue
(https://ggus.eu/tech/ticket_show.php?ticket=85970, as advertised by
Daniela). In progress (27/9)
https://ggus.eu/ws/ticket_info.php?ticket=86306 (22/9)
Queen Mary is being swamped by unkillable lhcb zombie pilots. Neither
the submitters or the site admins can do ought about them using "normal"
tools. Daniela has suggested some DB queries to try or attempting to use
the JobPurger tool (which would be my suggestion too). In progress (1/10).
https://ggus.eu/ws/ticket_info.php?ticket=85967 (10/9)
QM failing ops Apel tests. Chris ticketed apel support for help
(https://ggus.eu/ws/ticket_info.php?ticket=84326), but not having much
luck due to the shear size of their DB, and progress interrupted by
GridPP last week. Hopefully will break this problem this week. On hold
(21/9)
ECDF
https://ggus.eu/ws/ticket_info.php?ticket=86334 (24/9)
Poor atlas sonar rates between BNL and ECDF. Waiting on moving disk
servers to new switches and other general network wizardry scheduled for
this week. On hold till then (28/9).
CAMBRIDGE
https://ggus.eu/ws/ticket_info.php?ticket=86108 (14/9)
Duncan noticed a WAN bandwidth asymmetry at Cambridge. John contacted
the local networking guys, who've investigated and found nothing. Still
in progress (26/9)
LANCASTER
https://ggus.eu/ws/ticket_info.php?ticket=85367 (20/8)
ilc were having trouble submitting jobs to one of Lancaster's CEs. Robin
tracked the issues to high disk IO load, and we're figuring out a some
ways of mitigating these problems. In progress (1/10)
https://ggus.eu/ws/ticket_info.php?ticket=84583 (26/7)
lhcb jobs failing on a Lancaster CE, originally due to a pool account
misconfiguration. The problem has been fixed (probably...) but files
don't seem to be being staged in for lhcb and there are no errors (or
mention of lhcb at all) in the gridftp logs. Debugging is not being
helped by the load issues documented above. In progress (27/9)
https://ggus.eu/ws/ticket_info.php?ticket=84461 (23/7)
t2k.org transfers from RAL to Lancaster timing out. We hoped the gateway
upgrade would improve things, but we were disappointed. Back to the
network investigation. In progress (1/10)
RALPP
https://ggus.eu/ws/ticket_info.php?ticket=85019 (9/8)
ILC had some adventures due to VO misconfiguration at RALPP, but looks
like things are fixed and the ticket can be closed now. In progress (1/10)
SUSSEX
https://ggus.eu/ws/ticket_info.php?ticket=81784 (1/5)
Emyr wondered last week if this was the longest ticket ever? Sadly I
doubt it! The baton has passed oddly enough to Lancaster, as we've come
across a bazaar problem whereby communication from the Sussex cream CE
(and only the cream CE) is being refused by machines on a specific
Lancaster subnet. Sadly this is the subnet where the lancaster nagios
box is sitting. We've ruled out firewalls and had the network chaps at
both sides take a look. Traffic is being stopped at the Lancaster end,
but by the servers themselves (not the network gateways). I'm currently
investigating to see if there's any oddity with our network settings. In
progress (26/9)
Ticket of Interest:
https://ggus.eu/tech/ticket_show.php?ticket=85970
As mentioned above, the ticket documenting the EMI2 cream database
"reset" problems.
Solved Tickets
Ran out of time for these, but I notice that most of the glite 3.1
tickets are closed and the neurogrid VO has taken off. Good stuff!
|