Hello Everybody.
It's the first Monday of the month again, so as you all know (and
dread?) it's time to go over all of the currently open UK tickets.
On top of the lovely tickets there was a discussion in the Ops team last
week and it was mentioned how it would be handy to look how sites were
doing on the VO nagios, so I thought I'd go over that here.
https://vo-nagios.physics.ox.ac.uk/nagios/cgi-bin/status.cgi?host=all&servicestatustypes=16&hoststatustypes=15
(forgive the long hand URL).
Site's that seem to be having trouble on one or more of their nodes at
the time of writing are:
Durham: pheno and gridpp
Lancaster: pheno and gridpp
Sussex: snoplus
EFDA-JET: gridpp, pheno, southgrid
Liverpool: gridpp, snoplus
Sheffield: gridpp, snoplus
QMUL: t2k.org
TIER 1: snoplus and t2k
Although only Lancaster, Sheffield and the Tier 1 seem to be having
really long term problems.
(I'm still trying to think how best to parse this information, so my
apologies that it's poorly presented).
On to the tickets.
Only 24 open UK tickets this month (organised by site).
SUSSEX
https://ggus.eu/index.php?mode=ticket_info&ticket_id=108765 (24/9)
Sussex have a ROD ticket, originating from a glue validation error
(although it's just picked up some SHA-2 failures). Matt RB was away
though, so not much progress - Matt can you get to it this week? In
progress (3/10)
RALPP
https://ggus.eu/index.php?mode=ticket_info&ticket_id=109115 (6/10
A fresh ticket from cms, complaining that RALPP don't have any backup
squids listed in their site xml file. Assigned (6/10)
BRISTOL
https://ggus.eu/index.php?mode=ticket_info&ticket_id=106325 (18/6)
CMS pilots losing network connectivity. CMS have confirmed that it is
only a subset of the Bristol clusters seeing pilots dropping
connections. Winnie has continued to poke and prod this, and between her
and CMS they've (more or less) ruled out natting as the cause of the
problem. Bristol are still quite stuck, and kind of hoping some
unrelated network tweaks might sweep this issue away. On Hold (2/10)
ECDF
https://ggus.eu/index.php?mode=ticket_info&ticket_id=95303 (1/7/13)
tarball glexec deployment - see Lancaster entry on the same issue. On
hold (29/8)
DURHAM
https://ggus.eu/index.php?mode=ticket_info&ticket_id=108273 (5/9)
Durham experienced a sudden, odd change in their perfsonar results
(outbound bandwidth went up, in bound dropped). The Durham chaps were
looking into this but were interrupted by this shellshock business.
Oliver has included some long term plans in the ticket and will update
it again when they have their perfsonar back. On hold (6/10)
SHEFFIELD
https://ggus.eu/index.php?mode=ticket_info&ticket_id=108716 (23/9)
Snoplus jobs not running at Sheffield. Elena had to bash one of her CEs
into shape, but it should be fixed now and has asked Matt M if he still
sees a problem. Waiting for reply (6/10)
MANCHESTER
https://ggus.eu/index.php?mode=ticket_info&ticket_id=109001 (2/10)
Not quite a site problem, but David M was having trouble committing to
the SVN hosted at Manchester (and a reminder that I believe the
"official" way of reporting problems with these services is to ticket
the site). It looks like this has been solved and the ticket can
probably be closed. In progress (3/10)
https://ggus.eu/index.php?mode=ticket_info&ticket_id=109049 (4/10)
Atlas transfer problems - the underlying issue being a downed (and dead)
disk server. Alessandra is doing the lost file declaration stuff and
offered to provide lists of these files to the users directly. Not much
more that Manchester can do. In progress (6/10)
LANCASTER
https://ggus.eu/index.php?mode=ticket_info&ticket_id=100566 (27/1)
Poor, unexplained perfsonar performance. Although some ideas have been
made how to tackle this, holidays then shellshock have got in the way of
implementing them. On hold (1/10)
https://ggus.eu/index.php?mode=ticket_info&ticket_id=108715 (23/9)
Sno+ jobs not running at Lancaster. Hopefully after a tweak to the
information system on our CEs I fixed this - as Duncan pointed out
things are looking okay on the VO nagios. I've asked Matt M how things
are looking for "real" Sno+ work. Waiting for reply (1/10)
https://ggus.eu/index.php?mode=ticket_info&ticket_id=95299 (1/7/13)
tarball glexec ticket. As mentioned in last week's Ops meeting, due to
holidays there has been no progress over the last month but things look
hopeful. On hold (9/9)
UCL
https://ggus.eu/index.php?mode=ticket_info&ticket_id=95298 (1/7/13)
Non-tarball glexec ticket. Ben's been trying to install this, but having
dependency troubles - did anyone who uses rpms notice this when they
last tried to install the glexec WN? In progress (29/9)
https://ggus.eu/index.php?mode=ticket_info&ticket_id=109039 (3/10)
Another Glue2 validation ROD ticket. In progress (3/10)
IMPERIAL
https://ggus.eu/index.php?mode=ticket_info&ticket_id=108723 (23/9)
Chris W has ticket Imperial with a few dirac file catalogue queries.
Duncan responded with some documentation that others might also find
useful and some other information. I believe the ticket is now waiting
for feedback from Chris (who may in turn be waiting for feedback from
the other VO user groups). Waiting for reply (1/10)
EFDA-JET
https://ggus.eu/index.php?mode=ticket_info&ticket_id=108735 (23/9)
biomed have asked that JET activate the biomed cvmfs repo at their site.
Ticket seen but no news or action. In progress (23/9)
https://ggus.eu/index.php?mode=ticket_info&ticket_id=97485 (21/9/13)
One of the ancient tickets, whose solution alongside the glexec tarball
tickets will herald the start of the endtimes (no offence intended to
the Jet admins - this issue is an absolute dog). LHCB having
authentication errors at Jet. No change. On hold (1/10)
https://ggus.eu/index.php?mode=ticket_info&ticket_id=109080 (6/10)
A fresh ROD ticket about a number of alarms - at first glance I would
say a certificate has expired. In progress (6/10)
100IT
https://ggus.eu/index.php?mode=ticket_info&ticket_id=108356 (10/9)
VM images from fedcloud.egi.eu not available at 100IT. This ticket
showed up an issue with creating an AppDB profile, but that has since
been solved. No news on the state of this ticket other then that the
issue persists. In progress (1/10)
Last but not least:
THE TIER 1
https://ggus.eu/index.php?mode=ticket_info&ticket_id=107935 (27/8)
"BDII vs SRM inconsistent storage capacity numbers". No news on this for
a long time. This ticket really could do with some love (or at least on
holding!). In progress (3/9)
https://ggus.eu/index.php?mode=ticket_info&ticket_id=106324 (18/6)
CMS pilots losing connection, similar to the Bristol ticket. The issue
has been tracked to being *something* in the Tier 1's internal network
after comparing firewall rules to RALPP. CMS have updated the ticket
with some more information and some nice plots, but the long and the
short of it is the problem persists. In progress (1/10)
https://ggus.eu/index.php?mode=ticket_info&ticket_id=108546 (16/9)
atlas seeing failures on the RAL-LCG2_HIMEM_SL6 queue. Ticket in an odd
state - the atlas shifters seem to think the problem was transient but
Gareth and go are seeing a lot of load on diskservers despite nothing on
BiGpanda. The RAL team is keeping an eye, but this ticket could do with
some updates/on holding in the mean time. In progress (22/9)
https://ggus.eu/index.php?mode=ticket_info&ticket_id=107880 (26/8)
Sno+ asking RAL for help/alternatives with srmcping for a small group of
seemingly awkward Suse using users. Some input from others but not much
word from Sno+ or the Tier 1 - Chris, could you please take a peek with
your small VO hat on? In progress (30/9)
https://ggus.eu/index.php?mode=ticket_info&ticket_id=108944 (1/10)
CMS running into a lot of "file not found" errors when running a AAA
check at RAL, and asking if things are alright. When looking over the
whole Castor namespace it appears that all files are present and correct
which doesn't explain why CMS had trouble finding them. In progress (1/10)
https://ggus.eu/index.php?mode=ticket_info&ticket_id=108845 (27/9)
Atlas seeing gridftp timeouts. This looks to be a hotspot problem (at
this point in the review I'm just skim reading tickets). Atlas also
report seeing deletion errors, and have included links. I'm not sure if
this ticket will be impacted by this afternoon's Castor intervention.
Still very much In Progress (5/10)
And that's all folks! Thanks for bearing with me.
|