Hello everybody!
I've cast my eyes across our tickets once again, and chronicle my
findings here, for your judgement and approval. Also last week I was
reminded that I was to take on the task of having a peek at the "Other
VO" nagios results and let it be known what I saw there. But I kept
forgetting to do that - but not this week! This week, I remembered
(probably at the expense of forgetting something else...).
Other VO Nagios Status:
https://vo-nagios.physics.ox.ac.uk/nagios/cgi-bin/status.cgi?host=all&servicestatustypes=16&hoststatustypes=15
At the time of writing I see:
Imperial: gridpp VO job submission errors (but only 34 minutes old so
probably naught to worry about).
Brunel: gridpp VO jobs aborted (one of these is 94 days old, so might be
something to worry about).
Lancaster: pheno failures (I can't see what's wrong, but this CE only
has 10 days left to live)
Sussex: snoplus failures (but I think Sussex is in downtime)
RALPP: A number of failures across a number of CEs, all a few hours old.
An SE problem?
Sheffield: gridpp VO job submission failure, but only 6 hours old.
And of course the srm-$VONAME failures at the Tier 1, which are caused
by incompatibility between the tests and Castor AIUI. Things are
generally looking good.
To the tickets!
22 Open UK Tickets this week.
NGI/100IT
https://ggus.eu/?mode=ticket_info&ticket_id=111333 (22/1)
The NGI has been asked to upgrade the cloud accounting probe, and then
notify our (only at the moment) cloud site to republish their
accounting. Not entirely sure what this entails or who this falls on, I
assigned it to NGI-OPERATIONS (and also noticed that 100IT isn't on the
"notify site" list - odd). Assigned (22/1)
TIER 1
https://ggus.eu/?mode=ticket_info&ticket_id=108944 (1/10/14)
CMS AAA test failures. Andrew Lahiff reported last week that the Tier 1
is building a replacement xrootd box which is currently being prepared.
If that will take a while can the ticket be put on hold? In progress (19/1)
QMUL
https://ggus.eu/?mode=ticket_info&ticket_id=110353 (25/11/14)
An atlas ticket, asking for httpd access to at QMUL. The QM chaps were
waiting on a production ready Storm that could handle this, and are
preparing to test one out. This is another ticket that looks like it
might need to be put On Hold (will leave that up to you chaps - there's
a big difference between "slow and steady" progress and "no progress for
a while"). In progress (21/1)
RHUL
https://ggus.eu/?mode=ticket_info&ticket_id=111355 (23/1)
A dteam ticket - concerning http access to RHUL's SE. Although the
initial observation about the SE certificate being expired was incorrect
(the expiry date was reported as 5/1/15, which to be fair I would read
as the 5th of January and not the 1st of May!) there still is some
underlying problem here with intermittent test failures. Also this
ticket raises the question of under what context are these tests being
conducted? Anyone know, or shall we ask the submitter? In progress (26/1)
BIOMED PROBLEMS:
Manchester: https://ggus.eu/?mode=ticket_info&ticket_id=111356 (23/1)
Imperial: https://ggus.eu/?mode=ticket_info&ticket_id=111357 (23/1)
Biomed are having job problems, looking to be caused by using crusty old
WMSes to communicate with these site's shiny up-to-date CEs. According
to ticket 110635 a cream side fix should be out by the end of January
(CREAM 1.16.5), although Alessandra suggests that Biomed should try to
use newer, working WMSes - or Dirac instead!
I think that's all folks. I need to save myself for next week's full review!
Cheers all,
Matt
|