Hello all!
As the next few weeks are looking weird (Bank Holiday, HEPSYSMAN and I'm
on leave again!) I thought I'd do a full review this week.
Other VO Nagios
https://vo-nagios.physics.ox.ac.uk/nagios/cgi-bin/status.cgi?host=all&servicestatustypes=16&hoststatustypes=15
At time of writing I see problems with test jobs at Brunel for pheno and
Liverpool for a number of VOs (see Sno+ ticket for probable cause and
fix at Liverpool).
22 Open UK Tickets this week. Going site-by-site:
APEL/NGI
https://ggus.eu/?mode=ticket_info&ticket_id=113473 (4/5)
Missing accounting date for April for some sites. Raul is discussing
things for Brunel in the ticket, although they have republished. I think
it's only ECDF left to republish their April data. In progress (16/5)
OXFORD
https://ggus.eu/?mode=ticket_info&ticket_id=113482 (26/4)
Loss of accounting data for Oxford needing a APEL republish. The Oxford
guys republished, but there is some confusion with the resulting
numbers. Discussion is ongoing, John G is currently looking at the
records. In progress (14/5)
https://ggus.eu/?mode=ticket_info&ticket_id=113650 (11/5)
CMS glideins failing at Oxford. The original problem was with a config
tweak being left out of the cvmfs setup, but the ticket has been
reopened citing problems persisting on the ARC CE (the CREAM appears to
be fixed). Reopened (16/5)
GLASGOW
https://ggus.eu/?mode=ticket_info&ticket_id=113095 (17/4)
ROD ticket about batch system BDII failures, left open to avoid
unnecessary ticket filing. Gareth noted that the full migration to ARC
and HTCondor, which should see the end of these issues, will hopefully
be completed by the end of June. On Hold (12/5)
SHEFFIELD
https://ggus.eu/?mode=ticket_info&ticket_id=113769 (18/5)
LHCB see a cvmfs problem at Sheffield. Elena has probably fixed the
problem(restarted the sssd), just waiting to see if it all pans out. In
progress (18/5)
MANCHESTER
https://ggus.eu/?mode=ticket_info&ticket_id=113744 (15/5)
For the VOMS rather then the site, Jens' request for the creation of the
dIrac VO, vo.dirac.ac.uk. In progress (18/5)
https://ggus.eu/?mode=ticket_info&ticket_id=113692 (13/5)
A request from pheno to add support to for their new cvmfs area at
Manchester, and as I understand it, to support them in a new "form"
(pheno.egi.eu). In progress (13/5)
LIVERPOOL
https://ggus.eu/?mode=ticket_info&ticket_id=113742 (15/5)
Sno+ noticed their nagios failures at Liverpool. Rob reckons this was a
problem with the DPM BDII service certificate not being updated (that's
bitten me too), and fixed things this morning. Let's see how that goes.
In progress (18/5)
LANCASTER
https://ggus.eu/?mode=ticket_info&ticket_id=95299 (1/7/13!)
Lancaster's vintage glexec ticket. An update on this - after have a
roundtuit session last week I was building glexec for different paths.
It still needs some testing to make sure it works properly. There
however definitely won't be a one-size-fits-all tarball solution. On
hold (15/5)
https://ggus.eu/?mode=ticket_info&ticket_id=100566 (27/1/14)
Only the crustiest old tickets for us at Lancaster! Poor perfsonar
performance. Sadly didn't get roundtuit on this one - we're pushing
getting these nodes dual stacked as Ewan had pointed out that it would
be interesting to see if IPv6 tests also saw this issue. On hild (18/5)
UCL
https://ggus.eu/?mode=ticket_info&ticket_id=113721 (14/5)
The only UCL ticket, this is a egi "low availability" ticket. However
Daniela notes that the plots are on the rise, so things are looking
alright. Probably want to "On Hold" it but otherwise not much to be
done. In progress (14/5)
IMPERIAL
https://ggus.eu/?mode=ticket_info&ticket_id=113743 (15/5)
A ticket from Durham concerning the Dirac instance at Imperial's
settings for their site. Daniela hopes to get it fixed soon. In progress
(15/5)
100IT
https://ggus.eu/?mode=ticket_info&ticket_id=112948 (10/4)
CA certificate update at 100IT leading to a discussion of other
authentication based failures. David has asked for voms information
after posting his configs. In progress (13/5)
TIER 1
https://ggus.eu/?mode=ticket_info&ticket_id=113035 (14/4)
Ticket tracking the decommissioning of the Tier 1 CREAM CEs. I think
things are just about done now, this ticket can soon be closed. In
progress (11/5)
https://ggus.eu/?mode=ticket_info&ticket_id=109694 (28/10/14)
Sno+ gfal-copy ticket. Brian reports that the Tier 1 is upgrading gfal2
on their WNs, and notes that there's a lot of active debugging work
going on in the area. As he eloquently puts it "situation is quite
fluid". In progress (13/5)
https://ggus.eu/?mode=ticket_info&ticket_id=108944 (1/10/14)
CMS AAA tests failing at the Tier 1. There's been a lot of work on this,
deploying then trying to get the new xrootd director configured. New
problems have cropped up, and are under investigation. In progress (11/5)
https://ggus.eu/?mode=ticket_info&ticket_id=112721 (28/3)
Atlas transfer failures ("failed to get source file size").Tracked to a
odd double transfer error, possibly introduced in one of the recent
"upgrades". Brian has been declaring these files as bad, and a
workaround or solution is being thought about. In progress (14/5)
https://ggus.eu/?mode=ticket_info&ticket_id=113705 (13/5)
Atlas transfer failures from RAL tape. Checksum failures, which Brian
tracked to being due to not being of a type Castor supports. Brian has
asked if this can be changed at the CERN FTS or in rucio. Waiting for
reply (14/5)
https://ggus.eu/?mode=ticket_info&ticket_id=113748 (16/5)
Another atlas transfer ticket, but as the error indicates no space left
at the Brunel space token being transferred to Elena has noted that this
isn't a site problem, telling the submitter to put in a JIRA ticket
instead. Waiting for reply, but probably can be just closed (16/5)
https://ggus.eu/?mode=ticket_info&ticket_id=112866 (7/4)
Lots of cms job failures at RAL. This has been traced to some super-hot
files, mitigation is being looked into. A candidate for perhaps On
Holding, depends on the time frame of a work around. In progress (13/5)
https://ggus.eu/?mode=ticket_info&ticket_id=113320 (27/4)
CMS data transfer issues. I'm not actually too sure what's going on.
There are files that need invalidating, which seems to be the root of
the evil befalling transfers. The issue is being actively worked on
though. In progress (18/5)
That's all the tickets! Catch y'all tomorrow.
Matt
|