> -----Original Message-----
> From: Testbed Support for GridPP member institutes [mailto:TB-
> [log in to unmask]] On Behalf Of Jeremy Coles
> Sent: 13 January 2015 05:00
> To: [log in to unmask]
> Subject: Ops meeting @ 11am
>
> Our first ops meeting of 2015 is today and has the agenda at
> https://indico.cern.ch/event/360124/.
Draft minutes are attached to this email; comments/corrections/etc.
all welcome.
Ewan
GridPP Weekly Operations meeting 2015 01 13
===========================================
-- https://indico.cern.ch/event/360124/ --
Experiments
============
LHCb:
------
Raja was having microphone/Vidyo problems and was unable to be heard, but
provided a report via the meeting chat:
Restripping campaign at Tier-1s almost finished. This caused huge load on
RAL srm-s. Castor itself worked fine. The RAL srm-s still seem to be
under load right now, even though the number of LHCb running jobs is quite
low.
For Tier-2 sites there were some problem at Edinburgh yesterday, e.g:
[dirac@lbvobox15 dirac]$ glite-ce-job-status -L0 -a -e
ce6.glite.ecdf.ed.ac.uk
2015-01-13 10:01:05,097 FATAL - Received NULL fault; the error is due to another cause: FaultString=[connection error] -
FaultCode=[SOAP-ENV:Client] - FaultSubCode=[SOAP-ENV:Client] - FaultDetail=[Connection timed out]
And load due to LHCb jobs.
CMS:
-----
Daniela reported that, as is normal for CMS, there is nothing much to
report, but did go on to explain that there have been some problems with
CMS' internal site monitoring not working reliably. Major site problems
would still have been found though, and none have been.
ATLAS:
-------
Elena reported that there had been a couple of problems with ATLAS
infrastructure over Christmas. There were no major issues with rucio, but
there is a problem with deletions from LOCALGROUPDISK - files can
currently only be removed by their owners, when it should also be possible
for users with production roles to delete things too. This will be fixed
in the next release of rucio, but in the interim it is possible to have files
removed manually by filing a Jura ticket.
Several sites have been querying low usage; this is a generic issue with
low amounts of production work running due to some central problems with
tasks getting stuck and a problem with random seed values(?). Work on all
these problems is ongoing, and there may be more news at the weekly ADC
meeting later today.
ATLAS have also introduced a new metrics system - ATLAS Site Availability
and Performance (ASAP), which was introduced to us by Elena at the 9th
December ops meeting ( https://indico.cern.ch/event/357311/ ) with slides
attached to the agenda. When she sent round the first set of ASAP results
this year several questions were raised; Elana explained a bit more about
the context and use of the new system - it replaces but is very similar to
the previous A, B, C ratings used for data distribution. ASAP results are
based entirely on ATLAS analysis availability as measured by the
Hammercloud tests. It is technically possible to also take into account
availability for production, but analysis is considered to be more
interesting, though that might change in the future if there's a need. As
well as being used to determine data shares, there's also a threshold
whereby sites at less than 80% for extended periods may find themselves
being demoted.
Elena's slides from this meeting:
https://indico.cern.ch/event/360124/contribution/1/material/slides/0.pdf
contain links to ASAP monitoring pages, and she endorsed and pointed to
Steve Jones explanation of the system on the gridpp wiki:
https://www.gridpp.ac.uk/wiki/ATLAS_Site_Availability_and_Performance_(ASAP)
In particular there have been questions about how sites can be alerted when
they're failing hammercloud tests; Elena explained that hammercloud state
change warning emails are sent to the [log in to unmask]
list; in principle the UK Cloud team will then follow them up with sites,
but Elena noted that in practice most UK ATLAS sites are already following
the list (with the possible exceptions of RALPPD and Birmingham). The list
is useful for support as well as alerts, and Elena encouraged anyone who
wanted help understanding or dealing with their test results to email the
list for assistance. Most UK sites seem to be in decent shape, but there
are some issues to resolve around Sussex, which is in a somewhat unusual
state by design.
Pete raised monitoring results from Steve Lloyd's page at:
http://pprc.qmul.ac.uk/~lloyd/gridpp/hammercloud.html
(and implicitly the ATLAS page from which it gets its data) and asked why
a site (specifically Oxford) was showing as 'grey' for a long time after a
short failure period, despite having resumed passing tests many hours
before. Elena wasn't sure, but said she'd look into it, but that the pages
in question were not used as inputs to ASAP.
Other VOs:
-----------
The 'other VOs' role has been taken over by Imperial, but there's nothing
to report this week except very early moves to provide CVMFS for the four regional
starter VOs; see the tb-support discussion for more:
https://www.jiscmail.ac.uk/cgi-bin/webadmin?A2=ind1501&L=tb-support&F=&S=&X=721CBCE9CDB1708DE7&P=14439
General updates
================
WLCG Site availabilities
-------------------------
The site availability metrics from December had been circulated to
tb-support by Jeremy:
https://www.jiscmail.ac.uk/cgi-bin/webadmin?A2=ind1501&L=tb-support&F=&S=&P=12748
and Pete briefly ran through the results:
- Oxford was low for ALICE, which Pete suggested may be due to AC problems.
- Sheffield were low for ATLAS, Elena reminded everyone that she'd explained
this on tb-support as being due to an unfortunate upgrade to the latest DPM.
- For LHCb there were problems at Durham, Bristol and JET. JET's is long
standing, and there was no information about the other two.
WLCG Ops Coordination
----------------------
There was only a 'virtual' meeting on 8th January, and no major updates.
Tier 1
-------
Catalin reported that Gareth was unable to attend today in person, but had
updated the bulletin page:
https://www.gridpp.ac.uk/w/index.php?title=Operations_Bulletin_Latest&oldid=7173
Pete noted that there had been issues with the site router, fixed by
Martin on Boxing Day, and a forthcoming 'at risk' for electrical testing
later this week, though no problems are expected:
https://goc.egi.eu/portal/index.php?Page_Type=Downtime&id=16402
Accounting
-----------
There had been some glitches over the holiday doe to the RAL site
networking problem, but everything is believed to be fine now.
Interoperations
------------------
There was a meeting yesterday - https://wiki.egi.eu/wiki/Agenda-12-01-2015
including:
- A discussion of StoRM.
- There was a general reminder to check site contact details for staged
rollout, with David particularly querying whether Chris Walker is still
listed as the contact for QMUL.
- There is a call for suitable sites to test the new FTS3 and Squid
releases - anyone already running them is asked to get in touch. David
noted that Glasgow are running the new squid, and are already involved in
the Staged Rollout process.
- He also highlighted yesterday's broadcast re multicore accounting:
https://operations-portal.egi.eu/broadcast/archive/id/1236
- There is a slight delay in organising the EGI forum in May due to
awaiting an approval, but this is expected to be done fairly soon.
On Duty
--------
Kashif was on ROD duty over Christmas, and submitted a written report:
Few tickets were opened that period which were subsequently solved next
week except EFDA-JET one. I looked at availability and reliability
figures during brief network issue at RAL tier 1 on 25-26 Dec and
availability/reliability was not affected for any site.
Security
---------
There was a brief discussion of the UK's response to the recent kernel
critical security alert issued just before the holiday period:
https://wiki.egi.eu/wiki/EGI_CSIRT:Alerts/Linux-2014-12-17
Ewan reported that the UK response had been generally very good, with most
sites having installed the new kernels and rebooted their worker nodes
into them very quickly. Four sites were still showing vulnerable worker
nodes this year and had been sent reminders by the EGI security team. of
those two had responding by updating, leaving two sites with particular
problems. There is a general issue with the recent RHEL/SL kernels and
IP-over-InfiniBand systems which was affecting Sussex, and ECDF were once
again suffering from their worker nodes being run by their extremely
update averse university team, not by their grid sysadmins. Matt noted the
very different experience that Lancaster have with their central team,
saying that they had shared our concerns about the vulnerability and had
acted swiftly to update the systems that they manage.
Pete also reported a contact from Wahid regarding a problem with the ECDF
site ARGUS server, which has been crashing due to apparent hardware
problems.
Tickets
--------
There was a short ticket update today; Matt's not carried out a full
ticket review since the break, but he did report that there were no
obvious problem tickets, and that normal service will resume next week.
VOs
----
Steve noted his recent update on tb-support for a change to the geant4 VO
details to reflect the new CERN/WLCG VOMS servers:
https://www.jiscmail.ac.uk/cgi-bin/webadmin?A2=ind1501&L=tb-support&F=&S=&P=2196
Sites
======
Pete went round the group asking for a brief update of current status, as
well as anything interesting from over Christmas:
RALPPD: Nothing much, intermittent load issues on dcache being worked on.
IC: Pretty uneventful
QMUL: There was a problem with a gridftp node over the weekend. Dan is
trying to re-deploy xrootd on a dual stack node and is having some
trouble, he will probably follow this up by email.
Glasgow: Generally OK, one disk failure in a storage node. There was a
problem with some recently (re)installed worker nodes not having gfortran
installed. Ewan queried whether it was required by HEPOSLibs, and David
explained that it wasn't but possibly wasn't supposed to be - there were
signs that the real problem was with ATLAS jobs not picking up the cvmfs
copy of gfortran that they should have, and falling back to a local copy
if it was present, but it shouldn't have been required in principle.
Alessandra asked for the details, and it was agreed that they would be
sent to a mailing suitable list.
Manchester: No major problems, one WN had a disk failure, one had a
motherboard failure, everything else all smooth.
Sheffield: fixed a problem with glexec after a ticket from atlas and some
SAM test failures. Elena has started a thread on lcg-rollout concerning a
mysterious 'detection of lifetime of proxy' problem; that is ongoing, and
Elena appealed for any input:
https://www.jiscmail.ac.uk/cgi-bin/webadmin?A2=ind1501&L=lcg-rollout&F=&S=&P=56
Oxford: Was basically fine.
RHUL: Quite quiet over the holiday, but recently had a problem with a
squid being unable to access CERN CVMFS servers (still being worked on).
The site has recently had a 10Gbit link to Janet installed, but it's not
yet in production pending getting some new kit in.
Liverpool: Had an issue with some new Condor WNs getting globus libraries
installed in the wrong place, and some DPM filesystems being offlined and
not put back due to the holidays, both now resolved.
Lancs: Christmas had been quiet, but there was a bit of a problem with the
pre-holiday kernel updates caused by bad Chelsio network drivers. An
update was obtained directly from Chelsio but has been causing problems.
Matt recommended that people avoid Chelsio cards, and noted that their
Mellanox equipped machines were very good and trouble-free.
Brunel: Nothing new from Brunel
Priorities/Discussion
======================
Pete asked for updates on IPv6, Ewan reported that there's been little
concrete movement, but that the plan remains to push strongly for
deployments, particularly with dual stack perfsonars at Tier 1s, and that
there's a face-to-face meeting of the hepix working group towards the end
of the month (actually next Wednesday -
https://indico.cern.ch/event/352638/ ), so there may be more after that.
Pete also asked about ARGUS, there was no general news, but Ewan noted
that some Tier 2 sites may be seeing a non-critical security test failure
in the EGI security dashboard, but that this was spurious - it's only
supposed to be testing NGI Argus servers, not site ones.
AOB
====
Alessandra noted that today's pre-GDB meeting is covering storage access
protocols, with Wahid speaking at 12:30 UK time, and she suggested that we
should all go join that meeting after this one; Pete concurred.
Chat Window log:
=====================
Brian @RAL: (13/01/2015 11:01:29)
can I check if anyone is talking?
Raja Nandakumar: (11:01 AM)
Pete is talking now
Ewan Mac Mahon: (11:01 AM)
Pete's just started.
Brian @RAL: (11:01 AM)
thanks
Raja Nandakumar: (11:02 AM)
Apparently noone can hear me
Apologies - I will try to reconnect
Ewan Mac Mahon: (11:03 AM)
There's nothing to report for CMS, there never is.
Steve Jones: (11:12 AM)
https://www.gridpp.ac.uk/wiki/ATLAS_Site_Availability_and_Performance_%28ASAP%29
Work in progress....
Chris Brew: (11:16 AM)
We'e had an issue with high load on the dCache servers affecting all VOs. We are actively looking into it
Peter Gronbech: (11:16 AM)
http://pprc.qmul.ac.uk/~lloyd/gridpp/hammercloud.html points to http://hammercloud.cern.ch/hc/app/atlas/robot/
Raja Nandakumar: (11:23 AM)
LHCb ...
Ewan Mac Mahon: (11:24 AM)
I'm minuting that as 'Duncan is in a wind tunnel'
Raja Nandakumar: (11:25 AM)
Apparently you cannot still hear me - my apologies.
Restripping campaign at Tier-1s almost finished. This caused huge load on RAL srm-s. Castor itself worked fine.
The RAL srm-s still seem to be under load right now, even though the number of LHCb running jobs is quite low.
For Tier-2 sites :
Some problem at Edinburgh yesterday
[dirac@lbvobox15 dirac]$ glite-ce-job-status -L0 -a -e ce6.glite.ecdf.ed.ac.uk
2015-01-13 10:01:05,097 FATAL - Received NULL fault; the error is due to another cause: FaultString=[connection error] - FaultCode=[SOAP-ENV:Client] - FaultSubCode=[SOAP-ENV:Client] - FaultDetail=[Connection timed out]
And load due to LHCb jobs
Matt Doidge: (11:39 AM)
Perhaps someone needs to have a chat with the central ECDF team - our local admins shared our concerns and updated and rebooted without quibble.
Matt Doidge: (11:50 AM)
Sorry if I missed Lancaster's turn on the round table - darn Vidyo playing up.
Peter Gronbech: (11:51 AM)
we can hear you
Ewan Mac Mahon: (11:53 AM)
Clearly it's a half duplex link.
Govind: (11:54 AM)
I can't hear anything now..
Did you hear what I said...
otherwise I will type it.
Samuel Cadellin Skipsey: (11:54 AM)
Yes, Govind, we heard you.
We did realise that you couldn't hear us, though.
Govind: (11:55 AM)
still can not hear.. will try to re-join
Raja Nandakumar: (11:55 AM)
No sound
raul: (11:56 AM)
Nothing new from Brunel
Raja Nandakumar: (11:56 AM)
Pete - my apologies.
My microphone is not working unfortunately.
I put in my report earlier in the chat window
Peter Gronbech: (11:57 AM)
ok I've seen the report now so will read it.
Ewan Mac Mahon: (11:58 AM)
Wr're having a bit of a RAL / Raul issue here.
Samuel Cadellin Skipsey: (11:59 AM)
It's a good job that Raul doesn't work at RAL or RHUL ;)
Ewan Mac Mahon: (12:00 PM)
I don't know, might help if he did - you'd get to the same place either way.
Steve Jones: (12:03 PM)
Cheer Peter
|