JiscMail Logo
Email discussion lists for the UK Education and Research communities

Help for TB-SUPPORT Archives


TB-SUPPORT Archives

TB-SUPPORT Archives


TB-SUPPORT@JISCMAIL.AC.UK


View:

Message:

[

First

|

Previous

|

Next

|

Last

]

By Topic:

[

First

|

Previous

|

Next

|

Last

]

By Author:

[

First

|

Previous

|

Next

|

Last

]

Font:

Proportional Font

LISTSERV Archives

LISTSERV Archives

TB-SUPPORT Home

TB-SUPPORT Home

TB-SUPPORT  January 2015

TB-SUPPORT January 2015

Options

Subscribe or Unsubscribe

Subscribe or Unsubscribe

Log In

Log In

Get Password

Get Password

Subject:

Re: Ops meeting @ 11am

From:

Ewan MacMahon <[log in to unmask]>

Reply-To:

Testbed Support for GridPP member institutes <[log in to unmask]>

Date:

Wed, 14 Jan 2015 15:31:29 +0000

Content-Type:

multipart/mixed

Parts/Attachments:

Parts/Attachments

text/plain (16 lines) , ops-20150113.txt (1 lines)

> -----Original Message-----
> From: Testbed Support for GridPP member institutes [mailto:TB-
> [log in to unmask]] On Behalf Of Jeremy Coles
> Sent: 13 January 2015 05:00
> To: [log in to unmask]
> Subject: Ops meeting @ 11am
> 
> Our first ops meeting of 2015 is today and has the agenda at
> https://indico.cern.ch/event/360124/.

Draft minutes are attached to this email; comments/corrections/etc.
all welcome.

Ewan



GridPP Weekly Operations meeting 2015 01 13 =========================================== -- https://indico.cern.ch/event/360124/ -- Experiments ============ LHCb: ------ Raja was having microphone/Vidyo problems and was unable to be heard, but provided a report via the meeting chat: Restripping campaign at Tier-1s almost finished. This caused huge load on RAL srm-s. Castor itself worked fine. The RAL srm-s still seem to be under load right now, even though the number of LHCb running jobs is quite low. For Tier-2 sites there were some problem at Edinburgh yesterday, e.g: [dirac@lbvobox15 dirac]$ glite-ce-job-status -L0 -a -e ce6.glite.ecdf.ed.ac.uk 2015-01-13 10:01:05,097 FATAL - Received NULL fault; the error is due to another cause: FaultString=[connection error] - FaultCode=[SOAP-ENV:Client] - FaultSubCode=[SOAP-ENV:Client] - FaultDetail=[Connection timed out] And load due to LHCb jobs. CMS: ----- Daniela reported that, as is normal for CMS, there is nothing much to report, but did go on to explain that there have been some problems with CMS' internal site monitoring not working reliably. Major site problems would still have been found though, and none have been. ATLAS: ------- Elena reported that there had been a couple of problems with ATLAS infrastructure over Christmas. There were no major issues with rucio, but there is a problem with deletions from LOCALGROUPDISK - files can currently only be removed by their owners, when it should also be possible for users with production roles to delete things too. This will be fixed in the next release of rucio, but in the interim it is possible to have files removed manually by filing a Jura ticket. Several sites have been querying low usage; this is a generic issue with low amounts of production work running due to some central problems with tasks getting stuck and a problem with random seed values(?). Work on all these problems is ongoing, and there may be more news at the weekly ADC meeting later today. ATLAS have also introduced a new metrics system - ATLAS Site Availability and Performance (ASAP), which was introduced to us by Elena at the 9th December ops meeting ( https://indico.cern.ch/event/357311/ ) with slides attached to the agenda. When she sent round the first set of ASAP results this year several questions were raised; Elana explained a bit more about the context and use of the new system - it replaces but is very similar to the previous A, B, C ratings used for data distribution. ASAP results are based entirely on ATLAS analysis availability as measured by the Hammercloud tests. It is technically possible to also take into account availability for production, but analysis is considered to be more interesting, though that might change in the future if there's a need. As well as being used to determine data shares, there's also a threshold whereby sites at less than 80% for extended periods may find themselves being demoted. Elena's slides from this meeting: https://indico.cern.ch/event/360124/contribution/1/material/slides/0.pdf contain links to ASAP monitoring pages, and she endorsed and pointed to Steve Jones explanation of the system on the gridpp wiki: https://www.gridpp.ac.uk/wiki/ATLAS_Site_Availability_and_Performance_(ASAP) In particular there have been questions about how sites can be alerted when they're failing hammercloud tests; Elena explained that hammercloud state change warning emails are sent to the [log in to unmask] list; in principle the UK Cloud team will then follow them up with sites, but Elena noted that in practice most UK ATLAS sites are already following the list (with the possible exceptions of RALPPD and Birmingham). The list is useful for support as well as alerts, and Elena encouraged anyone who wanted help understanding or dealing with their test results to email the list for assistance. Most UK sites seem to be in decent shape, but there are some issues to resolve around Sussex, which is in a somewhat unusual state by design. Pete raised monitoring results from Steve Lloyd's page at: http://pprc.qmul.ac.uk/~lloyd/gridpp/hammercloud.html (and implicitly the ATLAS page from which it gets its data) and asked why a site (specifically Oxford) was showing as 'grey' for a long time after a short failure period, despite having resumed passing tests many hours before. Elena wasn't sure, but said she'd look into it, but that the pages in question were not used as inputs to ASAP. Other VOs: ----------- The 'other VOs' role has been taken over by Imperial, but there's nothing to report this week except very early moves to provide CVMFS for the four regional starter VOs; see the tb-support discussion for more: https://www.jiscmail.ac.uk/cgi-bin/webadmin?A2=ind1501&L=tb-support&F=&S=&X=721CBCE9CDB1708DE7&P=14439 General updates ================ WLCG Site availabilities ------------------------- The site availability metrics from December had been circulated to tb-support by Jeremy: https://www.jiscmail.ac.uk/cgi-bin/webadmin?A2=ind1501&L=tb-support&F=&S=&P=12748 and Pete briefly ran through the results: - Oxford was low for ALICE, which Pete suggested may be due to AC problems. - Sheffield were low for ATLAS, Elena reminded everyone that she'd explained this on tb-support as being due to an unfortunate upgrade to the latest DPM. - For LHCb there were problems at Durham, Bristol and JET. JET's is long standing, and there was no information about the other two. WLCG Ops Coordination ---------------------- There was only a 'virtual' meeting on 8th January, and no major updates. Tier 1 ------- Catalin reported that Gareth was unable to attend today in person, but had updated the bulletin page: https://www.gridpp.ac.uk/w/index.php?title=Operations_Bulletin_Latest&oldid=7173 Pete noted that there had been issues with the site router, fixed by Martin on Boxing Day, and a forthcoming 'at risk' for electrical testing later this week, though no problems are expected: https://goc.egi.eu/portal/index.php?Page_Type=Downtime&id=16402 Accounting ----------- There had been some glitches over the holiday doe to the RAL site networking problem, but everything is believed to be fine now. Interoperations ------------------ There was a meeting yesterday - https://wiki.egi.eu/wiki/Agenda-12-01-2015 including: - A discussion of StoRM. - There was a general reminder to check site contact details for staged rollout, with David particularly querying whether Chris Walker is still listed as the contact for QMUL. - There is a call for suitable sites to test the new FTS3 and Squid releases - anyone already running them is asked to get in touch. David noted that Glasgow are running the new squid, and are already involved in the Staged Rollout process. - He also highlighted yesterday's broadcast re multicore accounting: https://operations-portal.egi.eu/broadcast/archive/id/1236 - There is a slight delay in organising the EGI forum in May due to awaiting an approval, but this is expected to be done fairly soon. On Duty -------- Kashif was on ROD duty over Christmas, and submitted a written report: Few tickets were opened that period which were subsequently solved next week except EFDA-JET one. I looked at availability and reliability figures during brief network issue at RAL tier 1 on 25-26 Dec and availability/reliability was not affected for any site. Security --------- There was a brief discussion of the UK's response to the recent kernel critical security alert issued just before the holiday period: https://wiki.egi.eu/wiki/EGI_CSIRT:Alerts/Linux-2014-12-17 Ewan reported that the UK response had been generally very good, with most sites having installed the new kernels and rebooted their worker nodes into them very quickly. Four sites were still showing vulnerable worker nodes this year and had been sent reminders by the EGI security team. of those two had responding by updating, leaving two sites with particular problems. There is a general issue with the recent RHEL/SL kernels and IP-over-InfiniBand systems which was affecting Sussex, and ECDF were once again suffering from their worker nodes being run by their extremely update averse university team, not by their grid sysadmins. Matt noted the very different experience that Lancaster have with their central team, saying that they had shared our concerns about the vulnerability and had acted swiftly to update the systems that they manage. Pete also reported a contact from Wahid regarding a problem with the ECDF site ARGUS server, which has been crashing due to apparent hardware problems. Tickets -------- There was a short ticket update today; Matt's not carried out a full ticket review since the break, but he did report that there were no obvious problem tickets, and that normal service will resume next week. VOs ---- Steve noted his recent update on tb-support for a change to the geant4 VO details to reflect the new CERN/WLCG VOMS servers: https://www.jiscmail.ac.uk/cgi-bin/webadmin?A2=ind1501&L=tb-support&F=&S=&P=2196 Sites ====== Pete went round the group asking for a brief update of current status, as well as anything interesting from over Christmas: RALPPD: Nothing much, intermittent load issues on dcache being worked on. IC: Pretty uneventful QMUL: There was a problem with a gridftp node over the weekend. Dan is trying to re-deploy xrootd on a dual stack node and is having some trouble, he will probably follow this up by email. Glasgow: Generally OK, one disk failure in a storage node. There was a problem with some recently (re)installed worker nodes not having gfortran installed. Ewan queried whether it was required by HEPOSLibs, and David explained that it wasn't but possibly wasn't supposed to be - there were signs that the real problem was with ATLAS jobs not picking up the cvmfs copy of gfortran that they should have, and falling back to a local copy if it was present, but it shouldn't have been required in principle. Alessandra asked for the details, and it was agreed that they would be sent to a mailing suitable list. Manchester: No major problems, one WN had a disk failure, one had a motherboard failure, everything else all smooth. Sheffield: fixed a problem with glexec after a ticket from atlas and some SAM test failures. Elena has started a thread on lcg-rollout concerning a mysterious 'detection of lifetime of proxy' problem; that is ongoing, and Elena appealed for any input: https://www.jiscmail.ac.uk/cgi-bin/webadmin?A2=ind1501&L=lcg-rollout&F=&S=&P=56 Oxford: Was basically fine. RHUL: Quite quiet over the holiday, but recently had a problem with a squid being unable to access CERN CVMFS servers (still being worked on). The site has recently had a 10Gbit link to Janet installed, but it's not yet in production pending getting some new kit in. Liverpool: Had an issue with some new Condor WNs getting globus libraries installed in the wrong place, and some DPM filesystems being offlined and not put back due to the holidays, both now resolved. Lancs: Christmas had been quiet, but there was a bit of a problem with the pre-holiday kernel updates caused by bad Chelsio network drivers. An update was obtained directly from Chelsio but has been causing problems. Matt recommended that people avoid Chelsio cards, and noted that their Mellanox equipped machines were very good and trouble-free. Brunel: Nothing new from Brunel Priorities/Discussion ====================== Pete asked for updates on IPv6, Ewan reported that there's been little concrete movement, but that the plan remains to push strongly for deployments, particularly with dual stack perfsonars at Tier 1s, and that there's a face-to-face meeting of the hepix working group towards the end of the month (actually next Wednesday - https://indico.cern.ch/event/352638/ ), so there may be more after that. Pete also asked about ARGUS, there was no general news, but Ewan noted that some Tier 2 sites may be seeing a non-critical security test failure in the EGI security dashboard, but that this was spurious - it's only supposed to be testing NGI Argus servers, not site ones. AOB ==== Alessandra noted that today's pre-GDB meeting is covering storage access protocols, with Wahid speaking at 12:30 UK time, and she suggested that we should all go join that meeting after this one; Pete concurred. Chat Window log: ===================== Brian @RAL: (13/01/2015 11:01:29) can I check if anyone is talking? Raja Nandakumar: (11:01 AM) Pete is talking now Ewan Mac Mahon: (11:01 AM) Pete's just started. Brian @RAL: (11:01 AM) thanks Raja Nandakumar: (11:02 AM) Apparently noone can hear me Apologies - I will try to reconnect Ewan Mac Mahon: (11:03 AM) There's nothing to report for CMS, there never is. Steve Jones: (11:12 AM) https://www.gridpp.ac.uk/wiki/ATLAS_Site_Availability_and_Performance_%28ASAP%29 Work in progress.... Chris Brew: (11:16 AM) We'e had an issue with high load on the dCache servers affecting all VOs. We are actively looking into it Peter Gronbech: (11:16 AM) http://pprc.qmul.ac.uk/~lloyd/gridpp/hammercloud.html points to http://hammercloud.cern.ch/hc/app/atlas/robot/ Raja Nandakumar: (11:23 AM) LHCb ... Ewan Mac Mahon: (11:24 AM) I'm minuting that as 'Duncan is in a wind tunnel' Raja Nandakumar: (11:25 AM) Apparently you cannot still hear me - my apologies. Restripping campaign at Tier-1s almost finished. This caused huge load on RAL srm-s. Castor itself worked fine. The RAL srm-s still seem to be under load right now, even though the number of LHCb running jobs is quite low. For Tier-2 sites : Some problem at Edinburgh yesterday [dirac@lbvobox15 dirac]$ glite-ce-job-status -L0 -a -e ce6.glite.ecdf.ed.ac.uk 2015-01-13 10:01:05,097 FATAL - Received NULL fault; the error is due to another cause: FaultString=[connection error] - FaultCode=[SOAP-ENV:Client] - FaultSubCode=[SOAP-ENV:Client] - FaultDetail=[Connection timed out] And load due to LHCb jobs Matt Doidge: (11:39 AM) Perhaps someone needs to have a chat with the central ECDF team - our local admins shared our concerns and updated and rebooted without quibble. Matt Doidge: (11:50 AM) Sorry if I missed Lancaster's turn on the round table - darn Vidyo playing up. Peter Gronbech: (11:51 AM) we can hear you Ewan Mac Mahon: (11:53 AM) Clearly it's a half duplex link. Govind: (11:54 AM) I can't hear anything now.. Did you hear what I said... otherwise I will type it. Samuel Cadellin Skipsey: (11:54 AM) Yes, Govind, we heard you. We did realise that you couldn't hear us, though. Govind: (11:55 AM) still can not hear.. will try to re-join Raja Nandakumar: (11:55 AM) No sound raul: (11:56 AM) Nothing new from Brunel Raja Nandakumar: (11:56 AM) Pete - my apologies. My microphone is not working unfortunately. I put in my report earlier in the chat window Peter Gronbech: (11:57 AM) ok I've seen the report now so will read it. Ewan Mac Mahon: (11:58 AM) Wr're having a bit of a RAL / Raul issue here. Samuel Cadellin Skipsey: (11:59 AM) It's a good job that Raul doesn't work at RAL or RHUL ;) Ewan Mac Mahon: (12:00 PM) I don't know, might help if he did - you'd get to the same place either way. Steve Jones: (12:03 PM) Cheer Peter

Top of Message | Previous Page | Permalink

JiscMail Tools


RSS Feeds and Sharing


Advanced Options


Archives

May 2024
April 2024
March 2024
February 2024
January 2024
December 2023
November 2023
October 2023
September 2023
August 2023
July 2023
June 2023
May 2023
April 2023
March 2023
February 2023
January 2023
December 2022
November 2022
October 2022
September 2022
August 2022
July 2022
June 2022
May 2022
April 2022
March 2022
February 2022
January 2022
December 2021
November 2021
October 2021
September 2021
August 2021
July 2021
June 2021
May 2021
April 2021
March 2021
February 2021
January 2021
December 2020
November 2020
October 2020
September 2020
August 2020
July 2020
June 2020
May 2020
April 2020
March 2020
February 2020
January 2020
December 2019
November 2019
October 2019
September 2019
August 2019
July 2019
June 2019
May 2019
April 2019
March 2019
February 2019
January 2019
December 2018
November 2018
October 2018
September 2018
August 2018
July 2018
June 2018
May 2018
April 2018
March 2018
February 2018
January 2018
December 2017
November 2017
October 2017
September 2017
August 2017
July 2017
June 2017
May 2017
April 2017
March 2017
February 2017
January 2017
December 2016
November 2016
October 2016
September 2016
August 2016
July 2016
June 2016
May 2016
April 2016
March 2016
February 2016
January 2016
December 2015
November 2015
October 2015
September 2015
August 2015
July 2015
June 2015
May 2015
April 2015
March 2015
February 2015
January 2015
December 2014
November 2014
October 2014
September 2014
August 2014
July 2014
June 2014
May 2014
April 2014
March 2014
February 2014
January 2014
December 2013
November 2013
October 2013
September 2013
August 2013
July 2013
June 2013
May 2013
April 2013
March 2013
February 2013
January 2013
December 2012
November 2012
October 2012
September 2012
August 2012
July 2012
June 2012
May 2012
April 2012
March 2012
February 2012
January 2012
December 2011
November 2011
October 2011
September 2011
August 2011
July 2011
June 2011
May 2011
April 2011
March 2011
February 2011
January 2011
December 2010
November 2010
October 2010
September 2010
August 2010
July 2010
June 2010
May 2010
April 2010
March 2010
February 2010
January 2010
December 2009
November 2009
October 2009
September 2009
August 2009
July 2009
June 2009
May 2009
April 2009
March 2009
February 2009
January 2009
December 2008
November 2008
October 2008
September 2008
August 2008
July 2008
June 2008
May 2008
April 2008
March 2008
February 2008
January 2008
December 2007
November 2007
October 2007
September 2007
August 2007
July 2007
June 2007
May 2007
April 2007
March 2007
February 2007
January 2007
December 2006
November 2006
October 2006
September 2006
August 2006
July 2006
June 2006
May 2006
April 2006
March 2006
February 2006
January 2006
December 2005
November 2005
October 2005
September 2005
August 2005
July 2005
June 2005
May 2005
April 2005
March 2005
February 2005
January 2005
December 2004
November 2004
October 2004
September 2004
August 2004
July 2004
June 2004
May 2004
April 2004
March 2004
February 2004
January 2004
December 2003
November 2003
October 2003
September 2003
August 2003
July 2003
June 2003
May 2003
April 2003
March 2003
February 2003
January 2003
December 2002
November 2002
October 2002
September 2002
August 2002
July 2002
June 2002
May 2002
April 2002
March 2002
February 2002
January 2002


JiscMail is a Jisc service.

View our service policies at https://www.jiscmail.ac.uk/policyandsecurity/ and Jisc's privacy policy at https://www.jisc.ac.uk/website/privacy-notice

For help and support help@jisc.ac.uk

Secured by F-Secure Anti-Virus CataList Email List Search Powered by the LISTSERV Email List Manager