GridPP PMB Meeting 669 (21.05.18) ================================= Present: Dave Britton (Chair), Pete Clarke, Jeremy Coles, David Colling, Alastair Dewhurst, Tony Doyle, Pete Gronbech, Roger Jones, Steve Lloyd, Andrew Sansum, Louisa Campbell (Minutes). Apologies: Tony Cass, Dave Kelsey, Andrew McNab. 1. CMS FTS issues ================== This relates to the CMS FTS issue raised by DC last week. Information has now been circulated and this has been clarified by Chris Brew. DC confirmed these refer to three separate incidents that caused significant risks through CMS. AD noted that around four weeks ago an IPV6 firewall change issue was picked up and fixes were put in place that resulted in periods where connectivity dropped, Chris and others were aware of this in advance. Since this was being monitored last Tuesday an additional period of poor efficiency was picked up on and no other VOs were affected. The thread was on the Ops list and it may be helpful for Chris to update this system in future. This is referred to in the Tier-1 Manager Report, below, as AD has been monitoring and believes this is not related to Tier-1 but to CMS sites. On a negative note this highlights some miscommunication, but on a positive note it is good the PMB picked up on this and resolved very quickly. 2. DUNE Computing request ========================= PC advised upping the profile of DUNE as a current need and he will be discussing with other relevant people. This is due to DUNE being high profile and requires some tactical thinking relating to GridPP6. PC was at Fermilab and had several relevant conversations in this regard. DUNE have a shortfall in computing resources and want to internationalise and are pleased about holding up the UK as an example of an international partner. PC pointed out we organise this via GridPP as Fermilab and DUNE are important partners. This may not amount to a great deal in the long run, its 2-3 PTB and we may have the money for that via IRIS, so we should exercise care to ensure if there is a need elsewhere it can be redirected there as a priority – RAL and Manchester will determine where that storage is. The main question surrounded when GridPP can start providing resource for DUNE. DC noted that STFC has committed £65M and it is a particle physics experiment so we should support it. DB summarised GridPP is being asked to underwrite and in the mid term the main resource may come from IRIS, but it is not yet clear what their resources will be or availability so the request is for GridPP to underwrite this in the meantime. GridPP6 would be an excellent opportunity to support this in the longer term. PC is writing the computing part of the DUNE proposal and incorporating this therein. The question is whether we can underwrite storage at 2 PTB in the meantime and whether PMB regard DUNE as within the scope of GridPP5 and possibly for GridPP6. The PMB agreed if they were genuinely going to put 2 PTB onto Echo we could provide this but it AD noted it could impact our ability to move off Castor and it is not yet clear what is happening with £460K non-spend. There was some discussion on the logistics of finessing and managing this between now and October alongside aspects surrounding Castor. DC enquired whether the resource could be distributed across various sites and suggested on that basis it could be managed, e.g. Glasgow, Manchester, IC, etc. Currently DUNE are moving data between EOS and Fermilab. This is a timing issue with IRIS having h/w money over the next 4 years there is no resource issue and in our experience these things can be late but normally work out so we are accustomed to making decisions on that basis. DB is happy to support this based on our past experience despite not having the full details and asked the PMB to support this. This was agreed and DB will phrase a formal response. A chat is planned at 3pm today and PC invited anyone who is interested to join the discussion. DC confirmed they were expecting to formally join DUNE during this summer. ACTION 669.1: DB to respond to a request for resources from DUNE. 3. AOCB ======= a) PG has been making progress on Tier2 H/W allocation as a result of the survey and comparing with the experimental requirement. Looking at Tier2 Resources we would start to go negative in April 2021 on CPU and 2020 on storage. Based on current costings to meet the requirements this would equate to 60% CPU and 40% on disk and a split of the funds in that ratio means we could split the CPU according to metrics (SL needs to check the latest situation and check anomalies). There are some questions to groups, e.g. Atlas 4 main sites (QMU, Lancaster, Manchester and Glasgow) and whether all storage goes to each site. Atlas confirms this will be clearer by end of June and this should be further discussed at the F2F on 6 June – PG is preparing a brief presentation in this regard. Much depends on the future of RAL PP for CMS. There will be no definitive answers for experiments by the F2F but it will be possible to undertake some modelling to ascertain how that looks to potentially assist in decision-making. PG is sending the ageing profile to RJ for info. 5. Standing Items =================== SI-0 Bi-Weekly Report from Technical Group (DC) ----------------------------------------------- DC circulated an email last week on the future of the working group as things have evolved. The plan is to continue fortnightly meetings with topics suggested for each one – there will be a speaker for each and DC will circulate. If this does not progress we should consider whether this should continue to be a standing item. DB expressed concern if the Technical meeting was to disappear if not structured around a standing item. It was agreed that this should perhaps continue along the lines of pre-DMB to discuss at least once monthly with a subject identified in advance with specifically relevant attendees to discuss. Some potential relevant topics for inclusion and speakers were covered as well as the best way to advertise, e.g. at Ops meetings. SI-1 ATLAS Weekly Review and Plans (RJ) --------------------------------------- Atlas have agreed to delete 3PTB of secondary storage, some test deletions have begun and the remainder will be undertaken soon. X3D copying errors were increased to 10% and have now been reduced by restarting them, this is only a short-term fix and probably relates to a memory leak - Tim is reviewing this. Putdown error affecting 3 or 4 UK Tier2s around Centaur7 is with developers and being investigated. AD is handing over 600TB allocated at RDF. Various errors and debugs are being reviewed and production data should soon move to that storage. AD is also handing generating import metrics so RJ can complete quarterly reports. SI-2 CMS Weekly Review and Plans (DC) ------------------------------------- Nothing significant to report. DC mentioned ongoing work on CMS efficiency and work coming through quicker than expected. The Quarterly Report has now been sent to PG. SI-3 LHCb Weekly Review and Plans (PC) -------------------------------------- Nothing to report. SI-4 Production Manager’s report (JC) ------------------------------------- A series of information updates this week: 1. To make progress with LSST whilst questions around a VO specific CVMFS setup are agreed, we will be making use of the GridPP CVMFS space for LSST. 2. In line with the discussion on DUNE at the PMB last week, the GridPP VO “incubator” page has been updated https://www.gridpp.ac.uk/wiki/GridPP_VO_Incubator#DUNE. There are several DUNE activities to complete this month. 3. We are keeping an eye on GGUS messaging as some UK sites are not getting their regular ticket updates following a GGUS update last week. 4. At 97%, NGI_UK remains well above the availability/reliability targets for EGI: http://argo.egi.eu/ar-ngi?month=2018-04. For WLCG in April only QMUL and Bristol were below target, the summary per VO being: ALICE (http://wlcg-sam.cern.ch/reports/2018/201804/wlcg/WLCG_All_Sites_ALICE_Apr2018.pdf) All OK ATLAS (http://wlcg-sam.cern.ch/reports/2018/201804/wlcg/WLCG_All_Sites_ATLAS_Apr2018.pdf) QMUL 87%, 87% CMS (http://wlcg-sam.cern.ch/reports/2018/201804/wlcg/WLCG_All_Sites_CMS_Apr2018.pdf) Bristol 73%, 73% LHCb (http://wlcg-sam.cern.ch/reports/2018/201804/wlcg/WLCG_All_Sites_LHCB_Apr2018.pdf) QMUL 75%,75% The QMUL issue for ATLAS appears to relate to an SE configuration that has now been modified. For LHCb the issue appears to relate to some older nodes pulling in and failing jobs. A UMD update was performed and some capacity taken off line. 5. Within the wider community there has been a security incident of note. SI-5 Tier-1 Manager's Report (AD) --------------------------------- - Tape: o CMS have finished their spike of tape writes and the tape write rate has dropped significantly. o The new tape procurement has passed the GDPR hurdle and delivery is expected second half of June (delay of 2 weeks). o ATLAS have approved a tape deletion and done some test deletions. o CMS have modified their tape deletion script (removing unnecessary stager commands) and are running a steady deletion at 0.2Hz. We have informed them they can go faster. - The first of this year’s procurement is going in to Echo this week. We now have some headroom and can start increasing quotas for VOs again. We aim to deploy it all by July 1st. - 3 disk servers were taken out of Castor for LHCb last week. Two are back in production in read only mode. One is still rebuilding. Will need to consult with LHCb as if this happened for CMS/ATLAS we would simply drain and remove this hardware (because they can use quota on Echo). - Niggly problems with ARC CEs caused a few SAM test failures and the batch system to be slightly less full than it should. - Darren Moore is getting a CMS certificate to provide better CMS cover in the immediate future. He observed very variable CMS FTS efficiency on Tuesday 15th May (all other VOs were unaffected). SI-6 LCG Management Board Report of Issues (DB) ----------------------------------------------- The meeting was cancelled and no report was submitted. SI-7 External Contexts (PC) --------------------------------- PC noted nothing specific, except that IRIS is progressing. The main potential risk with IRIS is an inability to spend and use funds it received in year 1, but this is a largely positive risk. REVIEW OF ACTIONS ================= 644.4: AS will progress capture of funds for Dirac with Mark Wilkinson. (Update: funding from DIRAC. AS has emailed Mark. They are now using it more heavily. Could use the money for tape, but have to be careful not to buy tape we won’t use. May be better charging later rather than during this FY? AD will now progress). Ongoing. 663.2: PG will canvas sites to ascertain when they want to spend money and determine how disk will be phased out. Done. 663.3: RJ and DC will advise how the experiments want disk divided for the start of Run 3 (Alice and LHCb are resolved). Ongoing. 663.8: JC will examine GridPP staff roles/service/areas of expertise. (UPDATE: JC will provide a table with information for discussion at June F2F). Ongoing. 665.2: AD will produce Procurement schedule for the coming FY to build in an additional month to buffer any delays in the future. Ongoing. 667.1 PG Clarify with STFC what exactly is required for the OC feedback. wrt the Capital reporting. Ongoing. 667.2 Need to do h/w planning before next OC to provide OC with details of shortfall in funds. Ongoing. ACTIONS AS OF 21.05.18 ====================== 644.4: AS will progress capture of funds for Dirac with Mark Wilkinson. (Update: funding from DIRAC. AS has emailed Mark. They are now using it more heavily. Could use the money for tape, but have to be careful not to buy tape we won’t use. May be better charging later rather than during this FY? AD will now progress). Ongoing. 663.3: RJ and DC will advise how the experiments want disk divided for the start of Run 3 (Alice and LHCb are resolved). Ongoing. 663.8: JC will examine GridPP staff roles/service/areas of expertise. (UPDATE: JC will provide a table with information for discussion at June F2F). Ongoing. 665.2: AD will produce Procurement schedule for the coming FY to build in an additional month to buffer any delays in the future. Ongoing. 667.1 PG Clarify with STFC what exactly is required for the OC feedback. wrt the Capital reporting. Ongoing. 667.2 Need to do h/w planning before next OC to provide OC with details of shortfall in funds. Ongoing. 669.1: DB to respond to a request for resources from DUNE. ######################################################################## To unsubscribe from the UKHEPGRID list, click the following link: https://www.jiscmail.ac.uk/cgi-bin/webadmin?SUBED1=UKHEPGRID&A=1