Dear All,
Please find attached the GridPP Project Management Board
Meeting minutes for the 491st meeting.
The latest minutes can be in:
http://www.gridpp.ac.uk/php/pmb/minutes.php?latest
as well as being listed with other minutes at:
http://www.gridpp.ac.uk/php/pmb/minutes.php
Cheers, Dave.
GridPP PMB Minutes 491 (11.03.2013)
===================================
Present: Dave Britton (Chair), Pete Gronbech, Andrew Sansum, Tony Cass, Dave Colling, Dave Kelsey, Steve Lloyd, Pete Clarke (Minutes - Suzanne Scott)
Apologies: Tony Doyle, Roger Jones, Jeremy Coles, Claire Devereux, Neil Geddes
1. STFC e-VAL ResearchFish
===========================
PG reported that he was still working on this. Were there any new GridPP papers that could be added? DC noted he had no cloud papers yet however he could add a couple of GridPP ones. PG advised that the 'Collaborations' section was now done. PG would check with STFC what they mean in relation to grants. PG was working on the 'engagement' section at present.
ACTION
491.1 ALL to let PG know about any roles they have on committees etc - positions and roles info. is required, also how many talks have been given at meetings and conferences.
PG noted that there were no instructions given to PIs about how they register online and report that the GridPP info. was being provided centrally. PG would check with STFC.
2. ASPIRE Report on Clouds
===========================
DK had circulated the final report and recommendations of the ASPIRE study on the future of NRENs and their services including clouds. This was 'A Study on the Prospects of the Internet for Research and Education 2014-2020'. DK noted there was a tension between who provided and who ran the services, however there were opportunities to get involved in cloud issues. PC considered this might be a useful role in the brokering of a national repository; JANET could be involved. The TERENA community were also running pilot services. DK advised that he used to be on the TERENA technical committee and hence
still had good contacts with that community. He would forward any relevant information on clouds to the lists from now on.
3. RAL Network Issue
=====================
AS reported that there had been problems last week which had resulted in a 5-hour unscheduled downtime. The Tier-1 had also run 'at risk' for around 25 hours. They had lost 20% of their availability on Wednesday and Thursday. These problems had arisen due to a site firewall and router issue - feedback had been provided to the Head of IT. No further information was available at present, however the incident had not been a concern operationally for the Tier-1.
4. PMB F2F Agenda
==================
PG had put together an Agenda for the PMB F2F on Monday 25th March at GridPP30. DK asked if it would be possible to 'phone in? DB thought that Skype would be available. PG went through the Agenda. PC noted that item 2 should state: 'Report from CAP Chair'. PG would amend. PG asked if there were any issues which had been missed? PC asked for a small item at the end regarding increased networking requirements. DC/PC would discuss this.
PG advised that there items listed for the PMB/Ops meeting as well. This looked ok to SL - he would do the Agenda for the joint PMB/Ops team meeting, which would take place at 9.00 am on Tuesday 26th March.
5. AOCB
========
- LHCb use of Tier-2
PC advised that he was asking questions of the Tier-2s world-wide. LHCb would not generally use the Tier-2 unless a certain amount of disk was available - ie: it would need 300TB to be available at the Tier-2 for LHCb to use it. What did ATLAS require in terms of space? DB wasn't sure. PG advised that an accounting model metric would be required before the Tier-2 could provide any disk space - also an incentive would be required for spare disk to be provided. ATLAS received their allocation by way of the percentage of Tier-1 authors. DB advised that funding for the Tier-2 was based on no disk for LHCb - we would have to build this into GridPP5 in order to provide it. The problem would however occur before 2016 - SL was looking at the resources at the moment. Providing a spare 300TB for LHCb at a single site would be a problem, but he would know more once he had compiled his figures. PC concluded that it wouldn't be easy for LHCb to use the Tier-2s. DB summarised that the initial response was that it would be difficult, however we would be able to provide a better response once SL had compiled his figures. SL was currently awaiting replies from sites. DB advised that we did need to decide when to spend the hardware funding - we should be able to come up with an answer in a few weeks.
ACTION
491.2 DB to respond to LHCb regarding their Tier-2 disk requirements.
STANDING ITEMS
==============
SI-1 Dissemination Report
--------------------------
SL noted there wasn't much to report - he and Neasan O'Neill had met with Alex Efimov of the KE network. Alex had ideas about the HPC world and had a contact who worked for a bank that was currently having problems, that might be suited to our sort of technology. SL/Neasan would pursue this.
SI-2 ATLAS weekly review & plans
---------------------------------
RJ was not present.
SI-3 CMS weekly review & plans
-------------------------------
DC had left the meeting.
SI-4 LHCb weekly review & plans
--------------------------------
PC noted there wasn't much to report - he had responded to the JISC survey re the data repository, this issue may come round again.
SI-5 Production Manager's Report
---------------------------------
JC reported in absentia:
1) The main open operations action relates to EMI-1 middleware updates. Recent developments in this area include that QMUL has a potential concern with the StoRM migration (there was lack of communication with respect to SL5 Storm EMI-2 and so the site will move directly to the as yet unavailable SL6 EMI-3 release expected in mid-April), and there has also been a reminder that dCache 1.9.12 is only supported until the end of April - it had remained until last week a WLCG baseline recommended version. Our ROD team and sites have been active in following up on EMI-1 Nagios alerts.
2) PerfSonar deployment in the UK will likely have some gaps as not all sites purchased the required hardware (or have yet to set it up. For the current status see http://perfsonar.racf.bnl.gov:8080/exda/?page=25&cloudName=UK). There is perhaps an expectation that all WLCG sites will be perfsonar enabled, though the need is obviously greater at larger sites. The GridPP sites affected are currently Bristol, Sussex, UCL and EFDA-JET.
3) We have (as of last week) a few sites with APEL publishing delays/issues: QMUL, ECDF, Lancaster and Durham. The underlying problems are variously associated with new CEs or problematic APEL publishing but in each case is being addressed.
4) The WLCG Tier-2 availability:reliability figures for February were released in draft last week: http://sam-reports.web.cern.ch/sam-reports/2013/201302/wlcg/WLCG_Tier2_OPS_Feb2013.pdf. So far no recomputation requests have been made by UK sites. Of the UK sites below 90% (in either category) the issues encountered are as follows:
EFDA-JET (89%:89%): The main issue was a DPM server problem (RAID crash) that was resolved within a few days. There has also been a longstanding issue with their CE (GlueCEStateWaitingJobs: 444444 problem) which was fixed last week with an EMI-2 update 9 upgrade.
RAL-PP (79%:97%): The main issue was a weekend power outage (16th and 17th Feb) and the site being down for 3 days (including a period for draining the batch system). During the month RAL-PP also witnessed processes either locking up or dying on their Argus server causing problems on their CreamCEs.
Lancaster (86%:87%): Impacted over one weekend by broken CRLs on a pool node.
Durham (71%:72%): Site was still in a rebuild state during early February with the following work being done: All WNs reinstalled and reconfigured; CEs decomissioned and replaced; Batch system replaced; Publishing reimplemented; New config management implemented; New VM infrastructure and backup systems and procedure replaced.
UCL (79%:91%): Outages due to electrical work and DPM pool node upgrades.
5) We have been informed that the lead DPM developer, Ricardo Rocha, will be taking a a break from grid data management and will leave the team at the end of March. The role of technical lead will be taken by Fabrizio Furano “who has been working with the project throughout EMI and who brings a great deal of experience in data management, with DPM and beyond. He will take over the role at the beginning of April”. This is a concern but with the DPM community now in place ought not to generate large new risks for DPM.
6) A reminder that the CHEP 2013 abstract deadline is 25th March: http://www.chep2013.org/bulletins/2.
7) RAL was affected by some network outages on Wednesday 5th/Thursday 6th March (The problem was later identified as a “combination of issues including a complex networking loop and problems with a specific system"). The GOCDB and APEL experienced some disruption to their service due to the network outages.
For information:
A) On Tuesday this week there is a pre-GDB on Clouds: https://indico.cern.ch/conferenceDisplay.py?confId=223689.
B) On Wednesday there is a GDB: https://indico.cern.ch/conferenceOtherViews.py?view=standard&confId=197801. The topics include GLUE2 and the IS; EMI-3 highlights; Storage accounting; Operations coordination updates and a report from the Storage working group).
C) The next LHCONE/LHCOPN meeting takes place on 17th & 18th March: http://indico.cern.ch/conferenceDisplay.py?confId=236955.
SI-6 Tier-1 Manager's Report
-----------------------------
AS reported as follows:
Fabric:
1) Disk - both sets delivered - acceptance testing, projected to end 11th and 15th March. Still going OK test data still to be reviewed .
2) CPU - half of 1 set in SL6 queue. remaining new capacity expected to deploy to SL5 this week.
3) We expect to replace the core Tier-1 network switch (C300) on Tuesday 12th March. Downtime scheduled from 08:45 to 15:30. This change is necessary as a pre-cursor for our 40Gb upgrade later in Q2, however we also hope this may resolve outbound network performance problems.
4) Network problems caused 20% availability impact on both Wednesday and Thursday. Fault traced to two separate problems, one impacting site firewall (single host identified), the second to a network loop impacting site routers.
Service:
1) Significant disruption caused by network problems on Wednesday and Thursday
http://www.gridpp.ac.uk/wiki/Tier1_Operations_Report_2013-03-06
2) CASTOR
- nothing to report
3) We are investigating low level job failure rates for ATLAS and LHCb jobs during setup phase.
Staff:
1) Paperwork for two system admin posts for Fabric team approved - preparing advertisements.
SI-7 LCG Management Board Report
---------------------------------
There was no MB meeting.
REVIEW OF ACTIONS
=================
438.9 AS to contact relevant site managers to ask whether or not they would be interested in having retired Tier-1 hardware - if a site were interested then they should submit a proposal as to what they want and why. Ongoing.
480.2 JC to consider the imminent demise of EMI and the resultant effect on the GridPP community - concrete issues and action requests to be brought back to the PMB. Ongoing.
485.3 AS to poll for date in May/June for T1 review. Date has been set for 10th May. Done, item closed.
487.4 ALL to send PG a list of the occasions any PMB member was a keynote speaker at conferences. Done, item superceded.
488.1 AS to notify the community, giving three months' notice, that the AFS service would be shut down. Ongoing.
488.2 DB to speak to DK/STFC regarding the EGI fee payment and let AS know. DB had contacted Janet Seed and Adrian - he had been advised to hold-off for a while pending EPSRC/JISC discussions, we must wait meantime. Done, item closed.
ACTIONS AS OF 11.03.13
======================
438.9 AS to contact relevant site managers to ask whether or not they would be interested in having retired Tier-1 hardware - if a site were interested then they should submit a proposal as to what they want and why.
480.2 JC to consider the imminent demise of EMI and the resultant effect on the GridPP community - concrete issues and action requests to be brought back to the PMB.
488.1 AS to notify the community, giving three months' notice, that the AFS service would be shut down.
491.1 ALL to let PG know about any roles they have on committees etc - positions and roles info. is required; also how many keynote talks have been given at meetings and conferences.
491.2 DB to respond to LHCb regarding their Tier-2 disk requirements.
The next PMB would take place on Monday 18th March at 12:55 pm. The following meeting would be the PMB F2F on Monday 25th at GridPP30.
|