GridPP PMB Meeting 518 (17.02.14) ================================= Present: Dave Britton (Chair), Pete Gronbech, Andrew Sansum, Jeremy Coles, Dave Kelsey, Steve Lloyd, Roger Jones, Pete Clarke, Tony Cass, Tony Doyle, Dave Colling (Minutes - Suzanne Scott) Apologies: Claire Devereux 1. The GridPP5 Papers ====================== The purpose of the meeting was to review the GridPP5 Project Brief & Guidelines; the GridPP Strategic Review Terms of Reference; and the draft GridPP5 Proposal document (v14b). i) Project Brief ----------------- Section 3.3: DB noted that we did address user support in the Deployment, Operations and Support (DOS) section, did a sentence need to be added to cover the point mentioned in 3.3? JC would add a sentence. Section 3.3: Regarding the comment about a 'key element', we could add something about already having influenced the computing models and that we would continue to do so, eg: the CMS high level trigger farm. JC noted that we also deploy things and see how they work. PC considered that GridPP had made a major influence on the model. Could PC provide a sentence? Yes. RJ noted that GridPP had influenced the computing models from the start, and were key players in all decisions. PG noted also we had key roles in the collaboration as a whole. PC considered that this section 3.3 indicated that STFC thought of GridPP as being like Dirac, which was not the case - there was inherent confusion in the question. Our users were not physicists sitting at a machine, for example for LHCb it was a team issue. PC thought the question ill-founded by its focus on 'users' - ATLAS, CMS and LHCb were 'users'. UK sites were 'user-supporting'. DB noted that the big picture comprised how our resources were used in order to meet international obligations and also to meet UK user needs. Section 3.3: Regarding the 'clear boundaries of responsibility between GridPP and the Experiments, did we need to distinguish between the experiment support posts? Were there posts on the experiment grants for software support? The grey area was in production and analysis, eg: with the ATLAS experiment grants. The ganga posts were situated in this grey area, but were within 'development' rather than 'operations' at the moment. Section 3.4: Regarding the 'international context' and 'synergies' with STFC, AS noted that he had information in his document which could be used as a high-level extract? AS could write one? There were two parts to this: the first was the storage system CASTOR was used by STFC and the RC communities; second, other communities worked in this area. AS would provide a paragraph. DB noted there were also opportunities with UK3A. Section 3.5: Regarding demonstrating close collaboration with industry, we could cite CERN here. SL noted that we also had collaborations with suppliers? DK added that we had collaborations through EGI and commercial clouds - ATLAS had been working with Google. AS considered that the Tier-1 had a few relevant things, he would write something. ACTION 518.1 AS to send SL some info on collaborating with industry in relation to the Tier-1. 518.2 SL to amend the Impact section, a new version was required of two paragraphs' length, perhaps 3/4 of a page. SL would amend the version with inputs as noted above from AS re engagement with industry. ii) Proposal Guidelines ------------------------ Number 3: DB noted that an analysis of risks and benefits was required. Project Management information was needed, the length of a page or two. The Panel would do a SWOT analysis (strengths, weaknesses, opportunities, threats). PG confirmed he was working on this. He had circulated the risks from GridPP4 and we must work out priorities for GridPP5. He had looked at previous documents to assist him with this. DB advised that PG should keep this brief - it should be an annual cycle of milestones, driven by pledges/procurement/installation/delivery. DK asked whether we were not already moving towards metrics rather than milestones? DB noted yes, we were metrics-driven but we did need an annual set of milestones. DK considered that it was evolution and not development we should consider - it was the evolution element that would have milestones. DB noted we could review, for example, the medium-term future of CASTOR at a set point. DB advised that there were key milestones that could be extracted from AS's list. PG was tasked to do this, also look at high-level risks. The subject could be discussed offline. Number 14: Regarding 'Collaborative projects' - we must include relevant information appropriate to this section. iii) Terms of Reference ------------------------ Number 3.2: bulleted list - a SWOT analysis of the scenarios needed to be included within the Project Management section. Again, a yearly cycle of milestones and pledges would comprise an annual process of review. Regarding the last bullet on Impact, SL must provide more information as previously discussed. Regarding the bullet point on 'other funding', DK thought we should advise that funding from the EU was not guaranteed and should not be assumed on the part of the Review Panel. We should say what we received in GridPP4 and note how this will diminish. CD must provide the figure for GridPP4. TD suggested that we should comment on the reduction from previous phases, which have taken a downward slope. DK noted that the expectation was that we would receive less, in Horizon2020 etc, and this had to be stated clearly. iv) draft GridPP5 Proposal Doc (v14b) -------------------------------------- Cover page: It was noted that most Institutes wished to continue their association with GridPP. Confirmation was awaited from the last few. page 1: DB asked if the Foreword was long enough? Should we add a set of background documents and provide the url? Agreed. SL to do this. SS to provide document numbers (done following the meeting). ACTION 518.3 SL to add the background documents to the website and provide the url for the Strategic Review Committee. Adam's document to be added. page 2: Motivation: it was agreed to leave-in the comment on ranking. It was agreed to use 'ILC' and add 'Collaborations' after NA62. page 4: re clouds and I/O cost, the comment was included for information only. DB's highlighted statement should remain. page 5: re SL's comment - remove the word 'such' page 6: were these metrics for the 4th Quarter? No - it was confirmed these were for the 3rd. It was agreed to use them as a snapshot. page 8: re TC's statement in 6.1 - should the sentence be inclued or not? It was agreed to delete it. page 11: the wording in 6.5 should be changed to 'global exchange rate' rather than specifying a particular currency. page 13: in 6.6 it was confirmed that SJ6 should be changed to JANET6. page 14: in 7.1.1 could we include the numbers re reliability? Yes, agreed. page 15: in 7.1.2 add the word 'slow' as suggested by TC. page 18: in 7.2.1 this should begin with what the Tier-2s in Run 1 were used for etc, not the cloud information. Could SL iterate with RJ on this paragraph and move the cloud information to the end, and address the section generally? Yes. It was suggested to start the paragraph with Tier-2 in Runs 1 and 2. ACTION 518.4 SL/RJ to re-work 7.2.1 on page 18, starting the paragraph with Tier-2 in Runs 1 and 2 and moving the cloud information to the end. page 19: 7.2.1 - regarding the CMS section, 'vulnerable' needed to be clarified - should 'resilience' not be used? It depended on which site. DC noted that regarding Tier-2 functionality, the wording should be 'would remove this resilience'. It was pointed out that we would need 2FTE at each of the large sites - we needed to make the point that each of the extra 0.5FTE was funded by Brunel and RAL PPD. DB would change the text accordingly. ACTION 518.5 DC to check on the yellow fraction in the UK as shown in the figure on page 19. page 20: 7.2.2 - should the capacity numbers be updated? No - it was agreed to leave the table as it was, it provided a general indication of size and manpower. page 21: the paragraph immediately below the table should be deleted. The paragraph beneath the table should begin: 'It is clear that the ..'. Regarding the estimation, this could refer to the background document in a single sentence beneath the table, or in a footnote (explanation of costs), or it could be at the end of the paragraph, highlighted. page 22: 7.2.4 - below the plots, it was agreed to leave in the comparison to the Tier-1. page 23: 7.3 - at the bottom of the page regarding monitoring systems and teams, AS to discuss with JC offline in case this question needed to be addressed (we run different services, in different ways, therefore they require to be monitored differently). page 24: it was questioned whether to use 'sysadmins' due to the fEC issue - it was agreed to use instead: 'site personnel', 'grid expert', or 'Tier-2 staff' depending on relevant context. page 25: regarding the comment on 'development' - it was better to show 'required evolution', using evolution rather than development generally. page 26: at the top of the page, it was asked why this specifically related to ATLAS? It was noted this was historical, because Brian used to do ATLAS stuff. It was agreed to remove 'ATLAS' and replace with 'experiment liaison'. page 27: 7.3.2 - The heading of D/O/S should be amended here and in other sections. It had to be in full at some point before the acronym could be used. Wording was also required to be changed as follows: to: 'make balanced reductions to the remaining 7 FTE' It was agreed to leave the table in at the moment. page 28: 7.4.2 - SL to re-work. page 31: at the top: DB to consider. DB advised that he would address the above comments today then go through the document again tomorrow. A Project Management Section was required. On Thursday, SS would do the table/figure numbers, correct any typos, and update the list of Acronyms. SL asked about CB feedback? A version would be circulated on Wednesday to elicit CB comments. It was planned to submit the document on Thursday evening. DB thanked everyone for their valued work and contributions. 2. GridPP32 ============ PG had received no inputs in relation to the proposed Agenda - there was room for one more talk. Who would do the keynote? It was noted that Graeme Stewart would give a talk. The Standing Items were not reviewed. The reports submitted are below: SI-1 Dissemination Report ------------------------- SL reported on behalf of Tom Whyntie: => News Item - How Big is a year of Big Data for ATLAS? This was a short News Item that made use of the ATLAS Dashboard to retrieve WLCG usage statistics for the ATLAS experiment in 2013. Figures for the T0, T1 and T2 sites suggested that 1.2 exabytes (EB) were processed in all, with some 10% of that carried out by UK sites (with RAL and QMUL in the top 10 busiest sites). Rather than compare this to the number of stacked CDs this represents, the number was converted to YouTube video views of 2013's top "viral" video. Useful links: * [How Big is a year of Big Data for ATLAS?](http://www.gridpp.ac.uk/news/?p=3158); => Summer Science Festivals - Cheltenham Science Festival, June 2014, and the Royal Society Summer Science Exhibition Cheltenham Science Festival have approached TW with a pitch for sponsoring a "Big Data"-themed event at this year's science festival. The standard rate for this is £3250 + VAT but negotiations are ongoing (either joint sponsorship or a smaller event). TW has, at the invitation of Wahid Bhimji, joined the team designing the "Higgs Boson and Beyond" stand at the Royal Society Summer Science Exhibition with the aim of featuring GridPP's contribution to the Higgs boson physics programme. WB and TW attended a meeting on Thursday 13th February 2014 to help develop ideas. Useful links: * [Cheltenham Science Festival2014](http://www.cheltenhamfestivals.com/science); * [Royal Society Summer Science Exhibition](http://royalsociety.org/summer-science/). => EGI Community Forum 2014 TW has submitted an abstract for the EGI Community Forum 2014 in Helsinki for a presentation entitled "Developing new GridPP user communities: a case study with CERN@school" to the "Requirements and solutions for data management and computing" with the aim of promoting GridPP's activities in engaging new user communities. SI-5 Production Manager's Report -------------------------------- JC reported as follows: Operations updates from the last week: 1) We have reviewed the January Tier-2 availability/reliability figures. There is a recognized problem in the way in which LHC VO SAM tests are submitted (they have no priority and currently use the WMS which is being phased out), nevertheless comments on the specific January figures are as follows: * For ALICE: http://sam-reports.web.cern.ch/sam-reports/2014/201401/wlcg/WLCG_All_Sites_ALICE_Jan2014.pdf. All fine. * For ATLAS: http://sam-reports.web.cern.ch/sam-reports/2014/201401/wlcg/WLCG_All_Sites_ATLAS_Jan2014.pdf (page 8-9). Below 90% are: UCL (77%:77%) - TBC Durham (83%:83%) – Cluster full. RALPP (82%:82%) – Trying to get more information from SAM. Sussex (71%:71%) – Encountered a host certificate problem. WN scratch space was full at the end of January. Some downtime while glexec issues investigated. * For CMS: http://sam-reports.web.cern.ch/sam-reports/2014/201401/wlcg/WLCG_All_Sites_CMS_Jan2014.pdf (page 8). Below 90% is: RALPP (72%: 72%) – Trying to get more information from SAM * For LHCb: http://sam-reports.web.cern.ch/sam-reports/2014/201401/wlcg/WLCG_All_Sites_LHCB_Jan2014.pdf (pages 6-7). Below 90% are: Sheffield (77%:77%) - LHCb has lower priority than ATLAS. Atlas sam tests are run by sgmatl user and it has top priority. We have the same priority for sgmlhb but it doesn't help. Durham (86%:86%) – Cluster full RALPP (83%:98%) – Trying to get more information from SAM 2) perfSONAR is now considered by WLCG as a required service. We made good initial progress but still have some sites needing to resolve issues: ECDF, Sheffield, Brunel and RALPP. 3) LHCb cannot run reliably on ARC CEs. There are issues with jobs not setting their environment without workarounds and job monitoring does not update quickly. This latter problem often results with DIRAC aborting the job – since late January the jobs need queue information. These issues are becoming increasingly important because we have several other sites looking seriously at moving to ARC. There are suggestions for ways forward but at least one requires the use of rfc proxies. For the time being ARC CE sites are running with suppressed MC jobs. 4) A RIPE ATLAS probe proposal is almost ready; I am waiting for feedback from RIPE on a few items before bringing this back to the PMB. Has the dissemination budget been checked? 5) There are plans to hold the next WLCG workshop in Barcelona late June or more likely early July. A poll was setup to gather feedback on potential dates: http://doodle.com/5s6dessc7vtem45n. 6) ATLAS has automated the metrics and process for evaluating T2D and ABCD status. This system is expected to go into production in March and feedback is currently being gathered. 7) There was a GDB last week at CERN. A summary of the meeting is available via https://twiki.cern.ch/twiki/bin/view/LCG/GDBMeetingNotes20140212. UK participation in activities was noted several times and the following gained specific mention: Alessandra Forti in connection with leading the multi-core task force; Duncan Rand for work on perfSONAR and IPv6 and David Crooks in relation to monitoring consolidation work. 8) The pre-GDB meeting was on Operations Coordination (https://indico.cern.ch/event/272784/). A decision taken at that meeting was that there is no need to arrange a joint scale test ahead of Run-2. CERN will decommission its WMSes in June. Experiments usage will stop in April but the experiment SAM tests will continue until June. 9) The plan for the GridPP website move to SHA-2 compatibility is to have a testable site including the wiki at the start of next week. This will be followed by a week of testing with several people. The new site will go live at the start of March. ACTIONS AS OF 17.02.14 ====================== There was no time to review the Actions: 512.2 Regarding the outturn forecast and the possible spend on tape media, travel etc, PG to work out what was left and ask Tony Medland for re-profiling. PG should make a plan: balance staff against capital hardware and submit as soon as possible (DK to assist). 513.3 DB to send to PG the EVAL information required. 513.4 DC to follow-up with Alex Efimov/Tom Whyntie regarding Simon's time on coding for bit-splitting work on Linux - DC to clarify the issues involved and report-back if a PMB decision was required. 514.3 DC to provide information to PC to enable him to complete the network forward-look. 517.1 JC to write a few lines on networking and send them to DB for inclusion on page 24 of the GridPP5 Proposal. 517.2 JC to include the security issues, sent by DK, and re-send the amended document to DB, and remind DB that he needs to adjust the Security section accordingly. 517.3 ALL: to let PG know any thoughts/preferences for the GridPP32 Agenda. 518.1 AS to send SL some info on collaborating with industry in relation to the Tier-1. 518.2 SL to amend the Impact section, a new version was required of two paragraphs' length, perhaps 3/4 of a page. SL would amend the version with inputs as noted above from AS re engagement with industry. 518.3 SL to add the background documents to the website and provide the url for the Strategic Review Committee. Adam's document to be added. 518.4 SL/RJ to re-work 7.2.1 on page 18, starting the paragraph with Tier-2 in Runs 1 and 2 and moving the cloud information to the end. 518.5 DC to check on the yellow fraction in the UK as shown in the figure on page 19. The next PMB would take place on Monday 24 February at 12:55pm.