JiscMail Logo
Email discussion lists for the UK Education and Research communities

Help for UKHEPGRID Archives


UKHEPGRID Archives

UKHEPGRID Archives


UKHEPGRID@JISCMAIL.AC.UK


View:

Message:

[

First

|

Previous

|

Next

|

Last

]

By Topic:

[

First

|

Previous

|

Next

|

Last

]

By Author:

[

First

|

Previous

|

Next

|

Last

]

Font:

Proportional Font

LISTSERV Archives

LISTSERV Archives

UKHEPGRID Home

UKHEPGRID Home

UKHEPGRID  March 2014

UKHEPGRID March 2014

Options

Subscribe or Unsubscribe

Subscribe or Unsubscribe

Log In

Log In

Get Password

Get Password

Subject:

Minutes of the 512th - 519th GridPP PMB meeting

From:

David Britton <[log in to unmask]>

Reply-To:

David Britton <[log in to unmask]>

Date:

Mon, 31 Mar 2014 10:59:40 +0100

Content-Type:

multipart/mixed

Parts/Attachments:

Parts/Attachments

text/plain (56 lines) , 140224.txt (1 lines) , 140217.txt (1 lines) , 140210.txt (1 lines) , 140203.txt (1 lines) , 140127.txt (1 lines) , 140120.txt (1 lines) , 140113.txt (1 lines) , 131216.txt (1 lines)

Dear All,


Please find attached the GridPP Project Management Board
Meeting minutes for the 512th to 519th meetings.

The latest minutes can be in:

http://www.gridpp.ac.uk/php/pmb/minutes.php?latest

as well as being listed with other minutes at:

http://www.gridpp.ac.uk/php/pmb/minutes.php

Cheers, Dave.










































GridPP PMB Meeting 519 (24.02.14) ================================= Present: Dave Britton (Chair), Pete Gronbech, Andrew Sansum, Jeremy Coles, Dave Kelsey, Pete Clarke, Claire Devereux, Tony Cass, Tony Doyle (Minutes - Suzanne Scott) Apologies: Roger Jones, Steve Lloyd, Dave Colling 1. GridPP32 Agenda =================== PG reported that all of the slots were filled. A member of the PMB would chair each session. DB thought it all looked fine - had PG checked with RJ/GS that the ATLAS talks were orthogonal? Yes. DB noted that Dan Protopopescu could be added for a 5-minute micro-talk, and also for other VOs if they wished - PG should refer Dan to Chris Walker, and CW could choose what he was doing. T2K did not need a 25-minute talk, 15-minute would do. DB thought that after lunch the jobs/disk/batch sessions looked fine, and also the clouds/VM/storage was ok too. SS would contact DDN regarding their presentation slot [done following the meeting]. DB noted that networking on day 2 and the monitoring etc, all looked fine. It was good that Tom Whyntie was doing a talk. DB summarised that the Agenda was looking good. There were no other comments. 2. Quarterly Reports ===================== PG reported that he had only received four reports and had sent out reminders. There was a plea from PG for everyone to assist by doing their Quarterly Reports asap. 3. ResearchFish ================ PG reported that he had received minor inputs - he would send round a reminder email. He had received a list of papers from LHCb. RJ/DC should provide experiment papers as they had done last time. It was suggested that PG get these direct from 'Inspire'. PC would forward to PG the instructions for this. It was suggested to use an author count of over 500 as a selector. PG reminded the PMB that he needed a note of roles on international committees etc - if there were any new ones since last time/any changes, please let him know. 4. RIPE Proposal ================= JC had circulated a proposal. There had been a sub-meeting of the Ops Team group, who had discussed using probes in Schools, and encouraging Schools to adopt them as projects. Funds might be available in the Dissemination Budget? JC reported that 100 Probes would cost ~5000Euros. If we had 100 Probes to distribute, we could work with other groups. We could also take-on an Ambassador role and work with Schools. DB considered that this was a lot of money, and comments from DK would be required. This could be packaged as 'dissemination', how much remained in the budget? PC considered the cost far too high, and it was not a major thing for GridPP to be involved with. The level of 1000Euros could perhaps be justified - what 'dissemination' could be got for that figure? JC noted the possibility of linked publicity, if we did not take-on Ambassador status - we could deploy an Anchor. DB thought there was merit for Schools to do projects with this, but 5-6000Euros was not defensible for GridPP to spend. JC would check the minimum investment possible. DB also noted that we had, at the present time, to be very cautious indeed as funding in the current climate was problematic. TD noted that sites could engage with RIPE independently. It was agreed that a small dissemination project, with investment around 1000Euros would be possible - and we could see where this led. STANDING ITEMS ============== SI-0 Monthly Report from Development Group ------------------------------------------- There was no report. SI-1 Dissemination Report -------------------------- SL had circulated a brief report from Tom Whyntie: - Locked Space Technologies - update Alex Efimov has been in touch with the group at Imperial working on the bit-splitting grid test bed. Simon Fayer will produce a CentOS VM image with the software required to demonstrate the technology working on the grid. The hope is to deploy it with OpenStack at Imperial or QML - 81 nodes would be required for a first demonstration. - RIPE Atlas probes See notes from Jeremy Coles - TW would be involved on the schools, distribution and reporting front (particularly regarding News Items). SI-2 ATLAS weekly review & plans --------------------------------- RJ was absent. SI-3 CMS weekly review & plans ------------------------------- DC was absent. SI-4 LHCb weekly review & plans --------------------------------- PC noted nothing major to discuss. SI-5 Production Manager's Report --------------------------------- JC reported as follows: 1) An EMI-2 decommissioning activity is about to start. Sites will receive alarms from 3rd March. EMI-2 is to be removed from production systems by April. 2) In relation to item 1), Matt Doidge is working on an EMI-3 WN tarball. 3) A GLUE2 validation exercise is starting. Test probes will become operational from 10th March and sites will then have two weeks to resolve issues before alarming is activated. There are several GridPP sites that need to resolve issues. 4) CMS is updating its T1 scheduling policy by reintroducing 5% share for analysis jobs (/cms/Role=pilot). This is being done partly as SAM tests are timing out, but also to allow more analysis work. 5) There were several FTS3 incidents at RAL on 18th February. 6) Testing of the SHA-2 compliant GridPP website is taking place this week. Following this GridPP is ready for the SHA-2 switch, though there are still some issues in WLCG which require manual workarounds. No show stopper issues have been identified by the French switch in January. The UK CA switch date is still mid-March. 7) Ahead of the March pre-GDB on batch systems we are gathering a UK overview at https://www.gridpp.ac.uk/wiki/Batch_system_status. As expected there is a lot of concern and growing problems in relation to Torque/Maui which most of our sites use. Early indications suggest that Slurm and HT Condor for larger sites are the preferred replacements. This will become an active area in the coming months. SI-6 Tier-1 Manager's Report ----------------------------- AS reported as follows: Fabric: -------- 1) Capacity procurements: a) One tranch of disk accepted and now being tested by CASTOR team ahead of deployment; b) Second tranch of disk completed 4 week vendor proving test - scheduled to start 1 week acceptance test by RAL today; c) One tranch CPU accepted - being performance and power benchmarked; d) Second tranch of CPU completed vendor proving test now starting RAL acceptance tests. 2) Management network and Z9000 40Gb mesh in operation. New Tier-1 router change ready to deploy and waiting to fix a slot with site networking. 3) Preparing an upgrade to migrate the tape robot control system (ACSLS) from Solaris to Linux and to support T10KD 4) The Tier-1 will be moved to a new site firewall - most likely date is now 17th March TBC. Service: --------- 1) Some CASTOR test failures impacting availability - full report available at: http://www.gridpp.ac.uk/wiki/Tier1_Operations_Report_2014-02-19 2) CASTOR a) Chasing a problem that is causing occasional timeouts in the availability monitoring. No evidence it is impacting operations; b) CASTOR 2.1.14 upgrade just about ready to go - will probably approve next week and then ready for deployment; c) The CASTOR info provider (CIP) was upgraded to 2.2.16-1. 3) FTS-3 Outage There were two extended breaks on the FTS-3 test service last Tuesday. These occurred when an attempt was made to transparently migrate the server from the test hypervisor in ATLAS centre to the main R89 cluster. An SiR will be performed. Staff: ------ 1) George Ryall is now working for us until the end of August on CEPH and the cloud. SI-7 LCG Management Board Report --------------------------------- There had been no meeting. AOB === It was reported that the next HEPiX meeting would take place in Oxford, on 23-27 March 2015. REVIEW OF ACTIONS ================= 512.2 Regarding the outturn forecast and the possible spend on tape media, travel etc, PG to work out what was left and ask Tony Medland for re-profiling. PG should make a plan: balance staff against capital hardware and submit as soon as possible (DK to assist). Done, item closed. 513.3 DB to send to PG the EVAL information required. Ongoing. 513.4 DC to follow-up with Alex Efimov/Tom Whyntie regarding Simon's time on coding for bit-splitting work on Linux - DC to clarify the issues involved and report-back if a PMB decision was required. Ongoing. 514.3 DC to provide information to PC to enable him to complete the network forward-look. Action closed and reverted to PC (519.1). 517.1 JC to write a few lines on networking and send them to DB for inclusion on page 24 of the GridPP5 Proposal. Done, action closed. 517.2 JC to include the security issues, sent by DK, and re-send the amended document to DB, and remind DB that he needs to adjust the Security section accordingly. Done, action closed. 517.3 ALL: to let PG know any thoughts/preferences for the GridPP32 Agenda. Done, action closed. 518.1 AS to send SL some info on collaborating with industry in relation to the Tier-1. Done, action closed. 518.2 SL to amend the Impact section, a new version was required of two paragraphs' length, perhaps 3/4 of a page. SL would amend the version with inputs as noted above from AS re engagement with industry. Done, action closed. 518.3 SL to add the background documents to the website and provide the url for the Strategic Review Committee. Adam's document to be added. Done, action closed. 518.4 SL/RJ to re-work 7.2.1 on page 18, starting the paragraph with Tier-2 in Runs 1 and 2 and moving the cloud information to the end. Done, action closed. 518.5 DC to check on the yellow fraction in the UK as shown in the figure on page 19. Done, action closed. ACTIONS AS OF 24.02.14 ====================== 513.3 DB to send to PG the EVAL information required. 513.4 DC to follow-up with Alex Efimov/Tom Whyntie regarding Simon's time on coding for bit-splitting work on Linux - DC to clarify the issues involved and report-back if a PMB decision was required. 519.1 PC to complete the Network Forward Look. The PMB for Monday 3rd March was cancelled. The next PMB would take place on Monday 10th March at 12:55 pm. DB noted that he was away from Wednesday of this week, SL would cover in his absence.
GridPP PMB Meeting 518 (17.02.14) ================================= Present: Dave Britton (Chair), Pete Gronbech, Andrew Sansum, Jeremy Coles, Dave Kelsey, Steve Lloyd, Roger Jones, Pete Clarke, Tony Cass, Tony Doyle, Dave Colling (Minutes - Suzanne Scott) Apologies: Claire Devereux 1. The GridPP5 Papers ====================== The purpose of the meeting was to review the GridPP5 Project Brief & Guidelines; the GridPP Strategic Review Terms of Reference; and the draft GridPP5 Proposal document (v14b). i) Project Brief ----------------- Section 3.3: DB noted that we did address user support in the Deployment, Operations and Support (DOS) section, did a sentence need to be added to cover the point mentioned in 3.3? JC would add a sentence. Section 3.3: Regarding the comment about a 'key element', we could add something about already having influenced the computing models and that we would continue to do so, eg: the CMS high level trigger farm. JC noted that we also deploy things and see how they work. PC considered that GridPP had made a major influence on the model. Could PC provide a sentence? Yes. RJ noted that GridPP had influenced the computing models from the start, and were key players in all decisions. PG noted also we had key roles in the collaboration as a whole. PC considered that this section 3.3 indicated that STFC thought of GridPP as being like Dirac, which was not the case - there was inherent confusion in the question. Our users were not physicists sitting at a machine, for example for LHCb it was a team issue. PC thought the question ill-founded by its focus on 'users' - ATLAS, CMS and LHCb were 'users'. UK sites were 'user-supporting'. DB noted that the big picture comprised how our resources were used in order to meet international obligations and also to meet UK user needs. Section 3.3: Regarding the 'clear boundaries of responsibility between GridPP and the Experiments, did we need to distinguish between the experiment support posts? Were there posts on the experiment grants for software support? The grey area was in production and analysis, eg: with the ATLAS experiment grants. The ganga posts were situated in this grey area, but were within 'development' rather than 'operations' at the moment. Section 3.4: Regarding the 'international context' and 'synergies' with STFC, AS noted that he had information in his document which could be used as a high-level extract? AS could write one? There were two parts to this: the first was the storage system CASTOR was used by STFC and the RC communities; second, other communities worked in this area. AS would provide a paragraph. DB noted there were also opportunities with UK3A. Section 3.5: Regarding demonstrating close collaboration with industry, we could cite CERN here. SL noted that we also had collaborations with suppliers? DK added that we had collaborations through EGI and commercial clouds - ATLAS had been working with Google. AS considered that the Tier-1 had a few relevant things, he would write something. ACTION 518.1 AS to send SL some info on collaborating with industry in relation to the Tier-1. 518.2 SL to amend the Impact section, a new version was required of two paragraphs' length, perhaps 3/4 of a page. SL would amend the version with inputs as noted above from AS re engagement with industry. ii) Proposal Guidelines ------------------------ Number 3: DB noted that an analysis of risks and benefits was required. Project Management information was needed, the length of a page or two. The Panel would do a SWOT analysis (strengths, weaknesses, opportunities, threats). PG confirmed he was working on this. He had circulated the risks from GridPP4 and we must work out priorities for GridPP5. He had looked at previous documents to assist him with this. DB advised that PG should keep this brief - it should be an annual cycle of milestones, driven by pledges/procurement/installation/delivery. DK asked whether we were not already moving towards metrics rather than milestones? DB noted yes, we were metrics-driven but we did need an annual set of milestones. DK considered that it was evolution and not development we should consider - it was the evolution element that would have milestones. DB noted we could review, for example, the medium-term future of CASTOR at a set point. DB advised that there were key milestones that could be extracted from AS's list. PG was tasked to do this, also look at high-level risks. The subject could be discussed offline. Number 14: Regarding 'Collaborative projects' - we must include relevant information appropriate to this section. iii) Terms of Reference ------------------------ Number 3.2: bulleted list - a SWOT analysis of the scenarios needed to be included within the Project Management section. Again, a yearly cycle of milestones and pledges would comprise an annual process of review. Regarding the last bullet on Impact, SL must provide more information as previously discussed. Regarding the bullet point on 'other funding', DK thought we should advise that funding from the EU was not guaranteed and should not be assumed on the part of the Review Panel. We should say what we received in GridPP4 and note how this will diminish. CD must provide the figure for GridPP4. TD suggested that we should comment on the reduction from previous phases, which have taken a downward slope. DK noted that the expectation was that we would receive less, in Horizon2020 etc, and this had to be stated clearly. iv) draft GridPP5 Proposal Doc (v14b) -------------------------------------- Cover page: It was noted that most Institutes wished to continue their association with GridPP. Confirmation was awaited from the last few. page 1: DB asked if the Foreword was long enough? Should we add a set of background documents and provide the url? Agreed. SL to do this. SS to provide document numbers (done following the meeting). ACTION 518.3 SL to add the background documents to the website and provide the url for the Strategic Review Committee. Adam's document to be added. page 2: Motivation: it was agreed to leave-in the comment on ranking. It was agreed to use 'ILC' and add 'Collaborations' after NA62. page 4: re clouds and I/O cost, the comment was included for information only. DB's highlighted statement should remain. page 5: re SL's comment - remove the word 'such' page 6: were these metrics for the 4th Quarter? No - it was confirmed these were for the 3rd. It was agreed to use them as a snapshot. page 8: re TC's statement in 6.1 - should the sentence be inclued or not? It was agreed to delete it. page 11: the wording in 6.5 should be changed to 'global exchange rate' rather than specifying a particular currency. page 13: in 6.6 it was confirmed that SJ6 should be changed to JANET6. page 14: in 7.1.1 could we include the numbers re reliability? Yes, agreed. page 15: in 7.1.2 add the word 'slow' as suggested by TC. page 18: in 7.2.1 this should begin with what the Tier-2s in Run 1 were used for etc, not the cloud information. Could SL iterate with RJ on this paragraph and move the cloud information to the end, and address the section generally? Yes. It was suggested to start the paragraph with Tier-2 in Runs 1 and 2. ACTION 518.4 SL/RJ to re-work 7.2.1 on page 18, starting the paragraph with Tier-2 in Runs 1 and 2 and moving the cloud information to the end. page 19: 7.2.1 - regarding the CMS section, 'vulnerable' needed to be clarified - should 'resilience' not be used? It depended on which site. DC noted that regarding Tier-2 functionality, the wording should be 'would remove this resilience'. It was pointed out that we would need 2FTE at each of the large sites - we needed to make the point that each of the extra 0.5FTE was funded by Brunel and RAL PPD. DB would change the text accordingly. ACTION 518.5 DC to check on the yellow fraction in the UK as shown in the figure on page 19. page 20: 7.2.2 - should the capacity numbers be updated? No - it was agreed to leave the table as it was, it provided a general indication of size and manpower. page 21: the paragraph immediately below the table should be deleted. The paragraph beneath the table should begin: 'It is clear that the ..'. Regarding the estimation, this could refer to the background document in a single sentence beneath the table, or in a footnote (explanation of costs), or it could be at the end of the paragraph, highlighted. page 22: 7.2.4 - below the plots, it was agreed to leave in the comparison to the Tier-1. page 23: 7.3 - at the bottom of the page regarding monitoring systems and teams, AS to discuss with JC offline in case this question needed to be addressed (we run different services, in different ways, therefore they require to be monitored differently). page 24: it was questioned whether to use 'sysadmins' due to the fEC issue - it was agreed to use instead: 'site personnel', 'grid expert', or 'Tier-2 staff' depending on relevant context. page 25: regarding the comment on 'development' - it was better to show 'required evolution', using evolution rather than development generally. page 26: at the top of the page, it was asked why this specifically related to ATLAS? It was noted this was historical, because Brian used to do ATLAS stuff. It was agreed to remove 'ATLAS' and replace with 'experiment liaison'. page 27: 7.3.2 - The heading of D/O/S should be amended here and in other sections. It had to be in full at some point before the acronym could be used. Wording was also required to be changed as follows: to: 'make balanced reductions to the remaining 7 FTE' It was agreed to leave the table in at the moment. page 28: 7.4.2 - SL to re-work. page 31: at the top: DB to consider. DB advised that he would address the above comments today then go through the document again tomorrow. A Project Management Section was required. On Thursday, SS would do the table/figure numbers, correct any typos, and update the list of Acronyms. SL asked about CB feedback? A version would be circulated on Wednesday to elicit CB comments. It was planned to submit the document on Thursday evening. DB thanked everyone for their valued work and contributions. 2. GridPP32 ============ PG had received no inputs in relation to the proposed Agenda - there was room for one more talk. Who would do the keynote? It was noted that Graeme Stewart would give a talk. The Standing Items were not reviewed. The reports submitted are below: SI-1 Dissemination Report ------------------------- SL reported on behalf of Tom Whyntie: => News Item - How Big is a year of Big Data for ATLAS? This was a short News Item that made use of the ATLAS Dashboard to retrieve WLCG usage statistics for the ATLAS experiment in 2013. Figures for the T0, T1 and T2 sites suggested that 1.2 exabytes (EB) were processed in all, with some 10% of that carried out by UK sites (with RAL and QMUL in the top 10 busiest sites). Rather than compare this to the number of stacked CDs this represents, the number was converted to YouTube video views of 2013's top "viral" video. Useful links: * [How Big is a year of Big Data for ATLAS?](http://www.gridpp.ac.uk/news/?p=3158); => Summer Science Festivals - Cheltenham Science Festival, June 2014, and the Royal Society Summer Science Exhibition Cheltenham Science Festival have approached TW with a pitch for sponsoring a "Big Data"-themed event at this year's science festival. The standard rate for this is £3250 + VAT but negotiations are ongoing (either joint sponsorship or a smaller event). TW has, at the invitation of Wahid Bhimji, joined the team designing the "Higgs Boson and Beyond" stand at the Royal Society Summer Science Exhibition with the aim of featuring GridPP's contribution to the Higgs boson physics programme. WB and TW attended a meeting on Thursday 13th February 2014 to help develop ideas. Useful links: * [Cheltenham Science Festival2014](http://www.cheltenhamfestivals.com/science); * [Royal Society Summer Science Exhibition](http://royalsociety.org/summer-science/). => EGI Community Forum 2014 TW has submitted an abstract for the EGI Community Forum 2014 in Helsinki for a presentation entitled "Developing new GridPP user communities: a case study with CERN@school" to the "Requirements and solutions for data management and computing" with the aim of promoting GridPP's activities in engaging new user communities. SI-5 Production Manager's Report -------------------------------- JC reported as follows: Operations updates from the last week: 1) We have reviewed the January Tier-2 availability/reliability figures. There is a recognized problem in the way in which LHC VO SAM tests are submitted (they have no priority and currently use the WMS which is being phased out), nevertheless comments on the specific January figures are as follows: * For ALICE: http://sam-reports.web.cern.ch/sam-reports/2014/201401/wlcg/WLCG_All_Sites_ALICE_Jan2014.pdf. All fine. * For ATLAS: http://sam-reports.web.cern.ch/sam-reports/2014/201401/wlcg/WLCG_All_Sites_ATLAS_Jan2014.pdf (page 8-9). Below 90% are: UCL (77%:77%) - TBC Durham (83%:83%) – Cluster full. RALPP (82%:82%) – Trying to get more information from SAM. Sussex (71%:71%) – Encountered a host certificate problem. WN scratch space was full at the end of January. Some downtime while glexec issues investigated. * For CMS: http://sam-reports.web.cern.ch/sam-reports/2014/201401/wlcg/WLCG_All_Sites_CMS_Jan2014.pdf (page 8). Below 90% is: RALPP (72%: 72%) – Trying to get more information from SAM * For LHCb: http://sam-reports.web.cern.ch/sam-reports/2014/201401/wlcg/WLCG_All_Sites_LHCB_Jan2014.pdf (pages 6-7). Below 90% are: Sheffield (77%:77%) - LHCb has lower priority than ATLAS. Atlas sam tests are run by sgmatl user and it has top priority. We have the same priority for sgmlhb but it doesn't help. Durham (86%:86%) – Cluster full RALPP (83%:98%) – Trying to get more information from SAM 2) perfSONAR is now considered by WLCG as a required service. We made good initial progress but still have some sites needing to resolve issues: ECDF, Sheffield, Brunel and RALPP. 3) LHCb cannot run reliably on ARC CEs. There are issues with jobs not setting their environment without workarounds and job monitoring does not update quickly. This latter problem often results with DIRAC aborting the job – since late January the jobs need queue information. These issues are becoming increasingly important because we have several other sites looking seriously at moving to ARC. There are suggestions for ways forward but at least one requires the use of rfc proxies. For the time being ARC CE sites are running with suppressed MC jobs. 4) A RIPE ATLAS probe proposal is almost ready; I am waiting for feedback from RIPE on a few items before bringing this back to the PMB. Has the dissemination budget been checked? 5) There are plans to hold the next WLCG workshop in Barcelona late June or more likely early July. A poll was setup to gather feedback on potential dates: http://doodle.com/5s6dessc7vtem45n. 6) ATLAS has automated the metrics and process for evaluating T2D and ABCD status. This system is expected to go into production in March and feedback is currently being gathered. 7) There was a GDB last week at CERN. A summary of the meeting is available via https://twiki.cern.ch/twiki/bin/view/LCG/GDBMeetingNotes20140212. UK participation in activities was noted several times and the following gained specific mention: Alessandra Forti in connection with leading the multi-core task force; Duncan Rand for work on perfSONAR and IPv6 and David Crooks in relation to monitoring consolidation work. 8) The pre-GDB meeting was on Operations Coordination (https://indico.cern.ch/event/272784/). A decision taken at that meeting was that there is no need to arrange a joint scale test ahead of Run-2. CERN will decommission its WMSes in June. Experiments usage will stop in April but the experiment SAM tests will continue until June. 9) The plan for the GridPP website move to SHA-2 compatibility is to have a testable site including the wiki at the start of next week. This will be followed by a week of testing with several people. The new site will go live at the start of March. ACTIONS AS OF 17.02.14 ====================== There was no time to review the Actions: 512.2 Regarding the outturn forecast and the possible spend on tape media, travel etc, PG to work out what was left and ask Tony Medland for re-profiling. PG should make a plan: balance staff against capital hardware and submit as soon as possible (DK to assist). 513.3 DB to send to PG the EVAL information required. 513.4 DC to follow-up with Alex Efimov/Tom Whyntie regarding Simon's time on coding for bit-splitting work on Linux - DC to clarify the issues involved and report-back if a PMB decision was required. 514.3 DC to provide information to PC to enable him to complete the network forward-look. 517.1 JC to write a few lines on networking and send them to DB for inclusion on page 24 of the GridPP5 Proposal. 517.2 JC to include the security issues, sent by DK, and re-send the amended document to DB, and remind DB that he needs to adjust the Security section accordingly. 517.3 ALL: to let PG know any thoughts/preferences for the GridPP32 Agenda. 518.1 AS to send SL some info on collaborating with industry in relation to the Tier-1. 518.2 SL to amend the Impact section, a new version was required of two paragraphs' length, perhaps 3/4 of a page. SL would amend the version with inputs as noted above from AS re engagement with industry. 518.3 SL to add the background documents to the website and provide the url for the Strategic Review Committee. Adam's document to be added. 518.4 SL/RJ to re-work 7.2.1 on page 18, starting the paragraph with Tier-2 in Runs 1 and 2 and moving the cloud information to the end. 518.5 DC to check on the yellow fraction in the UK as shown in the figure on page 19. The next PMB would take place on Monday 24 February at 12:55pm.
GridPP PMB Meeting 517 (10.02.14) ================================= Present: Dave Britton (Chair), Pete Gronbech, Andrew Sansum, Jeremy Coles, Dave Kelsey, Steve Lloyd, Roger Jones, Pete Clarke, Tony Cass, Claire Devereux (Minutes - Suzanne Scott) Apologies: Tony Doyle, Dave Colling 1. The GridPP5 Papers ====================== The papers were reviewed as to status: A. Tier-2 (SL) --------------- SL reported that he had received input from RJ, DC and PG. He was amending v5 now, as a separate document. He was aware that the structure was not ideal however he was unsure as to what further inputs he could provide. DB noted that all of the other documents were still being worked on. He suggested that the PMB use the time during this meeting to address issues within v7 of the GridPP5 Proposal document which had been circulated and commented upon: - page 2: regarding the background and context, DB would re-order this and re-draft it - page 4: DB asked what the current status was of the European context and Horizon2020? CD was working on this. - page 5: DB considered that we could dismiss comment 1 by stating 'on budget', which was a secure statement. AS noted the comment on risk. DB thought we did need to say this as it was a response to the success of the machine commissioning, which had been more successful than anticipated. AS noted that we must show that the planning was solid. DB would re-word the section. Regarding the comments on 'domino effect', DB would re-phrase this to draw out the stability aspects - it was important that no precedent was set that would start to weaken wLCG. - page 6: This related to the discussion on work packages. It was agreed to retain the work package on Management/Travel/Impact/Admin. Regarding WP-C, it was felt that this should be amalgamated with WP-D. In the past GridPP had tried to ring-fence the experiment support posts but they could now be added to WP-C. It was recognised that there was a danger that this WP would be cut. DB agreed to combine C and D, and this would also make writing easier. The experiment support posts had been created for the Tier-1 but these had been an invaluable resource and it was expected that these posts would widen and be involved with the Tier-2 within the changing model. - page 8: AS had commented on the words 'private cloud'. It was agreed that this Technical document had been submitted very late as the structure and content of the master document were already extant. Did we need, now, to insert an earlier section on context which showed an evolution of technology? AS considered yes, but the document already comprised a large amount of background. DB agreed that it was hard to strike a balance between complexity and context, in order to inform both the 'why' and the 'what'. AS recommended that a statement like: 'there is a move to private clouds' should be mentioned. PG noted that later on in the document there was a comment on cloud resource and using external suppliers like Google and Amazon. DB noted this was a move in the other direction. He agreed to insert some relevant information into the context section. SL advised that the significance of 'cloud' as opposed to 'region' should be kept separate as they were different things. - page 10: Should a full table be included or only a summary? DB noted this would be re-done for the proposal anyway. It was agreed to leave-in the full table as this showed due diligence. - page 22: Regarding the comment on explanation of why there were separate monitoring systems and teams, and different 'Production Managers' - these were becoming confused in the document. Work was needed with JC to produce a common picture and language. The back-up documents would be posted on the webpage. It was agreed to change the titles to: 'Tier-1 Production Manager' and 'GridPP Production Manager' to ensure the distinction between the two was obvious. - page 24: DB had extracted and condensed the list of tasks. Should we add 'networking' (Robin Tasker)? JC noted that we were doing things in the networking arena, eg: IPV6. DB asked whether we needed an extra few lines about networking, and could someone write this? Yes, JC would do so. ACTION 517.1 JC to write a few lines on networking and send them to DB for inclusion on page 24 of the GridPP5 Proposal. DK reminded that in all sections, the security issues needed updating. DK had sent this information to JC last week. JC apologised for the non-inclusion, which had occurred due to the different versions being circulated. ACTION 517.2 JC to include the security issues, sent by DK, and re-send the amended document to DB, and remind DB that he needs to adjust the Security section accordingly. - page 25: This related to the Tier-2 staff occupied in Core Ops-Team tasks. JC noted that the table did not record the extra contributions. DB advised that he was trying to defend funded posts. The 20 people at the Tier-2 were not just running hardware, if we lost them we would lose more than the hardware. It seemed that 6FTE was a reasonable number. PG considered this equated to 12 @ 0.5FTE which seemed about right in reality. JC agreed. PG noted that we had not mentioned Nagios monitoring. PG would add this. DB asked if there were any other issues to raise? No. DB advised that the current status was that DB was awaiting input from SL on the Tier-2, following which DB would combine the work packages as discussed, then deal with the sub-flat-cash option, describing the de-scoping steps as per Document G. DB was currently working on Document G and this would take until midweek to finish. He would circulate an amended version by close-of-play on Wednesday. Regarding the Collaboration Board (CB), it was felt that the situation was still evolving and there were no definitive facts to give them. It was agreed that at some point, the PMB must decide when to inform/consult the CB. DK asked about submission of the Outline by 18th February. It was noted that it would be difficult to do anything other than inform the CB about this - a consultation was not possible. This was not the final Proposal in relation to Institutes etc. It was agreed to circulate the final version of the current Outline document to the CB before submission to the Strategic Review Panel. DB noted that we were presenting two scenarios as requested, however the second scenario that led to de-scoping required step-change descriptions as to what was actually possible with what funding level. DB would speak to SL (who had now left the meeting) regarding the CB. It would be possible to give the CB access to the background documents just before final submission on 18th February. 2. GridPP32 ============ Regarding the Agenda, PG had now populated the page. DB might wish to re-order this? DB noted that the T2K update/InstantUI/VOs/SMEs might go well together in one session? There should be another session focussed on 'other' experiments and impact, ie: a non-LHC session. DB asked PG to add-in a slot for DDN to do a presentation at the end of session 4, around 20 minutes, entitled 'DDN & LHC Computing'. PG asked if there were any value-added Tier-2 talks? Or any additionals from the Tier-1, eg: the new disk system? DB asked PG to try and schedule some discussion time. ACTION 517.3 ALL: to let PG know any thoughts/preferences for the GridPP32 Agenda. 3. AOB ======= - Matt Williams from Birmingham had asked to attend the Berlin Python Conference. He had applied to give a Ganga talk. The cost might be ~£900 inclusive of Registration Fee, and would take place on 21st July. He would submit an abstract. It was agreed that if the Paper was accepted then he would be funded to go. - PG asked about the hardware money, had the PIs spent this on time? It seemed yes, as positive replies had been received. Info from Imperial; RHUL; Manchester; Edinburgh was awaited. PG would chase this. STANDING ITEMS ============== SI-1 Dissemination Report -------------------------- SL had left the meeting. SI-2 ATLAS weekly review & plans --------------------------------- RJ reported that there was a lot of work going on re optimising frontier access; other recent issues had been partly resolved; 50% of transfers were now using FTS3; the re-naming of the file management system was now complete - this exercise had thrown-up missing files generally, so it had been a useful exercise. SI-3 CMS weekly review & plans ------------------------------- DC was not present. SI-4 LHCb weekly review & plans -------------------------------- PC noted nothing major to report. SI-5 Production Manager's Report --------------------------------- JC reported as follows: 1) A subset of the operations team has discussed the options for taking the RIPE ATLAS probe engagement forward. A proposal will be presented next week. 2) We are reviewing the January availability/reliability reports which are the first to focus on the experiment SAM results. As noted elsewhere there is an impact for some sites as a result of the test jobs having regular priority. 3) There was good news at the WLCG middleware readiness meeting last week (https://indico.cern.ch/event/272784/) in that INFN management have committed to long-term maintenance of the EMI repository. We are making steady progress towards a workable m/w readiness approach. 4) There is a face-to-face meeting of the WLCG Operations Coordination Team at CERN tomorrow (https://indico.cern.ch/conferenceDisplay.py?confId=272784). Focus is on the Experiment Computing Commissioning activities. 5) Sussex has now managed to reach stability following their sysadmin changes. 6) We are currently reviewing Glue2 publishing – several GridPP sites have publishing issues to address. SI-6 Tier-1 Manager's Report ----------------------------- AS noted nothing major to report this week. SI-7 LCG Management Board Report --------------------------------- There had been no meeting. REVIEW OF ACTIONS ================= 512.2 Regarding the outturn forecast and the possible spend on tape media, travel etc, PG to work out what was left and ask Tony Medland for re-profiling. It was noted that we need to say what we are going to spend this on - PG should make a plan: balance staff against capital hardware and submit as soon as possible. DK asked with the SLA estimates were for next year? PG/DK should sort-out this issue. DB noted we could do the capital/resource split on an annual basis. Ongoing. 513.3 DB to send to PG the EVAL information required. Ongoing. 513.4 DC to follow-up with Alex Efimov/Tom Whyntie regarding Simon's time on coding for bit-splitting work on Linux - DC to clarify the issues involved and report-back if a PMB decision was required. Ongoing. 514.2 DC to provide information about the CMS computing model to SL for the Tier-2 document. Done, item closed. 514.3 DC to provide information to PC to enable him to complete the network forward-look. Ongoing. 515.3 ALL: to send DB suggestions for a theme for GridPP32 and suggestions for people who could give talks. Done, item closed. ACTIONS AS OF 10.02.14 ====================== 512.2 Regarding the outturn forecast and the possible spend on tape media, travel etc, PG to work out what was left and ask Tony Medland for re-profiling. PG should make a plan: balance staff against capital hardware and submit as soon as possible (DK to assist). 513.3 DB to send to PG the EVAL information required. 513.4 DC to follow-up with Alex Efimov/Tom Whyntie regarding Simon's time on coding for bit-splitting work on Linux - DC to clarify the issues involved and report-back if a PMB decision was required. 514.3 DC to provide information to PC to enable him to complete the network forward-look. 517.1 JC to write a few lines on networking and send them to DB for inclusion on page 24 of the GridPP5 Proposal. 517.2 JC to include the security issues, sent by DK, and re-send the amended document to DB, and remind DB that he needs to adjust the Security section accordingly. 517.3 ALL: to let PG know any thoughts/preferences for the GridPP32 Agenda. The next PMB would take place on Monday 17 February at 12.55pm. This meeting would sign-off on the GridPP5 Outline Proposal to be submitted to the Strategic Review Panel on the following day. The background document must therefore be completed by Thursday of this week, and circulated to the CB on Friday.
GridPP PMB Meeting 516 (03.02.14) ================================= Present: Dave Britton (Chair), Pete Gronbech, Andrew Sansum, Jeremy Coles, Tony Doyle, Dave Kelsey, Steve Lloyd, Roger Jones, Pete Clarke, Tony Cass, Dave Colling (Minutes - Suzanne Scott) Apologies: Claire Devereux 1. GridPP5 Strategic Review Papers =================================== The current status of the Papers was reviewed: A. Tier-2 (SL) --------------- SL had circulated the last version of his document, however information which he required was still awaited. He had not therefore circulated the latest version - this was pending information from CMS. SL confirmed that he had received the other inputs he required. SL advised that the electricity calculations were based on information from AS. There ensued a discussion on electricity costs. It was noted that kit generally was on average two years old, and SL should utilise the numbers from AS and extrapolate them as appropriate. SL confirmed that his document was nearly complete, except for the CMS information. B. Tier-1 (AS) --------------- AS had circulated a 29-page document. DB commented that this looked good and was very useful. AS confirmed that the bulk of it was complete and he had only a few sections to finalise and a few more days should do it. DB advised that a similar case as this was required for the Tier-2s. C. Technical Overview (DC) --------------------------- DC had circulated nothing to date. DC promised he would circulate something this afternoon. Was there any capital investment at Imperial? DC noted yes, they had received large funding of £100k for infrastructure support from the University. DB advised him again that the Technical Overview paper was well overdue. DC said that he did have a draft, and would work on it today. DB reminded DC that the document must include cloud computing and its prospective cost. It should also describe the way in which CERN was moving to provision at Tier-0 in relation to cloud technology, and the impact at RAL. No draft had been available yet from DC to enable iterations on content, and time was now exceedingly short. D. Deployment, Operations & Support (JC/TD) -------------------------------------------- Versions of this paper had been circulated. JC had not yet circulated the final draft, he would circulate the latest version today. DB reminded JC that we required a powerful and coherent argument for the Tier-2s - this should include the hardware staff being the backbone of the Ops/Deployment Team and who deal with Core Tasks - this was as important as site-based work. There was no information included for these staff members so far. The support case for the posts at the Tier-2s was required. JC confirmed he would finalise a version within the next few days. E. Experiment Requirements (DB) -------------------------------- DB reported that he had modified his document. PG had checked the numbers and they seemed ok based on the data we had. F. Rationale for 50% Plan (DB) ------------------------------- DB noted that this document had now been incorporated elsewhere which had become a new Document G. This had been circulated last week showing the four levels possible for four different funding scenarios. Some inputs were still awaited. DB had circulated a modified version over the weekend. This document would form the crux of the submission to the Stratetic Review Panel. Feedback from the PMB was required. Providing scenarios which covered the gap between the £6.2 million 'Minimal Frozen Service' (MFS) and the £4.7 million 'Partial Tier-1' went beyond the brief requested, so these had not been provided. It was not practical to provide additional scenarios to the four presented. PC asked if we know who the panel members were? DB noted yes, we had a list now. DB outlined who the Strategic Review Panel members were, and what experiments/institutes they were involved with. Concern was noted that the members may not know much about the industrial-level computing of the LHC. PC suggested that we present our information as if to a complete novice, showing that we cannot do differently to our international collaborators, especially as none of our main customers were presently on the panel. DB noted that we could reply, noting that there was no-one on the Panel who either understood the experiment requirements or the entire service-level of GridPP. Other than that, there were no major objections to the Panel members proposed. DB noted particularly that, at the Minimum Frozen Service (MFS) of £6.2 million - below this level we would be forced to renege on international commitments, and this could close experiments down, therefore discussion at management level would be required. Were there any other comments on Document G? PC, SL and PG had provided comments. It was noted that we were not discussing the relocation of hardware. SL asked if DB could make Tables 8 and 9 clearer. DB advised that more feedback on this document would be welcome - it was urgent however, as all had to be finalised within 48 hrs max. Final Drafts of all documents had to be completed by Wednesday. There ensued a discussion on data preservation at the Tier-1 and how did STFC maintain this? Functionality of the Tier-1 and data preservation issues should be added to the proposal. DB reported that he had started on the GridPP5 document and would have an outline proposal for the PMB by next Monday. The possible date for the Strategic Review was 17th March, just prior to the GridPP32 Collaboration Meeting. 2. GridPP32 Collaboration Meeting ================================== DB reported that the website was being worked on, and Registrations would be open soon. There were no talks scheduled yet - the theme would be 'taking stock' or a 'mini review' or similar, in order to discuss the updated computing models and technical developments. Suggested talks please to DB. Regarding Sponsorship, DELL had pulled-out and DDN would Sponsor. DELL may sponsor GridPP33. 3. STFC ResearchFish ===================== PG advised that PIs were still being asked to add information on an individual basis. Reminders were being sent. PG did ask PIs to send inputs, but he had received nothing. The deadline for inputs was the end of March 2014. 4. HEPSYSMAN ============= PG had circulated an email requesting approval of a two-day meeting at RAL in early Spring/Summer. The meeting would be on monitoring. DC would do a presentation. This was approved. STANDING ITEMS ============== SI-1 Dissemination Report -------------------------- SL reported on behalf of Tom Whyntie as follows: Institute of Physics Top 50 Work Placements Scheme - £2000 for student-based GridPP engagement projects A potential source of effort for projects that engage external groups (SMEs, non-HEP research groups, etc.) are summer students, based either with the GridPP institution or the external group in question. The IOP runs the "Top 50 Work Placements Scheme" that funds students to the tune of £2000, as well as providing support for student(s) while on placement. Likewise, Key dates: ***3 March 2014**: Deadline for employers to submit their request for a bursary; ***7 April 2014**: 'Top50' Work Placements Scheme opens to students; ***28 April 2014**: Scheme is closed to applications. Useful links: * [Scheme homepage](http://www.iop.org/careers/top50/index.html). Meeting with Palantir Technologies On Thursday 30th January 2014, TW met with two former members of the CMS experiment who now work for the UK offices of Palantir Technologies. While there is probably not the scope for direct collaboration - in terms of scale, nature of business, and _modus operandi_ - it was a fruitful discussion and interesting to learn how former members of the HEP community were working in commercial/governmental "Big Data". A number of technologies used in the Big Data community were also discussed that may be of interest to the GridPP community (see links below). They may also be willing to support more traditional "outreach" activities (school code workshops, competitions). Useful links: * [The Palantir Technologies homepage](http://www.palantir.com/); * [Palantir's demonstration platform, using US Government open data](https://analyzethe.us/); * [Protocol Buffers - Protobuf - Google's data encoding format](https://code.google.com/p/protobuf/); * [Apache Spark](http://spark.incubator.apache.org/); * [Hadoop](http://hadoop.apache.org/); * [Kaggle - platform for big data competitions](http://www.kaggle.com/). SI-2 ATLAS weekly review & plans --------------------------------- RJ reported that disk space was being freed-up at RAL; the problem with the CMS issue had to do with transfer loads on the disk servers - these had halved the ATLAS transfers; the events service at RAL in relation to CASTOR was being investigated; the Tier-2s were quiet. SI-3 CMS weekly review & plans ------------------------------- There was nothing to report. SI-4 LHCb weekly review & plans -------------------------------- There was nothing to report. SI-5 Production Manager's Report --------------------------------- JC reported as follows: 1) Several of our sites have installed free probes from RIPE – an open consortium that deploys probes to ”measure Internet connectivity and reachability, providing an unprecedented understanding of the state of the Internet in real time”. The consortium is seeking to improve networks for the benefit of all (public good) using member leverage. The measurements compliment perfSONAR. RIPE coordinates and relies on sponsorship for which extra data access is given. There is a suggestion arising from the core operations team that GridPP may wish to consider becoming a RIPE sponsor. It would help to show further leadership in the UK High Throughput Networking area and we have an opportunity for gaining some useful data and publicity and build upon our impact agenda (example output can be seen here https://atlas.ripe.net/results/maps/). The probes do not replace perfSONAR which is a HEP solution (that will be less widely deployed than RIPE) but it would improve our reach and diagnostic capabilities (which will definitely be needed moving to IPv6 for example) while also supporting the wider community. We could sponsor probes to cover other JANET sites. I have made contact with RIPE and can elaborate on costs and develop a proposal if the PMB is behind the idea in principle. Our involvement can vary from sponsor to ambassador to providing not just probes at sites but an anchor point (https://atlas.ripe.net/about/anchors/). Following our ops meeting discussion last week Ewan MacMahon sent out a summary which puts this in context, there is also a summary that is useful (reproduced below). There ensued a discussion about RIPE - it was thought that sites could do this themselves, but was GridPP sponsorship at a level of £2k possible? This could be an outreach opportunity linked to a Schools programme, however we needed to be careful about spending funds. This might be possible as part of an outreach programme for small cost, however a robust proposal was needed. JC should speak to Tom Whyntie and put in a proposal to the PMB noting both costs and benefits. 2) Globus has now fixed issues that led to (default) key size incompatibilities between grid middleware products using openssl. We are waiting for these to be incorporated into deployable releases and will then upgrade service nodes before requesting sites to upgrade other node types. 3) Batch system evolution and also support for multi-core jobs continues to be a current ‘hot topic’. There are plans to push this work forward at a pre-GDB in March. Site engagement is crucial. 4) There is a plan to migrate some SAM services from CERN to an EGI consortium based on GRNET, CNRS and SRCE. The experiment SAM testing will remain with CERN. The only WLCG visible impact should be short downtimes during infrastructure reconfiguration over the coming months. It is worth noting that this division will also lead to a fork in the SAM code base. Although WLCG will see little impact, it is a move that can create confusion for sites in the future as we remain part of WLCG and EGI. 5) The WLCG ops coordination team is stepping up efforts in the area of WMS decommissioning. The CERN shared instances are being phased out on the April timescale. This has the implication that we will ourselves in the near future be running some services purely for non-LHC communities. 6) Plans are being developed for Experiment Computing Commissioning in 2014. The activity will be "STEP14". The plans will be developed during an Operations Coordination F2F meeting in February. SI-6 Tier-1 Manager's Report ----------------------------- AS reported as follows: Fabric ------- 1) All capacity equipment now delivered. One tranche of disk and one tranche of CPU now successfully completed - vendor proving tests. Now moving into our own shorter acceptance tests. Second disk and cpu tranches in vendor proving test. [no update from last week as 2 staff involved unexpectedly out of office today]. 2) A number of major network changes are in the process of being scheduled. a) On the 11th March the Tier-1 control traffic will be moved from the old firewall to the new firewall. As the old rule-set needs to be re-implemented on a different manufacturers hardware this is potentially going to lead to teething problems; b) Late in February (TBC) the Tier-1 will deploy its own top level router and move to a direct 40Gb bypass of the firewall for data traffic; c) Two preparatory changes will need to be implemented in the next few weeks - physically separate out the management network (manages the network switches) - deploy the Z9000 mesh layer (required ahead of the main change in order to attach new hardware). Service ------- Operations again very smooth last week. at: https://www.gridpp.ac.uk/wiki/Tier1_Operations_Report_2014-01-22 Staff ----- 1) George Ryall will be joining the team for 6 months. George will be working on the CEPH storage system and proto-cloud. 2) Eddie Grabczewski will be leaving the database team in March. Eddie not only worked 70% on the Tier-1 Databases but supported on-call. In the short term we are considering getting a contractor to help cover daytime activities. SI-7 LCG Management Board Report --------------------------------- There had been no meeting. REVIEW OF ACTIONS ================= 512.2 Regarding the outturn forecast and the possible spend on tape media, travel etc, PG to work out what was left and ask Tony Medland for re-profiling. It was noted that we need to say what we are going to spend this on - PG should make a plan: balance staff against capital hardware and submit as soon as possible. 513.2 PG to review the EVAL requirements and email everyone a summary and reminder. Done, item closed. 513.3 DB to send to PG the EVAL information required. ongoing. 513.4 DC to follow-up with Alex Efimov/Tom Whyntie regarding Simon's time on coding for bit-splitting work on Linux - DC to clarify the issues involved and report-back if a PMB decision was required. ongoing. 514.2 DC to provide information about the CMS computing model to SL for the Tier-2 document. ongoing. 514.3 RJ/DC to provide information to PC to enable him to complete the network forward-look. RJ had provided this. Action on DC. 515.1 AS to give SL the numbers for power usage for delivery of disk & CPU, to enable SL to calculate electricity costs. Done, action closed. 515.2 JC to consider the issue of DIRAC funding and manpower, in comparison with GridPP. Done, action closed. 515.3 ALL: to send DB suggestions for a theme for GridPP32 and suggestions for people who could give talks. ongoing. ACTIONS AS OF 03.02.14 ====================== 512.2 Regarding the outturn forecast and the possible spend on tape media, travel etc, PG to work out what was left and ask Tony Medland for re-profiling. It was noted that we need to say what we are going to spend this on - PG should make a plan: balance staff against capital hardware and submit as soon as possible. 513.3 DB to send to PG the EVAL information required. 513.4 DC to follow-up with Alex Efimov/Tom Whyntie regarding Simon's time on coding for bit-splitting work on Linux - DC to clarify the issues involved and report-back if a PMB decision was required. 514.2 DC to provide information about the CMS computing model to SL for the Tier-2 document. 514.3 DC to provide information to PC to enable him to complete the network forward-look. 515.3 ALL: to send DB suggestions for a theme for GridPP32 and suggestions for people who could give talks. The next PMB would take place on Monday 10 February at 12.55 pm.
GridPP PMB Meeting 515 (27.01.14) ================================= Present: Dave Britton (Chair), Pete Gronbech, Andrew Sansum, Jeremy Coles, Tony Doyle, Dave Kelsey, Steve Lloyd, Roger Jones, Pete Clarke, Claire Devereux (Minutes - Suzanne Scott) Apologies: Tony Cass, Dave Colling 1. GridPP5 Proposal Documents ============================== The present status of these draft documents was reviewed: A. Tier-2 (SL) --------------- SL had sent round an update, however he was still unclear about the electricity issue - he had put in an estimate. PG considered that it wasn't worth doing a great deal of work on this and advised that SL should get the information from sites. DB noted that it wouldn't be possible to get figures from everyone, but what should we say in general about electricity? PC considered this was very difficult as things would be done differently in different places so it was hard to get a consistent picture - it could be costed at the Tier-2s ok. SL noted that for the Tier-2 at RAL, it was likely they had to pay electricity on that. AS advised that they were required to outline their costs at RAL. DB thought we should state what the cost would be to power kit at RAL - get the current information on this, and it could be scaled from the Tier-2 figures. SL noted that he needed AS's spend to deliver disk and CPU, the Tier-2 delivered 1.2 or 1.6 of that, therefore he could multiply the number. This would make the baseline number fairly robust. ACTION 515.1 AS to give SL the numbers for power usage for delivery of disk & CPU, to enable SL to calculate electricity costs. SL advised that he was still waiting on site descriptions. JC had summarised the up-to-date status by email. CMS information was also awaited. DB considered that for the small sites, SL could write a sentence based on his knowledge, as further information was unlikely to be forthcoming. SL noted that he needed a definitive list of FTEs at each site. DB reminded that SL should focus on leverage / engagement / reduced fault tolerance (redundancy). B. Tier-1 (AS) --------------- AS advised that version 6 had been circulated on Saturday. He was trying to walk a fine line - there was a lot of stuff going on at the Tier-1 and having done many good things he needed to show that they had only gained effort in order to stand still. Efficiency savings had already been made and the staff effort was down to 17.5FTE. AS considered that going forward it would be a big mistake to cut effort to 15FTE, as 2.5FTE of effort was allocated to some large upcoming projects that will need key effort. His document was nearly completed. DB agreed with the arguments and asked AS to develop them to conclusion. The document was in reasonable shape. C. Technical Overview (DC) --------------------------- Nothing had yet been received from DC. D. Deployment, Operations & Support (JC) ----------------------------------------- JC reported that he had already circulated the latest version - it was in good shape but some areas needed development. DB noted that he had circulated a spreadsheet last week which showed post development, along with various models. The cut-backs proposed were minimal. The scope of work for both VOMS and Security was being considered, also some EGI effort had to be decided upon. DB reminded that the overall document had to be a maximum of 40 pages in length - this would go to Strategic Review. The individual documents would be posted on the web with a protected url so that Review members could view them as background information, if required. The detail contained in each document was therefore crucial. PC voiced his concern regarding the lack of posts for LHCb at the Tier-2s. DB had suggested that these posts go to Tier-2Ds. SL reminded that the ATLAS posts weren't really ATLAS posts, they were simply posts at the Tier-2. DB concurred, noting that these posts were there to support a site whose primary focus was to support CMS/ATLAS. LHCb had been addressed by sites looking after 'other' experiments. It depended on the scenario being looked at. If we didn't have a Tier-1, which sites would do what? The post-holders needed to support multiple VOs and GridPP had to be experiment-led. It was agreed that the distinctions as to which experiment the posts were allocated to, should not be so black and white on the tables presented. It was agreed from henceforth to call the support posts (whether they were ATLAS or CMS) by the more generic title of 'Site Support for VOs'. E. Experiment Requirements (DB) -------------------------------- DB was working on this document. F. Rationale for 50% plan (DB) ------------------------------- DB had circulated a partial draft and received comments. G. new document outlining 4 scenarios (DB) ------------------------------------------- DB had circulated a partial draft and would describe the step-changes (from flat-cash to 50% funding) in this document. DB reminded that all documents had to be completed this week. It was agreed to give the Collaboration Board (CB) the scenarios by the end of the week once the documents had been finalised. DB hoped to circulate Document G on the scenarios by Friday to the CB and get feedback about the requirement for a F2F meeting. The date for the Strategic Review was possibly 17th March 2014. Regarding attendance, if the entire PMB did not attend, then DB, SL and the three experiment reps should go, followed by the three 'provider' reps AS/JC/DK. DB had asked if the meeting could take place in London. PC brought up another issue for discussion, that of DIRAC. PC considered that we should be prepared in relation to DIRAC, as they had minimal support staff and we should consider our response to the Strategic Review if they believe that our support staff numbers are a luxury in comparison. DB agreed that he needed to take a closer look at DIRAC's funding situation. There was a difference between operating a Grid infrastructure and an HPC facility. JC should consider this issue, and it would be revisited. ACTION 515.2 JC to consider the issue of DIRAC funding and manpower, in comparison with GridPP. AS suggested that we should look at the questions which the PPRP asked us last time round. He could circulate them [done following the meeting]. DB noted that STFC had informed us that we would know the membership of the Strategic Review panel - this was something we needed to scrutinise. 2. GridPP32 ============ DB reported that Andrew McNab was working on a draft website at the moment. We needed an Agenda and a Theme. It was too early for Run 2 as a topic. All suggestions would be welcome. ACTION 515.3 ALL: to send DB suggestions for a theme for GridPP32 and suggestions for people who could give talks. It would be an opportunity for JC's team to present what they've been doing. DB noted this should be work that has reached a level of maturity and it would be a good moment to review best practice. 3. STANDING ITEMS ================== SI-1 Dissemination Report -------------------------- SL reported on behalf of Tom Whyntie as follows: GridPP guides for small VOs/SMEs ------------------------------------------ TW attended both ops and storage weekly meetings last week; in the latter some requirements of the Langton Ultimate Cosmic ray Intensity Detector (LUCID) experiment (part of CERN@school) were outlined. Steve Jones suggested that a guide for small VOs (potentially including SMEs) should be pulled together from the existing documentation on the wiki, recent work (including the instantUI from VomsSnooper, etc.) and TW's work with [log in to unmask] This could make for an interesting session/workshop at GridPP32 and (possibly) the EGI Community Forum in Helsinki (May 2014). Useful links: * [The grid user crash course on the GridPP wiki](https://www.gridpp.ac.uk/wiki/Grid_user_crash_course); * [The instantUI guide on the GridPP wiki](https://www.gridpp.ac.uk/wiki/User_Interface_%28UI%29_to_support _approved_VOs); * [EGI Community Forum 2014 "Advancing Excellent Science"](http://cf2014.egi.eu/). SME engagement - Palantir Technologies -------------------------------------------------- TW has engaged with Alex Sparrow (formerly of the CMS experiment/Imperial) from Palantir Technologies wrt to the TSB "Data Exploration" competition. Useful links: * [Palantir Technologies](http://www.palantir.com/); * [TSB data exploration competition](https://www.innovateuk.org/competition-display-page/-/asset_publisher/RqEt2AKmEBhi/content/data-exploration-creating-new-insight-and-value SI-2 ATLAS weekly review & plans --------------------------------- RJ reported that there had been one incident at the Tier-1 relating to an accidental Condor restart on the Worker Nodes which had killed all the jobs. The renaming of files was now complete. They were continuing to work on IPV6. Multi-core queues were progressing. Work was ongoing at QMUL and RAL. There were outstanding list requests which were starting to clear. SI-3 CMS weekly review & plans ------------------------------- DC was absent. SI-4 LHCb weekly review & plans -------------------------------- PC had no major issues to report. SI-5 Production Manager's Report --------------------------------- JC reported as follows: 1) The final WLCG T2 availability/reliability figures for December were released last week. https://espace.cern.ch/WLCG-document-repository/ReliabilityAvailability/Tier-2/2013/WLCG_Tier2_OPS_Dec2013.pdf. Overall the GridPP sites were good with many above 98% and several at 100%. Those under 90% in at least one category were: UCL – 45%:95%. The site was unavailable due to declared downtimes during which WN and SE upgrades were performed. Bristol - 89%: 90%. Problems with new ARC-CE debugging. Power surge on 18th December led to 1.5 days outage. Intermittent GPFS problems leading to StoRM lockup affecting CEs and WNs for 2 days. RALPP – 89%: 89%. Main issue was the dCache database filling up a log partition on the server and stopping on a Friday evening which cause a full weekend of downtime. A smaller issue was small VOs on a shared disk pool with ops led to all transfer slots being used and SAM tests timed out. Sussex – 71%: 71%. The majority of downtime came from a period where the site was unavailable due to a CVMFS server replacement taking longer than expected. Following the replacement a configuration error led to further CVMFS failures. Finally at the end of December the Lustre file system object servers crashed causing StoRM servers to hang for several days. 2) Last week one GridPP site reported a security incident [1]. Our security team are happy with the handling of the incident. A more interesting incident is coming to light outside of the UK but may yet affect us [2]. 3) On Thursday we had a core ops tasks meeting (http://indico.cern.ch/conferenceDisplay.py?confId=297062). Points to note: - There is good progress being made with IPv6 testing (see Duncan’s talk http://indico.cern.ch/materialDisplay.py?contribId=1&materialId=slides&confId=297062). Last week at the ops meeting we approved the ipv6.hepix.org VO. - Work is ongoing to refactor WLCG monitoring. - Steady progress is being made with Puppet (module sharing). - EGI Staged Rollout has become very quiet. We are contributing to the WLCG discussion on middleware readiness. - Some documents are in need of review. Blog posting needs to be revived. - Use of Dirac for smaller VOs has been slow to progress but now getting more focus. - A move to our resilient VOMS network is now taking place. - We have not yet come up with a workable WN tarball that incorporates glexec – CMS is making availability of glexec a critical test. 4) Jens and I are trying to follow up with DiRAC to arrange a joint technical meeting to discuss areas of possible mutual interest and collaboration. 5) Pete has spotted (from the Q4 quarterly reports) that several sites are not publishing their new benchmark figures (HS06) following their migration to SL6 late last year. We are reviewing. SI-6 Tier-1 Manager's Report ----------------------------- AS reported as follows: Fabric: -------- 1) All capacity equipment now delivered. One tranche of disk and one tranche of CPU now successfully completed vendor proving tests. Now moving into our own shorter acceptance tests. Second disk and cpu tranches in vendor proving test. 2) Microcode updated on the robot in order to support T10KD drives. Service: --------- Operations again very smooth last week. at: https://www.gridpp.ac.uk/wiki/Tier1_Operations_Report_2014-01-22 1) CASTOR a)Merged the ALICE and GEN tape service class reducing the need for disk pool in front of tape. SI-7 LCG Management Board Report --------------------------------- There was no meeting. REVIEW OF ACTIONS ================= 512.2 Regarding the outturn forecast and the possible spend on tape media, travel etc, PG to work out what was left and ask Tony Medland for re-profiling. Ongoing. 513.1 RJ/DC/PC each to provide a 1-page summary of functionality at the Tier-1 and Tier-2, including present situation and evolution. Replaced by action 514.2. Closed. 513.2 PG to review the EVAL requirements and email everyone a summary and reminder. Ongoing. 513.3 DB to send to PG the EVAL information required. Ongoing. 513.4 DC to follow-up with Alex Efimov/Tom Whyntie regarding Simon's time on coding for bit-splitting work on Linux - DC to clarify the issues involved and report-back if a PMB decision was required. Ongoing. 514.1 PG to chase-up the sites regarding a line of information required from them to insert into SL's Tier-2 document. Done, item closed. 514.2 DC/PC to provide information about their computing models to SL for the Tier-2 document. PC has now done this. Action on DC only. 514.3 RJ/DC to provide information to PC to enable him to complete the network forward-look. Ongoing. ACTIONS AS OF 27.01.14 ====================== 512.2 Regarding the outturn forecast and the possible spend on tape media, travel etc, PG to work out what was left and ask Tony Medland for re-profiling. 513.2 PG to review the EVAL requirements and email everyone a summary and reminder. 513.3 DB to send to PG the EVAL information required. 513.4 DC to follow-up with Alex Efimov/Tom Whyntie regarding Simon's time on coding for bit-splitting work on Linux - DC to clarify the issues involved and report-back if a PMB decision was required. 514.2 DC to provide information about the CMS computing model to SL for the Tier-2 document. 514.3 RJ/DC to provide information to PC to enable him to complete the network forward-look. 515.1 AS to give SL the numbers for power usage for delivery of disk & CPU, to enable SL to calculate electricity costs. 515.2 JC to consider the issue of DIRAC funding and manpower, in comparison with GridPP. 515.3 ALL: to send DB suggestions for a theme for GridPP32 and suggestions for people who could give talks. Next PMB Monday 3rd February 2014 @ 12.55 pm.
GridPP PMB Meeting 514 (20.01.14) ================================= Present: Dave Britton (Chair), Pete Gronbech, Andrew Sansum, Jeremy Coles, Tony Doyle, Dave Kelsey, Steve Lloyd, Dave Colling, Tony Cass, Roger Jones, Pete Clarke, Claire Devereux (Minutes - Suzanne Scott) Apologies: None 1. The GridPP5 Proposal ======================== DB noted that today's meeting was specifically focussing on the GridPP5 proposal - the status of all documents would be checked. PC reported that there was a service infrastructure funding round which was due imminently. He had noted the email exchanges on what should be included in the Proposal - it was understood that STFC had requested 'flat cash' and 50% scenarios, however he suggested that we add 'minimum acceptable' and 'minimum viable' scenarios, also that we should include one entitled 'optimal', ie: what we would like if there were the possibility of no constraints. PC advised that a new consultation was due to be launched in relation to infrastructure priorities from BIS and that funds might be available from that. DB considered that we could be in a difficult situation if the funding was cut and we tried to get a hardware grant from BIS which we couldn't support - the division of hardware and resource was very difficult to manage. The hardware level we needed was around £11-12 million but in a de-scoped scenario the likelihood was only for £6-7 million. TD thought that in all scenarios we should assume that through a secondary process hardware might become available and that we should tune the scenarios accordingly. DB noted that we wouldn't know the outcome of the BIS consultation until it was too late; he had spoken to Janet Seed and she had been positive about the possibilities within the BIS consultation. DB added that he had received an email from Tony Medland, who advised that the Strategic Review Panel would assess a realistic scale of funding to support the STFC experimental programme - TM had warned that we could not simply give a 'flat cash' and a 'doomsday' scenario - intermediate options should be included. AS had suggested four scenarios and DB considered that including a couple of intermediate options would be useful - it would enable the balancing of functionality vs risk. This would cover the 'middle' ground of £4-5 million funding, wherease the £3.5 million scenario was a complete disaster. The option of a de-scoped Tier-1 would be around £6 million. Therefore we should provide project scenarios at £7 million; £6 million; £4.5 million and a £3.5 million, which would show the step-change risk from 'flat cash' to no Tier-1 at all. This would give a clear picture to the Strategic Review Panel. RJ agreed, noting that we needed to identify clearly the risks at each level, especially capital. PC suggested that if there were any real optimism about the Infrastructure funding, perhaps STFC could provide bridging funding until the BIS outcome were known? We should raise this issue at the Strategic Review. DB asked the PMB if we should provide the range of scenarios as discussed? This was agreed. DB noted that the document outline details would therefore be changed. The status of the documents was next reviewed: A. Tier-2 (SL) --------------- SL had circulated his draft document - he had been unsure as to what the slant should be - the document outlines where we are at the moment, however is it meant to be a proposal or background information only? DB advised that the document would help formulate the arguments regarding leverage etc. SL noted that some information was still missing - he needed a line of information from each site. PG would chase this up. ACTION 514.1 PG to chase-up the sites regarding a line of information required from them to insert into SL's Tier-2 document. PG would check what hardware was available at sites and ascertain which ones could potentially be expanded. SL noted that he needed information from CMS and LHCb about the computing models. Inputs to SL as soon as possible. ACTION 514.2 DC/PC to provide information about their computing models to SL for the Tier-2 document. B. Tier-1 (AS) --------------- DB advised AS that the document which AS had circulated would be very useful for the proposal. AS confirmed he would progress his document. C. Technical Overview (DC) --------------------------- DC confirmed that he should have a draft available by midweek this week. D. Deployment, Ops & Support (JC/TD) ------------------------------------- JC reported that progress was being made - he had discussed with TD who would contribute to what sections: DK would contribute on security and CD would contribute on the international situation/context. DB advised JC that he must focus on the most important issues and provide hard-hitting facts. E. Experiment Requirements (DB) -------------------------------- DB reported that he had been working on this, looking at CERN requirements in comparison with his spreadsheet on hardware. The draft was nearly ready. F. Rationale for 50% plan (DB) ------------------------------- DB noted that this must now be adapted to cover intermediate levels and risks in light of what was discussed today - he needs to revise this. He would circulate it by the end of the week. DB summarised that all documents now appeared to be in progress, however by Friday of this week we must have a completed document that would form the basis of the arguments we were going to make. All draft documents must be completed and circulated by Friday. These drafts would be reviewed next Monday then the writing of the proposal itself would be mapped-out. STANDING ITEMS ============== SI-0 Development Group Report ------------------------------ DC reported that there had been a meeting last Friday, and he provided a link for the Minutes: https://indico.cern.ch/getFile.py/access?resId=0&materialId=minutes&confId=295905 ATLAS had been running well, with an image, and were running simulation. They may look at Oxford again. CMS were focused on HLT, the VM was running again after about 20 minutes and was working well. There were 3,500 jobs on that, which they will suspend and resume, it was going reasonably well. Regarding LHCb there had been staff changes and there would be a development meeting taking place tomorrow to evaluate how to move forward. PG reported that at the GDB there had been a cloud focus, and much confusion there between the different usages of virtual machines and sites. DC agreed, and considered the meeting had been highly unproductive. They needed to get things working. JC agreed there was no common understanding of the areas. SI-1 Dissemination Report -------------------------- SL reported on behalf of Tom Whyntie as follows: ### instantUI news item published A new item on Steve Jones's instantUI tool has been published on the GridPP website. EGI (through Neasan) have picked it up and may follow up with a blog post. Useful links: > * [instantUI: work with the grid from anywhere](http://www.gridpp.ac.uk/news/?p=3137); > * [The instantUI GridPP wiki guidelines](https://www.gridpp.ac.uk/wiki/User_Interface_%28UI%2_to_support_approved_VOs). ### CERN@school software deployed with CVMFS Thanks to Catalin Condurache (RAL Tier-1), the CERN@school VO(cernatschool.org) can now deploy software to the grid using CVMFS.The nest step will be uploading the students' software; the most likely candidate for this is code that processes calibration data obtained from the Insitute of Experimental and Applied Physics (IEAP), Czech Technical University in Prague for the Langton Ultimate Cosmic ray Intensity Detector (LUCID) experiment. SI-2 ATLAS weekly review & plans --------------------------------- RJ noted not much to report, they were struggling with storage areas getting full however sites were responding. SI-3 CMS weekly review & plans ------------------------------- DC noted he was in discussion with CMS computing management regarding the GridPP scenarios for GridPP5. SI-4 LHCb weekly review & plans -------------------------------- PC noted nothing to raise. SI-5 Production Manager's Report --------------------------------- JC reported as follows: 1) A WLCG T2 availability/reliability report for December was circulated last week: http://sam-reports.web.cern.ch/sam-reports/2013/201312/wlcg/WLCG_Tier2_OPS_Dec2013.pdf . There were no UK requests for re-computation, but I am still waiting for explanations from some (of the 4) sites that were under 90% and will review these next week. 2) One of the concerns raised at the HEPSYSMAN meeting last Monday was about the future of batch system support. This will be discussed at the WLCG/HEPiX level in coming months. Condor with the ARC CE is gaining some attention as is Slurm and GridEngine. We can expect this to affect most of our T2 sites. 3) One of the areas driving discussion at (and after) the GDB last week (http://indico.cern.ch/conferenceDisplay.py?confId=272795) was the adoption of a new benchmark - made more complicated with the evolving computing environment (i.e. considering how to account with cloud resources, multi-core and whole node etc.). 4) There is a WLCG multi-core Task Force kick-off meeting tomorrow (https://indico.cern.ch/conferenceDisplay.py?confId=296031). SI-6 Tier-1 Manager's Report ----------------------------- AS reported as follows: Fabric: 1) Both disk and one CPU tranch now delivered. Last major item (second CPU tranch) delivering today. All on track for deployment. 2) We expect to carry out at least two major interventions on the Tier-1 network in the coming 6 weeks in order to attach to the site network at 40Gb/s. details being finalised. Service: Operations again very smooth last week. at: https://www.gridpp.ac.uk/wiki/Tier1_Operations_Report_2014-01-15 1) CASTOR a) Overheads reduced from 5% to 1% - extra space available in all service classes. SI-7 LCG Management Board Report --------------------------------- There had been no meeting. REVIEW OF ACTIONS ================= 496.2 PC to update the network forward-look. This was now pending information from RJ/DC. Done, item closed. 512.2 Regarding the outturn forecast and the possible spend on tape media, travel etc, DB/PG to work out what was left and ask Tony Medland for re-profiling. PG would take-over this action. Ongoing. 513.1 RJ/DC/PC each to provide a 1-page summary of functionality at the Tier-1 and Tier-2, including present situation and evolution. Ongoing. 513.2 PG to review the EVAL requirements and email everyone a summary and reminder. Ongoing. 513.3 DB to send to PG the EVAL information required. Ongoing. 513.4 DC to follow-up with Alex Efimov/Tom Whyntie regarding Simon's time on coding for bit-splitting work on Linux - DC to clarify the issues involved and report-back if a PMB decision was required. Ongoing. ACTIONS AS OF 20.01.14 ====================== 512.2 Regarding the outturn forecast and the possible spend on tape media, travel etc, PG to work out what was left and ask Tony Medland for re-profiling. 513.1 RJ/DC/PC each to provide a 1-page summary of functionality at the Tier-1 and Tier-2, including present situation and evolution. 513.2 PG to review the EVAL requirements and email everyone a summary and reminder. 513.3 DB to send to PG the EVAL information required. 513.4 DC to follow-up with Alex Efimov/Tom Whyntie regarding Simon's time on coding for bit-splitting work on Linux - DC to clarify the issues involved and report-back if a PMB decision was required. 514.1 PG to chase-up the sites regarding a line of information required from them to insert into SL's Tier-2 document. 514.2 DC/PC to provide information about their computing models to SL for the Tier-2 document. 514.3 RJ/DC to provide information to PC to enable him to complete the network forward-look. Next PMB Monday 27 January @ 12.55 pm.
GridPP PMB Meeting 513 (13.01.14) ================================= Present: Dave Britton (Chair), Pete Gronbech, Andrew Sansum, Jeremy Coles, Steve Lloyd, Dave Colling, Tony Cass, Roger Jones, Pete Clarke, Claire Devereux (Minutes - Suzanne Scott) Apologies: Tony Doyle, Dave Kelsey 1. The GridPP5 Proposal ======================== DB reminded the PMB that we needed to make a start with this, based on the STFC Briefing Document. There was a limit of 40 pages in length, including appendices. DB considered that it was best if the case were written by way of background documents in order to set down the full arguments, then the proposal could be extracted from that. DB had circulated a list of the suggested documents for discussion: A. Tier-2 - SL would do this with inputs required from others. Up-to-date information was required on functionality and all leverage arguments. SL asked whether the experiments could summarise their computing models? ACTION 513.1 RJ/DC/PC to each provide a 1-page summary of functionality at the Tier-1 and Tier-2, including present situation and evolution. DB advised that the timescale overall was compressed - inputs were needed within the next few days, therefore they needed to be done immediately. This week the inputs should be in draft form, which would then be finalised next week. SL noted that he also needed 'added value' inputs - he would send a list to JC of the details outstanding. SL considered that he could probably calculate the power himself if he had a few actual numbers which he could relate to the hardware - this might give an average number he could apply to everyone. DB noted he could look at the accounting over GridPP4 to see the resource delivered. SL was looking at what was on the ground and would use the quarterly reports. PG advised that this was done through the experiment reports. B. Tier-1 - AS noted that the information already existed for this section and it just needed to be pulled together in one place with respect to manpower, functionality, and impact. He advised that his department would be providing a briefing document to the STFC Executive Board regarding the proposal from GridPP, giving them an outline of it. Some of the text from that impact statement document would be suitable for use in the Tier-1 document. DB noted that he needed to show the advantage of the Tier-1 and its positive impact. Dave Wark had recently had a discussion with DB regarding the proposal strategy and the likely fallout, although the actual costs were as yet unknown. The PMB agreed that any likely fallout should not be included in the proposal as this fell outwith its scope. DB noted that this was another reason for background documents being required, as the overall process would not be straightforward. Regarding the experiments' view, DC advised that the document from CMS would state their preference to retain the current infrastructure but if forced to choose would favour the Tier-2 service. PC noted that he had brought this up at the LHCb Steering Board. Historically, LHCb only uses the Tier-1 except for Monte Carlo production but CMS and ATLAS tend to take the lead because of their relative size, therefore LHCb would hope to be able to adapt to any change, depending on the situation. DB advised that there were possibilities for the infrastructure that might be able to fulfil LHCb requirements, with Tier-2Ds for example. A statement from LHCb would be very helpful. PC noted he would need to check with LHCb about what the statement could contain. RJ advised that within the UK he had not, as yet, had any formal discussions. DB considered that now that the Collaboration Board (CB) had been informed, things were out in the open, and RJ could go ahead and discuss the options and the ATLAS view of the GridPP proposals. C. Technical Overview - this would need to describe our technical evolution, including at CERN, and also longer term sustainability. DC confirmed that he would do a document. D. Deployment, Ops and Support - JC must describe the support mechanisms and the interface with the experiments. TD would give support with this as a wide-ranging scoped document was required - TD/JC should iterate. CD would also assist in relation to the ex-EGI effort/services. JC should also co-ordinate with DK regarding security. Progress would be reviewed next week. E. & F. Experiment Requirements & Rationale for the 50% plan - DB would deal with these documents which described the choices facing the Project. DB advised that all of these documents as described would lead into the proposal. The outline had been circulated along with page limits. DB reminded about the urgency of this due to the tight timescale. Were there any other comments/suggestions? There were none. Progress would be reviewed next week. 2. EVAL reminder ================= PG reminded the PMB about the email from ResearchFish - a list of papers were required, all the new ones since last year. DB requested that any inputs should be marked against GridPP if being entered directly onto the system, or they should be sent to PG who would upload them. PG reminded that there were all of the other sections as well, which required inputs - PG would review the requirements and email everyone. ACTION 513.2 PG to review the EVAL requirements and email everyone a summary and reminder. 513.3 DB to send to PG the EVAL information required. 3. Quarterly Report summary ============================ PG noted that it had been difficult to get the reports in for Q3, and that the Q4 ones were now due. There were some points to highlight: - hardware at some sites was now old - operations personnel had been lost during the period - the metric for vacant posts was nearly at 6 DB asked about the issue of the Tier-1 as a 1 x 10g link? AS noted this was no longer the case - they were now on twinned. PG highlighted this was one of the examples of how out-of-date the reports were. The Q4 reports were due in asap - could everyone do their best to attend to these now. STANDING ITEMS ============== SI-0 Monthly report from Development Group ------------------------------------------- It was agreed that cloud issues would be dealt with next week. SI-1 Dissemination Report -------------------------- Tom Whyntie reported via SL as follows: > instantUI TW had tested Stephen Jones' instantUI tool. Part of the VomsSnooper suite, it allows the user to create a stand-alone user interface that can be used to submit grid jobs from anywhere by anyone with a valid grid certificate. A news item is in preparation, as well as instructions for installing on a generic SL6 Virtual Machine. Useful links: [The instantUI GridPP wiki guidelines](https://www.gridpp.ac.uk/wiki/User_Interface_%28UI%29_to_support_approved_VOs). >Technology Strategy Board "Data Exploration" funding competition - registration open Registration was now open for the Technology Strategy Board (TSB) "Data Exploration" funding competition, which offers £100k-£500k for projects lasting 6-24 months that "that address the technical challenges and business opportunities presented by the huge growth in data". Applications should be led by an SME and submitted by 5th March 2014 (registration deadline 26th February 2014). Useful links: ["Data Exploration" competition information](https://www.innovateuk.org/competition-display-page/-/asset_publisher/RqEt2AKmEBhi/content/data-exploration-creating-new-insight-and-value SI-2 ATLAS weekly review & plans --------------------------------- RJ reported that they had been busy over Christmas, they had been running multi-core jobs, there had been an issue with Lancaster re multi-core queues. JC reported that sites had been concerned at the set up of these just prior to Christmas, with minimal usage. SI-3 CMS weekly review & plans ------------------------------- DC reported that they were also running multi-core jobs; things were generally ok. They were exploring a new structure at CERN. SI-4 LHCb weekly review & plans -------------------------------- PC noted nothing major to report. SI-5 Production Manager's Report --------------------------------- JC reported as follows: 1) GridPP resources ran smoothly over the Christmas and New Year periods. - Thanks to the ROD team for continuing to key an eye on things over the whole period (Daniela undertook the Christmas period and Andrew the new year) - transient problems were noticed and tickets kept in check. - In discussion of 'issues' to follow-up at the ops meeting last week, only one topic was mentioned. There was some frustration that ATLAS requested (where available 50% of resources be assigned to) multi-core queues shortly before the holiday period and did not then really use them. This affected 5-6 UK sites. There was concern the last minute changes could have led to instabilities and setting up multi-core wastes resource (this includes draining ahead of the configuration). I understand that the ATLAS strategy for multi-core going forwards is to be discussed at an upcoming meeting but the need is increasing. The wider approach to multi-core is becoming a hot topic (there is now a WLCG ops task force to look at this area which is to be led by Alessandra Forti). 2) The current expectation is that the UK CA will move to SHA-2 as default in March - as WLCG is not entirely ready (indeed LHCb have recently discovered another problem to be addressed). Meanwhile, there have been further problems in trying to move the GridPP website to a SHA-2 ready status - these are due to problems updating the wiki schema. We have agreed a workaround whereby wiki pages will be moved only with their latest update and the history will be put within an archive. 3) Chris Walker has been pushing the testing of resources with the network of GridPP VOMS instances for many months now. Issues have been identified and resolved. Several of the smaller VOs though have not tested - we have been helping where possible including adding their VO to a VO-Nagios testing instance. As there have been many warnings and sufficient time for testing, the plan is for sites to update their resources to be aware of the wider network in the coming week. There may be some issues we will only become aware of after the move and will resolve them as a priority. 4) Since December there has been a known openssl issue that has required updates of service nodes (e.g. WMSes) before sites update their versions. Our plan (agreed by the GridPP security team on Friday) is to finish the service updates this week and then give the green light for sites to update their nodes. For information: A) There is a HEPSYSMAN meeting taking place in Birmingham today: http://indico.cern.ch/conferenceDisplay.py?confId=286936. B) There is a pre-GDB on Cloud Issues this Tuesday: http://indico.cern.ch/conferenceDisplay.py?confId=272783. C) The GDB this month is on Wednesday 15th: http://indico.cern.ch/conferenceDisplay.py?confId=272795. This includes discussion of the work plan for SPEC14 readiness. DB added that the PMB had been very pleased to see the Tier-1 and Tier-2 infrastructure running well over Christmas and they wished to thank the on-call team at the Tier-1 and the ROD team (particularly Daniela and Andrew) for keeping things running well. SI-6 Tier-1 Manager's Report ----------------------------- AS reported as follows: Fabric ------ 1) One disk and one CPU tranch are now in proving test. Second tranch of disk has just arrived this morning and second CPU delivery is expected next week. We are in very good shape to meet the April MoU commitments. 2) Enlarged tape media order now delivered. 3) We expect to have all the Tier-1 network infrastructure in place by the end of this week in order to be ready for deployment of the new mesh and routing layer. We are in the process of negotiating a slot in Network Groups upgrade schedule for the site. Service ------- Overall, operations has been very smooth for some time and this is beginning to release effort to carry out further development work. 1) We had a very reliable holiday period with 100% availability for all VOs since 19th December. Only a few minor callouts. Reports covering the holiday period available at: https://www.gridpp.ac.uk/wiki/Tier1_Operations_Report_2013-12-11 2) CASTOR a) Work continues on CASTOR 2.1.14 testing. A number of issues have been identified and updates received from CERN. b) ATLAS file has almost finished 14M files renamed, another 3M to go. Investigations into files missing when renamed indicates the cause was a bug in the renaming process and the true loss rate is a factor 10 less than originally observed. Investigations into the relatively slow rate for renaming suggests that the problem may be outside the Tier-1. Any ongoing work rests with ATLAS. 3) Next Gen Storage Work on this project has restarted. Our conclusion some time ago was that no candidates were currently suitable for large scale deployment however the situation has changed somewhat. CEPH looks an increasingly attractive solution with increasing commercial and community interest (including some WLCG sites). Code development is very active too and of particular note is the expected release in February of a release supporting "Erasure Coding" (RAID-like block level parity based replication over a number of disk servers). We have decided to use 1PM of operational reserve disk capacity to deploy a test instance that the experiments can experiment with. The storage group have also (using non-Tier-1 effort) deployed an EOS instance to trial with facilities. Staff ----- 1) We are beginning recruitment of a 1-year contractor position to work on cloud infrastructure. 2) At the end of the month we will have a new (grad program) starter to work with us on CEPH and the cloud. SI-7 LCG Management Board Report --------------------------------- There was no meeting. REVIEW OF ACTIONS ================= 496.2 PC to update the network forward-look. In progress. 512.1 DB to email Alex and cc Tom Whyntie regarding Simon's time on coding for the bit-splitting work on Linux. DC to follow-up this action and clarify the issues involved. Action closed for DB, re-allocated to DC. 512.2 Regarding the outturn forecast and the possible spend on tape media, travel etc, DB/PG to work out what was left and ask Tony Medland for re-profiling. Ongoing. ACTIONS AS OF 13.01.14 ====================== 496.2 PC to update the network forward-look. 512.2 Regarding the outturn forecast and the possible spend on tape media, travel etc, DB/PG to work out what was left and ask Tony Medland for re-profiling. 513.1 RJ/DC/PC each to provide a 1-page summary of functionality at the Tier-1 and Tier-2, including present situation and evolution. 513.2 PG to review the EVAL requirements and email everyone a summary and reminder. 513.3 DB to send to PG the EVAL information required. 513.4 DC to follow-up with Alex Efimov/Tom Whyntie regarding Simon's time on coding for bit-splitting work on Linux - DC to clarify the issues involved and report-back if a PMB decision was required. Regarding the next PMB meeting, DB noted that he would be late. PG to chair. SL gave his apologies. The focus of the meeting would be the GridPP5 documents' review. DB advised PG to read the briefing outline from STFC in relation to the main proposal to ascertain the Project Management information required. Next PMB Monday 20 January @ 12.55 pm.
GridPP PMB Meeting 512 (16.12.13) =============================== Present: Dave Britton (Chair), Pete Gronbech (Minutes), Andrew Sansum, Jeremy Coles, Steve Lloyd, Tony Doyle, Dave Colling, Tony Cass, Dave Kelsey, Roger Jones Apologies: Pete Clarke, Claire Devereux 1. The Collaboration Board (CB) meeting ======================================== It was considered unnecessary to present the full talk that was given at the recent PMB F2F. The CB would be asked to endorse the decision that, if we were forced to choose, then we would only provide a Tier-2 service in the UK. The CB decision had to be based on correct reasoning - the location(s) of the Tier-2 service was a separate issue. DC asked if STFC cut GridPP by 50% and we could not provide both Tier-1 and Tier-2, what would break? We would lose the leverage from the universities if we relocated or removed the Tier-2 service. It was felt that this was a second order issue. LHCb relied on the Tier-1 and would not support this decision. As they could get plenty of CPU from other sites, losing the T1 would be a disaster. There were a whole range of impacts related to losing a T1, Atlas and CMS would also not be happy. RJ agreed. Were we agreed that DB would ask the CB to endorse this as the strategy if we were only to be funded at the £3.5M/year level? PC considered that it would be wrong to ask them to endorse this without a paper being submitted, or a pre-warning. Perhaps a second meeting would be required. Support from the CB would be required in order to go ahead and draft a plan on that basis. It was also clearly difficult to ask for approval of a plan where a vested interest was involved. PC advised that the advice received from experiment reps indicated that they were likely to choose the Tier-2 solution, mainly due to Atlas and CMS support. We would require another CB meeting or email exchange early in the New Year. STFC would present it to Science Board to decide. How would we cope with a £3.5M funding level? We trusted the institute leaders to have the experiments' best interests at heart. SL read out the definition of the CB. PC noted that the GridPP Project's job was to understand and provide what the experiments required, not to provide jobs at all sites. It was not about the institutes, rather it was about the functionality. DB did not intend to send out the slides. PC considered that once Swindon knew about the choice between the T1 or T2 they would ask for a minimum viable cost to include both. DB already had this – which is why we knew that a £3.5m/yr budget would mean either one or the other. Even at that level, it was not enough for a robust Tier-1 service. There was a programme for LHC physics and there was a required amount for computing - if they voted for the physics then they would have to provide the computing. The £3.5m/yr budget was not a feasible option. STANDING ITEMS ============== SI-0 Report from Development Group ----------------------------------- DC advised that no report had been prepared - there was lots going on, he would email round a report. There was activity in CMS, ATLAS and LHCb, they had upgraded client to Havana. A cloud resource could be made to look like a grid resource. This was a new setup (AKA Stealth Cloud). DB had sent a link to a paper on cloud provision at Tier-2 centres to DC on 11.12.13. This could be a useful input to our proposals. SI-1 Dissemination Report -------------------------- SL presented the Report from Tom Whyntie: > News Item - Big Data on the BBC A recent BBC Radio 4 documentary "Data, Data Everywhere..." featured the LHC and the huge computing effort required to find the Higgs boson. The news item may be found here:http://www.gridpp.ac.uk/news/?p=3108 > A Collaborator's Guide to GridPP on the GridPP website Alex Efimov has produced a PDF guide for potential collaborators. This can be found at: https://www.gridpp.ac.uk/wider/ > TW to Guest Curate "Science Showoff", 15th April 2013 TW has been invited to guest curate a "Science Showoff" event on the 15th April 2014, with a working title "Big Data, Big Deal". Science Showoff is a charity event where scientists are encouraged to give a 9 minute talk - in any format - about their research. If anyone wants a slot, let me know - TW is be going with the Data Exploration/citizen science theme. Further information about Science Showoff may be found at http://scienceshowoff.org/ > CERN@school and CVMFS Catalin Condurache (RAL) has now enabled CVMFS for cernatschool.org at RAL and a test tarball has been created for deployment a simple ROOT-based executable. Further tests to follow this week. - Regarding the project with Alex, Tom was trying to find out about the bit-splitting work. They had finally got it working, but it was difficult to get it working on Linux - this would require a week’s solid coding from Simon. Did we want Simon to spend this much time on this, it was not our top priority? Why should we be doing this. DB agreed. DC to email Alex and cc Tom. ACTION 512.1 DB to email Alex and cc Tom Whyntie regarding Simon's time on coding for the bit-splitting work on Linux. DC advised that he still had not been paid for the journal. It had happened before he could get college to pay for it. The amount was £1680. Apparently we said DC would try to get funds but otherwise it would come out of travel funds. Could DB try to get funding from his libraries? Money went to 23 universities about a year ago but may have been spent now. The Publication was in 2012. The Royal Society were looking for payment. It was agreed DC should pay this and claim it back. - Tom went to see a 3-man SME, ‘python anywhere’, which was ideal for schools as it was in a web browser. They rented time on amazon. Tom wondered if we could offer resources to help this? It would need an EC2 interface (which DC had), how do we do due diligence on what is being run? So long as it had a limit it would be acceptable. It was potentially interesting work. - PG had given a GridPP talk to IATUL. SI-2 ATLAS weekly report & plans --------------------------------- RJ noted not much to report. There had been an incident concerning inappropriate use of resources, which had been dealt with rapidly, the user had admitted to it and had been admonished. The Panda systems alerted them to odd behaviour. The person’s certificate has been revoked. From the RAL end, the ATLAS response had been excellent, reflecting well on site security. They did report to higher management but nothing had come of it. RAL was now scanning logs to see if any other inappropriate workloads had been run. DK reported that the incident was closed. Operationally things went very well and AS thanked Atlas. SI-3 CMS weekly report & plans ------------------------------- DC noted not much to report. They were planning an exercise next year. The Ops SAM tests had changed. Fair share policies were affecting new SAM tests. (previously SAM tests were based on OPS, which sites tended to make a small reservation for to ensure they did not get blocked). WLCG was not entirely ready for SHA2 certificates. This had been postponed to January. The DPM collaboration workshop hosted in Edinburgh last week had gone very well. For November Tier-2 availability, there had been four sites below target: UCL (downtime due to SE/WN upgrade); Durham; Birmingham; Sussex. RALPPD SAM jobs had been stuck due to fair share issues. The site would be 'at risk' over the Xmas break as normal. SI-4 LHCb weekly review & plans -------------------------------- There was nothing to report. SI-5 Production Manager's Report --------------------------------- JC reported as follows: 1) There was a GDB last Wednesday (http://indico.cern.ch/conferenceDisplay.py?confId=251192). The most discussed item related to the move to using experiment SAM tests for WLCG site availability/reliability reporting and issues seen whereby the test jobs get ‘stuck’ due to fairshare policies. SHA-2 readiness was also reviewed – the infrastructure overall is not yet completely ready; the French CA (at least) is likely to issue SHA-2 certificates by default from this week. In the UK we postponed the switch to next year – possibly now March. 2) The GridPP hosted DPM collaboration workshop took place in Edinburgh last Friday – thanks to Wahid Bhimji who organised it. The event was well attended and received. The efforts of the collaboration members mean that the DPM product remains a core component at many sites and its future looks more certain now with a good selection of new interfaces in development or test. GridPP is making a solid contribution to the work - there is also increased participation from other countries compared to when the collaboration started. 3) The November WLCG Tier-2 availability/reliability report is now final: http://indico.cern.ch/conferenceDisplay.py?confId=251192. GridPP sites under the targets were: UCL (28%:54%): Downtime/impacts related to SE and WN upgrades. Durham (71%:71%): Downtime due to campus wide power maintenance. Birmingham (89%:89%): Submissions stopped due to a CE issue requiring the node to be rebooted and the situation covered a weekend. Sussex (58%:58%): Issues remained following the SL6 upgrade. RALPPD reported one issues with the experiment test results (currently run in parallel to ops) that is being pursued in a ticket: https://ggus.eu/ws/ticket_info.php?ticket=99319. 4) Experiment plans for running over the Christmas period have been mentioned in a number of forums including the WLCG ops coordination planning meeting: http://tinyurl.com/mx3oq8y. All the experiments understand (and are grateful) that support during this period will be on a 'best efforts' basis. SI-6 Tier-1 Manager's Report ----------------------------- AS reported as follows: Fabric: 1) First CPU delivery just arrived this morning. Second CPU delivery and two disk deliveries scheduled for January. 2) Uplifted tape media order placed. The cost for tape media £40k (for t2k) was raised to £180k, the order had gone out and should be delivered in January. The price on the new framework was ~15% better than before. 3) We are having to consider the rapid disposal of part of the 2007 generation of hardware owing to constraints on machine room floor space. Will email separately. 4) A generator load test was carried out successfully last week. We are discussing what the appropriate test interval is with estates. Hopefully it will revert to 3-monthly intervals. Service: 1) Reports covering last week available at: https://www.gridpp.ac.uk/wiki/Tier1_Operations_Report_2013-12-11 2) CASTOR a) Work continues on CASTOR 2.1.14 testing. A number of issues have been identified and updates received from CERN. b) ATLAS file renaming is so throwing up .004% missing files. We don't have enough logging information to understand what is the cause of the file loss. Not necessarily a local data retention problem. Staff: 1) We are beginning recruitment of a 1 year contractor position to work on cloud infrastructure. SI-7 LCG Management Board Report --------------------------------- There had been no meeting. REVIEW OF ACTIONS ================= 496.2 PC to update the network forward-look. This was close to being started, but was waiting for input from RJ and DC. DC said that the position is evolving in CMS. RJ hopes to have a look at this later this week. PC does need half a page from each expt to set the scene. Remote access to data is scaling well and dependant on how well this works the bandwidth required will change. PC asked that RJ/DC to note down what they expect but say it may change and they will inform JANET if so. If there is a Tier-2 site that requires a better connection to JANET the experiments must say that. 511.1 AS/DK to do the outturn forecast, look at the possible spend on tape media and advise Tony Medland about the profile for next year. A realistic outturn forecast for travel was also required. Action closed and replaced. 511.2 CD to discuss GridPP's input with the UK NGI concerning interest in the Distributed Competence Centre. JC says CD did raise the issue and the UK are on the list. So probably complete. ACTIONS AS OF 16.12.13 ====================== 496.2 PC to update the network forward-look. 512.1 DB to email Alex and cc Tom Whyntie regarding Simon's time on coding for the bit-splitting work on Linux. 512.2 Regarding the outturn forecast and the possible spend on tape media, travel etc, DB/PG to work out what was left and ask Tony Medland for re-profiling. Next PMB: Monday 13th January @ 12.55pm

Top of Message | Previous Page | Permalink

JiscMail Tools


RSS Feeds and Sharing


Advanced Options


Archives

February 2024
January 2024
September 2022
July 2022
June 2022
February 2022
December 2021
August 2021
March 2021
November 2020
October 2020
August 2020
March 2020
February 2020
October 2019
August 2019
June 2019
May 2019
April 2019
March 2019
February 2019
January 2019
December 2018
November 2018
August 2018
July 2018
June 2018
May 2018
April 2018
March 2018
February 2018
January 2018
November 2017
October 2017
September 2017
August 2017
May 2017
April 2017
March 2017
February 2017
January 2017
October 2016
August 2016
July 2016
June 2016
May 2016
April 2016
March 2016
February 2016
January 2016
December 2015
November 2015
October 2015
September 2015
July 2015
May 2015
April 2015
March 2015
February 2015
January 2015
December 2014
November 2014
October 2014
September 2014
August 2014
July 2014
June 2014
May 2014
April 2014
March 2014
February 2014
January 2014
December 2013
October 2013
August 2013
July 2013
June 2013
May 2013
March 2013
February 2013
January 2013
December 2012
November 2012
October 2012
September 2012
August 2012
July 2012
June 2012
May 2012
April 2012
March 2012
February 2012
January 2012
December 2011
November 2011
October 2011
September 2011
August 2011
July 2011
June 2011
May 2011
April 2011
March 2011
February 2011
January 2011
December 2010
November 2010
October 2010
September 2010
August 2010
July 2010
June 2010
May 2010
April 2010
March 2010
February 2010
January 2010
December 2009
November 2009
October 2009
September 2009
August 2009
July 2009
June 2009
May 2009
April 2009
March 2009
February 2009
January 2009
December 2008
November 2008
October 2008
September 2008
August 2008
July 2008
June 2008
May 2008
April 2008
March 2008
February 2008
January 2008
December 2007
November 2007
October 2007
September 2007
August 2007
July 2007
June 2007
May 2007
April 2007
March 2007
February 2007
January 2007
2006
2005
2004
2003
2002
2001
2000


JiscMail is a Jisc service.

View our service policies at https://www.jiscmail.ac.uk/policyandsecurity/ and Jisc's privacy policy at https://www.jisc.ac.uk/website/privacy-notice

For help and support help@jisc.ac.uk

Secured by F-Secure Anti-Virus CataList Email List Search Powered by the LISTSERV Email List Manager