GridPP PMB Minutes 488 (18.02.2013) =================================== Present: Dave Britton (Chair), Tony Doyle, Pete Gronbech, Andrew Sansum, Jeremy Coles, Tony Cass, Pete Clarke, Dave Colling, Roger Jones (Minutes - Suzanne Scott) Apologies: Dave Kelsey, Steve Lloyd, Claire Devereux, Neil Geddes 0) Closure of AFS Service ========================== AS had circulated a report giving an overview of the history of this. In 2007 the Tier-1 Board had said we should shut it down, some complaints had been received from the User Board regarding users. A slight upgrade had been effected to keep it going. The issue had come round again now as the hardware was fairly old. We needed to decide what to do, as the AFS Service did not fit-in with the Tier-1 model and the community had not found it useful. Recently there had been fileserver problems and there were still come users, however usage was limited. Moving forward, to maintain it, we would need to invest effort in staff, upgrades, and also the user registration process, however we probably could not support it at that level. No funding was available. The logical outcome was to close it. DB asked if there were a use case for the AFS as part of the core mission of the Tier-1? AS noted no. DB considered it to be peripheral therefore and the service did not require to be run. It may affect some individual users if it were closed. DB considered we should turn it off unless we had to respond to an urgent issue. This was agreed. AS would broadcast the notification. DB noted there was no defined use case, therefore there was no justification for refresh and manpower. We would announce the termination of the service and see what the outcome was. This was agreed. PC asked whether a long process of advertisement would be required? AS advised that he preferred to keep this to within a four-month period. AS would send out a notification and reminders. ACTION 488.1 AS to notify the community, giving three months' notice, that the AFS service would be shut down. 1) Quarterly Report Summary ============================ PG had circulated a report. Compared with the previous quarter the experiments and the Tier-1 were green. There were some reds at the Tier-1. The LFC and FTS service fell below the 99% target. The CMS VO box metric was no longer required. Regarding storage, there had been similar drops due to power outages. A lot of effort had been put into power incidents and upgrades. The CASTOR staff levels were critical. Jens Jensen was working on three recruitments at the moment. For ATLAS all metrics were green. RJ advised that ATLAS use of resources was not all green but the site performance was acceptable. For CMS all metrics were green. All was OK apart from Bristol. DC noted that we needed to tread carefully with this site due to manpower and other issues. There were storage issues to be resolved. There had been an improvement however. For LHCb all metrics were green. RAL had performed excellently during the Quarter. For 'Other' experiments all metrics were green. There had been the addition of the EPIC VO. NGS VOs had been added onto the VOMS Server. T2k had increasing storage requirements. There were LFC support issues. For Ops, everything was going fairly smoothly. There had been some upgrade issues. For DataGroup all metrics were green. For Experiment Support all metrics were green. 2) EGI Fees ============ It was noted that JISC would not pay the current year's EGI fee for the UK, therefore there had been a request that £60k be funded by someone other than JISC, ie: GridPP and NGS. NGS could pay half and it was requested that GridPP pay £30-35k. The only mechanism available was out of the travel budget. DK could let us know whether this was feasible. DB asked for comments. RJ asked what we got in return for the EGI fee? DB advised that staff were funded by EGI, there were a few FTE and 4 x 0.5FTE at the Tier-2 which were funded by EGI. DB advised that matching effort was also required - other people reported time into the PPT timesheet system. There was one year left of EGI, which had followed-on from EGEEII and EGEEIII. DB needed to speak to DK and STFC before taking action. JC advised that Ireland hadn't paid and they had withdrawn. The Portuguese payment was delayed. 4 x FTE were also on APEL and the GocDB, which we relied on. It was agreed that DB should speak to DK/STFC regarding the EGI fee payment and let AS know. ACTION 488.2 DB to speak to DK/STFC regarding the EGI fee payment and let AS know. 3) Horizon 2020 =============== It was noted that the EU were widening their search for experts in all fields for Horizon 2020 proposals. Had anyone responded to the call? No-one had. Did anyone wish to volunteer? There were strategic priorities and an Agenda to be discussed. If anyone did wish to get involved they should let DB know. STANDING ITEMS ============== SI-O Report from Cloud Group ============================= DC advised that meetings were happening fortnightly; the twiki was in progress; hardware was limited so far. DC reported as follows: Organisation is settling down and we have fortnightly meetings. There is a growing twiki a community is starting to form. There is an ongoing discussion between Ian C., JC and DC about best to form this into a community. Other cloud sites within GridPP are encouraged to add a description and link on the GridPP twiki site (as ECDF already have done). Physical Hardware (description from Adam H.) --------------------------------------------- Four compute nodes are in the process of being provisioned as extra compute nodes. Each has 64GB RAM and 32 cores with HyperThreading. This will bring the total compute capacity to: 200 Cores 400 GiB RAM A storage node is also being provisioned, to provide an S3-compatible service. The raw usable capacity (before any reduction for replication) is: 20 TiB Further storage may be added using space on other nodes in the cluster, if the loading on single machines is such that multiple roles can be accommodated safely. The cluster is running OpenStack Folsom. Hosts are have been added to the Imperial monitoring system. It is planned to provide monitoring of individual instances too. Activities: ----------- Cloud Storage testing --------------------- Non- of the LHC experiments are currently using cloud storage however, storage is being added to the cluster so that Wahid can perform some tests. ATLAS ----- Nobody was able to report on Atlas activities at the last meeting, but at the previous meeting Peter L. had reported that he was planning to use cloud scheduler. As yet there have been no images on the GridPP cloud from Atlas but I believe that Peter had some configuration to do in Lancaster before he would try anything at our end. CMS --- CMS have been very active both with the GridPP cloud and the HLT farm at CERN. The HLT farm has run ~4000 concurrent reprocessing jobs however under that loading the jobs started to fail. This is believed to be a simple network bandwidth problem as the data was going over the 1Gb/s pipe not the 10Gb/s. After the low energy run Andrew L. and Toni (from CERN) are to map the requirements of the reprocessing jobs and then rearrange the network as needed. These jobs are submitted from a glideinWMS sitting at CERN. Data are read in and out over xroot. In the UK user data analysis jobs are now being submitted using the regular CMS tools, going via the glideinWMS at RAL and being run GridPP cloud at Imperial. It should be noted that the glideinWMS not only controls the jobs but also performs the instantiation of the VM itself. Data is read in using xroot and staged out using conventional grid tools. Currently there is a problem that some jobs fail because of stageout timeout problems.This is being investigated. LHCb ---- Andrew MN. described that LHCb at CERN were using the hampster set up to create individual VMs on the agile infrastructure and he was going to try doing something similar in Manchester and the possibly on the GridPP Cloud. Relations with other Cloud projects ------------------------------------ We are in the process of joining the EGI Federated Cloud and had a 'phone meeting with Dave W and Matteo last Friday. This sound as though it will be about 2 weeks of work which would mean that we would be part of the demo at the user forum. We will then look at trying to run CMS (and hopefully other VOs if effort is available) jobs on the EGI FC. We have been in touch with Helix Nebula and we will be a resource provider via the EGI FC but will also be part of a dialogue with Helix Nebula on how they can work with national structures such as GridPP and national funding agencies. Especially concerning hybrid cloud models. Security -------- Regarding security, John the Security Officer had agreed to take on the security remit for the Cloud Group as well. SI-1 Dissemination Report ========================== There was no report. SI-2 ATLAS weekly review & plans ================================= RJ confirmed there had been minor issues and that space tokens were filling up. Group production was being done by those who might not know how their job would behave, this meant space tokens were being used up and it needed to be sorted out. RJ advised that with several sites, people were submitting jobs using proof, running root in multicore - individual users had been contacted and there was a need to control the user base. RJ reported there were FTS transfer issues in relation to Lustre - this was on hold pending the return of Shawn de Witt. AS was aware of the issue. It was hoped that the problem would go away with SL6 deployment. RJ considered it to be a low-level problem and they were keeping a watching brief. RJ noted another issue with the SRM timeout option in relation to CASTOR sites, jobs went into pending mode then died. SL6 large-scale testing was imminent, they were awaiting news of the RAL half-day intervention. Delayed stream reprocessing would be put in as a modest priority. This equated to half of the resource globally, and was due to start at the beginning of April. SI-3 CMS weekly review & plans =============================== DC had left the meeting. SI-4 LHCb weekly review & plans ================================ PC noted nothing major to report. SI-5 Production Manager's Report ================================= JC reported as follows: 1) Some Tier-2 sites have had issues with certain ATLAS user jobs (proof-lite running multi- threaded root) running with high cpu usage and causing WNs to crash. Individual users are being contacted to cancel jobs. 2) A new version of the DPM Collaboration document (final) has been produced with a first draft annex allocating tasks amongst partners. This is being currently being revised – the stated GridPP contribution being 1 FTE but the current figures reflect comments about estimated current effort. Comments on the IPR and licensing text will be fed back. 3) The final WLCG Tier-2 availability report for January is now available: https://espace.cern.ch/WLCG-document-repository/ReliabilityAvailability/Tier- 2/2013/WLCG_Tier2_Jan2013.pdf. Comments on WLCG marked amber sites: UCL 41%:48% - SE problems and upgrade Manchester: 85%:85% - CEs stopped accepting jobs. Durham: 65%:65%. The site was being ‘rebuilt’ during January and therefore in downtime. Birmingham: 68%:68% - DPM head-node upgrade. ops VOMS settings incorrect. Aside: ATLAS analysis availability is discussed at http://tinyurl.com/b9yy8ja. 4) GridPP contributors (mainly Wahid, Sam and Jens) will lead a storage ‘workshop’ at the EGI CF. This is leading to additional travel requests for which we may wish to set a quota. There are also questions about registration for those with accepted submissions posters/talks as the fees are high (http://cf2013.egi.eu/registration/). Do we encourage day participation? Early bird registration is until 22nd February. Apparently DK had been receiving travel requests for this; could we clarify the fee payment. DB advised that we wanted to support the EGI Community Forum but considered that 20 people going was too many. DB advised that it depended on whether the person going needed to be there for the week or not, or could we cap the cost at a certain level? DB would contact DK and check the cap level. The priority was for those with talks and posters to present. Those attending the storage workshop would probably attend on that day only. DB noted it was complicated - it could be a full day or it could be interleaved with the main conference as a thread. There were room issues as well. DB was not aware that the storage workshop was going ahead. DB would contact DK. ACTION 488.3 DB to contact DK regarding travel and other costs to the EGI Community Forum in Manchester. 5) Glasgow is currently running ‘at risk’ due to power feed issues. 6) The ops team focus is going to be on networking/perfsonar, IPv6, SL6 and glexec over the coming month(s). For information: A) There was a GDB last week: http://indico.cern.ch/conferenceDisplay.py?confId=197800. Topics covered included EGI’s plans post EMI, IPv6, reports from the Ops coordination team groups and an update on Clouds and Storage Federations. SI-6 Tier-1 Manager's Report ============================= Fabric: 1) Disk - both sets delivered - acceptance testing. 2) CPU - both delivered - one set completed our tests but waiting for a supplier fix to power distribution this week. Second set has our acceptance tests to run - will complete in about 2 weeks. Still need to configure both deliveries into final network configuration - cannot do this until early March. In any case we plan not to deploy the new CPU to production capacity until late March, in the meantime we will use for SL6 capacity testing and other CASTOR load tests. 3) A short core site network intervention is being scheduled for Tuesday 26th February (adds resilience). We are evaluating the likely impact and will schedule an at-risk/downtime as appropriate Service: 1) A relatively quiet 2 weeks: https://www.gridpp.ac.uk/wiki/Tier1_Operations_Report_2013-02-06 https://www.gridpp.ac.uk/wiki/Tier1_Operations_Report_2013-02-13 2) CASTOR - Chasing a problem where the CASTOR SRM response has an invalid format impact some ATLAS transfer management particularly from QMW. fault appears to be in the GSI/gsoap layer. We hope it will be fixed when we upgrade to SL6 SRMs. Will need to discuss with ATLAS if they can wait that long. - Chasing a slowdown problem on a generation of disk servers which cause timeouts and cause us to be placed offline in ATLAS production. Rolling out a RAID controller firmware update to address this problem. 3) BATCH - Problems with low start rate in batch system, causing periods of under utilisation. Needing manually intervention regularly. Problem in Maui proving hard to diagnose. Working on a plan of how we will progress this but may need to deploy an alternative to Torque/Maui. 4) AFS - Consultations underway on possible termination of rl.ac.uk AFS cell. Staff: - paperwork for two system admin posts for Fabric team in system waiting approval. ACTION 488.4 AS to let DB know the SL5 estimated benchmark figure for new CPU purchase. SI-7 LCG Management Board Report ================================= There was no MB. REVIEW OF ACTIONS ================= 438.9 AS to contact relevant site managers to ask whether or not they would be interested in having retired Tier-1 hardware - if a site were interested then they should submit a proposal as to what they want and why. Ongoing. 480.2 JC to consider the imminent demise of EMI and the resultant effect on the GridPP community - concrete issues and action requests to be brought back to the PMB. Ongoing. 484.1 DB to investigate plan for support of GridPP resources at Durham. PC as Chair of ScotGrid may have some input to this. Ongoing. 485.1 DB to speak to STFC regarding GridPP5 timetable. Done, item closed. 485.3 AS to poll for date in May/June for T1 review. Ongoing. 486.1 DB to make a proposal regarding the increase in T2K data storage requirements, so that this can be discussed. Done, item closed. 487.1 RJ/DC/PC to send PG a BibTeX file of experiment publications for the STFC e-VAL survey. Done, item closed. 487.4 ALL to send PG a list of the occasions DB was a keynote speaker at conferences. Ongoing. 487.5 AS to check with Simon Lambert and Juan at RAL about DPHEP and ATLAS data curation, and report back. Done, item closed. ACTIONS AS AT 18.02.12 ====================== 438.9 AS to contact relevant site managers to ask whether or not they would be interested in having retired Tier-1 hardware - if a site were interested then they should submit a proposal as to what they want and why. 480.2 JC to consider the imminent demise of EMI and the resultant effect on the GridPP community - concrete issues and action requests to be brought back to the PMB. 484.1 DB to investigate plan for support of GridPP resources at Durham. PC as Chair of ScotGrid may have some input to this. DB/PC would meet to discuss this and report-back to the PMB. 485.3 AS to poll for date in May/June for T1 review. 487.4 ALL to send PG a list of the occasions any PMB member was a keynote speaker at conferences. 488.1 AS to notify the community, giving three months' notice, that the AFS service would be shut down. 488.2 DB to speak to DK/STFC regarding the EGI fee payment and let AS know. 488.3 DB to contact DK regarding travel and other costs to the EGI Community Forum in Manchester. 488.4 AS to let DB know the SL5 estimated benchmark figure for new CPU purchase. The next PMB would take place on Monday 25th February at 12:55 pm.