JiscMail Logo
Email discussion lists for the UK Education and Research communities

Help for UKHEPGRID Archives


UKHEPGRID Archives

UKHEPGRID Archives


UKHEPGRID@JISCMAIL.AC.UK


View:

Message:

[

First

|

Previous

|

Next

|

Last

]

By Topic:

[

First

|

Previous

|

Next

|

Last

]

By Author:

[

First

|

Previous

|

Next

|

Last

]

Font:

Proportional Font

LISTSERV Archives

LISTSERV Archives

UKHEPGRID Home

UKHEPGRID Home

UKHEPGRID  March 2013

UKHEPGRID March 2013

Options

Subscribe or Unsubscribe

Subscribe or Unsubscribe

Log In

Log In

Get Password

Get Password

Subject:

Minutes of the 488th to 490th GridPP PMB meeting

From:

David Britton <[log in to unmask]>

Reply-To:

David Britton <[log in to unmask]>

Date:

Mon, 11 Mar 2013 12:09:52 +0000

Content-Type:

multipart/mixed

Parts/Attachments:

Parts/Attachments

text/plain (48 lines) , 130218.txt (1 lines) , 130225.txt (1 lines) , 130304.txt (1 lines)

Dear All,


Please find attached the GridPP Project Management Board
Meeting minutes for the 488th meeting to 490th meeting.

                      The latest minutes can be in:

http://www.gridpp.ac.uk/php/pmb/minutes.php?latest

as well as being listed with other minutes at:

http://www.gridpp.ac.uk/php/pmb/minutes.php

Cheers, Dave.


































GridPP PMB Minutes 488 (18.02.2013) =================================== Present: Dave Britton (Chair), Tony Doyle, Pete Gronbech, Andrew Sansum, Jeremy Coles, Tony Cass, Pete Clarke, Dave Colling, Roger Jones (Minutes - Suzanne Scott) Apologies: Dave Kelsey, Steve Lloyd, Claire Devereux, Neil Geddes 0) Closure of AFS Service ========================== AS had circulated a report giving an overview of the history of this. In 2007 the Tier-1 Board had said we should shut it down, some complaints had been received from the User Board regarding users. A slight upgrade had been effected to keep it going. The issue had come round again now as the hardware was fairly old. We needed to decide what to do, as the AFS Service did not fit-in with the Tier-1 model and the community had not found it useful. Recently there had been fileserver problems and there were still come users, however usage was limited. Moving forward, to maintain it, we would need to invest effort in staff, upgrades, and also the user registration process, however we probably could not support it at that level. No funding was available. The logical outcome was to close it. DB asked if there were a use case for the AFS as part of the core mission of the Tier-1? AS noted no. DB considered it to be peripheral therefore and the service did not require to be run. It may affect some individual users if it were closed. DB considered we should turn it off unless we had to respond to an urgent issue. This was agreed. AS would broadcast the notification. DB noted there was no defined use case, therefore there was no justification for refresh and manpower. We would announce the termination of the service and see what the outcome was. This was agreed. PC asked whether a long process of advertisement would be required? AS advised that he preferred to keep this to within a four-month period. AS would send out a notification and reminders. ACTION 488.1 AS to notify the community, giving three months' notice, that the AFS service would be shut down. 1) Quarterly Report Summary ============================ PG had circulated a report. Compared with the previous quarter the experiments and the Tier-1 were green. There were some reds at the Tier-1. The LFC and FTS service fell below the 99% target. The CMS VO box metric was no longer required. Regarding storage, there had been similar drops due to power outages. A lot of effort had been put into power incidents and upgrades. The CASTOR staff levels were critical. Jens Jensen was working on three recruitments at the moment. For ATLAS all metrics were green. RJ advised that ATLAS use of resources was not all green but the site performance was acceptable. For CMS all metrics were green. All was OK apart from Bristol. DC noted that we needed to tread carefully with this site due to manpower and other issues. There were storage issues to be resolved. There had been an improvement however. For LHCb all metrics were green. RAL had performed excellently during the Quarter. For 'Other' experiments all metrics were green. There had been the addition of the EPIC VO. NGS VOs had been added onto the VOMS Server. T2k had increasing storage requirements. There were LFC support issues. For Ops, everything was going fairly smoothly. There had been some upgrade issues. For DataGroup all metrics were green. For Experiment Support all metrics were green. 2) EGI Fees ============ It was noted that JISC would not pay the current year's EGI fee for the UK, therefore there had been a request that £60k be funded by someone other than JISC, ie: GridPP and NGS. NGS could pay half and it was requested that GridPP pay £30-35k. The only mechanism available was out of the travel budget. DK could let us know whether this was feasible. DB asked for comments. RJ asked what we got in return for the EGI fee? DB advised that staff were funded by EGI, there were a few FTE and 4 x 0.5FTE at the Tier-2 which were funded by EGI. DB advised that matching effort was also required - other people reported time into the PPT timesheet system. There was one year left of EGI, which had followed-on from EGEEII and EGEEIII. DB needed to speak to DK and STFC before taking action. JC advised that Ireland hadn't paid and they had withdrawn. The Portuguese payment was delayed. 4 x FTE were also on APEL and the GocDB, which we relied on. It was agreed that DB should speak to DK/STFC regarding the EGI fee payment and let AS know. ACTION 488.2 DB to speak to DK/STFC regarding the EGI fee payment and let AS know. 3) Horizon 2020 =============== It was noted that the EU were widening their search for experts in all fields for Horizon 2020 proposals. Had anyone responded to the call? No-one had. Did anyone wish to volunteer? There were strategic priorities and an Agenda to be discussed. If anyone did wish to get involved they should let DB know. STANDING ITEMS ============== SI-O Report from Cloud Group ============================= DC advised that meetings were happening fortnightly; the twiki was in progress; hardware was limited so far. DC reported as follows: Organisation is settling down and we have fortnightly meetings. There is a growing twiki a community is starting to form. There is an ongoing discussion between Ian C., JC and DC about best to form this into a community. Other cloud sites within GridPP are encouraged to add a description and link on the GridPP twiki site (as ECDF already have done). Physical Hardware (description from Adam H.) --------------------------------------------- Four compute nodes are in the process of being provisioned as extra compute nodes. Each has 64GB RAM and 32 cores with HyperThreading. This will bring the total compute capacity to: 200 Cores 400 GiB RAM A storage node is also being provisioned, to provide an S3-compatible service. The raw usable capacity (before any reduction for replication) is: 20 TiB Further storage may be added using space on other nodes in the cluster, if the loading on single machines is such that multiple roles can be accommodated safely. The cluster is running OpenStack Folsom. Hosts are have been added to the Imperial monitoring system. It is planned to provide monitoring of individual instances too. Activities: ----------- Cloud Storage testing --------------------- Non- of the LHC experiments are currently using cloud storage however, storage is being added to the cluster so that Wahid can perform some tests. ATLAS ----- Nobody was able to report on Atlas activities at the last meeting, but at the previous meeting Peter L. had reported that he was planning to use cloud scheduler. As yet there have been no images on the GridPP cloud from Atlas but I believe that Peter had some configuration to do in Lancaster before he would try anything at our end. CMS --- CMS have been very active both with the GridPP cloud and the HLT farm at CERN. The HLT farm has run ~4000 concurrent reprocessing jobs however under that loading the jobs started to fail. This is believed to be a simple network bandwidth problem as the data was going over the 1Gb/s pipe not the 10Gb/s. After the low energy run Andrew L. and Toni (from CERN) are to map the requirements of the reprocessing jobs and then rearrange the network as needed. These jobs are submitted from a glideinWMS sitting at CERN. Data are read in and out over xroot. In the UK user data analysis jobs are now being submitted using the regular CMS tools, going via the glideinWMS at RAL and being run GridPP cloud at Imperial. It should be noted that the glideinWMS not only controls the jobs but also performs the instantiation of the VM itself. Data is read in using xroot and staged out using conventional grid tools. Currently there is a problem that some jobs fail because of stageout timeout problems.This is being investigated. LHCb ---- Andrew MN. described that LHCb at CERN were using the hampster set up to create individual VMs on the agile infrastructure and he was going to try doing something similar in Manchester and the possibly on the GridPP Cloud. Relations with other Cloud projects ------------------------------------ We are in the process of joining the EGI Federated Cloud and had a 'phone meeting with Dave W and Matteo last Friday. This sound as though it will be about 2 weeks of work which would mean that we would be part of the demo at the user forum. We will then look at trying to run CMS (and hopefully other VOs if effort is available) jobs on the EGI FC. We have been in touch with Helix Nebula and we will be a resource provider via the EGI FC but will also be part of a dialogue with Helix Nebula on how they can work with national structures such as GridPP and national funding agencies. Especially concerning hybrid cloud models. Security -------- Regarding security, John the Security Officer had agreed to take on the security remit for the Cloud Group as well. SI-1 Dissemination Report ========================== There was no report. SI-2 ATLAS weekly review & plans ================================= RJ confirmed there had been minor issues and that space tokens were filling up. Group production was being done by those who might not know how their job would behave, this meant space tokens were being used up and it needed to be sorted out. RJ advised that with several sites, people were submitting jobs using proof, running root in multicore - individual users had been contacted and there was a need to control the user base. RJ reported there were FTS transfer issues in relation to Lustre - this was on hold pending the return of Shawn de Witt. AS was aware of the issue. It was hoped that the problem would go away with SL6 deployment. RJ considered it to be a low-level problem and they were keeping a watching brief. RJ noted another issue with the SRM timeout option in relation to CASTOR sites, jobs went into pending mode then died. SL6 large-scale testing was imminent, they were awaiting news of the RAL half-day intervention. Delayed stream reprocessing would be put in as a modest priority. This equated to half of the resource globally, and was due to start at the beginning of April. SI-3 CMS weekly review & plans =============================== DC had left the meeting. SI-4 LHCb weekly review & plans ================================ PC noted nothing major to report. SI-5 Production Manager's Report ================================= JC reported as follows: 1) Some Tier-2 sites have had issues with certain ATLAS user jobs (proof-lite running multi- threaded root) running with high cpu usage and causing WNs to crash. Individual users are being contacted to cancel jobs. 2) A new version of the DPM Collaboration document (final) has been produced with a first draft annex allocating tasks amongst partners. This is being currently being revised – the stated GridPP contribution being 1 FTE but the current figures reflect comments about estimated current effort. Comments on the IPR and licensing text will be fed back. 3) The final WLCG Tier-2 availability report for January is now available: https://espace.cern.ch/WLCG-document-repository/ReliabilityAvailability/Tier- 2/2013/WLCG_Tier2_Jan2013.pdf. Comments on WLCG marked amber sites: UCL 41%:48% - SE problems and upgrade Manchester: 85%:85% - CEs stopped accepting jobs. Durham: 65%:65%. The site was being ‘rebuilt’ during January and therefore in downtime. Birmingham: 68%:68% - DPM head-node upgrade. ops VOMS settings incorrect. Aside: ATLAS analysis availability is discussed at http://tinyurl.com/b9yy8ja. 4) GridPP contributors (mainly Wahid, Sam and Jens) will lead a storage ‘workshop’ at the EGI CF. This is leading to additional travel requests for which we may wish to set a quota. There are also questions about registration for those with accepted submissions posters/talks as the fees are high (http://cf2013.egi.eu/registration/). Do we encourage day participation? Early bird registration is until 22nd February. Apparently DK had been receiving travel requests for this; could we clarify the fee payment. DB advised that we wanted to support the EGI Community Forum but considered that 20 people going was too many. DB advised that it depended on whether the person going needed to be there for the week or not, or could we cap the cost at a certain level? DB would contact DK and check the cap level. The priority was for those with talks and posters to present. Those attending the storage workshop would probably attend on that day only. DB noted it was complicated - it could be a full day or it could be interleaved with the main conference as a thread. There were room issues as well. DB was not aware that the storage workshop was going ahead. DB would contact DK. ACTION 488.3 DB to contact DK regarding travel and other costs to the EGI Community Forum in Manchester. 5) Glasgow is currently running ‘at risk’ due to power feed issues. 6) The ops team focus is going to be on networking/perfsonar, IPv6, SL6 and glexec over the coming month(s). For information: A) There was a GDB last week: http://indico.cern.ch/conferenceDisplay.py?confId=197800. Topics covered included EGI’s plans post EMI, IPv6, reports from the Ops coordination team groups and an update on Clouds and Storage Federations. SI-6 Tier-1 Manager's Report ============================= Fabric: 1) Disk - both sets delivered - acceptance testing. 2) CPU - both delivered - one set completed our tests but waiting for a supplier fix to power distribution this week. Second set has our acceptance tests to run - will complete in about 2 weeks. Still need to configure both deliveries into final network configuration - cannot do this until early March. In any case we plan not to deploy the new CPU to production capacity until late March, in the meantime we will use for SL6 capacity testing and other CASTOR load tests. 3) A short core site network intervention is being scheduled for Tuesday 26th February (adds resilience). We are evaluating the likely impact and will schedule an at-risk/downtime as appropriate Service: 1) A relatively quiet 2 weeks: https://www.gridpp.ac.uk/wiki/Tier1_Operations_Report_2013-02-06 https://www.gridpp.ac.uk/wiki/Tier1_Operations_Report_2013-02-13 2) CASTOR - Chasing a problem where the CASTOR SRM response has an invalid format impact some ATLAS transfer management particularly from QMW. fault appears to be in the GSI/gsoap layer. We hope it will be fixed when we upgrade to SL6 SRMs. Will need to discuss with ATLAS if they can wait that long. - Chasing a slowdown problem on a generation of disk servers which cause timeouts and cause us to be placed offline in ATLAS production. Rolling out a RAID controller firmware update to address this problem. 3) BATCH - Problems with low start rate in batch system, causing periods of under utilisation. Needing manually intervention regularly. Problem in Maui proving hard to diagnose. Working on a plan of how we will progress this but may need to deploy an alternative to Torque/Maui. 4) AFS - Consultations underway on possible termination of rl.ac.uk AFS cell. Staff: - paperwork for two system admin posts for Fabric team in system waiting approval. ACTION 488.4 AS to let DB know the SL5 estimated benchmark figure for new CPU purchase. SI-7 LCG Management Board Report ================================= There was no MB. REVIEW OF ACTIONS ================= 438.9 AS to contact relevant site managers to ask whether or not they would be interested in having retired Tier-1 hardware - if a site were interested then they should submit a proposal as to what they want and why. Ongoing. 480.2 JC to consider the imminent demise of EMI and the resultant effect on the GridPP community - concrete issues and action requests to be brought back to the PMB. Ongoing. 484.1 DB to investigate plan for support of GridPP resources at Durham. PC as Chair of ScotGrid may have some input to this. Ongoing. 485.1 DB to speak to STFC regarding GridPP5 timetable. Done, item closed. 485.3 AS to poll for date in May/June for T1 review. Ongoing. 486.1 DB to make a proposal regarding the increase in T2K data storage requirements, so that this can be discussed. Done, item closed. 487.1 RJ/DC/PC to send PG a BibTeX file of experiment publications for the STFC e-VAL survey. Done, item closed. 487.4 ALL to send PG a list of the occasions DB was a keynote speaker at conferences. Ongoing. 487.5 AS to check with Simon Lambert and Juan at RAL about DPHEP and ATLAS data curation, and report back. Done, item closed. ACTIONS AS AT 18.02.12 ====================== 438.9 AS to contact relevant site managers to ask whether or not they would be interested in having retired Tier-1 hardware - if a site were interested then they should submit a proposal as to what they want and why. 480.2 JC to consider the imminent demise of EMI and the resultant effect on the GridPP community - concrete issues and action requests to be brought back to the PMB. 484.1 DB to investigate plan for support of GridPP resources at Durham. PC as Chair of ScotGrid may have some input to this. DB/PC would meet to discuss this and report-back to the PMB. 485.3 AS to poll for date in May/June for T1 review. 487.4 ALL to send PG a list of the occasions any PMB member was a keynote speaker at conferences. 488.1 AS to notify the community, giving three months' notice, that the AFS service would be shut down. 488.2 DB to speak to DK/STFC regarding the EGI fee payment and let AS know. 488.3 DB to contact DK regarding travel and other costs to the EGI Community Forum in Manchester. 488.4 AS to let DB know the SL5 estimated benchmark figure for new CPU purchase. The next PMB would take place on Monday 25th February at 12:55 pm.
GridPP PMB Minutes 489 (25.02.2013) =================================== Present: Dave Britton (Chair), Pete Gronbech, Andrew Sansum, Jeremy Coles, Tony Cass, Dave Colling, Dave Kelsey, Steve Lloyd, Claire Devereux (Minutes - Pete Gronbech) Apologies: Pete Clarke, Tony Doyle, Roger Jones, Neil Geddes 1) Finances ============ DB and AS had discussed the finance plan. AS had not yet looked at DB’s figures to double check them - the amount of disk may be a little low. AS would check later today. RAL T2 figures were in the budget. 2) GridPP5 =========== DB had forwarded an email to the PMB which gave an outline schedule. The SoI should be submitted to the December 2013 Science Board meeting and the proposal to the PPRP in February 2014. DB noted that this as going through the PPRP rather than another funding mechanism (such as Consolidated Grant). This was probably preferable, as a four-year project until LS2 was possible. This meant that the timing had to work backwards from the SoI in December, the key issue was to do the bulk of the work in Sept/Oct this year, this could then be finalised February 2014. Christmas was during that period. The GridPP31 Collaboration Meeting should therefore focus on scoping-out GridPP5. DC considered that this should work really well, as the updated TDR computing documents would have to be ready for September. Over the summer we needed to be thinking about how we wished to shape this. DB suggested that we need to think about the following issues: 1. An operational vs developmental project: A good argument for any development would be needed (new technologies were a concern, as was the successful maintenance of current operations). How we packaged GridPP5 needed to be considered carefully, to avoid it being separated off with the risk of not being funded at all. 2. Technical implementation: What would the GridPP5 Grid look like? This question was tied-up with cloud work and developments in computer hardware. 3. Political instantiation of the grid: Would it be more of the same, or would it be rationalised to fewer institutes? 4. Boundary services: NGI/EGI APEL, CA, VOMS, and network - these were all things we currently relied upon - how would they be sustained? 5. Currently a big push in the UK to join-up the computing ecosystem (including HPC) - this needed to be an energy efficient computing ecosystem. We could not submit a bid in isolation and we needed to know how we might relate to this new world. 6. Impact agenda: How do we respond to this and can we get funding in this area? It was agreed that we should structure the meeting in GridPP31 around these (and possibly other) issues. DC asked whether there was any European activity? CD noted no, there was no follow-on from EGI Inspire, but there were some smaller projects, but no details were available yet. DB considered it was unlikely that we would get significant outside funding. We had 4 x 0.5FTE at institutes and several at the T1. The total was around 6-7. Potentially we might have to ask for more this time, but to do that we would have to show very clearly how we would fit into the UK ecosystem. 3) Support of WN/UI Tarball ============================ JC advised that Tiziana had enquired about ongoing support for WN and UI. Matt Doidge thought that it should be fairly low-load but work was required for each new release. It was noted that there were other countries using it (approximately 10 non-UK sites), but in some ways it was good to be offering support for something we were using. We would have to say that it was on a 'best effort' basis only - we had no extra effort available, so if the load increased we could not commit to supporting it. 4) HEPSYSMAN and Security Training =================================== The PMB had approved the revised HEPSYSMAN /security training plan. STANDING ITEMS ============== SI-1 Dissemination Report -------------------------- SL reported on behalf of Neasan O'Neill as follows: Royal Soc: * Attended Digital Training for exhibition, I'll be helping compile digital content and managing online interactions * Compiled ideas for "eye witness" stories for booklet News Items: * VomsSnooper published * Working with Claire Devereux on a profile of her as a news item * Working on an EPIC news item Social Media: * We know have a facebook page http://facebook.com/gridpp * Have drawn up a small plan to increase presence on the various channels * Could people on PMB push use of the blogs again? Events: * We have a booth at CF13, currently trying to work out what we have offer/who is attending KE/Impact: * Working on sessions/talks for GridPP30, suggested agenda here http://www.gridpp.ac.uk/gridpp30/day2.html * Have Jamie Coleman to talk at GridPP30 * Trying to sort out a date for Mark Mitchell's talk at Edinburgh's TechCube * I have wording for GridPP's offering to academia and SMEs waiting for feedback SI-2 ATLAS weekly review & plans --------------------------------- There was no report, RJ was absent. SI-3 CMS weekly review & plans ------------------------------- DC noted nothing of significance to report. SI-4 LHCb weekly review & plans -------------------------------- There was no report, PC was absent. SI-5 Production Manager's Report --------------------------------- JC reported as follows: 1) Tiziana Ferrari (EGI) has asked about GridPP support for the tarball WN/UI. (See email to PMB on 20th February). 2) ATLAS users using multi-core proof caused a few additional problems during last week but overall the situation was handled well. There is now a discussion about how to deal with such jobs in future if there is a genuine user needs for them. 3) PerfSONAR showed some but not all GridPP sites having poor rates to BNL. TCP tuning of several parameters appears to markedly improved the situation and there is now work to understand what settings particularly influence the rates and why. 4) The GDB actions list (https://twiki.cern.ch/twiki/bin/view/LCG/GDBActionInProgress) has been updated and I highlight these activities: - evaluation of new CVMFS version (2.1.5) starting (new features NFS export, shared caches) - starting tests with volunteering sites for multi-core jobs - the next pre-GDB (12th March http://indico.cern.ch/conferenceDisplay.py?confId=223689) will be on "Cloud issues" and building a work plan for future work in the area - SHA-2 readiness of sites testing is starting: no need for RFC proxy anymore - Sites with perfSONAR should move to a centrally managed configuration. 5) In addition to Glasgow, sites that are to start looking at IPv6 are Imperial, QMUL and possibly Oxford. SI-6 Tier-1 Manager's Report ----------------------------- AS reported as follows: Fabric: 1) Disk - both sets delivered - acceptance testing. 2) CPU - both delivered - one set completed our tests but waiting for a supplier fix to power distribution this week. Second set has our acceptance tests to run - will complete in about 2 weeks. 3) A short core site network intervention is being scheduled for Tuesday 26th February (adds resilience). We have declared a 1 hour "at risk". 4) We expect to replace the core Tier-1 network switch (C300) on Tuesday 12th March. Details to be finalised. 5) We lost a disk server filesystem (gdss594) - a tape backed server - 68 T2K files un-migrated and lost. A post mortem review is underway. Service: 1) A quiet week: https://www.gridpp.ac.uk/wiki/Tier1_Operations_Report_2013-02-20 2) CASTOR - CASTOR srm down for a few hours on Saturday evening - cause still unknown. - Chasing a problem where the CASTOR SRM response has an invalid format impact some ATLAS transfer management particularly from QMW. fault appears to be in the GSI/gsoap layer. We hope it will be fixed when we upgrade to SL6 SRMs. Still need to discuss with ATLAS if they can wait that long. - Chasing a slowdown problem on a generation of disk servers which cause timeouts and cause us to be placed offline in ATLAS production. Rolling out a RAID controller firmware update to address this problem. 3) BATCH - Problems with low start rate in batch system, causing periods of under-utilisation mainly when jobs are very short. Needing manually intervention regularly. Problem in Maui proving hard to diagnose. Working on a plan of how we will progress this but may need to deploy an alternative to Torque/Maui. 4) Investigating unusual job failure rates for LHCb and ATLAS. May be during job set-up and related to CVMFS investigations underway. Staff: 1) Paperwork for two system admin posts for Fabric team in system waiting approval by STFC. SI-7 LCG Management Board Report --------------------------------- There had been no MB. ACTIONS AS OF 25.02.13 ====================== 438.9 AS to contact relevant site managers to ask whether or not they would be interested in having retired Tier-1 hardware - if a site were interested then they should submit a proposal as to what they want and why. 480.2 JC to consider the imminent demise of EMI and the resultant effect on the GridPP community - concrete issues and action requests to be brought back to the PMB. 484.1 DB to investigate plan for support of GridPP resources at Durham. PC as Chair of ScotGrid may have some input to this. DB/PC would meet to discuss this and report-back to the PMB. 485.3 AS to poll for date in May/June for T1 review. 487.4 ALL to send PG a list of the occasions any PMB member was a keynote speaker at conferences. 488.1 AS to notify the community, giving three months' notice, that the AFS service would be shut down. 488.2 DB to speak to DK/STFC regarding the EGI fee payment and let AS know. 488.3 DB to contact DK regarding travel and other costs to the EGI Community Forum in Manchester. 488.4 AS to let DB know the SL5 estimated benchmark figure for new CPU purchase. Next PMB would take place on Monday 4th March 2013 at 12:55 pm.
GridPP PMB Minutes 490 (04.03.2013) =================================== Present: Dave Britton (Chair), Pete Gronbech, Andrew Sansum, Jeremy Coles, Tony Cass, Dave Colling, Dave Kelsey, Steve Lloyd, Roger Jones (Minutes - Suzanne Scott) Apologies: Tony Doyle, Pete Clarke, Claire Devereux, Neil Geddes 1. Tier-1 Resources ==================== DB advised that historically LHCb had used the figure of 18.6% to calculate the LHCb fraction of resources, based on authors multiplied by global resource requests. LHCb had now realised that this figure was not the correct number for the Tier-1 - it was right for the Tier-2. By applying the algorithm to the Tier-1, they were chronically under-providing resources. This had a big effect on LHCb. DB had confirmed with PC that the formula was 'authors in the UK' divided by 'authors in all Tier-1 countries'. The Tier-1 was currently therefore providing less to LHCb at RAL but this was what LHCb had requested. It was unlikely that we could find extra resources in GridPP4. DB wanted to know what the actual number was in order to see how far we could meet it. There was a pressing need for disk, an extra 300TB. It was noted we had ~1PB of contingency so we could probably meet LHCb part-way. We had to get through the procurements first before determining the timing of this. We would need to treat LHCb like ATLAS and respond appropriately. AS was concerned - over the coming year we had 3 calls on disk: 1. the FY14 delivery would be more than 4PB; 2. the operational size of existing tranches ranged considerably, solving problems with tranches would be outwith our ability to cope if the buffer dropped below 1PB; 3. we had to deploy another storage instance this year in FY13. DB asked if it was possible to provision LHCb with tape-back disk? AS wasn't sure. DB concluded that we needed to get the numbers from LHCb and look at the operational concerns from our side. DC thought we should help if we could. AS advised that if we lost the Streamline 2009 we might not have enough capacity. DB asked if it was 300TB that they wanted, then what was the risk of giving them 100/200/300TB and make a decision on that? We might then be able to meet them part-way. PC would provide the final percentage figure of UK authors over Tier-1 authors. AS asked about Alice? Alice was short of authors from Tier-1 countries? DB advised that we weren't funded to support Alice at all and the fact that we did provide for them at the moment was best effort. DB advised that the other issue was that pledges were made last year before the experiment requirements were approved by the CRRB. Which numbers did we provision against? Sensibly this would be against the numbers in Rebus, but that would be less than actually pledged in some cases. If, on the other hand, we followed what we had pledged then we were provisioning against the wrong numbers. DB had emailed Ian Bird, asking whether we should pledge against Rebus or the pledges - a response was awaited. It made sense to provision against Rebus - we would return to this issue. 2. AOCB ======== - PC had withdrawn his request for the LHCb workshop, there were not enough people attending the IoP to make it worthwhile. - EGI fees: what was the timing of this? DB had received an email from Adrian about this; DB needed to speak to CD. AS asked whether we needed to look a the risks w.r.t. project planning of NES? DB noted that the NGI were developing a disaster Management plan to cover all services on which GridPP depended. DB had emailed Janet Seed and CAP (PC) recently about the funding issue. The LHC computing profile was not high enough. Recent funding had gone to projects in development, it had not been given to established projects like GridPP. - Regarding travel for Hepix: the next meeting was on 15-19 April in Bologna, early-bird registration was mid-March. DK asked how many people we intended to fund? For the last meeting in Prague, 3 people from the Tier-1 and PG had attended. There had been no engagement from the Tier-2s. DB considered that it would be good if up to one person per Tier-2 could attend. AS noted he was hoping for 2-3 people to attend from the Tier-1 this time. DB thought it entirely reasonable for a few from the Tier-1 and one from each Tier-2 to go. We could consider any requests beyond those figures. JC would remind the Ops people tomorrow and encourage attendance. He would also email suggestions to DK regarding who would be best to go. - PG asked about the allocation of hardware funding? DB advised that CMS needed to say whether Bristol was part of their policy or not. SL noted that GridPP5 had not yet been discussed - we would keep things going until then. CMS didn't update their metrics very often. STANDING ITEMS ============== SI-1 Dissemination Report -------------------------- SL reported that Alex Efimov, who had worked at QMUL for a time, had asked for a meeting with himself and Neasan regarding 'industry engagement'. SL and Neasan O'Neill would meet with him. SI-2 ATLAS weekly review & plans --------------------------------- RJ noted small issues with the batch farm at RAL - they had problems filling the farm due to Maui. Apart from that, there were issues with SL6 and node-testing. They had people making progress with the testing infrastructure. Xrootd was being rolled-out across DPM sites. Durham was currently up and running. SI-3 CMS weekly review & plans ------------------------------- DC noted nothing major to report. SI-4 LHCb weekly review & plans -------------------------------- PC was absent. SI-5 Production Manager's Report --------------------------------- JC reported as follows: 1) As of today sites are being alerted about the end-of-life of EMI-1 middleware and the decommissioning campaign is starting (sites have until the end of March to remove the middleware) 2) EMI 3 (Monte Bianco) is expected to be released this Thursday. We will review UK staged- rollout involvement at tomorrow’s ops meeting. 3) SNO+ and T2K have both experienced proxy renewal issues in recent months; jobs are not consistently failing so it is difficult to pin-point the underlying problem(s). At least one problem was reported as a bug with the WMS that was subsequently fixed but the release failed staged rollout for other reasons. It was noted that both Sussex and Durham were up and running at the moment. PG asked whether Sussex had been added to the accounting metric page? SL noted he would add them. SI-6 Tier-1 Manager's Report ----------------------------- AS reported as follows: Fabric: 1) Disk - both sets delivered - acceptance testing, projected to end 1th and 15th March (if no problems)). 2) CPU - both delivered - one set available for test queue (probably SL6) second set expected to complete acceptance tests this week. We do not plan to deploy to production queues until required to meet MoU commitment. 3) We expect to replace the core Tier-1 network switch (C300) on Tuesday 12th March. The Tier-1 network is complex with many switch stacks. We expect to schedule a 6 hour downtime (TBC) which includes some contingency to allow time to resolve problems with uplinks or switch stacks disturbed by the change. Full details will be announced later this week. 4) Preparations are underway for moving Tier-1 to new 40Gb network infrastructure. Major intervention likely in late April or May. Service: 1) A relatively quiet week: https://www.gridpp.ac.uk/wiki/Tier1_Operations_Report_2013-02-27 AS advised that there had been job-start rate issues in Maui - they could solve it when it occurred, but it seemed worse when the experiments were submitting short jobs, the issue tended to come and go. The other issue was low-level loss of jobs in the setup phase for both ATLAS and LHCb - work was ongoing on this however the cause was not yet known. 2) CASTOR - development continues on 2.1.13. Stress testing is well advanced. Some tape servers upgraded to test in production. Expect to upgrade the Facilities instance this month and Tier-1 instances likely to start in April. Staff: 1) Paperwork for two system admin posts for Fabric team in system waiting approval by STFC. SI-7 LCG Management Board Report --------------------------------- There had been no MB. REVIEW OF ACTIONS ================= 438.9 AS to contact relevant site managers to ask whether or not they would be interested in having retired Tier-1 hardware - if a site were interested then they should submit a proposal as to what they want and why. Ongoing. 480.2 JC to consider the imminent demise of EMI and the resultant effect on the GridPP community - concrete issues and action requests to be brought back to the PMB. Ongoing. 484.1 DB to investigate plan for support of GridPP resources at Durham. PC as Chair of ScotGrid may have some input to this. DB/PC would meet to discuss this and report-back to the PMB. Done, item closed. 485.3 AS to poll for date in May/June for T1 review. Ongoing. 487.4 ALL to send PG a list of the occasions any PMB member was a keynote speaker at conferences. Ongoing. 488.1 AS to notify the community, giving three months' notice, that the AFS service would be shut down. Ongoing. 488.2 DB to speak to DK/STFC regarding the EGI fee payment and let AS know. Ongoing. 488.3 DB to contact DK regarding travel and other costs to the EGI Community Forum in Manchester. Done, item closed. 488.4 AS to let DB know the SL5 estimated benchmark figure for new CPU purchase. Done, item closed. ACTIONS AS OF 04.03.12 ====================== 438.9 AS to contact relevant site managers to ask whether or not they would be interested in having retired Tier-1 hardware - if a site were interested then they should submit a proposal as to what they want and why. 480.2 JC to consider the imminent demise of EMI and the resultant effect on the GridPP community - concrete issues and action requests to be brought back to the PMB. 485.3 AS to poll for date in May/June for T1 review. 487.4 ALL to send PG a list of the occasions any PMB member was a keynote speaker at conferences. 488.1 AS to notify the community, giving three months' notice, that the AFS service would be shut down. 488.2 DB to speak to DK/STFC regarding the EGI fee payment and let AS know. The next meeting would take place next Monday 11th March at 12:55 pm. RJ advised of apologies for the next two meetings.

Top of Message | Previous Page | Permalink

JiscMail Tools


RSS Feeds and Sharing


Advanced Options


Archives

February 2024
January 2024
September 2022
July 2022
June 2022
February 2022
December 2021
August 2021
March 2021
November 2020
October 2020
August 2020
March 2020
February 2020
October 2019
August 2019
June 2019
May 2019
April 2019
March 2019
February 2019
January 2019
December 2018
November 2018
August 2018
July 2018
June 2018
May 2018
April 2018
March 2018
February 2018
January 2018
November 2017
October 2017
September 2017
August 2017
May 2017
April 2017
March 2017
February 2017
January 2017
October 2016
August 2016
July 2016
June 2016
May 2016
April 2016
March 2016
February 2016
January 2016
December 2015
November 2015
October 2015
September 2015
July 2015
May 2015
April 2015
March 2015
February 2015
January 2015
December 2014
November 2014
October 2014
September 2014
August 2014
July 2014
June 2014
May 2014
April 2014
March 2014
February 2014
January 2014
December 2013
October 2013
August 2013
July 2013
June 2013
May 2013
March 2013
February 2013
January 2013
December 2012
November 2012
October 2012
September 2012
August 2012
July 2012
June 2012
May 2012
April 2012
March 2012
February 2012
January 2012
December 2011
November 2011
October 2011
September 2011
August 2011
July 2011
June 2011
May 2011
April 2011
March 2011
February 2011
January 2011
December 2010
November 2010
October 2010
September 2010
August 2010
July 2010
June 2010
May 2010
April 2010
March 2010
February 2010
January 2010
December 2009
November 2009
October 2009
September 2009
August 2009
July 2009
June 2009
May 2009
April 2009
March 2009
February 2009
January 2009
December 2008
November 2008
October 2008
September 2008
August 2008
July 2008
June 2008
May 2008
April 2008
March 2008
February 2008
January 2008
December 2007
November 2007
October 2007
September 2007
August 2007
July 2007
June 2007
May 2007
April 2007
March 2007
February 2007
January 2007
2006
2005
2004
2003
2002
2001
2000


JiscMail is a Jisc service.

View our service policies at https://www.jiscmail.ac.uk/policyandsecurity/ and Jisc's privacy policy at https://www.jisc.ac.uk/website/privacy-notice

For help and support help@jisc.ac.uk

Secured by F-Secure Anti-Virus CataList Email List Search Powered by the LISTSERV Email List Manager