GridPP PMB Minutes 380 (08.03.10) ================================= Present: David Britton (Chair), Steve Lloyd, Sarah Pearce, Andrew Sansum, Tony Doyle, Dave Colling, Robin Middleton, Pete Clarke, Roger Jones, Pete Gronbech (Suzanne Scott, Minutes) Apologies: David Kelsey, Tony Cass, John Gordon, Jeremy Coles, Glenn Patrick, Neil Geddes 1. DB Agenda for RHUL ====================== SL asked when we were going to switch to the new arrangements outlined in the GridPP4 proposal? Presumably not this time? DB noted yes, we would switch in advance of GridPP4 but not before the PPRP. It would be better to discuss this issue at Ambleside. It was noted that the PMB membership would change, the DB would cease, and the Ops Team would start. SL asked about the Tier-2 hardware situation? DB suggested that a working group be set up, with the criteria set by the experiments. On the timescale of RHUL it might not be possible however. PG asked about the GridPP4 funding and how this would be split between sites? SL noted that monte carlo and analysis would be treated separately. PG asked about measuring of site performance, or would it be by amount of resource? And what about the small sites? SL advised that these were all issues that needed to be discussed. DB advised that ATLAS and CMS should drive this at high level. RJ noted that a uniform metric was probably not possible. DB advised that the starting point would be receipt of statements from ATLAS and CMS at high level, then the PMB could discuss implementation. SL advised that there was also regional monitoring to consider - SAM and Nagios etc. The statistics we have now may not exist later. PG reported that we were part way down the conversion process - Nagios services run by CERN would move to regional Nagios services, likely to be the Oxford dashboard for the UK. This would be a flexible system that can be modified. CERN were working to the end of EGEE. DB asked if we wished to get ATLAS and CMS to make preliminary presentations to the Deployment Board? The Tier-2 spreadsheet was discussed, PG asked if it was generally available? It had been distributed to the CB. SL advised that the Tier-2 Co-ordinators should inform their SysAdmins, but it should not be generally available yet. DB advised that the word document which was sent to the CB had all of the numbers in it. PG queried the MoU generated by the spreadhsheet, which makes RAL PPD a 'small' ATLAS site. DB noted the other issue at the Deployment Board could be regional nagios. SL agreed. SL would circulate an agenda. ACTION 380.1 SL to circulate an Agenda for the Deployment Board meeting at RHUL. 2. Tier-2 Investments ====================== SL had circulated an email. DB advised that we needed a high-level picture of investments in infrastructure to defend our case if this were to be raised at the PPRP. We needed to show leverage of investment. SL had only received one or two responses thus far. ACTION 380.2 ALL: to send SL information on infrastructure investments at their respective institutes. 380.3 AS to send SL assumptions re electricity (in relation to investments in infrastructure). 380.4 SP to send SL historical numbers on unfunded effort (in relation to investments in infrastructure). 3. EGI/NGI Paper ================= DB noted that several questions needed to be answered: - how does GridPP relate to an NGI structure? - what happens if EGI does not go ahead? - what happens if NGS is not funded? - how much is GridPP doing in NGI which is not directly related to GridPP? DB reminded that he had prepared a draft document last year. RM had circulated an update to this which provided a framework for argument. DB went through this document: p3 top para - RM needs to qualify this, a statement is missing, eg: 'this reflects a reduced particle physics influence going from EGEE to EGI' (cf the statement in DB's covering letter to the GridPP4 proposal) DB noted this should be an internal document, however there will be issues raised prior to the site visit. We should do a public version of the document once it is in good shape. It was understood that GridPP has to be represented on an NGI MB in proportion to the size of resources and user base. In relation to global tasks, security and training were clear (the former is very closely coupled to GridPP; the latter does not involve Grid). We are proposing to continue with configuration and accounting. DB emphasised that in relation to APEL there had been a negative reputational effect due to the recent problems. We needed management buy-in. DB noted that the large blue tables in RM's document were incomplete as follows: - RM should fill-in the GridPP effort at top-level in relation to global tasks - RM needed to fill-in the NGS column - a 'totals' column was required - the EU contribution needed to be clarified TD advised that there was a difference between the hardware resources and users re the relative size of GridPP within the NGI. In the 'risks' section of the document DB noted that if NGS is not funded, an NGI would be de-scoped - we would drop training, but we could still do the global tasks. For Risk 1 the following was required: - the paragraph needed to be quantified - RM should add two columns to the table: the status quo was an NGS and an EGI, if there was no NGS what do we do? If there is no EGI what do we do? These manpower changes should be shown in these columns using an X or similar. PC advised that the first sentence should be: 'This is the extra effort we need if these are unfunded'. PG asked if Nagios at Oxford would be part of NGI? TD noted that as it would be devolved from CERN then it would rest with us. DB noted that the document was a good start. RM/SP should make the changes as discussed and try and quantify some of the issues. Over the next week DB would iterate with them in order to push the document forward. The text would follow the numbers - they should concentrate on the numbers first, ie: task vs effort, and where the effort comes from. As noted, two strategy columns should be added. RM advised that he would have an internal meeting with JG. By the end of the week, SP/DB would try and talk. ACTION 380.5 RM/SP to make changes to the EGI/NGI paper as discussed and bring back a revised version to next week's PMB. 380.6 ALL: to feedback comments on the EGI/NGI paper to DB, RM or SP before next week's PMB. 4. Week's Notes ================ - DB advised that the PPRP Agenda had been changed, but the GridPP timing for the meeting remained unaltered. - Re the OPN backup link, AS advised that he had received an Invoice. They were scheduled to get the line at the end of March, which would then be tested during April. DB noted that we needed to confirm the delivery date and the usage/testing plans. The invoice should not be paid if there was a possibility that the link would not be installed for several months. ACTION 380.7 Re the OPN backup link: AS to find out: 1. When the link is supposed to be operational; 2. More detail about how and when the link will be tested. If possible AS should delay Invoice payment until more information was forthcoming. There ensued a discussion on use and capacity of the link plus strategy required in relation to usage - was a cap possible? The traffic could be split two ways if the link were to be used for production. STANDING ITEMS ============== SI-1 Tier-1 Manager's Report ----------------------------- AS reported as follows: Fabric: 1) FY09 procurements: - All disk and CPU has been delivered. - We expect to be able to start acceptance tests on one lot of disk and CPU this week, the second lot is still being installed. 2) FY10 procurements - We have started the process of updating the procurement documentation for FY10 procurements. We are considering alternative options to a restricted EU tender. DB noted there were pros and cons required for this issue. A HAG would be preferable - AS noted that a teleconference would be required. There was no update re the UPS. DB gave direction to AS that from GridPP's viewpoint the equipment was not fit-for-purpose and should at this point be returned to the vendor, instead of allowing alternatives that only added other points of failure. AS advised that he did not control the process which was being handled by Estates & Buildings. DB noted he could speak with someone if required. AS would check and get back to him. 3) We have concluded that one lot of the 2006 procurement (about 250TB) is too unreliable (high drive eject rate) and we are discussing phasout options with the UB. This lot was the source of all multi-drive filesystem losses during 2009 and has generated the majority of drive ejects in the last 12 months. We do not expect the phasout to impact our WLCG commitments. Service: 1) SAM test availability for the ops VO was 100%. 2) We are working on an upgrade strategy for CASTOR from 2.1.7 to 2.1.8 or 2.1.9 we expect to discuss with the UK VO representatives in 1-2 weeks then discuss at the PMB. 3) We have been reviewing our position wrt the CASTOR database hardware wrt the problems encountered during the migration back to the EMC RAID arrays. The current configuration is not fully resiliant, currently a storage array break may lead to an outage of the CASTOR database SAN. Our conclusions are that we will need to move the database service back off the EMC units to allow a reconfiguration of the SAN to a well tested and working configuration. We will have to do this by deploying new hardware temporarily to stage the service onto. We are still review exact required configuration and hardware options. We also have to find a good time window for a 1-2 day intervention to release the existing hardware (not during the early stages of data taking probably) and then a further timeslot to move back onto it. 4) On Friday we made an emergency change on the CMS CASTOR instance in order to address a hot file issue (created a new service class overlaying the existing disk pool). SI-2 ATLAS weekly review & plans --------------------------------- RJ reported that things had been quiet last week; there were production jobs due this week. There had been a problem over the weekend re the pilot factory at Glasgow due to the end of a proxy, but this had been fixed today. There was also a bug in the distribution of hardware tasks which meant they were blacklisted, wrongly, by the ganga robot - this was causing problems on the ATLAS side. SI-3 CMS weekly review & plans ------------------------------- DC reported that they were preparing for 7TeV monte carlo - nothing unusual was happening at present. There ensued a discussion about a change made on Friday afternoon by AS at the Tier-1. DB commented that the Tier-1 should be responsive and they had made the right decision. SI-4 LHCb weekly review & plans -------------------------------- In absentia GP reported as follows: 1) Low level Monte Carlo productions. Most went without problem. Bulk of LHCb work on the Grid is currently user analysis. 2) Problem uploading data out of the site at 3 UK Tier-2s : Sheffield, Glasgow and Brunel. GGUS tickets opened against them and issue raised in dTeam mailing list. This particular problem is limited to just these 3 sites on the (LHCb) Grid. Working with sites to understand. 3)dCache Tier-1s were brought back in to the mask last Tuesday after a new stack of LHCb software was released with fixed versions of root. Analysis jobs now fine at most sites. 4)CASTORLHCB successfully upgraded to version 2.1.9.4 at CERN this morning. SI-5 Production Manager's Report --------------------------------- PG presented JC's report as follows: 1) CREAM & SCAS/glexec status (may be updated): Oxford - two CREAM installs. Both in production. One uses SCAS. glexec on small set of WNs. Manchester - one CREAM CE in production. SCAS/glexec deployed but not in production. RAL T1 - CREAM CE in production. SCAS/glexec installed on test cluster Glasgow - 1 CREAM instance in production. SCAS and glexec in production. CREAM is using it only worker nodes at present. It has been tested with CREAM and with the lcg-CE. No explicit testing by any major VO yet, but found problem with proxy lease and renewals with ATLAS condor submissions. Still to implement pilot ops role for ops glexec testing. Imperial - work in progress on CE and SCAS. Sites with more than one CE have been asked to move one to CREAM. Several sites administrators were concerned about doing this while there remains a critical bug affecting ATLAS submissions. There ensued a discussion about the problems at Imperial and RHUL in relation to the CREAM CE. It was noted that sites are still finishing SL5 upgrade. DB asked about UK site testing of SCAS and glexec? DC and RJ noted no this was not happening, not as far as they knew. ACTION 380.8 RJ/DC to advise us of what the experiment plans are in the UK in relation to SCAS and glexec. DB asked whether the sites were using these at all? DC didn't know. PG would check his logs. RJ didn't know - they were not doing specific testing as far as he knew. There was certainly no pressure to do so from ATLAS. DB confirmed that comment from ATLAS and CMS was required. Some sites have it installed and some don't, therefore direction was needed. PG noted that ATLAS didn't use CREAM CE anyway at the moment - lcg CEs were still required. 2) A post-mortem/incident report for the outage of the gridpp.ac.uk DNS is now available in the wiki: https://www.gridpp.ac.uk/wiki/Manchester_Incident_20100227 . The specific cause of the problem was a kernel panic on the DNS host. The impact was larger than it should have been due to the DNS and several other services being in the process of host migration at Manchester. To mitigate future occurrences DNS backups are being sought in the Manchester computer centre and at RAL. 3) The transition to Nagios took place last week. Once used in production many new bugs were quickly identified. There remain issues such as: the CREAM CE is missing from the myEGEE interface; multiple top-level BDIIs are not supported; some data show differently between the dashboard and Nagios portal. SI-6 LCG Mangagement Board Report ---------------------------------- DB reported on issues as follows: 1. on Tuesday there was a clear statement from CERN on DPM and CASTOR. Both are and will continue to be supported at CERN at the same level. The CASTOR situation was particularly good at present. The statement was carefully made. There was a normal rotation of 3-year posts happening - all in a steady state. 2. JG had provided an update on the GDB - there was a suspension of clauses in the security policy. What was GridPP's position? It would be better to discuss this when DK and JG were present. 3. The APEL issue - there was a perception that this was done in the UK at RAL and was synonymous with GridPP. DB noted we have to see this differently in future in relation to NGI, as it affected GridPP's reputation. We have to take lessons from this going forward and need to do better re communication - this had been a retrograde step. 4. Were the experiments working on resource estimates for the upcoming period? RJ noted they will certainly be different. ACTION 380.9 RJ/DC to send info to DB regarding resource estimates for the upcoming period, as this info will be needed after the PPRP. SI-7 Dissemination Report -------------------------- SP reported that planning was ongoing for an upcoming meeting, where the Chief of STFC would be giving a speech. There was nothing further on the LHC at present. It was noted that an email had been circulated re STFC Innovations Partnership Scheme (IPS) Panel Nominations. SP asked whether we wanted to nominate someone? Two academics were required. No-one was available. AOB === SP reminded that the Quarterly Reports were due. RJ noted he was working on his, the info systems had been changed. DB noted that there would be issues from the Quarter which should be raised at the PMB. REVIEW OF ACTIONS ================= 354.2 JC to consult with site admins on a framework policy for releases, with a mechanism for escalation, plus a mechanism for monitoring. JC reported that the consultation happened. There were a few suggestions in the deployment team about how to progress in this area. It needs writing up and an implementation plan. JC to progress. Pending. 366.8 AS to confirm that the Tier-1 proposes to use Tape-based storage in the period 2011 - 2015. DB advised this related to long-term plans and power capacity. Physical footprint space? Alternatives? AS had sent tech questions round the team and would forward inputs when available. AS noted that alternative further costings were required. AS to progress. Ongoing. 367.2 RM to fill-in the grey boxes on DB's UK NGI diagram of a minimal NGI, as to what NGS would be doing in the areas listed. RM reported that there wasn't enough information available at present to carry out this action, but he had met with Andy Richards. RM/SP to circulate a document. Done, item closed. 375.9 RM to provide a skeleton outline plan, including post details, of GridPP/NGS convergence. RM reported that a draft plan would be available soon. RM/SP to circulate a document. Done, item closed. 379.1 Re GridPP4 proposal and forthcoming PPRP meeting: SP to begin work on 'background' financial planning. Ongoing. 379.2 Re GridPP4 proposal and forthcoming PPRP meeting: AS to look at the CERN hardware paper and work on the CPU and disk costings. Ongoing. 379.3 Re GridPP4 proposal and forthcoming PPRP meeting: SP to add more detailed information to the WBS. Ongoing. 379.4 Re GridPP4 proposal and forthcoming PPRP meeting: RM to progress the EGI/NGI/NGS model for next week's PMB (in relation to Actions 367.2 & 375.9). Done, item closed. 379.5 RM/SP to assimilate the information in DB's paper on NGI within the GridPP4 Proposal, and circulate a new updated paper before next week's PMB. This would be a transition document addressing the possibility that: 1. There would be no NGI; 2. There would be no future funding for NGS. Ongoing. 379.6 SL to ensure that the OC documents are made publicly available [done following the meeting]. 379.7 JC to follow-up the issue of merging VO lists and ILDG VO. Ongoing. ACTIONS AS AT 08.03.10 ====================== 354.2 JC to consult with site admins on a framework policy for releases, with a mechanism for escalation, plus a mechanism for monitoring. JC reported that the consultation happened. There were a few suggestions in the deployment team about how to progress in this area. It needs writing up and an implementation plan. JC to progress. 366.8 AS to confirm that the Tier-1 proposes to use Tape-based storage in the period 2011 - 2015. DB advised this related to long-term plans and power capacity. Physical footprint space? Alternatives? AS had sent tech questions round the team and would forward inputs when available. AS noted that alternative further costings were required. AS to progress. 379.1 Re GridPP4 proposal and forthcoming PPRP meeting: SP to begin work on 'background' financial planning. 379.2 Re GridPP4 proposal and forthcoming PPRP meeting: AS to look at the CERN hardware paper and work on the CPU and disk costings. 379.3 Re GridPP4 proposal and forthcoming PPRP meeting: SP to add more detailed information to the WBS. 379.5 RM/SP to assimilate the information in DB's paper on NGI within the GridPP4 Proposal, and circulate a new updated paper before next week's PMB. This would be a transition document addressing the possibility that: 1. There would be no NGI; 2. There would be no future funding for NGS. 379.7 JC to follow-up the issue of merging VO lists and ILDG VO. 380.1 SL to circulate an Agenda for the Deployment Board meeting at RHUL. 380.2 ALL: to send SL information on infrastructure investments at their respective institutes. 380.3 AS to send SL assumptions re electricity (in relation to investments in infrastructure). 380.4 SP to send SL historical numbers on unfunded effort (in relation to investments in infrastructure). 380.5 RM/SP to make changes to the EGI/NGI paper as discussed and bring back a revised version to next week's PMB. 380.6 ALL: to feedback comments on the EGI/NGI paper to DB, RM or SP before next week's PMB. 380.7 Re the OPN backup link: AS to find out: 1. When the link is supposed to be operational; 2. More detail about how and when the link will be tested. If possible AS should delay Invoice payment until more information was forthcoming. 380.8 RJ/DC to advise us of what the experiment plans are in the UK in relation to SCAS and glexec. 380.9 RJ/DC to send info to DB regarding resource estimates for the upcoming period, as this info will be needed after the PPRP. INACTIVE CATEGORY ================= 359.4 JC to follow up dTeam actions from the DB, as follows: --------------------------- 05.02 dTeam to try and sort out CPU shares and priority resources, at Glasgow first (perhaps by raising the job priority in Panda). --------------------------- JC would check the situation with Graeme Stewart (who was currently on annual leave). JC followed up with Graeme and the other experiments. A test was started but this area has been deemed low priority and further progress is not expected for some time. ATLAS see no issues with contention. LHCb are not intending to pursue anything in this area. A CMS discussion has started but again it does not appear to be urgent. If the experiments are not pushing this internally then there is nothing for the deployment team to follow up! It was noted there was no priority in ATLAS at present, this will be pending for a while. Move to inactive as it is a long-term action. --------------------- The meeting closed at 3:00 pm. The next PMB would take place on Monday 15th March at 12:55 pm. GridPP PMB Minutes 380 (08.03.10) ================================= Present: David Britton (Chair), Steve Lloyd, Sarah Pearce, Andrew Sansum, Tony Doyle, Dave Colling, Robin Middleton, Pete Clarke, Roger Jones, Pete Gronbech (Suzanne Scott, Minutes) Apologies: David Kelsey, Tony Cass, John Gordon, Jeremy Coles, Glenn Patrick, Neil Geddes 1. DB Agenda for RHUL ====================== SL asked when we were going to switch to the new arrangements outlined in the GridPP4 proposal? Presumably not this time? DB noted yes, we would switch in advance of GridPP4 but not before the PPRP. It would be better to discuss this issue at Ambleside. It was noted that the PMB membership would change, the DB would cease, and the Ops Team would start. SL asked about the Tier-2 hardware situation? DB suggested that a working group be set up, with the criteria set by the experiments. On the timescale of RHUL it might not be possible however. PG asked about the GridPP4 funding and how this would be split between sites? SL noted that monte carlo and analysis would be treated separately. PG asked about measuring of site performance, or would it be by amount of resource? And what about the small sites? SL advised that these were all issues that needed to be discussed. DB advised that ATLAS and CMS should drive this at high level. RJ noted that a uniform metric was probably not possible. DB advised that the starting point would be receipt of statements from ATLAS and CMS at high level, then the PMB could discuss implementation. SL advised that there was also regional monitoring to consider - SAM and Nagios etc. The statistics we have now may not exist later. PG reported that we were part way down the conversion process - Nagios services run by CERN would move to regional Nagios services, likely to be the Oxford dashboard for the UK. This would be a flexible system that can be modified. CERN were working to the end of EGEE. DB asked if we wished to get ATLAS and CMS to make preliminary presentations to the Deployment Board? The Tier-2 spreadsheet was discussed, PG asked if it was generally available? It had been distributed to the CB. SL advised that the Tier-2 Co-ordinators should inform their SysAdmins, but it should not be generally available yet. DB advised that the word document which was sent to the CB had all of the numbers in it. PG queried the MoU generated by the spreadhsheet, which makes RAL PPD a 'small' ATLAS site. DB noted the other issue at the Deployment Board could be regional nagios. SL agreed. SL would circulate an agenda. ACTION 380.1 SL to circulate an Agenda for the Deployment Board meeting at RHUL. 2. Tier-2 Investments ====================== SL had circulated an email. DB advised that we needed a high-level picture of investments in infrastructure to defend our case if this were to be raised at the PPRP. We needed to show leverage of investment. SL had only received one or two responses thus far. ACTION 380.2 ALL: to send SL information on infrastructure investments at their respective institutes. 380.3 AS to send SL assumptions re electricity (in relation to investments in infrastructure). 380.4 SP to send SL historical numbers on unfunded effort (in relation to investments in infrastructure). 3. EGI/NGI Paper ================= DB noted that several questions needed to be answered: - how does GridPP relate to an NGI structure? - what happens if EGI does not go ahead? - what happens if NGS is not funded? - how much is GridPP doing in NGI which is not directly related to GridPP? DB reminded that he had prepared a draft document last year. RM had circulated an update to this which provided a framework for argument. DB went through this document: p3 top para - RM needs to qualify this, a statement is missing, eg: 'this reflects a reduced particle physics influence going from EGEE to EGI' (cf the statement in DB's covering letter to the GridPP4 proposal) DB noted this should be an internal document, however there will be issues raised prior to the site visit. We should do a public version of the document once it is in good shape. It was understood that GridPP has to be represented on an NGI MB in proportion to the size of resources and user base. In relation to global tasks, security and training were clear (the former is very closely coupled to GridPP; the latter does not involve Grid). We are proposing to continue with configuration and accounting. DB emphasised that in relation to APEL there had been a negative reputational effect due to the recent problems. We needed management buy-in. DB noted that the large blue tables in RM's document were incomplete as follows: - RM should fill-in the GridPP effort at top-level in relation to global tasks - RM needed to fill-in the NGS column - a 'totals' column was required - the EU contribution needed to be clarified TD advised that there was a difference between the hardware resources and users re the relative size of GridPP within the NGI. In the 'risks' section of the document DB noted that if NGS is not funded, an NGI would be de-scoped - we would drop training, but we could still do the global tasks. For Risk 1 the following was required: - the paragraph needed to be quantified - RM should add two columns to the table: the status quo was an NGS and an EGI, if there was no NGS what do we do? If there is no EGI what do we do? These manpower changes should be shown in these columns using an X or similar. PC advised that the first sentence should be: 'This is the extra effort we need if these are unfunded'. PG asked if Nagios at Oxford would be part of NGI? TD noted that as it would be devolved from CERN then it would rest with us. DB noted that the document was a good start. RM/SP should make the changes as discussed and try and quantify some of the issues. Over the next week DB would iterate with them in order to push the document forward. The text would follow the numbers - they should concentrate on the numbers first, ie: task vs effort, and where the effort comes from. As noted, two strategy columns should be added. RM advised that he would have an internal meeting with JG. By the end of the week, SP/DB would try and talk. ACTION 380.5 RM/SP to make changes to the EGI/NGI paper as discussed and bring back a revised version to next week's PMB. 380.6 ALL: to feedback comments on the EGI/NGI paper to DB, RM or SP before next week's PMB. 4. Week's Notes ================ - DB advised that the PPRP Agenda had been changed, but the GridPP timing for the meeting remained unaltered. - Re the OPN backup link, AS advised that he had received an Invoice. They were scheduled to get the line at the end of March, which would then be tested during April. DB noted that we needed to confirm the delivery date and the usage/testing plans. The invoice should not be paid if there was a possibility that the link would not be installed for several months. ACTION 380.7 Re the OPN backup link: AS to find out: 1. When the link is supposed to be operational; 2. More detail about how and when the link will be tested. If possible AS should delay Invoice payment until more information was forthcoming. There ensued a discussion on use and capacity of the link plus strategy required in relation to usage - was a cap possible? The traffic could be split two ways if the link were to be used for production. STANDING ITEMS ============== SI-1 Tier-1 Manager's Report ----------------------------- AS reported as follows: Fabric: 1) FY09 procurements: - All disk and CPU has been delivered. - We expect to be able to start acceptance tests on one lot of disk and CPU this week, the second lot is still being installed. 2) FY10 procurements - We have started the process of updating the procurement documentation for FY10 procurements. We are considering alternative options to a restricted EU tender. DB noted there were pros and cons required for this issue. A HAG would be preferable - AS noted that a teleconference would be required. There was no update re the UPS. DB gave direction to AS that from GridPP's viewpoint the equipment was not fit-for-purpose and should at this point be returned to the vendor, instead of allowing alternatives that only added other points of failure. AS advised that he did not control the process which was being handled by Estates & Buildings. DB noted he could speak with someone if required. AS would check and get back to him. 3) We have concluded that one lot of the 2006 procurement (about 250TB) is too unreliable (high drive eject rate) and we are discussing phasout options with the UB. This lot was the source of all multi-drive filesystem losses during 2009 and has generated the majority of drive ejects in the last 12 months. We do not expect the phasout to impact our WLCG commitments. Service: 1) SAM test availability for the ops VO was 100%. 2) We are working on an upgrade strategy for CASTOR from 2.1.7 to 2.1.8 or 2.1.9 we expect to discuss with the UK VO representatives in 1-2 weeks then discuss at the PMB. 3) We have been reviewing our position wrt the CASTOR database hardware wrt the problems encountered during the migration back to the EMC RAID arrays. The current configuration is not fully resiliant, currently a storage array break may lead to an outage of the CASTOR database SAN. Our conclusions are that we will need to move the database service back off the EMC units to allow a reconfiguration of the SAN to a well tested and working configuration. We will have to do this by deploying new hardware temporarily to stage the service onto. We are still review exact required configuration and hardware options. We also have to find a good time window for a 1-2 day intervention to release the existing hardware (not during the early stages of data taking probably) and then a further timeslot to move back onto it. 4) On Friday we made an emergency change on the CMS CASTOR instance in order to address a hot file issue (created a new service class overlaying the existing disk pool). SI-2 ATLAS weekly review & plans --------------------------------- RJ reported that things had been quiet last week; there were production jobs due this week. There had been a problem over the weekend re the pilot factory at Glasgow due to the end of a proxy, but this had been fixed today. There was also a bug in the distribution of hardware tasks which meant they were blacklisted, wrongly, by the ganga robot - this was causing problems on the ATLAS side. SI-3 CMS weekly review & plans ------------------------------- DC reported that they were preparing for 7TeV monte carlo - nothing unusual was happening at present. There ensued a discussion about a change made on Friday afternoon by AS at the Tier-1. DB commented that the Tier-1 should be responsive and they had made the right decision. SI-4 LHCb weekly review & plans -------------------------------- In absentia GP reported as follows: 1) Low level Monte Carlo productions. Most went without problem. Bulk of LHCb work on the Grid is currently user analysis. 2) Problem uploading data out of the site at 3 UK Tier-2s : Sheffield, Glasgow and Brunel. GGUS tickets opened against them and issue raised in dTeam mailing list. This particular problem is limited to just these 3 sites on the (LHCb) Grid. Working with sites to understand. 3)dCache Tier-1s were brought back in to the mask last Tuesday after a new stack of LHCb software was released with fixed versions of root. Analysis jobs now fine at most sites. 4)CASTORLHCB successfully upgraded to version 2.1.9.4 at CERN this morning. SI-5 Production Manager's Report --------------------------------- PG presented JC's report as follows: 1) CREAM & SCAS/glexec status (may be updated): Oxford - two CREAM installs. Both in production. One uses SCAS. glexec on small set of WNs. Manchester - one CREAM CE in production. SCAS/glexec deployed but not in production. RAL T1 - CREAM CE in production. SCAS/glexec installed on test cluster Glasgow - 1 CREAM instance in production. SCAS and glexec in production. CREAM is using it only worker nodes at present. It has been tested with CREAM and with the lcg-CE. No explicit testing by any major VO yet, but found problem with proxy lease and renewals with ATLAS condor submissions. Still to implement pilot ops role for ops glexec testing. Imperial - work in progress on CE and SCAS. Sites with more than one CE have been asked to move one to CREAM. Several sites administrators were concerned about doing this while there remains a critical bug affecting ATLAS submissions. There ensued a discussion about the problems at Imperial and RHUL in relation to the CREAM CE. It was noted that sites are still finishing SL5 upgrade. DB asked about UK site testing of SCAS and glexec? DC and RJ noted no this was not happening, not as far as they knew. ACTION 380.8 RJ/DC to advise us of what the experiment plans are in the UK in relation to SCAS and glexec. DB asked whether the sites were using these at all? DC didn't know. PG would check his logs. RJ didn't know - they were not doing specific testing as far as he knew. There was certainly no pressure to do so from ATLAS. DB confirmed that comment from ATLAS and CMS was required. Some sites have it installed and some don't, therefore direction was needed. PG noted that ATLAS didn't use CREAM CE anyway at the moment - lcg CEs were still required. 2) A post-mortem/incident report for the outage of the gridpp.ac.uk DNS is now available in the wiki: https://www.gridpp.ac.uk/wiki/Manchester_Incident_20100227 . The specific cause of the problem was a kernel panic on the DNS host. The impact was larger than it should have been due to the DNS and several other services being in the process of host migration at Manchester. To mitigate future occurrences DNS backups are being sought in the Manchester computer centre and at RAL. 3) The transition to Nagios took place last week. Once used in production many new bugs were quickly identified. There remain issues such as: the CREAM CE is missing from the myEGEE interface; multiple top-level BDIIs are not supported; some data show differently between the dashboard and Nagios portal. SI-6 LCG Mangagement Board Report ---------------------------------- DB reported on issues as follows: 1. on Tuesday there was a clear statement from CERN on DPM and CASTOR. Both are and will continue to be supported at CERN at the same level. The CASTOR situation was particularly good at present. The statement was carefully made. There was a normal rotation of 3-year posts happening - all in a steady state. 2. JG had provided an update on the GDB - there was a suspension of clauses in the security policy. What was GridPP's position? It would be better to discuss this when DK and JG were present. 3. The APEL issue - there was a perception that this was done in the UK at RAL and was synonymous with GridPP. DB noted we have to see this differently in future in relation to NGI, as it affected GridPP's reputation. We have to take lessons from this going forward and need to do better re communication - this had been a retrograde step. 4. Were the experiments working on resource estimates for the upcoming period? RJ noted they will certainly be different. ACTION 380.9 RJ/DC to send info to DB regarding resource estimates for the upcoming period, as this info will be needed after the PPRP. SI-7 Dissemination Report -------------------------- SP reported that planning was ongoing for an upcoming meeting, where the Chief of STFC would be giving a speech. There was nothing further on the LHC at present. It was noted that an email had been circulated re STFC Innovations Partnership Scheme (IPS) Panel Nominations. SP asked whether we wanted to nominate someone? Two academics were required. No-one was available. AOB === SP reminded that the Quarterly Reports were due. RJ noted he was working on his, the info systems had been changed. DB noted that there would be issues from the Quarter which should be raised at the PMB. REVIEW OF ACTIONS ================= 354.2 JC to consult with site admins on a framework policy for releases, with a mechanism for escalation, plus a mechanism for monitoring. JC reported that the consultation happened. There were a few suggestions in the deployment team about how to progress in this area. It needs writing up and an implementation plan. JC to progress. Pending. 366.8 AS to confirm that the Tier-1 proposes to use Tape-based storage in the period 2011 - 2015. DB advised this related to long-term plans and power capacity. Physical footprint space? Alternatives? AS had sent tech questions round the team and would forward inputs when available. AS noted that alternative further costings were required. AS to progress. Ongoing. 367.2 RM to fill-in the grey boxes on DB's UK NGI diagram of a minimal NGI, as to what NGS would be doing in the areas listed. RM reported that there wasn't enough information available at present to carry out this action, but he had met with Andy Richards. RM/SP to circulate a document. Done, item closed. 375.9 RM to provide a skeleton outline plan, including post details, of GridPP/NGS convergence. RM reported that a draft plan would be available soon. RM/SP to circulate a document. Done, item closed. 379.1 Re GridPP4 proposal and forthcoming PPRP meeting: SP to begin work on 'background' financial planning. Ongoing. 379.2 Re GridPP4 proposal and forthcoming PPRP meeting: AS to look at the CERN hardware paper and work on the CPU and disk costings. Ongoing. 379.3 Re GridPP4 proposal and forthcoming PPRP meeting: SP to add more detailed information to the WBS. Ongoing. 379.4 Re GridPP4 proposal and forthcoming PPRP meeting: RM to progress the EGI/NGI/NGS model for next week's PMB (in relation to Actions 367.2 & 375.9). Done, item closed. 379.5 RM/SP to assimilate the information in DB's paper on NGI within the GridPP4 Proposal, and circulate a new updated paper before next week's PMB. This would be a transition document addressing the possibility that: 1. There would be no NGI; 2. There would be no future funding for NGS. Ongoing. 379.6 SL to ensure that the OC documents are made publicly available [done following the meeting]. 379.7 JC to follow-up the issue of merging VO lists and ILDG VO. Ongoing. ACTIONS AS AT 08.03.10 ====================== 354.2 JC to consult with site admins on a framework policy for releases, with a mechanism for escalation, plus a mechanism for monitoring. JC reported that the consultation happened. There were a few suggestions in the deployment team about how to progress in this area. It needs writing up and an implementation plan. JC to progress. 366.8 AS to confirm that the Tier-1 proposes to use Tape-based storage in the period 2011 - 2015. DB advised this related to long-term plans and power capacity. Physical footprint space? Alternatives? AS had sent tech questions round the team and would forward inputs when available. AS noted that alternative further costings were required. AS to progress. 379.1 Re GridPP4 proposal and forthcoming PPRP meeting: SP to begin work on 'background' financial planning. 379.2 Re GridPP4 proposal and forthcoming PPRP meeting: AS to look at the CERN hardware paper and work on the CPU and disk costings. 379.3 Re GridPP4 proposal and forthcoming PPRP meeting: SP to add more detailed information to the WBS. 379.5 RM/SP to assimilate the information in DB's paper on NGI within the GridPP4 Proposal, and circulate a new updated paper before next week's PMB. This would be a transition document addressing the possibility that: 1. There would be no NGI; 2. There would be no future funding for NGS. 379.7 JC to follow-up the issue of merging VO lists and ILDG VO. 380.1 SL to circulate an Agenda for the Deployment Board meeting at RHUL. 380.2 ALL: to send SL information on infrastructure investments at their respective institutes. 380.3 AS to send SL assumptions re electricity (in relation to investments in infrastructure). 380.4 SP to send SL historical numbers on unfunded effort (in relation to investments in infrastructure). 380.5 RM/SP to make changes to the EGI/NGI paper as discussed and bring back a revised version to next week's PMB. 380.6 ALL: to feedback comments on the EGI/NGI paper to DB, RM or SP before next week's PMB. 380.7 Re the OPN backup link: AS to find out: 1. When the link is supposed to be operational; 2. More detail about how and when the link will be tested. If possible AS should delay Invoice payment until more information was forthcoming. 380.8 RJ/DC to advise us of what the experiment plans are in the UK in relation to SCAS and glexec. 380.9 RJ/DC to send info to DB regarding resource estimates for the upcoming period, as this info will be needed after the PPRP. INACTIVE CATEGORY ================= 359.4 JC to follow up dTeam actions from the DB, as follows: --------------------------- 05.02 dTeam to try and sort out CPU shares and priority resources, at Glasgow first (perhaps by raising the job priority in Panda). --------------------------- JC would check the situation with Graeme Stewart (who was currently on annual leave). JC followed up with Graeme and the other experiments. A test was started but this area has been deemed low priority and further progress is not expected for some time. ATLAS see no issues with contention. LHCb are not intending to pursue anything in this area. A CMS discussion has started but again it does not appear to be urgent. If the experiments are not pushing this internally then there is nothing for the deployment team to follow up! It was noted there was no priority in ATLAS at present, this will be pending for a while. Move to inactive as it is a long-term action. --------------------- The meeting closed at 3:00 pm. The next PMB would take place on Monday 15th March at 12:55 pm.