JiscMail Logo
Email discussion lists for the UK Education and Research communities

Help for UKHEPGRID Archives















By Topic:










By Author:











Proportional Font





UKHEPGRID  December 2012

UKHEPGRID December 2012


Subscribe or Unsubscribe

Subscribe or Unsubscribe

Log In

Log In

Get Password

Get Password


Minutes of the 476th to 479th GridPP PMB meeting


David Britton <[log in to unmask]>


David Britton <[log in to unmask]>


Mon, 10 Dec 2012 10:47:43 +0000





text/plain (46 lines) , 121015.txt (1 lines) , 121029.txt (1 lines) , 121105.txt (1 lines) , 121112.txt (1 lines)

Dear All,

Please find attached the GridPP Project Management Board
Meeting minutes for the 476th meeting to 479th meeting.

                    The latest minutes can be in:


as well as being listed with other minutes at:


Cheers, Dave.

GridPP PMB Minutes 476 (15.10.2012) =================================== Present: Dave Britton (Chair), Pete Gronbech, Jeremy Coles, Andrew Sansum, Apologies: Roger Jones, Steve Lloyd, John Gordon, Dave Kelsey, Pete Clarke, Tony Cass, Tony Doyle, Dave Colling, Claire Devereux, Neil Geddes 1. Synergies with DIRAC ======================== Jeremy Yates had circulated a document and DB asked how we should respond? DB proposed that the last section of the document be discussed to see what was possible and there may be actions to generate. - identity management DB noted that the UK3A bid had already been submitted - GPFS multi-cluster DB had been pursuing this already but GridPP probably would not want to consider GPFS because of long-term licensing costs. We could however monitor DIRAC's progress - who was available to do that? It could be delegated to the Storage Group. JC advised that GPFS was not popular among the Storage Group. DB considered that we wanted someone to monitor and understand what DIRAC was doing, then we could get a presentation in a year's time. ACTION 476.1 PG to ask the Storage Group to be aware that DIRAC may deploy/test a form of GPFS as a prototype for a national system, the Storage Group to monitor and keep abreast of progress. - creating VOs Did this relate to the technical side of things, or to outreach? In the long term, could we envisage VOs which might need/use both GridPP and DIRAC? It was premature at this stage to consider this. - sharing resources It was noted that our CPU was full these days, but HPC was not so full - could we use the compute power that was available out there? We couldn't use HECTOR at Edinburgh due to the architecture - was there anything else? AS considered we might make use of other shared facilities, support edge nodes and buy-in, however this had been difficult to do in the past. DB thought that Institute resources would be better, for example DIRAC at Cambridge - should we talk to them? Was it too much work for too little gain? PG advised that Cambridge suffered from a lack of manpower. JC noted that we had just disengaged from the Condor cluster. DB thought it was good for Institutes to have their clusters used, but recognised the manpower issues. This needed to be devolved to Institutes to take forward, namely Oxford, SouthGrid, and Cambridge. JC noted we needed manpower to pursue the technical side. - helpdesk DB noted this related to local support for DIRAC. AS considered that it was a different concept to what we did at our Helpdesk. - training This would only be needed if we had things in common. - security policy DB considered that until DIRAC joined-up their technology then security was really a local issue only. AS advised that there could be federated issues - they were at the stage we had been around 8-10 years ago, sites not disclosing issues etc. DB noted that there was a new GridPP Security Officer now, we could ask DK and the new Officer to have a dialogue with DIRAC to ascertain whether there was any common ground. - operations Could we collaborate here? It was thought no, not until we had something in common. AS noted it might be possible in relation to monitoring frameworks. ACTION 476.2 DB to invite JY and his Sysadmin to visit Lancaster or attend a HEPSYSMAN meeting. - outreach It was thought collaboration in this area was possible, Neasan O'Neill should be involved. ACTION 476.3 DB to feedback to JY the PMB discussion regarding possible synergies with DIRAC. 2. AOB ======= - Track Convenors There had been a call for CHEP Convenors, however possible contenders were not here today. RJ was a possibility. - NGS CertWizard This would be discussed in JC's report, but it was noted that this issue was being widely discussed at present. Constructive comments were required so that we could feed back relevant information. It was known that NGS CertWizard was causing some problems that might be due to the clarity of the instructions or more technical in nature. JC noted he would be getting feedback from Jens Jensen at the Ops Team. DB noted that this issue needed to be sorted out by someone. - next PMB DB noted he was travelling next week and other PMB members were also away. PG advised that the Quarterly Reports were still awaited from some. It was agreed, in the light of absences, that there would be no PMB meeting next Monday 22nd October. The next PMB would take place on Monday 29th October. STANDING ITEMS ============== SI-1 Dissemination Report -------------------------- SL was absent. SI-2 ATLAS weekly review & plans --------------------------------- RJ was absent. SI-3 CMS weekly review & plans --------------------------------- DC was absent. SI-4 LHCb weekly review & plans --------------------------------- PC was absent. SI-5 Production Manager's Report --------------------------------- JC reported as follows: 1) There were a number of current topics touched upon at the GDB last week (http://indico.cern.ch/conferenceOtherViews.py?view=standard&confId=155073). Sites running unsupported gLite 3.2 services will be ticketed from the start of November and must by then have a plan to move to EMI or an escalated technical reason that prevents them upgrading. The GridPP sites (still using gLite CEs) at the ops meeting last Tuesday all indicated plans to move their CEs before the end of October. - There are a number of activities involved with Storage Federations (failover, self-healing, caching…). GridPP sites are involved with both the ATLAS and CMS testing. - Publishing WN environments is still being tested. - Jamie Shiers's talk on post EGI-Inspire emphasised the need for WLCG to work closely with other communities in new areas post EGI-Inspire. FP8/Horizon 2020 calls likely in data management and data preservation. JS should meet with PC/DC/RJ to push forward a common position regarding data preservation in the context of potential funding. ACTION 476.4 PC/DC/RJ to meet with Jamie Shiers in order to push forward a common position regarding data preservation in the context of potential funding and FP8/Horizon 2020 calls. - Markus Schulz circulated a proposal paper for middleware support post EMI (http://indico.cern.ch/materialDisplay.py?contribId=12&sessionId=1&materialId=paper&confId =155073). 2) As part of our (GridPP) contribution to the future necessity of community supported activities, some of the ops team are now learning how to produce the WN tarball installs that we need. In the DPM area, there is now confirmed interest in the community support model from France and Taiwan, and it is likely that we will be able to continue without the initially proposed MoU structure. CERN management have yet to discuss the CERN contribution. There were possible alternative fixes from DPM - information to be sent by JC to the Glasgow Team. ACTION 476.5 JC to send info on possible alternative DPM fixes to the Glasgow Team. 3) The next EGI Community Forum will be hosted by the departments of IT Services and Particle Physics, University of Manchester, UK between 8-12 April 2013. Wahid would like that we consider running a Storage Workshop in conjunction with this meeting (an extended version of a DPM workshop that will likely take place in the UK around April). DB noted that in principle this was a good idea, but we needed to ensure that our costs would not be too high as a result. 4) Last Thursday the core ops team discussed progress and plans in each of the core task areas. Updates are captured in the meeting page here https://indico.cern.ch/conferenceDisplay.py?confId=212408. (This is for reference but I can talk through the areas at the PMB if there is time/interest). One item of note concerns other VOs. We currently point these VOs to use SRM, WMS and LFC yet there are indications that the LHC experiments will move away from them. There was an issue about support in the longer term. 5) Communications have been sent out to our UK hosted VOs informing the VO-admins about upcoming changes in a number of areas and particularly with the EMI middleware transition (CEs and WNs). There are few indications that the VOs are testing and most likely problems will need to be dealt with if and when they arise. 6) There have been multiple discussions about the CA CertWizard (http://www.ngs.ac.uk/use/tools/certwizard) in the last week. It is a tool for managing certificates. There are no current plans to replace the browser interface for certificate management, but Jens will be joining the ops meeting tomorrow to explain the rationale, plans and take feedback. For information: A) HEPiX takes place this week in Beijing: https://indico.cern.ch/conferenceOtherViews.py?view=standard&confId=199025. B) The next WLCG coordination meeting takes place this Thursday: https://indico.cern.ch/conferenceDisplay.py?confId=212691. C) The next HEPSYSMAN meeting takes place on 9th November in Lancaster: http://hepwww.rl.ac.uk/sysman/Nov2012/main.html. SI-6 Tier-1 Manager's Report ----------------------------- AS reported as follows: Fabric ------ 1) Disk tender closed - evaluation underway 2) CPU tender evaluation complete - now with procurement team Service ------- 1) Operations continue generally smoothly 2) CASTOR a) CASTOR 2.1.12 upgrade for LHCB was cancelled last Tuesday while we investigated a possible problem with the previous ATLAS upgrade. This eventually turned out to be a false alarm and upgrade scheduling is underway again. b) CMS upgrade now scheduled for this Tuesday 16th October. LHCb upgrade planned (TBC) for 23rd October. 3) Upgrade to EMI2 CREAM CE in final tests but some publishing problems remain. Things are tight for us to meet our deadline to have switched off the old gLite CEs by the end of October or face possible suspension. However systems are deployed and being tested and we expect to move to full production this week. 4) Hyper-threading change has been approved to exploit hyper-threading by running more jobs than cores. This is a simple change to implement but does come with some risks/issues as well as benefits. Implementation scheduled for next month after CE change this month. - We will gain an additional 8647 HEPSPEC from the existing hardware nominally - We will allow an additional 2048 job slots to run. The amount we over-commit will differ on the different generations: *10 slots on the 8 core 2009 generation * 20 slots on the 12 core 2010/2011 generations - We will gradually ramp up the number of additional job slots in case of load issues on the batch server (risk) - CPU scale factors will be set according to the new benchmarked per job slot performance. This is only relevant when the worker node is fully occupied. When occupancy is below max, CPUs will effectively be faster than published and so we will under account work done at the accounting portal. - Job efficiency will still be able to discriminate between efficient and inefficient work, but average job efficiency is no longer a measure of how much useful work is done on the farm (it remains a measure of how efficient jobs are. - "wasted CPU hours" from the efficiency stats becomes even less meaningful as if a job does not use execution units another overcommitted job will. - By committing memory top run more jobs per node we have reduced our capacity to run large memory jobs (or visa versa). New hardware will be purchased configured with enough memory to support all hyper-threads concurrently. 5) Backup Oracle (and Frontier) Service for CMS - we expect to receive a formal request shortly to run a global backup Oracle service for the CMS conditions D/B. Given the reduction in load on Oracle from ATLAS LFC and LHCB 3D/LFC we expect to be able to meet Oracle licensing and database hardware mainly from existing resources, but we'll need to assess exact requirement before reaching a final conclusion. DB noted that DC should request this via the PMB. AOB === - GridPP30 PG asked what was happening about this? DB advised that DC said he would look into hosting the meeting at the Royal Geographical Society near Imperial. ACTION 476.6 DC to investigate the hosting of GridPP30 at the Royal Geographical Society near Imperial, and report back. - European PP Strategy AS reported that there had been an internal request within STFC regarding the European Particle Physics Strategy process and a discussion about national laboratories. John Wormersley was putting together the proposal that RAL was a National Lab including the Tier-1. ACTION 476.7 AS to check with John Wormersley regarding the proposal that RAL be considered as a National Lab including the Tier-1. AS to find out status of the proposal and report back. REVIEW OF ACTIONS ================= 438.9 AS to contact relevant site managers to ask whether or not they would be interested in having retired Tier-1 hardware - if a site were interested then they should submit a proposal as to what they want and why. Ongoing. 475.1 DB/JC, in conjunction with AS, to consider and draft Terms of Reference (ToR) for the proposed GridPP Cloud Group. Ongoing. 475.2 DB to draft a response to Peter Coveney's email request, using PC's suggestions and in the light of PMB discussion. Done, item closed. ACTIONS AS AT 15.10.12 ====================== 438.9 AS to contact relevant site managers to ask whether or not they would be interested in having retired Tier-1 hardware - if a site were interested then they should submit a proposal as to what they want and why. 475.1 DB/JC, in conjunction with AS, to consider and draft Terms of Reference (ToR) for the proposed GridPP Cloud Group. 476.1 PG to ask the Storage Group to be aware that DIRAC may deploy/test a form of GPFS as a prototype for a national system, the Storage Group to monitor and keep abreast of progress. 476.2 DB to invite Jeremy Yates and his Sysadmin to visit Lancaster or attend a HEPSYSMAN meeting, to help move forward with DIRAC synergies. 476.3 DB to feedback to Jeremy Yates the PMB discussion regarding possible synergies with DIRAC. 476.4 PC/DC/RJ to meet with Jamie Shiers in order to push forward a common position regarding data preservation in the context of potential funding and FP8/Horizon 2020 calls. 476.5 JC to send info on possible alternative DPM fixes to the Glasgow Team. 476.6 DC to investigate the hosting of GridPP30 at the Royal Geographical Society near Imperial, and report back. 476.7 AS to check with John Wormersley regarding the proposal that RAL be considered as a National Lab including the Tier-1. AS to find out current status of the proposal and report back. There would be *no* PMB on Monday 22nd October. The next PMB would take place on Monday 29th October at 12:55 pm.
GridPP PMB Minutes 477 (29.10.2012) ======================================= Present: Dave Britton (Chair), Andrew Sansum, Roger Jones, Pete Clarke, Tony Cass, Tony Doyle, Dave Colling, Claire Devereux (Suzanne Scott -Minutes) Apologies: Dave Kelsey, Steve Lloyd, John Gordon, Jeremy Coles, Pete Gronbech, Neil Geddes STANDING ITEMS ============== SI-1 Dissemination Report -------------------------- SL was not present. SI-2 ATLAS weekly report & plans --------------------------------- RJ reported that there had been a rolling changeover to the EMI CE at RAL last week, there had been discussions about the process, extra disk for ATLAS at RAL was being installed this week but they had held back on the hyperthreading. High memory MC jobs had gone to the Tier-1 recently, the Tier-2s could also contribute to this but this was to be discussed. RJ had no major problems to report. SI-3 CMS weekly review & plans ------------------------------- DC was not present at this stage in the meeting. SI-4 LHCb weekly review & plans -------------------------------- PC reported that they were progressing with reprocessing, which was going fine, after Christmas they would be doing the 2011 data reprocessing. SI-5 Production Manager's Report --------------------------------- JC was absent but had sent a brief note: We have made steady progress with removing gLite 3.2 CEs/BDIIs, but some (more than I hoped) will certainly remain in early November. Sites have received tickets and all have now responded but I am concerned that some of the smaller sites will not follow-up and there is a growing possibility they will be suspended/uncertified at some point in the coming month. I will send an update next week. The WN tarball help has not so far developed which is another problem on the horizon when the gLite 3.2 WN deadline arrives at the end of November. SI-6 Tier-1 Manager's Report ----------------------------- AS reported as follows: Fabric: 1) Disk tender closed - evaluation expected to complete this week. 2) CPU tender standstill complete. Orders about to be raised. 3) Asymmetric network routing discovered for some Tier-1 to RAL traffic. External sites had not accepted our OP_N routing. Now corrected. 4) A disk server operating system was accidentally re-installed (human error). This was risk 6 in our accidental data loss risk analysis. Mitigation worked - no data lost. Service: 1) Operations continue generally smoothly 2) CASTOR a) CASTOR 2.1.12 upgrade for CMS+LHCB completed. Gen instance will be carried out on Tuesday 30th. 3) Upgrade to EMI2 CREAM CE completed. Went very well but experiments did not promptly change SAM test endpoints so incorrect availability will need correcting. Old glite nodes will be turned off by end of month. 4) WMS services upgrade from glite. We should now be glite free. 5) Hyper-threading change has been approved to exploit hyper-threading by running more jobs than cores. This is a simple change to implement but does come with some risks/issues as well as benefits. Implementation scheduled for next month after CE change this month. 6) Backup Oracle (and Frontier) Service for CMS - we expect to receive a formal request shortly to run a global backup Oracle service for the CMS conditions D/B. Given the reduction in load on Oracle from ATLAS LFC and LHCB 3D/LFC we expect to be able to meet Oracle licensing and database hardware mainly from existing resources, but we'll need to assess exact requirement before reaching a final conclusion. SI-7 LCG Management Board Report --------------------------------- DB reported that there had been a discussion re Oracle licences, they were identifying cases where Oracle was in use at the Tier-1s; there had been the issue of OSG's contingency plans for their CA, users were requesting contingency planning for various scenarios if Certs could not be issued - the documents were available publicly. DB noted that GridPP was in the same situation and we should ask the same question for services we don't directly run - the next NGI meeting would discuss this on 12th November. DB noted that the documents re the CA and infrastructure were fairly generic and could maybe be used. There needed to be contingency plans for all NGI services. DB would report-back from the NGI meeting. CD noted she had this issue on the NGI Agenda. DB continued - there had been an update on the wLCG networking group by Michael Ernst. The Oversight Board had raised a query about the networking group's remit, in order to clarify how it related to other bodies. DB reported that there had been a bit of discussion about this group generally and 'bandwidth on demand', no further action was required at present. There had followed a discussion on common projects; then a discussion on wLCG software life-cycle process. DB noted there would shortly be a Russian Tier-1. AS had sent an email regarding Oracle. He advised that the licence requirements were reducing over the next few years but the maintenance bill was due in GridPP4. AS noted he was awaiting formal information from CERN. DB thought we would need less licences going forward that was originally planned? AS confirmed yes - the bulk of licences go on CASTOR. DB noted that at RAL the dominant factor was CASTOR therefore the LFC and FTS changes would not affect things much. AS agreed, and he would send round a summary. DB noted that regarding the backup service for CMS we didn't want additional costs. DC had joined the meeting and advised that he had a chat with Ian this morning. The CMS request was not high on their wishlist but it would be good to have. CMS may try and move away from Oracle. DC noted that Fermilab had almost no Oracle licences at all. 1. ToR for Cloud Group ======================= A proposal document had been circulated by DB and he had sent it to AS for comment. AS noted only one minor thing: 'production' cloud service could perhaps be modified to 'prototype' cloud service. DC was to give feedback. Any other comments should be sent to DB/DC. It was noted that the document would be used as the basis for moving forward. There would be a monthly report to the PMB. Would PC and RJ be involved? PC advised that a PDRA post was being advertised and this was something that the prospective member of staff could be involved with on behalf of LHCb. RJ advised that he had been discussing this within ATLAS and a few people were interested, but this was to be confirmed. DC should convene a meeting soon to start-off this Cloud Group. 2. AOB ======= - DELL LHC Programme It was noted that George Jones had left DELL. PG had received a message from Gary Kriegel noting that the Programme was currently in transition and that LHC pricing was being determined for the future. It was thought that the programme could disappear entirely. RJ would contact Andy Langford and thereafter the DELL contact he met at Manchester. ACTION 477.1 RJ to contact Andy Langford and thereafter the DELL contact he met at Manchester in relation to DELL LHC programme changes. AS advised that DELL hadn't made the cut for the CPU service, possibly reflecting their change of emphasis. - DPHEP meeting DB asked about this meeting - was anyone going? PC noted no - it was difficult to get to Marseille from Edinburgh. RJ noted he had also dropped out due to the change of venue from Munich. PC advised that Marco would be going for LHCb. ATLAS would not have any representation. REVIEW OF ACTIONS ================= 438.9 AS to contact relevant site managers to ask whether or not they would be interested in having retired Tier-1 hardware - if a site were interested then they should submit a proposal as to what they want and why. Ongoing for 2006 generation. 475.1 DB/JC, in conjunction with AS, to consider and draft Terms of Reference (ToR) for the proposed GridPP Cloud Group. Done, item closed. 476.1 PG to ask the Storage Group to be aware that DIRAC may deploy/test a form of GPFS as a prototype for a national system, the Storage Group to monitor and keep abreast of progress. Ongoing. 476.2 DB to invite Jeremy Yates and his Sysadmin to visit Lancaster or attend a HEPSYSMAN meeting, to help move forward with DIRAC synergies. Done, item closed. 476.3 DB to feedback to Jeremy Yates the PMB discussion regarding possible synergies with DIRAC. Done, item closed. 476.4 PC/DC/RJ to meet with Jamie Shiers in order to push forward a common position regarding data preservation in the context of potential funding and FP8/Horizon 2020 calls. Done, item closed. 476.5 JC to send info on possible alternative DPM fixes to the Glasgow Team. Done, item closed. 476.6 DC to investigate the hosting of GridPP30 at the Royal Geographical Society near Imperial, and report back. DC would check the Physics Dept and Halls of Residence. Done, item closed. 476.7 AS to check with John Wormersley regarding the proposal that RAL be considered as a National Lab including the Tier-1. AS to find out current status of the proposal and report back. Done, item closed. ACTIONS AS AT 29.12.12 ====================== 438.9 AS to contact relevant site managers to ask whether or not they would be interested in having retired Tier-1 hardware - if a site were interested then they should submit a proposal as to what they want and why. 476.1 PG to ask the Storage Group to be aware that DIRAC may deploy/test a form of GPFS as a prototype for a national system, the Storage Group to monitor and keep abreast of progress. 477.1 RJ to contact Andy Langford and thereafter the DELL contact he met at Manchester in relation to DELL LHC programme changes. The next PMB meeting would take place on Monday 5th November at 12:55 pm.
GridPP PMB Minutes 478 (05.11.2012) ======================================= Present: Dave Britton (Chair), Pete Gronbech, Andrew Sansum, Roger Jones, Pete Clarke, Tony Cass, Dave Colling, Claire Devereux, Steve Lloyd, John Gordon, Jeremy Coles, Dave Kelsey Apologies: Tony Doyle, Neil Geddes Agenda: 1. ATLAS - Oracle for conditions DB and Frontier Server at RAL [RJ/AS] ====================================================================== ATLAS has asked the 5 Tier-1s (which includes RAL) that host the Conditions DataBase and Frontier Servers in addition to CERN, whether they intended to continue to do so for Run2 (i.e. until 2018). ATLAS were not sure how many instances were required: It might not be 5 but it was certainly "some". AS noted that the 3D database required some 6 oracle licences (compared to something like 30 for CASTOR) and this might reduce to 4, so was not a dominant factor. RJ had yet to receive and answer from ATLAS as to the experiments longer term plans WRT Oracle. ATLAS has requested a response by mid-Nov. DB suggested that RJ find out a little more about ATLAS' position and draft initial response on the bases that it was not regarded as a big problem by the Tier-1. DB would want to add some caveats about the timeframe involved. ACTION 478.1 RJ to draft response to the ATLAS message and iterate with DB. AOCB ==== 1) PG had been away last week and would summarise quarterly reports at the next PMB meeting. 2) DC had made some enquires about GridPP30 at Imperial and would make a proposal on dates to the PMB this week. ACTION 478.2 DC to propose dates for GridPP30. STANDING ITEMS ============== SI-1 Dissemination Report [SL] ------------------------- SL reported that he had received the following from NO: Published Ganga News item - Waiting to publish LCG CE news item - Sussex news item ready for when they go into production - perfSONAR news item in the works - VOMS Snooper news item also in the works - GridPP (and PG) in Linux Format this month - I've been officially added to the LOC for the Community Forum (well I'm included in the phone calls) DB expressed a concern that the events of September had demonstrated that our dissemination overall as a project had some gaps. In particular, news items were fine but they only addressed one area of dissemination. In particular, GridPP needs better contact with industry and better visibility within the developing UK e-infrastructure community. A discussion ensued, with broad agreement that there was an issue. It was felt that we need to target some very specific things: A project with an industrial partner would be valuable; money might be available from the various STFC impact programmes if something could be identified. ACTION 478.3 SL to talk with NO; possibly a meeting with DB/SL/NO/CD? RJ noted that website needed to be fixed so that the old Excel visit-notice was no longer liked from the resources page. DK said he would contact Andrew McNab. SI-2 ATLAS Weekly Review and Plans [RJ] ---------------------------------- Main issue was that RAL had been moved out of raw-data export. This might be due to OPN saturation but there are several independent network-related issues on-going at RAL and AS was still trying to get to the bottom of this. The UK Tier-2s also seem to have a number of unrelated issues at present, but nothing too serious. Lancaster would shortly be moved off the light path now that the link north was up and running. SI-3 CMS Weekly Review and Plans [DC] -------------------------------- DC reported that things were fine with CMS. He had noted that the UK Tier-2s had appeared in the top grouping of global CMS Tier-2 sites (along with the US and DESY) in terms of cpu-hours delivered and analysis delivered. DC noted that he was currently setting up the cloud-group and an email list would be established this week. The possibility of hosting a duplicate CMS conditions db at RAL was discussed. The costs included £2.5k for nodes; £8.7k for disk; and £2k? for Oracle Licence(s). It was not yet clear how many Oracle Licenses would be needed. AS would get back to DC with the complete details and DC would talk to Ian Fisk as to whether the costs were justifiable. SI-4 LHCb Weekly Review and Plans [GP] --------------------------------- PC reported that there were no issues on the LHCb side. SI-5 Production Manager's weekly report [JC] --------------------------------------- JC reported that: 1) We have agreed a VOMS upgrade/switch for 14th November. There will be a brief period where VO information will not be editable but otherwise the switch will be transparent for VOs already hosted on gridpp.ac.uk. David Wallom has been liasing with the NGS VOs that are coming on to the gridpp VOMS. 2) A validator script running on VOMRS to check the status of issuer DNs produced some confusing messages for (LHC) users last week as old certificate DNs were not deleted in VOMS but the certificates against the old CA DN were picked up as failing (due to the old UK CA now having expired?) the validation. This seems to have impacted ATLAS team memberships within GGUS for editing tickets which used the old certificate status for team membership confirmation. 3) As of 1st November several GridPP sites were still running gLite 3.2 CEs with no EMI CEs in parallel: UCL, Durham and ECDF. Additional sites with 3.2 CEs that will be removed soon (when the EMI CEs are shown stable): Manchester, Sheffield, Bristol and Cambridge. Some sites have deployed EMI-2 SL5 WNs (the status tables are being updated). Alessandra has been tracking plans for ATLAS via this page: https://www.gridpp.ac.uk/wiki/UK_EMI2_Deployment. 4) Last week joint work (finally) began on producing EMI WN tarballs. Needless to say it is not quite as simple as early reports suggested it would be. Matt Doidge at Lancaster together with Wahid Bhimji are providing the GridPP input. Issues include what ÔextraÕ SL rpms need to be included and a policy for later allowing use of glexec. 5) There was a request on TB-SUPPORT for more information on GridPP30 dates. 6) Are there any further PMB comments on the DPM collaboration notice I forwarded from Oliver Keeble last week? It mentions the in principle agreement to support the collaboration from 3 countries and core development effort being provided by CERN. For information A) There is a GDB next week http://indico.cern.ch/conferenceDisplay.py?confId=155074. B) There is a HEPSYSMAN meeting on Friday: http://hepwww.rl.ac.uk/SYSMAN/Nov2012/main.html. SI-6 Tier-1 Manager's weekly report [AS] ----------------------------------- AS reported that: Fabric ------ 1) Disk tender closed - HAG meeting scheduled for Tuesday 2) CPU orders placed. 3) Review of our network performance indicates problem with our outbound rate to most/all sites. Still investigating. 4) High traffic rate on LHCOPN to RAL at the moment (since Friday) under investigation. May need to consider load balancing on backup link in future. 5) Failure of the primary OPN for about 10 hours on 30th October owing to a major fibre cut between Gravelines and Bois-Grenier in France. 6) Site networking plan a short intervention on our board on the main site router on Tuesday 13th November. this will lead to a short scheduled outage. We may take this opportunity to schedule other network work such as performance tests and an upgrade to address bandwidth limitations on one of our stack uplinks. Service ------- 1) Operations report at: https://www.gridpp.ac.uk/wiki/Tier1_Operations_Report_2012-10-31 2) CASTOR a) CASTOR 2.1.12 upgrade now complete on all instances. b) CASTOR 2.1.13 certification has commenced. c) Lengthy (7 hours) downtime on ATLAS instance over weekend. Cause was non-optimal change in execution plan on SRM database. DB team plan to lock down execution plan using Oracle 11 feature. 3) Hyper-threading change expected to be implemented shortly. SI-7 LCG Management Board Report of Issues [JG/DB] ------------------------------------------ There had been no MB. REVIEW OF ACTIONS ================= 476.1 had been done 477.1 had been done but DB opened a new action: ACTION 478.4 RJ to let PMB know more details about the future of the DELL LHC programme after he'd talked to Andy Langford. ACTIONS AS OF 05.11.12 ====================== 438.9 AS to contact relevant site managers to ask whether or not they would be interested in having retired Tier-1 hardware - if a site were interested then they should submit a proposal as to what they want and why. 478.1 RJ to draft response to the ATLAS message about Conditions db and Frontier server and iterate with DB. 478.2 DC to propose dates for GridPP30. 478.3 SL to talk with NO; possibly a meeting with DB/SL/NO/CD about targeting our dissemination. 478.4 RJ to report back to the PMB about the DELL LHC programme after he'd talked to Andy Langford. The next PMB would take place on Monday 12 November at 12:55 pm.
GridPP PMB Minutes 479 (012.11.2012) ======================================= Present: Dave Britton (Chair), Pete Gronbech, Andrew Sansum, Pete Clarke, Tony Cass, Dave Colling, Claire Devereux, Steve Lloyd, John Gordon, Jeremy Coles, Dave Kelsey Apologies: Tony Doyle, Roger Jones, Neil Geddes 0. Summary of NGI Management Meeting [CD] ========================================= Claire reported that the monthly NGI meeting had just been held. Dave Wallom was representing the UK on the EGI Elixir Virtual Team. There is a call for EGI Champions - so nominations were solicited (basically can fund some travel). The meeting discussed the imminent VOMS migration and Claire was asked whether all UK NGI services had been restored following the power cut at RAL (the answer was "yes"). DB raised the issue of contingency planning for NGI services. It was agreed to make a list of services and to evaluate the need and status of contingency plans against each. 1. Tier-1 Power Outage [AS] =========================== AS described the events of last week when a power cut at RAL and the failure of the generator brought down the whole Tier-1. The only data loss was "data-in-flight" and only a modest amount of hardware had to be repaired. A full SIR will be made available; there are some more details in the Tier-1 report below. It was noted that although the generator was tested on a monthly basis, it had not been load tested. DB asked whether the recent departure of the Operations Manager had compounded the situation (probably not). 2. Quarterly Reports: Issues from 12Q3 [PG] =========================================== PG circulated a summary of 12Q3 quarterly reports. The Tier-1s performance in Q3 had been excellent. PG/AS asked whether there should be a review of the Tier-1 next May as per the project milestones? DB noted that the lightweight-informal review held last June had been very informative; AS confirmed that it had been useful. Therefore, it was agreed that a repeat should be scheduled in May 2013. It was noted that there was a slight delay in the disk procurement that increased the risk of missing the deployment deadline for the MOU in April 2013. Delivery was January. DB noted that this should still give time for 4-6weeks burn-in and then deployment before the deadline. JG noted that we might expect to run into problems some problems so there was a chance that perhaps half the capacity might be late. DB expressed his hope that this would not happen. Q3 had been less stellar at the Tier-2s, with poor availability at Glasgow for ATLAS (power issues) and data loss at Cambridge. CMS and LHCb had had a good quarter. T2K were investigating their storage requirements; it was hard for them to work out how much disk they were using at Tier-2s due to shared resources with other VOs. The transition to EMI middleware had been somewhat a concern at the end of the quarter but now, one month later, the UK was in good shape. AOCB ==== 1) EU Researcher Article: This non-refereed journal had approached DB about GridPP paying to publish an article. DB had referred to Neasan. The proposal was for 1500words for £3000. The PMB could not see how this would be of value. The decision was not to proceed. 2) ORACLE Licenses: CERN (Tony Cass) had written to GridPP (DB) to request planning numbers of ORACLE Licences. AS had started the inventory but there were some outstanding questions, particularly around ATLAS. DB had discussed with RJ: It looked likely that ATLAS would like RAL to continue to host the 3D DB but not likely that the TAG DB would be required in its current form. AS would use this input and come back with a plan next week. ACTION 479.1 AS to provide ORACLE licence plan. 3) HAG: The hardware advisory group had met. JG had circulated an email to the PMB and the salient points were in the Tier-1 Manager's report below. 4) EGI Software Support: Oxford had received an email about SAM support. This was something that had been discussed a longtime ago by JG with EGI - providing support for APEL and SAM. There was the odd month of effort funded to provide this, but it was felt to be a very low level commitment and it was agreed that no further action was required (such as transferring this month of funding to Oxford) unless the task proved more onerous than expected. 5) GridPP30: DC reported that IC no longer had student accommodation at Easter. DB asked about local hotels but realised this was unlikely to be affordable. DC would check. PG suggested contacting Dell about their conference centre in Ireland. CD suggested holding it in conjunction with EGI in Manchester. DB/CD/PG/DC would look into these options. STANDING ITEMS ============== SI-1 Dissemination Report [SL] ------------------------- SL noted that a KE meeting had been arranged for Nov 27th at QM to be attended by at least SL,NO,DB and CD. Other PMB members were invited. DC and JC expressed interest. It was agreed, therefore, to start at 12:45 to avoid Ops-team. SI-2 ATLAS Weekly Review and Plans [RJ] ---------------------------------- RJ was not present due to teaching. SI-3 CMS Weekly Review and Plans [DC] -------------------------------- DC reported no issues from CMS operations. However, Stuart Wakefield had now left and some issues with Brunel had been found where his certificate had been hardwired. SI-4 LHCb Weekly Review and Plans [PC] --------------------------------- No issues for LHCb. SI-5 Production Manager's weekly report [JC] --------------------------------------- JC reported as follows: 1) An upgrade of the GridPP VOMS takes place this Wednesday (14th). VO-admins have been informed of the read-only period during the upgrade and that the new VOMS version has new notification policies and in particular VO-admins will now Ò get regular emails about expired users, or users that are going to expire.(see details here https://www.gridpp.ac.uk/wiki/VOMS_Notifications). 2) There was a power cut that affected RAL at 11:30 UTC last Wednesday 7th November and the backup diesel generators failed. This affected UK Tier-2 work but did not lead to any complaints. We will review the impacts (and any lessons learned) at the ops meeting tomorrow Ð for example top-BDII settings used by the UK Nagios testing and GOCDB failover. APEL processing at RAL was also affected and sites were asked to temporarily avoid republishing data. 3) No GridPP/UK sites have been designated as unresponsive by EGI in regards to their EMI upgrade progress and plans (but see D below for the process being followed). 4) Steady (positive) progress is being made with producing an EMI-2 tarball WN. Testing last week showed a working version with ATLAS. (Reminder: The current deadline for sites to move from gLite 3.2 WNs is the end of November). 5) HEPSYSMAN took place at Lancaster on Friday (https://indico.cern.ch/conferenceDisplay.py?confId=211206). A flexible format and short-talks approach worked well. For information: A) There is a GDB this week: http://indico.cern.ch/conferenceDisplay.py?confId=155074. Topics include: GGUS recent developments; an update on the Security WG activities; Glue 2.0; IPv6 and plans for the deployment of M/W clients (in light of EMI ending soon). B) A statement on the DPM collaboration is now online: https://svnweb.cern.ch/trac/lcgdm/blog. Planning for the DPM community workshop in December has started: http://indico.cern.ch/conferenceDisplay.py?confId=214478. C) The EGI-Inspire task TSA1.5 (accounting) has been handed over from John to Alison Packer (STFC). D) An EGI CSIRT process to handle unsupported gLite service end-points of unresponsive sites that failed to reply to COD tickets and to provide information about their upgrade plans has now been agreed. From today sites affected will be asked to put old endpoints into downtime and from 19th unresponsive sites will risk suspension. SI-6 Tier-1 Manager's weekly report [AS] ----------------------------------- AS reported as follows: Fabric ------ 1) Disk tender evaluation complete. Expect to start standstill shortly. 2) CPU orders placed. 3) Review of our network performance indicates problem with our outbound rate to most/all sites. Still investigating. 4) Site networking plan a short intervention on our board on the main site router on Tuesday 13th November. this will lead to a short scheduled outage. We will not be scheduling an intervention on our internal stacks as suggested last week as testing could not be completed owing to the power failure. Service ------- 1) A major (>50%) site wide power failure at 11:20 on Wednesday 7th November (last major power failure 44 months ago). Trip occurred at main site substation (cause being investigated). UPS generator started but would not accept load (cause being investigated). Critical (UPS battery protected) services operated for about 20 minutes but had to be shut down as cooling requires generator. Power to machine room restored at 14:20. External national and international services (FTS, BDI, WMS, LFC, GOC, APEL) restored by 18:00 (some much earlier). Batch and CASTOR services restored by 14:00 on 8th November. Generator circuit remains faulty. Generator will not start in event of another power failure. Investigation and generator load test being scheduled for 20th November but until then our UPS critical systems remain at risk in event of further power problems. Post Mortem (SIR) underway. 2) CASTOR a) On Sunday (again) problems with ATLAS SRM owing to database choosing non-optimal execution plan. Expect to lock down the execution plans this Tuesday. b) Intermittent CMS SRM test failures - leading to around 20% degradation in test results. Seems to be an increasing problem, but the cause is not understood. Does not seem to be noticeably impacting production work. 3) On Saturday problems with CRLs expiring on CEs. Investigating how this happened. Inconvenient that CERN CRLs expire on Saturday (known problem). 4) Hyper-threading change rollout started. 5) EMI-2 workernode update in pipeline. Expected before end of month. SI-7 LCG Management Board Report of Issues [JG/DB] ------------------------------------------ There had been no MB. JC asked about the software lifecycle plan that had been presented in outline at the last but one MB and then at the GDB. DB had not heard anything more. REVIEW OF ACTIONS ================= 438.9 AS to contact relevant site managers to ask whether or not they would be interested in having retired Tier-1 hardware - if a site were interested then they should submit a proposal as to what they want and why. ONGOING 478.1 RJ to draft response to the ATLAS message about Conditions db and Frontier server and iterate with DB. ONGOING 478.2 DC to propose dates for GridPP30. NO ACCOMMODATION. ACTION CLOSED 478.3 SL to talk with NO; possibly a meeting with DB/SL/NO/CD about targeting our dissemination. DONE - ARRANGED FOR 27TH 478.4 RJ to report back to the PMB about the DELL LHC programme after he'd talked to Andy Langford. DONE AND ONGOING! ACTIONS AS OF 12.11.12 ====================== 438.9 AS to contact relevant site managers to ask whether or not they would be interested in having retired Tier-1 hardware - if a site were interested then they should submit a proposal as to what they want and why. 478.1 RJ to draft response to the ATLAS message about Conditions db and Frontier server and iterate with DB. 479.1 AS to finalise ORACLE licence planning. 479.2 RJ to report back to the PMB about the DELL LHC programme after he'd talked to Andy Langford.

Top of Message | Previous Page | Permalink

JiscMail Tools

RSS Feeds and Sharing

Advanced Options


April 2024
February 2024
January 2024
September 2022
July 2022
June 2022
February 2022
December 2021
August 2021
March 2021
November 2020
October 2020
August 2020
March 2020
February 2020
October 2019
August 2019
June 2019
May 2019
April 2019
March 2019
February 2019
January 2019
December 2018
November 2018
August 2018
July 2018
June 2018
May 2018
April 2018
March 2018
February 2018
January 2018
November 2017
October 2017
September 2017
August 2017
May 2017
April 2017
March 2017
February 2017
January 2017
October 2016
August 2016
July 2016
June 2016
May 2016
April 2016
March 2016
February 2016
January 2016
December 2015
November 2015
October 2015
September 2015
July 2015
May 2015
April 2015
March 2015
February 2015
January 2015
December 2014
November 2014
October 2014
September 2014
August 2014
July 2014
June 2014
May 2014
April 2014
March 2014
February 2014
January 2014
December 2013
October 2013
August 2013
July 2013
June 2013
May 2013
March 2013
February 2013
January 2013
December 2012
November 2012
October 2012
September 2012
August 2012
July 2012
June 2012
May 2012
April 2012
March 2012
February 2012
January 2012
December 2011
November 2011
October 2011
September 2011
August 2011
July 2011
June 2011
May 2011
April 2011
March 2011
February 2011
January 2011
December 2010
November 2010
October 2010
September 2010
August 2010
July 2010
June 2010
May 2010
April 2010
March 2010
February 2010
January 2010
December 2009
November 2009
October 2009
September 2009
August 2009
July 2009
June 2009
May 2009
April 2009
March 2009
February 2009
January 2009
December 2008
November 2008
October 2008
September 2008
August 2008
July 2008
June 2008
May 2008
April 2008
March 2008
February 2008
January 2008
December 2007
November 2007
October 2007
September 2007
August 2007
July 2007
June 2007
May 2007
April 2007
March 2007
February 2007
January 2007

JiscMail is a Jisc service.

View our service policies at https://www.jiscmail.ac.uk/policyandsecurity/ and Jisc's privacy policy at https://www.jisc.ac.uk/website/privacy-notice

For help and support help@jisc.ac.uk

Secured by F-Secure Anti-Virus CataList Email List Search Powered by the LISTSERV Email List Manager