JiscMail Logo
Email discussion lists for the UK Education and Research communities

Help for UKHEPGRID Archives


UKHEPGRID Archives

UKHEPGRID Archives


UKHEPGRID@JISCMAIL.AC.UK


View:

Message:

[

First

|

Previous

|

Next

|

Last

]

By Topic:

[

First

|

Previous

|

Next

|

Last

]

By Author:

[

First

|

Previous

|

Next

|

Last

]

Font:

Proportional Font

LISTSERV Archives

LISTSERV Archives

UKHEPGRID Home

UKHEPGRID Home

UKHEPGRID  January 2008

UKHEPGRID January 2008

Options

Subscribe or Unsubscribe

Subscribe or Unsubscribe

Log In

Log In

Get Password

Get Password

Subject:

Minutes of the 286th GridPP PMB meeting

From:

Tony Doyle <[log in to unmask]>

Reply-To:

Tony Doyle <[log in to unmask]>

Date:

Thu, 10 Jan 2008 16:44:27 +0000

Content-Type:

MULTIPART/MIXED

Parts/Attachments:

Parts/Attachments

TEXT/PLAIN (20 lines) , 080107.txt (1 lines)

Dear All,

     Please find attached the latest weekly GridPP Project Management 
Board Meeting minutes. The latest minutes can be found each week in:

http://www.gridpp.ac.uk/php/pmb/minutes.php?latest

as well as being listed with other minutes at:

http://www.gridpp.ac.uk/php/pmb/minutes.php

Cheers, Tony
________________________________________________________________________
Prof. A T Doyle, FInstP FRSE                       GridPP Project Leader
Rm 478, Kelvin Building                      Telephone: +44-141-330 5899
Dept of Physics and Astronomy                  Telefax: +44-141-330 5881
University of Glasgow                   EMail: [log in to unmask]
G12 8QQ, UK                 Web: http://ppewww.physics.gla.ac.uk/~doyle/
________________________________________________________________________


GridPP PMB Minutes 286 - 7th January 2008 ========================================= Present: Tony Doyle, Sarah Pearce, Roger Jones, David Britton, Steve Lloyd, Robin Middleton, John Gordon, Glenn Patrick, Andrew Sansum, Dave Colling, Suzanne Scott (Minutes) Apologies: Stephen Burke, David Kelsey, Dave Newbold, Tony Cass, Jeremy Coles, Peter Clarke, Neil Geddes 1. CMS Representation on PMB ============================= TD advised that the transition from Dave Newbold to Dave Colling, as CMS Representative on the PMB, was taking place at today's meeting. The PMB wished to formally thank Dave Newbold for all his inputs on CMS issues and relating to CASTOR testing. It was noted that DN would remain on the PMB mailing list for the next month or so in order to effect a handover period. GP, SL, noted the changeover in relation to the UB, CB respectively. DN would continue on the Tier-1 Board. 2. GridPP MoU Draft for CB =========================== TD had circulated v3.1 for consideration by the PMB. TD advised that this incorporated minor changes post the F2F, including modifications to the Tier-1 hardware breakdown for CMS in 2008. TD proposed that this version be circulated to the CB. It was agreed that this version be circulated and it was noted that the CB will sign this off at the next meeting. SL noted that a short phone meeting was expected; and that Dave Colling was now also on the CB as CMS Representative. TD reported that the figures in v3.1 were the final version of the planning figures relating to the LCG in December (cf the letter to Les Robertson). In relation to EGEE, they needed to know the planning figures - 1% of the hardware allocated was for EGEE purposes (incorporated in LHC anyway), but agreement with Ian Bird was required via RM/JG, who would refer to the figures given. JG noted that a meeting was due within the next few weeks, and EGEE planning was likely to be on the Agenda. JG would sum-up the GridPP/NGS/Ireland contribution. The PMB approved circulation of the MoU to the CB. 3. Tier-2 Hardship Fund ======================== SL had circulated a document for consideration by the PMB. SL reported that the F2F meeting had agreed to make a further 100k available for cases to be made. Four bids had been received totalling 113k, SL had circulated the cases and conclusions/recommendations. It was proposed to fund two cases in full; one was vague in terms of stated outputs but 25k had been recommended. NorthGrid wished to allocate funds themselves. SL recommended release of funds in goodwill that they would be sensibly disseminated. TD proposed that the PMB accept SL and NG's recommendations as given, and that the conclusions be endorsed, the information would be relayed to STFC as PMB-approved allocations. DB agreed that the PMB should proceed as proposed. SL asked whether a breakdown of the funds be required of NorthGrid. DC noted that a mechanism might be required to place the funds temporarily. SL advised that the grants would need to be issued to institutions, the proposers would need to invoice each other internally in order to effect internal transfers. DB advised that some conditions should be attached to the grants relating to delivery of resources being attached to the funds. TD noted that the MoU has Regional responsibility - the additional amounts for hardship would not change the overarching MoU and associated hardware delivery. DB noted that something should be written into the grant conditions in case of site failure. There was a discussion in relation to the various aspects of this. TD noted that agreement of future resource allocations would be based on 'past performance'. SL suggested Institutional MoUs that each region would sign-up to. The question was, who gets penalised in case of failure to deliver - Site or Region? DB reiterated that draft wording attached to the Grants was required, to build-in something to which the PMB had recourse in case of failure to deliver at site level. RJ noted that NorthGrid would call a meeting and give allocations to Institutions - it should be the Institutions that get penalised in case of failure. TD advised that the Institutions themselves would also have to instigate regional agreements to transfer funds. TD advised that the MoU figures would not be modified. SL and DB to iterate. RJ to call a NorthGrid meeting to organise dissemination of funds. DC noted that overall responsibility lies with the Tier-2. TD confirmed that all other internal arrangements are devolved to London, SouthGrid, NorthGrid and ScotGrid. 4. Post-Mortem on Tier-1 running ================================= AS reported on various problems experienced over the Christmas break, as follows: The Tier-1 ran unattended from Saturday 22nd December 2007 until Tuesday 1st January inclusive. During that period Tier-1 staff continued to monitor the service and carried out a number of interventions when they detected problems. In general good availability was maintained for most of the service, however there were major problems with ATLAS access to CASTOR which are yet to be understood and are still being investigated. Key problems and interventions were: 00) At 20:00 on Saturday 22nd december approximately 70% of the batch workers went offline following a transient overload of the home filesystem. Detected at midnight and workers restarted by 01:30 the following morning. (Adams) 0) dCache failed at 11:00 on 24th after a logfile unexpectedly filled the system disk of the pnfs server - was corrected at 13:00 and again after a recurrence at 14:30 (2GB written in 1.5 hours) - change of use pattern probably by MINOS. (Ross) 1) The failure (partial/) of ATLAS access to CASTOR from 24th December (still being investigated. Intervention attempted by Kruk (during the holdiay period - no info available) and by Bly (see below). Possibly caused by a load-related problem on the ATLAS stager but still being investigated. 2) The Nagios monitoring system failed and was restarted on the 25th December. (Bly) 3) The restart of various CASTOR SRMs on the 27th december in response to reported (by ATLAS) problems with CASTOR. High process count alarm on the SRMS and no other faults reported by other CASTOR components. (Bly) 4) The restart of the ganglia server on 27th after logging failed. Caused by 'out of memory' - more now ordered. (Thorne) 5) The CE gatekeeper daemon died at 04:00 on 29th December and was restarted at 15:40. (Ross) 6) The nagios server was restarted at 07:49 on 31st December and was restarted at 12:09. (Thorne) 7) The CMS CASTOR WANIN pool became very busy on 31st 12th, logging NFS errors. (White) 8) rb01 became overloaded (excessive job cancellations) on 31st December (not detected) and was taken out of production on 2nd January for investigation and repair. 9) Backups of system/home filesystems failed over the holiday period after a problem with one of the ADS tape servers. AS reported that on-call payments had been instigated for the first time in relation to these issues. Problems experienced were currently being diagnosed, particularly with CASTOR and ATLAS. The main challenges with operating a 'holiday' service was expertise with specific complex problems. The PMB agreed that GridPP should formally apologise to ATLAS for the production difficulties over the festive season. It was hoped that more specific diagnoses and full understanding of the problems would become apparent in due course, resolution of which could be disseminated UK-wide. AS to apologise to ATLAS on behalf of GridPP. The issues above would be discussed again at next week's PMB. 5. AOCB ======== GP had submitted a paper on special cases for non-Grid access to the UK Tier-1. GP put forward the cases he had received along with recommendations for action from the UB relating to the LHC experiments (approved by RJ and DC) and BaBar. GP reported that MINOS had suffered from CASTOR delays and lack of testing, a working instance of CASTOR was still awaited. Re CALICE, one issue was the RAL firewall. GP expected that more cases would be received once qsub was withdrawn. The PMB agreed with GP's (UB's) recommendations and asked GP to advise the UB accordingly. TD noted that a service message was required at login - AS to organise. JG asked about AFS for BaBar? It was proposed to continue to run this (it was also required for the Oversight Board in relation to disaster planning) - the PMB agreed. JC and SB to incorporate the AFS Service into the disaster planning document. STANDING ITEMS ============== SI-1 Dissemination Officer's Report ------------------------------------ SP reported that there was news from STFC regarding the award on LHC@Home, which had progressed to the next stage. This would need to be submitted by the end of January and a presentation given in March. The message for the LHC Promotion Advisory Group had been agreed and accepted - this will be put to the next meeting. SP advised that the Christmas Story had been posted as a news item and had been put on the website. If anyone has any issues they wish reported, please send them to SP. Rob Edgecock had advised that a nomination was required for the Science in Society Advisory Panel. RJ was willing. TD would nominate him. SI-2 Tier-1 Manager's Report ----------------------------- AS provided the following report: 1) Tenders: a) Disk tender - order placed - planned delivery date now agreed for 11th January (may be delayed by up to 1 week). b) CPU tender - order placed and scheduled for delivery 28 February. c) Tape drive purchase - number of additional drives to be finalised in the next 1-2 weeks in order to ensure delivery (mainly the servers) this FY. 2) Memory upgrade - the Woodcrest (Streamline) systems have been upgraded to 2GB per core and the AMD (Compusys) systems will be upgraded today. 3) Work on the power supply is proceeding - so far with no disruption to service. 4) We expect to borrow about 80TB of unused disk capacity from the Tier-2 in order to (partially) tide us over until new capacity becomes available at the end of march. 2) Service: 1) SAM availability for last week was 99% and the month's availability was 93%. 2) CASTOR - No update on general deployment of CASTOR. 3) SL4 Migration - The SL4 UI is configured and is being tested. 4) dCache - no update. 5) The LHCB ORACLE based LFC is installed, has had limited testing and is now handed over to LHCB. Progress to Grid Only Access - This standing item documents the status of work towards achieving GRIDPP milestone 0.18 "Access to Tier-1 resources by Grid Interfaces Only" 1) The scheduled termination has been announced. Special cases for continued access have been passed to the PMB for review. We continue to expect to terminate qsub access on 11th January 2008. SI-3 Production Manager's Report --------------------------------- It was noted that JC was on annual leave. SI-4 LCG Management Board Report --------------------------------- TD reported that he did not attend the last meeting due to a meeting clash. It was noted that Ian Bird was now in charge of the LCG MB. SI-5 Documentation Officer's Report ------------------------------------ It was noted that SB was on annual leave. REVIEW OF ACTIONS ================= 272.4 AS to check the current Tier-1 disaster recovery plan and circulate the existing version to the PMB. It was reported that this document does not exist, but it was planned to have one in the longer term. TD would incorporate in v0.4 anything that AS considered relevant. AS will check and advise additions. Ongoing. 277.2 DN to provide an update and re-evaluation of CMS/CASTOR deliverables. TD advised that there was a CMS/CASTOR document on deliverables which should be revised in light of the December '07 tests. DC to take the token for this now and iterate with DN. 277.5 Disaster Recovery 'Team B': SB, JC, TD, SP, DB to analyse the wider issues of disaster planning, mapped to the experiments' lists, and this work would include Project Management. A Recovery Plan was required. It was agreed that JC was in charge of this and the experiment input relating to subsets of the disaster plan. SB/JC to progress. 277.8 User Experience 'Team C': SB, SP, SL, with input from JC to deal with the issue of user experience and design of an easily-found lookup facility for grid error messages. SL reported that he had started the ATLAS wiki page and would circulate the url. Ongoing. 280.6 JG to bring up this issue (the biomed VO and 'sieving')at the ROC Manager's meeting (done) - a broadcast is to go out from EGEE which will be helpful in underlining acceptable use of Grid resources and would act as a reminder to VOs about the policy they have signed-up to in relation to their users. JC had now emailed the Chair to have this discussed - EGEE broadcast part of this action ongoing. JG reported that a new VO was now set up but there were no resources allocated to it as yet, although one Institute may be giving funds. Pending further info from JC. EGEE broadcast action ongoing. 280.7 JC to mention the issues (when approached by a VO with regard to joining) of the 'standard' 6-month introduction period, following which the VO must set-up something specific to them, if appropriate. This had been discussed at DTeam, done. JC to email GridPP VO members if possible - ongoing. This was a standing action - JC had discussed it with the Tier-2 Co-ordinators in relation to VO members. The emailing part was ongoing but the first part of the action was completed. JC to send email. Ongoing. 280.8 JG to investigate the UKI ROC website - any change/progress, and report-back. Ongoing. 282.2 SP to progress the Project Map using the T1 service areas and input from the meeting. Ongoing. 282.3 SL and NG to progress issues relating to Tier-2 hardware allocation/complaints and iterate procedure with T2s. It was noted that there was a deadline of 14 December for inputs to SL and NG. SL had received inputs. To be re-evaluated in the New Year. Done, item closed. 282.5 Updated GridPP3 MOU needs to be sent to CB (TD to provide updated version for SL to circulate). TD reported that he was working on this, on the latest numbers required and comments would be sent to JC. Version 3.1 had been prepared for the CB. Done, item closed. 282.6 JC and SB to progress existing 'disaster planning' template for next F2F meeting on 1st Feb. Involve experiments as necessary. Ongoing. 283.1 TD to arrange a phone connection at TC Dublin for RJ to join the GridPP20 meeting remotely. Ongoing. 283.3 RM/TD to prepare use cases appropriate for the UK community, [relating to item 278.10 EGEEIII -> EGI]. RM reported that he would be attending a workshop at the end of January at CERN (by EGI design study project) and would report-back at that time. Ongoing. 285.1 SP to circulate LPAG Grid Message paper to PMB once further comments received. Done, item closed. 285.2 GP to compile a document showing the applications for non-Grid access, and circulate to the PMB. Done, item closed. 285.3 JG to check the status of the Tier-1 Review Plan regarding 'on-call' service, and circulate. JG reported that a wiki had been created. Done, item closed. ACTIONS AS AT 07.01.08 ====================== 272.4 AS to check the current Tier-1 disaster recovery plan and circulate the existing version to the PMB. It was reported that this document does not exist, but it was planned to have one in the longer term. TD would incorporate in v0.4 anything that AS considered relevant. AS will check and advise additions. 277.2 DN to provide an update and re-evaluation of CMS/CASTOR deliverables. TD advised that there was a CMS/CASTOR document on deliverables which should be revised in light of the December '07 tests. DC to take the token for this now and iterate with DN. 277.5 Disaster Recovery 'Team B': SB, JC, TD, SP, DB to analyse the wider issues of disaster planning, mapped to the experiments' lists, and this work would include Project Management. A Recovery Plan was required. It was agreed that JC was in charge of this and the experiment input relating to subsets of the disaster plan. SB/JC to progress. 277.8 User Experience 'Team C': SB, SP, SL, with input from JC to deal with the issue of user experience and design of an easily-found lookup facility for grid error messages. SL reported that he had started the ATLAS wiki page and would circulate the url. 280.6 JG to bring up this issue (the biomed VO and 'sieving')at the ROC Manager's meeting (done) - a broadcast is to go out from EGEE which will be helpful in underlining acceptable use of Grid resources and would act as a reminder to VOs about the policy they have signed-up to in relation to their users. JC had now emailed the Chair to have this discussed - EGEE broadcast part of this action ongoing. JG reported that a new VO was now set up but there were no resources allocated to it as yet, although one Institute may be giving funds. Pending further info from JC. EGEE broadcast action ongoing. 280.7 JC to mention the issues (when approached by a VO with regard to joining) of the 'standard' 6-month introduction period, following which the VO must set-up something specific to them, if appropriate. This had been discussed at DTeam, done. JC to email GridPP VO members if possible - ongoing. This was a standing action - JC had discussed it with the Tier-2 Co-ordinators in relation to VO members. The emailing part ongoing but the first part of the action completed. JC to send email. Ongoing. 280.8 JG to investigate the UKI ROC website - any change/progress, and report-back. 282.2 SP to progress the Project Map using the T1 service areas and input from the meeting. 282.6 JC and SB to progress existing 'disaster planning' template for next F2F meeting on 1st Feb. Involve experiments as necessary. 283.1 TD to arrange a phone connection at TC Dublin for RJ to join the GridPP20 meeting remotely. 283.3 RM/TD to prepare use cases appropriate for the UK community, [relating to item 278.10 EGEEIII -> EGI]. RM reported that he would be attending a workshop at the end of January at CERN (by EGI design study project) and would report-back at that time. Ongoing. 286.1 RJ to call a NorthGrid meeting to decide hardship funding allocations to Institutes. 286.2 SL and DB to iterate regarding clause associated with the issuing of Tier-2 hardware grants. 286.3 AS to formally apologise to ATLAS on behalf of GridPP for the outage problems over the Christmas period. 286.4 GP to advise the UB that the special cases for non-Grid access to the UK Tier-1 were approved. 286.5 AS to organise a service message at login relating to non-Grid access being withdrawn. 286.6 JC and SB to incorporate the AFS Service into the disaster planning document. INACTIVE CATEGORY ================= 271.1 PMB to examine the issue of fibre breakage and outages, CERN-RAL OPN link, in one year's time, when actual data on breakages is available. Due date would be September '08. 271.3 Re CERN-RAL OPN link breakage and backup generally, PC to oversee the issue and collate info so that the PMB have something to revisit in one year's time. Due date September '08. It was noted that PC would circulate a revised document after discussion with ATLAS (RJ/PC/DN to iterate). 282.8 RM to monitor how R-GMA and networking issues impact on GridPP as matters progress. RM advised that this item should be moved to the 'inactive' category as it will develop over the coming months. RM discussed the issue with Steve Fisher and advised that support of R-GMA is required whilst APEL is dependent on it. RM reported that he has spoken to SF and there is currently no change to the R-GMA situation - process ongoing. The meeting closed at 2:20 pm. The next PMB would take place on Monday 14 January 2008 at 1:00 pm.

Top of Message | Previous Page | Permalink

JiscMail Tools


RSS Feeds and Sharing


Advanced Options


Archives

February 2024
January 2024
September 2022
July 2022
June 2022
February 2022
December 2021
August 2021
March 2021
November 2020
October 2020
August 2020
March 2020
February 2020
October 2019
August 2019
June 2019
May 2019
April 2019
March 2019
February 2019
January 2019
December 2018
November 2018
August 2018
July 2018
June 2018
May 2018
April 2018
March 2018
February 2018
January 2018
November 2017
October 2017
September 2017
August 2017
May 2017
April 2017
March 2017
February 2017
January 2017
October 2016
August 2016
July 2016
June 2016
May 2016
April 2016
March 2016
February 2016
January 2016
December 2015
November 2015
October 2015
September 2015
July 2015
May 2015
April 2015
March 2015
February 2015
January 2015
December 2014
November 2014
October 2014
September 2014
August 2014
July 2014
June 2014
May 2014
April 2014
March 2014
February 2014
January 2014
December 2013
October 2013
August 2013
July 2013
June 2013
May 2013
March 2013
February 2013
January 2013
December 2012
November 2012
October 2012
September 2012
August 2012
July 2012
June 2012
May 2012
April 2012
March 2012
February 2012
January 2012
December 2011
November 2011
October 2011
September 2011
August 2011
July 2011
June 2011
May 2011
April 2011
March 2011
February 2011
January 2011
December 2010
November 2010
October 2010
September 2010
August 2010
July 2010
June 2010
May 2010
April 2010
March 2010
February 2010
January 2010
December 2009
November 2009
October 2009
September 2009
August 2009
July 2009
June 2009
May 2009
April 2009
March 2009
February 2009
January 2009
December 2008
November 2008
October 2008
September 2008
August 2008
July 2008
June 2008
May 2008
April 2008
March 2008
February 2008
January 2008
December 2007
November 2007
October 2007
September 2007
August 2007
July 2007
June 2007
May 2007
April 2007
March 2007
February 2007
January 2007
2006
2005
2004
2003
2002
2001
2000


JiscMail is a Jisc service.

View our service policies at https://www.jiscmail.ac.uk/policyandsecurity/ and Jisc's privacy policy at https://www.jisc.ac.uk/website/privacy-notice

For help and support help@jisc.ac.uk

Secured by F-Secure Anti-Virus CataList Email List Search Powered by the LISTSERV Email List Manager