JiscMail Logo
Email discussion lists for the UK Education and Research communities

Help for UKHEPGRID Archives


UKHEPGRID Archives

UKHEPGRID Archives


UKHEPGRID@JISCMAIL.AC.UK


View:

Message:

[

First

|

Previous

|

Next

|

Last

]

By Topic:

[

First

|

Previous

|

Next

|

Last

]

By Author:

[

First

|

Previous

|

Next

|

Last

]

Font:

Proportional Font

LISTSERV Archives

LISTSERV Archives

UKHEPGRID Home

UKHEPGRID Home

UKHEPGRID  November 2007

UKHEPGRID November 2007

Options

Subscribe or Unsubscribe

Subscribe or Unsubscribe

Log In

Log In

Get Password

Get Password

Subject:

Re: Minutes of the 282nd GridPP PMB F2F meeting

From:

Tony Doyle <[log in to unmask]>

Reply-To:

Tony Doyle <[log in to unmask]>

Date:

Thu, 29 Nov 2007 12:01:23 +0000

Content-Type:

MULTIPART/MIXED

Parts/Attachments:

Parts/Attachments

TEXT/PLAIN (34 lines) , 071122.txt (1 lines)

Now attached.

Cheers, Tony
________________________________________________________________________
Prof. A T Doyle, FInstP FRSE                       GridPP Project Leader
Rm 478, Kelvin Building                      Telephone: +44-141-330 5899
Dept of Physics and Astronomy                  Telefax: +44-141-330 5881
University of Glasgow                   EMail: [log in to unmask]
G12 8QQ, UK                 Web: http://ppewww.physics.gla.ac.uk/~doyle/
________________________________________________________________________

On Thu, 29 Nov 2007, Tony Doyle wrote:

> Dear All,
> 
>      Please find attached the latest weekly GridPP Project Management 
> Board Meeting minutes. The latest minutes can be found each week in:
> 
> http://www.gridpp.ac.uk/php/pmb/minutes.php?latest
> 
> as well as being listed with other minutes at:
> 
> http://www.gridpp.ac.uk/php/pmb/minutes.php
> 
> Cheers, Tony
> ________________________________________________________________________
> Prof. A T Doyle, FInstP FRSE                       GridPP Project Leader
> Rm 478, Kelvin Building                      Telephone: +44-141-330 5899
> Dept of Physics and Astronomy                  Telefax: +44-141-330 5881
> University of Glasgow                   EMail: [log in to unmask]
> G12 8QQ, UK                 Web: http://ppewww.physics.gla.ac.uk/~doyle/
> ________________________________________________________________________
> 


GridPP PMB Minutes 282 - 22nd November 2007 =========================================== Face-to-face meeting at RAL. Present: David Britton, Stephen Burke, Peter Clarke, Jeremy Coles, Tony Doyle, Neil Geddes, John Gordon, Dave Kelsey, Steve Lloyd, Robin Middleton, Trish Mullins, Sarah Pearce (EVO), Dave Newbold, Andrew Sansum, Glenn Patrick (minutes) Apologies: Roger Jones Yingqin Zheng observed for the Pegasus project. 1. Tier 1 Review ================= TD outlined the T1 Review that had been held the previous day. One recommendation was for a Service Delivery Plan and monitoring system for future planning of GridPP3 (includes definition of all the services; assessment of their criticality; monitoring technique; fallover procedure; expert list and call-out procedure, as well as the disaster recovery plan already envisaged). Also, there is a mismatch between current metrics in the old ProjectMap and the service delivery to experiments now required. Better integration with Deployment Team was also recommended. Encourage more handshaking between T1 and T2 in establishing good practice and collaboration. PMB in first instance should review current experiment requirements and present them in a succinct form to T1 as implementation model. Talks from SRM workshop could be a useful starting point? Take a snapshot of experiment high-level (top-down) requirements rather than having dispersed information. DB made the point that comments about all information being in computing TDRs were no longer useful. ACTION: TD and JG to constitute a group with experiments and other parties to capture experiment requirements and how they relate to UK. 2. GridPP3 Project Planning ============================ SP had circulated an email with initial thoughts on experiments and production metrics. Initial ideas sought from GP and DN on 6-7 general areas for monitoring. CMS gave 5 technical QOS areas, whilst LHCb had 7 high-level experiment areas of Grid use. DN said it was important that metrics are linked to perceived results and measurements that could be made. GP said that although LHCb and CMS had adopted different approaches, it looked like the underlying metrics would be very similar. DB asked how we deal with metrics which go "red" and we can't do anything to rectify them (eg. for the OSC). High-level metrics are supported, but we have to be able to define the next steps. Concluded that this looks like right approach, but need to be able to dissect things when they go wrong. SP will work with experiments to define specific metrics. DB made the point that we are likely to need to refine and/or redefine the metric set with the benefit of experience. RAS commented that we should also look in the WLCG MOU at current metrics. These will be measured anyway (eg. ticket response). GP had raised what to do about ALICE? DB suggested to keep ALICE-specific metrics in "Other" box. For the remaining "other" experiments it may be sufficient to see if there are some simple generic metrics (e.g. how many VOs at Tier 2s) or it may also be appropriate to have experiment-specific metrics. A whole range of production metrics needed to be evaluated. Some needed to be dropped and some amended. DB suggested starting with the list of services in the T1 questionnaire and the metrics should measure how they relate to experiment delivery. The 10 metrics recommended for dropping looked acceptable. SB raised whether the average number of sites/quarter available in VO selection (0.144) should be dropped. DB suggested monitoring blacklists might be appropriate. Metrics where there was no agreement on future: 0.110(GridPP Tape Storage) - DB suggested change this metric to be based on "does tape service work". 0.117(Job failure rates) - should be retained, although difficult to measure. 0.127(T1 meeting PPS commitments) and 0.128(meeting JRA1 commitments) - agreed to drop. 0.129(T1 meeting "other" user commitments) - should be covered by user area. 0.131(T1 service disaster recovery) - this has been overtaken by events. Metrics to be amended: 0.104(no.job slots) should be kept. 0.105(fraction of LCG job slots used) should be kept. 0.107(GridPP KSI2K available to EGEE/LCG) should appear separately for T1 and each T2 centres. 0.114(fraction of available tape used in quarter) should be dropped as covered by monitoring tape service. 0.124(GridPP security audit) covered elsewhere. 0.130(testbed) should be dropped. 0.132 (Prod. Service risks/issues) covered by Service Delivery Plan and not needed. 0.136 and 0.137(delivering to LCG MOU - availability targets) retained as availability metric. Spreadsheet items: 0.101 and 0.102(registered and active users) - numbers need to be known, but not as up-front metric in the project map. ACTION: SP to progress the Project Map using the T1 service areas and input from the meeting. 3. Tier-2 Hardware Allocation ============================== SL showed slides from the T2 Board held on 16 November. Hardware allocated by formula as advertised in advance. Using experiment/institute matrix and costing model, hardware for 2008/2009 was allocated. No obligation to support experiments which institute not part of, but credit is given for supporting any VO. Some issues over those institutes supporting more than one VO being advantaged. T2 Board agreed would not change matrix. Acceptable for institutes to move hardware around within a T2, as with manpower. For next phase, the accounting period will be 2Q08 - 1Q09 inclusive. A number of complaints had arisen. SL proposed because of possible anomalies arising from formula approach that: (a) GridPP make available an extra 100K to help in genuine cases, (b) a special(1/2 page) case to be made to GridPP for consideration, (c) cases to be collated by each T2 Chair who then send the cases and list of any internal transfers to SL. Procedure supported by PMB with SL and NG to evaluate cases. ACTION: SL and NG to progress and iterate procedure with T2s. 4. Dissemination Issues ======================== SP presented the new Web page. Only comment was that it was too wide for screens - maybe one column too much? Otherwise, everyone thought it was a good first draft. 5. GLite Support Proposal ========================== Nick Trigg (STFC/CLIK) introduced a proposal to provide commercial support service for gLite. This could be for individual GridPP institutes. Funding could flow through Constellation Technologies with added value. DB noted that would need to investigate a specific case to see if this model could work. NT to communicate with DB to explore if there are any possibilities. 6. UK Prioritisation of Resources ================================== TD raised the question of the appropriate level to set UK prioritised resources for T1 and T2. RJ had suggested that ~20% of the total be set aside for ATLAS. Revised WLCG pledges had not yet been made, but were now urgent. There was now a need to identify two numbers for ATLAS, CMS and LHCb - namely the fraction of activity at T1 and T2 reserved only for UK usage For CMS, DN agreed to a figure of 25% for T2 and 0% for T1. GP made the point that LHCb has a very different computing model and it was not obvious how this could be implemented. However, the principle of reserving 0% of T2 and 25% of T1 for LHCb could be agreed - but it would then be up to the experiment whether it then chose to use these resources for UK or wider purposes. 7. GridPP3 MOU =============== TD introduced version 2 of the MOU. Binds UK project for three years and will be signed by reps of the four regional T2s and the T1. Agreed with STFC. First action of the new Deployment Board will be to sign off the MoU. Agreed hardware fractions (minimum) broken down by experiment for each institute need to appear. Need to draft next Monday for STFC - a draft of WLCG pledges would then be ready prior to global deadline of Friday 30 November. Note: The current (working) version of the GridPP MoU is available at: http://www.gridpp.ac.uk/db/GridPP3_MoU_v2.1.doc The input to WLCG planning is available at: http://www.gridpp.ac.uk/db/WLCG_MoU_UK_Nov07.doc ACTION: Updated MOU needs to be sent to CB. 8. EGI/NGI Plans and Planning ============================== RM outlined EGI Design Study now underway. Only 18 months before transition from EGEE starts. More science than just HEP needed, more funding for NGI specific functions and community representation, governance, etc. 9 partner institutes including STFC and CERN. Six work packages. Deliverables: D2.1 Dec 2007 EGI consolidated requirements and use cases,March 2008 EGI Workshop, June 2008 EGI Blueprint publication. UK NGI - assumed to be based on NGS. GridPP sites become partner/affiliate. Interoperability - NGS VO on GridPP, SRM-SRB interoperation. Some services already in NGS such as Certificate Authority, GGUS, VOMS. Funding line will become clearer in April 2008. Proposal to NGS Board on 6 December 2007. GridPP strategy for transition to EGI/NGI need to be defined. 9. Disaster Planning ==================== SB presented a set of slides, mainly from JC. OC had been extremely concerned that do not have planning for wide-range of potential disasters. Disasters covered "known knowns"(disks will fail), "known unknowns" (fire) and "unknown unknowns" (something preventing data transfers). Probability, impact and scope are the important factors. Plans for both disaster recovery and business continuity planning needed. For OC meeting on 10 October, a paper was submitted covering high-level failure modes and impact on experiment services. Networking perhaps should be in a separate document. Tier 1 included, but should have own disaster plan. CMS experiment scenarios had been included (along with LHCb). DN pointed out that network throughput limitations observed in CSA07 could be a disaster during real data taking. There needs to be a way of declaring a "disaster" for things which are beyond experiment control. JC commented that need a strategy to deal with each scenario. TM emphasised that OC meant GridPP to look at the basic things that could be put in place to correct things when they go wrong. DB pointed out that for T1 there would be a Service Delivery Plan which should cover operational responses. ACTION: JC and SB to progress existing template for next F2F meeting on 21 Feb. Involve experiments as necessary. 10. Network Resilience ====================== PC said main thrust was whether a single 10Gb link is an issue? Does not warrant diverting GridPP funds, but keep under review. However, ATLAS (RJ) say cannot stand 6 day outage.Needs further consideration. Brookhaven and Fermilab have triangle connection, and other T1s fall back on cross 10Gb link with another T1. ACTION: Need further input from RJ on 6 day issue and decide on way forward. Keep action open. 11. Castor Status and CSA07 Outcome =================================== DN: Goal was ~50% test of entire 2008 computing system. Tried to do for 6 weeks and included T2 for analysis. Extensive programme of "link commmissioning" which did not converge fully in time for CSA07. Outcomes - serious issues with implication of physics goals. Transfers were a weak point (storage system capability). Most individual components tested. RAL T1: In general,late coming upto speed for CSA. All SL4 resources available (though not used). Major update of software area carried out without problem. Weak point - CASTOR 2.1.3 performance for WAN transfers. Never exceeded 100MB/s for more than a day or so. Weak point - JANET and OPN connections to RAL. T2 centres: Bristol/Brunel did not work (DPM incompat, etc). PPD/Imperial worked very well. Castor - just started testing the new 2.1.4 prod instance. Extensive programme of tests planned for December. Attempting to understand complexity wrt tape handling (cannot control allocation of tape drives to streams). Plan to test Castor internal data flow (ie. d2d and tape migration). Also, test CASTOR SRM2+internal_RFIO_gridftp from RAL PPD. Bring resources back online for CMS, recommission links. Bottom line - more testing required to achieve confidence in Castor for 2008. 12. R-GMA and Networking ======================== RM presented slides after talking to Steve Fisher and Robin Tasker. R-GMA: re-engineering to new design wil be completed by 31 March 2008. Remain part of gLite distribution. Support being negotiated outside EGEE and GridPP3 (1FTE). Important that work is completed by end March since no obvious source for future development. Used by dashboards, APEL, Grid Ireland. Expect new users. RAS raised problem of T1 service run for R-GMA if there is no support after March. Service Discovery - API to hide underlying information system. Work on SAGA(OGF) spec about to go public. C++ version by 31 March 2008. New activity also in SA3. 1 FTE being funded in this area by GridPP for 2 years - skills required to support R-GMA. ACTION: RM to monitor how this impacts GridPP as matters progress. Networking - some GridPP2 deliverables outstanding. New GridPP2+ deliverables - UKLIGHT, Gridmon, etc. GridMon effort seconded (part time) to JA.NET. 13. AOB ======== The meeting finished at 16:20. Next F2F meeting in Glasgow on 1st Feb. 2008. Next EVO meeting on Monday 3rd December. ACTIONS AS AT 22.11.07 ====================== 271.2 Re CERN-RAL OPN link breakage, RJ to provide an analysis of what the consequences would be to Experiments for a one-day break, a three-day break, a five-day break, etc. The outcome of these need to be assessed for disaster scenario planning. 272.4 AS to check the current Tier-1 disaster recovery plan and circulate the existing version to the PMB. It was reported that this document does not exist, but it was planned to have one in the longer term. TD would incorporate in v0.4 anything that AS considered relevant. AS will check and advise additions. 277.2 DN to provide an update and re-evaluation of CMS/CASTOR deliverables. 277.4 Castor 'Team A': TC, AS, JG, RJ, DN, GP to provide inputs relating to CASTOR and a breakdown of issues that could be incorporated into meta-level deliverables for the next 6-month period. 277.5 Disaster Recovery 'Team B': SB, JC, TD, SP, DB to analyse the wider issues of disaster planning, mapped to the experiments' lists, and this work would include Project Management. A Recovery Plan was required. It was agreed that JC was in charge of this and the experiment input relating to subsets of the disaster plan. TD noted that first thoughts on categorising inputs would be required for the next F2F meeting - this would ensure categories were laid down and an idea of what could be said under each category by way of examples that were clear. DB noted that SB could deal with this as an Agenda item at the next F2F meeting and provide a pre-idea of evolution (on behalf of JC who would not be present). 277.7 SP and NO to review existing user documentation areas - it was noted that these need to appeal to the lower common denominator, be less technical, and be easier to find. SP reported and she and NO were working on a re-designed front page that would be easier to use. SP would send an email to SB summarising her ongoing thoughts and would iterate with SB. 277.8 User Experience 'Team C': SB, SP, SL, with input from JC to deal with the issue of user experience and design of an easily-found lookup facility for grid error messages. 277.9 24x7 cover at Tier 1 'Team D': AS and JG to discuss this issue and see what could be achieved in relation to possible shift rotas/on call/overtime at weekends. 278.3 JC to look at the Quarterly Reports, funded vs unfunded effort, to see if there is a correlation between the lack of unfunded effort and related site problems. 278.8 Regarding the GridPP3 SLA and EGEE SA1 putting forward a draft of its Service Level Description for sites/ROCs to discuss - it was agreed that TD & DK would go through the GridPP3 SLA and review it in terms of consistency of style. 278.10 ALL: inputs on EGEEIII -> EGI to be sent to RM/TD. 279.4 Regarding CASTOR, DN to provide input on CMS after CSA07, and AS to speak to Bonny Strong (high-level planning to be met - a formal recognition of progress is required with well-stated goals). 280.3 JC to elicit more specific objections from Site Admins, to set UID for glexec, to be built-into glexec testing and cert procedures. 280.6 JG to bring up this issue (the biomed VO and 'sieving')at the ROC Manager's meeting (done) - a broadcast is to go out from EGEE which will be helpful in underlining acceptable use of Grid resources and would act as a reminder to VOs about the policy they have signed-up to in relation to their users. JC had now emailed the Chair to have this discussed - EGEE broadcast part of this action ongoing. 280.7 JC to mention the issues (when approached by a VO with regard to joining) of the 'standard' 6-month introduction period, following which the VO must set-up something specific to them, if appropriate. This had been discussed at DTeam, done. JC to email GridPP VO members if possible - ongoing. 280.8 JG to investigate the UKI ROC website - any change/progress, and report-back. 281.1 DB to circulate an updated F2F Agenda. 281.2 TD to circulate an outline Agenda for the GridPP20 Collaboration Meeting. 281.3 SL to raise the issue of user checks on running code (pre-testing procedure/workbook advice) at the Software Installation Tools (SIT) meeting to be able to point people in the right direction prior to releasing code across the grid. 282.1 TD and JG to constitute a group with experiments and other parties to capture experiment requirements and how they relate to UK. 282.2 SP to progress the Project Map using the T1 service areas and input from the meeting. 282.3 SL and NG to progress issues relating to Tier-2 hardware allocation/complaints and iterate procedure with T2s. 282.4 Nick Trigg and DB to iterate regarding the possibility of provision of commercial support service for gLite. 282.5 Updated GridPP3 MOU needs to be sent to CB (TD to provide updated version for SL to circulate). 282.6 JC and SB to progress existing 'disaster planning' template for next F2F meeting on 21 Feb. Involve experiments as necessary. 282.7 RJ to provide input relating to '6 day issue' (network resilience outage) and decide on way forward. Keep action open. 282.8 RM to monitor how R-GMA and networking issues impact on GridPP as matters progress. INACTIVE CATEGORY ================= 247.2 RJ to get further information from ATLAS regarding use of Grid for testing of PANDA, and report-back. RJ reported that there were a planned series of tests for a few sites in the UK - Rod Walker was in charge of this. No further details were available at present. 251.1 TD to raise the issue of memory vs CPU cost at the MB [in order to work out what the requirement was between 1GB and 2GB memory per core]. This was discussed at the MB, cost was understood, it was agreed that 2GB memory per core was now a requirement in relation to future procurements. AS noted this and agreed. Done, item closed. 271.1 PMB to examine the issue of fibre breakage and outages, CERN-RAL OPN link, in one year's time, when actual data on breakages is available. Due date would be September '08. 271.3 Re CERN-RAL OPN link breakage and backup generally, PC to oversee the issue and collate info so that the PMB have something to revisit in one year's time. Due date September '08. There would be no PMB next Monday (26th November) due to the F2F. Next meeting Monday 3rd December.

Top of Message | Previous Page | Permalink

JiscMail Tools


RSS Feeds and Sharing


Advanced Options


Archives

November 2018
August 2018
July 2018
June 2018
May 2018
April 2018
March 2018
February 2018
January 2018
November 2017
October 2017
September 2017
August 2017
May 2017
April 2017
March 2017
February 2017
January 2017
October 2016
August 2016
July 2016
June 2016
May 2016
April 2016
March 2016
February 2016
January 2016
December 2015
November 2015
October 2015
September 2015
July 2015
May 2015
April 2015
March 2015
February 2015
January 2015
December 2014
November 2014
October 2014
September 2014
August 2014
July 2014
June 2014
May 2014
April 2014
March 2014
February 2014
January 2014
December 2013
October 2013
August 2013
July 2013
June 2013
May 2013
March 2013
February 2013
January 2013
December 2012
November 2012
October 2012
September 2012
August 2012
July 2012
June 2012
May 2012
April 2012
March 2012
February 2012
January 2012
December 2011
November 2011
October 2011
September 2011
August 2011
July 2011
June 2011
May 2011
April 2011
March 2011
February 2011
January 2011
December 2010
November 2010
October 2010
September 2010
August 2010
July 2010
June 2010
May 2010
April 2010
March 2010
February 2010
January 2010
December 2009
November 2009
October 2009
September 2009
August 2009
July 2009
June 2009
May 2009
April 2009
March 2009
February 2009
January 2009
December 2008
November 2008
October 2008
September 2008
August 2008
July 2008
June 2008
May 2008
April 2008
March 2008
February 2008
January 2008
December 2007
November 2007
October 2007
September 2007
August 2007
July 2007
June 2007
May 2007
April 2007
March 2007
February 2007
January 2007
2006
2005
2004
2003
2002
2001
2000


JiscMail is a Jisc service.

View our service policies at https://www.jiscmail.ac.uk/policyandsecurity/ and Jisc's privacy policy at https://www.jisc.ac.uk/website/privacy-notice

Secured by F-Secure Anti-Virus CataList Email List Search Powered by the LISTSERV Email List Manager