JiscMail Logo
Email discussion lists for the UK Education and Research communities

Help for UKHEPGRID Archives


UKHEPGRID Archives

UKHEPGRID Archives


UKHEPGRID@JISCMAIL.AC.UK


View:

Message:

[

First

|

Previous

|

Next

|

Last

]

By Topic:

[

First

|

Previous

|

Next

|

Last

]

By Author:

[

First

|

Previous

|

Next

|

Last

]

Font:

Proportional Font

LISTSERV Archives

LISTSERV Archives

UKHEPGRID Home

UKHEPGRID Home

UKHEPGRID  2005

UKHEPGRID 2005

Options

Subscribe or Unsubscribe

Subscribe or Unsubscribe

Log In

Log In

Get Password

Get Password

Subject:

Minutes of the 167th GridPP PMB meeting

From:

Tony Doyle <[log in to unmask]>

Reply-To:

Tony Doyle <[log in to unmask]>

Date:

Thu, 28 Apr 2005 17:02:29 +0100

Content-Type:

MULTIPART/MIXED

Parts/Attachments:

Parts/Attachments

TEXT/PLAIN (24 lines) , 050414.txt (1 lines)

Dear All,

     Please find attached the F2F GridPP Project Management Board Meeting
minutes. These can also be found at:

http://www.gridpp.ac.uk/pmb/minutes/050414.txt

as well as being listed with other minutes at

http://www.gridpp.ac.uk/php/pmb/minutes.php

Cheers, Tony
________________________________________________________________________
Tony Doyle, GridPP Project Leader            Telephone: +44-141-330 5899
Rm 478, Kelvin Building                        Telefax: +44-141-330 5881
Dept of Physics and Astronomy           EMail: [log in to unmask]
University of Glasgow             Web: http://ppewww.ph.gla.ac.uk/~doyle
G12 8QQ, UK                                      Video - IP: 194.36.1.32
________________________________________________________________________






GridPP PMB Minutes 167 - 14th April 2005 ======================================== Face to Face Meeting at Imperial -------------------------------- Present: Dave Britton (pm), Jeremy Coles, Tony Doyle, John Gordon, Roger Jones, Dave Kelsey, Steve Lloyd, Deborah Miller, Sarah Pearce (phone, pm), Dan Tovey (phone, am) Apologies: Tony Cass, Robin Middleton 1. Tier-1 Issues - JG ===================== Summary of issues presented by A Sansum to Tier1A Stakeholders Board on 11th April was discussed. 1) Another power cut in March. Probably cause - electrical work close to the EPO system.. Impact - 1 day outage. Consequence - must give greater consideration to protecting critical and sensitive systems from power failure. Will assess requirement for UPS for critical services. May be financial cost to GridPP. 2) Service Challenge activity has sharply increased and expectations from LCG are higher than a few months ago. Additional effort will be needed to meet the ongoing challenges in this area. Ditto capacity requirements. The service challenge work is become a bigger driver than any single experiment. Discussion on networking is planned need description of roles. Will raise at networking meeting. NEW ACTION 167.1: JG to report back from networking meeting on service challenge activity/organisation. 3) Increasingly concerned that there will be difficulties meeting Service Challenge commitments on either the production or UKLIGHT network over the next 18 months. However further work needs to be done before we have a clear picture of either the requirement (from T1) or the capability of UKLIGHT or SJ4/5. A meeting is scheduled with the GRIDPP networking team to discuss requirements / capabilities. 4) SRM worked well in SC2. Continuing issue that xrootd is deployed at T1A only for BaBar. 5) We are finding it extremely difficult to allocate disk in the dynamic manner requested by the user board. Provisioning is slow and labour-intensive. A lesser issue is obtaining the release of capacity from experiments in order to re-allocate it. RJ said there was no use case to reduce disk allocations. Once allocated disk will be used essentially for ever. Not obvious how this fits with the UB model for data challenges. 5) Tenders delayed until - a) effort was available and b) pending evidence that utilization justified additional capacity purchasing. This now appears to be the case. 6) Advertisements for Tier1 positions should appear very shortly (maybe 1-2 weeks). Previous Issues that have been resolved: 1) dCache disk sRM is deployed and is working. CMS much happier. However support effort for dCache is essentially 1 FTE which is rather worrying if it does not reduce over the next 3 months. 2) Tape back end to dCache/SRM implemented and tested. Full test under SC2 shortly. 3) CPU Utilisation figures for January-March are much healthier 4) Redhat 7.3 almost entirely phased out (except disk servers) - SL3 installed on LCG. SLC3 still an issue - it was good that we chose SL3. 5) Scheduling - Although work continues in this area, much progress has been made and the hard partition between the Tier1 and Tier-A has become much softer allowing jobs to migrate between both infrastructures with a single common scheduler. 6) Sun services terminated - but disposal not completed BaBar - TD reported on long discussion with BaBar Computing Coordinator Stephen Gowdy. No US effort for Grid; all impetus coming from Europe. The Monte Carlo problem is soluable, but no progress on analysis. TD told SG the current position and how we would meet their 2005-06 requirements (they should use Grid to share with LHC). Overall Tier1A Plan has been sent to Phase 2 Planning Committee for LCG and to PPARC. Ken sent an accompanying note to Janet Seed observing that the UK Tier1 resources were light. A pulse of resources are required in 2007/8. JS advises us to have plans for cutbacks in other areas if there is no extra money. SRIF3 outcome may not be good for T2 resources. JC asked about tape planning. 2. Tier-2 Issues - SL ===================== 1) Continuing delays in appointing hardware support people. Four out of nine are in post with another two appointed.. Total T2 effort has gone up even though unfunded effort may be under-reported. The issue of delayed grant start was discussed. Six month delays in grants are allowed by PPARC. Some further slippage of three months is allowed by the reporting period making June the cut-off. 2) T2 middleware people should report through RM. 3) Use of Tier-2 Hardware money - testbeds now or contingency for later? 4) Failure of sites to upgrade/LCG to deliver upgrades. Eventually got to 2.3.0 or 2.3.1 at most sites. Will see how 2.4.0 migration goes. Give feedback to CERN on shortage of migration period and lack of sufficient testing. 5) (Lack of) understanding of experiment computing models (esp analysis). Task force led by Roger to provide detailed bandwidth requirements for UK. 6) Future resources - SRIF3 etc. QMUL have been successful but little other successes known so far. 7) The PMB considered the hardware bids received via T2s All T2s had bid for sytemss to use for the planned UK testzone and some middleware development. After long discussion, the PMB approved the following: London ====== 8x£1000 systems for the testzone 10x£300 boxes for WLMS development Total £11,000 Rejected network systems NorthGrid ========= 4x£1000 systems for the testzone 3x£1000 VO Operation systems 3x£1000 GridPP web server Total £10,000 Network Monitoring deferred (see below); dCache rejected. SouthGrid ========= 7x£1000 systems for the testzone Total £7,000 RAID rejected VMware was discussed, but not considered appropriate to commit funding to at this time. ScotGrid ======== 4x£1000 systems for the testzone 8x£1000 replica catalogue, data management testing, metadata development 2x£1000 storage but not with special interfaces Total £14,000 Rejected machine to act as storage for common home areas Rejected storage middleware development machines Rejected security intrusion detection (see below) Tier-1 ====== 7x£1000 for dcache pro, dev, tape middleware and integration test Total £7,000 Rejected testing other people's SRMs. Notes: 1. Testzone and other boxes were generally costed at £1k allowing some contingency for KVM or other things. 2. T2 to decide where boxes go based on this feedback and institutes to submit PPARC grants referring to these minutes. It was agreed that testzone boxes (£23k) are funded by T2 hardware budget. The remainder (£26k) will be found from contingency after discussion between TD and DB. (Agreed following the meeting). The PMB encouraged further national bids for (a) network monitoring - setting up GridMon nodes at sites; and (b) security intrusion detection (log server, snort, tripwire, nessus) if a joint case can be made on how to run/configure these across all institutes can be determined. 3. Production Manager Issues - JC ================================= 1) Responsiveness of developers/EGEE to problems with middleware - over the last few months we have raised a number of security related problems. Some are fixed while others are not, and once raised the discussions tend to be CERN centric. A number of problems have been raised at meetings/workshops and on LCG-ROLLOUT that seem not to get registered anywhere to be dealt with. Even problems logged in Savannah sometimes do not progress. One problem in particular should be of concern relating to the instability of the RAL RB for Babar jobs - it falls over after 30 mins. This was raised months ago and is stopping Babar from performing any real use of Grid resources. UK should keep its own list of issues and pursue them. NEW ACTION 167.2: JC, with input from Stephen Burke to maintain a list of middleware issues that should be addressed 2) Deployment schedules for LCG2/gLite releases. An announcement was made that a release would be made every 3 months starting on 1st April. Several sites scheduled manpower to perform upgrades around this release timescale and as it was not delivered on time such sites will have difficulty meeting with requests for timely upgrades. We tried hard for a fixed release timeline to ease deployment concerns at sites, to miss the release date impacts the credibility of GridPP, LCG and EGEE. The transition to gLite is not clear. JG said that it was quite clear: SA1 will deploy elements of gLite on pre-production service and only consider deploying them in production when they are shown to be at least as good as the elements they are replacing. All gLite elements are designed to co-exist with the previous software so they could be deployed in production for a transition phase while users move. LCG 2_4_0 was released on 7/4 which is better than forecast but not perfect. The three week target for migration started then. 3) Networking provision and installation. RAL only just made it into Service Challenge 2 after a commendable effort from the Tier-1 and UKERNA. From this it is clear that the deployment and network areas of GridPP need to be more aligned and that the deployment team need a single point of contact who will be accountable for ensuring that network hardware is installed and operational to agreed plans. The situation has improved recently but there is still an issue as currently short (for SC3) and long term (for end 2006) planning are not clear. There are also questions in the deployment team about what support is available for networking problems and what other GridPP activities are going to provide - sites for instance want help monitoring their networks. This may be an issue of communication. These issues and more will be addressed in a phone conference to be held on 15/4. 4) Engagement of experiments with UK deployment. Despite requests we are struggling to get any direct feedback from experiments on usability, configuration, reliability etc. of GridPP resources yet related issues keep being raised at the Board levels with LCG sites. In addition site managers struggle to understand why sometimes their sites are used but at other times they are completely empty - 0% utilisation! If we are to provide a better service to the experiments we need to have feedback on what is wrong - especially as jobs the DTEAM can run are only for the DTEAM VO. Can we define UK contacts for each experiment who will respond on experiment operational issues like - BDII, software installation. 5) Tier-2 coordinator and hardware support activities. While the DTEAM structure is improving there is room for improvement in the visible output from the team and from the Tier-2s. (web pages - The GridPP structure of having hardware support posts report through a separate channel does not help as it is not clear how their work is going to be coordinated with the rest of GridPP deployment. The use of weekly reporting has improved the situation for the coordinators but there is still a disconnect between production manager needs from the team and the local requirements on them - non-local and indirect management are not ideal to meet GridPP and EGEE objectives. The hardware people only report to SL for FTE figures in QPR. Their work should be an integral part of the Deployment Team and engaged with the T2 coordinators. There are a number of other issues that can and should be discussed though there is some overlap with Tier-2 Board and Deployment Board responsibilities: 6) Deployment of resources across the Tier-2s are well behind anticipated levels 7) Tools (even though also essential for local monitoring and security) are not being deployed quickly enough at GridPP sites. Ganglia is an example. Repeated requests have seen the number of sites with such a tool increase from about 35% to 45%. 8) Understanding and following of procedures at sites is still quite poor. This issue is raised in light of a number of sites not properly reporting or working around security incidents. A HEPSYSMAN meeting in April will focus on such issues but the pressure needs to come from within the institutes 9) Planning in Tier-2s was reported to be improving at the last Tier-2 Board - JC is still concerned though that the level of planning is inadequate. 10) Support for operations and users. There are ideas to improve this but so far none have been followed up partly because of lack of manpower and partly because of focus elsewhere for myself and the DTEAM. 11) Coordination across LCG/EGEE does not always provide much needed help. For example an issue about migrating VO data off of classic SEs has appeared as a UK issue at the weekly operations meeting for several months. NEW ACTION 167.3: RJ to define UK contacts for each experiment who will respond on experiment operational issues. 4. Deployment Board Issues - DK =============================== T2 and site engagement - JC constantly has to chase for reports. T2 coordinators engaged but are having problems getting replies from sites. Recruitment of posts has been late but we need to get a culture change from R&D to production. Prediction of battles ahead with gLite. We should encourage people to travel to HEP Sysman Meeting (RAL in April) where Grid deployment and support would be discussed. Travel funds are quoted as an issue as such meetings are explicitly excluded in the travel policy.. UB had also noted this and the lack of support for training. Travel policy should allow for the Sysman meeting. NEW ACTION 167.4 TD to raise with RM possibilities for funding HEP SysMan meetings Reporting load - Weekly EGEE Operations, Quarterly Report. These were placing a strain on all concerned. Most concern - deliverables on project map - how to review them. DB - Not yet got the metrics in the Project Map into its final form. DK concerned load is too high. TD PowerPoint reports are focused on metrics. Production, User support - culture still R&D not production. Activity at higher level (ROC etc) but not at Tier-2 level. Service Challenges - get T2s to same level as T1 - enormous amount of manpower required. Maybe people will find this a challenge and get engaged. Need a person to lead it in the UK. Who? Currently JC but time is limited with other UK-wide role. Documentation - we would definitely benefit from a proper technical documentation writer. Needs a dedicated person with the right skills. Unspent Tier-1 Staff money at RAL otherwise will go on hardware. Should take this seriously. NEW ACTION 167.5 DK to define a possible technical documentation role. 5. Dissemination - SP ===================== JPhysG submitted - confirmation that they've got it. Quick review by one of their Board Members Bid to PPARC PUS award Grid Café with Dana Centre - Dave Colling Francois Grey, SP should hear back in a couple of months. AHM 18 Abstracts 15 on Web 2 up from previous years AHM planning meeting in May - Dave Colling + Bekki going. DM needs list of who is on the booth (free places). NEW ACTION 167.6 DM to circulate draft agenda of AHM2005. Bekki Pearce started as Events Officer. Working on Schools Talk. Will attend QM Masterclass tomorrow to talk to sixth formers. Will also attend PPARC Healthcare Industry Day on 28th Press release on SC2 in progress. Business cards. PMB said they looked nice and we should all get them. [Sarah cut off at that point] 6. UB Issues - DT ================= Non-LHC non-BaBar concern about UK Grid strategy and wider issues e.g. CDF. People less strongly wedded to Grid feel their science might be compromised if they have to use RAL facilities through the Grid. PMB don't really understand what the problems are. GridPP mission is to build Grid. CMS also concerned about LCG. Maybe others haven't tried yet. Need to show how e.g. see SLL Dublin Talk. DT asked how much effort to get ZEUS going on LCG - 9 Staff Months perhaps. RJ discussed yesterday - show and tell - what can be done, how much effort etc. Schedule for next UB. Invite Stephen Burke to give basic talk on job submission (or SLL possibly). Technical Side - problems with site configuration (ATLAS). Already covered this morning. JC wants contact with each experiment through UB (technical person). Data Management and Movement - tools weren't there (like they were for cpu management). Summary now exists. NEW ACTION 167.7 TD to forward Graeme Stewart's summary of Storage/Data Management Workshop at CERN to PMB. Lack of stability. Is it getting any better? From ATLAS point of view getting better for Rome production. Whitelisting of sites for a VO will be possible soon. Many UK sites stable (and off!). Can't rely on Dteam tests each morning - too harsh. RJ - Experiments had asked how they should request that a site supports a particular OS - The request should be made via UB who will pass it on to T2B which will result in an approach to the site. 7. Applications - RJ ==================== Need to firm up Service Challenges. Central planning still fluid. Also what happens beyond this year. Partition some resources for SCs. SRM at different sites on critical path. Thanked RAL for effort put in. CDF SAMGrid. Rick now given up as link person leading Grid efforts. Rest of CDF moderately atheist towards Grid. CDF once had a separte strategy, now converging with LCG (in parts). Fermi will support CMS activities and see if anything benefits CDF. SAMGrid deliverables need to be modified. Also manpower issues. Morag leaving in May. PMB agreed that Metadata post should be readvertised as Metadata not CDF. High level deliverables probably fine but secondary ones should not be CDF specific. What about Oxford post? Need to be discussions with Oxford as to where this is going. Maybe to adapt CDF software to LCG? Needs new deliverables to reflect this. NEW ACTION 167.8 RJ to raise CDF deliverables definition with Todd Huffmann. 8. M/S/N - DK ============= Apologies from RM. glite release testing etc increasingly a deployment issue. The PMB issue is expectation management. EGEE/CERN problem? Looks professional release of many disparate components. Problem is integration. Monitoring concerns. Discussion of R-GMA - outcome invite Steves Fisher and Traylen to present to PMB as to how to move R-GMA forward NEW ACTION 167.9 TD to invite Steve Fisher to discuss R-GMA status and plans at a forthcoming PMB meeting. Unfilled MSN posts. Less of an issue than elsewhere. Security middleware report filled? Quarterly reporting (security) - DK suggested separating out reporting. TD not happy - helpful if they do it together (as in other areas). Imperial integrated SGE in WMS. Who else interested - Durham? DESY visiting RAL this week to talk about dCache. NEW ACTION 167.10 DB to chase up all unfilled GridPP posts at this point. 9. Semi-External Issues ======================= a) LCG Issues - (TC) - none reported b) GGF Report - PC - none reported c) LCG MOU / NGS issues - NG - none reported d) EGEE PMB Report - DK for RM - RM has been proposed to continue as PMB chair for another 6 months. EGEE2 task force is due to report. UK partners will be meeting in Athens. e) Phase 2 planning - CERN have sent letters to funding agencies with annexes listing t1/t2 asking for validation and t1/t2 contacts. Many replied that they will reply during CRRB. DM contacted Richard Wade after the meeting Experiments need need capacity figures for T2s by end of May for inclusion in computing TDRs. End of may is unrealistic - decided to produce a separate document to accompany the TDRs to review. Should be complete by mid August. - All figures in TDRs will be planning ones as no-one will have signed the MoU by the end of May. Chris Eck reckons UK T1 plans don't balance our allocations with Experiment Requirements separately for cpu/disk/tape. DB confirmed this and said that tape needed to be adjusted as noted in his report. New input has been requested from Dave Corney et al. Networking - Comment from Dave Foster about UKERNA being confused over timing of SJ5. This is wrong but if he had got this message we have a problem. Action JG to raise at networking phone conference.The only new information was assumption that t1s need an extra 10gb connection for their traffic to T2s although this could be over a production IP network.? Tapes - a small working group will be formed to discuss experiment models for data transfer between T1 and T2. Next Phase 2 Planning meeting - end June. 10. AOB ======= 1) Discussion of GridPP13 agenda - TD No commitment yet from CERN to send people. Later discussion with TC indicated there would be CERN people available. Need this before planning agenda. No decision on parallel session. Leave gap in programme for ad hoc discussions. 2) Dates of next meetings. Oversight Committee Fri 1st July Cancel F2F 8th July. Next one 5th September in Birmingham. Collaboration Meeting at Glasgow in 2007? Try for end May. Another one at RAL in Jan 2006? or, more speculatively, Jan 2007. Volunteer sites required for future Collaboration Meetings. New Actions =========== 167.1: JG to report back from networking meeting on service challenge activity/organisation. 167.2: JC, with input from Stephen Burke to maintain a list of middleware issues that should be addressed 167.3: RJ to define UK contacts for each experiment who will respond on experiment operational issues. 167.4 TD to discuss with RM revised funding policy for GridPP people attending HEP SysMan meetings 167.5 DK to define a possible technical documentation role. 167.6 DM to circulate draft agenda of AHM2005. 167.8 RJ to raise CDF deliverables definition with Todd Huffmann. 167.10 DB to chase up all unfilled GridPP posts at this point.

Top of Message | Previous Page | Permalink

JiscMail Tools


RSS Feeds and Sharing


Advanced Options


Archives

April 2024
February 2024
January 2024
September 2022
July 2022
June 2022
February 2022
December 2021
August 2021
March 2021
November 2020
October 2020
August 2020
March 2020
February 2020
October 2019
August 2019
June 2019
May 2019
April 2019
March 2019
February 2019
January 2019
December 2018
November 2018
August 2018
July 2018
June 2018
May 2018
April 2018
March 2018
February 2018
January 2018
November 2017
October 2017
September 2017
August 2017
May 2017
April 2017
March 2017
February 2017
January 2017
October 2016
August 2016
July 2016
June 2016
May 2016
April 2016
March 2016
February 2016
January 2016
December 2015
November 2015
October 2015
September 2015
July 2015
May 2015
April 2015
March 2015
February 2015
January 2015
December 2014
November 2014
October 2014
September 2014
August 2014
July 2014
June 2014
May 2014
April 2014
March 2014
February 2014
January 2014
December 2013
October 2013
August 2013
July 2013
June 2013
May 2013
March 2013
February 2013
January 2013
December 2012
November 2012
October 2012
September 2012
August 2012
July 2012
June 2012
May 2012
April 2012
March 2012
February 2012
January 2012
December 2011
November 2011
October 2011
September 2011
August 2011
July 2011
June 2011
May 2011
April 2011
March 2011
February 2011
January 2011
December 2010
November 2010
October 2010
September 2010
August 2010
July 2010
June 2010
May 2010
April 2010
March 2010
February 2010
January 2010
December 2009
November 2009
October 2009
September 2009
August 2009
July 2009
June 2009
May 2009
April 2009
March 2009
February 2009
January 2009
December 2008
November 2008
October 2008
September 2008
August 2008
July 2008
June 2008
May 2008
April 2008
March 2008
February 2008
January 2008
December 2007
November 2007
October 2007
September 2007
August 2007
July 2007
June 2007
May 2007
April 2007
March 2007
February 2007
January 2007
2006
2005
2004
2003
2002
2001
2000


JiscMail is a Jisc service.

View our service policies at https://www.jiscmail.ac.uk/policyandsecurity/ and Jisc's privacy policy at https://www.jisc.ac.uk/website/privacy-notice

For help and support help@jisc.ac.uk

Secured by F-Secure Anti-Virus CataList Email List Search Powered by the LISTSERV Email List Manager