JISCMail - UKHEPGRID Archives

Email discussion lists for the UK Education and Research communities
Subscriber's Corner
Email Lists
UKHEPGRID Archives

UKHEPGRID@JISCMAIL.AC.UK

View:

Message:
[
First
Last
]
By Topic:
[
First
Last
]
By Author:
[
First
Last
]
Font:
Proportional Font
		LISTSERV Archives
		UKHEPGRID Home
		UKHEPGRID 2005
Options

Subscribe or Unsubscribe
Get Password
Subject:
Minutes of the 167th GridPP PMB meeting
From:
Tony Doyle <[log in to unmask]>
Reply-To:
Tony Doyle <[log in to unmask]>
Date:
Thu, 28 Apr 2005 17:02:29 +0100
Content-Type:
MULTIPART/MIXED
Parts/Attachments:
TEXT/PLAIN (24 lines) , 050414.txt (1 lines)
Dear All,

     Please find attached the F2F GridPP Project Management Board Meeting
minutes. These can also be found at:

http://www.gridpp.ac.uk/pmb/minutes/050414.txt

as well as being listed with other minutes at

http://www.gridpp.ac.uk/php/pmb/minutes.php

Cheers, Tony
________________________________________________________________________
Tony Doyle, GridPP Project Leader            Telephone: +44-141-330 5899
Rm 478, Kelvin Building                        Telefax: +44-141-330 5881
Dept of Physics and Astronomy           EMail: [log in to unmask]
University of Glasgow             Web: http://ppewww.ph.gla.ac.uk/~doyle
G12 8QQ, UK                                      Video - IP: 194.36.1.32
________________________________________________________________________






                GridPP PMB Minutes 167 - 14th April 2005

                ========================================

                    Face to Face Meeting at Imperial

                    --------------------------------

Present: Dave Britton (pm), Jeremy Coles, Tony Doyle, John Gordon, Roger 

Jones, Dave Kelsey, Steve Lloyd, Deborah Miller, Sarah Pearce (phone, pm), 

Dan Tovey (phone, am)

Apologies: Tony Cass, Robin Middleton

1. Tier-1 Issues - JG

=====================

Summary of issues presented by A Sansum to Tier1A Stakeholders Board on 

11th April was discussed.



1) Another power cut in March.  Probably cause - electrical work 

close to the EPO system.. Impact - 1 day outage. Consequence - must give 

greater consideration to protecting critical and sensitive systems from 

power failure. Will assess requirement for UPS for critical services. May 

be financial cost to GridPP.



2) Service Challenge activity has sharply increased and expectations from 

LCG are higher than a few months ago. Additional effort will be needed to 

meet the ongoing challenges in this area. Ditto capacity requirements. The 

service challenge work is become a bigger driver than any single 

experiment. Discussion on networking is planned need description of roles. 

Will raise at networking meeting. 



NEW ACTION 167.1:  JG to report back from networking meeting on service

challenge activity/organisation.



3) Increasingly concerned that there will be difficulties meeting Service 

Challenge commitments on either the production or UKLIGHT network over the 

next 18 months. However further work needs to be done before we have a 

clear picture of either the requirement (from T1) or the capability of 

UKLIGHT or SJ4/5. A meeting is scheduled with the GRIDPP networking team 

to discuss requirements / capabilities.



4) SRM worked well in SC2. Continuing issue that xrootd  is deployed at 

T1A only for BaBar.



5) We are finding it extremely difficult to allocate disk in the dynamic

manner requested by the user board. Provisioning is slow and

labour-intensive. A lesser issue is obtaining the release of capacity from

experiments in order to re-allocate it. RJ said there was no use case to

reduce disk allocations. Once allocated disk will be used essentially for

ever. Not obvious how this fits with the UB model for data challenges.



5) Tenders delayed until - a) effort was available and b) pending evidence 

that utilization justified additional capacity purchasing. This now 

appears to be the case.



6) Advertisements for Tier1 positions should appear very shortly (maybe 

1-2 weeks). 



Previous Issues that have been resolved:

1) dCache disk sRM is deployed and is working. CMS much happier. However 

support effort for dCache is essentially 1 FTE which is rather worrying if 

it does not reduce over the next 3 months.

2) Tape back end to dCache/SRM implemented and tested. Full test under SC2 

shortly.

3) CPU Utilisation figures for January-March are much healthier

4) Redhat 7.3 almost entirely phased out (except disk servers) - SL3 

installed on LCG. SLC3 still an issue - it was good that we chose SL3.

5) Scheduling - Although work continues in this area, much progress has 

been made and the hard partition between the Tier1 and Tier-A has become 

much softer allowing jobs to migrate between both infrastructures with a 

single common scheduler.

6) Sun services terminated - but disposal not completed



BaBar - TD reported on long discussion with BaBar Computing Coordinator 

Stephen Gowdy. No US effort for Grid; all impetus coming from Europe. The 

Monte Carlo problem is soluable, but no progress on analysis. TD told SG 

the current position and how we would meet their 2005-06 requirements 

(they should use Grid to share with LHC). 



Overall Tier1A Plan has been sent to Phase 2 Planning Committee for LCG 

and to PPARC. Ken sent an accompanying note to Janet Seed observing that 

the UK Tier1 resources were light. A pulse of resources are required in 

2007/8. JS advises us to have plans for cutbacks in other areas if there 

is no extra money. SRIF3 outcome may not be good for T2 resources.

JC asked about tape planning.



2. Tier-2 Issues - SL

=====================

1) Continuing delays in appointing hardware support people. Four out of 

nine are in post with another two appointed.. Total T2 effort has gone up 

even though unfunded effort may be under-reported. 

The issue of delayed grant start was discussed. Six month delays in grants 

are allowed by PPARC. Some further slippage of three months is allowed by 

the reporting period making June the cut-off.



2) T2 middleware people should report through RM.



3) Use of Tier-2 Hardware money - testbeds now or contingency for later?



4) Failure of sites to upgrade/LCG to deliver upgrades. Eventually got to 

2.3.0 or 2.3.1 at most sites. Will see how 2.4.0 migration goes. Give 

feedback to CERN on shortage of migration period and lack of sufficient 

testing.



5) (Lack of) understanding of experiment computing models (esp analysis). 

Task force led by Roger to provide detailed bandwidth requirements for UK.



6) Future resources - SRIF3 etc. QMUL have been successful but little 

other successes known so far.



7) The PMB considered the hardware bids received via T2s

All T2s had bid for sytemss to use for the planned UK testzone and some 

middleware development.



After long discussion, the PMB approved the following: 



London 

======

8x£1000 systems for the testzone

10x£300 boxes for WLMS development

     Total £11,000



Rejected network systems



NorthGrid 

=========

4x£1000 systems for the testzone 

3x£1000 VO Operation systems 

3x£1000 GridPP web server 

     Total £10,000



Network Monitoring deferred (see below); dCache rejected.



SouthGrid 

=========

7x£1000 systems for the testzone 

     Total £7,000

RAID rejected

VMware was discussed, but not considered appropriate to commit funding 

to at this time.

 

ScotGrid 

========

4x£1000 systems for the testzone 

8x£1000 replica catalogue, data management testing, metadata development  

2x£1000 storage but not with special interfaces

     Total £14,000

Rejected machine to act as storage for common home areas

Rejected storage middleware development machines

Rejected security intrusion detection (see below)



Tier-1 

====== 

7x£1000 for dcache pro, dev, tape middleware and integration test 

     Total £7,000

Rejected testing other people's SRMs.



Notes:

1. Testzone and other boxes were generally costed at £1k allowing some 

contingency for KVM or other things.

2. T2 to decide where boxes go based on this feedback and institutes to 

submit PPARC grants referring to these minutes.



It was agreed that testzone boxes (£23k) are funded by T2 hardware budget. 



The remainder (£26k) will be found from contingency after discussion 

between TD and DB. (Agreed following the meeting).



The PMB encouraged further national bids for 

(a) network monitoring - setting up GridMon nodes at sites; 

and 

(b) security intrusion detection (log server, snort, tripwire, nessus)

if a joint case can be made on how to run/configure these across all 

institutes can be determined.



3. Production Manager Issues - JC

=================================

1) Responsiveness of developers/EGEE to problems with middleware - over 

the last few months we have raised a number of security related problems. 

Some are fixed while others are not, and once raised the discussions tend 

to be CERN centric. A number of problems have been raised at 

meetings/workshops and on LCG-ROLLOUT that seem not to get registered 

anywhere to be dealt with. Even problems logged in Savannah sometimes do 

not progress. One problem in particular should be of concern relating to 

the instability of the RAL RB for Babar jobs - it falls over after 30 

mins. This was raised months ago and is stopping Babar from performing any 

real use of Grid resources.

UK should keep its own list of issues and pursue them.  



NEW ACTION 167.2:  JC, with input from Stephen Burke to maintain a list 

of middleware issues that should be addressed



2) Deployment schedules for LCG2/gLite releases. An announcement was

made that a release would be made every 3 months starting on 1st April. 

Several sites scheduled manpower to perform upgrades around this release 

timescale and as it was not delivered on time such sites will have 

difficulty meeting with requests for timely upgrades. We tried hard for a 

fixed release timeline to ease deployment concerns at sites, to miss the 

release date impacts the credibility of GridPP, LCG and EGEE. 

The transition to gLite is not clear. JG said that it was quite clear: SA1 

will deploy elements of gLite on pre-production service and only consider 

deploying them in production when they are shown to be at least as good as 

the elements they are replacing. All gLite elements are designed to 

co-exist with the previous software so they could be deployed in 

production for a transition phase while users move. 

LCG 2_4_0 was released on 7/4 which is better than forecast but not 

perfect. The three week target for migration started then.

 

3) Networking provision and installation. RAL only just made it into 

Service Challenge 2 after a commendable effort from the Tier-1 and UKERNA. 

From this it is clear that the deployment and network areas of GridPP need 

to be more aligned and that the deployment team need a single point of 

contact who will be accountable for ensuring that network hardware is 

installed and operational to agreed plans. The situation has improved 

recently but there is still an issue as currently short (for SC3) and long 

term (for end 2006) planning are not clear. There are also questions in 

the deployment team about what support is available for networking 

problems and what other GridPP activities are going to provide - sites for 

instance want help monitoring their networks. This may be an issue of 

communication.

These issues and more will be addressed in a phone conference to be held 

on 15/4.



4) Engagement of experiments with UK deployment. Despite requests we are 

struggling to get any direct feedback from experiments on usability, 

configuration, reliability etc. of GridPP resources yet related issues 

keep being raised at the Board levels with LCG sites. In addition site 

managers struggle to understand why sometimes their sites are used but at 

other times they are completely empty - 0% utilisation! If we are to 

provide a better service to the experiments we need to have feedback on 

what is wrong - especially as jobs the DTEAM can run are only for the 

DTEAM VO. 

Can we define UK contacts for each experiment who will respond on 

experiment operational issues like - BDII, software installation. 



5) Tier-2 coordinator and hardware support activities. While the DTEAM 

structure is improving there is room for improvement in the visible output 

from the team and from the Tier-2s. (web pages - The GridPP structure of 

having hardware support posts report through a separate channel does not 

help as it is not clear how their work is going to be coordinated with the 

rest of GridPP deployment. The use of weekly reporting has improved the 

situation for the coordinators but there is still a disconnect between 

production manager needs from the team and the local requirements on them 

- non-local and indirect management are not ideal to meet GridPP and EGEE 

objectives.

The hardware people only report to SL for FTE figures in QPR. Their work 

should be an integral part of the Deployment Team and engaged with the T2 

coordinators.

There are a number of other issues that can and should be discussed though 

there is some overlap with Tier-2 Board and Deployment Board 

responsibilities:



6) Deployment of resources across the Tier-2s are well behind anticipated 

levels



7) Tools (even though also essential for local monitoring and security) 

are not being deployed quickly enough at GridPP sites. Ganglia is an 

example. Repeated requests have seen the number of sites with such a tool 

increase from about 35% to 45%.



8) Understanding and following of procedures at sites is still quite poor. 

This issue is raised in light of a number of sites not properly reporting 

or working around security incidents. A HEPSYSMAN meeting in

April will focus on such issues but the pressure needs to come from within 

the institutes



9) Planning in Tier-2s was reported to be improving at the last Tier-2 

Board - JC is still concerned though that the level of planning is 

inadequate.



10) Support for operations and users. There are ideas to improve this but 

so far none have been followed up partly because of lack of manpower and 

partly because of focus elsewhere for myself and the DTEAM.



11) Coordination across LCG/EGEE does not always provide much needed help. 

For example an issue about migrating VO data off of classic SEs has 

appeared as a UK issue at the weekly operations meeting for several 

months.



NEW ACTION 167.3:  RJ to define UK contacts for each experiment who will

respond on experiment operational issues.



4. Deployment Board Issues - DK

===============================

T2 and site engagement - JC constantly has to chase for reports. T2 

coordinators engaged but are having problems getting replies from sites. 

Recruitment of posts has been late but we need to get a culture change 

from R&D to production. Prediction of battles ahead with gLite.

We should encourage people to travel to HEP Sysman Meeting (RAL in April) 

where Grid deployment and support would be discussed. Travel funds are 

quoted as an issue as such meetings are explicitly excluded in the travel 

policy.. UB had also noted this and the lack of support for training.  

Travel policy should allow for the Sysman meeting. 



NEW ACTION 167.4 TD to raise with RM possibilities for funding HEP SysMan

meetings



Reporting load - Weekly EGEE Operations, Quarterly Report. These were 

placing a strain on all concerned. Most concern - deliverables on project 

map - how to review them.

DB - Not yet got the metrics in the Project Map into its final form.  DK 

concerned load is too high. TD PowerPoint reports are focused on metrics.

Production, User support - culture still R&D not production. Activity at 

higher level (ROC etc) but not at Tier-2 level.

Service Challenges - get T2s to same level as T1 - enormous amount of 

manpower required. Maybe people will find this a challenge and get 

engaged. Need a person to lead it in the UK. Who? Currently JC but time is 

limited with other UK-wide role.

Documentation - we would definitely benefit from a proper technical 

documentation writer. Needs a dedicated person with the right skills. 

Unspent Tier-1 Staff money at RAL otherwise will go on hardware. Should 

take this seriously.  



NEW ACTION 167.5 DK to define a possible technical documentation role.



5. Dissemination - SP

=====================

JPhysG submitted - confirmation that they've got it. Quick review by one 

of their Board Members

Bid to PPARC PUS award Grid Café with Dana Centre - Dave Colling Francois 

Grey, SP should hear back in a couple of months. 

AHM 18 Abstracts 15 on Web 2 up from previous years

AHM planning meeting in May - Dave Colling + Bekki going. DM needs list of 

who is on the booth (free places). 



NEW ACTION 167.6 DM to circulate draft agenda of AHM2005.



Bekki Pearce started as Events Officer. Working on Schools Talk. Will 

attend QM Masterclass tomorrow to talk to sixth formers.

Will also attend PPARC Healthcare Industry Day on 28th

Press release on SC2 in progress.

Business cards. PMB said they looked nice and we should all get them. 

[Sarah cut off at that point]



6. UB Issues - DT

=================

Non-LHC non-BaBar concern about UK Grid strategy and wider issues e.g. 

CDF. People less strongly wedded to Grid feel their science might be 

compromised if they have to use RAL facilities through the Grid. PMB don't 

really understand what the problems are. GridPP mission is to build Grid. 

CMS also concerned about LCG. Maybe others haven't tried yet. Need to show 

how e.g. see SLL Dublin Talk. DT asked how much effort to get ZEUS going 

on LCG - 9 Staff Months perhaps. RJ discussed yesterday - show and tell - 

what can be done, how much effort etc. Schedule for next UB. Invite 

Stephen Burke to give basic talk on job submission (or SLL possibly).



Technical Side - problems with site configuration (ATLAS). Already 

covered this morning. JC wants contact with each experiment through UB 

(technical person).



Data Management and Movement - tools weren't there (like they were for cpu 

management). Summary now exists. 



NEW ACTION  167.7 TD to forward Graeme Stewart's summary of Storage/Data 

Management Workshop at CERN to PMB.



Lack of stability. Is it getting any better? From ATLAS point of view 

getting better for Rome production. Whitelisting of sites for a VO will be 

possible soon. Many UK sites stable (and off!). Can't rely on Dteam tests 

each morning - too harsh.



RJ - Experiments had asked how they should request that a site supports a 

particular OS - The request should be made via UB who will pass it on to 

T2B which will result in an approach to the site.



7. Applications - RJ

====================

Need to firm up Service Challenges. Central planning still fluid. Also 

what happens beyond this year. Partition some resources for SCs. 

SRM at different sites on critical path. Thanked RAL for effort put in.

CDF SAMGrid. Rick now given up as link person leading Grid efforts. Rest 

of CDF moderately atheist towards Grid. CDF once had a separte strategy, 

now converging with LCG (in parts). 

Fermi will support CMS activities and see if anything benefits CDF. 

SAMGrid deliverables need to be modified. Also manpower issues. Morag 

leaving in May. PMB agreed that Metadata post should be readvertised as 

Metadata not CDF. High level deliverables probably fine but secondary ones 

should not be CDF specific. What about Oxford post? Need to be discussions 

with Oxford as to where this is going. Maybe to adapt CDF software to LCG? 

Needs new deliverables to reflect this. 



NEW ACTION 167.8 RJ to raise CDF deliverables definition with Todd 

Huffmann.



8. M/S/N - DK

=============

Apologies from RM.

glite release testing etc increasingly a deployment issue. The PMB issue 

is expectation management. EGEE/CERN problem? Looks professional release 

of many disparate components. Problem is integration. 

Monitoring concerns. Discussion of R-GMA - outcome invite Steves Fisher 

and Traylen to present to PMB as to how to move R-GMA forward 



NEW ACTION 167.9 TD to invite Steve Fisher to discuss R-GMA status and 

plans at a forthcoming PMB meeting.



Unfilled MSN posts. Less of an issue than elsewhere. Security middleware 

report filled?

Quarterly reporting (security) - DK suggested separating out reporting. 

TD not happy - helpful if they do it together (as in other areas).

Imperial integrated SGE in WMS. Who else interested - Durham? 



DESY visiting RAL this week to talk about dCache.



NEW ACTION 167.10  DB to chase up all unfilled GridPP posts at this point.



9. Semi-External Issues

=======================

a) LCG Issues - (TC)  - none reported



b) GGF Report - PC - none reported



c) LCG MOU / NGS issues - NG - none reported



d) EGEE PMB Report - DK for RM - RM has been proposed to continue as 

PMB chair for another 6 months. EGEE2 task force is due to report. UK 

partners will be meeting in Athens.



e) Phase 2 planning - CERN have sent letters to funding agencies with 

annexes listing t1/t2 asking for validation and t1/t2 contacts. Many 

replied that they will reply during CRRB. 

DM contacted Richard Wade after the meeting



Experiments need need capacity figures for T2s by end of May for inclusion 

in computing TDRs. End of may is unrealistic - decided to produce a 

separate document to accompany the TDRs to review. Should be complete by 

mid August. - All figures in TDRs will be planning ones as no-one will 

have signed the MoU by the end of May.



Chris Eck reckons UK T1 plans don't balance our allocations with 

Experiment Requirements separately for cpu/disk/tape.

DB confirmed this and said that tape needed to be adjusted as noted in 

his report. New input has been requested from Dave Corney et al.



Networking - Comment from Dave Foster about UKERNA being confused over 

timing of SJ5. This is wrong but if he had got this message we have a 

problem. Action JG to raise at networking phone conference.The only new 

information was assumption that t1s need an extra 10gb connection for 

their traffic to T2s although this could be over a production IP network.?

Tapes - a small working group will be formed to discuss experiment models 

for data transfer between T1 and T2.



Next Phase 2 Planning meeting - end June.



10. AOB

=======

1) Discussion of GridPP13 agenda - TD

No commitment yet from CERN to send people. Later discussion with TC 

indicated there would be CERN people available. 

Need this before planning agenda. 

No decision on parallel session. 

Leave gap in programme for ad hoc discussions.



2) Dates of next meetings.

Oversight Committee  Fri 1st July

Cancel F2F 8th July. Next one 5th September in Birmingham.

Collaboration Meeting at Glasgow in 2007? Try for end May.

Another one at RAL in Jan 2006? or, more speculatively, Jan 2007. 

Volunteer sites required for future Collaboration Meetings.



New Actions

===========



167.1:  JG to report back from networking meeting on service

challenge activity/organisation.



167.2:  JC, with input from Stephen Burke to maintain a list of middleware

issues that should be addressed



167.3:  RJ to define UK contacts for each experiment who will

respond on experiment operational issues.



167.4 TD to discuss with RM revised funding policy for GridPP people

attending HEP SysMan meetings



167.5 DK to define a possible technical documentation role.



167.6 DM to circulate draft agenda of AHM2005.



167.8 RJ to raise CDF deliverables definition with Todd Huffmann.



167.10 DB to chase up all unfilled GridPP posts at this point.
Top of Message | Previous Page | Permalink
JiscMail Tools

Files Area | help
RSS Feeds and Sharing

Search Archives

Advanced Options