Dear All,
Please find attached the F2F GridPP Project Management Board Meeting
minutes. These can also be found at:
http://www.gridpp.ac.uk/pmb/minutes/050414.txt
as well as being listed with other minutes at
http://www.gridpp.ac.uk/php/pmb/minutes.php
Cheers, Tony
________________________________________________________________________
Tony Doyle, GridPP Project Leader Telephone: +44-141-330 5899
Rm 478, Kelvin Building Telefax: +44-141-330 5881
Dept of Physics and Astronomy EMail: [log in to unmask]
University of Glasgow Web: http://ppewww.ph.gla.ac.uk/~doyle
G12 8QQ, UK Video - IP: 194.36.1.32
________________________________________________________________________
GridPP PMB Minutes 167 - 14th April 2005
========================================
Face to Face Meeting at Imperial
--------------------------------
Present: Dave Britton (pm), Jeremy Coles, Tony Doyle, John Gordon, Roger
Jones, Dave Kelsey, Steve Lloyd, Deborah Miller, Sarah Pearce (phone, pm),
Dan Tovey (phone, am)
Apologies: Tony Cass, Robin Middleton
1. Tier-1 Issues - JG
=====================
Summary of issues presented by A Sansum to Tier1A Stakeholders Board on
11th April was discussed.
1) Another power cut in March. Probably cause - electrical work
close to the EPO system.. Impact - 1 day outage. Consequence - must give
greater consideration to protecting critical and sensitive systems from
power failure. Will assess requirement for UPS for critical services. May
be financial cost to GridPP.
2) Service Challenge activity has sharply increased and expectations from
LCG are higher than a few months ago. Additional effort will be needed to
meet the ongoing challenges in this area. Ditto capacity requirements. The
service challenge work is become a bigger driver than any single
experiment. Discussion on networking is planned need description of roles.
Will raise at networking meeting.
NEW ACTION 167.1: JG to report back from networking meeting on service
challenge activity/organisation.
3) Increasingly concerned that there will be difficulties meeting Service
Challenge commitments on either the production or UKLIGHT network over the
next 18 months. However further work needs to be done before we have a
clear picture of either the requirement (from T1) or the capability of
UKLIGHT or SJ4/5. A meeting is scheduled with the GRIDPP networking team
to discuss requirements / capabilities.
4) SRM worked well in SC2. Continuing issue that xrootd is deployed at
T1A only for BaBar.
5) We are finding it extremely difficult to allocate disk in the dynamic
manner requested by the user board. Provisioning is slow and
labour-intensive. A lesser issue is obtaining the release of capacity from
experiments in order to re-allocate it. RJ said there was no use case to
reduce disk allocations. Once allocated disk will be used essentially for
ever. Not obvious how this fits with the UB model for data challenges.
5) Tenders delayed until - a) effort was available and b) pending evidence
that utilization justified additional capacity purchasing. This now
appears to be the case.
6) Advertisements for Tier1 positions should appear very shortly (maybe
1-2 weeks).
Previous Issues that have been resolved:
1) dCache disk sRM is deployed and is working. CMS much happier. However
support effort for dCache is essentially 1 FTE which is rather worrying if
it does not reduce over the next 3 months.
2) Tape back end to dCache/SRM implemented and tested. Full test under SC2
shortly.
3) CPU Utilisation figures for January-March are much healthier
4) Redhat 7.3 almost entirely phased out (except disk servers) - SL3
installed on LCG. SLC3 still an issue - it was good that we chose SL3.
5) Scheduling - Although work continues in this area, much progress has
been made and the hard partition between the Tier1 and Tier-A has become
much softer allowing jobs to migrate between both infrastructures with a
single common scheduler.
6) Sun services terminated - but disposal not completed
BaBar - TD reported on long discussion with BaBar Computing Coordinator
Stephen Gowdy. No US effort for Grid; all impetus coming from Europe. The
Monte Carlo problem is soluable, but no progress on analysis. TD told SG
the current position and how we would meet their 2005-06 requirements
(they should use Grid to share with LHC).
Overall Tier1A Plan has been sent to Phase 2 Planning Committee for LCG
and to PPARC. Ken sent an accompanying note to Janet Seed observing that
the UK Tier1 resources were light. A pulse of resources are required in
2007/8. JS advises us to have plans for cutbacks in other areas if there
is no extra money. SRIF3 outcome may not be good for T2 resources.
JC asked about tape planning.
2. Tier-2 Issues - SL
=====================
1) Continuing delays in appointing hardware support people. Four out of
nine are in post with another two appointed.. Total T2 effort has gone up
even though unfunded effort may be under-reported.
The issue of delayed grant start was discussed. Six month delays in grants
are allowed by PPARC. Some further slippage of three months is allowed by
the reporting period making June the cut-off.
2) T2 middleware people should report through RM.
3) Use of Tier-2 Hardware money - testbeds now or contingency for later?
4) Failure of sites to upgrade/LCG to deliver upgrades. Eventually got to
2.3.0 or 2.3.1 at most sites. Will see how 2.4.0 migration goes. Give
feedback to CERN on shortage of migration period and lack of sufficient
testing.
5) (Lack of) understanding of experiment computing models (esp analysis).
Task force led by Roger to provide detailed bandwidth requirements for UK.
6) Future resources - SRIF3 etc. QMUL have been successful but little
other successes known so far.
7) The PMB considered the hardware bids received via T2s
All T2s had bid for sytemss to use for the planned UK testzone and some
middleware development.
After long discussion, the PMB approved the following:
London
======
8x£1000 systems for the testzone
10x£300 boxes for WLMS development
Total £11,000
Rejected network systems
NorthGrid
=========
4x£1000 systems for the testzone
3x£1000 VO Operation systems
3x£1000 GridPP web server
Total £10,000
Network Monitoring deferred (see below); dCache rejected.
SouthGrid
=========
7x£1000 systems for the testzone
Total £7,000
RAID rejected
VMware was discussed, but not considered appropriate to commit funding
to at this time.
ScotGrid
========
4x£1000 systems for the testzone
8x£1000 replica catalogue, data management testing, metadata development
2x£1000 storage but not with special interfaces
Total £14,000
Rejected machine to act as storage for common home areas
Rejected storage middleware development machines
Rejected security intrusion detection (see below)
Tier-1
======
7x£1000 for dcache pro, dev, tape middleware and integration test
Total £7,000
Rejected testing other people's SRMs.
Notes:
1. Testzone and other boxes were generally costed at £1k allowing some
contingency for KVM or other things.
2. T2 to decide where boxes go based on this feedback and institutes to
submit PPARC grants referring to these minutes.
It was agreed that testzone boxes (£23k) are funded by T2 hardware budget.
The remainder (£26k) will be found from contingency after discussion
between TD and DB. (Agreed following the meeting).
The PMB encouraged further national bids for
(a) network monitoring - setting up GridMon nodes at sites;
and
(b) security intrusion detection (log server, snort, tripwire, nessus)
if a joint case can be made on how to run/configure these across all
institutes can be determined.
3. Production Manager Issues - JC
=================================
1) Responsiveness of developers/EGEE to problems with middleware - over
the last few months we have raised a number of security related problems.
Some are fixed while others are not, and once raised the discussions tend
to be CERN centric. A number of problems have been raised at
meetings/workshops and on LCG-ROLLOUT that seem not to get registered
anywhere to be dealt with. Even problems logged in Savannah sometimes do
not progress. One problem in particular should be of concern relating to
the instability of the RAL RB for Babar jobs - it falls over after 30
mins. This was raised months ago and is stopping Babar from performing any
real use of Grid resources.
UK should keep its own list of issues and pursue them.
NEW ACTION 167.2: JC, with input from Stephen Burke to maintain a list
of middleware issues that should be addressed
2) Deployment schedules for LCG2/gLite releases. An announcement was
made that a release would be made every 3 months starting on 1st April.
Several sites scheduled manpower to perform upgrades around this release
timescale and as it was not delivered on time such sites will have
difficulty meeting with requests for timely upgrades. We tried hard for a
fixed release timeline to ease deployment concerns at sites, to miss the
release date impacts the credibility of GridPP, LCG and EGEE.
The transition to gLite is not clear. JG said that it was quite clear: SA1
will deploy elements of gLite on pre-production service and only consider
deploying them in production when they are shown to be at least as good as
the elements they are replacing. All gLite elements are designed to
co-exist with the previous software so they could be deployed in
production for a transition phase while users move.
LCG 2_4_0 was released on 7/4 which is better than forecast but not
perfect. The three week target for migration started then.
3) Networking provision and installation. RAL only just made it into
Service Challenge 2 after a commendable effort from the Tier-1 and UKERNA.
From this it is clear that the deployment and network areas of GridPP need
to be more aligned and that the deployment team need a single point of
contact who will be accountable for ensuring that network hardware is
installed and operational to agreed plans. The situation has improved
recently but there is still an issue as currently short (for SC3) and long
term (for end 2006) planning are not clear. There are also questions in
the deployment team about what support is available for networking
problems and what other GridPP activities are going to provide - sites for
instance want help monitoring their networks. This may be an issue of
communication.
These issues and more will be addressed in a phone conference to be held
on 15/4.
4) Engagement of experiments with UK deployment. Despite requests we are
struggling to get any direct feedback from experiments on usability,
configuration, reliability etc. of GridPP resources yet related issues
keep being raised at the Board levels with LCG sites. In addition site
managers struggle to understand why sometimes their sites are used but at
other times they are completely empty - 0% utilisation! If we are to
provide a better service to the experiments we need to have feedback on
what is wrong - especially as jobs the DTEAM can run are only for the
DTEAM VO.
Can we define UK contacts for each experiment who will respond on
experiment operational issues like - BDII, software installation.
5) Tier-2 coordinator and hardware support activities. While the DTEAM
structure is improving there is room for improvement in the visible output
from the team and from the Tier-2s. (web pages - The GridPP structure of
having hardware support posts report through a separate channel does not
help as it is not clear how their work is going to be coordinated with the
rest of GridPP deployment. The use of weekly reporting has improved the
situation for the coordinators but there is still a disconnect between
production manager needs from the team and the local requirements on them
- non-local and indirect management are not ideal to meet GridPP and EGEE
objectives.
The hardware people only report to SL for FTE figures in QPR. Their work
should be an integral part of the Deployment Team and engaged with the T2
coordinators.
There are a number of other issues that can and should be discussed though
there is some overlap with Tier-2 Board and Deployment Board
responsibilities:
6) Deployment of resources across the Tier-2s are well behind anticipated
levels
7) Tools (even though also essential for local monitoring and security)
are not being deployed quickly enough at GridPP sites. Ganglia is an
example. Repeated requests have seen the number of sites with such a tool
increase from about 35% to 45%.
8) Understanding and following of procedures at sites is still quite poor.
This issue is raised in light of a number of sites not properly reporting
or working around security incidents. A HEPSYSMAN meeting in
April will focus on such issues but the pressure needs to come from within
the institutes
9) Planning in Tier-2s was reported to be improving at the last Tier-2
Board - JC is still concerned though that the level of planning is
inadequate.
10) Support for operations and users. There are ideas to improve this but
so far none have been followed up partly because of lack of manpower and
partly because of focus elsewhere for myself and the DTEAM.
11) Coordination across LCG/EGEE does not always provide much needed help.
For example an issue about migrating VO data off of classic SEs has
appeared as a UK issue at the weekly operations meeting for several
months.
NEW ACTION 167.3: RJ to define UK contacts for each experiment who will
respond on experiment operational issues.
4. Deployment Board Issues - DK
===============================
T2 and site engagement - JC constantly has to chase for reports. T2
coordinators engaged but are having problems getting replies from sites.
Recruitment of posts has been late but we need to get a culture change
from R&D to production. Prediction of battles ahead with gLite.
We should encourage people to travel to HEP Sysman Meeting (RAL in April)
where Grid deployment and support would be discussed. Travel funds are
quoted as an issue as such meetings are explicitly excluded in the travel
policy.. UB had also noted this and the lack of support for training.
Travel policy should allow for the Sysman meeting.
NEW ACTION 167.4 TD to raise with RM possibilities for funding HEP SysMan
meetings
Reporting load - Weekly EGEE Operations, Quarterly Report. These were
placing a strain on all concerned. Most concern - deliverables on project
map - how to review them.
DB - Not yet got the metrics in the Project Map into its final form. DK
concerned load is too high. TD PowerPoint reports are focused on metrics.
Production, User support - culture still R&D not production. Activity at
higher level (ROC etc) but not at Tier-2 level.
Service Challenges - get T2s to same level as T1 - enormous amount of
manpower required. Maybe people will find this a challenge and get
engaged. Need a person to lead it in the UK. Who? Currently JC but time is
limited with other UK-wide role.
Documentation - we would definitely benefit from a proper technical
documentation writer. Needs a dedicated person with the right skills.
Unspent Tier-1 Staff money at RAL otherwise will go on hardware. Should
take this seriously.
NEW ACTION 167.5 DK to define a possible technical documentation role.
5. Dissemination - SP
=====================
JPhysG submitted - confirmation that they've got it. Quick review by one
of their Board Members
Bid to PPARC PUS award Grid Café with Dana Centre - Dave Colling Francois
Grey, SP should hear back in a couple of months.
AHM 18 Abstracts 15 on Web 2 up from previous years
AHM planning meeting in May - Dave Colling + Bekki going. DM needs list of
who is on the booth (free places).
NEW ACTION 167.6 DM to circulate draft agenda of AHM2005.
Bekki Pearce started as Events Officer. Working on Schools Talk. Will
attend QM Masterclass tomorrow to talk to sixth formers.
Will also attend PPARC Healthcare Industry Day on 28th
Press release on SC2 in progress.
Business cards. PMB said they looked nice and we should all get them.
[Sarah cut off at that point]
6. UB Issues - DT
=================
Non-LHC non-BaBar concern about UK Grid strategy and wider issues e.g.
CDF. People less strongly wedded to Grid feel their science might be
compromised if they have to use RAL facilities through the Grid. PMB don't
really understand what the problems are. GridPP mission is to build Grid.
CMS also concerned about LCG. Maybe others haven't tried yet. Need to show
how e.g. see SLL Dublin Talk. DT asked how much effort to get ZEUS going
on LCG - 9 Staff Months perhaps. RJ discussed yesterday - show and tell -
what can be done, how much effort etc. Schedule for next UB. Invite
Stephen Burke to give basic talk on job submission (or SLL possibly).
Technical Side - problems with site configuration (ATLAS). Already
covered this morning. JC wants contact with each experiment through UB
(technical person).
Data Management and Movement - tools weren't there (like they were for cpu
management). Summary now exists.
NEW ACTION 167.7 TD to forward Graeme Stewart's summary of Storage/Data
Management Workshop at CERN to PMB.
Lack of stability. Is it getting any better? From ATLAS point of view
getting better for Rome production. Whitelisting of sites for a VO will be
possible soon. Many UK sites stable (and off!). Can't rely on Dteam tests
each morning - too harsh.
RJ - Experiments had asked how they should request that a site supports a
particular OS - The request should be made via UB who will pass it on to
T2B which will result in an approach to the site.
7. Applications - RJ
====================
Need to firm up Service Challenges. Central planning still fluid. Also
what happens beyond this year. Partition some resources for SCs.
SRM at different sites on critical path. Thanked RAL for effort put in.
CDF SAMGrid. Rick now given up as link person leading Grid efforts. Rest
of CDF moderately atheist towards Grid. CDF once had a separte strategy,
now converging with LCG (in parts).
Fermi will support CMS activities and see if anything benefits CDF.
SAMGrid deliverables need to be modified. Also manpower issues. Morag
leaving in May. PMB agreed that Metadata post should be readvertised as
Metadata not CDF. High level deliverables probably fine but secondary ones
should not be CDF specific. What about Oxford post? Need to be discussions
with Oxford as to where this is going. Maybe to adapt CDF software to LCG?
Needs new deliverables to reflect this.
NEW ACTION 167.8 RJ to raise CDF deliverables definition with Todd
Huffmann.
8. M/S/N - DK
=============
Apologies from RM.
glite release testing etc increasingly a deployment issue. The PMB issue
is expectation management. EGEE/CERN problem? Looks professional release
of many disparate components. Problem is integration.
Monitoring concerns. Discussion of R-GMA - outcome invite Steves Fisher
and Traylen to present to PMB as to how to move R-GMA forward
NEW ACTION 167.9 TD to invite Steve Fisher to discuss R-GMA status and
plans at a forthcoming PMB meeting.
Unfilled MSN posts. Less of an issue than elsewhere. Security middleware
report filled?
Quarterly reporting (security) - DK suggested separating out reporting.
TD not happy - helpful if they do it together (as in other areas).
Imperial integrated SGE in WMS. Who else interested - Durham?
DESY visiting RAL this week to talk about dCache.
NEW ACTION 167.10 DB to chase up all unfilled GridPP posts at this point.
9. Semi-External Issues
=======================
a) LCG Issues - (TC) - none reported
b) GGF Report - PC - none reported
c) LCG MOU / NGS issues - NG - none reported
d) EGEE PMB Report - DK for RM - RM has been proposed to continue as
PMB chair for another 6 months. EGEE2 task force is due to report. UK
partners will be meeting in Athens.
e) Phase 2 planning - CERN have sent letters to funding agencies with
annexes listing t1/t2 asking for validation and t1/t2 contacts. Many
replied that they will reply during CRRB.
DM contacted Richard Wade after the meeting
Experiments need need capacity figures for T2s by end of May for inclusion
in computing TDRs. End of may is unrealistic - decided to produce a
separate document to accompany the TDRs to review. Should be complete by
mid August. - All figures in TDRs will be planning ones as no-one will
have signed the MoU by the end of May.
Chris Eck reckons UK T1 plans don't balance our allocations with
Experiment Requirements separately for cpu/disk/tape.
DB confirmed this and said that tape needed to be adjusted as noted in
his report. New input has been requested from Dave Corney et al.
Networking - Comment from Dave Foster about UKERNA being confused over
timing of SJ5. This is wrong but if he had got this message we have a
problem. Action JG to raise at networking phone conference.The only new
information was assumption that t1s need an extra 10gb connection for
their traffic to T2s although this could be over a production IP network.?
Tapes - a small working group will be formed to discuss experiment models
for data transfer between T1 and T2.
Next Phase 2 Planning meeting - end June.
10. AOB
=======
1) Discussion of GridPP13 agenda - TD
No commitment yet from CERN to send people. Later discussion with TC
indicated there would be CERN people available.
Need this before planning agenda.
No decision on parallel session.
Leave gap in programme for ad hoc discussions.
2) Dates of next meetings.
Oversight Committee Fri 1st July
Cancel F2F 8th July. Next one 5th September in Birmingham.
Collaboration Meeting at Glasgow in 2007? Try for end May.
Another one at RAL in Jan 2006? or, more speculatively, Jan 2007.
Volunteer sites required for future Collaboration Meetings.
New Actions
===========
167.1: JG to report back from networking meeting on service
challenge activity/organisation.
167.2: JC, with input from Stephen Burke to maintain a list of middleware
issues that should be addressed
167.3: RJ to define UK contacts for each experiment who will
respond on experiment operational issues.
167.4 TD to discuss with RM revised funding policy for GridPP people
attending HEP SysMan meetings
167.5 DK to define a possible technical documentation role.
167.6 DM to circulate draft agenda of AHM2005.
167.8 RJ to raise CDF deliverables definition with Todd Huffmann.
167.10 DB to chase up all unfilled GridPP posts at this point.
|