Dear All,
Please find attached the latest GridPP Project Management Board
Meeting minutes. The latest minutes can be found each week in:
http://www.gridpp.ac.uk/php/pmb/minutes.php?latest
as well as being listed with other minutes at:
http://www.gridpp.ac.uk/php/pmb/minutes.php
Cheers, Tony
________________________________________________________________________
Prof. A T Doyle, FInstP FRSE GridPP Project Leader
Rm 478, Kelvin Building Telephone: +44-141-330 5899
Dept of Physics and Astronomy Telefax: +44-141-330 5881
University of Glasgow EMail: [log in to unmask]
G12 8QQ, UK Web: http://ppewww.physics.gla.ac.uk/~doyle/
________________________________________________________________________
GridPP PMB Minutes 296 - 20th March 2008
========================================
Present: Tony Doyle, Sarah Pearce, David Britton, David Kelsey, Steve Lloyd,
Robin Middleton, Jeremy Coles, Glenn Patrick, Andrew Sansum, Dave Colling,
Suzanne Scott (Minutes)
Apologies: Roger Jones, Stephen Burke, Tony Cass, John Gordon, Pete Clarke,
Neil Geddes
1. Tier-1 short-term spending
==============================
AS made a request to the PMB as follows:
Some time ago AS alerted the PMB to the fact that we may need to agree
payment of the disk and CPU deliveries before formal acceptance had
completed. That is to allow us to pay the bill in this financial year
which is considered desirable.
Our current position is that:
1) 120 (of 182) disk servers have reached 19 out of 28 days planned RAL
load test. The remainder trail some way behind for a variety of minor
reasons (such as problems with a network switch or to hand over
initially, a few with hardware failures such as the RAID card). At
present our drive ejection/failure rate is consistent with about 4% per
annum. A little high, but not unexpected during the burn in period
(steady state rates are usually nearer 2-3%). Although we have not
completed the testing we have done a more careful and systematic job
than previously (which were also heavilly tested) and are happy that we
see no systematic problems. Hardware should start churning out of the
end of the pipe by next Wednesday.
2) The CPUs are lagging behind somewhat, but the hardware has been
installed to Martin's satisfaction and the suppliers have run a 7 day
load test with the SL4 O/S. We are happy the system runs and there are
no thermal concerns. We hope to start our load test before Easter for
one lot, but this is unlikely to provide further useful input for
several weeks.
At this stage we would like to pay the bills and resolve any further
(hopefully minor) problems as they crop up operationally. Will the PMB
approve payment?
TD commented that this request appeared to be relatively straightforward.
AS agreed, except for the financial implications of not spending within
this financial year. AS reported that re the hardware, the stress-testing
of disk servers would be going ahead and should be in production within 7
days. The bill-paying was required by the middle of next week.
Regarding the CPU delivery, it was delivered and the suppliers were doing
their own load testing and were satisfied. The hardware was running ok
and our own stress-testing was about to commence. DK asked if the bill
would therefore be paid around 2nd April? AS confirmed yes, but this
would still count as within this financial year. AS confirmed he would
chase-up the CPU testing. Regarding tape, AS advised that it was
difficult to purchase, have it installed, and pay invoices on time at this
stage; future drive upgrades were as yet unknown. DB noted that we did
not want to buy equipment that we were not absolutely sure that we needed.
DB proposed supporting the payment which AS requested. AS advised that it
was a large sum and we were not quite through the whole process, but it
was important to involve the PMB at this time. TD requested that if
possible, the CPU tests should be carried out over Easter. The early
payment was agreed and approved.
2. GridPP input to STFC consultation process
=============================================
DB had circulated a draft email response. TD noted that the approach was
to tread carefully between the CB and STFC and provide input into the
consultation process - it was hoped the final version would be signed-off
and submitted to STFC later today. DB went through each point as stated.
It was noted that we have obligations to the wider community; there should
not be further cutbacks to computing support; the 3-month delay to the
hardware support is a problem; we have suffered four major cuts of 13
million in total - although the Review believes that 5% is a small amount,
seen in context it is extremely damaging.
GP asked for specific support for LHCb. DB recommended that this not be
made explicit in the overall GridPP response. TD noted that LHCb will be
overtly supported elsewhere. SL agreed, noting that lots of others will
make statements about LHCb, and that it was important not to dilute the
GridPP message. GP advised that GridPP was in danger of losing two
experiments off the Project Map: LHCb and ALICE. DK noted that a balance
was important, but that the GridPP message should not be diluted. DB
suggested that a statement could be inserted on the effect the cuts to
LHCb and ALICE would have on the Tier-1 and its ability to provide a
viable service - point two could be enlarged slightly without mentioning
specific experiments. DK noted that there were thresholds below which
things were not viable. It was agreed to add in concerned wording that
cuts to GridPP support for individual experiments would take the Tier-1
below a viable level: "reduce Tier-1 level below a critical threshold" or
similar. This was agreed.
DK commented on point three: that any future delay would become even more
critical. Some comments by SP were still to be incorporated. It was
agreed to submit the statement as amended and circulate via the Minutes.
It was also circulated to the CB. The final statement, following feedback
from CB and PMB members, was as follows:
GridPP feedback on the programmatic review.
1. GridPP acknowledges that it is a user-led project that provides a
service to the community and if the scope of the constituent community
is altered then GridPP should respond appropriately. GridPP is
similarly aware that it must meet international obligations and has
already purchased the hardware necessary to meet the 2008 requirements
which restricts the options for incorporating new reductions.
2. The strong scientific merit of all the experiments serviced by GridPP
has previously been established by rigorous scientific peer-review. We
believe their reclassification by the recent Programmatic Review is a
reaction to a funding crisis and not a better representation of their
scientific value. We are concerned that cutting back GridPP support for
specific experiments will reduce their Tier-1 capability below a
critical threshold and translate directly into a disproportionate
reduction in the UK physics output.
3. Regardless of the proposed cut, the Programmatic Review has delayed 2m
of Tier-2 grants for analysis hardware by a minimum of 3 months. This
delay is already causing problems for the UK LHC groups in the
preparations for first data.
4. GridPP notes that the proposed reduction is the fourth in a sequence of
cuts and takes the project further below the level judged to be the
"minimum viable" by the PPRP review committee. There is an increased
risk that GridPP will fail to deliver a competitive service for UK
physicists. The sequence of cuts was as follows:
A. The GridPP3 proposal was de-scoped to a 70% scenario in the STFC award
of March 2007. The PPRP agreed that this was the "minimum viable
level".
B. The GridPP3 project was further reduced in July 2007 by the removal of
1.3m that had been preserved in the GridPP2 project through careful
management in response to delays in the LHC schedule and due to the
success in attracting European funding for some GridPP posts.
C. The lack of funding for application support posts in the Rolling Grant
round was recognised by GridPP as a serious risk to the UK success in
extracting LHC Physics. We proposed to use the majority of the 1.3m
saved within GridPP2 to support this activity. When that was removed,
the posts were funded out of the GridPP Working Allowance with the
support of four Oversight Committees (GridPP, ATLAS, CMS, and LHCb).
However, this further restricted the future options for managing the
core GridPP3 project.
5. GridPP is now concerned at the prospect of a further 5% cut just at the
point of delivery. Cumulative cuts of 13m in the last year threaten our
ability to meet international obligations and UK physics analysis
goals. We are concerned that these previous cuts were not fully
appreciated by the review committee.
GridPP Collaboration.
3. AOCB
========
Re the DANTE proposal, TD proposed that the PMB endorse Robin Tasker's
email to David Foster, as follows:
"The UK is not supportive of the proposal from DANTE on many grounds. The
existing OPN is already operational and we judge we are close to
consensus on agreeing the operational handbook; there would be
considerable additional cost to follow the DANTE proposal; and concern
was expressed that the "ownership" of the OPN would shift which could
restrict our ability to manage its operationa and development. However
we are also concerned that DANTE do not see this outcome in too negative
a light as their contribution engaging with the LHC community is to be
valued and encouraged."
The PMB agreed that DANTE was not appropriate. DB noted that it would be
useful to know what it was that DANTE had wanted to achieve. TD advised
that he would sent DB some further info. It was agreed to endorse Robin
Tasker's response.
There had been an AHM call for papers which DB had circulated. This would
be discussed at the PMB next Thursday and should be added to the Agenda.
STANDING ITEMS
==============
SI-1 Dissemination Officer's Report
------------------------------------
SP reported that Neasan O'Neill had done a news item on GridPP20. SP was
currently awaiting a news item on the CCRC from GS and Raja Nandakumar.
She had received a response from STFC relating to LHC@Home - the grant
application had been turned down, with the feedback that the proposal did
not encourage enough engagement with LHC and hadn't before tried schools
as part of it. SP reported that she was continuing work with the
experiments to get more applications to run on [log in to unmask] SP asked whether
a press release would be appropriate as yet for GridPP3? TD suggested
sending a draft to STFC for review. SP noted that finances did not
require to be mentioned. It was agreed that SP approach STFC for feedback
on a press release. TD suggested that a news item on the Project Map
would be useful. SP suggested that she could do this for the website, but
a press release itself would take a different form. DK noted that as we
are entering the data-taking phase, it is important to report something on
GridPP3.
SI-2 Tier-1 Manager's Report
-----------------------------
AS provided the following report:
1) Purchases:
a) Disk tender - supplier load test completed. Our 28 day load test has
now completed about 21 days for the majority of servers and is
progressing well.
b) CPU tender - Delivery received, installed and tested by suppliers. Our
28 day load test is about to commence.
c) Tape servers received and installed - closed.
d) Non-Capacity hardware delivered and accepted - will move into
production as required - closed.
e) Oracle server hardware upgrade order has been placed - eta next 7 days.
f) A Force10 C300 switch with 32 non-blocking 10Gb ports has been received
and will probably move into production in April (planning still
underway). This will be the main Tier-1 top level switch replacing our
Nortel 5530 central stack.
g) All tape media has been received - closed.
h) Additional RAID cards for the 2007 disk servers have been ordered and
are expected in the next week.
i) Replacement AFS servers have been received - they will go into
production as part of the AFS migration (covered under a seperate item
in a later report).
j) Some Xen capable hosts have been ordered for the PPS cluster.
2) Backplane work has nearly completed - there are twelve servers
outstanding on the ATLAS CASTOR instance.
3) There was a scheduled 40min network outage this Tuesday as the main
site router was upgraded. We don't route our data services through this
router so will see little direct benefit.
Service
-------
1) SAM availability for last week was 98% (SL extract). RAL-LCG2
reliability for February (MB report) was 93% (target 93%).
2) CASTOR:
a) Upgrades to 2.1.6 are underway but have been delayed after encountering
a bug. Work is rescheduled for next week after hot fixes have been
received.
b) Upgrades to the ORACLE RACs have been delayed after sudden loss of
staff from the database team. This work is now scheduled to restart in
April.
c) We have encounteresd problems on the 2.1.4 CMS instance with disk to
disk copies running wild - however we don't intend to pursue this until
after the upgrade to 2.1.6.
d) Migration rates to tape have been improved to 20-30MB/s following
system tuning.
3) The Tier-1 is now primarily a Grid-only service. Only approved
exceptions continue to have access.
4) SL4 Migration
The SL4 UI continues to be held up owing to team priorities being
focused on hardware procurement, installation and acceptance.
SI-3 Production Manager's Report
---------------------------------
JC provided the following report:
Many deployment matters were raised at GridPP20, in the PMB and DB. Here
are a few updates/new items.
1) There has been some discussion about the UKQCD Tier-2 requirements
document, but there is more discussion needed. The main requirement is
"at least" 2GB memory per core with 4 GB preferred. Use of MPI is
desirable. The combination of these requirements will make it difficult
for most T2 sites to be of use in the short-term.
2) There seems to be renewed interest around EGEE/WLCG concerning the SAM
tests and availability calculations. In particular how site metrics are
impacted by "core" problems that are not a site fault.
3) GGUS have circulated a document outlining an Operational Level
Agreement between them and TPMs (http://edms.cern.ch/document/888089).
The requirements on TPMs are a concern for UK teams whose work is
already divided. We will put together a response with our concerns.
4) For EGEE-III an automation team is being created with a mandate to,
among other things, improve the integration of grid and site
monitoring: http://edms.cern.ch/document/888089. A Nagios instance is
already available (sysadmins look here:
https://twiki.cern.ch/twiki/bin/view/LCG/GridMonitoringNcg).
5) ATLAS will soon move production to use TierofAtlas settings for
ATLASMCDISK for production output. Few GridPP sites currently advertise
this token: http://wn3.epcc.ed.ac.uk/srm/xml/srm_token_table. Sites
which are to be used are being asked to update their configuration.
Meetings:
a) CCRC'08 F2F meeting at CERN - 1st April.
http://indico.cern.ch/conferenceDisplay.py?confId=30246.
b) The next UB meeting has been moved to 14:00 Wednesday 16th April. It
was scheduled for 19th March.
SI-4 LCG Management Board Report
---------------------------------
TD noted a report on CCRC Feb Phase I and May Phase 2 tests being done.
TC brought-up the issue of all projected IT and CPU capacity and power
plans, being grown at 30% annual growth which means a limit on what can be
done at the Tier-0 in the longer term. There were no other issues.
SI-5 Documentation Officer's Report
------------------------------------
SB was not present.
REVIEW OF ACTIONS
=================
277.8 User Experience 'Team C': SB, SP, SL, with input from JC to deal
with the issue of user experience and design of an easily-found lookup
facility for grid error messages. SL reported that he had started the
ATLAS wiki page and would circulate the url. SB was leading this with
inputs from SP, SL and JC where needed. A new simple summary was required
of all areas available plus a lookup/links facility, for the OC to review.
This would include a list of most recent types of problems (possibly a
'top 12' for users - what the error means and the course of action to
follow). SB to progress this. It was noted that James Catmore (via the
DB) had volunteered to do this. This action is therefore transferred to
SL for progression via the Deployment Board. Done, item closed.
280.7 JC to mention the issues (when approached by a VO with regard to
joining) of the 'standard' 6-month introduction period, following which
the VO must set-up something specific to them, if appropriate. This was
discussed at DTeam. JC to email GridPP VO members if possible - ongoing.
This was a standing action - JC had discussed it with the Tier-2
Co-ordinators in relation to VO members. JC to send email. Ongoing -
Regional VOs are not yet validated - pending at the moment.
290.4 AS and JG to iterate regarding what could replace the Tier-1 Board.
Ongoing.
290.7 AS to provide numbers in the Quarterly Report for the Tier-1 as per
the ones provided for Tier-2. Ongoing - AS to provide the final GridPP2
and 2+ Quarterly Reports by end March.
290.8 AS/SP to iterate regarding the financial summary in the Quarterly
Reporting (eg: Outturn figures). Ongoing.
290.9 Quarterly Report for Tier-2 staff to be compiled by the Production
Manager. Done, item closed.
290.10 TD as Technical Director to provide a report showing effort
figures; milestones & metrics; and a table of posts showing Technical
Support. SP was currently progressing this - done, item closed.
290.18 Regarding the LCG box on the Project Map, SP to iterate with TC and
bring this issue back to the PMB. Content had now been sent by TC, done -
item closed.
290.20 RM to provide more detailed figures on travel expenditure -
broad-brush percentages would assist with decisions re travel in GridPP3.
This was now replaced by an action from the PMB F2F (see 295.10 below) -
done, item closed.
290.23 AS/JC to iterate on the Disaster Recovery template and remove
capturable items that were considered to be minor. Some progress had been
made - item ongoing.
290.24 JC to progress his suggested template to use when a crisis occurs -
to be revisited subsequently at a PMB. Some progress made - item ongoing.
292.1 TC and JC to iterate regarding the CERN system that recorded service
interdependence and enabled them to recover from crisis events. Reply
awaited, to be followed up - ongoing.
292.2 JG to review the interplay between Footprints and GGUS tickets on
the helpdesk. It was agreed that GGUS will be used as a helpdesk in the
UK as determined by the DB. Action closed.
292.4 JC to use the template from the disaster planning and apply it to
the RAL power failure. This has been done, and JC will circulate. Done,
item closed.
293.2 A PMB document to be written for the OC regarding NGI metrics, and
SP would provide some metrics for this. This has been replaced by an
action from the PMB F2F (see 295.8 below). Done, item closed.
294.1 Steve Fisher to speak to Pat Kite in the first instance re core
funding for training, and revert to the PMB if he required assistance with
a formal proposal document. Done, item closed.
294.2 All - to provide DB with Agenda items for the F2F in Dublin. Done,
item closed.
294.3 DB to contact Janet Seed or Jordan of STFC regarding up-to-date
financial information. Done, item closed.
295.1 DB to re-draft the attachment to the GridPP letter to STFC (in
response to the latest cuts imposed) and recirculate to PMB for approval.
Done, item closed.
295.2 Re the Project Map, SP to insert 'network plans' to ensure they were
up-to-date at each site - this would ensure 'suitable network planning
provision'. [SP to see the wiki sent by TD]. Ongoing.
295.3 It was agreed that there should be a formal look at Network Planning
for the Project Map next year involving PC, RJ, DK and RM - PC to
organise. Ongoing.
295.4 TD (as Technical Director) to address the issue of Data & Storage on
the Project Map and get back to SP with inputs. Ongoing.
295.5 RM to get back to SP with inputs regarding the EGEE box on the
Project Map. RM gave clarification on R-GMA, and was still working on the
EGEE box. Ongoing.
295.6 SP noted that she was awaiting a VOMS report from AS and a Grid
Vulnerability report from DK - these were almost in the nature of two
Quarterly Reports. AS and DK to provide appropriate inputs. These
related to metrics and milestones from the Project Map. Ongoing.
295.7 Re network contingency, PC to request clarification from Robin
Tasker if the cost quoted was for 1Gig only. Ongoing.
295.8 Re NGI planning, JG to produce a document/statement on the GridPP
position (due to his MB perspective), and SP to assist with metrics. JG
to liaise with RM re EGEE inputs. Ongoing.
295.9 DB, RM and SP to target categories for the travel budget for the
coming year. Targets are required for how much GridPP might spend and in
what categories of expenditure. Ongoing.
295.10 RM to provide categories and breakdown of travel + additionals to
enable monitoring and decision-making. Ongoing.
ACTIONS AS AT 20.03.08
======================
280.7 JC to mention the issues (when approached by a VO with regard to
joining) of the 'standard' 6-month introduction period, following which
the VO must set-up something specific to them, if appropriate. This was
discussed at DTeam. JC to email GridPP VO members if possible - ongoing.
This was a standing action - JC had discussed it with the Tier-2
Co-ordinators in relation to VO members. JC to send email.
290.4 AS and JG to iterate regarding what could replace the Tier-1 Board.
290.7 AS to provide numbers in the Quarterly Report for the Tier-1 as per
the ones provided for Tier-2. AS to provide the final GridPP2 and 2+
Quarterly Reports by end March.
290.8 AS/SP to iterate regarding the financial summary in the Quarterly
Reporting (eg: Outturn figures).
290.23 AS/JC to iterate on the Disaster Recovery template and remove
capturable items that were considered to be minor.
290.24 JC to progress his suggested template to use when a crisis occurs -
to be revisited subsequently at a PMB.
292.1 TC and JC to iterate regarding the CERN system that recorded service
interdependence and enabled them to recover from crisis events. Reply
awaited, to be followed up.
295.2 Re the Project Map, SP to insert 'network plans' to ensure they were
up-to-date at each site - this would ensure 'suitable network planning
provision'. [SP to see the wiki sent by TD].
295.3 It was agreed that there should be a formal look at Network Planning
for the Project Map next year involving PC, RJ, DK and RM - PC to
organise.
295.4 TD (as Technical Director) to address the issue of Data & Storage on
the Project Map and get back to SP with inputs.
295.5 RM to get back to SP with inputs regarding the EGEE box on the
Project Map.
295.6 SP noted that she was awaiting a VOMS report from AS and a Grid
Vulnerability report from DK - these were almost in the nature of two
Quarterly Reports. AS and DK to provide appropriate inputs. These
related to metrics and milestones from the Project Map.
295.7 Re network contingency, PC to request clarification from Robin
Tasker if the cost quoted was for 1Gig only.
295.8 Re NGI planning, JG to produce a document/statement on the GridPP
position (due to his MB perspective), and SP to assist with metrics. JG
to liaise with RM re EGEE inputs.
295.9 DB, RM and SP to target categories for the travel budget for the
coming year. Targets are required for how much GridPP might spend and in
what categories of expenditure.
295.10 RM to provide categories and breakdown of travel + additionals to
enable monitoring and decision-making.
296.1 SP to approach STFC for feedback on a proposed press release
relating to GridPP3.
INACTIVE CATEGORY
=================
271.1 PMB to examine the issue of fibre breakage and outages, CERN-RAL OPN
link, in one year's time, when actual data on breakages is available.
Due date would be September '08.
271.3 Re CERN-RAL OPN link breakage and backup generally, PC to oversee
the issue and collate info so that the PMB have something to revisit in
one year's time. Due date September '08. It was noted that PC would
circulate a revised document after discussion with ATLAS (RJ/PC/DN to
iterate).
282.8 RM to monitor how R-GMA and networking issues impact on GridPP as
matters progress. RM advised that this item should be moved to the
'inactive' category as it will develop over the coming months. RM
discussed the issue with Steve Fisher and advised that support of R-GMA is
required whilst APEL is dependent on it. RM reported that he has spoken
to SF and there is currently no change to the R-GMA situation - process
ongoing.
290.19 DB/SP to progress the details of the Project Map over the next few
months, cross-checking that all elements are incorporated, including
strategic priorities and staffing. To be completed before the next
Oversight Committee.
The next PMB would take place on Thursday 27 March 2008 at 1:00 pm.
|