Prof. A T Doyle, FInstP FRSE GridPP Project Leader
Rm 478, Kelvin Building Telephone: +44-141-330 5899
Dept of Physics and Astronomy Telefax: +44-141-330 5881
University of Glasgow EMail: [log in to unmask]
G12 8QQ, UK Web: http://ppewww.physics.gla.ac.uk/~doyle/
On Thu, 29 Nov 2007, Tony Doyle wrote:
> Dear All,
> Please find attached the latest weekly GridPP Project Management
> Board Meeting minutes. The latest minutes can be found each week in:
> as well as being listed with other minutes at:
> Cheers, Tony
> Prof. A T Doyle, FInstP FRSE GridPP Project Leader
> Rm 478, Kelvin Building Telephone: +44-141-330 5899
> Dept of Physics and Astronomy Telefax: +44-141-330 5881
> University of Glasgow EMail: [log in to unmask]
> G12 8QQ, UK Web: http://ppewww.physics.gla.ac.uk/~doyle/
GridPP PMB Minutes 282 - 22nd November 2007
Face-to-face meeting at RAL.
Present: David Britton, Stephen Burke, Peter Clarke, Jeremy Coles,
Tony Doyle, Neil Geddes, John Gordon, Dave Kelsey, Steve Lloyd,
Robin Middleton, Trish Mullins, Sarah Pearce (EVO), Dave Newbold,
Andrew Sansum, Glenn Patrick (minutes)
Apologies: Roger Jones
Yingqin Zheng observed for the Pegasus project.
1. Tier 1 Review
TD outlined the T1 Review that had been held the previous day. One
recommendation was for a Service Delivery Plan and monitoring system for
future planning of GridPP3 (includes definition of all the services;
assessment of their criticality; monitoring technique; fallover procedure;
expert list and call-out procedure, as well as the disaster recovery plan
already envisaged). Also, there is a mismatch between current metrics in
the old ProjectMap and the service delivery to experiments now required.
Better integration with Deployment Team was also recommended. Encourage
more handshaking between T1 and T2 in establishing good practice and
PMB in first instance should review current experiment requirements and
present them in a succinct form to T1 as implementation model. Talks from
SRM workshop could be a useful starting point? Take a snapshot of
experiment high-level (top-down) requirements rather than having dispersed
DB made the point that comments about all information being in computing
TDRs were no longer useful.
ACTION: TD and JG to constitute a group with experiments and other parties
to capture experiment requirements and how they relate to UK.
2. GridPP3 Project Planning
SP had circulated an email with initial thoughts on experiments and
production metrics. Initial ideas sought from GP and DN on 6-7 general
areas for monitoring. CMS gave 5 technical QOS areas, whilst LHCb had 7
high-level experiment areas of Grid use.
DN said it was important that metrics are linked to perceived results and
measurements that could be made. GP said that although LHCb and CMS had
adopted different approaches, it looked like the underlying metrics would
be very similar.
DB asked how we deal with metrics which go "red" and we can't do anything
to rectify them (eg. for the OSC). High-level metrics are supported, but
we have to be able to define the next steps. Concluded that this looks
like right approach, but need to be able to dissect things when they go
SP will work with experiments to define specific metrics. DB made the
point that we are likely to need to refine and/or redefine the metric set
with the benefit of experience.
RAS commented that we should also look in the WLCG MOU at current metrics.
These will be measured anyway (eg. ticket response).
GP had raised what to do about ALICE? DB suggested to keep ALICE-specific
metrics in "Other" box. For the remaining "other" experiments it may be
sufficient to see if there are some simple generic metrics (e.g. how many
VOs at Tier 2s) or it may also be appropriate to have experiment-specific
A whole range of production metrics needed to be evaluated. Some needed to
be dropped and some amended. DB suggested starting with the list of
services in the T1 questionnaire and the metrics should measure how they
relate to experiment delivery. The 10 metrics recommended for dropping
SB raised whether the average number of sites/quarter available in VO
selection (0.144) should be dropped. DB suggested monitoring blacklists
might be appropriate.
Metrics where there was no agreement on future:
0.110(GridPP Tape Storage) - DB suggested change this metric to be based
on "does tape service work".
0.117(Job failure rates) - should be retained, although difficult to
0.127(T1 meeting PPS commitments) and 0.128(meeting JRA1 commitments) -
agreed to drop.
0.129(T1 meeting "other" user commitments) - should be covered by user
area. 0.131(T1 service disaster recovery) - this has been overtaken by
Metrics to be amended:
0.104(no.job slots) should be kept.
0.105(fraction of LCG job slots used) should be kept.
0.107(GridPP KSI2K available to EGEE/LCG) should appear separately for T1
and each T2 centres.
0.114(fraction of available tape used in quarter) should be dropped as
covered by monitoring tape service.
0.124(GridPP security audit) covered elsewhere.
0.130(testbed) should be dropped.
0.132 (Prod. Service risks/issues) covered by Service Delivery Plan and
0.136 and 0.137(delivering to LCG MOU - availability targets) retained as
Spreadsheet items: 0.101 and 0.102(registered and active users) - numbers
need to be known, but not as up-front metric in the project map.
ACTION: SP to progress the Project Map using the T1 service areas and
input from the meeting.
3. Tier-2 Hardware Allocation
SL showed slides from the T2 Board held on 16 November. Hardware allocated
by formula as advertised in advance. Using experiment/institute matrix and
costing model, hardware for 2008/2009 was allocated. No obligation to
support experiments which institute not part of, but credit is given for
supporting any VO. Some issues over those institutes supporting more than
one VO being advantaged. T2 Board agreed would not change matrix.
Acceptable for institutes to move hardware around within a T2, as with
manpower. For next phase, the accounting period will be 2Q08 - 1Q09
A number of complaints had arisen. SL proposed because of possible
anomalies arising from formula approach that: (a) GridPP make available an
extra 100K to help in genuine cases, (b) a special(1/2 page) case to be
made to GridPP for consideration, (c) cases to be collated by each T2
Chair who then send the cases and list of any internal transfers to SL.
Procedure supported by PMB with SL and NG to evaluate cases.
ACTION: SL and NG to progress and iterate procedure with T2s.
4. Dissemination Issues
SP presented the new Web page. Only comment was that it was too wide for
screens - maybe one column too much? Otherwise, everyone thought it was a
good first draft.
5. GLite Support Proposal
Nick Trigg (STFC/CLIK) introduced a proposal to provide commercial support
service for gLite. This could be for individual GridPP institutes. Funding
could flow through Constellation Technologies with added value.
DB noted that would need to investigate a specific case to see if this
model could work. NT to communicate with DB to explore if there are any
6. UK Prioritisation of Resources
TD raised the question of the appropriate level to set UK prioritised
resources for T1 and T2. RJ had suggested that ~20% of the total be set
aside for ATLAS. Revised WLCG pledges had not yet been made, but were now
urgent. There was now a need to identify two numbers for ATLAS, CMS and
LHCb - namely the fraction of activity at T1 and T2 reserved only for UK
For CMS, DN agreed to a figure of 25% for T2 and 0% for T1. GP made the
point that LHCb has a very different computing model and it was not
obvious how this could be implemented. However, the principle of reserving
0% of T2 and 25% of T1 for LHCb could be agreed - but it would then be up
to the experiment whether it then chose to use these resources for UK or
7. GridPP3 MOU
TD introduced version 2 of the MOU. Binds UK project for three years and
will be signed by reps of the four regional T2s and the T1. Agreed with
STFC. First action of the new Deployment Board will be to sign off the
Agreed hardware fractions (minimum) broken down by experiment for each
institute need to appear. Need to draft next Monday for STFC - a draft of
WLCG pledges would then be ready prior to global deadline of Friday 30
The current (working) version of the GridPP MoU is available at:
The input to WLCG planning is available at:
ACTION: Updated MOU needs to be sent to CB.
8. EGI/NGI Plans and Planning
RM outlined EGI Design Study now underway. Only 18 months before
transition from EGEE starts. More science than just HEP needed, more
funding for NGI specific functions and community representation,
governance, etc. 9 partner institutes including STFC and CERN. Six work
Deliverables: D2.1 Dec 2007 EGI consolidated requirements and use
cases,March 2008 EGI Workshop, June 2008 EGI Blueprint publication.
UK NGI - assumed to be based on NGS. GridPP sites become
partner/affiliate. Interoperability - NGS VO on GridPP, SRM-SRB
interoperation. Some services already in NGS such as Certificate
Authority, GGUS, VOMS. Funding line will become clearer in April 2008.
Proposal to NGS Board on 6 December 2007. GridPP strategy for transition
to EGI/NGI need to be defined.
9. Disaster Planning
SB presented a set of slides, mainly from JC. OC had been extremely
concerned that do not have planning for wide-range of potential disasters.
Disasters covered "known knowns"(disks will fail), "known unknowns" (fire)
and "unknown unknowns" (something preventing data transfers). Probability,
impact and scope are the important factors.
Plans for both disaster recovery and business continuity planning needed.
For OC meeting on 10 October, a paper was submitted covering high-level
failure modes and impact on experiment services. Networking perhaps should
be in a separate document. Tier 1 included, but should have own disaster
plan. CMS experiment scenarios had been included (along with LHCb).
DN pointed out that network throughput limitations observed in CSA07 could
be a disaster during real data taking. There needs to be a way of
declaring a "disaster" for things which are beyond experiment control. JC
commented that need a strategy to deal with each scenario.
TM emphasised that OC meant GridPP to look at the basic things that could
be put in place to correct things when they go wrong. DB pointed out that
for T1 there would be a Service Delivery Plan which should cover
ACTION: JC and SB to progress existing template for next F2F meeting on 21
Feb. Involve experiments as necessary.
10. Network Resilience
PC said main thrust was whether a single 10Gb link is an issue? Does not
warrant diverting GridPP funds, but keep under review. However, ATLAS (RJ)
say cannot stand 6 day outage.Needs further consideration.
Brookhaven and Fermilab have triangle connection, and other T1s fall back
on cross 10Gb link with another T1.
ACTION: Need further input from RJ on 6 day issue and decide on way
forward. Keep action open.
11. Castor Status and CSA07 Outcome
DN: Goal was ~50% test of entire 2008 computing system. Tried to do for 6
weeks and included T2 for analysis. Extensive programme of "link
commmissioning" which did not converge fully in time for CSA07. Outcomes -
serious issues with implication of physics goals. Transfers were a weak
point (storage system capability). Most individual components tested.
RAL T1: In general,late coming upto speed for CSA. All SL4 resources
available (though not used). Major update of software area carried out
without problem. Weak point - CASTOR 2.1.3 performance for WAN transfers.
Never exceeded 100MB/s for more than a day or so. Weak point - JANET and
OPN connections to RAL.
T2 centres: Bristol/Brunel did not work (DPM incompat, etc). PPD/Imperial
worked very well.
Castor - just started testing the new 2.1.4 prod instance. Extensive
programme of tests planned for December. Attempting to understand
complexity wrt tape handling (cannot control allocation of tape drives to
Plan to test Castor internal data flow (ie. d2d and tape migration). Also,
test CASTOR SRM2+internal_RFIO_gridftp from RAL PPD. Bring resources back
online for CMS, recommission links.
Bottom line - more testing required to achieve confidence in Castor for
12. R-GMA and Networking
RM presented slides after talking to Steve Fisher and Robin Tasker.
R-GMA: re-engineering to new design wil be completed by 31 March 2008.
Remain part of gLite distribution. Support being negotiated outside EGEE
and GridPP3 (1FTE). Important that work is completed by end March since no
obvious source for future development. Used by dashboards, APEL, Grid
Ireland. Expect new users.
RAS raised problem of T1 service run for R-GMA if there is no support
Service Discovery - API to hide underlying information system. Work on
SAGA(OGF) spec about to go public. C++ version by 31 March 2008. New
activity also in SA3.
1 FTE being funded in this area by GridPP for 2 years - skills required to
ACTION: RM to monitor how this impacts GridPP as matters progress.
Networking - some GridPP2 deliverables outstanding. New GridPP2+
deliverables - UKLIGHT, Gridmon, etc. GridMon effort seconded (part time)
The meeting finished at 16:20. Next F2F meeting in Glasgow on 1st Feb.
2008. Next EVO meeting on Monday 3rd December.
ACTIONS AS AT 22.11.07
271.2 Re CERN-RAL OPN link breakage, RJ to provide an analysis of what the
consequences would be to Experiments for a one-day break, a three-day
break, a five-day break, etc. The outcome of these need to be assessed
for disaster scenario planning.
272.4 AS to check the current Tier-1 disaster recovery plan and circulate
the existing version to the PMB. It was reported that this document does
not exist, but it was planned to have one in the longer term. TD would
incorporate in v0.4 anything that AS considered relevant. AS will check
and advise additions.
277.2 DN to provide an update and re-evaluation of CMS/CASTOR
277.4 Castor 'Team A': TC, AS, JG, RJ, DN, GP to provide inputs relating
to CASTOR and a breakdown of issues that could be incorporated into
meta-level deliverables for the next 6-month period.
277.5 Disaster Recovery 'Team B': SB, JC, TD, SP, DB to analyse the wider
issues of disaster planning, mapped to the experiments' lists, and this
work would include Project Management. A Recovery Plan was required. It
was agreed that JC was in charge of this and the experiment input relating
to subsets of the disaster plan. TD noted that first thoughts on
categorising inputs would be required for the next F2F meeting - this
would ensure categories were laid down and an idea of what could be said
under each category by way of examples that were clear. DB noted that SB
could deal with this as an Agenda item at the next F2F meeting and provide
a pre-idea of evolution (on behalf of JC who would not be present).
277.7 SP and NO to review existing user documentation areas - it was noted
that these need to appeal to the lower common denominator, be less
technical, and be easier to find. SP reported and she and NO were working
on a re-designed front page that would be easier to use. SP would send an
email to SB summarising her ongoing thoughts and would iterate with SB.
277.8 User Experience 'Team C': SB, SP, SL, with input from JC to deal
with the issue of user experience and design of an easily-found lookup
facility for grid error messages.
277.9 24x7 cover at Tier 1 'Team D': AS and JG to discuss this issue and
see what could be achieved in relation to possible shift rotas/on
call/overtime at weekends.
278.3 JC to look at the Quarterly Reports, funded vs unfunded effort, to
see if there is a correlation between the lack of unfunded effort and
related site problems.
278.8 Regarding the GridPP3 SLA and EGEE SA1 putting forward a draft of
its Service Level Description for sites/ROCs to discuss - it was agreed
that TD & DK would go through the GridPP3 SLA and review it in terms of
consistency of style.
278.10 ALL: inputs on EGEEIII -> EGI to be sent to RM/TD.
279.4 Regarding CASTOR, DN to provide input on CMS after CSA07, and AS to
speak to Bonny Strong (high-level planning to be met - a formal
recognition of progress is required with well-stated goals).
280.3 JC to elicit more specific objections from Site Admins, to set UID
for glexec, to be built-into glexec testing and cert procedures.
280.6 JG to bring up this issue (the biomed VO and 'sieving')at the ROC
Manager's meeting (done) - a broadcast is to go out from EGEE which will
be helpful in underlining acceptable use of Grid resources and would act
as a reminder to VOs about the policy they have signed-up to in relation
to their users. JC had now emailed the Chair to have this discussed -
EGEE broadcast part of this action ongoing.
280.7 JC to mention the issues (when approached by a VO with regard to
joining) of the 'standard' 6-month introduction period, following which
the VO must set-up something specific to them, if appropriate. This had
been discussed at DTeam, done. JC to email GridPP VO members if possible
280.8 JG to investigate the UKI ROC website - any change/progress, and
281.1 DB to circulate an updated F2F Agenda.
281.2 TD to circulate an outline Agenda for the GridPP20 Collaboration
281.3 SL to raise the issue of user checks on running code (pre-testing
procedure/workbook advice) at the Software Installation Tools (SIT)
meeting to be able to point people in the right direction prior to
releasing code across the grid.
282.1 TD and JG to constitute a group with experiments and other parties
to capture experiment requirements and how they relate to UK.
282.2 SP to progress the Project Map using the T1 service areas and input
from the meeting.
282.3 SL and NG to progress issues relating to Tier-2 hardware
allocation/complaints and iterate procedure with T2s.
282.4 Nick Trigg and DB to iterate regarding the possibility of provision
of commercial support service for gLite.
282.5 Updated GridPP3 MOU needs to be sent to CB (TD to provide updated
version for SL to circulate).
282.6 JC and SB to progress existing 'disaster planning' template for next
F2F meeting on 21 Feb. Involve experiments as necessary.
282.7 RJ to provide input relating to '6 day issue' (network resilience
outage) and decide on way forward. Keep action open.
282.8 RM to monitor how R-GMA and networking issues impact on GridPP as
247.2 RJ to get further information from ATLAS regarding use of Grid for
testing of PANDA, and report-back. RJ reported that there were a planned
series of tests for a few sites in the UK - Rod Walker was in charge of
this. No further details were available at present.
251.1 TD to raise the issue of memory vs CPU cost at the MB [in order to
work out what the requirement was between 1GB and 2GB memory per core].
This was discussed at the MB, cost was understood, it was agreed that 2GB
memory per core was now a requirement in relation to future procurements.
AS noted this and agreed. Done, item closed.
271.1 PMB to examine the issue of fibre breakage and outages, CERN-RAL OPN
link, in one year's time, when actual data on breakages is available.
Due date would be September '08.
271.3 Re CERN-RAL OPN link breakage and backup generally, PC to oversee
the issue and collate info so that the PMB have something to revisit in
one year's time. Due date September '08.
There would be no PMB next Monday (26th November) due to the F2F.
Next meeting Monday 3rd December.