Dear All,
Please find attached the latest F2F GridPP Project Management
Board Meeting minutes. The latest minutes can be found each week in:
http://www.gridpp.ac.uk/php/pmb/minutes.php?latest
as well as being listed with other minutes at:
http://www.gridpp.ac.uk/php/pmb/minutes.php
Cheers, Tony
________________________________________________________________________
Prof. A T Doyle, FInstP FRSE GridPP Project Leader
Rm 478, Kelvin Building Telephone: +44-141-330 5899
Dept of Physics and Astronomy Telefax: +44-141-330 5881
University of Glasgow EMail: [log in to unmask]
G12 8QQ, UK Web: http://ppewww.physics.gla.ac.uk/~doyle/
________________________________________________________________________
GridPP PMB Minutes 290 - 1st February 2008
==========================================
Face-to-face meeting at Glasgow.
Present: Tony Doyle, Roger Jones, Stephen Burke, David Britton, David Kelsey,
Steve Lloyd, Robin Middleton, John Gordon, Jeremy Coles, Glenn Patrick,
Andrew Sansum, Neil Geddes, Dave Colling, Suzanne Scott (Minutes)
By Phone: Sarah Pearce, Tony Cass (pm only)
In attendance: Trish Mullins (STFC)
Apologies: Peter Clarke
1. GridPP3 Board Compositions
==============================
DB had circulated documents and now invited comments. DK suggested adding
a Security Officer as a standing member to assist with security policy
advice. TD suggested that this was more an issue the more
technically-focussed D-Team, and that a route already existed for
high-level inputs. SL agreed that there was a danger of technical issues
overtaking business and it was noted that DK already wore a security 'hat'
as representative on the PMB and, for coherence, should sit on the
Deployment Board in a similar capacity. DK noted that the Deployment Board
should ultimately sign-off on security issues, but that DTeam was much
more of an 'open' meeting where technical details could be debated. DB
advised that membership of DTeam needed to be more formalised and written
down; in particular, someone was needed to replace Steve Traylen from the
Tier-1. AS noted that this was likely to be Derek Ross. It was noted
that more Tier-1 input to the Deployment Board might be required, and possibly
a Technical Co-ordinator from Tier-1 should attend. DB asked about the
experiment representation and requested that individuals should be
nominated.
Actions: JC to write-down membership of DTeam. RJ, DC and GP to nominate
experiment user representatives for the Deployment Board.
Decision: It was agreed that DK would sit on the Deployment Board as
Security Policy liaison.
DB noted that the other Board under discussion was the Tier-1 Board.
This had been setup originally at the request of RAL for as an
oversight-type requirement. DK noted that some kind of Board was
required. DB also noted that the Terms of Reference overlapped with some
other boards. TD advised that the current structure also resulted in some
reporting direct to Janet Seed (STFC). The Tier-1 Board meeting that had
been scheduled for the beginning of March had been cancelled due to the
timing. TD asked how much of the Tier-1 remit the Deployment Board could
incorporate?
Decision: It was agreed that the Tier-1 Board would be suspended for 6
months, and it would be seen to what extent the Deployment Board could
cover the remit.
Action: SL and DB to review the Tier-1 Board Terms of Reference and see
what could be formally incorporated into the new Deployment Board Terms of
Reference. AS and JG to discuss what could replace the Tier-1 Board at
RAL.
The discussion turned to STFC input. Should there be STFC representation
on the PMB, or at the F2F? DB asked whether Swindon wanted to be
represented on the Board, with Trish Mullins as member? observer? It was
agreed that TM would be 'in attendance' at Face-to-Face PMB meetings.
SL asked whether the Collaboration Board would remain unchanged? It was
felt that the experiments were well represented there and it should remain
as is. It was agreed that SP would be on the Collaboration Board as
Project Manager 'ex officio'. SP asked whether the CB could be arranged
to coincide with the Oversight Committee which would mean she could attend
in person - this was agreed.
Regarding the User Board, DB asked what worked best? GP noted that it was
really a 'user forum', and better identified when it was an 'experiments'
board'. The core was experiments' contacts but anyone could join. More
generally it was noted that all users should be encouraged to have access.
DB asked if the Terms of Reference were OK? GP advised yes, but there was
an overlap with DTeam and that should be clarified; the Terms were fine.
DB asked about the meeting schedule? GP noted every Quarter.
DB asked if there were any further points/questions? On the related issue
of roles, TD was noted as the new Technical Director and would iterate
with JC as this was a new role for inclusion in the PMB. The Project
Manager's role would be taken on by SP and SP/DB would need to iterate
this week. Regarding the role of Production Manager, JC needs to check
this. It was noted that everyone should look at their new roles and
advise DB of any changes required.
Regarding the LCG GDB and the MB - DB noted that although the GDB was part
of JC's role, being deputy should be included as part of JC's role. The
Project Leader is by default the MB representative but this needs to be
discussed with Ian Bird as the default is actually a Tier-1 representative
from each country.
Action: TD to contact Ian Bird and suggest that the MB UK membership
change as at 1st April 2008.
2. GridPP3 Reporting and Reporting Routes
==========================================
DB gave a presentation on reporting routes covering the following areas:
Tier-1 staff, Tier-2 staff, Technical staff and others.
Tier-1 staff
------------
Regarding the Quarterly Reporting - SP/AS to discuss; to be aligned with
the reporting which AS already carries out. This will contain the effort
figures reported against the service areas; a summary of service levels
achieved; and an update of milestones and metrics related to the project
map. SL noted that CPU and disk figures in the same format as those
provide by the Tier-2 would be useful- this would give unified
information. DB noted that in the Tier-1 Quarterly Report there should be
an appropriate set of financial numbers (details to be discussed).
Action: AS to provide CPU/Disk usages numbers in the Quarterly Report for
the Tier-1 as per the ones provided for Tier-2. AS/SP to iterate
regarding the financial summary in the Quarterly Reporting (eg: Outturn
figures).
Tier-2 staff
------------
It was agreed that the Quarterly Report should be compiled by the
Production Manager, to include:
- effort per Institute (for grant monitoring)
- CPU/disk availability (for MoU verification)
- milestones & metrics update (for Project Map updates)
There was a discussion on unfunded effort and reporting. It was agreed to
encourage institutes to continue to report "Unfunded" effort if they
wished.
Technical staff
---------------
TD as Technical Director is to compile the Quarterly Report containing the
following:
- effort figures (for grant monitoring)
- milestones & metrics (to update Project Map)
It was noted that GOC posts come under the Technical Director's brief in
relation to overview, but are also in the EGEE slot in the Project Map.
There was a discussion on the dove-tailing of GridPP2 - GridPP3 posts at
Manchester - DB to progress.
User-related staff
------------------
It was noted that the Portal post best fits under the overview of the
Chair of the User Board. DC noted that the other half post was related to
RTM - the UB gets 50% of this post. It was agreed that DC and GP would
iterate each quarter on the direction of this work. DB asked in relation
to the Technical Documentation 50% FTE post - what exactly will the post
address? SB advised that up until now, it related to user documentation.
GP noted that this would only be of use if it was closely linked to user
requirements, or else the post was redundant - many users don't know that
it exists at all. SL suggested that it was not a useful post - VO support
is what is needed and support for system managers - it was up to the
experiments to support users effectively. TD noted that the post needs to
be more visible - would a blog by SB help? SB advised that individuals
prefer dialogue/email, which results in an immediate response. SL
suggested that use cases need documented. JG advised that everyone should
know that the Grid Documentation page exists, and user support should be
able to respond appropriately and give links; however a documentation
person needs to ensure that these links actually exist, and should let all
the managers know where to access them. DB summarised that this issue
needs to be better defined - the post should be called 'User Support' and
should complement the portal post: The Portal Post provides technical help
to new users/VO's interfacing their proprietary software to the Grid
middleware; the Document Posts supports users/VO's by leading them through
the process of getting engaged and pointing them at the right
documentation (and ensuring that documentation exists in an up-to-date
form).
Action: GP/SB/DC to define these Support posts and ensure they form a
comprehensive basis for user support (both documentation and Grid access
assistance), overseen by the UB Chair.
Regarding the new experiment posts, DB noted that they would naturally be
overseen by the UB chair but this did raise the question as to whether the
Tier-1 user experiment support posts also fell under this oversight. It
was felt important to try and make the experimental support posts more
visible outside with Tier-1. AS asked how the metrics would be monitored?
DB advised that these staff members should stay under the Tier-1 reporting
line for practical reasons but the User Board should monitor them. DB
noted that the danger was that nothing would be achieved, however AS
should provide a report to GP. AS advised that staff should also report
to the experiments. DB suggested that they be included on the User Board.
DC suggested that the experiments and the Tier-1 were already
well-aligned, and he did not see too much danger of lack of communication
or reporting.
Decision: DB summarised that the consensus was that the Tier-1 experiment
support posts should report via the Tier-1 but that GP as Chair of the
User Board should be aware of this effort.
Other staff
-----------
There was an outreach post at 50%FTE that would report direct to SP.
DB would complete the document and circulate it, thereafter it would be
posted on the website as a record.
It was noted that there were no other dissemination issues that required
discussion at present. TD, SL and RJ broke for the Group Leaders' phone
conference. DB proposed bringing forward the items on EGEE/EGI - RM and
JG to report.
3. EGI Workshop Report
=======================
RM reported that the workshop was in relation to EGI Work Package 3.
Around 30 people attended to discuss this issue. The workshop was a
brainstorming and consensus session driven by the deliverables schedule
for EGI which was due in March. RM advised that it was very much a
'closed' project and it was difficult to find out what was going on and
who exactly was involved - no list had been published of any international
collaborators. RM would circulate the Agenda for info. RM gave a
presentation on the various discussions. It was noted that Malcolm
Atkinson and NG had been invited to the meeting in March, but this clashed
with GridPP20 and MA would not be able to attend - also, NG might not be
able to attend and he advised that he had invited Andy Richards to go in
his stead. There was therefore a place available if anyone else wished to
go. RM reported that Work Package 2 related to use cases and there had
been discussions regarding middleware, ramp-up, and ways of handling the
projects and the facilities.
RM reported that Day Two had comprised the operations model (which needed
clarification) and more research members of the Grid were hoped-for;
services for VOs had been discussed and a sub-group had been convened to
progress issues. The workshop had addressed middleware issues, the
building and testing of systems, legal issues and MoUs. Types of
management had been addressed, also resource provisioning, VO entry
strategy, standardization, and policies.
Day Three had concentrated on industry take-up, application support (with
a possible taskforce being convened to progress this), training, outreach
and dissemination, security, and then Agenda and actions for the
forthcoming Rome meeting in March.
DB asked about cost and distribution of effort. RM said that not enough
information was available as yet. JG asked about the transition between
EGEE III and EGI, but again, RM noted that not enough information was
available. DB suggested that the workshop had not really been a useful
meeting but that if a 'strawman' existed by March then it could be helpful
to attend and GridPP should have an opinion - it would be good if JG could
go. JG would check with Malcolm Atkinson.
NG asked what the fallback plan was if EGI does not happen after EGEE III?
It was advised that GridPP would cover their commitments to the Particle
Physics Community as our dependency on EGEE is limited in this respect.
NG noted that he had provided a draft paper but would add information that
addressed the period beyond 2011 and re-circulate.
4. How ready are we for EGEE III?
=================================
JG reported on SA1 posts and staffing. He noted that assurances had been
given that things would go ahead, but more information might be
forthcoming in February. JG gave an overview of status as follows:
NA2 Dissemination Manchester, QMUL, Imperial
(RTM funding shortfall)
NA3 Training Edinburgh (not yet appointed)
NA4 User Support Glasgow (MBL)
NA5 Policy/strategy/external relations STFC
JRA1 GridSite Security Manchester
SA3 R-GMA STFC
JG then reported on SA1 staffing, work, and duties overview. Additional
personnel comprised Linda Cornwall (Vulnerability), Dave Kelsey
(Security), Dave Kant (APEL) and Andy Newton (GOCDB). SA1 personnel would
be based at Glasgow, Imperial, Manchester, Oxford, and STFC. The SA1
summary required more effort at the Tier-2s and stricter oversight of
effort overall. DB noted that the Tier-2 need to think about booking much
earlier than in EGEE. I was advised that the Tier-2 Co-ordinators would
assist with this. JG noted that JC and DTeam plus NGS effort would help
to deliver requirements.
RM noted that on 6-7 May there would be an EGEE transition meeting at CERN
relating to the transition of activities, if anyone was interested in
attending.
The meeting broke for lunch and re-convened at 1:00 pm.
5. GridPP3 Milestones & Metrics
================================
SP had circulated the Project Map. SP confirmed that she had iterated
with owners of some of the boxes: the experiments, SL, RM re NGI,
dissemination, Tier-1 and Ops with JC and AS. All metrics thus far had
been incorporated and older ones that were no longer relevant had been
deleted. SP advised that a few milestones were still unfinished. SP
suggested that there were too many metrics in some areas and we needed to
define more clearly what it was important to measure - this might become
easier after the dress rehearsal.
SP reported that the Tier-2 section was different in overview - it had
been restructured so that the top level had a box for each of the Tier-2s
- you could then drill-down to give metrics for each individual Tier-2.
SP asked whether the PMB wished performance to be at the top level, or did
they want the Tier-2 aggregated, or shown by different sites? DB advised
that the performance of the Tier-2 should reflect the management structure
and the devolution of responsibility - for example, red boxes caused by a
single bad site should not be visible here - rather, a green box that
shows that the Tier-2 as a whole is performing as defined by the MOU - it
needs to reflect the structure that has been set up. The PMB agreed.
SP raised the issue of metrics which SL suggested for the Tier-2s - should
SL's tests be included along with SAM availability? DC thought no, SAM
availability was important to include - utilisation could be measured
separately, therefore a combination of the SAM tests + utilisation would
show how the Tier-2 is performing. DB advised that at the top level the
boxes need to monitor the MoU which would provide parameters for the next
hardware allocation, but, SL's tests were useful too for additional
information. AS asked about response times - were they part of the MoU?
This was covered by No 11 on the list: 4.x11. DC thought that having 12
metrics quarterly was a lot. SP noted that as many as possible should be
automatic but she asked whether there were any here that were difficult to
collect? DC advised that each one seemed reasonable, but SL's tests
should remain unofficial. JC asked about response time of sites to
problems? This was dealt with by No 11. DC noted that No 11 was
difficult to quantify - a yes/no answer was not sufficient - a percentage
was needed. DB advised that re the ATLAS tests, ATLAS was the most
important experiment to some of the Tier-2s but for London, CMS might want
to have performance metrics. DC suggested that it was fairly well
balanced between ATLAS and CMS in London. SP noted that the number of
metrics in each experiment were looking at service levels etc, on the
Project Map these will not be aggregated. RJ advised that the experiments
were interested in sites, not aggregation, and that there existed a
tension between the MoU and the conditions of grant. DB advised that the
Project Map should monitor the Tier-2s.
SP asked about milestones and measurements for EGI? RM advised that
GridPP were not doing this alone and that we needed to participate,
therefore meaningful milestones were difficult at present. TD noted that
this was a bigger issue for NGS, but that the European integration of
GridPP was OK. JG asked if it could be considered an internal target?
TD noted that using the EGEE infrastructure was a key metric and that a
higher-level box for SA1 and NA4 would be useful. DB agreed, noting one
box for each SA and NA areas, showing them failing or succeeding. RM
confirmed it was better to wait and see how things crystallise within
EGEE/EGI. DB noted that we needed something that measures our
involvement. RM said that for EGEE this was clearer but for EGI not so -
we would need to wait and revisit the Project Map in light of subsequent
events.
Action: SP would look at the EGI wiki, and NG would consider more inputs
relating to box 6.2.
SP then asked about a separate box for the GOC post? JG asked whether
this was within EGEE SA1? TD noted that it should be under Grid Services.
DB advised that the development aspect could go under EGEE. SP noted that
it would be easier to put it under EGEE as it was a GOC activity.
SP queried the LCG box - did we need one? If yes, content? TD noted a
Tier-1 overlap and that TC would have a view. SP assumed that we did not
want service delivery metrics in this box? DB advised that it should
measure our relationship with LCG, but any high-level LCG requirements
should go through the Tier-1.
Action: SP to iterate with TC and bring this issue back to the PMB.
It was agreed that DB/SP need to progress the details of the Map over the
next few months. DB noted that a cross-check would need to be done,
checking all elements are included in the Project Map, including strategic
priorities and staffing. This needed to be completed before the next
Oversight Committee. SP hoped that it would be complete by GridPP20.
6. GridPP3 Meetings and Travel Policy
======================================
RM presented an outline of retrospective travel costs and suggested a
review of the Travel Policy, with queries as to allocation of funds.
Trish Mullins confirmed that the pressure year would be next year, and
anything that can be done to save this year will help. TM noted that
GridPP will be given flexibility to carry forward any unspent travel.
DB noted that the travel budget reduce by 100k to 188k + inflation in the
subsequent years. RM noted the issues as follows:
- continue support of non-GridPP (experiment) people?
- what level of support for conferences/workshops?
- what level of support to sysadmins and WLCG?
DB asked what we spent in GridPP2 for Collaboration Meetings etc - a
broad-brush comparison would be useful if approximate percentages could be
given? SL thought that we should not rule out any categories. DB noted
that how much we have spent would help the decision for next year - and
that at this stage, categories should not be ruled-out. DB advised that
if there are problems with STFC funding, GridPP would need to cut back and
this was a wider issue which would need revisited - and could not be
unequivocally decided just now. GP asked about travel for Ganga? TD
advised that that would be looked at as part of a larger discussion. SL
pointed out that there was no budget for dissemination consumables in
GridPP3 (posters, brochures, equipment hire etc) and these would also have
to come out of the travel budget.
DB concluded that there was no obvious need for a written change in the
travel policy at the moment, but he wished to understand the numbers
better and in more detail. If there were further cuts from STFC then
travel would form part of a larger discussion, but little flexibility
would be anticipated. JC asked for a concrete decision now regarding the
WLCG in April. TD noted that this was core business and could not be
compromised - the PMB agreed. It was advisable to book early to keep
costs down.
Action: RM to provide more detailed figures on travel expenditure -
broad-brush percentages would assist with decisions re travel in GridPP3.
SS was asked to hand-out travel forms at Dublin ('overseas' claim on web
would be submitted as 'actuals' and should be submitted before the end of
March 2008).
7. Tier-1 Review/progress towards recommendations
==================================================
AS submitted the following report:
AS reported that the reviewers had recommended actions in the following areas:
3.1 The CASTOR level of effort is appropriate for steady-state operation,
but given the current status, it needs to be monitored. Based on current
input, we do not believe that a long-term redistribution of manpower in
this area would lead to an optimum overall plan. In the short term, it is
recognised that dedicated effort is required for testing. This should be
regarded as transitionary. (Point-2.1)
Because of the current financial position within e-Science no decision can
be made regarding additional support for CASTOR in FY08. We expect our
CASE student to work full time on CASTOR until mid-summer and we have some
money agreed to be available for funding a contractor for most likely at
least 3 months in 2Q08. Beyond this the situation is uncertain.
DB asked if there was sufficient effort on CASTOR at present? In the short
term funding was available (within FY07). DB asked if everything possible
had been done to sort-out CASTOR? AS advised yes, the issues list was
much smaller and now general service was more of an issue, not CASTOR per
se.
3.2 Details of the operational arrangements to meet the 24x7 requirements
are required, which details the people and the systems that will be
employed. (Point-2.2)
A plan and schedule had been drawn up before the review took place and
could have been made available if we had realized it was required. It is
available at: http://www.gridpp.ac.uk/wiki
In summary: Nagios will be used as the exception monitoring system, this
in turn will pass alarms to the Automate/SURE system which will (for now)
be used to generate the callout to bleeper. Critical systems, and
responses have already been documented. We plan to run a small team of
first line on-call experts who will be able to resolve straightforward
problems and carry out hardware repair. Several second line on-call
experts will be available on any given day to provide cover for CASTOR,
Databases and Tier-1 infrastructure. These staff all guarantee to respond
within 2 hours. Most other Tier-1 staff have agreed to provide reasonable
efforts response and will handle alarms when they are able but do not
guarantee to do so.
3.3 The UK is seen as decoupled from the real user requirements. This
particularly affects the Tier-1. The GridPP PMB should address this by
reviewing the current requirements and providing the necessary
requirements summary. (Point-2.3)
A meeting with experiments was held on 14th December. Information was
collected from some experiments and the CCRC planning has exposed more.
3.4 The Tier-1 needs an overarching experiments coordinator to manage the
interactions with the experiments and the User Board. (Point-2.3)
We agree that this would be a big improvement on the current situation.
Success here depends on getting the right person for the role. We do not
plan to name a person for this role until the recruitment described in 1.7
below. This recommendation could be implemented in a matter of days once
1.7 is resolved.
3.5 Given the user requirements summary in 3.3, quantify and implement the
implications of the user requirements for UK Tier-1 services more fully.
(Point-2.3)
We plan to complete by April. An initial attempt was made by Andrew Sansum
in mid January to collate information based on the CCRC planning
information then available but major holes still existed at that time and
the experiments were asked to provide more detail. It is likely that
sufficient information is now available for another attempt.
3.6 User support (this distinguishes the Tier-1 from the Tier-2s). The
Service Delivery Plan should address potential detachment issues. (Points
2.3 and 2.7)
The recruitment of additional staff part funded by the experiments and
part funded by the Tier-1 should make a big difference in this area. The
designation of an experiments coordinator will also help. The service
delivery plan will address this (but does not do so yet in its current
draft).
3.7 The 1.5FTEs of GridPP-funded experiment support is not optimally
deployed. Further effort is envisaged and a review of overall
effectiveness in this area should be undertaken in conjunction with the
experiments. It is important to recruit the right person/people with the
right experimental background. (Point-2.3)
Recruitment is held up by the current STFC financial situation. PPD (who
are running this recruitment) do not expect to be in a position to move
this forward for perhaps 3 months while details of the program are
finalized.
3.8 A dedicated formal weekly management meeting is suggested. Regular
weekly meetings should be reduced in order to allow for a more formal
weekly Tier-1 Project management meeting that provides coordinated input
to the Deployment Team. This single body should manage the week-to-week
problems and review progress on the Service Delivery Plan. (Points-2.4,
2.6 and 2.7)
This meeting is presently scheduled for 2pm on Tuesdays. The first meeting
has been held. Membership is: Neil Geddes (Dept Head), John Gordon (Div
Head), Andrew Sansum (Tier-1 Manager), David Corney (Castor GL), David
Kelsey (Finance), Jeremy Coles (Production Manager), Robin Tasker (Network
GL), Gordon Brown (Database GL), Martin Bly (Fabric Manager), Bonny Strong
(Castor Service Manager), Matt Hodges (Resource Manager).
A formal agenda is in the process of being drafted, but the meeting will
review the Changing experiment and project requirements, maintain the
Service Delivery Plan (with its milestones and Metrics), Staffing and
Finance and set work priorities. Incidents/Operational problems will be
reviewed in detail elsewhere but this body will ensure they are being
prioritized/handled/progressed satisfactorily.
3.9 The operational responses on the questionnaire were not developed
sufficiently in areas of strategic importance. For example, network
contention and data rates within the Tier-1 should be fully understood and
not left until they become a problem. (Point-2.4)
We have commenced work on a full I/O model of the Tier-1 which will
simulate hardware performance and I/O rates into individual CASTOR disk
pools. The model will include load from both the wide area network and
local batch farm. It has already been prototyped in Python and will be
implemented in C++. Inputs will be the experiment requirements as defined
in 3.3 and measured hardware performance (in the process of profiling
disks and tape drives).We plan to be in a position to use the results from
of the model by May, although work is expected to continue on it through
until August.
If it works satisfactorily, the model will allow us to quantify I/O rates
to different disk pools and parts of the batch farm. This in turn should
allow us to validate planned configuration changes and new use cases and
even help with planned hardware purchases.
3.10 Network planning needs to be quantified: the planning and management
is evident in sub-areas, but the overall planning is not clear. We
recommend a networking task force be set up prior to data-taking c.f.
CASTOR team approach as part of an overall management plan. (Point-2.5).
By April we will hold a planning meeting to review experiment bandwidth
requirements to be used as input to the architectural and capacity
planning of the Tier-1 and site network.
We have considered how we could set up a Networking Task Force to address
performance issues as they are identified on the RAL network or external
links. However, funded effort and staff expertise in this area is limited.
We recently invested money in training a member of the Tier-1 team in
network performance tuning, unfortunately they have subsequently resigned.
We plan to work to raise the overall expertise of various members of staff
in this area (both Networking Group and e-Science) as it is clear that the
current level of expertise is too low. Richard Hughes-Jones (Now Dante)
has agreed to come to RAL for a couple of days to provide training and
work on any current problems. We expect that with the deployment of
PerfSonar on the OPN Dante will become more involved in end to end network
performance issues in that area.
3.11 There is too much emphasis on internal firefighting of issues that do
not address the high-level user requirements. The Tier-1 management team
(AS, JG and NG) should address this together with the Production Manager
(JC). (Point-2.5)
The service delivery plan will ensure high-level requirements are
prioritized. The production Manager will attend the weekly management
meeting.
3.12 An analysis strategy needs to be developed at the Tier-1 that meets
the requirements of LHCb (since LHCb-UK relies on this service).
(Point-2.5 and 2.3).
Discussions have commenced between Andrew Sansum, Raja Nandakumar and Nick
Brook. A face to face meeting will happen in March.
3.13 The Tier-1 and the Production Manager should investigate ways to
strengthen the integration of the Tier-1 with the wider deployment team.
Greater involvement is required at this level. A dedicated meeting is
suggested. This needs to be addressed as a first step towards a Tier-1
Service Delivery Plan. (Points-2.6 and 2.7).
Meetings and discussions have taken place between the Tier-1 manager and
the Production Manager
UK site admins were asked where they consider more Tier-1 information
sharing would be useful. The general feeling was that it is difficult to
suggest anything specific as little was known about the internal
functioning and work activities of the Tier-1. In the first instance site
admins suggested that exposing some sort of Tier-1 diary or blog would be
useful and this prompted the idea of a bulletin. Increasing awareness of
the public Tier-1 actions page would also help in this area. Finally,
having a standing item to talk about current Tier-1 work at the UKI
monthly operations meeting (and dteam) would also improve transparency.
It was hoped that introducing standing items into current scheduled
meetings will meet this need as there is already a general feeling within
the dteam that there are too many meetings.
By the end of February, the Tier-1 Grid team will trial a blog where key
items of interest are identified each week for publication on the blog.
The management meeting may also identify items (to be discussed at the
next meeting). These will ideally be pointers to existing updated
material but may also include short stand alone items. If the dteam find
this useful, we will expand the blog to other parts of the Tier-1.
A draft of the service delivery plan will be made available to the dteam
once there is sufficient content to enable the dteam to consider and
suggest modifications.
3.14 The UK deployment team needs to be informed of the Tier-1 deployment
requirements. (Point-2.6)
The production manager will attend the weekly management meeting. A
regular slot will be available at the dteam meeting where Tier-1
deployment requirements and information of interest can be discussed.
3.15 Quantification is required (at back of the envelope level) prior to
defining the Service Delivery Plan. (Point-2.7)
The Service Delivery Plan will take account of quantitative requirements
as far as possible. However not all of these will be available by the time
the preliminary draft of the Service Delivery plan is scheduled to be
completed (end of February).
3.16 RAL runs important UK-wide services e.g. the BDII and RBs. Improved
procedures for monitoring, announcing down-time and reporting problems
should be established as part of the Service Delivery Plan. (Point-2.7).
There is work ongoing at CERN to allow SAM tests to be propagated to
remote sites via a push from nagios to our own internal monitoring system.
We have asked to be involved in early testing of this when it comes
available. This should improve our response time for critical alarms.
There is clearly a problem with the current process of announcing
downtimes. The Tier-1 has a standard process to ensure that downtimes are
registered properly in the GOCDB and broadcast via EGEE. Unfortunately UK
production staff find themselves swamped by announcements from the many
sites that they use and dont necessarily notice announcements. The Tier-1
will raise the issue with the dteam, but preliminary discussions with
various people indicate that a further mailing list will be unhelpful. A
Tier-1 dashboard will help but it may be that better tools are required
from EGEE to allow better visualization of scheduled/unscheduled
downtimes.
3.17 Improved procedures for documenting and replicating the above
services as part of a resilient Grid should be established via the
Production Manager and together with the deployment team. (Point-2.7).
The Tier-1 manager and Production manager have discussed this but a
schedule/plan has not yet been agreed. We have also started to explore the
possibility of locating some critical state-full services at Daresbury
laboratory and manage them remotely if required. This has the advantage
that IP addresses/names are managed by the same set of DNS servers for
both sites.
3.18 All current Tier-1 metrics should be reviewed with an emphasis on
services and user-impact, incorporating input from the GridPP Project
Manager. (Point-2.9)
A preliminary set of metrics have been incorporated into the draft Service
Delivery Plan. These have included those proposed by the GRIDPP Project
Manager. Metrics will be established not only to measure the quality of
service delivered to the users but also to ensure that internal systems
and processes are working well.
3.19 STFC Networking and the Tier-1 should jointly plan to ensure that the
future site network connectivity and performance aims to meet the Tier-1
networking requirements in a timely fashion. (Point-2.10)
A number of changes will take place to the way STFC and the Tier-1 manage
the site network. A proposal has been made by Networking Group to set up a
Technical Design Authority which will be responsible for establishing and
maintaining the overall technical architecture of the RAL core network. It
will have representation from the internal and external science
communities and will contain external network experts to provide peer
review of its program. The Tier-1 will wish to provide input of
requirements to this authority.
A RAL Network User Group will be set up which will provide accountability
and review of day to day operation of the site network. Representation on
this body is not yet agreed but e-Science would expect to be represented.
The twice monthly meeting between Tier-1 Network experts and Networking
Group will continue to consider implementation of planned changes to the
network.
Robin Tasker will attend the weekly Tier-1 management meeting.
3.20 We recommend the development and dissemination of a Tier-1 dashboard
that presents a complete view of the monitoring for the centre.
(Point-2.11)
We will develop a dashboard, although WLCG is also planning work in this
area to present Tier-1 Dashboards.
DB noted that good progress was being made in some areas, but that this
would need to be revisited at Dublin but possibly the new deployment board
would be the right forum?
8. Disaster Recovery Plan
==========================
JC had circulated a template and had received comments from the last PMB,
and had now mapped the template to AS's documents. JC's report provided
an associated reference and a description of the disaster/failure, with a
bulleted list of services and areas affected. Issues to be considered
included: classification of the severity/impact (to help with
prioritisation); an action plan; an individual to be allocated as 'lead'
with a Deputy; contacts and timescales. For each failure a one-page
summary sheet was to be completed. AS advised that a lot of the
'disasters' experienced were standard things which happened, and that the
communication list would be long. TD agreed, noting that the report
should be short and readable. JC advised that this sheet would document
the disaster, the actual plan would be elsewhere. DB asked about how many
disaster templates would need to be completed? AS thought that they would
number in their hundreds. JC noted that it depended on the level of the
disaster itself. JG noted that it would not be maintainable if it
included every smaller issue. DB agreed, noting that it would not
converge. AS suggested that it needed to be deliverable, maintainable,
and useful.
Action: TD noted that AS/JC should iterate and remove capturable items
that were considered to be minor.
AS noted that for number 7, response co-ordination, there were various
stages to this. TD noted that someone needed to be defined as 'lead' and
someone as 'deputy'. AS suggested this be the project's decision - and it
would depend on the level of the crisis. TD noted the following criteria:
- category of disaster
- who to inform
- who is in charge
The document needed to be fairly high-level. TD and DB noted that the
template looked good but it was not to be issued in the hundreds. RJ
noted that for experiments, such a 'disaster' would be loss of data. AS
noted that for the Tier-1, it would be loss of CASTOR metadata. It was
agreed to go with the template as provided by JC and see how it goes when
it is actually used - then the issue can be revisited if necessary. JC to
progress this.
9. AOCB
========
RJ initiated a discussion on job slots on the Tier-1. AS would get back
to RJ, and it was noted that an occasional forum was required. It was
understood that weekly CASTOR meetings took up a lot of time and might
need to be reconfigured.
ACTIONS AS AT 01.02.08
======================
272.4 AS to check the current Tier-1 disaster recovery plan and circulate
the existing version to the PMB. It was reported that this document does
not exist, but it was planned to have one in the longer term. TD would
incorporate in v0.4 anything that AS considered relevant. AS will check
and advise additions.
277.2 DN to provide an update and re-evaluation of CMS/CASTOR
deliverables. TD advised that there was a CMS/CASTOR document on
deliverables which should be revised in light of the December '07 tests.
DC to take the token for this now and iterate with DN.
277.5 Disaster Recovery 'Team B': SB, JC, TD, SP, DB to analyse the wider
issues of disaster planning, mapped to the experiments' lists, and this
work would include Project Management. A Recovery Plan was required. It
was agreed that JC was in charge of this and the experiment input relating
to subsets of the disaster plan. SB/JC to progress.
277.8 User Experience 'Team C': SB, SP, SL, with input from JC to deal
with the issue of user experience and design of an easily-found lookup
facility for grid error messages. SL reported that he had started the
ATLAS wiki page and would circulate the url. SB was leading this with
inputs from SP, SL and JC where needed. A new simple summary was required
of all areas available plus a lookup/links facility, for the OC to review.
This would include a list of most recent types of problems (possibly a
'top 12' for users - what the error means and the course of action to
follow). SB to progress this.
280.7 JC to mention the issues (when approached by a VO with regard to
joining) of the 'standard' 6-month introduction period, following which
the VO must set-up something specific to them, if appropriate. This was
discussed at DTeam. JC to email GridPP VO members if possible - ongoing.
This was a standing action - JC had discussed it with the Tier-2
Co-ordinators in relation to VO members. JC to send email.
280.8 JG to investigate the UKI ROC website - any change/progress, and
report-back. SB to iterate with JG in order to sign-off this item next
week. Ongoing.
282.2 SP to progress the Project Map using the T1 service areas and input
from the meeting.
282.6 JC and SB to progress existing 'disaster planning' template for next
F2F meeting on 1st Feb. Involve experiments as necessary. This was a
follow-up from the last F2F, and was to be distinguished from 277.5 action
which is a longer-term one relating to the OC.
289.1 AS to provide an analysis of the ATLAS disk server failures on the
RAID controller.
289.2 DC to check current situation regarding gLite WMS and SL4 - current
status to be conveyed to DTeam.
289.3 JC to check the VOMS/-skipcacheck issue (in relation to UK CA
certificate change) with Jens Jensen and raise the issue at an Operations
meeting.
289.4 SP to speak to the KT person at STFC who assisted with the PIPSS
case, to help with the post-competitive phase (in relation to EGEE only
providing support to pre-competitive startup). SP to involve NG.
290.1 JC to write-down membership of DTeam.
290.2 RJ, DC and GP to nominate experiment user representatives for the
Deployment Board.
290.3 SL and DB to review the Tier-1 Board Terms of Reference and see what
could be formally incorporated into the new Deployment Board Terms of
Reference.
290.4 AS and JG to iterate regarding what could replace the Tier-1 Board.
290.5 All: to check their individual roles as outlined and advise DB of
any required changes.
290.6 TD to contact Ian Bird and suggest that the GDB and MB UK membership
changes as at 1st April 2008.
290.7 AS to provide numbers in the Quarterly Report for the Tier-1 as per
the ones provided for Tier-2.
290.8 AS/SP to iterate regarding the financial summary in the Quarterly
Reporting (eg: Outturn figures).
290.9 Quarterly Report for Tier-2 staff to be compiled by the Production
Manager.
290.10 TD as Technical Director to provide a report showing effort
figures; milestones & metrics; and a table of posts showing Technical
Support.
290.11 DB to progress the situation at Manchester.
290.12 GP/SB/DC to define these Support posts and ensure they form a
comprehensive basis for user support (both documentation and Grid access
assistance), overseen by the UB Chair.
290.13 DB to complete the document re Reporting and Reporting Routes
relating to staff, and circulate it, thereafter it would be posted on the
website as a record.
290.14 RM to circulate the EGI Workshop Agenda.
290.15 JG to check with Malcolm Atkinson re attending the next EGI
workshop in Rome (March).
290.16 NG noted that he had provided a draft paper relating to the end of
EGEE III but would add information that addressed the period beyond 2011
and re-circulate.
290.17 Re the Project Map, SP would look at the EGI wiki, and NG would
consider more inputs relating to box 6.2.
290.18 Regarding the LCG box on the Project Map, SP to iterate with TC and
bring this issue back to the PMB.
290.19 DB/SP to progress the details of the Project Map over the next few
months, cross-checking that all elements are incorporated, including
strategic priorities and staffing. To be completed before the next
Oversight Committee.
290.20 RM to provide more detailed figures on travel expenditure -
broad-brush percentages would assist with decisions re travel in GridPP3.
290.21 SS to hand-out travel forms at Dublin ('overseas' claim on web to
be submitted as 'actuals' and should be submitted before the end of March
2008).
290.22 AS to get back to RJ regarding job slots at the Tier-1.
290.23 AS/JC to iterate on the Disaster Recovery template and remove
capturable items that were considered to be minor.
290.24 JC to progress his suggested template to use when a crisis occurs -
to be revisited subsequently at a PMB.
INACTIVE CATEGORY
=================
271.1 PMB to examine the issue of fibre breakage and outages, CERN-RAL OPN
link, in one year's time, when actual data on breakages is available.
Due date would be September '08.
271.3 Re CERN-RAL OPN link breakage and backup generally, PC to oversee
the issue and collate info so that the PMB have something to revisit in
one year's time. Due date September '08. It was noted that PC would
circulate a revised document after discussion with ATLAS (RJ/PC/DN to
iterate).
282.8 RM to monitor how R-GMA and networking issues impact on GridPP as
matters progress. RM advised that this item should be moved to the
'inactive' category as it will develop over the coming months. RM
discussed the issue with Steve Fisher and advised that support of R-GMA is
required whilst APEL is dependent on it. RM reported that he has spoken
to SF and there is currently no change to the R-GMA situation - process
ongoing.
There was no other business, and the meeting closed at 3:45 pm. The next
PMB would be at 1:00 pm on Monday 11 February.
|