Dear All,
Please find attached the GridPP Project Management Board Meeting minutes
for the 407th meeting.
The latest minutes can be found each week in:
http://www.gridpp.ac.uk/php/pmb/minutes.php?latest
as well as being listed with other minutes at:
http://www.gridpp.ac.uk/php/pmb/minutes.php
Cheers, Dave.
--
________________________________________________________________________
Prof. David Britton GridPP Project Leader
Rm 480, Kelvin Building Telephone: +44 141 330 5454
School of Physics and Astronomy Telefax: +44-141-330 5881
University of Glasgow EMail: [log in to unmask]
G12 8QQ, UK
________________________________________________________________________
GridPP PMB Minutes 407 (22.11.10)
=================================
Present: Dave Britton (Chair), Sarah Pearce, Andrew Sansum, Steve Lloyd, Tony Cass, Robin
Middleton, John Gordon, Tony Doyle, Dave Kelsey, (Suzanne Scott - Minutes)
Apologies: Roger Jones, Jeremy Coles, Pete Clarke, Glenn Patrick, Dave Colling, Neil Geddes
1. Request for funding by Oxford
=================================
A proposal had been received for a Dell machine at Oxford to run three dedicated services (SE,
WMS and myProxy) in order to separate the Nagios tests from the production infrastructure. DB
asked if it had been decided that this was the way to go? JC was not present. AS advised that it
didn't look like a conclusion had been reached at the dTeam. DB confirmed he had received an
email from JC, but the end conclusion was to use the Tier-1 storage for SRM tests and use Oxford
as a failover, which contradicted Oxford hosting the SE. He wasn't sure if it had been concluded
from a technical point of view? It was agreed that we needed to ask the dTeam to decide
technically if this should be done or not. If they did approve this, were the PMB prepared to
approve the purchase? The cost was £3,300 plus VAT. SL noted it was a large machine for just 10
jobs per hour. DK asked if this had to do with fault tolerance? TD advised that the dTeam should
confirm/agree that it was sensible to run these services at Oxford, and also that the architecture
proposed was sensible.
ACTION
407.1 Re Oxford's funding request for a Dell to run three dedicated services: DB to reply to Pete
Gronbech and JC that in principle this was ok, but that the dTeam should confirm that it was the
correct solution.
2. Future Project Manager
==========================
DB reported that he had received an email from SP, who would be leaving GridPP at the end of
January after 7 years. DB congratulated SP on her news and noted that GridPP owed her a debt of
gratitude: Sarah has informed me that she has just been appointed Deputy Chief in the Astronomy
and Space Science division of the Australia and Commonwealth Scientific Research Organisation,
CSIRO, and will be leaving GridPP at the end of January. DB made the following statement:
First of all, I hope you will join me both in congratulating Sarah on her appointment, and in
thanking her for her enormous contributions to GridPP over the years. Not only has Sarah largely
defined the successful dissemination agenda within GridPP but has also diligently shepherded the
project management of GridPP3. We owe her a large vote of thanks and I wish her the best of luck
in transferring the skills honed within GridPP to her new project, which I believe includes a
special responsibility for the Square Kilometer Array, the largest Australian science project for the
next ten years.
DB noted that action in GridPP was now required - this came under 'loss of key personnel' in the
Risk Register. For the next 4-6 months we needed to 'close' GridPP to both the Oversight
Committee and STFC, and this would include documenting expenditure, effort, metrics, milestones
etc. We also needed to refine the GridPP4 project in terms of strategy etc. Being realistic, DB
noted that SP could complete GridPP3 to January 2011 in terms of budget/effort/milestones etc
but we needed to evolve a new Project Manager for GridPP4. DB advised that we couldn't appoint
someone quickly enough due to advertising requirements and Christmas timing, and there might
be no overlap with SP at all. DB had discussed this with SL and it would be possible to identify
someone temporarily to work in parallel with SP for 4-6 months. The appointment could be
funded from the existing QMUL GridPP3 grant, therefore there would be no net cost to the Project
overall. Someone would be required to act as a bridge between SP and the new Project Manager
come the time. Could we identify a possible person for this role? DK agreed that it sounded like a
good idea, if it were achievable. DB noted it could be an academic who was not teaching this
period; or someone else involved in the Project. DB asked the PMB to consider this idea and send
any suggestions to him. This would probably have to be taken to the CB fairly quickly.
ACTION
407.2 ALL: to consider a possible internal candidate for temporary role of Project Manager whilst
SP worked her notice - suggestions to be sent to DB as soon as possible.
3. Status of OC preparation
============================
a) Intro and overview - DB was working on this
b) Project Map - SP noted the Project Map was complete, she had a few questions outstanding, but
it would probably be available by Wednesday.
c) LHC/wLCG status - (JG) this was pending.
d) EU-update - (NG) DB had received this, but it needed work.
e) Tier-1 status - AS noted he had three-quarters of this written, it would be ready by tomorrow.
f) Deployment status - a document had been received from JC, and SL had sent comments,
however it needed work, and the table was not now accurate because it was several months old.
g) Users - (GP) - no info as to status was available.
h) Impact - SP advised that Neasan O'Neill had been at SuperComputing but something would be
available soon.
DB advised that contributions were required by the next day or two - he would follow-up with
others as to status.
STANDING ITEMS
==============
SI-1 Tier-1 Manager's Report
-----------------------------
AS reported as follows:
Fabric:
1) FY10 procurements
- Disk tender - orders placed. Delivery scheduled this week and next.
- CPU tender - orders placed. Delivery has commenced for one supplier, 2nd supplier scheduled to
deliver in December.
- Various small system purchases nearly complete.
- Tape drive and media purchase still outstanding, waiting for hardware availability.
2) Robotics
An intervention was made on the tape robot on 2nd to address an overheating problem.
Unfortunately this was only partially successful and a further intervention will take place Tuesday
(23rd November) to resolve the problem.
3) A review of disk server failures on one tranche of 2008 hardware indicates a range of
unexpected failure modes. Threee of these have led to data loss this year but most other single
drive failures have led to anomalous behaviour. Typically the drive failure either triggers a server
crash (unscheduled server outage) or leads to an emergency scheduled intervention. Our
investigations are continuing and we are also looking at a range of possible responses. We have a
RAID controller firmware update likely to be scheduled this week and are also considering other
changes to give better protection to the filesystem in the event of a crash. However we may
seriously have to consider temporarily taking this capacity out of production until the problem is
resolved.
Service:
1) Summary of operational issues is at:
https://www.gridpp.ac.uk/wiki/Tier1_Operations_Report_2010-11-17
2) CASTOR
- CMS upgrade to 2.1.9 was completed. the upgrade went well but we are still investigating load
related transfer failures which were originally believed to be peak load after startup. A long
standing bug was identified which limits our ability to throttle disk to disk transfers which were
then causing remote
file transfers to timeout. A change was made to address this problem, but we have been suprised
to see transfer failures continue over the weekend.
- Upgrade ATLAS - Monday to Wednesday 6 - 8 December (subject to review next Monday).
- An upgrade to the ATLAS SRMs is scheduled for this week (TBC) to address load related issues
around high rate of FTS status requests.
SI-2 ATLAS weekly review & plans
---------------------------------
RJ was absent.
SI-3 CMS weekly review & plans
-------------------------------
DC was absent.
SI-4 LHCb weekly review & plans
--------------------------------
GP was absent.
SI-5 Production Manager's Report
---------------------------------
JC was absent.
SI-6 LCG Management Board Report
---------------------------------
DB noted that the next meeting was tomorrow.
SI-7 Dissemination Report
--------------------------
SP reported that Neasan O'Neill had been at SuperComputing. A survey for the digital curation
centre was now out and to be done - an email had been circulated and as we had pledged to be
involved in this, could everyone please respond to the questionnaire.
AOB
---
AS reported that at the EGI Advisory meeting last week there had been an item about local root
end points - there was to be a 7-day cap or sites would be closed-down. DK advised that they
would be discussing this issue at HEPSYSMAN meeting today. DB would follow this up.
REVIEW OF ACTIONS
=================
384.6 TD/JC to take the lead on the 'GridPP to NGI' document that addresses the forward-moving
technical and other issues from a GridPP perspective. JC was gathering info. It was noted that the
recipient was likely to be Dave Wallom. Deadline of late November for discussion. This should be
on the F2F Agenda for 9th December meeting. Draft will be available soon for comment. Ongoing.
398.6 DC to provide updated LondonGrid MoU. DC reported that the meeting had happened, the
LondonGrid MoU had been discussed, DC would incorporate comments. Ongoing.
398.7 DK to check that all is up-to-date in terms of GridPP Security Policies - email DB. If there
are any issues, DK to let DB know. DK reported that the GridPP Security Policy phase was ongoing
at present, however other policies had been approved by LCG. DK advised that EGI formal signoff
was awaited, then the GridPP pages would be updated. Ongoing.
398.10 RJ/Graeme Stewart to provide urls of the place(s) where info is located re ATLAS site tests
and measurements (so that sites understand what they're being measured on). In progress.
398.13 DB to consider how to evolve the User Board into a useful meeting in the future, DB to
initiate in the timeframe between now and GridPP4. This should be on the F2F Agenda for 9th
December meeting. Done, item closed.
400.4 SL to co-ordinate changing the current GridPP MoU towards an MoU for GridPP4. Ongoing.
403.2 RJ to broadcast the move to ATLAS adaptive data placement at RAL, specifically for PD2P
only, via ATLAS and GridPP standard channels. Ongoing.
404.2 SP to circulate requirements relating to the OC meeting, for discussion at the PMB on 15th
November. Done, item closed.
406.1 SP to speak to work package area leaders to see if specific metrics can be devised for an
intermediate view in relation to monthly/quarterly reporting in GridPP4. Ongoing.
406.2 RJ/DC/GP to provide AS with confirmation that they are happy or otherwise with the Tier-1
Disaster Management document - if not, they should provide any detail from their point of view.
Ongoing.
406.3 JG to ask about services in EGI, and Service Level Agreements, and report back. [This was in
the context of the status of EGI/NGI Service Level Agreements, and GridPP agreeing with the level
of service provided, ensuring that it is as required]. Done, item closed.
406.4 DC to report-back to the PMB re the status of CMS sites in London. Ongoing.
ACTIONS AS AT 22.11.10
======================
384.6 TD/JC to take the lead on the 'GridPP to NGI' document that addresses the forward-moving
technical and other issues from a GridPP perspective. JC was gathering info. It was noted that the
recipient was likely to be Dave Wallom. Deadline of late November for discussion. This should be
on the F2F Agenda for 9th December meeting. Draft will be available soon for comment.
398.6 DC to provide updated LondonGrid MoU. DC reported that the meeting had happened, the
LondonGrid MoU had been discussed, DC would incorporate comments.
398.7 DK to check that all is up-to-date in terms of GridPP Security Policies - email DB. If there
are any issues, DK to let DB know. DK reported that the GridPP Security Policy phase was ongoing
at present, however other policies had been approved by LCG. DK advised that EGI formal signoff
was awaited, then the GridPP pages would be updated.
398.10 RJ/Graeme Stewart to provide urls of the place(s) where info is located re ATLAS site tests
and measurements (so that sites understand what they're being measured on).
400.4 SL to co-ordinate changing the current GridPP MoU towards an MoU for GridPP4.
403.2 RJ to broadcast the move to ATLAS adaptive data placement at RAL, specifically for PD2P
only, via ATLAS and GridPP standard channels.
406.1 SP to speak to work package area leaders to see if specific metrics can be devised for an
intermediate view in relation to monthly/quarterly reporting in GridPP4.
406.2 RJ/DC/GP to provide AS with confirmation that they are happy or otherwise with the Tier-1
Disaster Management document - if not, they should provide any detail from their point of view.
406.4 DC to report-back to the PMB re the status of CMS sites in London.
407.1 Re Oxford's funding request for a Dell to run three dedicated services: DB to reply to Pete
Gronbech and JC that in principle this was ok, but that the dTeam should confirm that it was the
correct solution.
407.2 ALL: to consider a possible internal candidate for temporary role of Project Manager whilst
SP worked her notice - suggestions to be sent to DB as soon as possible.
DB noted that next Monday's PMB would focus on the OC papers. There would be NO meeting on
6th December because of the F2F on 9th at QMUL. There would also be NO meeting on 13th
December.
|