Dear All,
Please find attached the GridPP Project Management Board Meeting minutes
for the 405th meeting.
The latest minutes can be found each week in:
http://www.gridpp.ac.uk/php/pmb/minutes.php?latest
as well as being listed with other minutes at:
http://www.gridpp.ac.uk/php/pmb/minutes.php
Cheers, Dave.
--
________________________________________________________________________
Prof. David Britton GridPP Project Leader
Rm 480, Kelvin Building Telephone: +44 141 330 5454
School of Physics and Astronomy Telefax: +44-141-330 5881
University of Glasgow EMail: [log in to unmask]
G12 8QQ, UK
________________________________________________________________________
GridPP PMB Minutes 405 (08.11.10)
=================================
Present: Dave Britton (Chair and minutes), Sarah Pearce, Tony Doyle, Jeremy Coles, Andrew
Sansum, Steve Lloyd, Roger Jones, Glenn Patrick, John Gordon, Dave Colling.
Apologies: Dave Kelsey, Pete Clarke, Neil Geddes, Tony Cass, Robin Middleton
1. GridPP26
============
DB reported that a provisional booking had been made at a hotel in Hove for Mon 28th March to
Thu 31st March. Rooms and breakfast were £50/night, including doubles and twins for single
occupancy. Furthermore, conference rooms for the PMB and Storage meetings had been offered at
£50/day. The main meeting would be at the University of Sussex, though there were issues about
transport (may need to organise some buses) and about power and wireless access at the
conference venue. Places for the conference dinner were being looked into. The hotel was a
possibility, but a restaurant may be better.
2. Installed Capacity
=======================
In advance of an MB discussion at CERN tomorrow, JG had raised the issue of installed capacity
reporting by the Tier-2s in the UK. JC had circulated a spreadsheet comparing the gstat values; the
actual capacity from the latest quarterly report; the Tier2-GridPP MOU numbers; and the wLCG
pledge. ALthough there were some issues the big picture is fine: The per-Tier-2 gstat values
satisfied the wLCG pledge, except for 5% under from ScotGrid. This was not a real shortfall as
installed capacity is there, but related to reporting. At the individual site level, there were various
discrepancies with the GridPP-MoU either due to kit currently being installed or due to
DPM/Storm reporting issues.
TD noted that transcription errors were an issue - we need something that automatically pulls
figures out of the quarterly reports. JC did have something, but quarterly report format changes as
extra columns are added for specific issues each quarter.
3. Quarterly Reporting Status [SP]
==================================
After a flurry of activity this morning (presumably in response to the agenda item!) all reports had
been received except those from RM and JG.
4. EPSRC call [SP/JG]
=====================
JG had circulated the EPSRC call which was directed at EPSRC-funded subjects so was not directly
applicable to GridPP. Nevertheless, there was some scope in the dissemination/outreach area for
some kind of joint proposal. Neasan had talked to Catherine (EGI) and would talk with the NGS.
There may be some scope for minor GridPP involvement here.
5. Security Statement [DB]
==========================
JC had raised this issue last week, the following statement was iterated upon over the last few
days:
The GridPP PMB encourages sites to make decisions on security related
matters in accordance with their own site's security policy and the
common wLCG/EGI/GridPP security policy
(https://wiki.egi.eu/wiki/SPG:Documents), taking into account the advice
received from their own site security team, the information provided by
the GridPP security team, and in consultation with other sites. The PMB
acknowledges that security responses may differ from site-to-site,
reflecting different institutional policies, grid architectures,
configurations and installed packages. Ultimately, each site must weigh
the risk vs benefit of continuing to provide a service. The risk
analysis must consider the severity of the threat; the time-frame of the
exposure; the risk that lack of response would cause to other sites,
services and the infrastructure; and the ability of a site to monitor
and respond. GridPP encourages sites to consider each incident
objectively and does not wish to influence, in either direction, the
outcome of that consideration. That is, GridPP does not encourage sites
to take on a higher level of risk than they feel comfortable with in
order to preserve service, nor does GridPP encourage sites to shutdown
service prematurely to eliminate small risks.
STANDING ITEMS
==============
SI-1 Tier-1 Manager's Report
-----------------------------
Fabric:
1) FY09 procurements:
- SL09 tranche completed acceptance test, very few problems encountered. It is likely that we will
formally accept the hardware this week.
2) FY10 procurements
- Disk tender - orders placed. Delivery late November.
- CPU tender - orders placed. Delivery late November and December.
- Various small system purchases being made.
3) Robotics
An intervention was made on the tape robot on 2nd November to address an overheating
problem. Unfortunately this was only partially sucessful and a further intervention will be
required.
Service:
Overall a better week operationally than recent weeks.
1) Summary of operational issues is at:
https://www.gridpp.ac.uk/wiki/Tier1_Operations_Report_2010-11-03
2) CASTOR
On Monday (1st November) the Atlas SRMs crashed repeatedly. This was triggered by the use of a
particular SRM command that checked the status of files recalled from tape. (There had been a
change in the Atlas software that exposed this problem.) On Tuesday morning the Atlas SRMs
were upgraded to fix the problem. So far this is looking good.
There is now an SIR of the previous weeks problems on the LHCB instance:
https://www.gridpp.ac.uk/wiki/RAL_Tier1_Incident_20101026_LHCb_SRM_Bad_T
URL_and_Outage
A change will be scheduled to move all the disk servers to 64 bit in order to fix the checksum
problem on the LHCB instance (pending since the upgrade). Our plan is to do the LHCB disk
servers on Wednesday then open negotiations with ATLAS and CMS to schedule work on theirs.
We plan to upgrade the LHCB SRMs (capacity not SRM release) in order to address possible
performance issues. We are working on a schedule to carry out this work. Issues around the exact
configuration to be deployed remain to be agreed but we hope to get them in this week before
LHCB reprocessing starts.
No problems have emerged from the Gen instance upgrade nor do we believe it is likely that
recent problems with the LHCB SRMs relate to the upgrade. We have therefore concluded that it
will be safe to proceed with upgrades to CMS and ATLAS. The schedule is now:
# Upgrade CMS - Tuesday to Thursday 16-18 November.
# Upgrade ATLAS - Monday to Wednesday 6 - 8 December.
SI-2 ATLAS weekly review & plans
---------------------------------
RJ reported that a data loss at Lancaster was worrying as it looked very like an earlier incident at
Glasgow. AS asked whether it related to the a generation of controllers that they were concerned
about at RAL? Andrew will follow up with Peter or Matt. Because of SRM worries ATLAS didn't
move to PD2P last week - plan to do today.
SI-3 CMS weekly review & plans
-------------------------------
DC reported that CMS was fine; there had been some issues with Tier-1 SAM test due to load. The
CASTOR upgrade had been moved back by a week or so.
SI-4 LHCb weekly review & plans
--------------------------------
GP reported as follows:
1) RAL Tier 1. Reasonable running over last week (since problems of previous weekend)
although load has been somewhat lower. Investigations continue, but a lot of SRM hits appear to
come from FTS. Plan is to upgrade LHCb SRM machines to increase performance. LHCb
reprocessing due to start mid-November Ð so aim to upgrade/test before this.
2) UK Tier 2. Some problems with shared area at Bristol and Birmingham. Issue with queue
length parameters at UCL causing jobs to be killed.
SI-5 Production Manager's Report
---------------------------------
JC reported as follows:
1) There have been problems with the WMSes in the UK over the last week and this has reflected
in the Nagios test results. The underlying problem is not really understood at the moment (see for
example the RAL ticket https://gus.fzk.de/ws/ticket_info.php?ticket=63912 and the Glasgow
ticket https://gus.fzk.de/ws/ticket_info.php?ticket=63931), Jobs enter the waiting state and never
complete. This has affected the SL test jobs too.
2) An estimate from Alastair Dewhurst suggests that there is of order 33TB of Òdark dataÓ in
ATLAS LOCALGROUPDISK. The current policy is to have 20% of a T2 disk allocated to this
spacetoken. There is currently no deletion policy for this area which is of concern to many sites Ð
but ultimately an ATLAS problem!
3) We currently have a problem with our ROD-COD communication as our regional operations
list is unsubscribed to the COD list due to email bounce problems (this has arisen due to the
change from a CERN based list to an egi.eu one).
4) A number of GridPP sites are being picked up by Pakiti as having nodes still vulnerable to a
recently announced vulnerability.
SI-6 LCG Management Board Report
---------------------------------
No meeting since last week.
SI-7 Dissemination Report
--------------------------
SP noted CHEP news item on GridPP website, and a note was being written on lessons to be drawn
from the stand and conference. SL noted that the CERN@school VO had been established.
REVIEW OF ACTIONS
=================
398.12 TD/DB to make renewed efforts to engage someone at Glasgow to tackle GridMon and to
have access transferred in order to ensure the instances were up-to-date and running ok - DB
would insist on a meeting with Mark Leese for a handover. To be done by the end of GridPP3.
It was decided that this action had been done by setting up the meeting. Progress would now be
monitored in the normal way (quarterly reports).
402.1 JC/JG to address the issue of ticket workflow in the UK in relation to NGS/NGI, to clarify
that the support process is: tickets were ending in dead ends.
JC was meeting this pm to discuss.
402.2 JC/JG to provide status report on EGI/NGI Service Level Agreements in the context of
GridPP agreeing with the level of service provided, ensuring that it is as GridPP requires.
JC and JG were meeting tomorrow to discuss. Some of this might have input into the GridPP4 MoU.
404.4 DB to provide a draft statement for the Minutes which should assist sites in dealing with
expectations on them in relation to risk strategies and work required.
DB had done this.
ACTIONS AS OF 08.11.10
======================
384.6 TD/JC to take the lead on the 'GridPP to NGI' document that addresses the forward-moving
technical and other issues from a GridPP perspective. JC was gathering info. It was noted that the
recipient was likely to be Dave Wallom. Deadline of late November for discussion. This should be
on the F2F Agenda for 9th December meeting.
397.1 AS to provide a high-level summary of the Disaster and Business Continuity Plan for input
to the next OC meeting - by November 15th latest - and also provide a web link to further more
detailed documents.
398.6 DC to provide updated LondonGrid MoU. DC reported that the meeting had happened, the
LondonGrid MoU had been discussed, DC would incorporate comments.
398.7 DK to check that all is up-to-date in terms of GridPP Security Policies - email DB. If there
are any issues, DK to let DB know. DK reported that the GridPP Security Policy phase was ongoing
at present, however other policies had been approved by LCG. DK advised that EGI formal signoff
was awaited, then the GridPP pages would be updated.
398.10 RJ/Graeme Stewart to provide urls of the place(s) where info is located re ATLAS site tests
and measurements (so that sites understand what they're being measured on).
398.13 DB to consider how to evolve the User Board into a useful meeting in the future, DB to
initiate in the timeframe between now and GridPP4. This should be on the F2F Agenda for 9th
December meeting.
400.2 JC to confirm that priorities have been documented for the major experiments for
recovering files from disk servers.
400.4 SL to co-ordinate changing the current GridPP MoU towards an MoU for GridPP4.
402.1 JC/JG to address the issue of ticket workflow in the UK in relation to NGS/NGI, to clarify
that the support process is: tickets were ending in dead ends.
402.2 JC/JG to provide status report on EGI/NGI Service Level Agreements in the context of
GridPP agreeing with the level of service provided, ensuring that it is as GridPP requires.
403.2 RJ to broadcast the move to ATLAS adaptive data placement at RAL, specifically for PD2P
only, via ATLAS and GridPP standard channels.
404.1 DB to send round requests for papers from the PMB for the forthcoming OC meeting.
404.2 SP to circulate requirements relating to the OC meeting, for discussion at the PMB on 15th
November.
404.3 JC/JG to document the process for setting up a new VO in the UK and make it available in
the appropriate places.
The next PMB would take place on Monday 15th November at 12:55 pm.
|