Dear All,
Please find attached the latest weekly GridPP Project Management
Board Meeting minutes. The latest minutes can be found each week in:
http://www.gridpp.ac.uk/php/pmb/minutes.php?latest
as well as being listed with other minutes at:
http://www.gridpp.ac.uk/php/pmb/minutes.php
Cheers, Tony
________________________________________________________________________
Prof. A T Doyle, FInstP FRSE GridPP Project Leader
Rm 478, Kelvin Building Telephone: +44-141-330 5899
Dept of Physics and Astronomy Telefax: +44-141-330 5881
University of Glasgow EMail: [log in to unmask]
G12 8QQ, UK Web: http://ppewww.physics.gla.ac.uk/~doyle/
________________________________________________________________________
GridPP PMB Minutes 286 - 7th January 2008
=========================================
Present: Tony Doyle, Sarah Pearce, Roger Jones, David Britton, Steve Lloyd,
Robin Middleton, John Gordon, Glenn Patrick, Andrew Sansum, Dave Colling,
Suzanne Scott (Minutes)
Apologies: Stephen Burke, David Kelsey, Dave Newbold, Tony Cass, Jeremy Coles,
Peter Clarke, Neil Geddes
1. CMS Representation on PMB
=============================
TD advised that the transition from Dave Newbold to Dave Colling, as CMS
Representative on the PMB, was taking place at today's meeting. The PMB
wished to formally thank Dave Newbold for all his inputs on CMS issues and
relating to CASTOR testing. It was noted that DN would remain on the PMB
mailing list for the next month or so in order to effect a handover
period. GP, SL, noted the changeover in relation to the UB, CB
respectively. DN would continue on the Tier-1 Board.
2. GridPP MoU Draft for CB
===========================
TD had circulated v3.1 for consideration by the PMB. TD advised that this
incorporated minor changes post the F2F, including modifications to the
Tier-1 hardware breakdown for CMS in 2008. TD proposed that this version
be circulated to the CB. It was agreed that this version be circulated
and it was noted that the CB will sign this off at the next meeting. SL
noted that a short phone meeting was expected; and that Dave Colling was
now also on the CB as CMS Representative. TD reported that the figures in
v3.1 were the final version of the planning figures relating to the LCG in
December (cf the letter to Les Robertson). In relation to EGEE, they
needed to know the planning figures - 1% of the hardware allocated was for
EGEE purposes (incorporated in LHC anyway), but agreement with Ian Bird
was required via RM/JG, who would refer to the figures given. JG noted
that a meeting was due within the next few weeks, and EGEE planning was
likely to be on the Agenda. JG would sum-up the GridPP/NGS/Ireland
contribution. The PMB approved circulation of the MoU to the CB.
3. Tier-2 Hardship Fund
========================
SL had circulated a document for consideration by the PMB. SL reported
that the F2F meeting had agreed to make a further 100k available for cases
to be made. Four bids had been received totalling 113k, SL had circulated
the cases and conclusions/recommendations. It was proposed to fund two
cases in full; one was vague in terms of stated outputs but 25k had been
recommended. NorthGrid wished to allocate funds themselves. SL
recommended release of funds in goodwill that they would be sensibly
disseminated. TD proposed that the PMB accept SL and NG's recommendations
as given, and that the conclusions be endorsed, the information would be
relayed to STFC as PMB-approved allocations. DB agreed that the PMB
should proceed as proposed. SL asked whether a breakdown of the funds be
required of NorthGrid. DC noted that a mechanism might be required to
place the funds temporarily. SL advised that the grants would need to be
issued to institutions, the proposers would need to invoice each other
internally in order to effect internal transfers. DB advised that some
conditions should be attached to the grants relating to delivery of
resources being attached to the funds. TD noted that the MoU has Regional
responsibility - the additional amounts for hardship would not change the
overarching MoU and associated hardware delivery. DB noted that something
should be written into the grant conditions in case of site failure.
There was a discussion in relation to the various aspects of this. TD
noted that agreement of future resource allocations would be based on
'past performance'. SL suggested Institutional MoUs that each region
would sign-up to. The question was, who gets penalised in case of failure
to deliver - Site or Region? DB reiterated that draft wording attached to
the Grants was required, to build-in something to which the PMB had
recourse in case of failure to deliver at site level. RJ noted that
NorthGrid would call a meeting and give allocations to Institutions - it
should be the Institutions that get penalised in case of failure. TD
advised that the Institutions themselves would also have to instigate
regional agreements to transfer funds. TD advised that the MoU figures
would not be modified. SL and DB to iterate. RJ to call a NorthGrid
meeting to organise dissemination of funds. DC noted that overall
responsibility lies with the Tier-2. TD confirmed that all other internal
arrangements are devolved to London, SouthGrid, NorthGrid and ScotGrid.
4. Post-Mortem on Tier-1 running
=================================
AS reported on various problems experienced over the Christmas break, as
follows:
The Tier-1 ran unattended from Saturday 22nd December 2007 until Tuesday
1st January inclusive. During that period Tier-1 staff continued to
monitor the service and carried out a number of interventions when they
detected problems. In general good availability was maintained for most
of the service, however there were major problems with ATLAS access to
CASTOR which are yet to be understood and are still being investigated.
Key problems and interventions were:
00) At 20:00 on Saturday 22nd december approximately 70% of the batch
workers went offline following a transient overload of the home
filesystem. Detected at midnight and workers restarted by 01:30 the
following morning. (Adams)
0) dCache failed at 11:00 on 24th after a logfile unexpectedly filled the
system disk of the pnfs server - was corrected at 13:00 and again after
a recurrence at 14:30 (2GB written in 1.5 hours) - change of use
pattern probably by MINOS. (Ross)
1) The failure (partial/) of ATLAS access to CASTOR from 24th December
(still being investigated. Intervention attempted by Kruk (during the
holdiay period - no info available) and by Bly (see below). Possibly
caused by a load-related problem on the ATLAS stager but still being
investigated.
2) The Nagios monitoring system failed and was restarted on the 25th
December. (Bly)
3) The restart of various CASTOR SRMs on the 27th december in response to
reported (by ATLAS) problems with CASTOR. High process count alarm on
the SRMS and no other faults reported by other CASTOR components. (Bly)
4) The restart of the ganglia server on 27th after logging failed.
Caused by 'out of memory' - more now ordered. (Thorne)
5) The CE gatekeeper daemon died at 04:00 on 29th December and was
restarted at 15:40. (Ross)
6) The nagios server was restarted at 07:49 on 31st December and was
restarted at 12:09. (Thorne)
7) The CMS CASTOR WANIN pool became very busy on 31st 12th, logging NFS
errors. (White)
8) rb01 became overloaded (excessive job cancellations) on 31st December
(not detected) and was taken out of production on 2nd January for
investigation and repair.
9) Backups of system/home filesystems failed over the holiday period after
a problem with one of the ADS tape servers.
AS reported that on-call payments had been instigated for the first time
in relation to these issues. Problems experienced were currently being
diagnosed, particularly with CASTOR and ATLAS. The main challenges with
operating a 'holiday' service was expertise with specific complex
problems. The PMB agreed that GridPP should formally apologise to ATLAS
for the production difficulties over the festive season. It was hoped
that more specific diagnoses and full understanding of the problems would
become apparent in due course, resolution of which could be disseminated
UK-wide. AS to apologise to ATLAS on behalf of GridPP. The issues above
would be discussed again at next week's PMB.
5. AOCB
========
GP had submitted a paper on special cases for non-Grid access to the UK
Tier-1. GP put forward the cases he had received along with
recommendations for action from the UB relating to the LHC experiments
(approved by RJ and DC) and BaBar. GP reported that MINOS had suffered
from CASTOR delays and lack of testing, a working instance of CASTOR was
still awaited. Re CALICE, one issue was the RAL firewall. GP expected
that more cases would be received once qsub was withdrawn. The PMB agreed
with GP's (UB's) recommendations and asked GP to advise the UB
accordingly. TD noted that a service message was required at login - AS
to organise. JG asked about AFS for BaBar? It was proposed to continue
to run this (it was also required for the Oversight Board in relation to
disaster planning) - the PMB agreed. JC and SB to incorporate the AFS
Service into the disaster planning document.
STANDING ITEMS
==============
SI-1 Dissemination Officer's Report
------------------------------------
SP reported that there was news from STFC regarding the award on LHC@Home,
which had progressed to the next stage. This would need to be submitted
by the end of January and a presentation given in March. The message for
the LHC Promotion Advisory Group had been agreed and accepted - this will
be put to the next meeting. SP advised that the Christmas Story had been
posted as a news item and had been put on the website. If anyone has any
issues they wish reported, please send them to SP. Rob Edgecock had
advised that a nomination was required for the Science in Society Advisory
Panel. RJ was willing. TD would nominate him.
SI-2 Tier-1 Manager's Report
-----------------------------
AS provided the following report:
1) Tenders:
a) Disk tender - order placed - planned delivery date now agreed for 11th
January (may be delayed by up to 1 week).
b) CPU tender - order placed and scheduled for delivery 28 February.
c) Tape drive purchase - number of additional drives to be finalised in
the next 1-2 weeks in order to ensure delivery (mainly the servers)
this FY.
2) Memory upgrade - the Woodcrest (Streamline) systems have been upgraded
to 2GB per core and the AMD (Compusys) systems will be upgraded today.
3) Work on the power supply is proceeding - so far with no disruption to
service.
4) We expect to borrow about 80TB of unused disk capacity from the Tier-2
in order to (partially) tide us over until new capacity becomes
available at the end of march.
2) Service:
1) SAM availability for last week was 99% and the month's availability was 93%.
2) CASTOR - No update on general deployment of CASTOR.
3) SL4 Migration - The SL4 UI is configured and is being tested.
4) dCache - no update.
5) The LHCB ORACLE based LFC is installed, has had limited testing and is
now handed over to LHCB.
Progress to Grid Only Access - This standing item documents the status of
work towards achieving GRIDPP milestone 0.18 "Access to Tier-1 resources
by Grid Interfaces Only"
1) The scheduled termination has been announced. Special cases for
continued access have been passed to the PMB for review. We continue to
expect to terminate qsub access on 11th January 2008.
SI-3 Production Manager's Report
---------------------------------
It was noted that JC was on annual leave.
SI-4 LCG Management Board Report
---------------------------------
TD reported that he did not attend the last meeting due to a meeting
clash. It was noted that Ian Bird was now in charge of the LCG MB.
SI-5 Documentation Officer's Report
------------------------------------
It was noted that SB was on annual leave.
REVIEW OF ACTIONS
=================
272.4 AS to check the current Tier-1 disaster recovery plan and circulate
the existing version to the PMB. It was reported that this document does
not exist, but it was planned to have one in the longer term. TD would
incorporate in v0.4 anything that AS considered relevant. AS will check
and advise additions. Ongoing.
277.2 DN to provide an update and re-evaluation of CMS/CASTOR
deliverables. TD advised that there was a CMS/CASTOR document on
deliverables which should be revised in light of the December '07 tests.
DC to take the token for this now and iterate with DN.
277.5 Disaster Recovery 'Team B': SB, JC, TD, SP, DB to analyse the wider
issues of disaster planning, mapped to the experiments' lists, and this
work would include Project Management. A Recovery Plan was required. It
was agreed that JC was in charge of this and the experiment input relating
to subsets of the disaster plan. SB/JC to progress.
277.8 User Experience 'Team C': SB, SP, SL, with input from JC to deal
with the issue of user experience and design of an easily-found lookup
facility for grid error messages. SL reported that he had started the
ATLAS wiki page and would circulate the url. Ongoing.
280.6 JG to bring up this issue (the biomed VO and 'sieving')at the ROC
Manager's meeting (done) - a broadcast is to go out from EGEE which will
be helpful in underlining acceptable use of Grid resources and would act
as a reminder to VOs about the policy they have signed-up to in relation
to their users. JC had now emailed the Chair to have this discussed -
EGEE broadcast part of this action ongoing. JG reported that a new VO was
now set up but there were no resources allocated to it as yet, although
one Institute may be giving funds. Pending further info from JC. EGEE
broadcast action ongoing.
280.7 JC to mention the issues (when approached by a VO with regard to
joining) of the 'standard' 6-month introduction period, following which
the VO must set-up something specific to them, if appropriate. This had
been discussed at DTeam, done. JC to email GridPP VO members if possible
- ongoing. This was a standing action - JC had discussed it with the
Tier-2 Co-ordinators in relation to VO members. The emailing part was
ongoing but the first part of the action was completed. JC to send email.
Ongoing.
280.8 JG to investigate the UKI ROC website - any change/progress, and
report-back. Ongoing.
282.2 SP to progress the Project Map using the T1 service areas and input
from the meeting. Ongoing.
282.3 SL and NG to progress issues relating to Tier-2 hardware
allocation/complaints and iterate procedure with T2s. It was noted that
there was a deadline of 14 December for inputs to SL and NG. SL had
received inputs. To be re-evaluated in the New Year. Done, item closed.
282.5 Updated GridPP3 MOU needs to be sent to CB (TD to provide updated
version for SL to circulate). TD reported that he was working on this, on
the latest numbers required and comments would be sent to JC. Version 3.1
had been prepared for the CB. Done, item closed.
282.6 JC and SB to progress existing 'disaster planning' template for next
F2F meeting on 1st Feb. Involve experiments as necessary. Ongoing.
283.1 TD to arrange a phone connection at TC Dublin for RJ to join the
GridPP20 meeting remotely. Ongoing.
283.3 RM/TD to prepare use cases appropriate for the UK community,
[relating to item 278.10 EGEEIII -> EGI]. RM reported that he would be
attending a workshop at the end of January at CERN (by EGI design study
project) and would report-back at that time. Ongoing.
285.1 SP to circulate LPAG Grid Message paper to PMB once further comments
received. Done, item closed.
285.2 GP to compile a document showing the applications for non-Grid
access, and circulate to the PMB. Done, item closed.
285.3 JG to check the status of the Tier-1 Review Plan regarding 'on-call'
service, and circulate. JG reported that a wiki had been created. Done,
item closed.
ACTIONS AS AT 07.01.08
======================
272.4 AS to check the current Tier-1 disaster recovery plan and circulate
the existing version to the PMB. It was reported that this document does
not exist, but it was planned to have one in the longer term. TD would
incorporate in v0.4 anything that AS considered relevant. AS will check
and advise additions.
277.2 DN to provide an update and re-evaluation of CMS/CASTOR
deliverables. TD advised that there was a CMS/CASTOR document on
deliverables which should be revised in light of the December '07 tests.
DC to take the token for this now and iterate with DN.
277.5 Disaster Recovery 'Team B': SB, JC, TD, SP, DB to analyse the wider
issues of disaster planning, mapped to the experiments' lists, and this
work would include Project Management. A Recovery Plan was required. It
was agreed that JC was in charge of this and the experiment input relating
to subsets of the disaster plan. SB/JC to progress.
277.8 User Experience 'Team C': SB, SP, SL, with input from JC to deal
with the issue of user experience and design of an easily-found lookup
facility for grid error messages. SL reported that he had started the
ATLAS wiki page and would circulate the url.
280.6 JG to bring up this issue (the biomed VO and 'sieving')at the ROC
Manager's meeting (done) - a broadcast is to go out from EGEE which will
be helpful in underlining acceptable use of Grid resources and would act
as a reminder to VOs about the policy they have signed-up to in relation
to their users. JC had now emailed the Chair to have this discussed -
EGEE broadcast part of this action ongoing. JG reported that a new VO was
now set up but there were no resources allocated to it as yet, although
one Institute may be giving funds. Pending further info from JC. EGEE
broadcast action ongoing.
280.7 JC to mention the issues (when approached by a VO with regard to
joining) of the 'standard' 6-month introduction period, following which
the VO must set-up something specific to them, if appropriate. This had
been discussed at DTeam, done. JC to email GridPP VO members if possible
- ongoing. This was a standing action - JC had discussed it with the
Tier-2 Co-ordinators in relation to VO members. The emailing part ongoing
but the first part of the action completed. JC to send email. Ongoing.
280.8 JG to investigate the UKI ROC website - any change/progress, and
report-back.
282.2 SP to progress the Project Map using the T1 service areas and input
from the meeting.
282.6 JC and SB to progress existing 'disaster planning' template for next
F2F meeting on 1st Feb. Involve experiments as necessary.
283.1 TD to arrange a phone connection at TC Dublin for RJ to join the
GridPP20 meeting remotely.
283.3 RM/TD to prepare use cases appropriate for the UK community,
[relating to item 278.10 EGEEIII -> EGI]. RM reported that he would be
attending a workshop at the end of January at CERN (by EGI design study
project) and would report-back at that time. Ongoing.
286.1 RJ to call a NorthGrid meeting to decide hardship funding
allocations to Institutes.
286.2 SL and DB to iterate regarding clause associated with the issuing of
Tier-2 hardware grants.
286.3 AS to formally apologise to ATLAS on behalf of GridPP for the outage
problems over the Christmas period.
286.4 GP to advise the UB that the special cases for non-Grid access to
the UK Tier-1 were approved.
286.5 AS to organise a service message at login relating to non-Grid
access being withdrawn.
286.6 JC and SB to incorporate the AFS Service into the disaster planning
document.
INACTIVE CATEGORY
=================
271.1 PMB to examine the issue of fibre breakage and outages, CERN-RAL OPN
link, in one year's time, when actual data on breakages is available.
Due date would be September '08.
271.3 Re CERN-RAL OPN link breakage and backup generally, PC to oversee
the issue and collate info so that the PMB have something to revisit in
one year's time. Due date September '08. It was noted that PC would
circulate a revised document after discussion with ATLAS (RJ/PC/DN to
iterate).
282.8 RM to monitor how R-GMA and networking issues impact on GridPP as
matters progress. RM advised that this item should be moved to the
'inactive' category as it will develop over the coming months. RM
discussed the issue with Steve Fisher and advised that support of R-GMA is
required whilst APEL is dependent on it. RM reported that he has spoken
to SF and there is currently no change to the R-GMA situation - process
ongoing.
The meeting closed at 2:20 pm. The next PMB would take place on Monday 14
January 2008 at 1:00 pm.
|