Dear All,
Please find attached the latest weekly GridPP Project Management
Board Meeting minutes. The latest minutes can be found each week in:
http://www.gridpp.ac.uk/php/pmb/minutes.php?latest
as well as being listed with other minutes at:
http://www.gridpp.ac.uk/php/pmb/minutes.php
Cheers, Tony
________________________________________________________________________
Prof. A T Doyle, FInstP FRSE GridPP Project Leader
Rm 478, Kelvin Building Telephone: +44-141-330 5899
Dept of Physics and Astronomy Telefax: +44-141-330 5881
University of Glasgow EMail: [log in to unmask]
G12 8QQ, UK Web: http://ppewww.physics.gla.ac.uk/~doyle/
________________________________________________________________________
GridPP PMB Minutes 289 - 28th January 2008
==========================================
Present: Tony Doyle, Sarah Pearce, Roger Jones, Stephen Burke, David Britton,
Steve Lloyd, Tony Cass, John Gordon, Jeremy Coles, Peter Clarke,
Glenn Patrick, Andrew Sansum, Dave Colling, Suzanne Scott (Minutes)
Apologies: David Kelsey, Robin Middleton, Neil Geddes
1. F2F Agenda
==============
DB reported that he had circulated a draft Agenda but had received no
comments. Items to be discussed included: the composition of the Board
and the DB, Quarterly Reporting, a brief item on Dissemination, progress
of milestones and metrics, travel policy, how the Tier-1 responded to the
Tier-1 Site Review and implementation plans required, disaster recovery
planning, EGEE/EGI. There would be a break in the middle of the meeting
for a Group Leaders' phone conference in order to report issues (to
include TD, RJ and SL). The PMB agreed the Agenda. DB noted that if
there were any other items, to let him know. DC advised that he might be
delayed due to his flight, but hoped to arrive around 09:30.
2. LHCb Funding
================
GP had circulated an email summary of LHCb requirements in relation to
another software course on distributed analysis and Grid computing. GP
noted that the funding requirement was similar to last year but new people
were involved - 6k was requested from GridPP for accommodation and room
costs, LHCb would cover travel and meal costs. The course was likely to
take place sometime in March. DB asked how many people would be involved?
Around 30. JG asked how much money had been awarded last time? A similar
sum. TD noted the requirement for an updated version of the LHCb proposal
from last time, which would be dealt with by TD/RM,(similar to the recent
ATLAS request). DB recommended support for the LHCb request. TD noted
that the PMB approved the request in outline, subject to receipt of a
one-page proposal case. GP would submit this.
3. GridPP20
============
TD advised that the registrations deadline was this Wednesday. It was
noted that numbers were down this year for a variety of reasons, but
probably due to the requirement to book early. TD would put out another
reminder tomorrow (Tue). TD requested that PMB members ensure that people
within their groups have thought about attendance; TD had already spoken
about this to AS. TD advised that accommodation could not be held
following the end of January but that GridPP registrations could remain
open following further negotiations. SS would find out if the Harcourt
Hotel might allow more time [emailed following the meeting]. TD noted
that experiments' section names were not yet there. It was advised that
they were still to come. Raja Nandakumar and Mingchao Ma were awaiting
visas.
4. AOCB
========
None.
STANDING ITEMS
==============
SI-1 Dissemination Officer's Report
------------------------------------
SP reported two news items in preparation, 1) Neasan O'Neill was
attending the All Activities meeting; 2) SP was awaiting final approval
for a report on ELSSI at Glasgow. SP asked whether a press release was
required for CCRC08. TD advised that afterwards, once achievements were
known, would be more appropriate. SP would go forward on the basis that
there may be a release in due course. TD advised that this could be
either March or May. SP reported that the LHC@Home bid was due to be
submitted to STFC on Thursday. SP advised that she had been in contact
with BBC Radio 4 who were planning a day of programmes about the LHC and
may want someone in the studio who knew about data processing and the
Grid. SP was going to meet them in March regarding Grid input to the
day. SP advised that she was working on the Project Map and hoped to be
able to remit this to the PMB F2F on Friday morning.
SI-2 Tier-1 Manager's Report
-----------------------------
AS provided the following report:
1) Tenders
a) Disk tender - installation is complete and supplier load test is
running. This is scheduled to end on Friday after which our own 29 day
load test will commence.
b) CPU tender - Order placed and scheduled for delivery 28 February.
c) Tape drive purchase - Order for 6 drives has been placed.
d) Non-Capacity hardware order has been placed.
e) Oracle server hardware upgrade order has been placed.
f) An order for a 32 port non-blocking 10Gb switch is expected to be
placed shorrly (e-Science funded). This will be the new core switch of
the Tier-1 network.
2) Work on the power supply is proceeding - work on the first transformer
is now complete and work on the second transformer has started and will
continue for 2-3 further weeks.
3) Disk server failures
Last week we suffered a severe filesystem corruption on an ATLAS disk
server after the RAID controller took an already offlined drive back into
the RAID set. This led to corruption of filesystem metadata and the
filesystem being marked as read only. On investigation many bad data
blocks were found and as it was not possible to identify which files had
been corrupted the filesystem was written off and over 2600 ATLAS files
lost. This is the second incident of this kind. We do not believe that
the drive should have been accepted back into the array and will contact
the supplier this week to escalate the matter with the controller
manufacturer.
There was a discussion on investigation and recovery of data. TD noted
that it was a reasonably manageable problem but asked AS if he expected
this would happen again? Were any statistics available? AS noted that
this issue could be a systematic problem rather than a straight disk
failure, so it was hard to tell at this stage. AS noted that with RAID6
it would be less likely to experience such issues. TD asked whether the
next disks were RAID6, and if so would that assist at controller-level
error? AS confirmed yes, but it would be hard to say completely. TD
asked if AS could assess the level of the problem and provide a review?
AS confirmed yes.
Service
1) SAM availability for last week was 99%.
2) CASTOR
c) Preperations for CCRC08
i) ATLAS: Service Classes/Space tokens set up. Disk pools fully configured
and ready.
ii) LHCB: Service classes/Space tokens set up. Disk pools fully configured
and ready.
iii) CMS have now decided how to allocate disk pools and deployment of the
last disk servers will commence.
iv) Alice have now made a request to the UB for disk space. It is not
likely that we will be able to iterate the allocation changes, redeploy
the disk and implement a CASTOR instance (with an untested xrootd)
interface in time for CCRC08.
3) SL4 Migration
The SL4 UI build is now working and a test system has been installed.
One of the production UIs will be retasked to provide an SL4 service
this week.
4) On-call: We now have completed our analysis and documented critical
systems and alarms for on-call. We have also have tested the full
callout chain from alarm to pager.
Progress to Grid Only Access
============================
This standing item documents the status of work towards achieving GRIDPP
milestone 0.18 "Access to Tier-1 resources by Grid Interfaces Only"
1) We are still finalising the list of users to be allowed to continue
qsub. This was scheduled to be completed last week. Once we have the
list we will terminate qsub with no notice.
2) Message of the day is in place for the termination of the interactive
service at the end of February.
SI-3 Production Manager's Report
---------------------------------
JC provided the following report:
1) UKI is the TPM backup team this week for EGEE. We are next on
operations duty (backup) from 2nd February. Some thought needs to be
given to who provides the support from April.
2) There is a little more clarity about sites that are expected to form a
full part in the CCRC08 run in February, though it has not been
straightforward to get required information about T2 space token
requirements (the publishing of the space token (needed for accounting
systems) requires GLUE 1.3 currently only used at the T1) and hardware
to be committed. Several sites are scheduled to undertake SE upgrades
this week. UK SRM 2.2 instances can be seen here:
http://tinyurl.com/348t52.
3) The shared cluster at Edinburgh (ECDF) took a while to become
operational (some lessons on trying to integrate shared clusters) but
now publishes just over 500 job slots to the Grid. The site is now
entering its installation Phase 2.
4) The RB is still being seen as a weak point in the UK service. Checking
the tests this morning revealed that all the UK production RBs appear
to have problems (http://wn3.epcc.ed.ac.uk/srm/xml/srm_versions_bar).
We need a strategy for moving to the gLite-WMS or to revisit the
problems users are having with the RBs. Has the T1 had user feedback on
the gLite-WMS installed at the end of last year? DC advised that the
gLite WMS was not certified for SL4 and problems were being worked on
at present. The situation would be continuing meantime and was being
monitored, and would be revisited once the SL4 version was certified.
Once certified there should be a quick introduction. TD noted that
this information should be circulated to DTeam. DC will check the
current situation. SL will investigate advertising his RB switches in
the interim.
5) The UK CA certificate change (carried out as a precaution in case the
private key was compromised late last year) appears not to be as
transparent as was hoped. The first case of a new certificate not being
recognized by various VOMS servers has been reported. Instructions to
rectify the situation were created here
http://www.gridpp.ac.uk/wiki/Instruction_for_VO_administrators under CA
Rollover. A fix for the previously reported bug
(https://savannah.cern.ch/bugs/?func=detailitem&item_id=20789) has been
implemented but requires VOMS to be run with the option skipcacheck
enabled. It is suspected that the CERN instance did not have this
enabled. It was noted that this was an issue to be brought-up at an
Operations meeting. JC will check with Jens Jensen and raise the issue
at Ops.
6) Imense Ltd (the company behind Cambridge ontology) are starting to
examine options for expanding their access to resources. In following
up within EGEE I received the following from Gabriel Zaquine After
examination of each business project, EGEE will support some of them,
only during the pre-competitive phase. Due to GEANT, EGEE can't be used
for business purposes. However, EGEE will encourage and support
companies willing to exploit the EGEE technology ( e.g providing
services or applications based on gLite on they one infrastructure). Is
the GridPP/UK message back to camont the same i.e. there is no pay for
use option available once their project moves to a competitive stage?
It was asked whether NGS could take-over this issue? TD noted that our
response would be the same as the EGEE response, even NGS would find it
difficult to support 'real' use over the JANET network. SP advised
that she could speak to the KT person at STFC who assisted with the
PIPPS case, to help with the post-competitive phase. TD noted that NG
could also be involved.
Meetings:
A) There will be a pre-GDB CCRC08 F2F meeting on Tuesday 5th February at
CERN: http://indico.cern.ch/conferenceDisplay.py?confId=26922
B) There is a GDB next Wednesday:
http://indico.cern.ch/conferenceDisplay.py?confId=20226.
C) There was an ATLAS UK operations meeting on 17th January (omitted from
last weeks summary):
http://indico.cern.ch/conferenceDisplay.py?confId=26907.
SI-4 LCG Management Board Report
---------------------------------
TD noted that there was nothing more to say at present, although the Alice
issue might come up in relation to UK support.
SI-5 Documentation Officer's Report
------------------------------------
SB noted that there was nothing to add to last week's report. Issues were
ongoing.
REVIEW OF ACTIONS
=================
272.4 AS to check the current Tier-1 disaster recovery plan and circulate
the existing version to the PMB. It was reported that this document does
not exist, but it was planned to have one in the longer term. TD would
incorporate in v0.4 anything that AS considered relevant. AS will check
and advise additions. AS noted that he had the basic document laid out,
and had completed one section in detail. Ongoing.
277.2 DN to provide an update and re-evaluation of CMS/CASTOR
deliverables. TD advised that there was a CMS/CASTOR document on
deliverables which should be revised in light of the December '07 tests.
DC to take the token for this now and iterate with DN. DC reported that
he had discussed this with DN. Ongoing.
277.5 Disaster Recovery 'Team B': SB, JC, TD, SP, DB to analyse the wider
issues of disaster planning, mapped to the experiments' lists, and this
work would include Project Management. A Recovery Plan was required. It
was agreed that JC was in charge of this and the experiment input relating
to subsets of the disaster plan. SB/JC to progress. Ongoing.
277.8 User Experience 'Team C': SB, SP, SL, with input from JC to deal
with the issue of user experience and design of an easily-found lookup
facility for grid error messages. SL reported that he had started the
ATLAS wiki page and would circulate the url. SB was leading this with
inputs from SP, SL and JC where needed. A new simple summary was required
of all areas available plus a lookup/links facility, for the OC to review.
This would include a list of most recent types of problems (possibly a
'top 12' for users - what the error means and the course of action to
follow). SB to progress this. Ongoing.
280.6 JG brought up the issue of the biomed VO and 'sieving' at the ROC
Manager's meeting - a broadcast is to go out from EGEE which will be
helpful in underlining acceptable use of Grid resources and would act as a
reminder to VOs about the policy they have signed-up to in relation to
their users. JC had now emailed the Chair to have this discussed. JG
reported that a new VO was now set up but there were few resources
allocated to it as yet, although the home Institute may be giving funds.
Pending further info from JC. EGEE broadcast action ongoing - JG will
bring-up the broadcast action at the ROC VO meeting tomorrow (Tue 15).
JG reported that Heinz may bring up the issue of being banned. JG will
provide an update at the next meeting, probably the F2F. Done, item
closed.
280.7 JC to mention the issues (when approached by a VO with regard to
joining) of the 'standard' 6-month introduction period, following which
the VO must set-up something specific to them, if appropriate. This was
discussed at DTeam. JC to email GridPP VO members if possible - ongoing.
This was a standing action - JC had discussed it with the Tier-2
Co-ordinators in relation to VO members. JC to send email. JC reported
that he has brought this up but we do not have stable regional VOs as yet,
to which people can migrate. VOs have been set-up at VOMS but not at
sites. Ongoing.
280.8 JG to investigate the UKI ROC website - any change/progress, and
report-back. SB to iterate with JG in order to sign-off this item next
week. Ongoing.
282.2 SP to progress the Project Map using the T1 service areas and input
from the meeting. Ongoing.
282.6 JC and SB to progress existing 'disaster planning' template for next
F2F meeting on 1st Feb. Involve experiments as necessary. This was a
follow-up from the last F2F, and was to be distinguished from 277.5 action
which is a longer-term one relating to the OC.
286.5 AS to organise a service message at login relating to non-Grid
access being withdrawn. Done, item closed.
287.3 url to be sent to FP, RJ, DC, relating to CCRC08 with planning
meeting details, so that the summary of experiment requirements can be
checked to ensure no major mismatch. This was circulated, no response
received. RM to re-circulate urls with deadline. [Done following the
meeting]. Done, item closed.
288.1 All: to email DB if planning not to be in Glasgow the night before
the F2F meeting. Done, item closed.
ACTIONS AS AT 28.01.08
======================
272.4 AS to check the current Tier-1 disaster recovery plan and circulate
the existing version to the PMB. It was reported that this document does
not exist, but it was planned to have one in the longer term. TD would
incorporate in v0.4 anything that AS considered relevant. AS will check
and advise additions.
277.2 DN to provide an update and re-evaluation of CMS/CASTOR
deliverables. TD advised that there was a CMS/CASTOR document on
deliverables which should be revised in light of the December '07 tests.
DC to take the token for this now and iterate with DN.
277.5 Disaster Recovery 'Team B': SB, JC, TD, SP, DB to analyse the wider
issues of disaster planning, mapped to the experiments' lists, and this
work would include Project Management. A Recovery Plan was required. It
was agreed that JC was in charge of this and the experiment input relating
to subsets of the disaster plan. SB/JC to progress.
277.8 User Experience 'Team C': SB, SP, SL, with input from JC to deal
with the issue of user experience and design of an easily-found lookup
facility for grid error messages. SL reported that he had started the
ATLAS wiki page and would circulate the url. SB was leading this with
inputs from SP, SL and JC where needed. A new simple summary was required
of all areas available plus a lookup/links facility, for the OC to review.
This would include a list of most recent types of problems (possibly a
'top 12' for users - what the error means and the course of action to
follow). SB to progress this.
280.7 JC to mention the issues (when approached by a VO with regard to
joining) of the 'standard' 6-month introduction period, following which
the VO must set-up something specific to them, if appropriate. This was
discussed at DTeam. JC to email GridPP VO members if possible - ongoing.
This was a standing action - JC had discussed it with the Tier-2
Co-ordinators in relation to VO members. JC to send email.
280.8 JG to investigate the UKI ROC website - any change/progress, and
report-back. SB to iterate with JG in order to sign-off this item next
week. Ongoing.
282.2 SP to progress the Project Map using the T1 service areas and input
from the meeting.
282.6 JC and SB to progress existing 'disaster planning' template for next
F2F meeting on 1st Feb. Involve experiments as necessary. This was a
follow-up from the last F2F, and was to be distinguished from 277.5 action
which is a longer-term one relating to the OC.
289.1 AS to provide an analysis of the ATLAS disk server failures on the
RAID controller.
289.2 DC to check current situation regarding gLite WMS and SL4 - current
status to be conveyed to DTeam.
289.3 JC to check the VOMS/-skipcacheck issue (in relation to UK CA
certificate change) with Jens Jensen and raise the issue at an Operations
meeting.
289.4 SP to speak to the KT person at STFC who assisted with the PIPSS
case, to help with the post-competitive phase (in relation to EGEE only
providing support to pre-competitive startup). SP to involve NG.
INACTIVE CATEGORY
=================
271.1 PMB to examine the issue of fibre breakage and outages, CERN-RAL OPN
link, in one year's time, when actual data on breakages is available.
Due date would be September '08.
271.3 Re CERN-RAL OPN link breakage and backup generally, PC to oversee
the issue and collate info so that the PMB have something to revisit in
one year's time. Due date September '08. It was noted that PC would
circulate a revised document after discussion with ATLAS (RJ/PC/DN to
iterate).
282.8 RM to monitor how R-GMA and networking issues impact on GridPP as
matters progress. RM advised that this item should be moved to the
'inactive' category as it will develop over the coming months. RM
discussed the issue with Steve Fisher and advised that support of R-GMA is
required whilst APEL is dependent on it. RM reported that he has spoken
to SF and there is currently no change to the R-GMA situation - process
ongoing.
The next PMB would be a F2F meeting on Friday 1st Feb in Glasgow.
|