Dear All,
Please find attached the latest GridPP Project Management Board
Meeting minutes. The latest minutes can be found each week in:
http://www.gridpp.ac.uk/php/pmb/minutes.php?latest
as well as being listed with other minutes at:
http://www.gridpp.ac.uk/php/pmb/minutes.php
Cheers, Tony
________________________________________________________________________
Prof. A T Doyle, FInstP FRSE GridPP Project Leader
Rm 478, Kelvin Building Telephone: +44-141-330 5899
Dept of Physics and Astronomy Telefax: +44-141-330 5881
University of Glasgow EMail: [log in to unmask]
G12 8QQ, UK Web: http://ppewww.physics.gla.ac.uk/~doyle/
________________________________________________________________________
GridPP PMB Minutes 292 - 18th February 2008
===========================================
Present: Tony Doyle, Sarah Pearce, Roger Jones, Stephen Burke, David Britton,
Steve Lloyd, John Gordon, Jeremy Coles, Peter Clarke, Glenn Patrick,
Andrew Sansum, Dave Colling, Tony Cass, (notes by DC)
Apologies: David Kelsey, Robin Middleton, John Gordon, Neil Geddes
1. Disaster Planning - Tier-1 power failure issues
===================================================
JC reported on the disaster planning review of the recent power failure at
RAL - his document had been circulated prior to the meeting. It was noted
that there was now an internal CERN number for such communications, which
was Ext 75011. DB asked if a meeting would always needed in order to
determine the order in which services should be brought back. He thought
that it would make sense to have an ordered list/flow diagram. TC pointed
out that CERN had a system that recorded the service interdependence and
enabled them to recover from crisis events. TC and JC to iterate
regarding this following the meeting. TD commented that this report was
very specific and requested that more general lessons be learned. There
was a discussion between TD and JC concerning footprints/GGUS use in these
circumstances. JC asked what the order should be for bringing up CASTOR
instances. TD suggested that the Tier 1 should make a plan of this order
which would then be circulated to the experiments.
2. AOCB
========
It was reported that Greig Cowan would join a group of dCache experts.
There was some discussion about the appropriate level of support and long
term support for dCache.
STANDING ITEMS
==============
SI-1 Dissemination Officer's Report
------------------------------------
SP reported that Neasan O'Neill had taken a stand to EGEE user forum.
This had received a fair amount of attention largely because it wasnt just
GridPP, so it has been decided to do the same for Istanbul. TD noted that
Morag Burgon-Lyon had found the meeting useful. JC noted that it had been
an interesting meeting but that there were not many actual users present.
SP reported that there will be news releases on the User Forum, the Atlas
meeting and there will be an article about Jens Jensen's work on SRB and
SRM interoperability. SP asked if it was worth having a press release on
CCRC, and if not there could be some news items for the GridPP website.
TD suggested that rather than three news articles there should be a single
one with input from the experiments. JC pointed out that the GDB in March
will have a large Post Mortem on CCRC. It was noted that there will be an
Industry meeting on the 21st of May. TD will be a speaker at this meeting
and DC will also attend. It was noted that iSGW was looking for punchy,
innovative ideas in an attempt to get some readers.
SI-2 Tier-1 Manager's Report
-----------------------------
In absentia AS provided the following report:
1) Tenders
a) Disk tender - supplier load test completed. Our 28 day load test has
not started and is now running late. Load test not yet started but is
planned to start today.
b) CPU tender - Order placed and scheduled for delivery 28 February.
Suppliers may deliver 1-2 weeks early. It will probably not be
possible to complete the full 28-day acceptance test before it is
necessary to pay the bill in this financial year. Once we have 1-2
weeks load test results the PMB will be asked to approve payment.
c) Tape drive purchase - Six tape drives have been received (5 drives are
currently being borrowed). Tape server requisition now signed and
order expected to be placed early this week.
d) Non-Capacity hardware order has been placed. Delivery is expected to be
1-2 weeks later than the CPU delivery.
e) Oracle server hardware upgrade order has been placed.
f) An order for a 32 port non-blocking 10Gb switch has been placed. Delivery is expected in mid-March.
g) An order for about 40K of tape media has yet to be placed but is
planned to be placed this week.
2) Work on the power supply is complete.
3) We expect to commence work on replacing disk server backplanes w/b 25th
February. CCRC equipment will not be dealt with until after the CCRC
has finished at the end of February.
Service
-------
1) SAM availability for last week was 96% although some experiments were
impacted by only partial functioning of CASTOR early in the week
(fallout from the power failure) wich were not detected by SAM.
2) CASTOR appears to be working well for ATLAS, CMS and LHCB.
3) SL4 Migration - The SL4 UI build has minor changes to be made and it
will then be ready for release.
Progress to Grid Only Access
----------------------------
This standing item documents the status of work towards achieving GRIDPP
milestone 0.18 "Access to Tier-1 resources by Grid Interfaces Only".
1) We have a list of users allowed to submit via qsub. When non-Grid
submission is reinstated only this list will be used.
There was a discussion on the replacement backplanes. The new timetable
was fine with the experiments.
SI-3 Production Manager's Report
---------------------------------
JC presented the following report:
1) Experiments/CCRC: LHCb have little production happening at the moment
but transfer tests for CCRC have started. They still have SAM test
problems to resolve and others connected with the RB/WMS.
CMS ramping up for CCRC but suffered due to loss of disk servers last
week. Still have CASTOR issues. IC and Brunel fully ready for CMS CCRC
activities. Bristol now appearing in CMS lists.
ATLAS FDR data is reaching T2s well now (but there is not much of it).
Problems come and go at the sites and are being resolved when they occur.
Required space tokens are in place at most T2s now. Site readiness for FDR
work is being tracked here:
http://www.gridpp.ac.uk/wiki/AtlasFdr1
Biggest T2 problems have surrounded dCache SRmv2.2. It is thought that
most problems faced have now been understood/resolved.
2) CPU utilisation has increased over the last week and has remained in
the range 50%-67%. The SAM test average for UK sites is up from 84% to
86% for the last week. The WLCG Tier-2 reliability report for January
2008 has now been circulated. The reliability:availability figures
given are: London (67%: 73%); NorthGrid (89%: 89%); ScotGrid (95%:95%)
and SouthGrid (90%: 87%).
3) The UKI helpdesk importer for GGUS to Footprints has experienced
difficulty following a move to "validated" as the last ticket status. A
manual checking process is currently in place.
4) camont observe a factor of 6 improvement in submission times when
moving from the LCG RB to the gLite WMS. There was a discussion about
the transition to the gLite WMS. This will be discussed at the dteam
meeting and the location of the current SL3 versions will be
rebroadcast.
5) The newly created Tier-1 blog (http://www.gridpp.rl.ac.uk/blog/) has
now been added to the GridPP aggregator: http://planet.gridpp.ac.uk/.
6) Many sites have complained about inefficient biomed jobs and a lack of
VO/user response in understanding them. This is now being taken up
directly with the VO management.
Meetings:
A) WLCG workshop 21st-25th April
(http://indico.cern.ch/conferenceDisplay.py?confId=6552). I have requested
sites to inform me of their intention to send someone with GridPP funding.
So far I have had 7 (T2 site) replies.
B) There is an EGEE ROC manager's meeting tomorrow:
http://indico.cern.ch/conferenceDisplay.py?confId=23754.
C) ATLAS software & computing workshop takes place next Monday-Friday at
CERN: http://indico.cern.ch/conferenceDisplay.py?confId=22132. On the
Wednesday there is an ATLAS T0/1/2/3 Jamboree.
SI-4 LCG Management Board Report
---------------------------------
JG gave a quick summary of his talk to the MB, noting that the T1s are not
as ready as had been hoped at this stage.
SI-5 Documentation Officer's Report
------------------------------------
SB reported that he had done some work on both the web pages and the user
guide. This would be given to EGEE documentation group in due course.
REVIEW OF ACTIONS
=================
277.2 DN to provide an update and re-evaluation of CMS/CASTOR
deliverables. TD advised that there was a CMS/CASTOR document on
deliverables which should be revised in light of the December '07 tests.
DC to take the token for this now and iterate with DN. DC reported that
the document would be sent out this week.
277.8 User Experience 'Team C': SB, SP, SL, with input from JC to deal
with the issue of user experience and design of an easily-found lookup
facility for grid error messages. SL reported that he had started the
ATLAS wiki page and would circulate the url. SB was leading this with
inputs from SP, SL and JC where needed. A new simple summary was required
of all areas available plus a lookup/links facility, for the OC to review.
This would include a list of most recent types of problems (possibly a
'top 12' for users - what the error means and the course of action to
follow). SB to progress this.
280.7 JC to mention the issues (when approached by a VO with regard to
joining) of the 'standard' 6-month introduction period, following which
the VO must set-up something specific to them, if appropriate. This was
discussed at DTeam. JC to email GridPP VO members if possible - ongoing.
This was a standing action - JC had discussed it with the Tier-2
Co-ordinators in relation to VO members. JC to send email.
289.2 DC to check current situation regarding gLite WMS and SL4 - current
status to be conveyed to DTeam.
290.1 JC to write-down membership of DTeam.
290.2 RJ, DC and GP to nominate experiment user representatives for the
Deployment Board. ATLAS user person for the DB will be James Catmore and
also Raja Nandakumar and Stuart Wakefield. Done, item closed.
290.3 SL and DB to review the Tier-1 Board Terms of Reference and see what
could be formally incorporated into the new Deployment Board Terms of
Reference. DB to forward to JG to see if we really need a Tier-1 Board.
JG pointed out that purchasing was taking a long time and so we need to
start earlier in future - this will require knowledge of the scale of the
purchase. Done, item closed.
290.4 AS and JG to iterate regarding what could replace the Tier-1 Board.
290.5 All: to check their individual roles as outlined and advise DB of
any required changes. DB advised that he required input by next Monday
18th. Done, item closed.
290.7 AS to provide numbers in the Quarterly Report for the Tier-1 as per
the ones provided for Tier-2.
290.8 AS/SP to iterate regarding the financial summary in the Quarterly
Reporting (eg: Outturn figures).
290.9 Quarterly Report for Tier-2 staff to be compiled by the Production
Manager.
290.10 TD as Technical Director to provide a report showing effort
figures; milestones & metrics; and a table of posts showing Technical
Support.
290.11 DB to progress the situation at Manchester.
290.12 GP/SB/DC to define the portal and documentation Support posts and
ensure they form a comprehensive basis for user support (both
documentation and Grid access assistance), overseen by the UB Chair.
290.13 DB to complete the document re Reporting and Reporting Routes
relating to staff, and circulate it, thereafter it would be posted on the
website as a record.
290.14 RM to circulate the EGI Workshop Agenda.
290.15 JG to check with Malcolm Atkinson re attending the next EGI
workshop in Rome (March). JG will attend the EGI meeting in Rome. Done,
item closed.
290.16 NG noted that he had provided a draft paper relating to the end of
EGEE III but would add information that addressed the period beyond 2011
and re-circulate. NG will bring this to the PMB next week. Done, item
closed.
290.17 Re the Project Map, SP would look at the EGI wiki, and NG would
consider more inputs relating to box 6.2.
290.18 Regarding the LCG box on the Project Map, SP to iterate with TC and
bring this issue back to the PMB.
290.20 RM to provide more detailed figures on travel expenditure -
broad-brush percentages would assist with decisions re travel in GridPP3.
290.21 SS to hand-out travel forms at Dublin ('overseas' claim on web to
be submitted as 'actuals' and should be submitted before the end of March
2008).
290.23 AS/JC to iterate on the Disaster Recovery template and remove
capturable items that were considered to be minor.
290.24 JC to progress his suggested template to use when a crisis occurs -
to be revisited subsequently at a PMB.
291.01 AS and JC to iterate on Thursday afternoon with a view to reporting
back on the recent Tier-1 outage to the PMB next Monday. Done, item
closed.
291.02 JG to raise the issue of UK CA certificates being taken out of CERN
VOMS, as an item at the MB. JC confirmed he would put it on the Ops
meeting Agenda. Done, item closed.
ACTIONS AS AT 18.02.08
======================
277.2 DN to provide an update and re-evaluation of CMS/CASTOR
deliverables. TD advised that there was a CMS/CASTOR document on
deliverables which should be revised in light of the December '07 tests.
DC to take the token for this now and iterate with DN. DC reported that
the document would be sent out this week.
277.8 User Experience 'Team C': SB, SP, SL, with input from JC to deal
with the issue of user experience and design of an easily-found lookup
facility for grid error messages. SL reported that he had started the
ATLAS wiki page and would circulate the url. SB was leading this with
inputs from SP, SL and JC where needed. A new simple summary was required
of all areas available plus a lookup/links facility, for the OC to review.
This would include a list of most recent types of problems (possibly a
'top 12' for users - what the error means and the course of action to
follow). SB to progress this.
280.7 JC to mention the issues (when approached by a VO with regard to
joining) of the 'standard' 6-month introduction period, following which
the VO must set-up something specific to them, if appropriate. This was
discussed at DTeam. JC to email GridPP VO members if possible - ongoing.
This was a standing action - JC had discussed it with the Tier-2
Co-ordinators in relation to VO members. JC to send email.
289.2 DC to check current situation regarding gLite WMS and SL4 - current
status to be conveyed to DTeam.
290.1 JC to write-down membership of DTeam.
290.4 AS and JG to iterate regarding what could replace the Tier-1 Board.
290.7 AS to provide numbers in the Quarterly Report for the Tier-1 as per
the ones provided for Tier-2.
290.8 AS/SP to iterate regarding the financial summary in the Quarterly
Reporting (eg: Outturn figures).
290.9 Quarterly Report for Tier-2 staff to be compiled by the Production
Manager.
290.10 TD as Technical Director to provide a report showing effort
figures; milestones & metrics; and a table of posts showing Technical
Support.
290.11 DB to progress the situation at Manchester.
290.12 GP/SB/DC to define the portal and documentation Support posts and
ensure they form a comprehensive basis for user support (both
documentation and Grid access assistance), overseen by the UB Chair.
290.13 DB to complete the document re Reporting and Reporting Routes
relating to staff, and circulate it, thereafter it would be posted on the
website as a record.
290.14 RM to circulate the EGI Workshop Agenda.
290.17 Re the Project Map, SP would look at the EGI wiki, and NG would
consider more inputs relating to box 6.2.
290.18 Regarding the LCG box on the Project Map, SP to iterate with TC and
bring this issue back to the PMB.
290.20 RM to provide more detailed figures on travel expenditure -
broad-brush percentages would assist with decisions re travel in GridPP3.
290.21 SS to hand-out travel forms at Dublin ('overseas' claim on web to
be submitted as 'actuals' and should be submitted before the end of March
2008).
290.23 AS/JC to iterate on the Disaster Recovery template and remove
capturable items that were considered to be minor.
290.24 JC to progress his suggested template to use when a crisis occurs -
to be revisited subsequently at a PMB.
292.1 TC and JC to iterate regarding the CERN system that recorded service
interdependence and enabled them to recover from crisis events.
292.2 JG to review the interplay between Footprints and GGUS tickets on
the helpdesk.
292.3 AS to produce an order for the CASTOR instances to be brought back.
292.4 JC to use the template from the disaster planning and apply it to
the RAL power failure.
INACTIVE CATEGORY
=================
271.1 PMB to examine the issue of fibre breakage and outages, CERN-RAL OPN
link, in one year's time, when actual data on breakages is available.
Due date would be September '08.
271.3 Re CERN-RAL OPN link breakage and backup generally, PC to oversee
the issue and collate info so that the PMB have something to revisit in
one year's time. Due date September '08. It was noted that PC would
circulate a revised document after discussion with ATLAS (RJ/PC/DN to
iterate).
282.8 RM to monitor how R-GMA and networking issues impact on GridPP as
matters progress. RM advised that this item should be moved to the
'inactive' category as it will develop over the coming months. RM
discussed the issue with Steve Fisher and advised that support of R-GMA is
required whilst APEL is dependent on it. RM reported that he has spoken
to SF and there is currently no change to the R-GMA situation - process
ongoing.
290.19 DB/SP to progress the details of the Project Map over the next few
months, cross-checking that all elements are incorporated, including
strategic priorities and staffing. To be completed before the next
Oversight Committee.
|