Dear All,
Please find attached the latest GridPP Project Management Board
Meeting minutes. The latest minutes can be found each week in:
http://www.gridpp.ac.uk/php/pmb/minutes.php?latest
as well as being listed with other minutes at:
http://www.gridpp.ac.uk/php/pmb/minutes.php
Cheers, Tony
________________________________________________________________________
Prof. A T Doyle, FInstP FRSE GridPP Project Leader
Rm 478, Kelvin Building Telephone: +44-141-330 5899
Dept of Physics and Astronomy Telefax: +44-141-330 5881
University of Glasgow EMail: [log in to unmask]
G12 8QQ, UK Web: http://ppewww.physics.gla.ac.uk/~doyle/
________________________________________________________________________
GridPP PMB Minutes 293 - 25th February 2008
===========================================
Present: Tony Doyle, Sarah Pearce, Roger Jones, Robin Middleton, John Gordon,
Jeremy Coles, Peter Clarke, Andrew Sansum, Dave Colling, Tony Cass,
Neil Geddes, (Suzanne Scott, Minutes)
Apologies: Stephen Burke, David Britton, Steve Lloyd, David Kelsey,
Glenn Patrick
1. 40 years of CPC - HPC thematic issue
=========================================
It was noted that NG, PC and TD had been invited to contribute and the
documentation had been forwarded to DB. TD suggested advertising via
UKHEPGRID, asking people to contact DB if they wanted to contribute to
this journal special issue to publish a software program developed during
GridPP - timescale was March '09. It was agreed that PC would raise the
issue with UKQCD. TD will contact Andrew Mcnab re GridSite. NG will
contact NGS side to see if any interest.
2. NGI Metrics
===============
DB and SL were not available today. SP noted that it is planned to put
something in the Project Map and also bring-up the issue at GridPP20. It
was noted that the OC was in May and was the deadline for the Project Map.
RM noted that the EGI blueprint would help but strategic issues needed to
be addressed in parallel. TD noted that input was required from STFC. RM
advised that we could go to the OC on the basis of the EGI blueprint. TD
noted that there were no problems with the metrics but that the milestones
were difficult - these would need to be signed-off by STFC. The OC would
be a useful forum for raising the issue - a document for this would need
to be presented, with inputs from SP, NG, DB etc - it should be a PMB
document. It was noted that LCG would be a part of this issue and the
EGEE/EGI/NGI infrastructure would also be involved, but a best funding
model was yet to be devised. It was agreed that a document would be
written in this space for the OC, and SP would provide some metrics. It
was agreed that TD would contact Trish Mullins and appraise her that an
Agenda item, with further discussion, was planned for the F2F.
STANDING ITEMS
==============
SI-1 Dissemination Officer's Report
------------------------------------
SP reported a news item by Neasan O'Neill on the User Forum and the
EGEE-All activities meeting - these were posted last week. There would be
a news item on the ATLAS workshop. Re the new version of the website, SP
advised that Andrew McNab and NO were working on it at present and it
should be ready by the end of this week for the dissemination team to look
at. There were preparations ongoing for the IoP HEPGRID meeting and
posters had been requested. NO would check that all poster requests had
been received. SP and NO were preparing for the presentation of the
LHC@Home large award at Swindon. Regarding the industry workshop, SP
asked whether TD and DC had replied back to Alex Efimov that they could
speak? TD still to do [done following meeting]; DC had already confirmed
yes.
SI-2 Tier-1 Manager's Report
-----------------------------
AS provided the following report:
1) Tenders:
a) Disk tender - supplier load test completed. Our 28 day load test has
not started and is now running late. The load test has taken longer to
start than expected following disruption from the power cut and the
need to restart supplier load test. We expect it will start later today.
b) CPU tender - Order placed and scheduled for delivery by 28 February. We
expect one supplier to deliver this Thursday but have no confirmed date
from the second supplier yet.
c) Tape drive purchase - Tape drives are in production. Tape servers are
ordered.
d) Non-Capacity hardware order has been placed. Delivery is expected to be
1-2 weeks later than the CPU delivery.
e) Oracle server hardware upgrade order has been placed.
f) An order for a 32 port non-blocking 10Gb switch has been placed.
Delivery is expected in mid March.
g) An order for about 40K of tape media has been placed.
2) Backplane work on non-CCRC disk servers will commence this week.
Service:
1) SAM availability for last week was 100% (SL's tests).
2) CASTOR
a) CASTOR appears to be working well for ATLAS, CMS and LHCB CCRC.
b) Work on Alice is underway but deployment of the xrootd side of the
service has been problematic.
3) SL4 Migration - The SL4 UI build has minor changes to be made and it
will then be ready for release.
Progress to Grid Only Access:
This standing item documents the status of work towards achieving GRIDPP
milestone 0.18 "Access to Tier-1 resources by Grid Interfaces Only"
1) Non-Grid job submission has ended.
DC reported from the CMS experiment point of view things were going ok and
the milestones had been passed. RJ was not sure re ATLAS, things were not
going as smoothly at present. AS noted an 'acceptable' rate of current
failure on servers - not too exceptional - and he noted that the crash
rate was likely to be high.
SI-3 Production Manager's Report
---------------------------------
JC provided the following report:
1) A UKI monthly meeting was held last week. Among the items discussed
were the move to have APEL publishing as a critical test (fails after
31 days without records being published) and storage token use being
driven by CCRC activities. The experiments are asking for SL4 WNs at
sites but many sites have yet to upgrade.
2) From the last WLCG GDB, "The LHC experiments requested sites which have
WN capable of running in 64 bit mode to run them that way and to
advertise the fact in the BDII." The request was also to install the
32-bit compatibility libraries so that certain jobs can still run.
Aside: A 64-bit WN release has recently entered the PPS.
3) CCRC: On February 23rd between 22:00 and 23:00 GMT the average transfer
rate from CERN to "anywhere" was 2.2 GB/s which was the highest so far
(http://tinyurl.com/2vqv76). The main (e-logged) experiment issue
reported against RAL T1 (21st) concerned proxies for CMS - now fixed.
This led to the observation that one needs to be careful when using one
certificate to manage multiple transfers (see: http://tinyurl.com/2nx2gs).
4) Greig Cowan has looked at ways of debugging dCache mapping issues. Have
a look at some of the graphs to understand how complicated dCache pool
management has become: http://tinyurl.com/3b7lxt. More details in the
Storage blog http://gridpp-storage.blogspot.com/.
5) The ATLAS FDR information page for GridPP sites is providing a useful
summary http://www.gridpp.ac.uk/wiki/AtlasFdr1. Does such a page exist
for CMS or LHCb and if not would one be useful for their respective
challenges?
6) There again seem to be instances of SAM critical tests failing at sites
where it may be the test itself not the site at fault. The observed
instances are being followed up.
7) Questions have arisen in the last week about use of pooled sgm and prod
roles and the appropriate configuration at sites. Meanwhile discussion
is ongoing about how to prevent T1 resources being used by (ATLAS) user
jobs.
8) A security incident was reported at one INFN site last week. So far no
UK sites have reported any linked concerns, but the available
information on the incident is sparse.
9) A gLite-WMS migration strategy was discussed at the last DTEAM meeting.
UIs will not provide a major problem as they can support both the LCG
and gLite implementations simultaneously. A parallel service will be
run at each of the existing providing sites for about 3 months. After
this the LCG machines will be used to provide additional WMS resources.
One minor issue is the need/recommendation to host the LB on a separate
machine. TD reported that there had been an incident at Glasgow which
had caused problems with the WMS and CE plus the compute element
functions.
10) The deployment team membership for GridPP3 has been discussed. There
is general agreement that this should include one representative from
ATLAS, CMS and LHCb (they already attend). Representatives of other
VOs and technical experts (such as from T1) will be affiliated and
invited to attend specific meetings/discussions of relevance. Core
members will be expected to attend the weekly meeting.
11) Some sites have been asking about the timelines for the GridPP3
hardware money to become available - you will recall that the
allocations were agreed some months ago. The current position is that
STFC have frozen the grants pending further review of the current
issues being faced by the council. TD reported that there was no
formal statement as yet - we were hoping to receive something by the
beginning of March, following which there would be a three-week
consultation process.
Meetings:
A) There is an ATLAS jamboree this Wednesday:
http://indico.cern.ch/conferenceDisplay.py?confId=22132#2008-02-27.
B) There is a WLCG GDB next week:
http://indico.cern.ch/conferenceDisplay.py?confId=20227.
The pre-GDB will be used to review the February CCRC:
http://indico.cern.ch/conferenceDisplay.py?confId=29170.
Derek Ross will be reporting a site's perspective on behalf of RAL T1.
C) Not all sites have responded to the WLCG workshop funding request for
the meeting in April (more agenda items now online
http://indico.cern.ch/conferenceTimeTable.py?confId=6552).
However, some sites have more than one request. Tony has helped us secure
a block booking in the CERN hostel which will help keep costs down.
C) The next GridPP User Board meeting has now been rescheduled to 19th
March at 14:00.
SI-4 LCG Management Board Report
---------------------------------
There was nothing to report.
SI-5 Documentation Officer's Report
------------------------------------
It was noted that SB was unavailable today.
REVIEW OF ACTIONS
=================
277.2 DN to provide an update and re-evaluation of CMS/CASTOR
deliverables. TD advised that there was a CMS/CASTOR document on
deliverables which should be revised in light of the December '07 tests.
DC to take the token for this now and iterate with DN. DC reported that
the document would be sent out this week.
277.8 User Experience 'Team C': SB, SP, SL, with input from JC to deal
with the issue of user experience and design of an easily-found lookup
facility for grid error messages. SL reported that he had started the
ATLAS wiki page and would circulate the url. SB was leading this with
inputs from SP, SL and JC where needed. A new simple summary was required
of all areas available plus a lookup/links facility, for the OC to review.
This would include a list of most recent types of problems (possibly a
'top 12' for users - what the error means and the course of action to
follow). SB to progress this.
280.7 JC to mention the issues (when approached by a VO with regard to
joining) of the 'standard' 6-month introduction period, following which
the VO must set-up something specific to them, if appropriate. This was
discussed at DTeam. JC to email GridPP VO members if possible - ongoing.
This was a standing action - JC had discussed it with the Tier-2
Co-ordinators in relation to VO members. JC to send email.
289.2 DC to check current situation regarding gLite WMS and SL4 - current
status to be conveyed to DTeam. Done, item closed.
290.1 JC to write-down membership of DTeam. Currently being done. Item
closed.
290.4 AS and JG to iterate regarding what could replace the Tier-1 Board.
290.7 AS to provide numbers in the Quarterly Report for the Tier-1 as per
the ones provided for Tier-2.
290.8 AS/SP to iterate regarding the financial summary in the Quarterly
Reporting (eg: Outturn figures).
290.9 Quarterly Report for Tier-2 staff to be compiled by the Production
Manager.
290.10 TD as Technical Director to provide a report showing effort
figures; milestones & metrics; and a table of posts showing Technical
Support.
290.11 DB to progress the situation at Manchester.
290.12 GP/SB/DC to define the portal and documentation Support posts and
ensure they form a comprehensive basis for user support (both
documentation and Grid access assistance), overseen by the UB Chair.
290.13 DB to complete the document re Reporting and Reporting Routes
relating to staff, and circulate it, thereafter it would be posted on the
website as a record.
290.14 RM to circulate the EGI Workshop Agenda. Done, item closed.
290.17 Re the Project Map, SP would look at the EGI wiki, and NG would
consider more inputs relating to box 6.2. Done, item closed.
290.18 Regarding the LCG box on the Project Map, SP to iterate with TC and
bring this issue back to the PMB.
290.20 RM to provide more detailed figures on travel expenditure -
broad-brush percentages would assist with decisions re travel in GridPP3.
290.21 SS to hand-out travel forms at Dublin ('overseas' claim on web to
be submitted as 'actuals' and should be submitted before the end of March
2008). Will be done. Item closed.
290.23 AS/JC to iterate on the Disaster Recovery template and remove
capturable items that were considered to be minor.
290.24 JC to progress his suggested template to use when a crisis occurs -
to be revisited subsequently at a PMB.
292.1 TC and JC to iterate regarding the CERN system that recorded service
interdependence and enabled them to recover from crisis events.
292.2 JG to review the interplay between Footprints and GGUS tickets on
the helpdesk.
292.3 AS to produce an order for the CASTOR instances to be brought back.
This is not really required in advance, will be dealt with on a
case-by-case basis as required. Done, item closed.
292.4 JC to use the template from the disaster planning and apply it to
the RAL power failure.
ACTIONS AS AT 25.02.08
======================
277.2 DN to provide an update and re-evaluation of CMS/CASTOR
deliverables. TD advised that there was a CMS/CASTOR document on
deliverables which should be revised in light of the December '07 tests.
DC to take the token for this now and iterate with DN. DC reported that
the document would be sent out this week.
277.8 User Experience 'Team C': SB, SP, SL, with input from JC to deal
with the issue of user experience and design of an easily-found lookup
facility for grid error messages. SL reported that he had started the
ATLAS wiki page and would circulate the url. SB was leading this with
inputs from SP, SL and JC where needed. A new simple summary was required
of all areas available plus a lookup/links facility, for the OC to review.
This would include a list of most recent types of problems (possibly a
'top 12' for users - what the error means and the course of action to
follow). SB to progress this.
280.7 JC to mention the issues (when approached by a VO with regard to
joining) of the 'standard' 6-month introduction period, following which
the VO must set-up something specific to them, if appropriate. This was
discussed at DTeam. JC to email GridPP VO members if possible - ongoing.
This was a standing action - JC had discussed it with the Tier-2
Co-ordinators in relation to VO members. JC to send email.
290.4 AS and JG to iterate regarding what could replace the Tier-1 Board.
290.7 AS to provide numbers in the Quarterly Report for the Tier-1 as per
the ones provided for Tier-2.
290.8 AS/SP to iterate regarding the financial summary in the Quarterly
Reporting (eg: Outturn figures).
290.9 Quarterly Report for Tier-2 staff to be compiled by the Production
Manager.
290.10 TD as Technical Director to provide a report showing effort
figures; milestones & metrics; and a table of posts showing Technical
Support.
290.11 DB to progress the situation at Manchester.
290.12 GP/SB/DC to define the portal and documentation Support posts and
ensure they form a comprehensive basis for user support (both
documentation and Grid access assistance), overseen by the UB Chair.
290.13 DB to complete the document re Reporting and Reporting Routes
relating to staff, and circulate it, thereafter it would be posted on the
website as a record.
290.18 Regarding the LCG box on the Project Map, SP to iterate with TC and
bring this issue back to the PMB.
290.20 RM to provide more detailed figures on travel expenditure -
broad-brush percentages would assist with decisions re travel in GridPP3.
290.23 AS/JC to iterate on the Disaster Recovery template and remove
capturable items that were considered to be minor.
290.24 JC to progress his suggested template to use when a crisis occurs -
to be revisited subsequently at a PMB.
292.1 TC and JC to iterate regarding the CERN system that recorded service
interdependence and enabled them to recover from crisis events.
292.2 JG to review the interplay between Footprints and GGUS tickets on
the helpdesk.
292.4 JC to use the template from the disaster planning and apply it to
the RAL power failure.
293.1 Re HPC thematic issue invites: it was agreed that PC will raise the
issue with UKQCD; TD would contact Andrew Mcnab re GridSite; NG will
contact NGS side to see if any interest.
293.2 A PMB document to be written for the OC regarding NGI metrics, and
SP would provide some metrics for this.
293.3 TD to contact Trish Mullins and appraise her that an Agenda item
relating to NGI metrics was planned for the F2F.
293.4 NO to re-send poster requests.
293.5 TD to reply to Alex re speaking at the industry workshop.
INACTIVE CATEGORY
=================
271.1 PMB to examine the issue of fibre breakage and outages, CERN-RAL OPN
link, in one year's time, when actual data on breakages is available.
Due date would be September '08.
271.3 Re CERN-RAL OPN link breakage and backup generally, PC to oversee
the issue and collate info so that the PMB have something to revisit in
one year's time. Due date September '08. It was noted that PC would
circulate a revised document after discussion with ATLAS (RJ/PC/DN to
iterate).
282.8 RM to monitor how R-GMA and networking issues impact on GridPP as
matters progress. RM advised that this item should be moved to the
'inactive' category as it will develop over the coming months. RM
discussed the issue with Steve Fisher and advised that support of R-GMA is
required whilst APEL is dependent on it. RM reported that he has spoken
to SF and there is currently no change to the R-GMA situation - process
ongoing.
290.19 DB/SP to progress the details of the Project Map over the next few
months, cross-checking that all elements are incorporated, including
strategic priorities and staffing. To be completed before the next
Oversight Committee.
The next PMB would take place on Monday 3rd March at 1:00 pm. The meeting
closed at 2:15 pm.
|