Dear All,
Please find attached the latest GridPP Project Management Board
Meeting minutes. The latest minutes can be found each week in:
http://www.gridpp.ac.uk/php/pmb/minutes.php?latest
as well as being listed with other minutes at:
http://www.gridpp.ac.uk/php/pmb/minutes.php
Cheers, Tony
________________________________________________________________________
Prof. A T Doyle, FInstP FRSE GridPP Project Leader
Rm 478, Kelvin Building Telephone: +44-141-330 5899
Dept of Physics and Astronomy Telefax: +44-141-330 5881
University of Glasgow EMail: [log in to unmask]
G12 8QQ, UK Web: http://ppewww.physics.gla.ac.uk/~doyle/
________________________________________________________________________
GridPP PMB Minutes 291 - 11th February 2008
===========================================
Present: Tony Doyle, Sarah Pearce, Roger Jones, Stephen Burke, David Britton,
Steve Lloyd, John Gordon, Jeremy Coles, Peter Clarke, Glenn Patrick,
Andrew Sansum, Dave Colling, Suzanne Scott (Minutes)
Apologies: David Kelsey, Tony Cass, Robin Middleton, Neil Geddes
SI-1 Dissemination Officer's Report
------------------------------------
SP reported two news items, one at draft stage regarding UKQCD on the
PIPSS award relating to biotechnology, and the other relating to Neasan
O'Neill currently attending the EGEE User Forum in France with the UKI
stand. SP reported that Neasan O'Neill had met with A McNab regarding the
GridPP Website initial beta release at the end of the month, discussing
the new design and the background changes. There would be a test
opportunity for the Website in March. If anyone wished anything included
could they let SP know. Neasan O'Neill was also currently gathering
updates for posters in relation to IoP HEPP Group meeting - these were
generally updated once per year. It was noted that some were being
omitted.
AS and JC provided an update on the recent outage at the Tier-1 and its
effect on Grid Production. The discussion would be continued at next
week's PMB as one input to overall disaster planning.
SI-2 Tier-1 Manager's Report
-----------------------------
AS reported as follows:
3) While working with one resiliant transformer down the other transformer
failed. We have not yet received a post-mortem analysis from building
services - although the failure was probably related to the ongoing
work. Power failed at about 12:10 on 7th February and was restored
about an hour later. Most critical national and global core services
were re-established later that day. The GOCDB was not available until
the Friday morning after problems rolling the database forward from the
Oracle journal. The CASTOR and Grid batch services were available late
on Friday although full batch capacity did not appear until Saturday.
Notification was problematic with no power/phones in the building and
no means if issuing a broadcast (GOCDB down). Notification to the VOs
was achieved within 30-60 minutes (ditto the PMB) and WLCG was informed
by 14:30.
4) We have seen a couple of incidents where drive failures appear to have
been handled incorrectly by the RAID controller. It appears likely that
this is related to the backplane issues identified below. This item is
closed unless we experience a recurrence.
5) On 31st January evening a disk server failed and smoke/fumes from the
failure trigger the fire alarm. Investigation showed that a hole had
been burnt through the disk backplane. This is the second failure to
trigger the fire alarm and is being treated as a safety matter. We have
received good support from the supplier and manufacturer and have now
received a written report which indicates a batch-related problem with
the server backplane. Although the supplier's assesment is that there
is no significant risk to safety, our safety review concluded that we
must address this problem as soon as practical. The supplier should
have replacement backplanes early this week (they should have arrived
in the UK on Friday) and they are reviewing what staff effort they have
to carry out an urgent intervention. It probably will take 20-30
minutes per server and we have 86 servers to repair. It is likely that
we will need to announce a lengthy downtime for mid/later this week but
have yet to decide if we will take the whole service down at once or
intervene VO by VO.
AS reported that an analysis by the supplier had shown that the outage was
caused by a printed circuit board and alignment problems with layers and
crossover - this caused resistors to overheat. There had been three board
failures to date since last August '07, but not all of them had set-off
the fire alarms. AS was unsure as to exactly how much downtime was
required. TD advised that doing it by VO seemed a sensible approach. AS
confirmed that once he had more definitive information from the supplier
as to what they can do, and staff availability to carry out the work etc,
he would iterate with the experiments and would make a judgement about
downtime at that point.
AS further reported that as of today, all was working except the GOCDB,
and there had also been a further CASTOR issue over the weekend. Later
this week a retrospective assessment would be carried out. AS noted that
volume of equipment was an issue, also the streamlining of staff in such
an instance. Notification had also been an issue - AS had managed to
contact TD and also Graeme Stewart regarding dissemination of the outage.
AS had been advised to contact the CERN line 5501 should anything similar
happen again. AS confirmed he would document all of the issues. TD asked
JC to comment on the communication channels. JC noted that he had not yet
done a retrospective analysis but would do that this week, however he had
received five notifications within the hour, and will examine the impact
of the outage. AS and JC would iterate on Thursday afternoon with a view
to reporting back to the PMB next Monday.
1) Tenders:
a) Disk tender - supplier load test completed. Our 28 day load test has
not started and is now running late. We plan to start it this week, but
have experienced disruption due to power and hardware problems.
b) CPU tender - order placed and scheduled for delivery 28 February.
Suppliers may deliver 1-2 weeks early. It will probably not be possible
to complete the full 28-day acceptance test before it is necessary to
pay the bill in this financial year. Once we have 1-2 weeks load test
results the PMB will be asked to approve payment.
c) Tape drive purchase - six tape drives have been received (5 drives are
currently being borrowed). Tape servers must be ordered shortly.
d) Non-Capacity hardware order has been placed. Delivery is expected to be
1-2 weeks later than the CPU delivery.
e) Oracle server hardware upgrade order has been placed.
f) An order for a 32 port non-blocking 10Gb switch has been placed.
Delivery is expected in mid-March.
2) Work on the power supply is nearly complete. Switchover back to two
supplies is scheduled for 13:00 today (Monday).
Service:
1) SAM availability for last week was 75% - main problem was the power
failure, although site DNS problems also impacted the service on
Wednesday night.
2) CASTOR:
a) ATLAS CMS and LHCB were fully configured for CCRC08. Alice have a disk
allocation and have started to attend the CASTOR teleconference. Work
on an Alice disk pool has now started.
b) One of the CASTOR endpoints (ralsrmd) has been down since Saturday. A
restart was attempted on Saturday but failed and the problem is still
under investigation.
3) SL4 Migration - The SL4 UI build is now working and a test system was
available. A minor problem was found and a new build was scheduled for
last Thursday, unfortunately this did not happen and will now be
completed this week.
4) At the PMB face to face an issue was raised about number of hosts with
2GB/core available. There are about 400 job slots guaranteed to have at
least 2GB/core at their disposal (about half the farm by KSI2K). The
remaining servers have 1GB/core and may be able to run 1.5GB jobs if
there is a mix of jobs with lower memory requirements. DC noted that
requirements had been growing over the past few months - very little
was use of Tier-1, but was mostly scheduled reconstruction. AS noted
that the older kit was about half of farm capacity, and was due to be
phased-out. TD asked whether AS was now formally in 'recovered
disaster' mode? AS said that this classification was unclear at
present. TD advised that if there were to be imminent turn-off of
disks this week - users needed to know. JG advised that the incident
could be classed as not a disaster, but a major incident. TD suggested
that users do need to know about significant downtime expected - and
this would need to be disseminated. AS confirmed that once he had
definitive feedback from the supplier about manpower and timescales, he
would advertise any potential downtime.
5) VO boxes
- We are working with Alice on an upgrade to SL4
- We have arranged with CMS to split the VO box into two, one running
Frontier - the other Phedex. This will be carried out with a planned
upgrade to Phedex.
6) AFS
There have been performance/availability problems with AFS which have been
traced to a misconfigured firewall at a remote site. This has been severely
impacting Babar but has now been resolved.
Progress to Grid Only Access
============================
This standing item documents the status of work towards achieving GRIDPP
milestone 0.18 "Access to Tier-1 resources by Grid Interfaces Only"
1) We have a list of users allowed to submit via qsub. When non-Grid
submission is re-instated only this list will be used.
SI-3 Production Manager's Report
---------------------------------
JC reported as follows:
1) The power cut that hit RAL last week took out the GOCDB. This has led
to a number of observations including that the Broadcast system in some
way depends on the GOCDB being available. We will review how
communications then progressed to develop our disaster planning. The
GOCDB problem was traced to the NGS Oracle Database which supports the
GOCDB being down. This has now been fixed.
2) UK CA certificates were taken out of CERN VOMS last week as it was
thought that they had been revoked. A user ticket hinted that a
specific problem being experienced might be due to old CA DN
information still being present and certificates based on it were then
removed without further investigation - i.e. the old root certificate
was suspected compromised. This resulted in the majority of UK issued
certificates not working for a period of several hours, and until all
users had been registered with both issuer names. JG noted that CERN
hadn't kept their VOMS certificates up-to-date. TD asked what the
Policy was - there had to be a chain of trust from the DN to the CA.
JG would raise this as an item at the MB. JC confirmed he would put it
on the Ops meeting Agenda.
3) CCRC08 February run has started. There was a meeting at CERN last
Tuesday to discuss the direction. ATLAS intends to use T2s from this
week. One question they face is how to enforce analysis at T2s only.
The questions on what space tokens to use for each of the experiments
now seems clearer. CCRC progress is being logged in e-logbooks here:
https://prod-grid-logger.cern.ch/elog/CCRC'08+Observations/. Other
links of interest will be found here: http://tinyurl.com/2tk5q9.
TD reported that a small group now existed, comprising TD, RJ, Alan Barr
and Dan Tovey, which had been working on the GridPP wiki regarding ATLAS
FDR plans, showing a breakdown of sites and available streams. TD advised
that the course of action might be to limit who can submit jobs to the
Tier-1 or delete the AOD at the Tier-1, however a definitive solution was
not yet apparent - it was noted that FDR tests the computing model, not
just the resources.
See
https://www.gridpp.ac.uk/wiki/AtlasFdr1
4) The topic of SL4 was discussed at the GDB last week. The majority of UK
sites now have plans to move and some have started asking about SL5. JG
noted that SL5 had been discussed at the MB - the PMB should consult
TC's slides as well as the conclusions outlined in the Minutes. TD
sent round a url in the chat window relating to the SL4 discussion.
https://twiki.cern.ch/twiki/pub/LCG/MbMeetingsMinutes/LCG_Management_Board_2008_02_05.htm
"Post SLC4 options at CERN - Tony Cass"
5) GridIreland colleagues have now created their own blog
(http://gridirelandops.blogspot.com/) and it is being aggregated along
with our GridPP blogs.
6) The overall job load is up in the last week. Today CPU is running at
roughly 55%. The UK SAM test average is down at 81% for the last week.
Meetings
A) There is an EGEE User Forum in Clermont-Ferrand this week (11th-14th):
http://www.eu-egee.org/egee_events/userforum/3-user-forum/
SI-4 LCG Management Board Report
---------------------------------
TD reported that an issue under discussion at present was user accounting
and its impact within the experiments. There was an agreement that an
interim policy, proposed by JG, needs to be implemented and that everyone
needs to know how data is accounted for. JG noted that a prototype portal
had been published to show what data exists. TD reported that there was a
summary available of the discussion on CCRC; there had also been a
discussion of the CASTOR metrics used to clarify how well the system was
working internally; SL4 had been discussed along with what should happen
afterwards and the timescale proposed. Other discussions covered VO box
support - this issue would be addressed this week - there had been a
presentation from Alice and also a review of the applications area.
See
https://twiki.cern.ch/twiki/pub/LCG/MbMeetingsMinutes/LCG_Management_Board_2008_02_05.htm
SI-5 Documentation Officer's Report
------------------------------------
SB noted that items previously discussed were all ongoing.
AOCB
====
GP reported that the LHCb software course had been delayed and might not
happen until the next financial year. He asked whether it would be
possible to carry forward the 6k allocated by GridPP for this. TD noted
that information was still awaited from RM regarding action 290.20
relating to travel figures and information. DB advised that if the course
was delayed then in principle it would still be supported, but he noted
that if there were to be financial difficulties then and that expenditure
would be considered along with all other expenditure. TD reported that
the announcements from STFC were expected at the end of February,
following which there would be a three-week consultation in March.
REVIEW OF ACTIONS
=================
272.4 AS to check the current Tier-1 disaster recovery plan and circulate
the existing version to the PMB. It was reported that this document does
not exist, but it was planned to have one in the longer term. TD would
incorporate in v0.4 anything that AS considered relevant. AS will check
and advise additions. Done, item closed.
277.2 DN to provide an update and re-evaluation of CMS/CASTOR
deliverables. TD advised that there was a CMS/CASTOR document on
deliverables which should be revised in light of the December '07 tests.
DC to take the token for this now and iterate with DN. DC reported that
the document would be sent out this week.
277.5 Disaster Recovery 'Team B': SB, JC, TD, SP, DB to analyse the wider
issues of disaster planning, mapped to the experiments' lists, and this
work would include Project Management. A Recovery Plan was required. It
was agreed that JC was in charge of this and the experiment input relating
to subsets of the disaster plan. SB/JC to progress. This was now being
dealt with via F2F actions. Done, item closed.
277.8 User Experience 'Team C': SB, SP, SL, with input from JC to deal
with the issue of user experience and design of an easily-found lookup
facility for grid error messages. SL reported that he had started the
ATLAS wiki page and would circulate the url. SB was leading this with
inputs from SP, SL and JC where needed. A new simple summary was required
of all areas available plus a lookup/links facility, for the OC to review.
This would include a list of most recent types of problems (possibly a
'top 12' for users - what the error means and the course of action to
follow). SB to progress this. Ongoing.
280.7 JC to mention the issues (when approached by a VO with regard to
joining) of the 'standard' 6-month introduction period, following which
the VO must set-up something specific to them, if appropriate. This was
discussed at DTeam. JC to email GridPP VO members if possible - ongoing.
This was a standing action - JC had discussed it with the Tier-2
Co-ordinators in relation to VO members. JC to send email. Ongoing.
280.8 JG to investigate the UKI ROC website - any change/progress, and
report-back. SB to iterate with JG in order to sign-off this item next
week. Done, item closed.
282.2 SP to progress the Project Map using the T1 service areas and input
from the meeting. Done, item closed.
282.6 JC and SB to progress existing 'disaster planning' template for next
F2F meeting on 1st Feb. Involve experiments as necessary. This was a
follow-up from the last F2F, and was to be distinguished from 277.5 action
which is a longer-term one relating to the OC. Done, item closed.
289.1 AS to provide an analysis of the ATLAS disk server failures on the
RAID controller. It was noted that this may be due to the backplanes
problem, and was considered non-catastrophic - outcome awaited of the
investigation. Done, item closed.
289.2 DC to check current situation regarding gLite WMS and SL4 - current
status to be conveyed to DTeam. Ongoing.
289.3 JC to check the VOMS/-skipcacheck issue (in relation to UK CA
certificate change) with Jens Jensen and raise the issue at an Operations
meeting. This was raised via a ticket and will be discussed today at the
Ops meeting. Done, action closed.
289.4 SP to speak to the KT person at STFC who assisted with the PIPSS
case, to help with the post-competitive phase (in relation to EGEE only
providing support to pre-competitive startup). SP to involve NG. SP had
emailed and was awaiting a reply. Done, item closed.
290.1 JC to write down membership of DTeam.
290.2 RJ, DC and GP to nominate experiment user representatives for the
Deployment Board.
290.3 SL and DB to review the Tier-1 Board Terms of Reference and see what
could be formally incorporated into the new Deployment Board Terms of
Reference.
290.4 AS and JG to iterate regarding what could replace the Tier-1 Board.
290.5 All: to check their individual roles as outlined and advise DB of
any required changes. Ongoing.
290.6 TD to contact Iain and suggest that the GDB and MB roles change as
at 1st April 2008. It was noted that DB would take over on the MB from
1st April. Done, item closed.
290.7 AS to provide numbers in the Quarterly Report for the Tier-1 as per
the ones provided for Tier-2.
290.8 AS/SP to iterate regarding the financial summary in the Quarterly
Reporting (eg: Outturn figures).
290.9 Quarterly Report for Tier-2 staff to be compiled by the Production
Manager.
290.10 TD as Technical Director to provide a report showing effort
figures; milestones & metrics; and a table of posts showing Technical
Support.
290.11 DB to progress the situation at Manchester.
290.12 GP/SB/DC to define these Support posts and ensure they form a
comprehensive basis for user support (both documentation and Grid access
assistance), overseen by the UB Chair.
290.13 DB to complete the document re Reporting and Reporting Routes
relating to staff, and circulate it, thereafter it would be posted on the
website as a record.
290.14 RM to circulate the EGI Workshop Agenda.
290.15 JG to check with Malcolm Atkinson re attending the next EGI
workshop in Rome (March).
290.16 NG noted that he had provided a draft paper relating to the end of
EGEE III but would add information that addressed the period beyond 2011
and re-circulate.
290.17 Re the Project Map, SP would look at the EGI wiki, and NG would
consider more inputs relating to box 6.2.
290.18 Regarding the LCG box on the Project Map, SP to iterate with TC and
bring this issue back to the PMB.
290.19 DB/SP to progress the details of the Project Map over the next few
months, cross-checking that all elements are incorporated, including
strategic priorities and staffing. To be completed before the next
Oversight Committee. It was agreed to move this to the 'inactive'
category.
290.20 RM to provide more detailed figures on travel expenditure -
broad-brush percentages would assist with decisions re travel in GridPP3.
290.21 SS to hand-out travel forms at Dublin ('overseas' claim on web to
be submitted as 'actuals' and should be submitted before the end of March
2008).
290.22 AS to get back to RJ regarding job slots at the Tier-1. Done, item
closed.
290.23 AS/JC to iterate on the Disaster Recovery template and remove
capturable items that were considered to be minor.
290.24 JC to progress his suggested template to use when a crisis occurs -
to be revisited subsequently at a PMB.
ACTIONS AS AT 11.02.08
======================
277.2 DN to provide an update and re-evaluation of CMS/CASTOR
deliverables. TD advised that there was a CMS/CASTOR document on
deliverables which should be revised in light of the December '07 tests.
DC to take the token for this now and iterate with DN. DC reported that
the document would be sent out this week.
277.8 User Experience 'Team C': SB, SP, SL, with input from JC to deal
with the issue of user experience and design of an easily-found lookup
facility for grid error messages. SL reported that he had started the
ATLAS wiki page and would circulate the url. SB was leading this with
inputs from SP, SL and JC where needed. A new simple summary was required
of all areas available plus a lookup/links facility, for the OC to review.
This would include a list of most recent types of problems (possibly a
'top 12' for users - what the error means and the course of action to
follow). SB to progress this.
280.7 JC to mention the issues (when approached by a VO with regard to
joining) of the 'standard' 6-month introduction period, following which
the VO must set-up something specific to them, if appropriate. This was
discussed at DTeam. JC to email GridPP VO members if possible - ongoing.
This was a standing action - JC had discussed it with the Tier-2
Co-ordinators in relation to VO members. JC to send email.
289.2 DC to check current situation regarding gLite WMS and SL4 - current
status to be conveyed to DTeam.
290.1 JC to write down membership of DTeam.
290.2 RJ, DC and GP to nominate experiment user representatives for the
Deployment Board.
290.3 SL and DB to review the Tier-1 Board Terms of Reference and see what
could be formally incorporated into the new Deployment Board Terms of
Reference.
290.4 AS and JG to iterate regarding what could replace the Tier-1 Board.
290.5 All: to check their individual roles as outlined and advise DB of
any required changes. DB advised that he required input by next Monday
18th.
290.7 AS to provide numbers in the Quarterly Report for the Tier-1 as per
the ones provided for Tier-2.
290.8 AS/SP to iterate regarding the financial summary in the Quarterly
Reporting (eg: Outturn figures).
290.9 Quarterly Report for Tier-2 staff to be compiled by the Production
Manager.
290.10 TD as Technical Director to provide a report showing effort
figures; milestones & metrics; and a table of posts showing Technical
Support.
290.11 DB to progress the situation at Manchester.
290.12 GP/SB/DC to define the portal and documentation Support posts and
ensure they form a comprehensive basis for user support (both
documentation and Grid access assistance), overseen by the UB Chair.
290.13 DB to complete the document re Reporting and Reporting Routes
relating to staff, and circulate it, thereafter it would be posted on the
website as a record.
290.14 RM to circulate the EGI Workshop Agenda.
290.15 JG to check with Malcolm Atkinson re attending the next EGI
workshop in Rome (March).
290.16 NG noted that he had provided a draft paper relating to the end of
EGEE III but would add information that addressed the period beyond 2011
and re-circulate.
290.17 Re the Project Map, SP would look at the EGI wiki, and NG would
consider more inputs relating to box 6.2.
290.18 Regarding the LCG box on the Project Map, SP to iterate with TC and
bring this issue back to the PMB.
290.20 RM to provide more detailed figures on travel expenditure -
broad-brush percentages would assist with decisions re travel in GridPP3.
290.21 SS to hand-out travel forms at Dublin ('overseas' claim on web to
be submitted as 'actuals' and should be submitted before the end of March
2008).
290.23 AS/JC to iterate on the Disaster Recovery template and remove
capturable items that were considered to be minor.
290.24 JC to progress his suggested template to use when a crisis occurs -
to be revisited subsequently at a PMB.
291.01 AS and JC to iterate on Thursday afternoon with a view to reporting
back on the recent Tier-1 outage to the PMB next Monday.
291.02 JG to raise the issue of UK CA certificates being taken out of CERN
VOMS, as an item at the MB. JC confirmed he would put it on the Ops
meeting Agenda.
INACTIVE CATEGORY
=================
271.1 PMB to examine the issue of fibre breakage and outages, CERN-RAL OPN
link, in one year's time, when actual data on breakages is available.
Due date would be September '08.
271.3 Re CERN-RAL OPN link breakage and backup generally, PC to oversee
the issue and collate info so that the PMB have something to revisit in
one year's time. Due date September '08. It was noted that PC would
circulate a revised document after discussion with ATLAS (RJ/PC/DN to
iterate).
282.8 RM to monitor how R-GMA and networking issues impact on GridPP as
matters progress. RM advised that this item should be moved to the
'inactive' category as it will develop over the coming months. RM
discussed the issue with Steve Fisher and advised that support of R-GMA is
required whilst APEL is dependent on it. RM reported that he has spoken
to SF and there is currently no change to the R-GMA situation - process
ongoing.
290.19 DB/SP to progress the details of the Project Map over the next few
months, cross-checking that all elements are incorporated, including
strategic priorities and staffing. To be completed before the next
Oversight Committee.
The meeting closed at 2:30 pm. The next PMB would be at 1:00 pm on Monday
18 February.
|