Dear All,
Please find attached the GridPP Project Management Board
Meeting minutes for the 388th meetings.
The latest minutes can be found each week in:
http://www.gridpp.ac.uk/php/pmb/minutes.php?latest
as well as being listed with other minutes at:
http://www.gridpp.ac.uk/php/pmb/minutes.php
Cheers, Dave.
--
________________________________________________________________________
Prof. David Britton GridPP Project Leader
Rm 480, Kelvin Building Telephone: +44 141 330 5454
Dept of Physics and Astronomy Telefax: +44-141-330 5881
University of Glasgow EMail: [log in to unmask]
G12 8QQ, UK
________________________________________________________________________
GridPP PMB Minutes 388 (24.05.10)
=================================
Present: John Gordon (Chair), Andrew Sansum, Tony Doyle, Jeremy Coles, Glenn Patrick, David
Kelsey, Sarah Pearce, Steve Lloyd, Tony Cass, Robin Middleton (Suzanne Scott, Minutes)
Apologies: Roger Jones, David Britton, Tony Cass, Pete Clarke, Dave Colling, Neil Geddes
1. Feedback from PPRP
======================
DB had circulated a note regarding this. The summary was that the PPRP had taken on board the
5% cuts, but this may not be enough. The situation was pending at present and it would be
discussed again at the upcoming OC meeting.
2. Oversight Committee
=======================
It was noted that the OC meeting was on 18th June. SP was not planning to attend. DB, SL, JG &
TD would be there. No Agenda had been received as yet.
3. Data Jamboree
=================
JG had provided an Agenda for this. JG noted he had hoped for better UK attendance but this
wouldn't be possible due to the cap on numbers. >From the UK, JG, Jens Jensen, Shaun de Witt,
Matthew Viljoen, Wahid Bhimji and Sam Skipsey would be there. JG noted that this was a long
term issue in any case, and it wouldn't affect things at present.
4. wLCG Workshop at Imperial
=============================
DC had publicised this to the UK. There had been a PMB decision by email to encourage people to
attend, and they would be funded to do so. On the Wednesday, the first session on issues from
T1/2 and experiments is to be organised by the UK. It was agreed that JC would contact Jamie
Shiers and find out how we could help with session planning etc.
ACTION
388.1 JC to contact Jamie Shiers re the wLCG Workshop at Imperial, and find out how we could
help with first session planning and/or provide a Chair for the session.
STANDING ITEMS
==============
SI-1 Tier-1 Manager's Report
-----------------------------
AS reported as follows:
Fabric:
------
1) FY09 procurements:
- Disk servers from FY08 lot 2 and FY09 lot 1 are moving into LHC service classes. We expect to
have sufficient capacity in VOs non-prod service classes by 1st June to meet the MoU
commitments.
- Second lot of FY09 disk servers had problems during the supplier proving test. Supplier resolved
the problem with firmware updates and demonstrated 1 week of stable operation. We accepted
the servers into our own 28 day acceptance test and are currently testing. Some indication of
further problems - a DM review meeting will be scheduled to review this week.
- Second lot of CPU servers is proceeding through acceptance and is expected to complete
successfully this week.
2) FY10 procurements
- PQQ stage of the disk tender is being evaluated. Delivery target is December.
- CPU PQQ is nearly finalised and is planned to be submitted this week.
3) We have had a couple of unusual disk server crashes. See:
https://www.gridpp.ac.uk/wiki/RAL_Tier1_Incident_20100515_Disk_Server_Outage These have
been operationally disruptive owing to the length of downtime (precautionary to retain data). We
are investigating the underlying problem, but have also reviewed our disk server crash process in
order to improve turnaround on future failures. A recently received firmware update failed to
resolve the problem.
4) Commissioning of the extra RAL site 10Gb/s link is ongoing. Currently implementation has
been completed on the failover part of the production network and is being tested.
5) Commissioning of the second, resiliant 10Gb/s OPN link to CERN is ongoing. A fault was traced
to a problem in London.
Service:
-------
Other than the disk server failures, operations continue to be good. Network rates are gradually
climbing but continue to be unproblematic.
1) The weekly operations summary is at:
http://www.gridpp.ac.uk/wiki/Tier1_Operations_Report_2010-05-19
2) SAM test availability for the ops VO was unreliable last week owing to false positives against
our site BDII from Taiwan.
This is not going to be fixed as this infrastructure will be phased out in June.
3) Load related problems on the ATLAS software server continue and we are working on a
temporary solution (faster server)
that we will deploy. Longer term we are considering the use of AFS.
4) We expect to complete the deployment of SCAS/glexec this week.
5) Oracle patching of databases. Will lead to "At Risks" on OGMA (Atlas 3D) on Tuesday 25th May,
LUGH (LHCb 3D & LFC) Thursday 27th May and SOMNUS (LFC, FTS) on Wednesday 2nd June.
6) The phaseout of SL4 is scheduled to complete in August - announcements have been made.
SI-2 ATLAS weekly review & plans
---------------------------------
RJ was absent.
SI-3 CMS weekly review & plans
-------------------------------
DC was absent.
SI-4 LHCb weekly review & plans
--------------------------------
GP reported as follows:
1. Diskserver gdss380 went down twice recently at UK Tier 1 - 14 May and then 22 May. See item
3 in Tier1 report.
1.1. First failure caused quite a few LHCb user jobs to fail. Second failure seems so far much less
problematic.
1.2. Hopefully improved procedures in the future about reporting to the VO of diskserver failures.
2. PIC power failure on Friday brought down LHCb grid job accounting again.
3. Continuing problems with uploading data out of Sheffield, Brunel, Liverpool, Bristol and
Glasgow. See also item 1 in Production manager’s Report.
3.1. Problem alleviated in Glasgow by firewall tweaking, but still exists.
3.2. This issue exists only in these 5 UK sites of those that LHCb runs on in the worldwide grid.
3.3. There are some indications that this upload issue may have temporarily overloaded the
lhcbFailover space token at CERN.
4. LHCb has updated (increased) the time needed per job in the VO-card. Requesting all sites to
provide long enough queues if they already do not do so.
5. Problem with bdii at SARA (wms.sara.nl) which froze at a moment unfortunately co-incident
with the time lcgce07 was being brought up after glexec updates.
5.1. All queues on lcgce07 were considered available for LHCb.
5.2. Jobs failed at RAL as they ended up in low memory queues.
5.3. Problem solved by restarting the SARA bdii.
SI-5 Production Manager's Report
---------------------------------
JC reported that generally things were running smoothly. Items to note were as follows:
1) The problems affecting LHCb transfers continue but progress has been made. In particular
there is a correlation between use of NAT and (high) failure rates when copying files from the
WNs to remote SEs. When the remote SE is a DPM installation the transfers are successful! At a
basic level this suggests problematic middleware implementations; at an operational level the
pulling of files directly from the WN disk is not a recommended approach, but the sites affected
continue to carry out more detailed tests to find workarounds. Oddly it is still only UK sites see
this particular issue. The scope of the problem can be seen in these plots:
http://hepwww.rl.ac.uk/nraja/UKUploadProblems/index.html.
2) There is a new call to support the staged rollout of new gLite 3.2 middleware. APEL gLite3.2
SL5 is in the list (on a related note, ActiveMQ-based APEL was recently certified:
https://savannah.cern.ch/patch/?3612)
3) The UKI regional Nagios is validated:
https://twiki.cern.ch/twiki/bin/view/EGEE/ExternalROCNagios. The latest schedule plans to
switch off the central regional Nagios instance on the 15th June for all regions that are validated,
the same date as the central OPS SAM tests will be switched off.
SI-6 LCG Management Board Report
---------------------------------
JG reported on the last meeting that had taken place on 11th May. The main issues under
discussion had been DK's two revised security policies being approved; comparison of Nagios
operations availability had been reviewed (this will happen again in June) - if everything is ok it
will be switched off on 15th. JG summarised the RRB and LHCC meetings: the scrutiny group
reported that since their estimates are within 10% of the experiment requests sites should use
the experiment figures. JG reported that info needed to be gathered on hot data sets that were on
disk. JG noted that a new full scrutiny was required by 1st September. It was noted that CERN
management had sent a 'congratulations' round sites. JG reported that difficulties overall had
been noted as Alice resources; delays to the Tier-0; EGEE to EGI progress; and long-term
sustainability for middleware..
SI-7 Dissemination Report
--------------------------
SL reported on behalf of SP that some changes might take place in relation to EGI personnel.
REVIEW OF ACTIONS
=================
354.2 JC to consult with site admins on a framework policy for
releases, with a mechanism for
escalation, plus a mechanism for monitoring. It needs writing up and an
implementation plan. JC to progress. Done, item closed.
366.8 AS to confirm that the Tier-1 proposes to use Tape-based storage
in the period 2011 - 2015. Ongoing.
380.5 RM/SP to make changes to the EGI/NGI paper as discussed and bring
back a revised
version to next week's PMB. JG would check the numbers and circulate to
the PMB - internal only. Done, item closed.
380.9 RJ/DC to send info to DB regarding resource estimates for the
upcoming period, as this info
will be needed after the PPRP. Ongoing.
382.1 RM to circulate updated paper (effort numbers, tables, text, NGI
governance, risk etc) on
EGI/NGI (DB to use to prepare slides for the PPRP). Ongoing.
383.1 JG to provide a note of expected procurement dates following the
HAG meeting. Done, item closed.
384.1 AS to provide a plan for how to deal with the ADS Service, and
bring back to the PMB. HEP data in the ADS had been greatly reduced but it was not obvious if the
work to reduce it to zero would be cost effective. Ongoing.
384.5 ALL: to think about two levels of response to the NGS Technical
Roadmap document:
1. endorse the general direction but correct any anomalies
2. ensure that the technical roadmap is aligned with GridPP's own aims
and intentions
A high-level response should be made - ALL to re-read the document and
it will be discussed at
the next PMB meeting. Ongoing.
384.6 TD/JC to take the lead on the response to the NGS Technical
Roadmap document - we
should devise our own response: GridPP to NGI document that addresses
the forward-moving
technical and other issues from a GridPP perspective - a skeleton
outline should be circulated. Ongoing.
384.7 JC to organise a poll of sites to find out how they pick up on
issues, what they currently
check and monitor - was it a screen in the office, an auto-email etc? Done, item closed.
ACTIONS AS AT 24.05.10
======================
366.8 AS to confirm that the Tier-1 proposes to use Tape-based storage
in the period 2011 - 2015.
380.9 RJ/DC to send info to DB regarding resource estimates for the
upcoming period, as this info
will be needed after the PPRP.
382.1 RM to circulate updated paper (effort numbers, tables, text, NGI
governance, risk etc) on
EGI/NGI (DB to use to prepare slides for the PPRP).
384.1 AS to provide a plan for how to deal with the ADS Service, and
bring back to the PMB.
384.5 ALL: to think about two levels of response to the NGS Technical
Roadmap document:
1. endorse the general direction but correct any anomalies
2. ensure that the technical roadmap is aligned with GridPP's own aims
and intentions
A high-level response should be made - ALL to re-read the document and
it will be discussed at
the next PMB meeting.
384.6 TD/JC to take the lead on the response to the NGS Technical
Roadmap document - we
should devise our own response: GridPP to NGI document that addresses
the forward-moving
technical and other issues from a GridPP perspective - a skeleton
outline should be circulated.
388.1 JC to contact Jamie Shiers re the wLCG Workshop at Imperial, and find out how we could
help with first session planning and/or provide a Chair for the session.
INACTIVE CATEGORY
=================
359.4 JC to follow up dTeam actions from the DB, as follows:
-------
05.02 JC/dTeam to try and sort out CPU shares and priority resources,
at Glasgow first (perhaps by raising the job priority in Panda).
-------
The next PMB would take place at NOON (12:00 pm) on TUESDAY 1st June.
|