Dear All,
Registration for GridPP4 (http://www.gridpp.ac.uk/gridpp24/ ) closes a
week today.
Please find attached the GridPP Project Management Board
Meeting minutes for the 381st meeting. The latest minutes can
be found each week in:
http://www.gridpp.ac.uk/php/pmb/minutes.php?latest
as well as being listed with other minutes at:
http://www.gridpp.ac.uk/php/pmb/minutes.php
Cheers, Dave.
--
________________________________________________________________________
Prof. David Britton GridPP Project Leader
Rm 480, Kelvin Building Telephone: +44 141 330 5454
Dept of Physics and Astronomy Telefax: +44-141-330 5881
University of Glasgow EMail: [log in to unmask]
G12 8QQ, UK
________________________________________________________________________
GridPP PMB Minutes 381 (15.03.10)
=================================
Present: David Britton (Chair), Steve Lloyd, Sarah Pearce, Andrew Sansum, Tony Doyle, Dave
Colling, Robin Middleton, Pete Clarke, Roger Jones, , Tony Cass, Jeremy Coles, Glenn Patrick, Neil
Geddes (Suzanne Scott, Minutes)
Apologies: David Kelsey, John Gordon
1. Week's Notes
================
a) Tier-2 Investments
---------------------
SL reported that he had received inputs from various people, which he is currently working on.
Inputs related to construction and building costs etc. A few more had been promised but he could
make a start on the paper and figures now. SL had received unfunded effort info from SP and also
has some electricity estimates.
b) Experiment speakers for RHUL
-------------------------------
RJ and DC reported that either themselves, or someone else, would speak at RHUL. Both would
advise DB by Wednesday if possible.
c) OPN link status
-------------------
AS reported that he had circulated info on the OPN, and was still in discussion with Robin Tasker.
DB advised that he needed the info within the next 2 weeks. DB noted that if ATLAS were going to
use 9 Gb/s then we needd the backup link for load balancing. AS noted that we could decide what
we provide.
d) CERN Hardware paper
-----------------------
Email information had been circulated and there had been a subsequent email discussion
exchange. Bernd Panzer had produced a CERN summary paper on hardware costings. Price
comparisons had been made. Iterations had taken place re available capacity. AS reported that
although CERN prices appeared higher by 60% - they were using a RAID 1 system and two hot
spares plus two system disks. Our (RAL) configuration was different, therefore there were
different overheads. Out of 24 drives, CERN had 10 data drives, RAL had 20. AS advised that
trying to pin-down the remaining differences was very difficult. In the final analysis AS thought
that we were within about 5% for the hardware cost. DB noted that the paper which AS was
preparing would be used to support hardware costings for GridPP4, if required.
2. EGI/NGI paper
=================
RM had circulated a spreadsheet. NG was still working on governance area text. RM reported that
the spreadsheet had three worksheets giving detail. The first worksheet gave columns for a likely
hopeful outcome scenario, compared with a minimum UK NGI requirement and a desirable
requirement. DB noted that the idea was to make some slides for the PPRP to address the three
scenarios (default; no-EGI; and no-NGS) including risks and opportunities.
The second worksheet related to no EGI - the GridPP contingency of 1 FTE had been added. The
final worksheet related to no NGS4 at all. It was noted that JISC had committed to EGI so it was
unlikely that no funding would be forthcoming. JISC had signed up to the legal entity EGI.eu
statute.
RM noted that it was difficult to make choices at the moment until the situation was firmed-up.
GridPP could become itself an NGI or it could simply relate to wLCG and there would be no NGI at
all. PC noted he was nervous that we would take on something that JISC wouldn't fund, as it
wouldn't help us, or help to get particle physics out of CERN. What difference did it actually make
being at the 'top table'? NG commented that it would depend on wLCG commitment on achieving
goals through EGI. DB thought it was a longer-term question really. PC noted that for GridPP to
carry the responsibility for the UK when the UK were not interested, would be a difficult task. NG
advised that we still have to deliver certain elements of an NGI - we couldn't deliver the whole of
the NGI requirement, only the bits that we were concerned with. DB thought that it was a grey
area - if we do the things we need, who do we deliver them to? We would still need to do
security/accounting/regional support - we would be doing these anyway. DB advised that in
order that we could answer the PPRP, various scenarios needed to be considered.
It was agreed that RM should embed the numbers into the document and get input from NG. DB
would use the document as a basis for backup slides for the PPRP. DB noted that this document
should be PMB-internal.
3. Travel
==========
a) RM reported that he had received requests to go to Hepix in Europe (Lisbon), but the cost per
person was £1500, which equated to £200+ per day overall rate. How much was GridPP prepared
to fund? He had received 2 requests so far. It was proposed to fund 1 person per Tier-2, but this
was expensive. TC advised that CERN funded 220CHF per day. RJ thought we should specify a
lower rate, DB agreed noting actuals with a limit on the hotel. It was decided to take the CERN
rate and divide by ~1.6. It was agreed that a maximum daily rate of £130 should apply.
b) RM reported that the issue would be the same for CHEP, which was taking place in Taiwan
(Taipei). RM advised that 15-20 had gone in previous years. DB advised that only those giving
talks or organising sessions should go. There would be a bidding process and GridPP would only
pay 50% of costs. SP advised that Neasan O'Neil was waiting for an announcement re the stands.
CHEP was most useful re the stands, as we got a lot of attention. RM suggested funding only
Neasan at full cost Any others doing papers and manning the stand would be part-funded. It was
agreed that as a requirement of funding, delegates worked on the stand. DB also asked if we could
ensure that it was not the same talk being given by the same people all the time. The same talk
didn't warrant repeated funding.
STANDING ITEMS
==============
SI-1 Tier-1 Manager's Report
-----------------------------
AS reported as follows:
Fabric:
1) FY09 procurements:
- All disk and CPU has been delivered.
- We expect to be able to start acceptance tests on one lot of disk and CPU today.
2) FY10 procurements
- We have started the process of updating the procurement documentation for FY10
procurements.
3) We have agreed the change request to move CMS to T10KB drives and are working on
implementation. Initial testing is underway.
4) We have placed an order for a second C300 core network switch (not funded by GRIDPP). This
is to act initially as a cold standby switch in the event of a major failure of the main core network
switch. Eventually the second switch may be deployed in parallel with the existing switch to offer
greater operational resiliance.
Service:
1) SAM test availability for the ops VO was 100%.
2) We are working on an upgrade strategy for CASTOR from 2.1.7 to 2.1.8 or 2.1.9 we expect to
discuss with the UK VO representatives next week then discuss at the PMB.
3) We are starting the LHCb drain of problematic RAID 5 disk servers as agreed with LHCb. The
aggressive draining led to failed SAM tests (ops VO) however LHCB VO tests remained OK owing
to the longer timeout and LHCB were satisfied with the process agreed.
4) LHCB 3D database streaming had problems last week [probably fixed now but no authorative
update available this morning]
5) FTS will be upgraded to version 2.2.3 on Wednesday in order to meet WLCG baseline versions
and provide checksumming
functionality.
SI-2 ATLAS weekly review & plans
---------------------------------
RJ reported that there had been an issue at RAL yesterday. Cambridge had a broken install which
was now fixed - production load and real data were expected. Re the use of Cream and SCAS,
ATLAS have done testing with Cream and had encountered problems in relation to tokens - Cream
was not much use to them until it was fixed. RJ advised that ATLAS policy was that GDB were
pushing Cream, not ATLAS, and ATLAS were not happy to have it deployed at present. Re glexec,
the concerns were coming from Security, not ATLAS. RJ confirmed there was no push from ATLAS
about either of these, certainly not until they were fixed and problems had been ironed out. DB
commented that Graeme Stewart was managing this at present but operationally he agreed it
wasn't good. DB asked JC if it was deployed and testable? JC noted yes, at 3 sites. The GDB and
the MB were pushing this to ensure testing. DB advised that the MB had agreed to suspend
security policy until April 1st - the exemption could be extended or there would be a move to
using glexec to satisfy Security policy - were we ready for either of these scenarios? JC advised no.
DB observed this could be messy. JC noted it was deployed at a number of sites but not all, and
bugs remain. DB asked JC to keep abreast of this, especially as he attends the GDB. JC confirmed
that what has been deployed up until now has not been tested. DB noted that whilst not all sites
want to do this, they should not be surprised at a short timescale request to move. There has been
plenty of warning.
DB noted that if three sites were deploying SCAS and glexec, had CMS used them? These were
deployed at Glasgow, Lancaster, Oxford and Manchester. DC reported that CMS occasionally use
Oxford and Manchester - he would check and see if they've been used. He confirmed that CMS
would use a site if SCAS and glexec were installed there.
SI-3 CMS weekly review & plans
-------------------------------
DC reported that there wasn't much happening, they were ticking along, starting Monte Carlo
production. There had been a problem at RAL PPD but all in all they weren't in bad shape.
SI-4 LHCb weekly review & plans
--------------------------------
GP reported as follows:
1. Problem with "resolv.conf" on a T1 diskserver preventing access to data on the diskserver by
user jobs and interactive use. Fixed by Shaun on 8 March.
2. Various problems over the week with jobs failing to access data at RAL. The data was on RAID5
servers on lhcbDst space token which was already being drained by Brian. To finish the draining
in a reasonable time, RAL-DST (lhcbDst) was banned within LHCb on Thursday night and
intensive drains were started on Friday morning. The LHCb lhcbDst RAID 5 diskservers were all
drained on Saturday and the space token has been put back in production today.
3. Problem with LFC at RAL. New record created at RAL on 2 March on the LHCb LFC - should not
have happened as RAL was read-only. The Read-Only user in the Oracle dB had been created with
write permissiions (fixed now). The user was setting up Nagios tests based at Oxford and it is not
clear why there should have been a request to create a new record sent to RAL. Investigations
ongoing.
4. Problem with uploading data out of some UK tier-2 sites ongoing. It is usually a very small
problem on most sites, but Glasgow is particularly hit by it and has been banned within LHCb.
Other sites are within the LHCb mask and accepting jobs.
SI-5 Production Manager's Report
---------------------------------
JC reported as follows:
1) Registration for the storage workshop being run with GridPP24 was now open:
http://www.gridpp.ac.uk/gridpp24/StorageWorkshopRegistration.html . The funding was agreed
at 15 places and was being allocated on a first-come first-served basis.
2) There is going to be a joint NGS-GridPP operations meeting in April to better understand the
functions of, and directions required for, NGI operations work. Please let me know any particular
points the PMB wish considered.
3) Some concerns have arisen with ATLAS users (at Liverpool) being able to submit an (arbitrary
number of) Ganga jobs (direct and avoiding the pilot system) with data access performed by rfio –
which very quickly degrades the performance of DPM. It may be that the queue being used can be
disabled but if not this raises a few questions as ATLAS suggest the site should optimise rfio and
the site points out that there is no one good optimisation for all cases.
4) The EGEE league table for February has been published:
https://edms.cern.ch/document/963325 . Three GridPP sites are mentioned as not hitting the
availability/reliability targets.
UKI-LT2-UCL-CENTRAL Scheduled donwtime during 10 days due to
'lfsck'ing Lustre'
UKI-SOUTHGRID-BRIS-HEP Scheduled downtime during the whole month due
to DPM retirement & lcgce04's WN configuration
UKI-SOUTHGRID-RALPP Unsched downtime due to problems with air
conditioning.
We are currently looking into the Bristol case as although some components were in downtime
the site remained fully functional and ran jobs.
5) The status for the CREAM CE remains as last week. No further news on releases for SGE or
Condor has appeared. Both ATLAS and CMS have reported problems using CREAM.
Additional:
(A) There is a GDB in Amsterdam next Wednesday 24th March:
http://indico.cern.ch/conferenceDisplay.py?confId=84636
A discussion of the pilot jobs status is expected to take place.
(B) Regional Nagios monitoring had been using dteam but has now returned to ops.
SI-7 Dissemination Report
--------------------------
SP reported that Neasan was drafting a press release re the collisions event at CERN. He will
probably ask for a quote from us (with his ATLAS hat on). There was nothing new to report on the
R89 opening. A news item on KE was being drafted. GridTalk had been approved for funding,
QMUL and Imperial were involved.
AOB
===
NG noted that due to CERN rearranging 'first beam' for March 30th, there might be fewer people
at the Tier1/R89 Opening at RAL. Invitations were extended to the PMB. It was noted that
unfortunately we would be at RHUL at that time. NG advised that we could encourage others to
attend the NGS event, as spaces would be available.
REVIEW OF ACTIONS
=================
354.2 JC to consult with site admins on a framework policy for releases, with a mechanism for
escalation, plus a mechanism for monitoring. JC reported that the consultation happened. There
were a few suggestions in the deployment team about how to progress in this area. It needs
writing up and an implementation plan. JC to progress. Ongoing.
366.8 AS to confirm that the Tier-1 proposes to use Tape-based storage in the period 2011 - 2015.
DB advised this related to long-term plans and power capacity. Physical footprint space?
Alternatives? AS had sent tech questions round the team and would forward inputs when
available. AS noted that alternative further costings were required. AS to progress. Ongoing.
379.1 Re GridPP4 proposal and forthcoming PPRP meeting: SP to begin work on 'background'
financial planning. Done, item closed.
379.2 Re GridPP4 proposal and forthcoming PPRP meeting: AS to look at the CERN hardware
paper and work on the CPU and disk costings. Done, item closed.
379.3 Re GridPP4 proposal and forthcoming PPRP meeting: SP to add more detailed information
to the WBS. Ongoing.
379.5 RM/SP to assimilate the information in DB's paper on NGI within the GridPP4 Proposal, and
circulate a new updated paper before next week's PMB. This would be a transition document
addressing the possibility that:
1. There would be no NGI;
2. There would be no future funding for NGS. Done, item closed.
379.7 JC to follow-up the issue of merging VO lists and ILDG VO. Done, item closed.
380.1 SL to circulate an Agenda for the Deployment Board meeting at RHUL. Ongoing.
380.2 ALL: to send SL information on infrastructure investments at their respective institutes.
Done, item closed.
380.3 AS to send SL assumptions re electricity (in relation to investments in infrastructure).
Done, item closed.
380.4 SP to send SL historical numbers on unfunded effort (in relation to investments in
infrastructure). Done, item closed.
380.5 RM/SP to make changes to the EGI/NGI paper as discussed and bring back a revised
version to next week's PMB. Ongoing.
380.6 ALL: to feedback comments on the EGI/NGI paper to DB, RM or SP before next week's PMB.
Done, item closed.
380.7 Re the OPN backup link: AS to find out: 1. When the link is supposed to be operational; 2.
More detail about how and when the link will be tested. If possible AS should delay Invoice
payment until more information was forthcoming. Ongoing.
380.8 RJ/DC to advise us of what the experiment plans are in the UK in relation to SCAS and
glexec. Done, item closed.
380.9 RJ/DC to send info to DB regarding resource estimates for the upcoming period, as this info
will be needed after the PPRP. Ongoing.
ACTIONS AS AT 15.03.10
======================
354.2 JC to consult with site admins on a framework policy for releases, with a mechanism for
escalation, plus a mechanism for monitoring. JC reported that the consultation happened. There
were a few suggestions in the deployment team about how to progress in this area. It needs
writing up and an implementation plan. JC to progress.
366.8 AS to confirm that the Tier-1 proposes to use Tape-based storage in the period 2011 - 2015.
DB advised this related to long-term plans and power capacity. Physical footprint space?
Alternatives? AS had sent tech questions round the team and would forward inputs when
available. AS noted that alternative further costings were required. AS to progress.
379.3 Re GridPP4 proposal and forthcoming PPRP meeting: SP to add more detailed information
to the WBS.
380.1 SL to circulate an Agenda for the Deployment Board meeting at RHUL.
380.5 RM/SP to make changes to the EGI/NGI paper as discussed and bring back a revised
version to next week's PMB.
380.7 Re the OPN backup link: AS to find out: 1. When the link is supposed to be operational; 2.
More detail about how and when the link will be tested. If possible AS should delay Invoice
payment until more information was forthcoming.
380.9 RJ/DC to send info to DB regarding resource estimates for the upcoming period, as this info
will be needed after the PPRP.
INACTIVE CATEGORY
=================
359.4 JC to follow up dTeam actions from the DB, as follows:
---------------------------
05.02 dTeam to try and sort out CPU shares and priority resources, at
Glasgow first (perhaps by raising the job priority in Panda).
---------------------------
JC would check the situation with Graeme Stewart (who was currently on annual leave).
JC followed up with Graeme and the other experiments. A test was
started but this area has been deemed low priority and further
progress is not expected for some time. ATLAS see no issues with
contention. LHCb are not intending to pursue anything in this area. A
CMS discussion has started but again it does not appear to be urgent.
If the experiments are not pushing this internally then there is
nothing for the deployment team to follow up!
It was noted there was no priority in ATLAS at present, this will be pending for a while. Move to
inactive as it is a long-term action.
---------------------
The next PMB would take place on Monday 22nd March at 12:55 pm.
|