Dear All,
Please find attached the GridPP Project Management Board Meeting minutes
for the 429th meeting.
The latest minutes can be found each week in:
http://www.gridpp.ac.uk/php/pmb/minutes.php?latest
as well as being listed with other minutes at:
http://www.gridpp.ac.uk/php/pmb/minutes.php
Cheers, Dave.
--
________________________________________________________________________
Prof. David Britton GridPP Project Leader
Rm 480, Kelvin Building Telephone: +44 141 330 5454
School of Physics and Astronomy Telefax: +44-141-330 5881
University of Glasgow EMail: [log in to unmask]
G12 8QQ, UK
________________________________________________________________________
GridPP PMB Minutes 429 (13.06.11)
=================================
Present: Dave Britton (Chair), Dave Colling, Jeremy Coles, Pete Gronbech, Robin Middleton, Glenn
Patrick, Dave Kelsey, Steve Lloyd, John Gordon, Pete Clarke, Tony Doyle, Roger Jones, Andrew
Sansum (Suzanne Scott - Minutes)
Apologies: Tony Cass, Neil Geddes
1. Accounting Issues
=====================
DB suggested that we continue the discussion on accounting. SL reported that he had responded
to a query from Mike Seymour but had not heard back yet, and had nothing to add since last time
as there was no change. DB asked if there was consensus regarding Glasgow receiving from
clouds? SL commented that other sites should also get work from elsewhere rather than stopping
Glasgow getting jobs. JC asked about the issue of capping? DB advised that the PMB had agreed
not to do that at the last meeting, but the issue could be re-visited if required. DB noted that
ATLAS wanted a couple of large Tier-2s in each country connected to multiple clouds in order to
help with load-levelling, and we should encourage sites to ask, but it was ATLAS' decision. SL
considered that in relation to Glasgow there was not a very large effect generally, and he thought
we should carry on as we are and check it again later. DB noted he couldn't tell what jobs came
from where. JG advised that some sites had been complaining that there wasn't enough work. DB
had looked back at the correlation, and this related to the oscillating nature of the job load in the
UK generally compared with other clouds - it went from zero to peak - and this didn't seem to
happen with other clouds. TD noted that Graeme Stewart could probably answer that. DB
suggested that this could be discussed at the CERN meeting.
DB advised that at STFC there was a change in the way capital would be funded, but that this
shouldn't affect GridPP4. Tony Medland had said that the 'old' rules applied and that it didn't
affect GridPP4.
DB noted that the other outstanding accounting issue was that of Lancaster. SL reported that he
had had no contact with RJ. DB advised that their resources weren't being fully utilised, relating
to brokering of two VOs. RJ was to pursue this and sort it out in order to receive more jobs. PG
commented that if the site were not being used then it was the site's problem as well as the
experiments'. DB noted it was possibly a brokering system issue relating to the 'slow start'
problem. PG advised that in general, sites should be addressing this and speaking to experiments
etc, being proactive in response. RJ needed to report-back on the situation.
JC advised that Lancaster also had other issues as well - in recent ops meetings they were often
mentioned as an ATLAS site in the brokeroff state. JG noted they had publishing issues in gstat as
well. DB advised that we would ask RJ in two weeks' time.
RJ joined the meeting at this point. DB asked for an update on Lancaster. RJ reported that there
were no developments in relation to ATLAS but they do seem to be flat-topping on ATLAS jobs. He
needed to look at it further in order to understand this. RJ had provided comments re the 'other'
VOs being on the older cluster. Lancaster were still not filling spots. DB noted that ATLAS was
using QMUL resources - was Lancaster not set up correctly? RJ advised that the set-up of slots
might be an issue. Panda brokering does not discover the power of the resources, it looks at job-
slots only. RJ noted they were not filling empty jobs, but that the pilots were going through OK. SL
advised that QMUL had 5,700 jobs running at present. DC also noted that IC had 1000 ATLAS jobs
running recently at Imperial. RJ noted that this was brokered at the other side, and you couldn't
simply pull-in from a queue. DB asked whether the current total of CPU should be used? RJ
advised that it was usable, but just wasn't being used. RJ added that they also were not getting as
much from LHCb as usual. RJ noted there was no inherent problem with running jobs, but it was
rare that Lancaster was full. They also had other demands on the cluster.
DB advised that it was two weeks until the next PMB. Could RJ sort out the issue that was
preventing ATLAS from using available slots in that time? RJ said yes, he could try.
STANDING ITEMS
==============
SI-1 Tier-1 Manager's Report
-----------------------------
AS reported as follows:
Fabric:
1) FY11 procurements
- EU tender for disk framework agreement PQQ stage being evaluated (eval meeting today)
- CPU framework expected to go out shortly (running late but nearly ready)
2) SL08 remains out of production.
- Concluded that original problem (lost raid set after single drive failure) resolved
- Further problem with new drives not recognised, now understood to be inconsistent device
driver update - now resolved and last 7 day test run to gain confidence
- Outstanding question of 3*(multi drive failure) in May, but drive failure rate generally high in
May (double)unknown cause at the moment. Plan to redeploy shortly into T1D0 service classes.
Service:
Generally operations running reasonably smoothly.
1) Summary of operational issues is at:
http://www.gridpp.ac.uk/wiki/Tier1_Operations_Report_2011-06-08
2) CASTOR
* LHCB experienced problems where recalled files were garbage collected before used, caused
thrashing of tape recall system. New garbage collect policy for LHCB is being trialled.
* Expect to upgrade CASTOR tape servers to 2.1.10-1 to enable T10KC. No downtime required.
3) Databases
* Minor update (at risk) to ORACLE configuration completed to resolve problem with Oracle
statistics gathering.
Staff:
1) Grid team leader post internal interviews within 1-2 weeks (being rescheduled from
Wednesday)
2) Paperwork for four other vacancies submitted to STFC for approval has not been approved -
* Two system admins for Fabric team
* One CASTOR admin
* One Grid Team member
SI-2 Production Manager's Report
---------------------------------
JC reported as follows:
1) EGI has released version 1.0 of the EGI Operational Level Agreement document:
https://documents.egi.eu/document/31. The document covers the services a resource centre is
expected to provide and the associated service levels. The main measures are:
“1. The Resource Centre MUST be available (UP) at least 70% of the time per month (daily
availability is measured over 24 hours).
2. Resource Centre reliability MUST be at least 75% per month.”
2) In the ops meeting last week most GridPP Tier-2 sites confirmed that they are on sub-nets
within their university. The majority of site administrators have their own, or access to, useful site
monitoring (mainly cacti or ganglia based) of network traffic. The topic of monitoring and site
configuration is of widespread interest and will be explored further during site update talks at the
HEPSYSMAN meeting at the end of the month.
3) At the GDB (http://indico.cern.ch/conferenceDisplay.py?confId=106645) last week the
“Security futures” discussion indicated that the glexec discussion is likely to reopen over the
coming months at first in the context of a working group being led by Jeff Templon and Markus
Schulz. The technical discussion group will attempt to distil core issues and proposals in
numerous areas where a more joined-up or simplified approach may benefit WLCG in the
medium/long term. The immediate approach remains to use glexec and integrate this with the
experiment frameworks.
JG reported that he had a few contacts in relation to a private group to come up with solutions. If
this covers too many different areas, it may be difficult to do it in any depth. JG noted that security
experts would be at HEPSYSMAN. DB agreed that we had a significant interest in the security side
of things.
4) The (provisional) Tier-2 reliability: availability figures for May (http://tinyurl.com/6fqwwc3)
indicate problems at UCL-HEP (41%:28%) due to unresolved CREAM-CE problems; EFDA-JET
(73%:49%) and Birmingham (87%:87%) which had disk/controller problems.
SI-3 ATLAS weekly review & plans
---------------------------------
RJ noted not much to report - not much news at the Tier-1, it had been ok over the past week or
so. There had been air conditioning issues at Manchester, which were now fixed. They also had
DPM problems, and there had been a squid problem. Things were generally functional: ATLAS
were trying to do hammercloud tests, which were showing higher failure rates, they were helping
with the configuration. Questions were being asked about future resource requests. DB asked
about the ATLAS ongoing global resources for disk - in the UK at the end of the accounting period,
funds needed to be spent. We also had unused disk. RJ advised that analysis jobs generally were
drifting to the Tier-1s. If we placed more data at the Tier-2s then this would balance it out. The
resource request from ATLAS was submitted recently. DB noted the issue of the forward look in
3-4 years as well. Prior to the pledge in October we needed to decide what we were doing with
the limited UK Tier-2 resources generally. We were in a transition phase.
SI-4 CMS weekly review & plans
-------------------------------
DC noted that everything was positive at present - the Tier-1 was running well, the Tier-2 was in
107% readiness, and availability was great.
SI-5 LHCb weekly review & plans
--------------------------------
GP reported that last week there had been problems with RAL - the Tier-1 was set to nominal
share, a new lot of data was needed. Garbage collection had also been an issue. PC advised that all
of the Tier-1s were empty at present, but stripping jobs were due and the re-stripping of 2010
data would be commencing tomorrow. AS advised that the job start rate needed to be changed - it
was inadequate at present. The change was done but was not yet permanent. He would warn the
team about the stripping work which was imminent. PC advised that they may use the Tier-2s for
re-processing in the future. They were also doing pilot work with Manchester in the UK for a few
months.
SI-6 User Co-ordination issues
-------------------------------
There were no issues to report.
SI-7 LCG Management Board Report
---------------------------------
DB advised there had been a discussion about the timeline for the glexec report in relation to the
identity federation workshop. JG noted that there had been two separate problems at RAL with
glexec but that they had been resolved. JG advised that ATLAS had highlighted the poor level of
support provided by the Netherlands Tier-1, which did not always respond.
AOB
===
PG advised that he needed RJ and DC to assist with tightening up the metrics for GridPP4. DC
confirmed that what PG now had was correct. PG would compile a template report and send this
round. RJ advised that he was happy with the metrics, but less happy about their ability to
measure them, due to changes in the dashboard.
PG noted he was happy that the metrics had been agreed by all sides, so he would send out
template reports for review.
REVIEW OF ACTIONS
=================
400.4 SL to co-ordinate changing the current GridPP MoU towards an MoU for GridPP4. In
progress - document had been circulated. Any corrections to be sent to SL. Ongoing.
409.1 JC to revisit document with a GridPP-NGI-NGS structure, not use the document Dave
Wallom produced. JG will provide input. Visions for today and for the future. Done, item closed.
424.3: DB to contact ALICE-UK about Tier-2 resources. Ongoing.
425.7 DC to have an internal discussion within CMS relating to use of future technology and
evolution of the computing model, from September to the next couple of years. DC to come up
with possible suggestion of theme/topics for GridPP27 at CERN. Ongoing.
425.8 AS to consider any longer-term issues relating to storage, DPM, databases etc, and come
back to DB with any ideas for sessions at GridPP27. Ongoing.
428.1 RJ and AS to respond to DC regarding inputs for the AHM paper. Done, item closed.
428.2 DC to check at Imperial regarding the new person dealing with ganga, in relation to a talk at
ACAT. Ongoing.
428.3 JC to compile an info list relating to sub-nets at sites. Ongoing.
428.4 JC/PC to ask through the Ops Team or HEPSYSMAN whether or not there was an easy way
to measure Tier-2 traffic, and to find out what was possible at Tier-2 sites. Done, item closed.
428.5 DB to contact David Salmon and appraise him of the Network Document which had already
been produced and contained our 'best knowledge' at present. He would also advise DS that we
would progress his request and see what we could provide in terms of traffic measurement. Done,
item closed.
428.6 AS to come up with a proposal for how to use the current disk buffer at the Tier-1. Ongoing.
ACTIONS AS AT 13.06.11
======================
400.4 SL to co-ordinate changing the current GridPP MoU towards an MoU for GridPP4. In
progress - document had been circulated. Any corrections to be sent to SL.
424.3: DB to contact ALICE-UK about Tier-2 resources.
425.7 DC to have an internal discussion within CMS relating to use of future technology and
evolution of the computing model, from September to the next couple of years. DC to come up
with possible suggestion of theme/topics for GridPP27 at CERN.
425.8 AS to consider any longer-term issues relating to storage, DPM, databases etc, and come
back to DB with any ideas for sessions at GridPP27.
428.2 DC to check at Imperial regarding the new person dealing with ganga, in relation to a talk at
ACAT.
428.3 JC to compile an info list relating to sub-nets at sites.
428.6 AS to come up with a proposal for how to use the current disk buffer at the Tier-1.
Forthcoming PMB meeting dates would be as follows, at the usual time:
Mon June 27th
Mon July 11th (doodle poll required - date not suitable)
Mon July 25th
Mon Aug 8th
Mon Aug 22nd
Mon Sep 5th
TUE Sep 13th F2F@CERN
Mon Sep 26th
|