Dear All,
Please find attached the GridPP Project Management Board Meeting minutes
for the 428th meeting.
The latest minutes can be found each week in:
http://www.gridpp.ac.uk/php/pmb/minutes.php?latest
as well as being listed with other minutes at:
http://www.gridpp.ac.uk/php/pmb/minutes.php
Cheers, Dave.
--
________________________________________________________________________
Prof. David Britton GridPP Project Leader
Rm 480, Kelvin Building Telephone: +44 141 330 5454
School of Physics and Astronomy Telefax: +44-141-330 5881
University of Glasgow EMail: [log in to unmask]
G12 8QQ, UK
________________________________________________________________________
GridPP PMB Minutes 428 (06.06.11)
================================
Present: Dave Britton (Chair), Dave Colling, Jeremy Coles, Pete Gronbech, Robin Middleton, Glenn
Patrick, Dave Kelsey, Steve Lloyd, John Gordon, Pete Clarke, (Suzanne Scott - Minutes)
Apologies: Tony Doyle, Roger Jones, Tony Cass, Andrew Sansum, Neil Geddes
1. AHM Paper status
====================
DC reported that he was still awaiting info. from the Tier-1 and ATLAS. DB advised that RJ was
interviewing today and would not be present, although he may have time afterwards to deal with
this. DB would also check status with AS. JG thought that the deadline might be extended. DC
advised that he needed a couple of paragraphs only from each, so that he could pull things
together to provide a couple of pages of text.
ACTION
428.1 RJ and AS to respond to DC regarding inputs for the AHM paper.
2. Speakers for ACAT conference
================================
DB reported there was an Advanced Computing & Analysis Techniques (ACAT) in Physics
Research happening at Brunel in September. DB advised that this was a better opportunity for
GridPP than the AHM to get publications - there was a procedure for refereeing the papers and
they would be published with a good impact-factor. DB was organising part of the conference and
wanted to identify people to speak. Did we have a list of submissions from GridPP to the AHM?
None were known. DC noted there was one on CMS and clouds. JG noted that Jens had a few
papers lined up. DB advised that this material could be re-used for ACAT. DB asked if there were
any other people we could contact? DC noted a new person dealing with ganga at Imperial - he
would look into this for Track 1.
ACTION
428.2 DC to check at Imperial regarding the new person dealing with ganga, in relation to a talk at
ACAT.
DB advised that the deadline was 2nd July. DB noted that within Track 1, grid and cloud
computing was a broad area. Was a security talk possible? DK indicated it was possible. DB
thought that more general talks might also be possible, however more targeted talks would be
good, eg: adaptive data placement. It could be a more GridPP-focussed talk with an emphasis on
networking. Then there were new architectures, many core - possibly Dave Newbold or Simon
Metson, or Phil Clark might be suitable. JC noted Andrew Washbrook as well. DB asked about
virtualisation - was there someone appropriate at the Tier-1? JC noted Martin or Ian Collier. The
Tier-1 was doing a fair bit of virtualisation of infrastructure. JG agreed to forward DB's email to
the Tier-1. DB noted other topics also, which were less related to GridPP and he asked if PG could
consider something on monitoring? PG would look at the topics and see if anything were possible.
DB noted that Track 2 was data analysis, algorithm and tools, with a subset list. These more
naturally fell under the experiments' brief rather than GridPP. The third Track was computation
in theoretical physics, which was probably outwith our remit. DB emphasised that we could take
the opportunity to submit abstracts.
3. Accounting - HS06 etc
=========================
JG advised that Alessandra Forti and Martin Bly had been skeptical about the published figures.
SL agreed that there was a general feeling that the figures were not correct, which was borne out
by his measurements. PG commented that we all knew that HEPSPEC produces a better result on
SL5 64bit systems compared with SL4 32bit, and some sites may not have re run the benchmarks
after the upgrade. There ensued a discussion on HEPSPEC, sites, and CPU. It was agreed that
HEPSPEC was not a proper benchmark of ATLAS code. SL emphasised that machines were all
different, and HEPSPEC took some combination of CPU Speed, memory, IO etc into account but
apparently not the right combination for ATLAS code. SL noted we could get the production right
at least. DB noted that for ATLAS, using results from production jobs would be the easiest thing.
DB summarised the PMB view that HEPSPEC06 was not helping - using production jobs to get
empirical numbers was the best way to proceed pragmatically.
DB reported that he had been discussing Lancaster with RJ. The issue was still being investigated,
however RJ had reported that waiting jobs were not an issue, as this depended on Panda. The
peak number of jobs was the more interesting issue, and RJ was looking into this. DB reported
that the Glasgow team were going to let DB know what jobs they were receiving from other clouds
- this was currently under investigation. RJ thought that the issue of Lancaster not being full
probably rested on several reasons internal to ATLAS - the Panda system operated in a particular
way and there was the internal issue of Panda brokering. RJ also wanted to measure resources
available, not just resources used. The Glasgow cloud issue and the Lancaster capacity number
were to be continued. DC noted his disagreement of using 'resources available'. DB noted that the
issue was internal to ATLAS - they knew globally that they were not using the resources that were
there, due to the issue of Panda brokering. This had nothing to do with sites not providing
resources to ATLAS.
4. AOCB
========
- networking
DB reported that David Salmon had sent notes and slides from the network meeting that had
taken place in Paris. There was a specific request to GridPP to check the situation with respect to
the Tier-2s:
1. check whether UK Tier-2 resources were on well-defined sub-nets within the universities;
2. ask Tier-2 sites to monitor traffic levels in and out of the Tier-2 resources
ACTION
428.3 JC to compile an info list relating to sub-nets at sites.
DB asked if it were possible to measure the traffic volume in and out of the Tier-2s? This was
about co-existence with different resources in Europe. DB advised that everything was under
control at this point and there was no proposal to do anything, however there was a need to keep
an eye on things. PC asked why the Network Document was not sufficient for David's purposes?
It provided at least 60% of what he needed to know? DB advised that they were asking us to
measure volume. JG noted that we could monitor the FTS but Tier-2 to Tier-2 traffic was difficult
as there were many kinds of dataflows. DB agreed that it would be overkill to do this for every
site, but some of the larger sites could provide useful information. DB asked JC to find out if there
were an easy way to measure this, was any monitoring already in place?
DB noted that overall this was a longer term issue and that we couldn't commence a huge
programme of work, however we could compile some info just now. PC suggested that the
timescale for this should be the GridPP Collaboration Meeting at CERN in September. This might
be trivial to do at Glasgow, which had a separate cluster, and we could limit it to sites that were
similar. The lowest level of detail was total traffic, beyond that, it depended how difficult the
monitoring would be.
ACTION
428.4 JC/PC to ask through the Ops Team or HEPSYSMAN whether or not there was an easy way
to measure Tier-2 traffic, and to find out what was possible at Tier-2 sites.
PC asked that David Salmon be reminded of the Network Document, which did contain the bulk of
information which he required. DB agreed to follow this up.
ACTION
428.5 DB to contact David Salmon and appraise him of the Network Document which had already
been produced and contained our 'best knowledge' at present. He would also advise DS that we
would progress his request and see what we could provide in terms of traffic measurement.
- Resource Meeting
GP reported that the issue of extra disk had arisen at the Resource Meeting - he would need to ask
AS about this.
ACTION
428.6 AS to come up with a proposal for how to use the current disk buffer at the Tier-1.
STANDING ITEMS
==============
SI-1 Tier-1 Manager's Report
-----------------------------
AS was not present.
SI-2 Production Manager's Report
---------------------------------
JC reported as follows:
1) There was an update of the UK VOMS that led to T2K job failures (proxy problems) during the
“at risk” period. T2K are also suffering due to jobs exceeding queue memory limits.
On the topic of Steve's observed HS06 spreads seen across the sites many of you will have read the
comments on TB-SUPPORT. In particular Martin Bly's remarks about the nature of the current
environment leading to distortions: "the prevalence of 64bit over 32bit since we did the original
tests, the I/O regime in which the tests are performed, changes to the code bases, to name some. I
suspect that I/O regimes will make the greatest difference to events/HS06 for two otherwise
identical nodes" and Alessandra Forti's comment about the test (and user) jobs being directed to
slower nodes in the cluster (and the impact of fairshares).
SI-3 ATLAS weekly review & plans
---------------------------------
In absentia, RJ reported briefly as follows:
ATLAS Status: Tier-1
- Testing xrootd queue at RAL
- Questions about the number of concurrent jobs running at RAL form our side – does this sound
familiar?! We may need more pilots at RAL.
- Frontier server switching from PIC to Lyon.
- Cernvmfs testing is going well.
ATLAS Status: Tier-2
- Minor T2 issues. Four more sites up for T2D sonar tests. All look OK on current tests.
SI-4 CMS weekly review & plans
-------------------------------
DC reported minor problems at the Tier-1 in relation to job and disk pools; generally everything
had been ok over the last week. For the Tier-2, all of the UK had been at 100% (not Bristol), SAM
tests were Nagios-based, there were some differences as a result.
SI-5 LHCb weekly review & plans
--------------------------------
GP reported as follows:
1) A backlog of jobs (~4500 jobs at peak) built up at UK T1 over the week with its peak on
Thursday. For various reasons, (batch farm full, flickering publishing in bdii - possibly Cream
issue?) RAL was not picking up LHCb jobs. Moved to direct submission of jobs to lcgce09 on Friday
and since then the backlog has almost been eliminated (~250 jobs on Monday morning).
2) RAL share of new data set to 0 until the backlog was eliminated. Expect it to be increased this
week again.
3) Added 6TB diskserver on Friday to lhcbRawRdst (d0t1) to help with above issue.
4) Large number of failures due to "input data resolution" mainly because of the time the jobs
have been waiting - files have been garbage collected by Castor and will need to be restaged
(being done automatically as needed).
5) Smooth running at Tier-2 sites.
SI-6 User Co-ordination issues
-------------------------------
GP noted nothing to report.
SI-7 LCG Management Board Report
---------------------------------
DB advised that the next meeting was tomorrow.
SI-8 Dissemination Report
--------------------------
SL reported that the Magic Cubes had arrived, and they had already been paid for.
REVIEW OF ACTIONS
=================
400.4 SL to co-ordinate changing the current GridPP MoU towards an MoU for GridPP4. In
progress - document had been circulated. Any corrections to be sent to SL. Ongoing.
409.1 JC to revisit document with a GridPP-NGI-NGS structure, not use the document Dave
Wallom produced. JG will provide input. Visions for today and for the future. Ongoing.
424.3: DB to contact ALICE-UK about Tier-2 resources. Ongoing.
424.6: DC to complete CMS metrics - DC would circulate this after the meeting tomorrow. Done,
item closed.
424.10 DB to contact JG to suggest topics for CERN Meeting. Done, item closed.
425.7 DC to have an internal discussion within CMS relating to use of future technology and
evolution of the computing model, from September to the next couple of years. DC to come up
with possible suggestion of theme/topics for GridPP27 at CERN. Ongoing.
425.8 AS to consider any longer-term issues relating to storage, DPM, databases etc, and come
back to DB with any ideas for sessions at GridPP27. Ongoing.
427.1 Re Tier-2 accounting figures: DB to contact RJ and ask him to explain why there were so
many jobs waiting at Lancaster, when they had such a large share available. Done, item closed.
427.2 Re Tier-2 accounting figures: DB to contact RJ and ask him about Glasgow getting
production jobs from other clouds, when other sites don't. DB would also check with the Glasgow
team. Done, item closed.
427.3 DB to circulate an email to the CB re the OC outcome and the finalising of GridPP3, and
point the CB at the documents. He would advise that a CB meeting might be useful in around 6
months' time, after the accounting period. Done, item closed.
ACTIONS AS OF 06.06.11
======================
400.4 SL to co-ordinate changing the current GridPP MoU towards an MoU for GridPP4. In
progress - document had been circulated. Any corrections to be sent to SL.
409.1 JC to revisit document with a GridPP-NGI-NGS structure, not use the document Dave
Wallom produced. JG will provide input. Visions for today and for the future.
424.3: DB to contact ALICE-UK about Tier-2 resources.
425.7 DC to have an internal discussion within CMS relating to use of future technology and
evolution of the computing model, from September to the next couple of years. DC to come up
with possible suggestion of theme/topics for GridPP27 at CERN.
425.8 AS to consider any longer-term issues relating to storage, DPM, databases etc, and come
back to DB with any ideas for sessions at GridPP27.
428.1 RJ and AS to respond to DC regarding inputs for the AHM paper.
428.2 DC to check at Imperial regarding the new person dealing with ganga, in relation to a talk at
ACAT.
428.3 JC to compile an info list relating to sub-nets at sites.
428.4 JC/PC to ask through the Ops Team or HEPSYSMAN whether or not there was an easy way
to measure Tier-2 traffic, and to find out what was possible at Tier-2 sites.
428.5 DB to contact David Salmon and appraise him of the Network Document which had already
been produced and contained our 'best knowledge' at present. He would also advise DS that we
would progress his request and see what we could provide in terms of traffic measurement.
428.6 AS to come up with a proposal for how to use the current disk buffer at the Tier-1.
Forthcoming PMB meeting dates were as follows, at the usual time:
Mon June 13th
Mon June 27th
Mon July 11th
Mon July 25th
Mon Aug 8th
Mon Aug 22nd
Mon Sep 5th
TUE Sep 13th F2F@CERN
Mon Sep 26th
|