GridPP PMB Minutes 428 (06.06.11)
================================

Present: Dave Britton (Chair), Dave Colling, Jeremy Coles, Pete Gronbech, Robin Middleton, Glenn 
Patrick, Dave Kelsey, Steve Lloyd, John Gordon, Pete Clarke,  (Suzanne Scott - Minutes)

Apologies: Tony Doyle, Roger Jones, Tony Cass, Andrew Sansum, Neil Geddes

1.  AHM Paper status
====================
DC reported that he was still awaiting info. from the Tier-1 and ATLAS.  DB advised that RJ was 
interviewing today and would not be present, although he may have time afterwards to deal with 
this.  DB would also check status with AS.  JG thought that the deadline might be extended.  DC 
advised that he needed a couple of paragraphs only from each, so that he could pull things 
together to provide a couple of pages of text.  

ACTION
428.1  RJ and AS to respond to DC regarding inputs for the AHM paper.

2.  Speakers for ACAT conference
================================
DB reported there was an Advanced Computing & Analysis Techniques (ACAT) in Physics 
Research happening at Brunel in September.  DB advised that this was a better opportunity for 
GridPP than the AHM to get publications - there was a procedure for refereeing the papers and 
they would be published with a good impact-factor.  DB was organising part of the conference and 
wanted to identify people to speak.  Did we have a list of submissions from GridPP to the AHM?  
None were known.  DC noted there was one on CMS and clouds.  JG noted that Jens had a few 
papers lined up.  DB advised that this material could be re-used for ACAT.  DB asked if there were 
any other people we could contact?  DC noted a new person dealing with ganga at Imperial - he 
would look into this for Track 1.

ACTION
428.2  DC to check at Imperial regarding the new person dealing with ganga, in relation to a talk at 
ACAT.

DB advised that the deadline was 2nd July.  DB noted that within Track 1, grid and cloud 
computing was a broad area. Was a security talk possible?  DK indicated it was possible.  DB 
thought that more general talks might also be possible, however more targeted talks would be 
good, eg: adaptive data placement.  It could be a more GridPP-focussed talk with an emphasis on 
networking.  Then there were new architectures, many core - possibly Dave Newbold or Simon 
Metson, or Phil Clark might be suitable.  JC noted Andrew Washbrook as well.  DB asked about 
virtualisation - was there someone appropriate at the Tier-1?  JC noted Martin or Ian Collier.  The 
Tier-1 was doing a fair bit of virtualisation of infrastructure.  JG agreed to forward DB's email to 
the Tier-1.  DB noted other topics also, which were less related to GridPP and he asked if PG could 
consider something on monitoring?  PG would look at the topics and see if anything were possible.  
DB noted that Track 2 was data analysis, algorithm and tools, with a subset list.  These more 
naturally fell under the experiments' brief rather than GridPP.  The third Track was computation 
in theoretical physics, which was probably outwith our remit.  DB emphasised that we could take 
the opportunity to submit abstracts.

3.  Accounting - HS06 etc
=========================
JG advised that Alessandra Forti and Martin Bly had been skeptical about the published figures.  
SL agreed that there was a general feeling that the figures were not correct, which was borne out 
by his measurements.  PG commented that we all knew that HEPSPEC produces a better result on 
SL5 64bit systems compared with SL4 32bit, and some sites may not have re run the benchmarks 
after the upgrade.  There ensued a discussion on HEPSPEC, sites, and CPU.  It was agreed that 
HEPSPEC was not a proper benchmark of ATLAS code.  SL emphasised that machines were all 
different, and HEPSPEC took some combination of CPU Speed, memory, IO etc into account but 
apparently not the right combination for ATLAS code.  SL noted we could get the production right 
at least.  DB noted that for ATLAS, using results from production jobs would be the easiest thing.   
DB summarised the PMB view that HEPSPEC06 was not helping - using production jobs to get 
empirical numbers was the best way to proceed pragmatically.

DB reported that he had been discussing Lancaster with RJ.  The issue was still being investigated, 
however RJ had reported that waiting jobs were not an issue, as this depended on Panda.  The 
peak number of jobs was the more interesting issue, and RJ was looking into this.  DB reported 
that the Glasgow team were going to let DB know what jobs they were receiving from other clouds 
- this was currently under investigation.  RJ thought that the issue of Lancaster not being full 
probably rested on several reasons internal to ATLAS - the Panda system operated in a particular 
way and there was the internal issue of Panda brokering.  RJ also wanted to measure resources 
available, not just resources used.  The Glasgow cloud issue and the Lancaster capacity number 
were to be continued.  DC noted his disagreement of using 'resources available'.  DB noted that the 
issue was internal to ATLAS - they knew globally that they were not using the resources that were 
there, due to the issue of Panda brokering.  This had nothing to do with sites not providing 
resources to ATLAS.

4.  AOCB
========
- networking
DB reported that David Salmon had sent notes and slides from the network meeting that had 
taken place in Paris.  There was a specific request to GridPP to check the situation with respect to 
the Tier-2s:

1.  check whether UK Tier-2 resources were on well-defined sub-nets within the universities;
2.  ask Tier-2 sites to monitor traffic levels in and out of the Tier-2 resources

ACTION
428.3  JC to compile an info list relating to sub-nets at sites.

DB asked if it were possible to measure the traffic volume in and out of the Tier-2s?  This was 
about co-existence with different resources in Europe.  DB advised that everything was under 
control at this point and there was no proposal to do anything, however there was a need to keep 
an eye on things.  PC asked why the Network Document was not sufficient for David's purposes?  
It provided at least 60% of what he needed to know?  DB advised that they were asking us to 
measure volume.  JG noted that we could monitor the FTS but Tier-2 to Tier-2 traffic was difficult 
as there were many kinds of dataflows.  DB agreed that it would be overkill to do this for every 
site, but some of the larger sites could provide useful information.  DB asked JC to find out if there 
were an easy way to measure this, was any monitoring already in place?

DB noted that overall this was a longer term issue and that we couldn't commence a huge 
programme of work, however we could compile some info just now.  PC suggested that the 
timescale for this should be the GridPP Collaboration Meeting at CERN in September.  This might 
be trivial to do at Glasgow, which had a separate cluster, and we could limit it to sites that were 
similar.  The lowest level of detail was total traffic, beyond that, it depended how difficult the 
monitoring would be.

ACTION
428.4  JC/PC to ask through the Ops Team or HEPSYSMAN whether or not there was an easy way 
to measure Tier-2 traffic, and to find out what was possible at Tier-2 sites.

PC asked that David Salmon be reminded of the Network Document, which did contain the bulk of 
information which he required.  DB agreed to follow this up.

ACTION
428.5  DB to contact David Salmon and appraise him of the Network Document which had already 
been produced and contained our 'best knowledge' at present.  He would also advise DS that we 
would progress his request and see what we could provide in terms of traffic measurement.

- Resource Meeting
GP reported that the issue of extra disk had arisen at the Resource Meeting - he would need to ask 
AS about this.  

ACTION
428.6  AS to come up with a proposal for how to use the current disk buffer at the Tier-1.

STANDING ITEMS
==============
SI-1  Tier-1 Manager's Report
-----------------------------
AS was not present.

SI-2  Production Manager's Report
---------------------------------
JC reported as follows:
1)   There was an update of the UK VOMS that led to T2K job failures (proxy problems) during the 
“at risk” period. T2K are also suffering due to jobs exceeding queue memory limits.

On the topic of Steve's observed HS06 spreads seen across the sites many of you will have read the 
comments on TB-SUPPORT. In particular Martin Bly's remarks about the nature of the current 
environment leading to distortions: "the prevalence of 64bit over 32bit since we did the original 
tests, the I/O regime in which the tests are performed, changes to the code bases, to name some.  I 
suspect that I/O regimes will make the greatest difference to events/HS06 for two otherwise 
identical nodes" and Alessandra Forti's comment about the test (and user) jobs being directed to 
slower nodes in the cluster (and the impact of fairshares). 

SI-3  ATLAS weekly review & plans
---------------------------------
In absentia, RJ reported briefly as follows:
ATLAS Status: Tier-1
- Testing xrootd queue at RAL
- Questions about the number of concurrent jobs running at RAL form our side – does this sound 
familiar?! We may need more pilots at RAL.
- Frontier server switching from PIC to Lyon.
- Cernvmfs testing is going well.
ATLAS Status: Tier-2
- Minor T2 issues. Four more sites up for T2D sonar tests. All look OK on current tests.

SI-4  CMS weekly review & plans
-------------------------------
DC reported minor problems at the Tier-1 in relation to job and disk pools; generally everything 
had been ok over the last week.  For the Tier-2, all of the UK had been at 100% (not Bristol), SAM 
tests were Nagios-based, there were some differences as a result.

SI-5  LHCb weekly review & plans
--------------------------------
GP reported as follows:
1) A backlog of jobs (~4500 jobs at peak) built up at UK T1 over the week with its peak on 
Thursday. For various reasons, (batch farm full, flickering publishing in bdii - possibly Cream 
issue?) RAL was not picking up LHCb jobs. Moved to direct submission of jobs to lcgce09 on Friday 
and since then the backlog has almost been eliminated (~250 jobs on Monday morning).
2) RAL share of new data set to 0 until the backlog was eliminated. Expect it to be increased this 
week again.
3) Added 6TB diskserver on Friday to lhcbRawRdst (d0t1) to help with above issue.
4) Large number of failures due to "input data resolution" mainly because of the time the jobs 
have been waiting - files have been garbage collected by Castor and will need to be restaged 
(being done automatically as needed).
5) Smooth running at Tier-2 sites.

SI-6  User Co-ordination issues
-------------------------------
GP noted nothing to report.
SI-7  LCG Management Board Report
---------------------------------
DB advised that the next meeting was tomorrow.

SI-8  Dissemination Report
--------------------------
SL reported that the Magic Cubes had arrived, and they had already been paid for.

REVIEW OF ACTIONS
=================
400.4  SL to co-ordinate changing the current GridPP MoU towards an MoU for GridPP4.  In 
progress - document had been circulated.  Any corrections to be sent to SL.  Ongoing.

409.1  JC to revisit document with a GridPP-NGI-NGS structure, not use the document Dave 
Wallom produced.  JG will provide input. Visions for today and for the future.  Ongoing.

424.3: DB to contact ALICE-UK about Tier-2 resources.  Ongoing.

424.6: DC to complete CMS metrics - DC would circulate this after the meeting tomorrow.  Done, 
item closed.

424.10 DB to contact JG to suggest topics for CERN Meeting.  Done, item closed.

425.7  DC to have an internal discussion within CMS relating to use of future technology and 
evolution of the computing model, from September to the next couple of years.  DC to come up 
with possible suggestion of theme/topics for GridPP27 at CERN.  Ongoing.

425.8  AS to consider any longer-term issues relating to storage, DPM, databases etc, and come 
back to DB with any ideas for sessions at GridPP27.  Ongoing.

427.1  Re Tier-2 accounting figures: DB to contact RJ and ask him to explain why there were so 
many jobs waiting at Lancaster, when they had such a large share available.  Done, item closed.

427.2  Re Tier-2 accounting figures: DB to contact RJ and ask him about Glasgow getting 
production jobs from other clouds, when other sites don't.  DB would also check with the Glasgow 
team.  Done, item closed.

427.3  DB to circulate an email to the CB re the OC outcome and the finalising of GridPP3, and 
point the CB at the documents. He would advise that a CB meeting might be useful in around 6 
months' time, after the accounting period.  Done, item closed.


ACTIONS AS OF 06.06.11
======================
400.4  SL to co-ordinate changing the current GridPP MoU towards an MoU for GridPP4.  In 
progress - document had been circulated.  Any corrections to be sent to SL.

409.1  JC to revisit document with a GridPP-NGI-NGS structure, not use the document Dave 
Wallom produced.  JG will provide input. Visions for today and for the future.

424.3: DB to contact ALICE-UK about Tier-2 resources.

425.7  DC to have an internal discussion within CMS relating to use of future technology and 
evolution of the computing model, from September to the next couple of years.  DC to come up 
with possible suggestion of theme/topics for GridPP27 at CERN.

425.8  AS to consider any longer-term issues relating to storage, DPM, databases etc, and come 
back to DB with any ideas for sessions at GridPP27.

428.1  RJ and AS to respond to DC regarding inputs for the AHM paper.

428.2  DC to check at Imperial regarding the new person dealing with ganga, in relation to a talk at 
ACAT.

428.3  JC to compile an info list relating to sub-nets at sites.

428.4  JC/PC to ask through the Ops Team or HEPSYSMAN whether or not there was an easy way 
to measure Tier-2 traffic, and to find out what was possible at Tier-2 sites.

428.5  DB to contact David Salmon and appraise him of the Network Document which had already 
been produced and contained our 'best knowledge' at present.  He would also advise DS that we 
would progress his request and see what we could provide in terms of traffic measurement.

428.6  AS to come up with a proposal for how to use the current disk buffer at the Tier-1.

Forthcoming PMB meeting dates were as follows, at the usual time:

Mon June 13th
Mon June 27th
Mon July 11th
Mon July 25th
Mon Aug 8th
Mon Aug 22nd
Mon Sep 5th
TUE Sep 13th F2F@CERN
Mon Sep 26th