GridPP PMB Minutes 428 (06.06.11) ================================ Present: Dave Britton (Chair), Dave Colling, Jeremy Coles, Pete Gronbech, Robin Middleton, Glenn Patrick, Dave Kelsey, Steve Lloyd, John Gordon, Pete Clarke, (Suzanne Scott - Minutes) Apologies: Tony Doyle, Roger Jones, Tony Cass, Andrew Sansum, Neil Geddes 1. AHM Paper status ==================== DC reported that he was still awaiting info. from the Tier-1 and ATLAS. DB advised that RJ was interviewing today and would not be present, although he may have time afterwards to deal with this. DB would also check status with AS. JG thought that the deadline might be extended. DC advised that he needed a couple of paragraphs only from each, so that he could pull things together to provide a couple of pages of text. ACTION 428.1 RJ and AS to respond to DC regarding inputs for the AHM paper. 2. Speakers for ACAT conference ================================ DB reported there was an Advanced Computing & Analysis Techniques (ACAT) in Physics Research happening at Brunel in September. DB advised that this was a better opportunity for GridPP than the AHM to get publications - there was a procedure for refereeing the papers and they would be published with a good impact-factor. DB was organising part of the conference and wanted to identify people to speak. Did we have a list of submissions from GridPP to the AHM? None were known. DC noted there was one on CMS and clouds. JG noted that Jens had a few papers lined up. DB advised that this material could be re-used for ACAT. DB asked if there were any other people we could contact? DC noted a new person dealing with ganga at Imperial - he would look into this for Track 1. ACTION 428.2 DC to check at Imperial regarding the new person dealing with ganga, in relation to a talk at ACAT. DB advised that the deadline was 2nd July. DB noted that within Track 1, grid and cloud computing was a broad area. Was a security talk possible? DK indicated it was possible. DB thought that more general talks might also be possible, however more targeted talks would be good, eg: adaptive data placement. It could be a more GridPP-focussed talk with an emphasis on networking. Then there were new architectures, many core - possibly Dave Newbold or Simon Metson, or Phil Clark might be suitable. JC noted Andrew Washbrook as well. DB asked about virtualisation - was there someone appropriate at the Tier-1? JC noted Martin or Ian Collier. The Tier-1 was doing a fair bit of virtualisation of infrastructure. JG agreed to forward DB's email to the Tier-1. DB noted other topics also, which were less related to GridPP and he asked if PG could consider something on monitoring? PG would look at the topics and see if anything were possible. DB noted that Track 2 was data analysis, algorithm and tools, with a subset list. These more naturally fell under the experiments' brief rather than GridPP. The third Track was computation in theoretical physics, which was probably outwith our remit. DB emphasised that we could take the opportunity to submit abstracts. 3. Accounting - HS06 etc ========================= JG advised that Alessandra Forti and Martin Bly had been skeptical about the published figures. SL agreed that there was a general feeling that the figures were not correct, which was borne out by his measurements. PG commented that we all knew that HEPSPEC produces a better result on SL5 64bit systems compared with SL4 32bit, and some sites may not have re run the benchmarks after the upgrade. There ensued a discussion on HEPSPEC, sites, and CPU. It was agreed that HEPSPEC was not a proper benchmark of ATLAS code. SL emphasised that machines were all different, and HEPSPEC took some combination of CPU Speed, memory, IO etc into account but apparently not the right combination for ATLAS code. SL noted we could get the production right at least. DB noted that for ATLAS, using results from production jobs would be the easiest thing. DB summarised the PMB view that HEPSPEC06 was not helping - using production jobs to get empirical numbers was the best way to proceed pragmatically. DB reported that he had been discussing Lancaster with RJ. The issue was still being investigated, however RJ had reported that waiting jobs were not an issue, as this depended on Panda. The peak number of jobs was the more interesting issue, and RJ was looking into this. DB reported that the Glasgow team were going to let DB know what jobs they were receiving from other clouds - this was currently under investigation. RJ thought that the issue of Lancaster not being full probably rested on several reasons internal to ATLAS - the Panda system operated in a particular way and there was the internal issue of Panda brokering. RJ also wanted to measure resources available, not just resources used. The Glasgow cloud issue and the Lancaster capacity number were to be continued. DC noted his disagreement of using 'resources available'. DB noted that the issue was internal to ATLAS - they knew globally that they were not using the resources that were there, due to the issue of Panda brokering. This had nothing to do with sites not providing resources to ATLAS. 4. AOCB ======== - networking DB reported that David Salmon had sent notes and slides from the network meeting that had taken place in Paris. There was a specific request to GridPP to check the situation with respect to the Tier-2s: 1. check whether UK Tier-2 resources were on well-defined sub-nets within the universities; 2. ask Tier-2 sites to monitor traffic levels in and out of the Tier-2 resources ACTION 428.3 JC to compile an info list relating to sub-nets at sites. DB asked if it were possible to measure the traffic volume in and out of the Tier-2s? This was about co-existence with different resources in Europe. DB advised that everything was under control at this point and there was no proposal to do anything, however there was a need to keep an eye on things. PC asked why the Network Document was not sufficient for David's purposes? It provided at least 60% of what he needed to know? DB advised that they were asking us to measure volume. JG noted that we could monitor the FTS but Tier-2 to Tier-2 traffic was difficult as there were many kinds of dataflows. DB agreed that it would be overkill to do this for every site, but some of the larger sites could provide useful information. DB asked JC to find out if there were an easy way to measure this, was any monitoring already in place? DB noted that overall this was a longer term issue and that we couldn't commence a huge programme of work, however we could compile some info just now. PC suggested that the timescale for this should be the GridPP Collaboration Meeting at CERN in September. This might be trivial to do at Glasgow, which had a separate cluster, and we could limit it to sites that were similar. The lowest level of detail was total traffic, beyond that, it depended how difficult the monitoring would be. ACTION 428.4 JC/PC to ask through the Ops Team or HEPSYSMAN whether or not there was an easy way to measure Tier-2 traffic, and to find out what was possible at Tier-2 sites. PC asked that David Salmon be reminded of the Network Document, which did contain the bulk of information which he required. DB agreed to follow this up. ACTION 428.5 DB to contact David Salmon and appraise him of the Network Document which had already been produced and contained our 'best knowledge' at present. He would also advise DS that we would progress his request and see what we could provide in terms of traffic measurement. - Resource Meeting GP reported that the issue of extra disk had arisen at the Resource Meeting - he would need to ask AS about this. ACTION 428.6 AS to come up with a proposal for how to use the current disk buffer at the Tier-1. STANDING ITEMS ============== SI-1 Tier-1 Manager's Report ----------------------------- AS was not present. SI-2 Production Manager's Report --------------------------------- JC reported as follows: 1) There was an update of the UK VOMS that led to T2K job failures (proxy problems) during the “at risk” period. T2K are also suffering due to jobs exceeding queue memory limits. On the topic of Steve's observed HS06 spreads seen across the sites many of you will have read the comments on TB-SUPPORT. In particular Martin Bly's remarks about the nature of the current environment leading to distortions: "the prevalence of 64bit over 32bit since we did the original tests, the I/O regime in which the tests are performed, changes to the code bases, to name some. I suspect that I/O regimes will make the greatest difference to events/HS06 for two otherwise identical nodes" and Alessandra Forti's comment about the test (and user) jobs being directed to slower nodes in the cluster (and the impact of fairshares). SI-3 ATLAS weekly review & plans --------------------------------- In absentia, RJ reported briefly as follows: ATLAS Status: Tier-1 - Testing xrootd queue at RAL - Questions about the number of concurrent jobs running at RAL form our side – does this sound familiar?! We may need more pilots at RAL. - Frontier server switching from PIC to Lyon. - Cernvmfs testing is going well. ATLAS Status: Tier-2 - Minor T2 issues. Four more sites up for T2D sonar tests. All look OK on current tests. SI-4 CMS weekly review & plans ------------------------------- DC reported minor problems at the Tier-1 in relation to job and disk pools; generally everything had been ok over the last week. For the Tier-2, all of the UK had been at 100% (not Bristol), SAM tests were Nagios-based, there were some differences as a result. SI-5 LHCb weekly review & plans -------------------------------- GP reported as follows: 1) A backlog of jobs (~4500 jobs at peak) built up at UK T1 over the week with its peak on Thursday. For various reasons, (batch farm full, flickering publishing in bdii - possibly Cream issue?) RAL was not picking up LHCb jobs. Moved to direct submission of jobs to lcgce09 on Friday and since then the backlog has almost been eliminated (~250 jobs on Monday morning). 2) RAL share of new data set to 0 until the backlog was eliminated. Expect it to be increased this week again. 3) Added 6TB diskserver on Friday to lhcbRawRdst (d0t1) to help with above issue. 4) Large number of failures due to "input data resolution" mainly because of the time the jobs have been waiting - files have been garbage collected by Castor and will need to be restaged (being done automatically as needed). 5) Smooth running at Tier-2 sites. SI-6 User Co-ordination issues ------------------------------- GP noted nothing to report. SI-7 LCG Management Board Report --------------------------------- DB advised that the next meeting was tomorrow. SI-8 Dissemination Report -------------------------- SL reported that the Magic Cubes had arrived, and they had already been paid for. REVIEW OF ACTIONS ================= 400.4 SL to co-ordinate changing the current GridPP MoU towards an MoU for GridPP4. In progress - document had been circulated. Any corrections to be sent to SL. Ongoing. 409.1 JC to revisit document with a GridPP-NGI-NGS structure, not use the document Dave Wallom produced. JG will provide input. Visions for today and for the future. Ongoing. 424.3: DB to contact ALICE-UK about Tier-2 resources. Ongoing. 424.6: DC to complete CMS metrics - DC would circulate this after the meeting tomorrow. Done, item closed. 424.10 DB to contact JG to suggest topics for CERN Meeting. Done, item closed. 425.7 DC to have an internal discussion within CMS relating to use of future technology and evolution of the computing model, from September to the next couple of years. DC to come up with possible suggestion of theme/topics for GridPP27 at CERN. Ongoing. 425.8 AS to consider any longer-term issues relating to storage, DPM, databases etc, and come back to DB with any ideas for sessions at GridPP27. Ongoing. 427.1 Re Tier-2 accounting figures: DB to contact RJ and ask him to explain why there were so many jobs waiting at Lancaster, when they had such a large share available. Done, item closed. 427.2 Re Tier-2 accounting figures: DB to contact RJ and ask him about Glasgow getting production jobs from other clouds, when other sites don't. DB would also check with the Glasgow team. Done, item closed. 427.3 DB to circulate an email to the CB re the OC outcome and the finalising of GridPP3, and point the CB at the documents. He would advise that a CB meeting might be useful in around 6 months' time, after the accounting period. Done, item closed. ACTIONS AS OF 06.06.11 ====================== 400.4 SL to co-ordinate changing the current GridPP MoU towards an MoU for GridPP4. In progress - document had been circulated. Any corrections to be sent to SL. 409.1 JC to revisit document with a GridPP-NGI-NGS structure, not use the document Dave Wallom produced. JG will provide input. Visions for today and for the future. 424.3: DB to contact ALICE-UK about Tier-2 resources. 425.7 DC to have an internal discussion within CMS relating to use of future technology and evolution of the computing model, from September to the next couple of years. DC to come up with possible suggestion of theme/topics for GridPP27 at CERN. 425.8 AS to consider any longer-term issues relating to storage, DPM, databases etc, and come back to DB with any ideas for sessions at GridPP27. 428.1 RJ and AS to respond to DC regarding inputs for the AHM paper. 428.2 DC to check at Imperial regarding the new person dealing with ganga, in relation to a talk at ACAT. 428.3 JC to compile an info list relating to sub-nets at sites. 428.4 JC/PC to ask through the Ops Team or HEPSYSMAN whether or not there was an easy way to measure Tier-2 traffic, and to find out what was possible at Tier-2 sites. 428.5 DB to contact David Salmon and appraise him of the Network Document which had already been produced and contained our 'best knowledge' at present. He would also advise DS that we would progress his request and see what we could provide in terms of traffic measurement. 428.6 AS to come up with a proposal for how to use the current disk buffer at the Tier-1. Forthcoming PMB meeting dates were as follows, at the usual time: Mon June 13th Mon June 27th Mon July 11th Mon July 25th Mon Aug 8th Mon Aug 22nd Mon Sep 5th TUE Sep 13th F2F@CERN Mon Sep 26th