JiscMail Logo
Email discussion lists for the UK Education and Research communities

Help for UKHEPGRID Archives


UKHEPGRID Archives

UKHEPGRID Archives


UKHEPGRID@JISCMAIL.AC.UK


View:

Message:

[

First

|

Previous

|

Next

|

Last

]

By Topic:

[

First

|

Previous

|

Next

|

Last

]

By Author:

[

First

|

Previous

|

Next

|

Last

]

Font:

Proportional Font

LISTSERV Archives

LISTSERV Archives

UKHEPGRID Home

UKHEPGRID Home

UKHEPGRID  June 2011

UKHEPGRID June 2011

Options

Subscribe or Unsubscribe

Subscribe or Unsubscribe

Log In

Log In

Get Password

Get Password

Subject:

Minutes of the 427th GridPP PMB meeting

From:

David Britton <[log in to unmask]>

Reply-To:

David Britton <[log in to unmask]>

Date:

Mon, 6 Jun 2011 11:25:18 +0100

Content-Type:

multipart/mixed

Parts/Attachments:

Parts/Attachments

text/plain (83 lines) , 110531.txt (395 lines)

Dear All,

Please find attached the GridPP Project Management Board Meeting minutes
for the 427th meeting.

   The latest minutes can be found each week in:

http://www.gridpp.ac.uk/php/pmb/minutes.php?latest

as well as being listed with other minutes at:

http://www.gridpp.ac.uk/php/pmb/minutes.php

Cheers, Dave.

-- 
________________________________________________________________________
Prof. David Britton                          GridPP Project Leader
Rm 480, Kelvin Building                      Telephone: +44 141 330 5454
School of Physics and Astronomy              Telefax: +44-141-330 5881
University of Glasgow                 EMail: [log in to unmask]
G12 8QQ, UK
________________________________________________________________________





























































GridPP PMB Minutes 427 (31.05.11) ================================ Present: Dave Britton (Chair), Dave Colling, Jeremy Coles, Pete Gronbech, Tony Cass, Robin Middleton, Glenn Patrick, Tony Doyle, Dave Kelsey, Steve Lloyd (Suzanne Scott - Minutes) Apologies: Roger Jones, John Gordon, Pete Clarke, Andrew Sansum, Neil Geddes 1. OC Feedback =============== DB had circulated the OC feedback. The meeting had gone well, the plan was to schedule a mid- term review of GridPP4 in the Spring of 2013, with the current OC personnel. Meantime, DB would liaise with Tony Medland should an interim meeting be required. The formal feedback had been positive - the OC thanked the PMB for the documents, they felt it was the right amount of material and provided on time. The OC had noted that GridPP continued to be well-managed and successful. The OC encouraged us to keep abreast of new technology and collaborate where possible. The OC appreciated the good Project Management in GridPP3. They appreciated the 'Lessons Learnt' document. This was now the end of the OC in the current format but the OC members would be invited to GridPP Collaboration Meetings, to give them the opportunity to remain up-to-date. 2. Month-1 of Tier-2 Accounting ================================ SL reported that we had reached the end of month 1 in the Tier-2 Accounting. SL had circulated a few points at issue, to which DB had responded. Regarding CPU, it was noted that CPU time was different on different machines. SL had corrected for HEPSPEC, however a similar exercise by ATLAS had shown that the spread of results had become worse as a result. HEPSPEC was not a good measure of ATLAS code when it was running. SL could compare it with his own benchmark jobs for the UK. PG was looking at work done at his site, and APEL and ATLAS appeared to be in line. DB noted that the evidence was that HEPSPEC did not help. We could ignore the HEPSPEC or use SL's measure of HEPSPEC as a multiplier. SL wanted to check his figures before we decided on anything. PG noted that if some sites used the average HEPSPEC published figure, they get a different result. SL advised no, it depended on the machine as well. SL would check his figures, and noted in addition there was a problem with the ones we can't measure, eg: Cambridge. PG confirmed they usually send manual figures. Regarding Lancaster (and ECDF), SL advised that concerns had been raised regarding the published CPU available numbers. They both had a shared cluster. For Lancaster, there was 90% published, which was a large number in comparison with others listed in the column, yet ATLAS didn't tend to use Lancaster. TD advised that for CPU the figure should show what was actually 'used' rather than what was 'available', as the latter was distortive. DB advised that if there was not enough disk, or not enough bandwidth, then what was 'available' was not a realistic number, but it was hard to prove otherwise. PG agreed with TD that we should be measuring 'reality' and showing what was actually done at the site. SL advised that this measurement is in fact done at present. JC noted that utilisation across all UK resources was rarely above 60%. PG noted his concern about the accuracy of the Lancaster figures. JC agreed, noting that Lancaster had the worst utilisation figures for Quarter 1: 14%. There was definitely an issue at Lancaster. TD commented that the PANDA system should be monitoring this, and that a declaration of huge CPU available that was under-utilised was not useful. SL suggested that we need to ask RJ about the waiting jobs. TD considered that giving 50 points for providing CPU seemed wrong - we should, rather, be measuring overall CPU throughput. SL thought that we should not penalise sites for buying a lot of kit that was under-utilised. TD noted the same issue with ECDF - they had a lot of CPU. TD noted that the right way to measure was the number of job slots x fraction. PG proposed to drop the column altogether, and measure what was actually done. TD agreed. SL observed that large shared clusters were the issue. TD noted we could adjust the fraction afterwards. SL agreed that we could publish the realistic share. SL noted that RJ had to answer why there were all the jobs waiting, when he had a large share available. TD suggested that it should be the effective share over the month that was measured, retrospectively, giving the usage they actually got out of the cluster (for a shared site). DC commented however that scheduling policy can work against the site. SL summarised by noting that we must get the answer as to why things still look so wrong, and then maybe drop the column altogether - this would apply to ECDF as well. DB had circulated some graphs showing CPU available against MC Production: there was a reasonable correlation for all sites except Lancaster, ECDF and QMUL that showed much higher levels of available CPU than was consistent with the work actually being performed. SL noted that at QMUL this was due to the current deployment of new resources.   ACTION 427.1 Re Tier-2 accounting figures: DB to contact RJ and ask him to explain why there were so many jobs waiting at Lancaster, when they had such a large share available. Regarding Glasgow getting production jobs from other clouds when others don't, DB noted that we needed to ask RJ, however there was no evidence from the graphs that Glasgow was getting more work than expected. PG observed that if the cluster is full then there would be no complaint from any site. ACTION 427.2 Re Tier-2 accounting figures: DB to contact RJ and ask him about Glasgow getting production jobs from other clouds, when other sites don't. DB would also check with the Glasgow team. Regarding the issue of LOCALGROUPDISK, it was noted that there were local users at a site, plus 10-20% set aside for GridPP. SL had asked RJ if he wanted this included or not. RJ had said yes, up to 20% of the pledge. SL had said this was too complicated to implement and that it should be either all or none. The latter had been agreed. TD noted that we should say it is zero and be driven by the major fraction available to everyone. Regarding QMUL, it was noted that 60TB was dedicated to T2K but this didn't count anywhere. DB suggested that we introduce a metric under 'others'. SL noted it wasn't clear we could measure it. DB thought we probably couldn't measure it but noted that in the big picture this was a very small effect. Regarding capping any one site @ 20%, it was noted that Glasgow had been in the 20s recently. DB noted however that Glasgow had already fallen below 20% as other sites deployed new kit and, anyway, the power had to go down soon, so it would not achieve 20%. TD suggested that the maximum allocated to a site should be £200k? DB did not want to set a figure at this stage. DB noted that Oxford and Manchester seemed to do more analysis than the amount of disk would indicate. DB noted that the big picture in the UK was that we needed to spend money on disk. PG advised that the Storage Group had discussed the disk at QMUL - they have a lot of disk but have bandwidth issues, therefore buying more disk would be a waste. TD observed that whether the network would throttle was a larger issue - pattern of use was more crucial. The conclusion of the discussions was that there were no major problems, however there remained detail which we needed to understand. 3. Misc Items ============== - AHM paper It was noted that the AHM deadline had been extended. DC had received a few inputs, and would contact the Tier-1. AS was away at present. DB advised that the OC documents had plots, if any were required, and DB's talk might also be helpful. DC would attend to this. - GridPP MoU SL had circulated the MoU and had received comments from JG in relation to EGI. SL would modify the document and add appropriate footnotes. SL asked if we needed a CB meeting? TD thought it would help, especially in order to provide the OC feedback. DB thought it preferable to wait until we were clearer about the hardware figures and had something more useful to report. DB could circulate an email re the OC outcome and the finalising of GridPP3, and point the CB at the documents. TD agreed, noting that it would be good to provide a report. ACTION 427.3 DB to circulate an email to the CB re the OC outcome and the finalising of GridPP3, and point the CB at the documents. He would advise that a CB meeting might be useful in around 6 months' time, after the accounting period. - UKHEPSYSMAN sponsorship PG had the budget breakdown for this, which he had sent to RM, who had agreed the expenditure. PG noted that the event would be similar to last year, and that a barbeque would be the best option. 4. Proposal for PMB Dates ========================== DB had proposed a list of dates for forthcoming PMB meetings, in order to avoid holiday weekends etc. Could everyone check these and let him know if there was anything missed that might make any of the meeting dates impractical. These meetings would take place at the usual time: 12.55 pm. Mon June 6th Mon June 13th Mon June 27th Mon July 11th Mon July 25th Mon Aug 8th Mon Aug 22nd Mon Sep 5th TUE Sep 13th F2F@CERN Mon Sep 26th 5. AOCB ======== No other business. STANDING ITEMS ============== SI-1 Tier-1 Manager's Report ----------------------------- AS was absent. SI-2 Production Manager's Report --------------------------------- JC reported as follows: 1) The request for a relocatable install for glexec has been pushed by Maarten Litmaath in the last week. For those interested in the discussion thread see http://indico.cern.ch/materialDisplay.py?contribId=2&materialId=0&confId=141553. It is very likely that sites waiting on this will miss the end of June deadline from WLCG. There is no point in sites building from source at this stage. The updated policy for glexec deployment is here https://twiki.cern.ch/twiki/bin/view/LCG/GlexecDeployment. The current status across UK sites can be ascertained from the UK Nagios tests now being run: http://tinyurl.com/3fhvh9z. Currently this shows success for: RHUL, Liverpool, RALPP, RAL Tier- 1, Glasgow and Oxford. 2) There was a meeting of the CA Technical Advisory Group last week. Action will need to be taken soon as the CA certificate needs to be renewed on the September timescale. A statement from the group reads as follows: “The UK e-Science CA is due to go through another rollover, i.e. the CA certificate has to be renewed. This is scheduled for the end of September 2011. We will remain compliant with the IGTF requirements, but aim to modernise many of the processes. You may already have seen early versions of the CertWizard, the java-based client which makes certificate management a lot easier. Another planned improvement is to bring the CA wholly online, so that instead of signing certificates within one working day (after approval of the request) they will be issued immediately (or at least within minutes.) This will also improve the security of the infrastructure as fresh revocation lists can be issued whenever they are needed. The new CA certificate will have a longer lifetime: the rules now allow this. The plan is to generate it at the end of May, and then push it out via the IGTF in June, to ensure it is widely distributed by the end of September. A few other modernisations are under way, but phased, so they are not introduced all at once. The policy is being rewritten, not so much to change it but more to make it clearer and to allow for more flexibility and resilience in following IGTF requirements. The extensions in the certificates will be modernised (the current ones are quite old by today's standard.) Throughout the process, the major relying parties have been consulted via the Technical Advisory Group, or TAG.” DB considered that this was a recipe for disaster, and urged anyone who was a member of the TAG to keep a very close eye on this. The TAG group remit was to be vigilant in any case and it was confirmed that they would do everything possible to avoid any serious problems. DB advised particularly that they should be on the lookout for unforseen delays.   3) Security Service Challenge 5 ran last week across EGI and included 43 sites. The challenge involved pilot jobs being submitted to sites with user payloads being run and various manipulations being made on files on the site storage. First impressions are that it was a useful test for everyone involved and UK participation and performance was good. The final reports are still being submitted and following that the EGI team will need to review and evaluate to see what can be learned to improve incident response procedures and assess site effectiveness. Thank you to the GridPP participating sites: Lancaster, Cambridge, RHUL and Glasgow. SI-3 ATLAS weekly review & plans --------------------------------- RJ was absent. SI-4 CMS weekly review & plans ------------------------------- DC reported that from the UK side, CMS was using glexec now. They were doing tests with xrootd. They were looking at scheduling: half of the Tier-1 would be scheduled by the end of the year. For accounting, the first month had been completed for CMS, the results/ratios were much as expected, the only surprise was that QMUL was coming in as a sizeable partner, even running analysis jobs. No data was scheduled there, so it must be private MC being run - this made QMUL an anomaly. DC noted issues at Bristol - they had a readiness of 0% last month and 0 transfers. CPU and MC was 0. They have transfer issues. SI-5 LHCb weekly review & plans -------------------------------- GP reported as follows: 1) Tier 1 disk-server gdss120 (t1d0 / lhcbRawRdst) out of production for 24 hours, from 25 May. 2) Looking at optimising job start rate at T1 (currently 3 jobs/minute) to improve throughput. 3) Full reprocessing of 2011 data ongoing with latest version of LHCb software, expected to be used for summer conferences. SI-6 User Co-ordination Issues =============================== There was a resource meeting due to happen tomorrow; no issues to report. SI-7 LCG Management Board report ================================= There had been no meeting. SI-8 Dissemination =================== No issues to report. REVIEW OF ACTIONS ================= 400.4 SL to co-ordinate changing the current GridPP MoU towards an MoU for GridPP4. In progress - document had been circulated. Any corrections to be sent to SL. Ongoing. 409.1 JC to revisit document with a GridPP-NGI-NGS structure, not use the document Dave Wallom produced. JG will provide input. Visions for today and for the future. Ongoing. 424.1: PG to sketch out a technical plan for establishing the appropriate Grid-services as Sussex. Done, action closed. 424.3: DB to contact ALICE-UK about Tier-2 resources. Ongoing. 424.5: SL to complete metrics web-page. Done, item closed. 424.6: DC to complete CMS metrics - DC would circulate this after the meeting tomorrow. Ongoing. 424.9 JC to suggest topics for CERN Meeting. Done, item closed. Following the meeting, JC reported as follows: "Here is an initial response with some ideas of topics that might be covered during the GridPP meeting at CERN. The speakers for many of the topics would ideally be people from CERN - for example the experiment technical experts/reps who speak at the GDBs. Probably we want to pick up on and develop outcomes of the WLCG workshop in Hamburg in July (https://indico.desy.de/conferenceTimeTable.py?confId=4019#all). The meeting theme might be something connected to accommodating changes (the EMI-1 release, the machine schedule, increasing event sizes, new technologies, shrinking budgets!). My first guess would be "(In)Stabilities"." 424.10 DB to contact JG to suggest topics for CERN Meeting. Ongoing. 425.1 DB to provide PG with text for Risk 18, noting that despite the likelihood being raised to 3, the risk was not immediate. Done, item closed. 425.2 PG to check whether the current Risk Register would map onto the new one for GridPP4. PG to summarise findings in an email to PMB. Done, item closed. 425.3 Owners of the new risks for GridPP4 should check the new Risk Register and get back to PG with any comments/amendments. They should also look at the 'old' spreadsheet and ensure that all previous risks relating to them are adequately covered within the new version. Done, item closed. 425.4 DB to firm-up the structure of document 155, providing a detailed document map and targeting AS with specific sections to complete on the Tier-1 as a quantified success. Done, item closed. 425.5 DK to provide text on security challenges during GridPP3 and UK performance, for document 155. Done, item closed. 425.6 ALL: to review document 156 on GridPP3 Financial Status. Comments to be sent to DB. Done, item closed. 425.7 DC to have an internal discussion within CMS relating to use of future technology and evolution of the computing model, from September to the next couple of years. DC to come up with possible suggestion of theme/topics for GridPP27 at CERN. Ongoing. 425.8 AS to consider any longer-term issues relating to storage, DPM, databases etc, and come back to DB with any ideas for sessions at GridPP27. Ongoing. 426.1 JC to check on blacklisted sites Manchester & Glasgow, and the timescales involved. Following the meeting, JC reported as follows: "As often happens on closer examination things are subtly different. The wording should have been "Manchester and Glasgow are currently closest to being blacklisted for example". In fact the status is now that nearly every site will be blacklisted: Brunel QMUL RHUL Liverpool Manchester Sheffield Durham ECDF Glasgow Oxford These sites have spacetokens close to 80% full, the point at which ATLAS blacklists the site. At the moment of writing I do not think there has been a negative impact on site accounting. The question I was asked to check was how long it takes to subsequently remove a site from the blacklist (http://bourricot.cern.ch/blacklisted_production.html). The answer in this case is a minimum of 24hrs, this being the length of time for the deletion service (triggered at the 80% full point) to clear old files from the spacetoken. Since I am correcting/clarifying I should also point out that: > B) EGI is looking at site entries in the GOCDB and asking NGIs to close sites that have been in candidate/uncertified states for a long period. In the UK this affects many of the NGS registered sites. Is now out of date. Uncertified sites can remain indefinitely in that state after a meeting today updated the procedures." Done, item closed. ACTIONS AS OF 31.05.11 ====================== 400.4 SL to co-ordinate changing the current GridPP MoU towards an MoU for GridPP4. In progress - document had been circulated. Any corrections to be sent to SL. 409.1 JC to revisit document with a GridPP-NGI-NGS structure, not use the document Dave Wallom produced. JG will provide input. Visions for today and for the future. 424.3: DB to contact ALICE-UK about Tier-2 resources. 424.6: DC to complete CMS metrics - DC would circulate this after the meeting tomorrow. 424.10 DB to contact JG to suggest topics for CERN Meeting. 425.7 DC to have an internal discussion within CMS relating to use of future technology and evolution of the computing model, from September to the next couple of years. DC to come up with possible suggestion of theme/topics for GridPP27 at CERN. 425.8 AS to consider any longer-term issues relating to storage, DPM, databases etc, and come back to DB with any ideas for sessions at GridPP27. 427.1 Re Tier-2 accounting figures: DB to contact RJ and ask him to explain why there were so many jobs waiting at Lancaster, when they had such a large share available. 427.2 Re Tier-2 accounting figures: DB to contact RJ and ask him about Glasgow getting production jobs from other clouds, when other sites don't. DB would also check with the Glasgow team. 427.3 DB to circulate an email to the CB re the OC outcome and the finalising of GridPP3, and point the CB at the documents. He would advise that a CB meeting might be useful in around 6 months' time, after the accounting period. Forthcoming PMB meeting dates were as follows, at the usual time: Mon June 6th Mon June 13th Mon June 27th Mon July 11th Mon July 25th Mon Aug 8th Mon Aug 22nd Mon Sep 5th TUE Sep 13th F2F@CERN Mon Sep 26th

Top of Message | Previous Page | Permalink

JiscMail Tools


RSS Feeds and Sharing


Advanced Options


Archives

February 2024
January 2024
September 2022
July 2022
June 2022
February 2022
December 2021
August 2021
March 2021
November 2020
October 2020
August 2020
March 2020
February 2020
October 2019
August 2019
June 2019
May 2019
April 2019
March 2019
February 2019
January 2019
December 2018
November 2018
August 2018
July 2018
June 2018
May 2018
April 2018
March 2018
February 2018
January 2018
November 2017
October 2017
September 2017
August 2017
May 2017
April 2017
March 2017
February 2017
January 2017
October 2016
August 2016
July 2016
June 2016
May 2016
April 2016
March 2016
February 2016
January 2016
December 2015
November 2015
October 2015
September 2015
July 2015
May 2015
April 2015
March 2015
February 2015
January 2015
December 2014
November 2014
October 2014
September 2014
August 2014
July 2014
June 2014
May 2014
April 2014
March 2014
February 2014
January 2014
December 2013
October 2013
August 2013
July 2013
June 2013
May 2013
March 2013
February 2013
January 2013
December 2012
November 2012
October 2012
September 2012
August 2012
July 2012
June 2012
May 2012
April 2012
March 2012
February 2012
January 2012
December 2011
November 2011
October 2011
September 2011
August 2011
July 2011
June 2011
May 2011
April 2011
March 2011
February 2011
January 2011
December 2010
November 2010
October 2010
September 2010
August 2010
July 2010
June 2010
May 2010
April 2010
March 2010
February 2010
January 2010
December 2009
November 2009
October 2009
September 2009
August 2009
July 2009
June 2009
May 2009
April 2009
March 2009
February 2009
January 2009
December 2008
November 2008
October 2008
September 2008
August 2008
July 2008
June 2008
May 2008
April 2008
March 2008
February 2008
January 2008
December 2007
November 2007
October 2007
September 2007
August 2007
July 2007
June 2007
May 2007
April 2007
March 2007
February 2007
January 2007
2006
2005
2004
2003
2002
2001
2000


JiscMail is a Jisc service.

View our service policies at https://www.jiscmail.ac.uk/policyandsecurity/ and Jisc's privacy policy at https://www.jisc.ac.uk/website/privacy-notice

For help and support help@jisc.ac.uk

Secured by F-Secure Anti-Virus CataList Email List Search Powered by the LISTSERV Email List Manager