Dear All,
Please find attached the GridPP Project Management Board Meeting minutes
for the 427th meeting.
The latest minutes can be found each week in:
http://www.gridpp.ac.uk/php/pmb/minutes.php?latest
as well as being listed with other minutes at:
http://www.gridpp.ac.uk/php/pmb/minutes.php
Cheers, Dave.
--
________________________________________________________________________
Prof. David Britton GridPP Project Leader
Rm 480, Kelvin Building Telephone: +44 141 330 5454
School of Physics and Astronomy Telefax: +44-141-330 5881
University of Glasgow EMail: [log in to unmask]
G12 8QQ, UK
________________________________________________________________________
GridPP PMB Minutes 427 (31.05.11)
================================
Present: Dave Britton (Chair), Dave Colling, Jeremy Coles, Pete Gronbech, Tony Cass, Robin
Middleton, Glenn Patrick, Tony Doyle, Dave Kelsey, Steve Lloyd (Suzanne Scott - Minutes)
Apologies: Roger Jones, John Gordon, Pete Clarke, Andrew Sansum, Neil Geddes
1. OC Feedback
===============
DB had circulated the OC feedback. The meeting had gone well, the plan was to schedule a mid-
term review of GridPP4 in the Spring of 2013, with the current OC personnel. Meantime, DB
would liaise with Tony Medland should an interim meeting be required.
The formal feedback had been positive - the OC thanked the PMB for the documents, they felt it
was the right amount of material and provided on time. The OC had noted that GridPP continued
to be well-managed and successful. The OC encouraged us to keep abreast of new technology and
collaborate where possible. The OC appreciated the good Project Management in GridPP3. They
appreciated the 'Lessons Learnt' document.
This was now the end of the OC in the current format but the OC members would be invited to
GridPP Collaboration Meetings, to give them the opportunity to remain up-to-date.
2. Month-1 of Tier-2 Accounting
================================
SL reported that we had reached the end of month 1 in the Tier-2 Accounting. SL had circulated a
few points at issue, to which DB had responded.
Regarding CPU, it was noted that CPU time was different on different machines. SL had corrected
for HEPSPEC, however a similar exercise by ATLAS had shown that the spread of results had
become worse as a result. HEPSPEC was not a good measure of ATLAS code when it was running.
SL could compare it with his own benchmark jobs for the UK. PG was looking at work done at his
site, and APEL and ATLAS appeared to be in line. DB noted that the evidence was that HEPSPEC
did not help. We could ignore the HEPSPEC or use SL's measure of HEPSPEC as a multiplier. SL
wanted to check his figures before we decided on anything. PG noted that if some sites used the
average HEPSPEC published figure, they get a different result. SL advised no, it depended on the
machine as well. SL would check his figures, and noted in addition there was a problem with the
ones we can't measure, eg: Cambridge. PG confirmed they usually send manual figures.
Regarding Lancaster (and ECDF), SL advised that concerns had been raised regarding the
published CPU available numbers. They both had a shared cluster. For Lancaster, there was 90%
published, which was a large number in comparison with others listed in the column, yet ATLAS
didn't tend to use Lancaster. TD advised that for CPU the figure should show what was actually
'used' rather than what was 'available', as the latter was distortive. DB advised that if there was
not enough disk, or not enough bandwidth, then what was 'available' was not a realistic number,
but it was hard to prove otherwise. PG agreed with TD that we should be measuring 'reality' and
showing what was actually done at the site. SL advised that this measurement is in fact done at
present.
JC noted that utilisation across all UK resources was rarely above 60%. PG noted his concern
about the accuracy of the Lancaster figures. JC agreed, noting that Lancaster had the worst
utilisation figures for Quarter 1: 14%. There was definitely an issue at Lancaster. TD commented
that the PANDA system should be monitoring this, and that a declaration of huge CPU available
that was under-utilised was not useful. SL suggested that we need to ask RJ about the waiting
jobs.
TD considered that giving 50 points for providing CPU seemed wrong - we should, rather, be
measuring overall CPU throughput. SL thought that we should not penalise sites for buying a lot
of kit that was under-utilised. TD noted the same issue with ECDF - they had a lot of CPU. TD
noted that the right way to measure was the number of job slots x fraction. PG proposed to drop
the column altogether, and measure what was actually done. TD agreed. SL observed that large
shared clusters were the issue. TD noted we could adjust the fraction afterwards. SL agreed that
we could publish the realistic share. SL noted that RJ had to answer why there were all the jobs
waiting, when he had a large share available. TD suggested that it should be the effective share
over the month that was measured, retrospectively, giving the usage they actually got out of the
cluster (for a shared site). DC commented however that scheduling policy can work against the
site. SL summarised by noting that we must get the answer as to why things still look so wrong,
and then maybe drop the column altogether - this would apply to ECDF as well.
DB had circulated some graphs showing CPU available against MC Production: there was a
reasonable correlation for all sites except Lancaster, ECDF and QMUL that showed much higher
levels of available CPU than was consistent with the work actually being performed. SL noted that
at QMUL this was due to the current deployment of new resources.
ACTION
427.1 Re Tier-2 accounting figures: DB to contact RJ and ask him to explain why there were so
many jobs waiting at Lancaster, when they had such a large share available.
Regarding Glasgow getting production jobs from other clouds when others don't, DB noted that
we needed to ask RJ, however there was no evidence from the graphs that Glasgow was getting
more work than expected. PG observed that if the cluster is full then there would be no complaint
from any site.
ACTION
427.2 Re Tier-2 accounting figures: DB to contact RJ and ask him about Glasgow getting
production jobs from other clouds, when other sites don't. DB would also check with the Glasgow
team.
Regarding the issue of LOCALGROUPDISK, it was noted that there were local users at a site, plus
10-20% set aside for GridPP. SL had asked RJ if he wanted this included or not. RJ had said yes,
up to 20% of the pledge. SL had said this was too complicated to implement and that it should be
either all or none. The latter had been agreed. TD noted that we should say it is zero and be
driven by the major fraction available to everyone.
Regarding QMUL, it was noted that 60TB was dedicated to T2K but this didn't count anywhere.
DB suggested that we introduce a metric under 'others'. SL noted it wasn't clear we could
measure it. DB thought we probably couldn't measure it but noted that in the big picture this was
a very small effect.
Regarding capping any one site @ 20%, it was noted that Glasgow had been in the 20s recently.
DB noted however that Glasgow had already fallen below 20% as other sites deployed new kit
and, anyway, the power had to go down soon, so it would not achieve 20%. TD suggested that the
maximum allocated to a site should be £200k? DB did not want to set a figure at this stage. DB
noted that Oxford and Manchester seemed to do more analysis than the amount of disk would
indicate. DB noted that the big picture in the UK was that we needed to spend money on disk. PG
advised that the Storage Group had discussed the disk at QMUL - they have a lot of disk but have
bandwidth issues, therefore buying more disk would be a waste. TD observed that whether the
network would throttle was a larger issue - pattern of use was more crucial.
The conclusion of the discussions was that there were no major problems, however there
remained detail which we needed to understand.
3. Misc Items
==============
- AHM paper
It was noted that the AHM deadline had been extended. DC had received a few inputs, and would
contact the Tier-1. AS was away at present. DB advised that the OC documents had plots, if any
were required, and DB's talk might also be helpful. DC would attend to this.
- GridPP MoU
SL had circulated the MoU and had received comments from JG in relation to EGI. SL would
modify the document and add appropriate footnotes. SL asked if we needed a CB meeting? TD
thought it would help, especially in order to provide the OC feedback. DB thought it preferable to
wait until we were clearer about the hardware figures and had something more useful to report.
DB could circulate an email re the OC outcome and the finalising of GridPP3, and point the CB at
the documents. TD agreed, noting that it would be good to provide a report.
ACTION
427.3 DB to circulate an email to the CB re the OC outcome and the finalising of GridPP3, and
point the CB at the documents. He would advise that a CB meeting might be useful in around 6
months' time, after the accounting period.
- UKHEPSYSMAN sponsorship
PG had the budget breakdown for this, which he had sent to RM, who had agreed the expenditure.
PG noted that the event would be similar to last year, and that a barbeque would be the best
option.
4. Proposal for PMB Dates
==========================
DB had proposed a list of dates for forthcoming PMB meetings, in order to avoid holiday
weekends etc. Could everyone check these and let him know if there was anything missed that
might make any of the meeting dates impractical. These meetings would take place at the usual
time: 12.55 pm.
Mon June 6th
Mon June 13th
Mon June 27th
Mon July 11th
Mon July 25th
Mon Aug 8th
Mon Aug 22nd
Mon Sep 5th
TUE Sep 13th F2F@CERN
Mon Sep 26th
5. AOCB
========
No other business.
STANDING ITEMS
==============
SI-1 Tier-1 Manager's Report
-----------------------------
AS was absent.
SI-2 Production Manager's Report
---------------------------------
JC reported as follows:
1) The request for a relocatable install for glexec has been pushed by Maarten Litmaath in the
last week. For those interested in the discussion thread see
http://indico.cern.ch/materialDisplay.py?contribId=2&materialId=0&confId=141553. It is very
likely that sites waiting on this will miss the end of June deadline from WLCG. There is no point in
sites building from source at this stage. The updated policy for glexec deployment is here
https://twiki.cern.ch/twiki/bin/view/LCG/GlexecDeployment.
The current status across UK sites can be ascertained from the UK Nagios tests now being run:
http://tinyurl.com/3fhvh9z. Currently this shows success for: RHUL, Liverpool, RALPP, RAL Tier-
1, Glasgow and Oxford.
2) There was a meeting of the CA Technical Advisory Group last week. Action will need to be
taken soon as the CA certificate needs to be renewed on the September timescale. A statement
from the group reads as follows:
“The UK e-Science CA is due to go through another rollover, i.e. the CA certificate has to be
renewed. This is scheduled for the end of September 2011. We will remain compliant with the
IGTF requirements, but aim to modernise many of the processes. You may already have seen
early versions of the CertWizard, the java-based client which makes certificate management a lot
easier. Another planned improvement is to bring the CA wholly online, so that instead of signing
certificates within one working day (after approval of the request) they will be issued
immediately (or at least within minutes.) This will also improve the security of the infrastructure
as fresh revocation lists can be issued whenever they are needed. The new CA certificate will have
a longer lifetime: the rules now allow this. The plan is to generate it at the end of May, and then
push it out via the IGTF in June, to ensure it is widely distributed by the end of September. A few
other modernisations are under way, but phased, so they are not introduced all at once. The
policy is being rewritten, not so much to change it but more to make it clearer and to allow for
more flexibility and resilience in following IGTF requirements. The extensions in the certificates
will be modernised (the current ones are quite old by today's standard.) Throughout the process,
the major relying parties have been consulted via the Technical Advisory Group, or TAG.”
DB considered that this was a recipe for disaster, and urged anyone who was a member of the TAG
to keep a very close eye on this. The TAG group remit was to be vigilant in any case and it was
confirmed that they would do everything possible to avoid any serious problems. DB advised
particularly that they should be on the lookout for unforseen delays.
3) Security Service Challenge 5 ran last week across EGI and included 43 sites. The challenge
involved pilot jobs being submitted to sites with user payloads being run and various
manipulations being made on files on the site storage. First impressions are that it was a useful
test for everyone involved and UK participation and performance was good. The final reports are
still being submitted and following that the EGI team will need to review and evaluate to see what
can be learned to improve incident response procedures and assess site effectiveness. Thank you
to the GridPP participating sites: Lancaster, Cambridge, RHUL and Glasgow.
SI-3 ATLAS weekly review & plans
---------------------------------
RJ was absent.
SI-4 CMS weekly review & plans
-------------------------------
DC reported that from the UK side, CMS was using glexec now. They were doing tests with xrootd.
They were looking at scheduling: half of the Tier-1 would be scheduled by the end of the year. For
accounting, the first month had been completed for CMS, the results/ratios were much as
expected, the only surprise was that QMUL was coming in as a sizeable partner, even running
analysis jobs. No data was scheduled there, so it must be private MC being run - this made QMUL
an anomaly. DC noted issues at Bristol - they had a readiness of 0% last month and 0 transfers.
CPU and MC was 0. They have transfer issues.
SI-5 LHCb weekly review & plans
--------------------------------
GP reported as follows:
1) Tier 1 disk-server gdss120 (t1d0 / lhcbRawRdst) out of production for 24 hours, from 25 May.
2) Looking at optimising job start rate at T1 (currently 3 jobs/minute) to improve throughput.
3) Full reprocessing of 2011 data ongoing with latest version of LHCb software, expected to be
used for summer conferences.
SI-6 User Co-ordination Issues
===============================
There was a resource meeting due to happen tomorrow; no issues to report.
SI-7 LCG Management Board report
=================================
There had been no meeting.
SI-8 Dissemination
===================
No issues to report.
REVIEW OF ACTIONS
=================
400.4 SL to co-ordinate changing the current GridPP MoU towards an MoU for GridPP4. In
progress - document had been circulated. Any corrections to be sent to SL. Ongoing.
409.1 JC to revisit document with a GridPP-NGI-NGS structure, not use the document Dave
Wallom produced. JG will provide input. Visions for today and for the future. Ongoing.
424.1: PG to sketch out a technical plan for establishing the appropriate Grid-services as Sussex.
Done, action closed.
424.3: DB to contact ALICE-UK about Tier-2 resources. Ongoing.
424.5: SL to complete metrics web-page. Done, item closed.
424.6: DC to complete CMS metrics - DC would circulate this after the meeting tomorrow.
Ongoing.
424.9 JC to suggest topics for CERN Meeting. Done, item closed.
Following the meeting, JC reported as follows:
"Here is an initial response with some ideas of topics that might be covered during the GridPP
meeting at CERN. The speakers for many of the topics would ideally be people from CERN - for
example the experiment technical experts/reps who speak at the GDBs.
Probably we want to pick up on and develop outcomes of the WLCG workshop in Hamburg in July
(https://indico.desy.de/conferenceTimeTable.py?confId=4019#all). The meeting theme might be
something connected to accommodating changes (the EMI-1 release, the machine schedule,
increasing event sizes, new technologies, shrinking budgets!). My first guess would be
"(In)Stabilities"."
424.10 DB to contact JG to suggest topics for CERN Meeting. Ongoing.
425.1 DB to provide PG with text for Risk 18, noting that despite the likelihood being raised to 3,
the risk was not immediate. Done, item closed.
425.2 PG to check whether the current Risk Register would map onto the new one for GridPP4.
PG to summarise findings in an email to PMB. Done, item closed.
425.3 Owners of the new risks for GridPP4 should check the new Risk Register and get back to PG
with any comments/amendments. They should also look at the 'old' spreadsheet and ensure that
all previous risks relating to them are adequately covered within the new version. Done, item
closed.
425.4 DB to firm-up the structure of document 155, providing a detailed document map and
targeting AS with specific sections to complete on the Tier-1 as a quantified success. Done, item
closed.
425.5 DK to provide text on security challenges during GridPP3 and UK performance, for
document 155. Done, item closed.
425.6 ALL: to review document 156 on GridPP3 Financial Status. Comments to be sent to DB.
Done, item closed.
425.7 DC to have an internal discussion within CMS relating to use of future technology and
evolution of the computing model, from September to the next couple of years. DC to come up
with possible suggestion of theme/topics for GridPP27 at CERN. Ongoing.
425.8 AS to consider any longer-term issues relating to storage, DPM, databases etc, and come
back to DB with any ideas for sessions at GridPP27. Ongoing.
426.1 JC to check on blacklisted sites Manchester & Glasgow, and the timescales involved.
Following the meeting, JC reported as follows:
"As often happens on closer examination things are subtly different. The wording should have
been "Manchester and Glasgow are currently closest to being blacklisted for example".
In fact the status is now that nearly every site will be blacklisted:
Brunel
QMUL
RHUL
Liverpool
Manchester
Sheffield
Durham
ECDF
Glasgow
Oxford
These sites have spacetokens close to 80% full, the point at which ATLAS blacklists the site. At the
moment of writing I do not think there has been a negative impact on site accounting. The
question I was asked to check was how long it takes to subsequently remove a site from the
blacklist (http://bourricot.cern.ch/blacklisted_production.html). The answer in this case is a
minimum of 24hrs, this being the length of time for the deletion service (triggered at the 80% full
point) to clear old files from the spacetoken.
Since I am correcting/clarifying I should also point out that:
> B) EGI is looking at site entries in the GOCDB and asking NGIs to close sites that have been in
candidate/uncertified states for a long period. In the UK this affects many of the NGS registered
sites.
Is now out of date. Uncertified sites can remain indefinitely in that state after a meeting today
updated the procedures." Done, item closed.
ACTIONS AS OF 31.05.11
======================
400.4 SL to co-ordinate changing the current GridPP MoU towards an MoU for GridPP4. In
progress - document had been circulated. Any corrections to be sent to SL.
409.1 JC to revisit document with a GridPP-NGI-NGS structure, not use the document Dave
Wallom produced. JG will provide input. Visions for today and for the future.
424.3: DB to contact ALICE-UK about Tier-2 resources.
424.6: DC to complete CMS metrics - DC would circulate this after the meeting tomorrow.
424.10 DB to contact JG to suggest topics for CERN Meeting.
425.7 DC to have an internal discussion within CMS relating to use of future technology and
evolution of the computing model, from September to the next couple of years. DC to come up
with possible suggestion of theme/topics for GridPP27 at CERN.
425.8 AS to consider any longer-term issues relating to storage, DPM, databases etc, and come
back to DB with any ideas for sessions at GridPP27.
427.1 Re Tier-2 accounting figures: DB to contact RJ and ask him to explain why there were so
many jobs waiting at Lancaster, when they had such a large share available.
427.2 Re Tier-2 accounting figures: DB to contact RJ and ask him about Glasgow getting
production jobs from other clouds, when other sites don't. DB would also check with the Glasgow
team.
427.3 DB to circulate an email to the CB re the OC outcome and the finalising of GridPP3, and
point the CB at the documents. He would advise that a CB meeting might be useful in around 6
months' time, after the accounting period.
Forthcoming PMB meeting dates were as follows, at the usual time:
Mon June 6th
Mon June 13th
Mon June 27th
Mon July 11th
Mon July 25th
Mon Aug 8th
Mon Aug 22nd
Mon Sep 5th
TUE Sep 13th F2F@CERN
Mon Sep 26th
|