Dear All,
Please find attached the GridPP Project Management Board Meeting minutes
for the 414th meeting.
The latest minutes can be found each week in:
http://www.gridpp.ac.uk/php/pmb/minutes.php?latest
as well as being listed with other minutes at:
http://www.gridpp.ac.uk/php/pmb/minutes.php
Cheers, Dave.
--
________________________________________________________________________
Prof. David Britton GridPP Project Leader
Rm 480, Kelvin Building Telephone: +44 141 330 5454
School of Physics and Astronomy Telefax: +44-141-330 5881
University of Glasgow EMail: [log in to unmask]
G12 8QQ, UK
________________________________________________________________________
GridPP PMB Minutes 414 (31.01.11)
=================================
Present: Dave Britton (Chair), Sarah Pearce, Andrew Sansum, Steve Lloyd, Robin Middleton, John
Gordon, Jeremy Coles, Pete Gronbech, Pete Clarke, Glenn Patrick, Tony Doyle, Roger Jones, Dave
Kelsey (Suzanne Scott - Minutes)
Apologies: Tony Cass, Dave Colling, Neil Geddes
DB began by thanking SP for her input and contributions to GridPP over the years both within and
outwith her Job Description! Contributions had been collected for SP as a token of the
community's appreciation. SP thanked everyone for 7 enjoyable years, especially because of the
people in the Project, which had come a long way since 2004.
1. Agenda for GridPP26
=======================
DB had circulated a draft Agenda with the overarching theme of 'efficiency'. The opening session
would comprise a talk from DB, then PG would outline GridPP4. The third opening talk would be
from ATLAS, Graeme Stewart had been invited to give this as a keynote talk. Was RJ happy with
this? RJ noted yes.
The 2nd session comprised complementary talks from the experiments, although Alice was yet to
confirm. (Note added: ALICE is now confirmed).
The 3rd session of the first day traditionally comprised a discussion session, and this year it would
be on the GridPP4 Tier-2 algorithm. DB advised that the algorithm had to be discussed at the
forthcoming F2F in Lancaster, with input from the experiments.
PG asked if the algorithm could realistically be changed by the time it was discussed at GridPP26?
DB noted a number of possible outcomes: acceptance, modification, starting date and length of
run time; dissemination of funds and dates - we would need to have this discussion. PG noted that
if the start was April 1st, there was only one day's leeway. DB advised that if modifications were
required then the start date would not be 1st April. SL noted that we didn't want a long period of
changes re the algorithm. PG emphasised how important it was that sites were treated equally - at
the moment, regarding SL's proposal, this looked similar to last year, and could change with
experiment input. JC considered there should be a period of time after the announcement prior to
implementation.
PG asked what did ATLAS want from sites? Did they want five equal sites which shared the load
equally, or did they want a disparity in size from large to small? RJ advised that they needed a
core of well-supported sites, and what we have currently worked well - it was a group of large
sites. PG pointed out that it was difficult for small sites to improve, and they didn't do well from
the algorithm. DB noted that we don't want a lot of sites that are the same size - a hierarchy of
sites is preferred. SL noted that feedback had already happened to some extent which had created
what we have currently. DB noted that we tried to design the support according to the best view
of the experiments - and this was a set of well-run larger sites, with other sites prepared to assist
as necessary.
The algorithm would need to be discussed first at Lancaster, it could be published then and
confirmed at GridPP26, with a view to starting on a named date. If we were not ready by
Lancaster then it could be published at GridPP26. JG asked about CB involvement? SL considered
there was no reason to involve the CB. DB noted that the CB view would be from a different angle
and at present our view was to best fit the experiment needs. DB noted in any case that an
information summary meeting was due for the Collaboration Board in order to wrap-up GridPP3
and start GridPP4. DB noted that the immediate step was for RJ and DC to give input to SL. SL
would make progress on this before the F2F at Lancaster.
There would be a one-hour discussion session allowed for the algorithm at GridPP26 and then a
storage discussion.
Day two of the Collaboration Meeting would commence with Tier-2 reports from the Tier-2 Co-
ordinators, from a site perspective, followed by a discussion session around three themes, with
main points brought up. There would be a guest speaker re support and ticketing, and the final
area to be covered was data transfer.
The final session on day two would focus on the Tier-1: AS and Gareth would probably present.
DB wanted relevance to the experiments, the project overall, and the Tier-2. PG noted that some of
the Tier-2s would lose manpower in GridPP4 and they may not be keen on change. There ensued
a discussion on Quattor and fabric management. DB noted that when he contacts the Tier-2s, he
can ask them to think about fabric management tools. JC and AS should approach this issue in a
relevant way within their sections/talks.
DB asked for suggestions regarding the last talk of the meeting. AS suggested it should focus on
vision and objectives, as at the beginning of GridPP3 - where will we be and where do we want to
be by 2015? DB considered this to be a good idea and suggested it be extended to the Tier-1 talk -
where did the Tier-1 go in GridPP3, how did it evolve - Quattor, monitoring etc happened within
that time. Given the meeting context was 'efficiency' then the question could be asked: where do
we want to be in 2015 at the end of GridPP4? AS thought this vision was useful whilst we were
moving through the Project to each next checkpoint. DB noted 'vision 2015' - we needed a vision
statement from the Tier-1, sites, and the experiments. TD asked if we also needed the CERN view?
PC also thought that upgrade proposals might be useful? DB would think about the last session in
terms of 'vision' for the future. RJ suggested that Roger Gough could be invited from DELL.
2. Project Management Transition conclusion
============================================
SP reported that things had gone well, they had regular meetings and covered all areas they had
wanted to cover. PG had not yet done a budget, so questions were anticipated come the time and
SP would be available to assist. The Quarterly Reports, Project Map, personnel reporting, were all
under control. PG thanked RJ and JG for their reports, he would be able to finalise the Quarterly
Reports soon. PG advised that he was still awaiting the report from CMS, and he also needed to
look at the manpower spreadsheet - the Tier-1 was the most complex and PG would meet with AS
to get the background to the current situation. PG asked if it might be easier for DC to delegate the
Quarterly Report? SP agreed that it would be good if DC could delegate the CMS Quarterly
Reporting. DB noted this would be discussed at Lancaster - lightweight but timely Quarterly
Reporting would be required in GridPP4.
3. Project Management Issues
=============================
PG had circulated a Project Map. DB outlined the history and the changes (versions) of this. We
could re-arrange the current one. There ensued a discussion on finances and layout. DB
suggested moving 'Grid Operations' to Work Package B, and moving Work Package A next to the
Experiments Tier-1. It was re-iterated that just because there were two boxes per experiment, did
not mean two separate reports. A single report could incorporate all tasks. The Project Map was
primarily a tool for the Project Manager. PG agreed to try making the changes as suggested by DB
and see how this worked out.
4. Publishing VO shares
========================
There had been an email discussion regarding publishing in GSTAT2. JC noted that sites wanted
some element of freedom of reporting. It was understood they could do so in principle but not in
the documentation? PG reported that the documentation was falling behind reality at present -
the document needed updated and it was agreed that we should publish something that makes
sense. JC noted that SL's spreadsheet was different at different stages of the project - the
hardware allocations should be taken into account, also, the Alice figures weren't correct. It was
agreed that JC should publish according to SL's spreadsheet as discussed.
STANDING ITEMS
==============
SI-1 Tier-1 Manager's Report
-----------------------------
AS reported as follows:
Fabric:
1) FY10 procurements
- Disk tender - accepted!
- CPU tender - all delivered. Acceptance testing has started on V10 (scheduled to complete 8th
Feb). CL10 problems resolved and expect them to complete, supplier proving test this week.
- Tape drive and media purchase still outstanding, waiting for hardware availability. Expect to
finalise plan early this week.
2) The removal of the SL08 disk servers is complete (reported verbally last Monday). Agreed plan
of action with supplier. Load test did not start last week as planned - increasing priority of work.
Service:
A quiet week operationally.
1) Summary of operational issues is at:
https://www.gridpp.ac.uk/wiki/Tier1_Operations_Report_2011-01-26
2) We have commenced a 2 day downtime:
- CASTOR database upgrade
- network intervention to add new address space for CPU nodes and increase internal links
- CMS disk servers upgrade to SL5 (64bit)
- Batch server O/S update
3) Large queues of batch jobs built up last week, waiting for free batch slots. This was traced to a
CE information publishing problem (its interaction with VO job submission).
4) Bad checksum files continues to be an operational problem. Manual deletions required by VO
and emergency interventions by us to ungum tape migration. We will consider an emergency
change to gridftp ASAP once we have a solution ready.
SI-2 Production Manager's Report
---------------------------------
JC reported as follows:
1) In the deployment team meeting last week there was a brief discussion of the GridPP4
accounting metrics to be used for Tier-2 hardware allocation, and the period that will be used for
the assessment. Apparently a commitment was made at the Collaboration Board to publish the
metrics “well in advance”. Please could we indicate to sites the timeline for sharing the metrics –
discussion at GridPP26 (29th-30th March) for a period starting 1st April is rather late.
2) In relation to the publishing of VO shares issue, we have now received a GGUS ticket from the
WLCG Information Officer (Flavia Donno) https://gus.fzk.de/ws/ticket_info.php?ticket=66564.
The shares will be discussed at tomorrow’s deployment team & sites meeting. To a first
approximation we will use the 2nd tranche hardware allocation figures. Is this acceptable to the
PMB providing the WLCG Tier-2 per VO pledges are met?
3) A VOMS intervention at CERN last week was unsuccessful leading to the server supporting the
ops VO being down for longer than the original 2hr downtime. ops proxies are for 4 hrs so the
concern here is that globally site availability/reliability metrics would have been affected. Does
the MB proactively correct for this sort of effect? Fortunately last week the server returned just
before the (UK ops) proxy expired.
4) Sites continued upgrades to their ATLAS Frontier squids last week amid concerns about the
level of customisation in the rpm and lack of documentation provided for the installation. Several
sites broke their services. A savannah request was submitted to request improvements in the
documentation.
5) Two additional issues/concerns around GOCDB4 have been raised. It is no longer possible to
tag site services as pre-production or test, therefore any site that is trying a new release will get
ticketed for all resulting problems seen in the monitoring (many would be seen quickly by the site
anyway) and these have to be treated as normal tickets by the ROD team. That is, sites that are in
the critical state cannot have their alarms closed, even if they are testing a release. The extra
critical tickets then impact the region performance metrics. Was there a consultation ahead of
such a change and will (or can) this happen for all central grid tools?
SI-3 ATLAS weekly review & plans
---------------------------------
RJ reported that the downtime for the Oracle upgrade for CASTOR, had affected ATLAS. They
were due to perform a local file catalogue upgrade due to this as well, probably in February.
There would be a 6-hour outage that will take down the cloud, then in August they may move over
to a central LFC system, the UK backup needs additional licences so this would not be a priority. A
full instance for the TAG database on the LFC was preferred (there was no timescale at present as
the tools were not available) - they would need to look at the pros and cons for UK operations. RJ
reported on another issue re software installations, at RAL all was good/green but they were
missing software releases. A ticket was in for this. The installations were being done by hand. If
there were many releases, and production work goes from the Tier-1 to the Tier-2 and fails, it can
be because the Tier-1 have not upgraded the software release. MC production was proceeding
with heavy ions pending. RJ advised that there were issues on certificates; the Quarterly
Reporting had been late due to changes onto the new dashboard - the information was incomplete
and inconsistent, so they had moved back to the old dashboard but this was being discontinued.
The metrics previously used were not so reliable, therefore metrics were amber in the Quarterly
Reporting. The ATLAS production dashboard had its own problems internally but was
inconsistent with the wLCG figures for the quarter.
SI-4 CMS weekly review & plans
-------------------------------
DC was absent.
SI-5 LHCb weekly review & plans
--------------------------------
GP reported that the queued jobs issue had now been resolved; it had been a good week; they
were tidying up their disk space.
SI-6 User Co-ordination issues
-------------------------------
GP noted there was nothing to report.
SI-7 LCG Management Board Report
---------------------------------
DB noted there had been no meeting.
SI-8 Dissemination Report
--------------------------
SP noted there was not much to report, LHC@Home was moving from QMUL back to CERN.
AOB
===
DB reported that he was due to have a 'phone meeting with Tony Medland this afternoon.
Funding was due to be released for the rest of GridPP4, however the Tier-2 hardware funding
might be an issue. The RAL staffing could be finalised. DB would email details of the meeting or
he would update the PMB at the next meeting.
Next Monday's PMB was CANCELLED: there would be NO meeting on Monday 7th February. DB
noted he was not available the following week, 14th February, so if possible JG could Chair. This
meeting may also be cancelled due to the upcoming F2F at Lancaster.
ACTIONS AS AT 31.01.11
======================
398.7 Re the GridPP Security Policies - DK advised that EGI formal signoff had now been given, he
would update the GridPP website pages.
400.4 SL to co-ordinate changing the current GridPP MoU towards an MoU for GridPP4.
409.1 JC to revisit document with a GridPP-NGI-NGS structure, not Dave Wallom’s. JG will
provide input. Visions for today and for the future.
409.2 GP to produce new role description for the Chair of the UB.
411.1 DB to organise an Agenda around the theme of 'Efficiency' for GridPP26 at Sussex.
411.3 SL to co-ordinate with RJ, DC, and GP, regarding monitoring site performance and
distribution of GridPP4 funds, and provide a draft document to which the PMB could respond.
This should be finalised at the F2F meeting in March, in relation to how much money was to be
allocated. We would need a starting point by the F2F in February. SL was awaiting input from RJ
and DC - they need to respond ASAP.
412.3 JG to check with AS and RJ re the issue of the Tier-1 continuing to provide LFC services (the
issue here was extra effort, a proposal was required).
413.1 RM to check the travel budget in relation to contributing to the costs of being involved with
the Royal Society Summer Science Exhibition, in conjunction with Birmingham/Cambridge.
413.2 DB to contact Karl Harrison and confirm GridPP's involvement in the Royal Society
Exhibition, noting a contribution in terms of a possible demo, manpower, and promotional
materials.
413.3 JG to find out at the EGI meeting today if there was a GOCDB4 failover still in existence (the
last one ended with EGEEIII).
413.4 Regarding GSTAT2 publishing and sites filling-in the numbers as per SL's spreadsheet table
showing the fraction (ie: publish the theoretical model in GSTAT) - PG to send the relevant
spreadsheet to JC so that dTeam could progress this.
|