Dear All,
Please find attached the GridPP Project Management Board Meeting minutes
for the 403rd meeting.
The latest minutes can be found each week in:
http://www.gridpp.ac.uk/php/pmb/minutes.php?latest
as well as being listed with other minutes at:
http://www.gridpp.ac.uk/php/pmb/minutes.php
Cheers, Dave.
--
________________________________________________________________________
Prof. David Britton GridPP Project Leader
Rm 480, Kelvin Building Telephone: +44 141 330 5454
School of Physics and Astronomy Telefax: +44-141-330 5881
University of Glasgow EMail: [log in to unmask]
G12 8QQ, UK
________________________________________________________________________
GridPP PMB Minutes 403 (25.10.10)
=================================
Present: Dave Britton (Chair), Sarah Pearce, Tony Doyle, Jeremy Coles, Andrew Sansum, Steve
Lloyd, Roger Jones, Glenn Patrick, Dave Kelsey (Suzanne Scott - Minutes)
Apologies: Tony Cass, Robin Middleton, John Gordon, Pete Clarke, Dave Colling, Neil Geddes
1. GridPP26
============
DB reported that there were problems with booking GridPP26 at Sheffield. The University
student accommodation was not available and a hotel had been recommended instead, however
the cost was prohibitive and for accommodation alone was around double our usual cost. There
was also a snooker tournament due to take place in Sheffield at the same time as GridPP26, which
would mean that other hotel accommodation would be difficult to source, and probably expensive
as well. DB suggested that we think about going somewhere else - he could raise it with
Manchester or we could try Sussex? DB noted that he had mentioned the possibility to
Manchester in general terms a few weeks ago. Comments? GP advised that the Sussex campus
was some way outside of Brighton. SL noted that it might be a good idea for him to approach
Sussex in the first instance, as we had no contact with them outside of the CB. This was agreed.
DB advised that it may be possible to be flexible about date, possibly for the end of March, but this
was during term time and it would make accommodation unavailable. It was agreed that SL
would contact Sussex and ask about our original dates of 18-21 April 2011.
ACTION
403.1 SL to contact Sussex and enquire about the possiblity of them hosting GridPP26 in April
2011.
2. ATLAS adaptive data placement
=================================
RJ advised of new data placement models within ATLAS - these were adaptive to the user needs.
Accessing downloads as required had been trialled in the US and elsewhere, and it seemed to
work, with the knock-on effect of reducing network traffic. The model resulted in more specific
usage and data was only moved when it was needed. RJ noted that his initial concerns that there
could potentially be a network issue, had been allayed, but problems were possible, however
there was a request from ATLAS to do this in the UK. It was noted that Tier-2 usage was falling
below capacity and with this model you could have multiple copies and make better use of the
resources. DB thought it sounded sensible but asked why the transfers had to run as a user job at
the Tier-1? RJ noted you shouldn't need to do that but that was how it had been implemented - it
was a transfer from the Tier-1 to the Tier-2. RJ advised that the UK were the only people
following the 'correct' model now but he suggested we do this. DB commented that it did open the
door to users to submit any job to RAL. RJ noted yes, but the slots needed to remain open, so the
jobs would be throttled-back - he believed this could be done technically. Graeme Stewart was an
advocate of running users jobs on the Tier-1 but it did pose a risk to the organisation. TD asked if
it could be limited to a subset of users only? RJ noted no, any user could run jobs. SL commented
that his tests would start to run at RAL again as a result of this. RJ advised that tape access could
be an issue. AS noted that they could not gain tape access through the normal tools but there
would be a Nagios check. For LHCb the user jobs were not a problem. AS thought the model
should not be too much of a problem. RJ noted that the load on the software server could be a
concern, but we should keep this separate from production, therefore it was less of a risk. DB
asked what the likelihood was that in eight weeks' time they would need more job slots? RJ noted
yes, this was likely - they had started at 100 in the US and had increased it, so he agreed that more
might be needed. DB advised that if we proceeded with this we would need to be clear that this
was a specific solution to a PD2P problem, not a change in Policy - this was not an automatic
increase to the number of slots in order to solve a backlog, unless to PD2P, and overall it was to
the benefit of the Tier-2. RJ noted that this did not open up the Tier-1 for analysis - the jobs were
only for PD2P. AS advised that the other issue was not allowing access to data that needed to be
maintained. AS also noted that this usage would need to be tracked. DB agreed, noting that it
would be good to keep a watching brief on the situation. The proposal was agreed. AS asked that
the timing of this waited until next Tuesday as they were due to do the CASTOR upgrade. This was
agreed. RJ would feed this back to the Operations Meeting today. TD asked if this ATLAS adaptive
data placement at RAL was temporary? RJ noted that techniques might evolve to do this in a
factored way but they didn't exist yet. It allowed a small number of slots to be available rather
than allocating a 'Super User' status. TD advised that communication with users would be an
issue - we needed to broadcast this via ATLAS and GridPP channels. RJ noted yes, he would do
this.
ACTION
403.2 RJ to broadcast the move to ATLAS adaptive data placement at RAL, specifically for PD2P
only, via ATLAS and GridPP standard channels.
3. Quarterly Reporting Status
==============================
DB asked what was the status of the Q3 reporting? SP advised that she had received some reports,
some were due at the end of October. The ATLAS and CMS reports had been due last week.
People were working on these, but SP reminded that the reports were urgent and due in asap.
4. Minor Items
===============
- F2F meeting: DB suggested that the PMB hold a F2F meeting before Christmas. The Oversight
Committee meeting was on 10th December at QMUL, and by then we may know about the CSR
impact. DB advised that we needed to think about the GridPP4 detail with respect to deliverables
and reporting. It was agreed to pencil-in a F2F meeting for 9th December at QMUL. SL to book a
room from 11am until 5pm.
ACTION
403.3 SL to book a room at QMUL for the PMB F2F meeting from 11am to 5pm on 9th December,
prior to the OC on 10th.
STANDING ITEMS
==============
SI-1 Tier-1 Manager's Report
-----------------------------
AS reported as follows:
Fabric:
1) FY09 procurements:
- SL09 tranche continues in acceptance test - expected to complete 5th November.
2) FY10 procurements:
- Disk tender - orders placed. Delivery late November.
- CPU tender - orders placed
- Various small system purchases being made
Service:
1) Summary of operational issues is at:
https://www.gridpp.ac.uk/wiki/Tier1_Operations_Report_2010-10-20
2) CASTOR
The LHCB CASTOR instance is generally working well and has sustained rates of up to 1000MB/s
(about previous peak). Problems with file status info not being updated have been resolved (two
seperate problems, one was workload on the stager database server, the second was an error in
the upgrade where multiple stagers were started but not authorised). A change will be scheduled
to move all the disk servers to 64 bit in order to fix the checksum problem reported last week.
The gen instance upgrade is proceeding and is currently on schedule.
3) ATLAS adaptive data placement at RAL
ATLAS intend to commence limited user analysis work at RAL in order to support the data
placement service for the UK cloud. Although primarily an ATLAS decision a change request was
submitted and reviewed by the change team. Key issues identified were:
a) They plan to use the CERNVMFS service which is still a development service.
b) ATLAS have noted that there are insufficient controls in CASTOR to prevent user jobs
accidentally deleting data if the standard ATLAS tools are bypassed.
We plan to move the proxy servers that support CERNVMFS to production this week (but
CERNVMFS itself remains test). We have flagged the deletion problem as an urgent issue which
cannot be fixed before user work starts and therefore remains a residual risk. It's important to
highlight that the RAL ATLAS user service is experimental and we will have to feel our way in
carefully as we gain experience.
SI-2 ATLAS weekly review & plans
---------------------------------
RJ reported that apart from the data placment issue, there was not much more to report.
Reprocessing was going through, they were aware of changes coming, but there were no other
issues.
SI-3 CMS weekly review & plans
-------------------------------
DC was absent.
SI-4 LHCb weekly review & plans
--------------------------------
GP reported as follows:
LHCb status: Reasonably smooth week for UK/RAL.
1)RAL T1 operating with limits of 3 job starts/minute and 800 simultaneous batch jobs.
2)Disk server (gdss463) taken out of service for a day - backplane replacement on 19 October.
3)Upload problems continue at Brunel.
DB asked if this was a long way from the LHCb spec? GP advised that limits would be increased
this week for the job starts, and they were trying to throttle the number of jobs, and would see
how it goes. DB asked if there was still a perception that the UK remained a problem? GP advised
probably yes, but things were progressing now and we were not blacklisted. AS advised that the
workload was variable but there was no cause for concern at the moment - all looked ok. DB
asked if GP considered there was further public relations work to be done? GP thought things
were really ok, everything had already been covered, and the other Tier-1s had different
problems.
SI-5 Production Manager's Report
---------------------------------
JC reported as follows:
1) Another RHEL5 vulnerability has been identified (this affects derivates like
SL5/SLC5/CentOS5). It was patched for RHEL5 on Friday (22nd October
https://rhn.redhat.com/errata/RHSA-2010-0787.html) and sites are in the process of rolling it
out. The vulnerability allows a user to escalate their privileges.
2) At the last LHCOPN meeting, the LHCOPN community was mandated to design a solution to
improve network connectivity for the LHC Tier2s. Anyone interested in actively participating in
the discussions can now join the discussion list via https://e-groups.cern.ch/e-
groups/Egroup.do?egroupId=218645.
RM couldn't attend the last meeting and had sent round a report. DB noted that we were 17 sites
now rather than 4 x Tier-2s - he had flagged this to RM that they should discuss sites rather than
Tiers. JC noted he would join the list.
3) In the deployment team we are currently trying to match storage pledge figures in gstat with
those agreed in GridPP for 2010/2011. A useful reference page is
http://bourricot.cern.ch/dq2/accounting/federation_reports/UKSITES/. The pledge figures
shown in gstat appear not to be close to those in the 2nd tranche allocations spreadsheet
circulated (i.e. the agreed GridPP pledges). What figures were sent to the WLCG project office?
4) A new GOCDB4 interface came into production during the week of Monday 11th October. There
were some initial problems the most significant being that sites could not log new downtimes for
the entire site. This issue was quickly resolved and the service appears to be running smoothly.
SI-6 LCG Management Board Report
---------------------------------
DB reported that the next MB was tomorrow but there was no Agenda set. For the one which took
place two weeks ago, there had been nothing relevant in the Ops Report; there had been a report
from the CRRB; there had been an EMI discussion which JG may have attended. AS commented
that the Ops Reports were always issued either very late or were not available at all, and he
advised that they also get circulated differently each time. These things meant that giving
feedback was difficult. DB noted he could raise this with Jamie Speirs. AS thought that an earlier
report would be helpful as, if it was too late, he couldn't give any useful comment.
SI-7 Dissemination Report
--------------------------
SP reported that Neasan O'Neill had attended CHEP 2010. The Stand had been quiet, but there
weren't many stands there overall, and the location of the GridPP stand had not been ideal. NO
would provide a report and conclusions in due course. There was also a news item on NorduGrid
coming.
AOB
===
TD advised that the EPSRC call was about to close. Akram Kham had asked to refer to the GridPP
section published on Cloud Computing, within his application - was this ok? It related to a six-
month pilot funded by JISC and EPSRC. DB noted that in the GridPP4 proposal there had been a
section on Cloud Computing. It as agreed that AK could reference this if he wished. TD would let
him know.
REVIEW OF ACTIONS
=================
384.6 TD/JC to take the lead on the 'GridPP to NGI' document that addresses the forward-moving
technical and other issues from a GridPP perspective. JC was gathering info. It was noted that the
recipient was likely to be Dave Wallom. Deadline of late November for discussion. Ongoing.
397.1 AS to provide a high-level summary of the Disaster and Business Continuity Plan - by
November 15th latest - and also provide a web link to further more detailed documents. Ongoing.
398.6 DC to provide updated LondonGrid MoU. DC reported that the meeting had happened, the
LondonGrid MoU had been discussed, DC would incorporate comments. Ongoing.
398.7 DK to check that all is up-to-date in terms of GridPP Security Policies - email DB. If there
are any issues, DK to let DB know. DK reported that the GridPP Security Policy phase was ongoing
at present, however other policies had been approved by LCG. DK advised that EGI formal signoff
was awaited, then the GridPP pages would be updated. Ongoing.
398.9 RJ to provide an updated NorthGrid MoU (only requires to be modified in relation to
EGEE/EGI). Meeting will take place 3rd week in October, it will be done then. Ongoing.
398.10 RJ/Graeme Stewart to provide urls of the place(s) where info is located re ATLAS site tests
and measurements (so that sites understand what they're being measured on). Ongoing.
398.12 TD/DB to make renewed efforts to engage someone at Glasgow to tackle GridMon and to
have access transferred in order to ensure the instances were up-to-date and running ok - DB
would insist on a meeting with Mark Leese for a handover. To be done by the end of GridPP3.
Ongoing.
398.13 DB to consider how to evolve the User Board into a useful meeting in the future, DB to
initiate in the timeframe between now and GridPP4. Ongoing.
400.2 JC to confirm that priorities have been documented for the major experiments for
recovering files from disk servers. Ongoing.
400.4 SL to co-ordinate changing the current GridPP MoU towards an MoU for GridPP4.
401.4 JG to progress issue of end-to-end network problems and the requirement for someone
neutral and part of central management, who had a good overview and who could solve problems
from a 'middle' position - JG to progress this at GDB. Ongoing.
402.1 Action on the PMB re ticket workflow in the UK in relation to NGS/NGI: tickets were ending
in dead ends. This action should be moved to JC/JG. Ongoing.
402.2 JC/JG to provide status report on EGI/NGI Service Level Agreements in the context of
GridPP agreeing with the level of service provided, ensuring that it is as GridPP requires. Ongoing.
ACTIONS AS AT 25.10.10
======================
384.6 TD/JC to take the lead on the 'GridPP to NGI' document that addresses the forward-moving
technical and other issues from a GridPP perspective. JC was gathering info. It was noted that the
recipient was likely to be Dave Wallom. Deadline of late November for discussion. This should be
on the F2F Agenda for 9th December meeting.
397.1 AS to provide a high-level summary of the Disaster and Business Continuity Plan for input
to the next OC meeting - by November 15th latest - and also provide a web link to further more
detailed documents.
398.6 DC to provide updated LondonGrid MoU. DC reported that the meeting had happened, the
LondonGrid MoU had been discussed, DC would incorporate comments.
398.7 DK to check that all is up-to-date in terms of GridPP Security Policies - email DB. If there
are any issues, DK to let DB know. DK reported that the GridPP Security Policy phase was ongoing
at present, however other policies had been approved by LCG. DK advised that EGI formal signoff
was awaited, then the GridPP pages would be updated.
398.9 RJ to provide an updated NorthGrid MoU (only requires to be modified in relation to
EGEE/EGI). Meeting will take place 3rd week in October, it will be done then.
398.10 RJ/Graeme Stewart to provide urls of the place(s) where info is located re ATLAS site tests
and measurements (so that sites understand what they're being measured on).
398.12 TD/DB to make renewed efforts to engage someone at Glasgow to tackle GridMon and to
have access transferred in order to ensure the instances were up-to-date and running ok - DB
would insist on a meeting with Mark Leese for a handover. To be done by the end of GridPP3.
398.13 DB to consider how to evolve the User Board into a useful meeting in the future, DB to
initiate in the timeframe between now and GridPP4. This should be on the F2F Agenda for 9th
December meeting.
400.2 JC to confirm that priorities have been documented for the major experiments for
recovering files from disk servers.
400.4 SL to co-ordinate changing the current GridPP MoU towards an MoU for GridPP4.
401.4 JG to progress issue of end-to-end network problems and the requirement for someone
neutral and part of central management, who had a good overview and who could solve problems
from a 'middle' position - JG to progress this at GDB.
402.1 JC/JG to address the issue of ticket workflow in the UK in relation to NGS/NGI, to clarify
that the support process is: tickets were ending in dead ends.
402.2 JC/JG to provide status report on EGI/NGI Service Level Agreements in the context of
GridPP agreeing with the level of service provided, ensuring that it is as GridPP requires.
403.1 SL to contact Sussex and enquire about the possiblity of them hosting GridPP26 in April
2011.
403.2 RJ to broadcast the move to ATLAS adaptive data placement at RAL, specifically for PD2P
only, via ATLAS and GridPP standard channels.
403.3 SL to book a room at QMUL for the PMB F2F meeting from 11am to 5pm on 9th December,
prior to the OC on 10th.
The next PMB will take place on Monday 1st November at 12:55 pm.
|