Dear All,
_-_-_-_-_-_-_-_-_-_-_-_-_-REMINDER_-_-_-_-_-_-_-_-_-_-_-_-_-_
Registration for GridPP28 at http://www.gridpp.ac.uk/gridpp28/
closes today.
_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-
Please find attached the GridPP Project Management Board
Meeting minutes for the 453rd meeting.
The latest minutes can be found each week in:
http://www.gridpp.ac.uk/php/pmb/minutes.php?latest
as well as being listed with other minutes at:
http://www.gridpp.ac.uk/php/pmb/minutes.php
Cheers, Dave.
GridPP PMB Minutes 453 (27.02.2012)
===================================
Present: Dave Britton (Chair), Jeremy Coles, Pete Gronbech, Steve Lloyd, Pete Clarke, Tony Cass,
Robin Middleton, Andrew Sansum (Suzanne Scott - Minutes)
Apologies: Tony Doyle, Roger Jones, Dave Kelsey, John Gordon, Glenn Patrick, Dave Colling, Neil
Geddes
1. Summary of Quarterly Report Issues
======================================
PG reported that all of the Reports had been received except for CMS. DC was waiting on
information from RJ.
Red Metrics were as follows:
- the Tier-1 staffing situation was still an issue during Q4, but 4 new staff started recently and it
was expected that this metric would move to Amber for 12Q1.
- the Tier-1 ranking for ATLAS was red. AS advised that the current round of upgrades should
make a difference. PG noted that disk procurement was underway, modulo the floods and the
reduced cost. The DRI funding had helped.
- for ATLAS there was one red metric: data availability went from 98 to 92%, the reason wasn't
known yet. AS advised this was probably due to CASTOR issues and other minor things. There
had been Oracle database problems. PG advised that the AHM paper would suffice for the
milestone report.
LHCb was pretty good, there had been a couple of amber 'below target', but the target may need to
be modified due to new practices. DB advised that we needed to re-visit the Risk Register and that
this should be put on the Agenda for the F2F meeting.
ACTION
453.1 PG to add the Risk Register to the Agenda for the upcoming F2F meeting at Manchester.
PG continued - for 'Other Experiments' the report was not too bad, there were a couple of amber
metrics which showed a drop to 63.3% (below the target of 75%). AS reported that job efficiency
overall was excellent, it was a small volume of data only, in context. DB noted that he didn't want
to drop the target - small VOs should try to increase their efficiency.
ACTION
453.2 AS to get the issue of small VO efficiency, which should be increased, onto the Agenda at the
experiment liaison meeting.
DB noted that we needed to examine Tier-2 disk usage by non-LHC VOs, and return to this issue
later this year. PG to retain this issue on his list. AS noted that there had been a lot of demand
from LHC VOs which meant that spare capacity was down.
PG continued - re Deployment and Ops, there were some amber metrics but most were close to
target. DB noted that the Durham issue remained - he queried why Edinburgh and Lancaster
didn't use such a large amount? JC advised that 2/3 of the Lancaster numbers were missing in the
accounting. DB thought that the problem was now resolved. PG noted that the calculation had
come from the ScotGrid report. For accounting we used the lower figure. JC would check whether
the number had been propagated through the accounting.
ACTION
453.3 JC to check the ScotGrid quarterly report to see whether or not the incorrect number had
propagated through the accounting system.
PG continued - all sites now had Cream CE installed, however overall reliability and availability
did fall. Oxford were supporting Alice now. The UK CA had caused some problems.
For Data Group there was one red metric: the blog posts were low; one milestone was overdue;
Argus deployment was pending.
For Security there had been no incidents, but there was a red milestone - the security framework
from EGI was still under development.
For NGI work was continuing on APEL and the GOCDB; Durham had been marked red by EGI.
For Execution, manpower was low. PG queried the 'Year 1 review of service to the experiments'.
AS noted we were keeping a note of this annually - both points of view were required: what the
experiments received; and what we provided. The questionnaire was no good and the issue
required serious attention - a 5-minute response was not good. PG noted the metric had been
there originally but he wasn't sure of origin. DB asked PG to give his conclusions and
recommendations after speaking to sources. DB noted we could discuss this at the F2F. PG would
provide a bullet list for the PMB to address during the meeting. DB noted that CPUs and disk
provided were already covered by the metrics, we didn't need to ask again - we needed to identify
the high-level problems. PG advised that this could be summarised from issues arising during the
year, and could be delivered via a couple of slides at the meeting, in order to meet the milestone.
ACTION
453.4 PG to provide at the F2F in Manchester, a bulleted list of summarised issues which had
arisen during the year and were noted in the Quarterly Reports. This would meet the milestone
required.
PG continued - regarding Outreach, the website was failing to meet the target; there had been no
KE meeting and no press releases. DB advised that the onus was on us to help Neasan O'Neill to
meet the targets - we should give him more of a platform at GridPP29. We needed to help Neasan
reach his objectives. DB noted that it was no problem in this category to have aspirations even if
they were noted as amber or red - we had been under-funded on Dissemination by STFC.
2. Tier-2 Disk
===============
DB advised that the issue of accounting policy discourages Tier-2 sites to allocate disk to non-LHC
VOs. We need to adjust the accounting algorithm soon to rectify this. SL noted we needed to do
the count first. PG advised that he had already asked the Tier-2 Co-ordinators to provide that
information. DB asked how we weight the algorithm? If we want them to deploy 3% then we
need to weight it the same whether to T2K or to ATLAS. PG noted then the non-LHC VOs could be
added in to the ATLAS sites. SL would think about this. DB noted it would be good to resolve this
before Manchester.
ACTION
453.5 SL to help resolve the issue of weighting for non-LHC VOs at the Tier-2s.
3. AOCB
========
- GridPP29
DB had circulated suggestions. Possibly Oxford next time? PG had spoken to Sue Geddes and
there was a lecture theatre available. The date could be 10-12 September 2012. Were there any
clashes? The AHM was 10-12 September. ATLAS week was the first week of October. LHCb week
was 3-7 September. DB thought that the end of that week then, 13 and 14 September might be
possible. Could PG check those dates instead? We would have the PMB on 12th. [Note Added:
GridPP29 dates now converging on week of 24th Sep 2012]
ACTION
453.6 PG to check lecture theatre availability for week of 24th September for GridPP29.
- EMI-2 early adopters
JC would meet with Daniela Bauer and Duncan Rand this week. Brunel had volunteered.
- FP7 Data Preservation Project
DB noted that DC/PC/RJ were all interested in this. This was CSA therefore matching funding was
not required. Should we get involved? PC had mixed feelings, as the deadline was very close. It
was peripheral to taking the data and analysing - it was on the borderline as to whether we got
involved. Wearing a UK hat however, PC thought that the work had to be done. DB agreed on the
money side, but it was a lot of work for not much return, however it was an area in which it was
better if we were involved. DC and RJ were too busy, who could express interest? DB would
express some interest and lay out the constraints.
ACTION
453.7 DB to 'express interest' in the FP7 Data Preservation Project and would contact Jamie to
check the scope and what was required. PC, RJ and DC were interested on behalf of the
experiments.
- ATLAS UK tutorial
RM advised that the cost of ~£2k for travel for this seemed fine. DB proposed PMB support. This
was agreed, subject to further detailed information from RJ.
STANDING ITEMS
==============
SI-1 Dissemination Report
--------------------------
SL reported on behalf of Neasan O'Neill:
1) The website was almost done, NO was working on getting it all live, some of the updates could
be seen already like the "Docs" page (http://www.gridpp.ac.uk/docs/)
2) NO was in Taipei this week, so would mostly be out of contact
3) Masterclasses had been confirmed: Daresbury, Oxford, UCL and QMUL
4) There would be no UK NGI/GridPP/NGS stand at Munich as NGS could not confirm funds. Once
they had confirmed, there were no booths left, but we were on the waiting list
5) NO was attending a meeting in Glasgow on the 10th of March about TuringFest and GridPP's
involvement - they wanted to do an entire session on the Higgs. Jamie Colman had contacted Mark
Mitchell, who noted to him that we should involve Neasan O'Neill. We should identify some Grid
speakers. It was a good event with a technical audience.
SI-2 ATLAS weekly review & plans
---------------------------------
RJ was absent.
SI-3 CMS weekly review & plans
-------------------------------
DC was absent.
SI-4 LHCb weekly review & plans
---------------------------------
GP was absent.
SI-5 User Co-ordination issues
-------------------------------
GP was absent.
SI-6 Production Manager's Report
---------------------------------
JC reported as follows:
1) We received a COD escalation last week for a site reportedly exceeding a month in downtime.
Upon investigation it was found that the RAL node was not marked in-production and was in a
testing state for ATLAS. The issue now requires follow-up by the dashboard developers – it is not
possible to issue tickets to sites that are not in production so those sites need to have a special
status in the dashboard.
2) A few sites are beginning to look at perfsonar installs. GridMon nodes are also arriving but for
now the recommendation is to install the node but leave it off – pending configuration details and
work on the policy.
3) Manchester is facing a period of poor availability as reported by the ops tests due to a problem
(reported back to WLCG and the DPM developers on several previous occasions) with the SE put
tests that are marked critical but there is sufficient space for the files. The problem occurs because
a DPM bug means that when their “other VOs disk server” is down or marked read-only other disk
is not counted. The site view:
“… there is more than enough disk space available to cover those 22TB that are down due to a
dodgy raid card that needs replacing. I'd like to underline that that is one of the small file systems
not even one of the big ones.
The system doesn't see that because the space is reserved by space tokens and anything is
subtracted by the common space first and then in a weird way from the space tokens.”
Several sites workaround this monitoring problem by creating a pool specifically for ops/sgmops
or allowing ops to write into other reserved areas. The argument against setting up separate pools
is that ops is then not testing the most important filesystem(s). Similarly sites set up CPU queues
for ops to ensure the tests can run. Some sysadmins argue that these tests then create an
environment that is not optimal for “real” work in order to return good results for operations grid
metrics. I raise this (once again) here so that the PMB is fully aware of the situation as relates to
use of the ops VO for testing. The tests are useful and spot problems though in some ways are
clearly not ideal. For the SE tests there may be some scope to adjust what is critical and this can be
fed back to EGI. In the meantime Manchester has responded that they “can remove space from
atlas so that these tests can run if the PMB doesn't remove them from the accounting”.
The status of Manchester is about to trigger a non-performance ticket to the COD as the ROD ticket
cannot be extended beyond 30 days.
DB advised that we should not make a different policy for Manchester that was different to other
sites. JC noted that the situation would need to escalate to a ROD ticket. DB noted that if other
sites had found appropriate workarounds then Manchester should do likewise.
4) The matter of disk space for other VOs (3%) is being checked but the PMB needs to consider
how making this disk available is wrapped into the T2 metrics. As Steve (Lloyd) has pointed out,
at the moment “it's better for sites to have empty ATLAS disk then full T2K etc.”
SI-7 Tier-1 Manager's Report
-----------------------------
AS reported as follows:
Fabric:
1) FY11 procurements
- D12 CPU delivery expected to deploy to production in next 24-48 hours
- V12 CPU delivery under RAL proving test. Expect to deploy around 9th March
- CV12 Disk some minor problems, vendor acceptance. expect to complete acceptance tests 9th
March
- V12 disk expect to complete acceptance tests 9th March
2) Power failure on 14th February external to RAL led to 3 racks of disk servers losing power. This
couldn't have happened at a better time as service was in scheduled downtime for CASTOR
nameserver upgrade. 3 racks of disk servers lost power.
3) Work on essential supply board on UPS supply on Tuesday and Thursday this week. Increased
risk of loss of power or cooling to UPS room equipment.
4) Hardware intervention was successful on the core C300 switch on 8th February
Service:
Many interventions scheduled - most transparent or with minimal disruption but a number of
major interventions planned during February two of which (CASTOR and Batch Farm) carry high
risk. ATLAS had a very difficult time in January owing to poor SRM availability.
1) Summary of operational issues and scheduled interventions is at:
http://www.gridpp.ac.uk/wiki/Tier1_Operations_Report_2012-02-15
http://www.gridpp.ac.uk/wiki/Tier1_Operations_Report_2012-02-22
2) CASTOR
a) Upgrade to CASTOR 2.1.11-8 completed successfully for ATLAS and CMS (last week) LHCB and
Gen this week.
b) Ongoing SRM problems after SRM upgrade have required aggressive re-starter cron. Work to
identify the underlying cause will commence once we have completed CASTOR upgrades this
week.
c) We expect to move the CASTOR database servers to their final hardware configuration in 1-2
weeks' time. This will require a 1-2 hour outage on all instances.
3) The upgrade to the batch server was completed. Mainly successful, however publishing
problems led to LHCB being unable to submit work in the later part of last week.
4) Problems with the CREAM CE (Zombie jobs) prevented LHCB from using the WMS for job
submission. We have a workaround wich will go into operation later today or tomorrow to delete
zombie jobs, but do not yet have a solution. Also seen at LAL.
5) The move of the ATLAS LFC to CERN took place last week. The RAL LFC is no longer critical for
the ATLAS UK cloud.
6) We expect to schedule an upgrade to the FTS to EMI FTS 2.2.8 w/b 5th March.
7) The MYPROXY upgrade on 9th February was reverted after problems were encountered. Now
suspected to be a hardware problem on the new target hardware rather than a problem with the
upgrade.
8) The CIP (CASTOR information provider) upgrade on the 9th February was reverted after
problems were encountered. Currently CIP is providing inaccurate disk capacity data. We are
reviewing how the Tier-1 supports and maintains the CIP.
Staff:
1) Grid team leader post. Ian Collier will lead the team. We will backfill Ian's post by recruiting a
new system admin for the Fabric team.
2) Recruitments
* Database post - recruitment post offered and verbally accepted.
SI-8 LCG Management Board Report
---------------------------------
There had been no MB.
AOB
===
Re the DRI situation, PG to send an email to all PIs, reminding them that there was 5 weeks left on
the DRI spend.
ACTION
453.8 PG to send an email to all PIs, reminding them that there was 5 weeks left on the DRI spend.
REVIEW OF ACTIONS
=================
436.12 DB to produce a financial proposal for adjustments to the Tier-2 staffing profile over the
term of GRIDPP4.
438.8 TC to advise when it is a good time to move to vidyo - early adopters were possible.
438.9 AS to contact relevant site managers to ask whether or not they would be interested in
having retired Tier-1 hardware - if a site were interested then they should submit a proposal as to
what they want and why. Ongoing.
448.4 ALL to send thoughts/suggestions to DB regarding the replacement for GP in the User Co-
ordinator position (not necessarily based at RAL).
448.7 RJ/PC to draw-up GridPP guidelines in relation to a Data Management Policy: RJ/PC to keep
abreast of Policy and inform GridPP as this develops.
449.1 AS to document the recent network incidents at RAL. Ongoing.
450.1 DC to send the CMS spreadsheet accounting numbers to December, to SL.
450.2 Re SL6, JC to come back to the PMB with regard to plans & schedules. Ongoing.
451.1 DB to respond to Gillian re the EGI Community Forum, noting GridPP's willingness to lend
support and to be involved, and to have a presence on the Organising Committee. DB to see if we
could co-locate a GridPP or NGI event. Done, item closed.
451.2 JG to respond to Tiziana Ferrari re the RC Forum and note that GridPP would like to be
involved. JG to consider how we contribute and report-back. Done, item closed.
451.3 PG/JC to look at non-LHC VO storage use at the Tier-2s and report back. Done, item closed.
ACTIONS AS OF 27.02.12
======================
436.12 DB to produce a financial proposal for adjustments to the Tier-2 staffing profile over the
term of GRIDPP4.
438.8 TC to advise when it is a good time to move to vidyo - early adopters were possible.
438.9 AS to contact relevant site managers to ask whether or not they would be interested in
having retired Tier-1 hardware - if a site were interested then they should submit a proposal as to
what they want and why.
448.4 ALL to send thoughts/suggestions to DB regarding the replacement for GP in the User Co-
ordinator position (not necessarily based at RAL).
448.7 RJ/PC to draw-up GridPP guidelines in relation to a Data Management Policy: RJ/PC to keep
abreast of Policy and inform GridPP as this develops.
449.1 AS to document the recent network incidents at RAL.
450.1 DC to send the CMS spreadsheet accounting numbers to December, to SL.
450.2 Re SL6, JC to come back to the PMB with regard to plans & schedules.
453.1 PG to add the Risk Register to the Agenda for the upcoming F2F meeting at Manchester.
453.2 AS to get the issue of small VO efficiency, which should be increased, onto the Agenda at the
experiment liaison meeting.
453.3 JC to check the ScotGrid quarterly report to see whether or not the incorrect number had
propagated through the accounting system.
453.4 PG to provide at the F2F in Manchester, a bulleted list of summarised issues which had
arisen during the year and were noted in the Quarterly Reports. This would meet the milestone
required.
453.5 SL to help resolve the issue of weighting for non-LHC VOs at the Tier-2s.
453.6 PG to check lecture theatre availability for week of 24th September for GridPP29.
453.7 DB to 'express interest' in the FP7 Data Preservation Project and would contact Jamie to
check the scope and what was required. PC, RJ and DC were interested on behalf of the
experiments.
453.8 PG to send an email to all PIs, reminding them that there was 5 weeks left on the DRI spend.
The next meeting would take place on Monday 5 March at 12:55 pm.
|