Dear All,
Please find attached the GridPP Project Management Board
Meeting minutes for the 488th meeting to 490th meeting.
The latest minutes can be in:
http://www.gridpp.ac.uk/php/pmb/minutes.php?latest
as well as being listed with other minutes at:
http://www.gridpp.ac.uk/php/pmb/minutes.php
Cheers, Dave.
GridPP PMB Minutes 488 (18.02.2013)
===================================
Present: Dave Britton (Chair), Tony Doyle, Pete Gronbech, Andrew Sansum, Jeremy Coles, Tony
Cass, Pete Clarke, Dave Colling, Roger Jones (Minutes - Suzanne Scott)
Apologies: Dave Kelsey, Steve Lloyd, Claire Devereux, Neil Geddes
0) Closure of AFS Service
==========================
AS had circulated a report giving an overview of the history of this. In 2007 the Tier-1 Board had
said we should shut it down, some complaints had been received from the User Board regarding
users. A slight upgrade had been effected to keep it going. The issue had come round again now
as the hardware was fairly old. We needed to decide what to do, as the AFS Service did not fit-in
with the Tier-1 model and the community had not found it useful. Recently there had been
fileserver problems and there were still come users, however usage was limited. Moving forward,
to maintain it, we would need to invest effort in staff, upgrades, and also the user registration
process, however we probably could not support it at that level. No funding was available. The
logical outcome was to close it. DB asked if there were a use case for the AFS as part of the core
mission of the Tier-1? AS noted no. DB considered it to be peripheral therefore and the service
did not require to be run. It may affect some individual users if it were closed.
DB considered we should turn it off unless we had to respond to an urgent issue. This was agreed.
AS would broadcast the notification. DB noted there was no defined use case, therefore there was
no justification for refresh and manpower. We would announce the termination of the service and
see what the outcome was. This was agreed. PC asked whether a long process of advertisement
would be required? AS advised that he preferred to keep this to within a four-month period. AS
would send out a notification and reminders.
ACTION
488.1 AS to notify the community, giving three months' notice, that the AFS service would be shut
down.
1) Quarterly Report Summary
============================
PG had circulated a report. Compared with the previous quarter the experiments and the Tier-1
were green. There were some reds at the Tier-1. The LFC and FTS service fell below the 99%
target.
The CMS VO box metric was no longer required. Regarding storage, there had been similar drops
due to power outages. A lot of effort had been put into power incidents and upgrades. The
CASTOR staff levels were critical. Jens Jensen was working on three recruitments at the moment.
For ATLAS all metrics were green. RJ advised that ATLAS use of resources was not all green but
the site performance was acceptable. For CMS all metrics were green. All was OK apart from
Bristol. DC noted that we needed to tread carefully with this site due to manpower and other
issues. There were storage issues to be resolved. There had been an improvement however.
For LHCb all metrics were green. RAL had performed excellently during the Quarter. For 'Other'
experiments all metrics were green. There had been the addition of the EPIC VO. NGS VOs had
been added onto the VOMS Server. T2k had increasing storage requirements. There were LFC
support issues.
For Ops, everything was going fairly smoothly. There had been some upgrade issues. For
DataGroup all metrics were green. For Experiment Support all metrics were green.
2) EGI Fees
============
It was noted that JISC would not pay the current year's EGI fee for the UK, therefore there had
been a request that £60k be funded by someone other than JISC, ie: GridPP and NGS. NGS could
pay half and it was requested that GridPP pay £30-35k. The only mechanism available was out of
the travel budget. DK could let us know whether this was feasible. DB asked for comments. RJ
asked what we got in return for the EGI fee? DB advised that staff were funded by EGI, there were
a few FTE and 4 x 0.5FTE at the Tier-2 which were funded by EGI. DB advised that matching effort
was also required - other people reported time into the PPT timesheet system. There was one
year left of EGI, which had followed-on from EGEEII and EGEEIII. DB needed to speak to DK and
STFC before taking action. JC advised that Ireland hadn't paid and they had withdrawn. The
Portuguese payment was delayed. 4 x FTE were also on APEL and the GocDB, which we relied on.
It was agreed that DB should speak to DK/STFC regarding the EGI fee payment and let AS know.
ACTION
488.2 DB to speak to DK/STFC regarding the EGI fee payment and let AS know.
3) Horizon 2020
===============
It was noted that the EU were widening their search for experts in all fields for Horizon 2020
proposals. Had anyone responded to the call? No-one had. Did anyone wish to volunteer? There
were strategic priorities and an Agenda to be discussed. If anyone did wish to get involved they
should let DB know.
STANDING ITEMS
==============
SI-O Report from Cloud Group
=============================
DC advised that meetings were happening fortnightly; the twiki was in progress; hardware was
limited so far. DC reported as follows:
Organisation is settling down and we have fortnightly meetings. There is a growing twiki a
community is starting to form. There is an ongoing discussion between Ian C., JC and DC about
best to form this into a community. Other cloud sites within GridPP are encouraged to add a
description and link on the GridPP twiki site (as ECDF already have done).
Physical Hardware (description from Adam H.)
---------------------------------------------
Four compute nodes are in the process of being provisioned as extra compute nodes. Each has
64GB RAM and 32 cores with HyperThreading. This will bring the total compute capacity to:
200 Cores
400 GiB RAM
A storage node is also being provisioned, to provide an S3-compatible service. The raw usable
capacity (before any reduction for replication) is:
20 TiB
Further storage may be added using space on other nodes in the cluster, if the loading on single
machines is such that multiple roles can be accommodated safely. The cluster is running
OpenStack Folsom. Hosts are have been added to the Imperial monitoring system. It is planned to
provide monitoring of
individual instances too.
Activities:
-----------
Cloud Storage testing
---------------------
Non- of the LHC experiments are currently using cloud storage however, storage is being added to
the cluster so that Wahid can perform some tests.
ATLAS
-----
Nobody was able to report on Atlas activities at the last meeting, but at the previous meeting Peter
L. had reported that he was planning to use cloud scheduler. As yet there have been no images on
the GridPP cloud from Atlas but I believe that Peter had some configuration to do in Lancaster
before he would try anything at our end.
CMS
---
CMS have been very active both with the GridPP cloud and the HLT farm at CERN.
The HLT farm has run ~4000 concurrent reprocessing jobs however under that loading the jobs
started to fail. This is believed to be a simple network bandwidth problem as the data was going
over the 1Gb/s pipe not the 10Gb/s. After the low energy run Andrew L. and Toni (from CERN)
are to map the requirements of the reprocessing jobs and then rearrange the network as needed.
These jobs are submitted from a glideinWMS sitting at CERN. Data are read in and out over xroot.
In the UK user data analysis jobs are now being submitted using the regular CMS tools, going via
the glideinWMS at RAL and being run GridPP cloud at Imperial. It should be noted that the
glideinWMS not only controls the jobs but also performs the instantiation of the VM itself. Data is
read in using xroot and staged out using conventional grid tools. Currently there is a problem that
some jobs fail because of stageout timeout problems.This is being investigated.
LHCb
----
Andrew MN. described that LHCb at CERN were using the hampster set up to create individual
VMs on the agile infrastructure and he was going to try doing something similar in Manchester
and the possibly on the GridPP Cloud.
Relations with other Cloud projects
------------------------------------
We are in the process of joining the EGI Federated Cloud and had a 'phone meeting with Dave W
and Matteo last Friday. This sound as though it will be about 2 weeks of work which would mean
that we would be part of the demo at the user forum. We will then look at trying to run CMS (and
hopefully other VOs if effort is available) jobs on the EGI FC.
We have been in touch with Helix Nebula and we will be a resource provider via the EGI FC but
will also be part of a dialogue with Helix Nebula on how they can work with national structures
such as GridPP and national funding agencies. Especially concerning hybrid cloud models.
Security
--------
Regarding security, John the Security Officer had agreed to take on the security remit for the Cloud
Group as well.
SI-1 Dissemination Report
==========================
There was no report.
SI-2 ATLAS weekly review & plans
=================================
RJ confirmed there had been minor issues and that space tokens were filling up. Group
production was being done by those who might not know how their job would behave, this meant
space tokens were being used up and it needed to be sorted out. RJ advised that with several sites,
people were submitting jobs using proof, running root in multicore - individual users had been
contacted and there was a need to control the user base.
RJ reported there were FTS transfer issues in relation to Lustre - this was on hold pending the
return of Shawn de Witt. AS was aware of the issue. It was hoped that the problem would go
away with SL6 deployment. RJ considered it to be a low-level problem and they were keeping a
watching brief. RJ noted another issue with the SRM timeout option in relation to CASTOR sites,
jobs went into pending mode then died. SL6 large-scale testing was imminent, they were awaiting
news of the RAL half-day intervention. Delayed stream reprocessing would be put in as a modest
priority. This equated to half of the resource globally, and was due to start at the beginning of
April.
SI-3 CMS weekly review & plans
===============================
DC had left the meeting.
SI-4 LHCb weekly review & plans
================================
PC noted nothing major to report.
SI-5 Production Manager's Report
=================================
JC reported as follows:
1) Some Tier-2 sites have had issues with certain ATLAS user jobs (proof-lite running multi-
threaded root) running with high cpu usage and causing WNs to crash. Individual users are being
contacted to cancel jobs.
2) A new version of the DPM Collaboration document (final) has been produced with a first draft
annex allocating tasks amongst partners. This is being currently being revised – the stated GridPP
contribution being 1 FTE but the current figures reflect comments about estimated current effort.
Comments on the IPR and licensing text will be fed back.
3) The final WLCG Tier-2 availability report for January is now available:
https://espace.cern.ch/WLCG-document-repository/ReliabilityAvailability/Tier-
2/2013/WLCG_Tier2_Jan2013.pdf. Comments on WLCG marked amber sites:
UCL 41%:48% - SE problems and upgrade
Manchester: 85%:85% - CEs stopped accepting jobs.
Durham: 65%:65%. The site was being ‘rebuilt’ during January and therefore in downtime.
Birmingham: 68%:68% - DPM head-node upgrade. ops VOMS settings incorrect.
Aside: ATLAS analysis availability is discussed at http://tinyurl.com/b9yy8ja.
4) GridPP contributors (mainly Wahid, Sam and Jens) will lead a storage ‘workshop’ at the EGI CF.
This is leading to additional travel requests for which we may wish to set a quota. There are also
questions about registration for those with accepted submissions posters/talks as the fees are
high (http://cf2013.egi.eu/registration/). Do we encourage day participation? Early bird
registration is until 22nd February.
Apparently DK had been receiving travel requests for this; could we clarify the fee payment. DB
advised that we wanted to support the EGI Community Forum but considered that 20 people
going was too many. DB advised that it depended on whether the person going needed to be there
for the week or not, or could we cap the cost at a certain level? DB would contact DK and check
the cap level. The priority was for those with talks and posters to present. Those attending the
storage workshop would probably attend on that day only. DB noted it was complicated - it could
be a full day or it could be interleaved with the main conference as a thread. There were room
issues as well. DB was not aware that the storage workshop was going ahead. DB would contact
DK.
ACTION
488.3 DB to contact DK regarding travel and other costs to the EGI Community Forum in
Manchester.
5) Glasgow is currently running ‘at risk’ due to power feed issues.
6) The ops team focus is going to be on networking/perfsonar, IPv6, SL6 and glexec over the
coming month(s).
For information:
A) There was a GDB last week: http://indico.cern.ch/conferenceDisplay.py?confId=197800.
Topics covered included EGI’s plans post EMI, IPv6, reports from the Ops coordination team
groups and an update on Clouds and Storage Federations.
SI-6 Tier-1 Manager's Report
=============================
Fabric:
1) Disk - both sets delivered - acceptance testing.
2) CPU - both delivered - one set completed our tests but waiting for a supplier fix to power
distribution this week. Second set has our acceptance tests to run - will complete in about 2
weeks. Still need to configure both deliveries into final network configuration - cannot do this
until early March. In any case we plan not to deploy the new CPU to production capacity until late
March, in the meantime we will use for SL6 capacity testing and other CASTOR load tests.
3) A short core site network intervention is being scheduled for Tuesday 26th February (adds
resilience). We are evaluating the likely impact and will schedule an at-risk/downtime as
appropriate
Service:
1) A relatively quiet 2 weeks:
https://www.gridpp.ac.uk/wiki/Tier1_Operations_Report_2013-02-06
https://www.gridpp.ac.uk/wiki/Tier1_Operations_Report_2013-02-13
2) CASTOR
- Chasing a problem where the CASTOR SRM response has an invalid format impact some ATLAS
transfer management particularly from QMW. fault appears to be in the GSI/gsoap layer. We hope
it will be fixed when we upgrade to SL6 SRMs. Will need to discuss with ATLAS if they can wait
that long.
- Chasing a slowdown problem on a generation of disk servers which cause timeouts and cause us
to be placed offline in ATLAS production. Rolling out a RAID controller firmware update to
address this problem.
3) BATCH - Problems with low start rate in batch system, causing periods of under utilisation.
Needing manually intervention regularly. Problem in Maui proving hard to diagnose. Working on
a plan of how we will progress this but may need to deploy an alternative to Torque/Maui.
4) AFS - Consultations underway on possible termination of rl.ac.uk AFS cell.
Staff:
- paperwork for two system admin posts for Fabric team in system waiting approval.
ACTION
488.4 AS to let DB know the SL5 estimated benchmark figure for new CPU purchase.
SI-7 LCG Management Board Report
=================================
There was no MB.
REVIEW OF ACTIONS
=================
438.9 AS to contact relevant site managers to ask whether or not they would be interested in
having retired Tier-1 hardware - if a site were interested then they should submit a proposal as to
what they want and why. Ongoing.
480.2 JC to consider the imminent demise of EMI and the resultant effect on the GridPP
community - concrete issues and action requests to be brought back to the PMB. Ongoing.
484.1 DB to investigate plan for support of GridPP resources at Durham. PC as Chair of ScotGrid
may have some input to this. Ongoing.
485.1 DB to speak to STFC regarding GridPP5 timetable. Done, item closed.
485.3 AS to poll for date in May/June for T1 review. Ongoing.
486.1 DB to make a proposal regarding the increase in T2K data storage requirements, so that
this can be discussed. Done, item closed.
487.1 RJ/DC/PC to send PG a BibTeX file of experiment publications for the STFC e-VAL survey.
Done, item closed.
487.4 ALL to send PG a list of the occasions DB was a keynote speaker at conferences. Ongoing.
487.5 AS to check with Simon Lambert and Juan at RAL about DPHEP and ATLAS data curation,
and report back. Done, item closed.
ACTIONS AS AT 18.02.12
======================
438.9 AS to contact relevant site managers to ask whether or not they would be interested in
having retired Tier-1 hardware - if a site were interested then they should submit a proposal as to
what they want and why.
480.2 JC to consider the imminent demise of EMI and the resultant effect on the GridPP
community - concrete issues and action requests to be brought back to the PMB.
484.1 DB to investigate plan for support of GridPP resources at Durham. PC as Chair of ScotGrid
may have some input to this. DB/PC would meet to discuss this and report-back to the PMB.
485.3 AS to poll for date in May/June for T1 review.
487.4 ALL to send PG a list of the occasions any PMB member was a keynote speaker at
conferences.
488.1 AS to notify the community, giving three months' notice, that the AFS service would be shut
down.
488.2 DB to speak to DK/STFC regarding the EGI fee payment and let AS know.
488.3 DB to contact DK regarding travel and other costs to the EGI Community Forum in
Manchester.
488.4 AS to let DB know the SL5 estimated benchmark figure for new CPU purchase.
The next PMB would take place on Monday 25th February at 12:55 pm.
GridPP PMB Minutes 489 (25.02.2013)
===================================
Present: Dave Britton (Chair), Pete Gronbech, Andrew Sansum, Jeremy Coles, Tony Cass, Dave
Colling, Dave Kelsey, Steve Lloyd, Claire Devereux (Minutes - Pete Gronbech)
Apologies: Pete Clarke, Tony Doyle, Roger Jones, Neil Geddes
1) Finances
============
DB and AS had discussed the finance plan. AS had not yet looked at DB’s figures to double check
them - the amount of disk may be a little low. AS would check later today. RAL T2 figures were in
the budget.
2) GridPP5
===========
DB had forwarded an email to the PMB which gave an outline schedule. The SoI should be
submitted to the December 2013 Science Board meeting and the proposal to the PPRP in February
2014. DB noted that this as going through the PPRP rather than another funding mechanism (such
as Consolidated Grant). This was probably preferable, as a four-year project until LS2 was
possible.
This meant that the timing had to work backwards from the SoI in December, the key issue was to
do the bulk of the work in Sept/Oct this year, this could then be finalised February 2014.
Christmas was during that period. The GridPP31 Collaboration Meeting should therefore focus on
scoping-out GridPP5.
DC considered that this should work really well, as the updated TDR computing documents would
have to be ready for September. Over the summer we needed to be thinking about how we
wished to shape this. DB suggested that we need to think about the following issues:
1. An operational vs developmental project: A good argument for any development would be
needed (new technologies were a concern, as was the successful maintenance of current
operations). How we packaged GridPP5 needed to be considered carefully, to avoid it being
separated off with the risk of not being funded at all.
2. Technical implementation: What would the GridPP5 Grid look like? This question was tied-up
with cloud work and developments in computer hardware.
3. Political instantiation of the grid: Would it be more of the same, or would it be rationalised to
fewer institutes?
4. Boundary services: NGI/EGI APEL, CA, VOMS, and network - these were all things we currently
relied upon - how would they be sustained?
5. Currently a big push in the UK to join-up the computing ecosystem (including HPC) - this
needed to be an energy efficient computing ecosystem. We could not submit a bid in isolation and
we needed to know how we might relate to this new world.
6. Impact agenda: How do we respond to this and can we get funding in this area? It was agreed
that we should structure the meeting in GridPP31 around these (and possibly other) issues.
DC asked whether there was any European activity? CD noted no, there was no follow-on from
EGI Inspire, but there were some smaller projects, but no details were available yet. DB
considered it was unlikely that we would get significant outside funding. We had 4 x 0.5FTE at
institutes and several at the T1. The total was around 6-7. Potentially we might have to ask for
more this time, but to do that we would have to show very clearly how we would fit into the UK
ecosystem.
3) Support of WN/UI Tarball
============================
JC advised that Tiziana had enquired about ongoing support for WN and UI. Matt Doidge thought
that it should be fairly low-load but work was required for each new release. It was noted that
there were other countries using it (approximately 10 non-UK sites), but in some ways it was
good to be offering support for something we were using. We would have to say that it was on a
'best effort' basis only - we had no extra effort available, so if the load increased we could not
commit to supporting it.
4) HEPSYSMAN and Security Training
===================================
The PMB had approved the revised HEPSYSMAN /security training plan.
STANDING ITEMS
==============
SI-1 Dissemination Report
--------------------------
SL reported on behalf of Neasan O'Neill as follows:
Royal Soc:
* Attended Digital Training for exhibition, I'll be helping compile digital content and managing
online interactions
* Compiled ideas for "eye witness" stories for booklet
News Items:
* VomsSnooper published
* Working with Claire Devereux on a profile of her as a news item
* Working on an EPIC news item
Social Media:
* We know have a facebook page http://facebook.com/gridpp
* Have drawn up a small plan to increase presence on the various channels
* Could people on PMB push use of the blogs again?
Events:
* We have a booth at CF13, currently trying to work out what we have offer/who is attending
KE/Impact:
* Working on sessions/talks for GridPP30, suggested agenda here
http://www.gridpp.ac.uk/gridpp30/day2.html
* Have Jamie Coleman to talk at GridPP30
* Trying to sort out a date for Mark Mitchell's talk at Edinburgh's TechCube
* I have wording for GridPP's offering to academia and SMEs waiting for feedback
SI-2 ATLAS weekly review & plans
---------------------------------
There was no report, RJ was absent.
SI-3 CMS weekly review & plans
-------------------------------
DC noted nothing of significance to report.
SI-4 LHCb weekly review & plans
--------------------------------
There was no report, PC was absent.
SI-5 Production Manager's Report
---------------------------------
JC reported as follows:
1) Tiziana Ferrari (EGI) has asked about GridPP support for the tarball WN/UI. (See email to PMB
on 20th February).
2) ATLAS users using multi-core proof caused a few additional problems during last week but
overall the situation was handled well. There is now a discussion about how to deal with such jobs
in future if there is a genuine user needs for them.
3) PerfSONAR showed some but not all GridPP sites having poor rates to BNL. TCP tuning of
several parameters appears to markedly improved the situation and there is now work to
understand what settings particularly influence the rates and why.
4) The GDB actions list (https://twiki.cern.ch/twiki/bin/view/LCG/GDBActionInProgress) has
been updated and I highlight these activities:
- evaluation of new CVMFS version (2.1.5) starting (new features NFS export, shared caches)
- starting tests with volunteering sites for multi-core jobs
- the next pre-GDB (12th March http://indico.cern.ch/conferenceDisplay.py?confId=223689) will
be on "Cloud issues" and building a work plan for future work in the area
- SHA-2 readiness of sites testing is starting: no need for RFC proxy anymore
- Sites with perfSONAR should move to a centrally managed configuration.
5) In addition to Glasgow, sites that are to start looking at IPv6 are Imperial, QMUL and possibly
Oxford.
SI-6 Tier-1 Manager's Report
-----------------------------
AS reported as follows:
Fabric:
1) Disk - both sets delivered - acceptance testing.
2) CPU - both delivered - one set completed our tests but waiting for a supplier fix to power
distribution this week. Second set has our acceptance tests to run - will complete in about 2
weeks.
3) A short core site network intervention is being scheduled for Tuesday 26th February (adds
resilience). We have declared a 1 hour "at risk".
4) We expect to replace the core Tier-1 network switch (C300) on Tuesday 12th March. Details to
be finalised.
5) We lost a disk server filesystem (gdss594) - a tape backed server - 68 T2K files un-migrated
and lost. A post mortem review is underway.
Service:
1) A quiet week:
https://www.gridpp.ac.uk/wiki/Tier1_Operations_Report_2013-02-20
2) CASTOR
- CASTOR srm down for a few hours on Saturday evening - cause still unknown.
- Chasing a problem where the CASTOR SRM response has an invalid format impact some ATLAS
transfer management particularly from QMW. fault appears to be in the GSI/gsoap layer. We hope
it will be fixed when we upgrade to SL6 SRMs. Still need to discuss with ATLAS if they can wait
that long.
- Chasing a slowdown problem on a generation of disk servers which cause timeouts and cause us
to be placed offline in ATLAS production. Rolling out a RAID controller firmware update to
address this problem.
3) BATCH - Problems with low start rate in batch system, causing periods of under-utilisation
mainly when jobs are very short. Needing manually intervention regularly. Problem in Maui
proving hard to diagnose. Working on a plan of how we will progress this but may need to deploy
an alternative to Torque/Maui.
4) Investigating unusual job failure rates for LHCb and ATLAS. May be during job set-up and
related to CVMFS investigations underway.
Staff:
1) Paperwork for two system admin posts for Fabric team in system waiting approval by STFC.
SI-7 LCG Management Board Report
---------------------------------
There had been no MB.
ACTIONS AS OF 25.02.13
======================
438.9 AS to contact relevant site managers to ask whether or not they would be interested in
having retired Tier-1 hardware - if a site were interested then they should submit a proposal as to
what they want and why.
480.2 JC to consider the imminent demise of EMI and the resultant effect on the GridPP
community - concrete issues and action requests to be brought back to the PMB.
484.1 DB to investigate plan for support of GridPP resources at Durham. PC as Chair of ScotGrid
may have some input to this. DB/PC would meet to discuss this and report-back to the PMB.
485.3 AS to poll for date in May/June for T1 review.
487.4 ALL to send PG a list of the occasions any PMB member was a keynote speaker at
conferences.
488.1 AS to notify the community, giving three months' notice, that the AFS service would be shut
down.
488.2 DB to speak to DK/STFC regarding the EGI fee payment and let AS know.
488.3 DB to contact DK regarding travel and other costs to the EGI Community Forum in
Manchester.
488.4 AS to let DB know the SL5 estimated benchmark figure for new CPU purchase.
Next PMB would take place on Monday 4th March 2013 at 12:55 pm.
GridPP PMB Minutes 490 (04.03.2013)
===================================
Present: Dave Britton (Chair), Pete Gronbech, Andrew Sansum, Jeremy Coles, Tony Cass, Dave
Colling, Dave Kelsey, Steve Lloyd, Roger Jones (Minutes - Suzanne Scott)
Apologies: Tony Doyle, Pete Clarke, Claire Devereux, Neil Geddes
1. Tier-1 Resources
====================
DB advised that historically LHCb had used the figure of 18.6% to calculate the LHCb fraction of
resources, based on authors multiplied by global resource requests. LHCb had now realised that
this figure was not the correct number for the Tier-1 - it was right for the Tier-2. By applying the
algorithm to the Tier-1, they were chronically under-providing resources. This had a big effect on
LHCb. DB had confirmed with PC that the formula was 'authors in the UK' divided by 'authors in
all Tier-1 countries'. The Tier-1 was currently therefore providing less to LHCb at RAL but this
was what LHCb had requested. It was unlikely that we could find extra resources in GridPP4. DB
wanted to know what the actual number was in order to see how far we could meet it. There was
a pressing need for disk, an extra 300TB. It was noted we had ~1PB of contingency so we could
probably meet LHCb part-way. We had to get through the procurements first before determining
the timing of this. We would need to treat LHCb like ATLAS and respond appropriately. AS was
concerned - over the coming year we had 3 calls on disk: 1. the FY14 delivery would be more than
4PB; 2. the operational size of existing tranches ranged considerably, solving problems with
tranches would be outwith our ability to cope if the buffer dropped below 1PB; 3. we had to
deploy another storage instance this year in FY13. DB asked if it was possible to provision LHCb
with tape-back disk? AS wasn't sure.
DB concluded that we needed to get the numbers from LHCb and look at the operational concerns
from our side. DC thought we should help if we could. AS advised that if we lost the Streamline
2009 we might not have enough capacity. DB asked if it was 300TB that they wanted, then what
was the risk of giving them 100/200/300TB and make a decision on that? We might then be able
to meet them part-way. PC would provide the final percentage figure of UK authors over Tier-1
authors. AS asked about Alice? Alice was short of authors from Tier-1 countries? DB advised that
we weren't funded to support Alice at all and the fact that we did provide for them at the moment
was best effort.
DB advised that the other issue was that pledges were made last year before the experiment
requirements were approved by the CRRB. Which numbers did we provision against? Sensibly
this would be against the numbers in Rebus, but that would be less than actually pledged in some
cases. If, on the other hand, we followed what we had pledged then we were provisioning against
the wrong numbers. DB had emailed Ian Bird, asking whether we should pledge against Rebus or
the pledges - a response was awaited. It made sense to provision against Rebus - we would return
to this issue.
2. AOCB
========
- PC had withdrawn his request for the LHCb workshop, there were not enough people attending
the IoP to make it worthwhile.
- EGI fees: what was the timing of this? DB had received an email from Adrian about this; DB
needed to speak to CD. AS asked whether we needed to look a the risks w.r.t. project planning of
NES? DB noted that the NGI were developing a disaster Management plan to cover all services on
which GridPP depended.
DB had emailed Janet Seed and CAP (PC) recently about the funding issue. The LHC computing
profile was not high enough. Recent funding had gone to projects in development, it had not been
given to established projects like GridPP.
- Regarding travel for Hepix: the next meeting was on 15-19 April in Bologna, early-bird
registration was mid-March. DK asked how many people we intended to fund? For the last
meeting in Prague, 3 people from the Tier-1 and PG had attended. There had been no engagement
from the Tier-2s. DB considered that it would be good if up to one person per Tier-2 could attend.
AS noted he was hoping for 2-3 people to attend from the Tier-1 this time. DB thought it entirely
reasonable for a few from the Tier-1 and one from each Tier-2 to go. We could consider any
requests beyond those figures. JC would remind the Ops people tomorrow and encourage
attendance. He would also email suggestions to DK regarding who would be best to go.
- PG asked about the allocation of hardware funding? DB advised that CMS needed to say whether
Bristol was part of their policy or not. SL noted that GridPP5 had not yet been discussed - we
would keep things going until then. CMS didn't update their metrics very often.
STANDING ITEMS
==============
SI-1 Dissemination Report
--------------------------
SL reported that Alex Efimov, who had worked at QMUL for a time, had asked for a meeting with
himself and Neasan regarding 'industry engagement'. SL and Neasan O'Neill would meet with
him.
SI-2 ATLAS weekly review & plans
---------------------------------
RJ noted small issues with the batch farm at RAL - they had problems filling the farm due to Maui.
Apart from that, there were issues with SL6 and node-testing. They had people making progress
with the testing infrastructure. Xrootd was being rolled-out across DPM sites. Durham was
currently up and running.
SI-3 CMS weekly review & plans
-------------------------------
DC noted nothing major to report.
SI-4 LHCb weekly review & plans
--------------------------------
PC was absent.
SI-5 Production Manager's Report
---------------------------------
JC reported as follows:
1) As of today sites are being alerted about the end-of-life of EMI-1 middleware and the
decommissioning campaign is starting (sites have until the end of March to remove the
middleware)
2) EMI 3 (Monte Bianco) is expected to be released this Thursday. We will review UK staged-
rollout involvement at tomorrow’s ops meeting.
3) SNO+ and T2K have both experienced proxy renewal issues in recent months; jobs are not
consistently failing so it is difficult to pin-point the underlying problem(s). At least one problem
was reported as a bug with the WMS that was subsequently fixed but the release failed staged
rollout for other reasons.
It was noted that both Sussex and Durham were up and running at the moment. PG asked
whether Sussex had been added to the accounting metric page? SL noted he would add them.
SI-6 Tier-1 Manager's Report
-----------------------------
AS reported as follows:
Fabric:
1) Disk - both sets delivered - acceptance testing, projected to end 1th and 15th March (if no
problems)).
2) CPU - both delivered - one set available for test queue (probably SL6) second set expected to
complete acceptance tests this week. We do not plan to deploy to production queues until
required to meet MoU commitment.
3) We expect to replace the core Tier-1 network switch (C300) on Tuesday 12th March. The Tier-1
network is complex with many switch stacks. We expect to schedule a 6 hour downtime (TBC)
which includes some contingency to allow time to resolve problems with uplinks or switch stacks
disturbed by the change. Full details will be announced later this week.
4) Preparations are underway for moving Tier-1 to new 40Gb network infrastructure. Major
intervention likely in late April or May.
Service:
1) A relatively quiet week:
https://www.gridpp.ac.uk/wiki/Tier1_Operations_Report_2013-02-27
AS advised that there had been job-start rate issues in Maui - they could solve it when it occurred,
but it seemed worse when the experiments were submitting short jobs, the issue tended to come
and go. The other issue was low-level loss of jobs in the setup phase for both ATLAS and LHCb -
work was ongoing on this however the cause was not yet known.
2) CASTOR
- development continues on 2.1.13. Stress testing is well advanced. Some tape servers upgraded
to test in production. Expect to upgrade the Facilities instance this month and Tier-1 instances
likely to start in April.
Staff:
1) Paperwork for two system admin posts for Fabric team in system waiting approval by STFC.
SI-7 LCG Management Board Report
---------------------------------
There had been no MB.
REVIEW OF ACTIONS
=================
438.9 AS to contact relevant site managers to ask whether or not they would be interested in
having retired Tier-1 hardware - if a site were interested then they should submit a proposal as to
what they want and why. Ongoing.
480.2 JC to consider the imminent demise of EMI and the resultant effect on the GridPP
community - concrete issues and action requests to be brought back to the PMB. Ongoing.
484.1 DB to investigate plan for support of GridPP resources at Durham. PC as Chair of ScotGrid
may have some input to this. DB/PC would meet to discuss this and report-back to the PMB.
Done, item closed.
485.3 AS to poll for date in May/June for T1 review. Ongoing.
487.4 ALL to send PG a list of the occasions any PMB member was a keynote speaker at
conferences. Ongoing.
488.1 AS to notify the community, giving three months' notice, that the AFS service would be shut
down. Ongoing.
488.2 DB to speak to DK/STFC regarding the EGI fee payment and let AS know. Ongoing.
488.3 DB to contact DK regarding travel and other costs to the EGI Community Forum in
Manchester. Done, item closed.
488.4 AS to let DB know the SL5 estimated benchmark figure for new CPU purchase. Done, item
closed.
ACTIONS AS OF 04.03.12
======================
438.9 AS to contact relevant site managers to ask whether or not they would be interested in
having retired Tier-1 hardware - if a site were interested then they should submit a proposal as to
what they want and why.
480.2 JC to consider the imminent demise of EMI and the resultant effect on the GridPP
community - concrete issues and action requests to be brought back to the PMB.
485.3 AS to poll for date in May/June for T1 review.
487.4 ALL to send PG a list of the occasions any PMB member was a keynote speaker at
conferences.
488.1 AS to notify the community, giving three months' notice, that the AFS service would be shut
down.
488.2 DB to speak to DK/STFC regarding the EGI fee payment and let AS know.
The next meeting would take place next Monday 11th March at 12:55 pm. RJ advised of apologies
for the next two meetings.
|