JISCMail - UKHEPGRID Archives

Email discussion lists for the UK Education and Research communities
Subscriber's Corner
Email Lists
UKHEPGRID Archives

UKHEPGRID@JISCMAIL.AC.UK

View:

Message:
[
First
Last
]
By Topic:
[
First
Last
]
By Author:
[
First
Last
]
Font:
Proportional Font
		LISTSERV Archives
		UKHEPGRID Home
		UKHEPGRID March 2013
Options

Subscribe or Unsubscribe
Get Password
Subject:
Minutes of the 488th to 490th GridPP PMB meeting
From:
David Britton <[log in to unmask]>
Reply-To:
David Britton <[log in to unmask]>
Date:
Mon, 11 Mar 2013 12:09:52 +0000
Content-Type:
multipart/mixed
Parts/Attachments:
text/plain (48 lines) , 130218.txt (1 lines) , 130225.txt (1 lines) , 130304.txt (1 lines)
Dear All,


Please find attached the GridPP Project Management Board
Meeting minutes for the 488th meeting to 490th meeting.

                      The latest minutes can be in:

http://www.gridpp.ac.uk/php/pmb/minutes.php?latest

as well as being listed with other minutes at:

http://www.gridpp.ac.uk/php/pmb/minutes.php

Cheers, Dave.


































GridPP PMB Minutes 488 (18.02.2013)

===================================



Present:  Dave Britton (Chair), Tony Doyle, Pete Gronbech, Andrew Sansum, Jeremy Coles, Tony 
Cass, Pete Clarke, Dave Colling, Roger Jones (Minutes - Suzanne Scott)



Apologies:  Dave Kelsey, Steve Lloyd, Claire Devereux, Neil Geddes



0)  Closure of AFS Service

==========================

AS had circulated a report giving an overview of the history of this.  In 2007 the Tier-1 Board had 
said we should shut it down, some complaints had been received from the User Board regarding 
users.  A slight upgrade had been effected to keep it going.  The issue had come round again now 
as the hardware was fairly old.  We needed to decide what to do, as the AFS Service did not fit-in 
with the Tier-1 model and the community had not found it useful.  Recently there had been 
fileserver problems and there were still come users, however usage was limited.  Moving forward, 
to maintain it, we would need to invest effort in staff, upgrades, and also the user registration 
process, however we probably could not support it at that level.  No funding was available.  The 
logical outcome was to close it.  DB asked if there were a use case for the AFS as part of the core 
mission of the Tier-1?  AS noted no.  DB considered it to be peripheral therefore and the service 
did not require to be run.  It may affect some individual users if it were closed.



DB considered we should turn it off unless we had to respond to an urgent issue.  This was agreed.  
AS would broadcast the notification.  DB noted there was no defined use case, therefore there was 
no justification for refresh and manpower.  We would announce the termination of the service and 
see what the outcome was.  This was agreed.  PC asked whether a long process of advertisement 
would be required?  AS advised that he preferred to keep this to within a four-month period.  AS 
would send out a notification and reminders.



ACTION

488.1  AS to notify the community, giving three months' notice, that the AFS service would be shut 
down.



1)  Quarterly Report Summary

============================

PG had circulated a report.  Compared with the previous quarter the experiments and the Tier-1 
were green.  There were some reds at the Tier-1.  The LFC and FTS service fell below the 99% 
target.



The CMS VO box metric was no longer required.  Regarding storage, there had been similar drops 
due to power outages.  A lot of effort had been put into power incidents and upgrades.  The 
CASTOR staff levels were critical.  Jens Jensen was working on three recruitments at the moment.



For ATLAS all metrics were green.  RJ advised that ATLAS use of resources was not all green but 
the site performance was acceptable.  For CMS all metrics were green.  All was OK apart from 
Bristol.  DC noted that we needed to tread carefully with this site due to manpower and other 
issues.  There were storage issues to be resolved.  There had been an improvement however.



For LHCb all metrics were green.  RAL had performed excellently during the Quarter.  For 'Other' 
experiments all metrics were green. There had been the addition of the EPIC VO.  NGS VOs had 
been added onto the VOMS Server.  T2k had increasing storage requirements.  There were LFC 
support issues.



For Ops, everything was going fairly smoothly.  There had been some upgrade issues.  For 
DataGroup all metrics were green.  For Experiment Support all metrics were green.



2)  EGI Fees

============

It was noted that JISC would not pay the current year's EGI fee for the UK, therefore there had 
been a request that �60k be funded by someone other than JISC, ie: GridPP and NGS.  NGS could 
pay half and it was requested that GridPP pay �30-35k.  The only mechanism available was out of 
the travel budget.  DK could let us know whether this was feasible.  DB asked for comments.  RJ 
asked what we got in return for the EGI fee?  DB advised that staff were funded by EGI, there were 
a few FTE and 4 x 0.5FTE at the Tier-2 which were funded by EGI.  DB advised that matching effort 
was also required - other people reported time into the PPT timesheet system.  There was one 
year left of EGI, which had followed-on from EGEEII and EGEEIII.  DB needed to speak to DK and 
STFC before taking action.  JC advised that Ireland hadn't paid and they had withdrawn.  The 
Portuguese payment was delayed.  4 x FTE were also on APEL and the GocDB, which we relied on.  
It was agreed that DB should speak to DK/STFC regarding the EGI fee payment and let AS know.



ACTION

488.2  DB to speak to DK/STFC regarding the EGI fee payment and let AS know.



3) Horizon 2020

===============

It was noted that the EU were widening their search for experts in all fields for Horizon 2020 
proposals.  Had anyone responded to the call?  No-one had.  Did anyone wish to volunteer?  There 
were strategic priorities and an Agenda to be discussed.  If anyone did wish to get involved they 
should let DB know.



STANDING ITEMS

==============

SI-O  Report from Cloud Group

=============================

DC advised that meetings were happening fortnightly; the twiki was in progress; hardware was 
limited so far.  DC reported as follows:

Organisation is settling down and we have fortnightly meetings. There is a growing twiki a 
community is starting to form. There is an ongoing discussion between Ian C., JC and DC about 
best to form this into a community. Other cloud sites within GridPP are encouraged to add a 
description and link on the GridPP twiki site (as ECDF already have done).



Physical Hardware (description from Adam H.)

---------------------------------------------

Four compute nodes are in the process of being provisioned as extra compute nodes.  Each has 
64GB RAM and 32 cores with HyperThreading. This will bring the total compute capacity to:

200 Cores

400 GiB RAM

A storage node is also being provisioned, to provide an S3-compatible service.  The raw usable 
capacity (before any reduction for replication) is:

20 TiB

Further storage may be added using space on other nodes in the cluster, if the loading on single 
machines is such that multiple roles can be accommodated safely.  The cluster is running 
OpenStack Folsom.  Hosts are have been added to the Imperial monitoring system.  It is planned to 
provide monitoring of

individual instances too.



Activities:

-----------



Cloud Storage testing

---------------------

Non- of the LHC experiments are currently using cloud storage however, storage is being added to 
the cluster so that Wahid can perform some tests.



ATLAS

-----

Nobody was able to report on Atlas activities at the last meeting, but at the previous meeting Peter 
L. had reported that he was planning to use cloud scheduler. As yet there have been no images on 
the GridPP cloud from Atlas but I believe that Peter had some configuration to do in Lancaster 
before he would try anything at our end.



CMS

---

CMS have been very active both with the GridPP cloud and the HLT farm at CERN.



The HLT farm has run ~4000 concurrent reprocessing jobs however under that loading the jobs 
started to fail. This is believed to be a simple network bandwidth problem as the data was going 
over the 1Gb/s pipe not the 10Gb/s. After the low energy run Andrew L. and Toni (from CERN) 
are to map the requirements of the reprocessing jobs and then rearrange the network as needed. 
These jobs are submitted from a glideinWMS sitting at CERN. Data are read in and out over xroot.



In the UK user data analysis jobs are now being submitted using the regular CMS tools, going via 
the glideinWMS at RAL and being run GridPP cloud at Imperial. It should be noted that the 
glideinWMS not only controls the jobs but also performs the instantiation of the VM itself. Data is 
read in using xroot and staged out using conventional grid tools. Currently there is a problem that 
some jobs fail because of stageout timeout problems.This is being investigated.



LHCb

----

Andrew MN. described that LHCb at CERN were using the hampster set up to create individual 
VMs on the agile infrastructure and he was going to try doing something similar in Manchester 
and the possibly on the GridPP Cloud.



Relations with other Cloud projects

------------------------------------

We are in the process of joining the EGI Federated Cloud and had a 'phone meeting with Dave W 
and Matteo last Friday. This sound as though it will be about 2 weeks of work which would mean 
that we would be part of the demo at the user forum. We will then look at trying to run CMS (and 
hopefully other VOs if effort is available) jobs on the EGI FC.



We have been in touch with Helix Nebula and we will be a resource provider via the EGI FC but 
will also be part of a dialogue with Helix Nebula on how they can work with national structures 
such as GridPP and national funding agencies. Especially concerning hybrid cloud models.



Security

--------

Regarding security, John the Security Officer had agreed to take on the security remit for the Cloud 
Group as well.



SI-1  Dissemination Report

==========================

There was no report.



SI-2  ATLAS weekly review & plans

=================================

RJ confirmed there had been minor issues and that space tokens were filling up.  Group 
production was being done by those who might not know how their job would behave, this meant 
space tokens were being used up and it needed to be sorted out.  RJ advised that with several sites, 
people were submitting jobs using proof, running root in multicore - individual users had been 
contacted and there was a need to control the user base.



RJ reported there were FTS transfer issues in relation to Lustre - this was on hold pending the 
return of Shawn de Witt.  AS was aware of the issue.  It was hoped that the problem would go 
away with SL6 deployment.  RJ considered it to be a low-level problem and they were keeping a 
watching brief.  RJ noted another issue with the SRM timeout option in relation to CASTOR sites, 
jobs went into pending mode then died.  SL6 large-scale testing was imminent, they were awaiting 
news of the RAL half-day intervention.  Delayed stream reprocessing would be put in as a modest 
priority.  This equated to half of the resource globally, and was due to start at the beginning of 
April.



SI-3  CMS weekly review & plans

===============================

DC had left the meeting.



SI-4  LHCb weekly review & plans

================================

PC noted nothing major to report.



SI-5  Production Manager's Report

=================================

JC reported as follows:

1) Some Tier-2 sites have had issues with certain ATLAS user jobs (proof-lite running multi-
threaded root) running with high cpu usage and causing WNs to crash. Individual users are being 
contacted to cancel jobs.



2) A new version of the DPM Collaboration document (final) has been produced with a first draft 
annex allocating tasks amongst partners. This is being currently being revised � the stated GridPP 
contribution being 1 FTE but the current figures reflect comments about estimated current effort.  
Comments on the IPR and licensing text will be fed back. 



3) The final WLCG Tier-2 availability report for January is now available: 
https://espace.cern.ch/WLCG-document-repository/ReliabilityAvailability/Tier-
2/2013/WLCG_Tier2_Jan2013.pdf. Comments on WLCG marked amber sites:



UCL 41%:48% - SE problems and upgrade



Manchester: 85%:85% - CEs stopped accepting jobs.



Durham: 65%:65%. The site was being �rebuilt� during January and therefore in downtime.



Birmingham: 68%:68% - DPM head-node upgrade. ops VOMS settings incorrect.



Aside: ATLAS analysis availability is discussed at http://tinyurl.com/b9yy8ja.



4) GridPP contributors (mainly Wahid, Sam and Jens) will lead a storage �workshop� at the EGI CF. 
This is leading to additional travel requests for which we may wish to set a quota. There are also 
questions about registration for those with accepted submissions posters/talks as the fees are 
high (http://cf2013.egi.eu/registration/). Do we encourage day participation? Early bird 
registration is until 22nd February.



Apparently DK had been receiving travel requests for this; could we clarify the fee payment.  DB 
advised that we wanted to support the EGI Community Forum but considered that 20 people 
going was too many.  DB advised that it depended on whether the person going needed to be there 
for the week or not, or could we cap the cost at a certain level?  DB would contact DK and check 
the cap level.  The priority was for those with talks and posters to present.  Those attending the 
storage workshop would probably attend on that day only.  DB noted it was complicated - it could 
be a full day or it could be interleaved with the main conference as a thread.  There were room 
issues as well.  DB was not aware that the storage workshop was going ahead.  DB would contact 
DK.



ACTION

488.3  DB to contact DK regarding travel and other costs to the EGI Community Forum in 
Manchester.



5) Glasgow is currently running �at risk� due to power feed issues.



6) The ops team focus is going to be on networking/perfsonar, IPv6, SL6 and glexec over the 
coming month(s).



For information:

A) There was a GDB last week: http://indico.cern.ch/conferenceDisplay.py?confId=197800. 
Topics covered included EGI�s plans post EMI, IPv6, reports from the Ops coordination team 
groups and an update on Clouds and Storage Federations.



SI-6  Tier-1 Manager's Report

=============================

Fabric:



1) Disk - both sets delivered - acceptance testing.



2) CPU - both delivered - one set completed our tests but waiting for a supplier fix to power 
distribution this week. Second set has our acceptance tests to run - will complete in about 2 
weeks. Still need to configure both deliveries into final network configuration - cannot do this 
until early March. In any case we plan not to deploy the new CPU to production capacity until late 
March, in the meantime we will use for SL6 capacity testing and other CASTOR load tests.



3) A short core site network intervention is being scheduled for Tuesday 26th February (adds 
resilience). We are evaluating the likely impact and will schedule an at-risk/downtime as 
appropriate



Service:



1) A relatively quiet 2 weeks:



https://www.gridpp.ac.uk/wiki/Tier1_Operations_Report_2013-02-06

https://www.gridpp.ac.uk/wiki/Tier1_Operations_Report_2013-02-13 



2) CASTOR



- Chasing a problem where the CASTOR SRM response has an invalid format impact some ATLAS 
transfer management particularly from QMW. fault appears to be in the GSI/gsoap layer. We hope 
it will be fixed when we upgrade to SL6 SRMs. Will need to discuss with ATLAS if they can wait 
that long.



- Chasing a slowdown problem on a generation of disk servers which cause timeouts and cause us 
to be placed offline in ATLAS production. Rolling out a RAID controller firmware update to 
address this problem.



3) BATCH - Problems with low start rate in batch system, causing periods of under utilisation. 
Needing manually intervention regularly. Problem in Maui proving hard to diagnose. Working on 
a plan of how we will progress this but may need to deploy an alternative to Torque/Maui.



4) AFS - Consultations underway on possible termination of rl.ac.uk AFS cell.



Staff:



- paperwork for two system admin posts for Fabric team in system waiting approval.



ACTION

488.4  AS to let DB know the SL5 estimated benchmark figure for new CPU purchase.



SI-7  LCG Management Board Report

=================================

There was no MB.





REVIEW OF ACTIONS

=================

438.9  AS to contact relevant site managers to ask whether or not they would be interested in 
having retired Tier-1 hardware - if a site were interested then they should submit a proposal as to 
what they want and why.  Ongoing.



480.2  JC to consider the imminent demise of EMI and the resultant effect on the GridPP 
community - concrete issues and action requests to be brought back to the PMB.  Ongoing.



484.1 DB to investigate plan for support of GridPP resources at Durham. PC as Chair of ScotGrid 
may have some input to this.  Ongoing.



485.1 DB to speak to STFC regarding GridPP5 timetable.  Done, item closed.



485.3 AS to poll for date in May/June for T1 review.  Ongoing.



486.1  DB to make a proposal regarding the increase in T2K data storage requirements, so that 
this can be discussed.  Done, item closed.



487.1  RJ/DC/PC to send PG a BibTeX file of experiment publications for the STFC e-VAL survey.  
Done, item closed.



487.4  ALL to send PG a list of the occasions DB was a keynote speaker at conferences.  Ongoing.



487.5  AS to check with Simon Lambert and Juan at RAL about DPHEP and ATLAS data curation, 
and report back.  Done, item closed.



ACTIONS AS AT 18.02.12

======================

438.9  AS to contact relevant site managers to ask whether or not they would be interested in 
having retired Tier-1 hardware - if a site were interested then they should submit a proposal as to 
what they want and why.



480.2  JC to consider the imminent demise of EMI and the resultant effect on the GridPP 
community - concrete issues and action requests to be brought back to the PMB.



484.1  DB to investigate plan for support of GridPP resources at Durham. PC as Chair of ScotGrid 
may have some input to this.  DB/PC would meet to discuss this and report-back to the PMB.



485.3  AS to poll for date in May/June for T1 review.



487.4  ALL to send PG a list of the occasions any PMB member was a keynote speaker at 
conferences.



488.1  AS to notify the community, giving three months' notice, that the AFS service would be shut 
down.



488.2  DB to speak to DK/STFC regarding the EGI fee payment and let AS know.



488.3  DB to contact DK regarding travel and other costs to the EGI Community Forum in 
Manchester.



488.4  AS to let DB know the SL5 estimated benchmark figure for new CPU purchase.



The next PMB would take place on Monday 25th February at 12:55 pm.














GridPP PMB Minutes 489 (25.02.2013)

===================================



Present:  Dave Britton (Chair), Pete Gronbech, Andrew Sansum, Jeremy Coles, Tony Cass, Dave 
Colling, Dave Kelsey, Steve Lloyd, Claire Devereux (Minutes - Pete Gronbech)



Apologies:  Pete Clarke, Tony Doyle, Roger Jones, Neil Geddes





1)  Finances

============

DB and AS had discussed the finance plan. AS had not yet looked at DB�s figures to double check 
them - the amount of disk may be a little low. AS would check later today. RAL T2 figures were in 
the budget.



2)  GridPP5

===========

DB had forwarded an email to the PMB which gave an outline schedule.  The SoI should be 
submitted to the December 2013 Science Board meeting and the proposal to the PPRP in February 
2014. DB noted that this as going through the PPRP rather than another funding mechanism (such 
as Consolidated Grant). This was probably preferable, as a four-year project until LS2 was 
possible.



This meant that the timing had to work backwards from the SoI in December, the key issue was to 
do the bulk of the work in Sept/Oct this year, this could then be finalised February 2014. 
Christmas was during that period. The GridPP31 Collaboration Meeting should therefore focus on 
scoping-out GridPP5. 



DC considered that this should work really well, as the updated TDR computing documents would 
have to be ready for September.  Over the summer we needed to be thinking about how we 
wished to shape this. DB suggested that we need to think about the following issues:



1. An operational vs developmental project:  A good argument for any development would be 
needed (new technologies were a concern, as was the successful maintenance of current 
operations). How we packaged GridPP5 needed to be considered carefully, to avoid it being 
separated off with the risk of not being funded at all. 



2. Technical implementation: What would the GridPP5 Grid look like? This question was tied-up 
with cloud work and developments in computer hardware. 



3. Political instantiation of the grid: Would it be more of the same, or would it be rationalised to 
fewer institutes?



4. Boundary services: NGI/EGI APEL, CA, VOMS, and network - these were all things we currently 
relied upon - how would they be sustained?



5. Currently a big push in the UK to join-up the computing ecosystem (including HPC) - this 
needed to be an energy efficient computing ecosystem.  We could not submit a bid in isolation and 
we needed to know how we might relate to this new world.



6. Impact agenda: How do we respond to this and can we get funding in this area?  It was agreed 
that we should structure the meeting in GridPP31 around these (and possibly other) issues.



DC asked whether there was any European activity?  CD noted no, there was no follow-on from 
EGI Inspire, but there were some smaller projects, but no details were available yet.  DB 
considered it was unlikely that we would get significant outside funding. We had 4 x 0.5FTE at 
institutes and several at the T1. The total was around 6-7. Potentially we might have to ask for 
more this time, but to do that we would have to show very clearly how we would fit into the UK 
ecosystem.



3)  Support of WN/UI Tarball

============================

JC advised that Tiziana had enquired about ongoing support for WN and UI. Matt Doidge thought 
that it should be fairly low-load but work was required for each new release.  It was noted that 
there were other countries using it (approximately 10 non-UK sites), but in some ways it was 
good to be offering support for something we were using.  We would have to say that it was on a 
'best effort' basis only - we had no extra effort available, so if the load increased we could not 
commit to supporting it.



4)  HEPSYSMAN and Security Training

===================================

The PMB had approved the revised HEPSYSMAN /security training plan.



STANDING ITEMS

==============

SI-1  Dissemination Report

--------------------------

SL reported on behalf of Neasan O'Neill as follows:



Royal Soc:

* Attended Digital Training for exhibition, I'll be helping compile digital content and managing 
online interactions

* Compiled ideas for "eye witness" stories for booklet

 

News Items:

* VomsSnooper published

* Working with Claire Devereux on a profile of her as a news item

* Working on an EPIC news item

 

Social Media:

* We know have a facebook page http://facebook.com/gridpp

* Have drawn up a small plan to increase presence on the various channels

* Could people on PMB push use of the blogs again?

 

Events:

* We have a booth at CF13, currently trying to work out what we have offer/who is attending

 

KE/Impact:

* Working on sessions/talks for GridPP30, suggested agenda here 
http://www.gridpp.ac.uk/gridpp30/day2.html

* Have Jamie Coleman to talk at GridPP30

* Trying to sort out a date for Mark Mitchell's talk at Edinburgh's TechCube

* I have wording for GridPP's offering to academia and SMEs waiting for feedback



SI-2  ATLAS weekly review & plans

---------------------------------

There was no report, RJ was absent.



SI-3  CMS weekly review & plans

-------------------------------

DC noted nothing of significance to report.



SI-4  LHCb weekly review & plans

--------------------------------

There was no report, PC was absent.



SI-5  Production Manager's Report

---------------------------------

JC reported as follows:

1)  Tiziana Ferrari (EGI) has asked about GridPP support for the tarball WN/UI. (See email to PMB 
on 20th February).



2)  ATLAS users using multi-core proof caused a few additional problems during last week but 
overall the situation was handled well. There is now a discussion about how to deal with such jobs 
in future if there is a genuine user needs for them.



3) PerfSONAR showed some but not all GridPP sites having poor rates to BNL. TCP tuning of 
several parameters appears to markedly improved the situation and there is now work to 
understand what settings particularly influence the rates and why.  



4) The GDB actions list (https://twiki.cern.ch/twiki/bin/view/LCG/GDBActionInProgress) has 
been updated and I highlight these activities:

- evaluation of new CVMFS version (2.1.5) starting (new features NFS export, shared caches)

- starting tests with volunteering sites for multi-core jobs

- the next pre-GDB (12th March http://indico.cern.ch/conferenceDisplay.py?confId=223689) will 
be on "Cloud issues" and building a work plan for future work in the area

- SHA-2 readiness of sites testing is starting: no need for RFC proxy anymore

- Sites with perfSONAR should move to a centrally managed configuration.



5) In addition to Glasgow, sites that are to start looking at IPv6 are Imperial, QMUL and possibly 
Oxford.



SI-6  Tier-1 Manager's Report

-----------------------------

AS reported as follows:



Fabric:



1) Disk - both sets delivered - acceptance testing.



2) CPU - both delivered - one set completed our tests but waiting for a supplier fix to power 
distribution this week. Second set has our acceptance tests to run - will complete in about 2 
weeks. 



3) A short core site network intervention is being scheduled for Tuesday 26th February (adds 
resilience). We have declared a 1 hour "at risk".



4) We expect to replace the core Tier-1 network switch (C300) on Tuesday 12th March. Details to 
be finalised.



5) We lost a disk server filesystem (gdss594) - a tape backed server - 68 T2K files un-migrated 
and lost. A post mortem review is underway.



Service:



1) A quiet week:



https://www.gridpp.ac.uk/wiki/Tier1_Operations_Report_2013-02-20 



2) CASTOR



- CASTOR srm down for a few hours on Saturday evening - cause still unknown.



- Chasing a problem where the CASTOR SRM response has an invalid format impact some ATLAS 
transfer management particularly from QMW. fault appears to be in the GSI/gsoap layer. We hope 
it will be fixed when we upgrade to SL6 SRMs. Still need to discuss with ATLAS if they can wait 
that long.



- Chasing a slowdown problem on a generation of disk servers which cause timeouts and cause us 
to be placed offline in ATLAS production. Rolling out a RAID controller firmware update to 
address this problem.



3) BATCH - Problems with low start rate in batch system, causing periods of under-utilisation 
mainly when jobs are very short. Needing manually intervention regularly. Problem in Maui 
proving hard to diagnose. Working on a plan of how we will progress this but may need to deploy 
an alternative to Torque/Maui.



4) Investigating unusual job failure rates for LHCb and ATLAS. May be during job set-up and 
related to CVMFS investigations underway.



Staff:



1) Paperwork for two system admin posts for Fabric team in system waiting approval by STFC.



SI-7  LCG Management Board Report

---------------------------------

There had been no MB.





ACTIONS AS OF 25.02.13

======================

438.9  AS to contact relevant site managers to ask whether or not they would be interested in 
having retired Tier-1 hardware - if a site were interested then they should submit a proposal as to 
what they want and why.



480.2  JC to consider the imminent demise of EMI and the resultant effect on the GridPP 
community - concrete issues and action requests to be brought back to the PMB.



484.1  DB to investigate plan for support of GridPP resources at Durham. PC as Chair of ScotGrid 
may have some input to this.  DB/PC would meet to discuss this and report-back to the PMB.



485.3  AS to poll for date in May/June for T1 review.



487.4  ALL to send PG a list of the occasions any PMB member was a keynote speaker at 
conferences.



488.1  AS to notify the community, giving three months' notice, that the AFS service would be shut 
down.



488.2  DB to speak to DK/STFC regarding the EGI fee payment and let AS know.



488.3  DB to contact DK regarding travel and other costs to the EGI Community Forum in 
Manchester.



488.4  AS to let DB know the SL5 estimated benchmark figure for new CPU purchase.



Next PMB would take place on Monday 4th March 2013 at 12:55 pm.
















GridPP PMB Minutes 490 (04.03.2013)

===================================



Present:  Dave Britton (Chair), Pete Gronbech, Andrew Sansum, Jeremy Coles, Tony Cass, Dave 
Colling, Dave Kelsey, Steve Lloyd, Roger Jones (Minutes - Suzanne Scott)



Apologies:  Tony Doyle, Pete Clarke, Claire Devereux, Neil Geddes



1.  Tier-1 Resources

====================

DB advised that historically LHCb had used the figure of 18.6% to calculate the LHCb fraction of 
resources, based on authors multiplied by global resource requests.  LHCb had now realised that 
this figure was not the correct number for the Tier-1 - it was right for the Tier-2.  By applying the 
algorithm to the Tier-1, they were chronically under-providing resources.  This had a big effect on 
LHCb.  DB had confirmed with PC that the formula was 'authors in the UK' divided by 'authors in 
all Tier-1 countries'.  The Tier-1 was currently therefore providing less to LHCb at RAL but this 
was what LHCb had requested.  It was unlikely that we could find extra resources in GridPP4.  DB 
wanted to know what the actual number was in order to see how far we could meet it.  There was 
a pressing need for disk, an extra 300TB.  It was noted we had ~1PB of contingency so we could 
probably meet LHCb part-way.  We had to get through the procurements first before determining 
the timing of this.  We would need to treat LHCb like ATLAS and respond appropriately.  AS was 
concerned - over the coming year we had 3 calls on disk:  1. the FY14 delivery would be more than 
4PB; 2. the operational size of existing tranches ranged considerably, solving problems with 
tranches would be outwith our ability to cope if the buffer dropped below 1PB; 3. we had to 
deploy another storage instance this year in FY13.  DB asked if it was possible to provision LHCb 
with tape-back disk?  AS wasn't sure.



DB concluded that we needed to get the numbers from LHCb and look at the operational concerns 
from our side.  DC thought we should help if we could.  AS advised that if we lost the Streamline 
2009 we might not have enough capacity.  DB asked if it was 300TB that they wanted, then what 
was the risk of giving them 100/200/300TB and make a decision on that?  We might then be able 
to meet them part-way.  PC would provide the final percentage figure of UK authors over Tier-1 
authors.  AS asked about Alice?  Alice was short of authors from Tier-1 countries?  DB advised that 
we weren't funded to support Alice at all and the fact that we did provide for them at the moment 
was best effort.



DB advised that the other issue was that pledges were made last year before the experiment 
requirements were approved by the CRRB.  Which numbers did we provision against?  Sensibly 
this would be against the numbers in Rebus, but that would be less than actually pledged in some 
cases.  If, on the other hand, we followed what we had pledged then we were provisioning against 
the wrong numbers.  DB had emailed Ian Bird, asking whether we should pledge against Rebus or 
the pledges - a response was awaited.  It made sense to provision against Rebus - we would return 
to this issue.



2.  AOCB

========

- PC had withdrawn his request for the LHCb workshop, there were not enough people attending 
the IoP to make it worthwhile.



- EGI fees: what was the timing of this?  DB had received an email from Adrian about this; DB 
needed to speak to CD.  AS asked whether we needed to look a the risks w.r.t. project planning of 
NES?  DB noted that the NGI were developing a disaster Management plan to cover all services on 
which GridPP depended.

DB had emailed Janet Seed and CAP (PC) recently about the funding issue.  The LHC computing 
profile was not high enough.  Recent funding had gone to projects in development, it had not been 
given to established projects like GridPP.



- Regarding travel for Hepix: the next meeting was on 15-19 April in Bologna, early-bird 
registration was mid-March.  DK asked how many people we intended to fund?  For the last 
meeting in Prague, 3 people from the Tier-1 and PG had attended.  There had been no engagement 
from the Tier-2s.  DB considered that it would be good if up to one person per Tier-2 could attend.  
AS noted he was hoping for 2-3 people to attend from the Tier-1 this time.  DB thought it entirely 
reasonable for a few from the Tier-1 and one from each Tier-2 to go.  We could consider any 
requests beyond those figures.  JC would remind the Ops people tomorrow and encourage 
attendance.  He would also email suggestions to DK regarding who would be best to go.



- PG asked about the allocation of hardware funding?  DB advised that CMS needed to say whether 
Bristol was part of their policy or not.  SL noted that GridPP5 had not yet been discussed - we 
would keep things going until then.  CMS didn't update their metrics very often.



STANDING ITEMS

==============

SI-1  Dissemination Report

--------------------------

SL reported that Alex Efimov, who had worked at QMUL for a time, had asked for a meeting with 
himself and Neasan regarding 'industry engagement'.  SL and Neasan O'Neill would meet with 
him.



SI-2  ATLAS weekly review & plans

---------------------------------

RJ noted small issues with the batch farm at RAL - they had problems filling the farm due to Maui.  
Apart from that, there were issues with SL6 and node-testing.  They had people making progress 
with the testing infrastructure.  Xrootd was being rolled-out across DPM sites.  Durham was 
currently up and running.



SI-3  CMS weekly review & plans

-------------------------------

DC noted nothing major to report.



SI-4  LHCb weekly review & plans

--------------------------------

PC was absent.



SI-5  Production Manager's Report

---------------------------------

JC reported as follows:

1)   As of today sites are being alerted about the end-of-life of EMI-1 middleware and the 
decommissioning campaign is starting (sites have until the end of March to remove the 
middleware)

 

2)   EMI 3 (Monte Bianco) is expected to be released this Thursday. We will review UK staged-
rollout involvement at tomorrow�s ops meeting.



3)  SNO+ and T2K have both experienced proxy renewal issues in recent months; jobs are not 
consistently failing so it is difficult to pin-point the underlying problem(s). At least one problem 
was reported as a bug with the WMS that was subsequently fixed but the release failed staged 
rollout for other reasons.



It was noted that both Sussex and Durham were up and running at the moment.  PG asked 
whether Sussex had been added to the accounting metric page?  SL noted he would add them.



SI-6  Tier-1 Manager's Report

-----------------------------

AS reported as follows:



Fabric:



1) Disk - both sets delivered - acceptance testing, projected to end 1th and 15th March (if no 
problems)).



2) CPU - both delivered - one set available for test queue (probably SL6) second set expected to 
complete acceptance tests this week. We do not plan to deploy to production queues until 
required to meet MoU commitment.



3) We expect to replace the core Tier-1 network switch (C300) on Tuesday 12th March. The Tier-1 
network is complex with many switch stacks. We expect to schedule a 6 hour downtime (TBC) 
which includes some contingency to allow time to resolve  problems with uplinks or switch stacks 
disturbed by the change. Full details will be announced later this week.



4) Preparations are underway for moving Tier-1 to new 40Gb network infrastructure. Major 
intervention likely in late April or May.



Service:



1) A relatively quiet week:



https://www.gridpp.ac.uk/wiki/Tier1_Operations_Report_2013-02-27 



AS advised that there had been job-start rate issues in Maui - they could solve it when it occurred, 
but it seemed worse when the experiments were submitting short jobs, the issue tended to come 
and go.  The other issue was low-level loss of jobs in the setup phase for both ATLAS and LHCb - 
work was ongoing on this however the cause was not yet known.



2) CASTOR

  - development continues on 2.1.13. Stress testing is well advanced.  Some tape servers upgraded 
to test in production. Expect to upgrade the Facilities instance this month and Tier-1 instances 
likely to start in April. 



Staff:



1) Paperwork for two system admin posts for Fabric team in system waiting approval by STFC.



SI-7  LCG Management Board Report

---------------------------------

There had been no MB.





REVIEW OF ACTIONS

=================

438.9  AS to contact relevant site managers to ask whether or not they would be interested in 
having retired Tier-1 hardware - if a site were interested then they should submit a proposal as to 
what they want and why.  Ongoing.



480.2  JC to consider the imminent demise of EMI and the resultant effect on the GridPP 
community - concrete issues and action requests to be brought back to the PMB.  Ongoing.



484.1  DB to investigate plan for support of GridPP resources at Durham. PC as Chair of ScotGrid 
may have some input to this.  DB/PC would meet to discuss this and report-back to the PMB.  
Done, item closed.



485.3  AS to poll for date in May/June for T1 review.  Ongoing.



487.4  ALL to send PG a list of the occasions any PMB member was a keynote speaker at 
conferences.  Ongoing.



488.1  AS to notify the community, giving three months' notice, that the AFS service would be shut 
down.  Ongoing.



488.2  DB to speak to DK/STFC regarding the EGI fee payment and let AS know.  Ongoing.



488.3  DB to contact DK regarding travel and other costs to the EGI Community Forum in 
Manchester.  Done, item closed.



488.4  AS to let DB know the SL5 estimated benchmark figure for new CPU purchase.  Done, item 
closed.



ACTIONS AS OF 04.03.12

======================

438.9  AS to contact relevant site managers to ask whether or not they would be interested in 
having retired Tier-1 hardware - if a site were interested then they should submit a proposal as to 
what they want and why.



480.2  JC to consider the imminent demise of EMI and the resultant effect on the GridPP 
community - concrete issues and action requests to be brought back to the PMB.



485.3  AS to poll for date in May/June for T1 review.



487.4  ALL to send PG a list of the occasions any PMB member was a keynote speaker at 
conferences.



488.1  AS to notify the community, giving three months' notice, that the AFS service would be shut 
down.



488.2  DB to speak to DK/STFC regarding the EGI fee payment and let AS know.



The next meeting would take place next Monday 11th March at 12:55 pm.  RJ advised of apologies 
for the next two meetings.
Top of Message | Previous Page | Permalink
JiscMail Tools

Files Area | help
RSS Feeds and Sharing

Search Archives

Advanced Options