Dear All,
Please find attached the GridPP Project Management Board
Meeting minutes for the 476th meeting to 479th meeting.
The latest minutes can be in:
http://www.gridpp.ac.uk/php/pmb/minutes.php?latest
as well as being listed with other minutes at:
http://www.gridpp.ac.uk/php/pmb/minutes.php
Cheers, Dave.
GridPP PMB Minutes 476 (15.10.2012)
===================================
Present: Dave Britton (Chair), Pete Gronbech, Jeremy Coles, Andrew Sansum,
Apologies: Roger Jones, Steve Lloyd, John Gordon, Dave Kelsey, Pete Clarke, Tony Cass, Tony
Doyle, Dave Colling, Claire Devereux, Neil Geddes
1. Synergies with DIRAC
========================
Jeremy Yates had circulated a document and DB asked how we should respond? DB proposed that
the last section of the document be discussed to see what was possible and there may be actions
to generate.
- identity management
DB noted that the UK3A bid had already been submitted
- GPFS multi-cluster
DB had been pursuing this already but GridPP probably would not want to consider GPFS because
of long-term licensing costs. We could however monitor DIRAC's progress - who was available to
do that? It could be delegated to the Storage Group. JC advised that GPFS was not popular among
the Storage Group. DB considered that we wanted someone to monitor and understand what
DIRAC was doing, then we could get a presentation in a year's time.
ACTION
476.1 PG to ask the Storage Group to be aware that DIRAC may deploy/test a form of GPFS as a
prototype for a national system, the Storage Group to monitor and keep abreast of progress.
- creating VOs
Did this relate to the technical side of things, or to outreach? In the long term, could we envisage
VOs which might need/use both GridPP and DIRAC? It was premature at this stage to consider
this.
- sharing resources
It was noted that our CPU was full these days, but HPC was not so full - could we use the compute
power that was available out there? We couldn't use HECTOR at Edinburgh due to the
architecture - was there anything else? AS considered we might make use of other shared
facilities, support edge nodes and buy-in, however this had been difficult to do in the past. DB
thought that Institute resources would be better, for example DIRAC at Cambridge - should we
talk to them? Was it too much work for too little gain? PG advised that Cambridge suffered from a
lack of manpower. JC noted that we had just disengaged from the Condor cluster. DB thought it
was good for Institutes to have their clusters used, but recognised the manpower issues. This
needed to be devolved to Institutes to take forward, namely Oxford, SouthGrid, and Cambridge. JC
noted we needed manpower to pursue the technical side.
- helpdesk
DB noted this related to local support for DIRAC. AS considered that it was a different concept to
what we did at our Helpdesk.
- training
This would only be needed if we had things in common.
- security policy
DB considered that until DIRAC joined-up their technology then security was really a local issue
only. AS advised that there could be federated issues - they were at the stage we had been around
8-10 years ago, sites not disclosing issues etc. DB noted that there was a new GridPP Security
Officer now, we could ask DK and the new Officer to have a dialogue with DIRAC to ascertain
whether there was any common ground.
- operations
Could we collaborate here? It was thought no, not until we had something in common. AS noted it
might be possible in relation to monitoring frameworks.
ACTION
476.2 DB to invite JY and his Sysadmin to visit Lancaster or attend a HEPSYSMAN meeting.
- outreach
It was thought collaboration in this area was possible, Neasan O'Neill should be involved.
ACTION
476.3 DB to feedback to JY the PMB discussion regarding possible synergies with DIRAC.
2. AOB
=======
- Track Convenors
There had been a call for CHEP Convenors, however possible contenders were not here today. RJ
was a possibility.
- NGS CertWizard
This would be discussed in JC's report, but it was noted that this issue was being widely discussed
at present. Constructive comments were required so that we could feed back relevant
information. It was known that NGS CertWizard was causing some problems that might be due to
the clarity of the instructions or more technical in nature. JC noted he would be getting feedback
from Jens Jensen at the Ops Team. DB noted that this issue needed to be sorted out by someone.
- next PMB
DB noted he was travelling next week and other PMB members were also away. PG advised that
the Quarterly Reports were still awaited from some. It was agreed, in the light of absences, that
there would be no PMB meeting next Monday 22nd October. The next PMB would take place on
Monday 29th October.
STANDING ITEMS
==============
SI-1 Dissemination Report
--------------------------
SL was absent.
SI-2 ATLAS weekly review & plans
---------------------------------
RJ was absent.
SI-3 CMS weekly review & plans
---------------------------------
DC was absent.
SI-4 LHCb weekly review & plans
---------------------------------
PC was absent.
SI-5 Production Manager's Report
---------------------------------
JC reported as follows:
1) There were a number of current topics touched upon at the GDB last week
(http://indico.cern.ch/conferenceOtherViews.py?view=standard&confId=155073). Sites running
unsupported gLite 3.2 services will be ticketed from the start of November and must by then have
a plan to move to EMI or an escalated technical reason that prevents them upgrading. The GridPP
sites (still using gLite CEs) at the ops meeting last Tuesday all indicated plans to move their CEs
before the end of October.
- There are a number of activities involved with Storage Federations (failover, self-healing,
caching…). GridPP sites are involved with both the ATLAS and CMS testing.
- Publishing WN environments is still being tested.
- Jamie Shiers's talk on post EGI-Inspire emphasised the need for WLCG to work closely with other
communities in new areas post EGI-Inspire. FP8/Horizon 2020 calls likely in data management
and data preservation. JS should meet with PC/DC/RJ to push forward a common position
regarding data preservation in the context of potential funding.
ACTION
476.4 PC/DC/RJ to meet with Jamie Shiers in order to push forward a common position regarding
data preservation in the context of potential funding and FP8/Horizon 2020 calls.
- Markus Schulz circulated a proposal paper for middleware support post EMI
(http://indico.cern.ch/materialDisplay.py?contribId=12&sessionId=1&materialId=paper&confId
=155073).
2) As part of our (GridPP) contribution to the future necessity of community supported activities,
some of the ops team are now learning how to produce the WN tarball installs that we need.
In the DPM area, there is now confirmed interest in the community support model from France
and Taiwan, and it is likely that we will be able to continue without the initially proposed MoU
structure. CERN management have yet to discuss the CERN contribution. There were possible
alternative fixes from DPM - information to be sent by JC to the Glasgow Team.
ACTION
476.5 JC to send info on possible alternative DPM fixes to the Glasgow Team.
3) The next EGI Community Forum will be hosted by the departments of IT Services and Particle
Physics, University of Manchester, UK between 8-12 April 2013. Wahid would like that we
consider running a Storage Workshop in conjunction with this meeting (an extended version of a
DPM workshop that will likely take place in the UK around April).
DB noted that in principle this was a good idea, but we needed to ensure that our costs would not
be too high as a result.
4) Last Thursday the core ops team discussed progress and plans in each of the core task areas.
Updates are captured in the meeting page here
https://indico.cern.ch/conferenceDisplay.py?confId=212408. (This is for reference but I can talk
through the areas at the PMB if there is time/interest). One item of note concerns other VOs. We
currently point these VOs to use SRM, WMS and LFC yet there are indications that the LHC
experiments will move away from them. There was an issue about support in the longer term.
5) Communications have been sent out to our UK hosted VOs informing the VO-admins about
upcoming changes in a number of areas and particularly with the EMI middleware transition (CEs
and WNs). There are few indications that the VOs are testing and most likely problems will need
to be dealt with if and when they arise.
6) There have been multiple discussions about the CA CertWizard
(http://www.ngs.ac.uk/use/tools/certwizard) in the last week. It is a tool for managing
certificates. There are no current plans to replace the browser interface for certificate
management, but Jens will be joining the ops meeting tomorrow to explain the rationale, plans
and take feedback.
For information:
A) HEPiX takes place this week in Beijing:
https://indico.cern.ch/conferenceOtherViews.py?view=standard&confId=199025.
B) The next WLCG coordination meeting takes place this Thursday:
https://indico.cern.ch/conferenceDisplay.py?confId=212691.
C) The next HEPSYSMAN meeting takes place on 9th November in Lancaster:
http://hepwww.rl.ac.uk/sysman/Nov2012/main.html.
SI-6 Tier-1 Manager's Report
-----------------------------
AS reported as follows:
Fabric
------
1) Disk tender closed - evaluation underway
2) CPU tender evaluation complete - now with procurement team
Service
-------
1) Operations continue generally smoothly
2) CASTOR
a) CASTOR 2.1.12 upgrade for LHCB was cancelled last Tuesday while we investigated a possible
problem with the previous ATLAS upgrade. This eventually turned out to be a false alarm and
upgrade scheduling is underway again.
b) CMS upgrade now scheduled for this Tuesday 16th October. LHCb upgrade planned (TBC) for
23rd October.
3) Upgrade to EMI2 CREAM CE in final tests but some publishing problems remain. Things are
tight for us to meet our deadline to have switched off the old gLite CEs by the end of October or
face possible suspension. However systems are deployed and being tested and we expect to move
to full production this week.
4) Hyper-threading change has been approved to exploit hyper-threading by running more jobs
than cores. This is a simple change to implement but does come with some risks/issues as well as
benefits. Implementation scheduled for next month after CE change this month.
- We will gain an additional 8647 HEPSPEC from the existing hardware nominally
- We will allow an additional 2048 job slots to run. The amount we over-commit will differ on the
different generations:
*10 slots on the 8 core 2009 generation
* 20 slots on the 12 core 2010/2011 generations
- We will gradually ramp up the number of additional job slots in case of load issues on the batch
server (risk)
- CPU scale factors will be set according to the new benchmarked per job slot performance. This is
only relevant when the worker node is fully occupied. When occupancy is below max, CPUs will
effectively be faster than published and so we will under account work done at the accounting
portal.
- Job efficiency will still be able to discriminate between efficient and inefficient work, but average
job efficiency is no longer a measure of how much useful work is done on the farm (it remains a
measure of how efficient jobs are.
- "wasted CPU hours" from the efficiency stats becomes even less meaningful as if a job does not
use execution units another overcommitted job will.
- By committing memory top run more jobs per node we have reduced our capacity to run large
memory jobs (or visa versa). New hardware will be purchased configured with enough memory to
support all hyper-threads concurrently.
5) Backup Oracle (and Frontier) Service for CMS - we expect to receive a formal request shortly to
run a global backup Oracle service for the CMS conditions D/B. Given the reduction in load on
Oracle from ATLAS LFC and LHCB 3D/LFC we expect to be able to meet Oracle licensing and
database hardware mainly from existing resources, but we'll need to assess exact requirement
before reaching a final conclusion.
DB noted that DC should request this via the PMB.
AOB
===
- GridPP30
PG asked what was happening about this? DB advised that DC said he would look into hosting the
meeting at the Royal Geographical Society near Imperial.
ACTION
476.6 DC to investigate the hosting of GridPP30 at the Royal Geographical Society near Imperial,
and report back.
- European PP Strategy
AS reported that there had been an internal request within STFC regarding the European Particle
Physics Strategy process and a discussion about national laboratories. John Wormersley was
putting together the proposal that RAL was a National Lab including the Tier-1.
ACTION
476.7 AS to check with John Wormersley regarding the proposal that RAL be considered as a
National Lab including the Tier-1. AS to find out status of the proposal and report back.
REVIEW OF ACTIONS
=================
438.9 AS to contact relevant site managers to ask whether or not they would be interested in
having retired Tier-1 hardware - if a site were interested then they should submit a proposal as to
what they want and why. Ongoing.
475.1 DB/JC, in conjunction with AS, to consider and draft Terms of Reference (ToR) for the
proposed GridPP Cloud Group. Ongoing.
475.2 DB to draft a response to Peter Coveney's email request, using PC's suggestions and in the
light of PMB discussion. Done, item closed.
ACTIONS AS AT 15.10.12
======================
438.9 AS to contact relevant site managers to ask whether or not they would be interested in
having retired Tier-1 hardware - if a site were interested then they should submit a proposal as to
what they want and why.
475.1 DB/JC, in conjunction with AS, to consider and draft Terms of Reference (ToR) for the
proposed GridPP Cloud Group.
476.1 PG to ask the Storage Group to be aware that DIRAC may deploy/test a form of GPFS as a
prototype for a national system, the Storage Group to monitor and keep abreast of progress.
476.2 DB to invite Jeremy Yates and his Sysadmin to visit Lancaster or attend a HEPSYSMAN
meeting, to help move forward with DIRAC synergies.
476.3 DB to feedback to Jeremy Yates the PMB discussion regarding possible synergies with
DIRAC.
476.4 PC/DC/RJ to meet with Jamie Shiers in order to push forward a common position regarding
data preservation in the context of potential funding and FP8/Horizon 2020 calls.
476.5 JC to send info on possible alternative DPM fixes to the Glasgow Team.
476.6 DC to investigate the hosting of GridPP30 at the Royal Geographical Society near Imperial,
and report back.
476.7 AS to check with John Wormersley regarding the proposal that RAL be considered as a
National Lab including the Tier-1. AS to find out current status of the proposal and report back.
There would be *no* PMB on Monday 22nd October. The next PMB would take place on Monday
29th October at 12:55 pm.
GridPP PMB Minutes 477 (29.10.2012)
=======================================
Present: Dave Britton (Chair), Andrew Sansum, Roger Jones, Pete Clarke, Tony Cass, Tony Doyle,
Dave Colling, Claire Devereux (Suzanne Scott -Minutes)
Apologies: Dave Kelsey, Steve Lloyd, John Gordon, Jeremy Coles, Pete Gronbech, Neil Geddes
STANDING ITEMS
==============
SI-1 Dissemination Report
--------------------------
SL was not present.
SI-2 ATLAS weekly report & plans
---------------------------------
RJ reported that there had been a rolling changeover to the EMI CE at RAL last week, there had
been discussions about the process, extra disk for ATLAS at RAL was being installed this week but
they had held back on the hyperthreading. High memory MC jobs had gone to the Tier-1 recently,
the Tier-2s could also contribute to this but this was to be discussed. RJ had no major problems to
report.
SI-3 CMS weekly review & plans
-------------------------------
DC was not present at this stage in the meeting.
SI-4 LHCb weekly review & plans
--------------------------------
PC reported that they were progressing with reprocessing, which was going fine, after Christmas
they would be doing the 2011 data reprocessing.
SI-5 Production Manager's Report
---------------------------------
JC was absent but had sent a brief note:
We have made steady progress with removing gLite 3.2 CEs/BDIIs, but some (more than I hoped)
will certainly remain in early November. Sites have received tickets and all have now responded
but I am concerned that some of the smaller sites will not follow-up and there is a growing
possibility they will be suspended/uncertified at some point in the coming month. I will send an
update next week.
The WN tarball help has not so far developed which is another problem on the horizon when the
gLite 3.2 WN deadline arrives at the end of November.
SI-6 Tier-1 Manager's Report
-----------------------------
AS reported as follows:
Fabric:
1) Disk tender closed - evaluation expected to complete this week.
2) CPU tender standstill complete. Orders about to be raised.
3) Asymmetric network routing discovered for some Tier-1 to RAL traffic. External sites had not
accepted our OP_N routing. Now corrected.
4) A disk server operating system was accidentally re-installed (human error). This was risk 6 in
our accidental data loss risk analysis. Mitigation worked - no data lost.
Service:
1) Operations continue generally smoothly
2) CASTOR
a) CASTOR 2.1.12 upgrade for CMS+LHCB completed. Gen instance will be carried out on Tuesday
30th.
3) Upgrade to EMI2 CREAM CE completed. Went very well but experiments did not promptly
change SAM test endpoints so incorrect availability will need correcting. Old glite nodes will be
turned off by end of month.
4) WMS services upgrade from glite. We should now be glite free.
5) Hyper-threading change has been approved to exploit hyper-threading by running more jobs
than cores. This is a simple change to implement but does come with some risks/issues as well as
benefits. Implementation scheduled for next month after CE change this month.
6) Backup Oracle (and Frontier) Service for CMS - we expect to receive a formal request shortly to
run a global backup Oracle service for the CMS conditions D/B. Given the reduction in load on
Oracle from ATLAS LFC and LHCB 3D/LFC we expect to be able to meet Oracle licensing and
database hardware mainly from existing resources, but we'll need to assess exact requirement
before reaching a final conclusion.
SI-7 LCG Management Board Report
---------------------------------
DB reported that there had been a discussion re Oracle licences, they were identifying cases
where Oracle was in use at the Tier-1s; there had been the issue of OSG's contingency plans for
their CA, users were requesting contingency planning for various scenarios if Certs could not be
issued - the documents were available publicly. DB noted that GridPP was in the same situation
and we should ask the same question for services we don't directly run - the next NGI meeting
would discuss this on 12th November. DB noted that the documents re the CA and infrastructure
were fairly generic and could maybe be used. There needed to be contingency plans for all NGI
services. DB would report-back from the NGI meeting. CD noted she had this issue on the NGI
Agenda.
DB continued - there had been an update on the wLCG networking group by Michael Ernst. The
Oversight Board had raised a query about the networking group's remit, in order to clarify how it
related to other bodies. DB reported that there had been a bit of discussion about this group
generally and 'bandwidth on demand', no further action was required at present. There had
followed a discussion on common projects; then a discussion on wLCG software life-cycle process.
DB noted there would shortly be a Russian Tier-1.
AS had sent an email regarding Oracle. He advised that the licence requirements were reducing
over the next few years but the maintenance bill was due in GridPP4. AS noted he was awaiting
formal information from CERN. DB thought we would need less licences going forward that was
originally planned? AS confirmed yes - the bulk of licences go on CASTOR. DB noted that at RAL
the dominant factor was CASTOR therefore the LFC and FTS changes would not affect things
much. AS agreed, and he would send round a summary. DB noted that regarding the backup
service for CMS we didn't want additional costs.
DC had joined the meeting and advised that he had a chat with Ian this morning. The CMS request
was not high on their wishlist but it would be good to have. CMS may try and move away from
Oracle. DC noted that Fermilab had almost no Oracle licences at all.
1. ToR for Cloud Group
=======================
A proposal document had been circulated by DB and he had sent it to AS for comment. AS noted
only one minor thing: 'production' cloud service could perhaps be modified to 'prototype' cloud
service. DC was to give feedback. Any other comments should be sent to DB/DC. It was noted
that the document would be used as the basis for moving forward. There would be a monthly
report to the PMB. Would PC and RJ be involved? PC advised that a PDRA post was being
advertised and this was something that the prospective member of staff could be involved with on
behalf of LHCb. RJ advised that he had been discussing this within ATLAS and a few people were
interested, but this was to be confirmed. DC should convene a meeting soon to start-off this Cloud
Group.
2. AOB
=======
- DELL LHC Programme
It was noted that George Jones had left DELL. PG had received a message from Gary Kriegel noting
that the Programme was currently in transition and that LHC pricing was being determined for
the future. It was thought that the programme could disappear entirely. RJ would contact Andy
Langford and thereafter the DELL contact he met at Manchester.
ACTION
477.1 RJ to contact Andy Langford and thereafter the DELL contact he met at Manchester in
relation to DELL LHC programme changes.
AS advised that DELL hadn't made the cut for the CPU service, possibly reflecting their change of
emphasis.
- DPHEP meeting
DB asked about this meeting - was anyone going? PC noted no - it was difficult to get to Marseille
from Edinburgh. RJ noted he had also dropped out due to the change of venue from Munich. PC
advised that Marco would be going for LHCb. ATLAS would not have any representation.
REVIEW OF ACTIONS
=================
438.9 AS to contact relevant site managers to ask whether or not they would be interested in
having retired Tier-1 hardware - if a site were interested then they should submit a proposal as to
what they want and why. Ongoing for 2006 generation.
475.1 DB/JC, in conjunction with AS, to consider and draft Terms of Reference (ToR) for the
proposed GridPP Cloud Group. Done, item closed.
476.1 PG to ask the Storage Group to be aware that DIRAC may deploy/test a form of GPFS as a
prototype for a national system, the Storage Group to monitor and keep abreast of progress.
Ongoing.
476.2 DB to invite Jeremy Yates and his Sysadmin to visit Lancaster or attend a HEPSYSMAN
meeting, to help move forward with DIRAC synergies. Done, item closed.
476.3 DB to feedback to Jeremy Yates the PMB discussion regarding possible synergies with
DIRAC. Done, item closed.
476.4 PC/DC/RJ to meet with Jamie Shiers in order to push forward a common position regarding
data preservation in the context of potential funding and FP8/Horizon 2020 calls. Done, item
closed.
476.5 JC to send info on possible alternative DPM fixes to the Glasgow Team. Done, item closed.
476.6 DC to investigate the hosting of GridPP30 at the Royal Geographical Society near Imperial,
and report back. DC would check the Physics Dept and Halls of Residence. Done, item closed.
476.7 AS to check with John Wormersley regarding the proposal that RAL be considered as a
National Lab including the Tier-1. AS to find out current status of the proposal and report back.
Done, item closed.
ACTIONS AS AT 29.12.12
======================
438.9 AS to contact relevant site managers to ask whether or not they would be interested in
having retired Tier-1 hardware - if a site were interested then they should submit a proposal as to
what they want and why.
476.1 PG to ask the Storage Group to be aware that DIRAC may deploy/test a form of GPFS as a
prototype for a national system, the Storage Group to monitor and keep abreast of progress.
477.1 RJ to contact Andy Langford and thereafter the DELL contact he met at Manchester in
relation to DELL LHC programme changes.
The next PMB meeting would take place on Monday 5th November at 12:55 pm.
GridPP PMB Minutes 478 (05.11.2012)
=======================================
Present: Dave Britton (Chair), Pete Gronbech, Andrew Sansum, Roger Jones, Pete Clarke, Tony
Cass, Dave Colling, Claire Devereux, Steve Lloyd, John Gordon, Jeremy Coles, Dave Kelsey
Apologies: Tony Doyle, Neil Geddes
Agenda:
1. ATLAS - Oracle for conditions DB and Frontier Server at RAL [RJ/AS]
======================================================================
ATLAS has asked the 5 Tier-1s (which includes RAL) that host the Conditions DataBase and
Frontier Servers in addition to CERN, whether they intended to continue to do so for Run2 (i.e.
until 2018). ATLAS were not sure how many instances were required: It might not be 5 but it was
certainly "some". AS noted that the 3D database required some 6 oracle licences (compared to
something like 30 for CASTOR) and this might reduce to 4, so was not a dominant factor. RJ had
yet to receive and answer from ATLAS as to the experiments longer term plans WRT Oracle.
ATLAS has requested a response by mid-Nov. DB suggested that RJ find out a little more about
ATLAS' position and draft initial response on the bases that it was not regarded as a big problem
by the Tier-1. DB would want to add some caveats about the timeframe involved.
ACTION
478.1 RJ to draft response to the ATLAS message and iterate with DB.
AOCB
====
1) PG had been away last week and would summarise quarterly reports at the next PMB meeting.
2) DC had made some enquires about GridPP30 at Imperial and would make a proposal on dates
to the PMB this week.
ACTION
478.2 DC to propose dates for GridPP30.
STANDING ITEMS
==============
SI-1 Dissemination Report [SL]
-------------------------
SL reported that he had received the following from NO:
Published Ganga News item
- Waiting to publish LCG CE news item
- Sussex news item ready for when they go into production
- perfSONAR news item in the works
- VOMS Snooper news item also in the works
- GridPP (and PG) in Linux Format this month
- I've been officially added to the LOC for the Community Forum (well I'm included in the phone
calls)
DB expressed a concern that the events of September had demonstrated that our dissemination
overall as a project had some gaps. In particular, news items were fine but they only addressed
one area of dissemination. In particular, GridPP needs better contact with industry and better
visibility within the developing UK e-infrastructure community. A discussion ensued, with broad
agreement that there was an issue. It was felt that we need to target some very specific things: A
project with an industrial partner would be valuable; money might be available from the various
STFC impact programmes if something could be identified.
ACTION
478.3 SL to talk with NO; possibly a meeting with DB/SL/NO/CD?
RJ noted that website needed to be fixed so that the old Excel visit-notice was no longer liked from
the resources page. DK said he would contact Andrew McNab.
SI-2 ATLAS Weekly Review and Plans [RJ]
----------------------------------
Main issue was that RAL had been moved out of raw-data export. This might be due to OPN
saturation but there are several independent network-related issues on-going at RAL and AS was
still trying to get to the bottom of this. The UK Tier-2s also seem to have a number of unrelated
issues at present, but nothing too serious. Lancaster would shortly be moved off the light path
now that the link north was up and running.
SI-3 CMS Weekly Review and Plans [DC]
--------------------------------
DC reported that things were fine with CMS. He had noted that the UK Tier-2s had appeared in the
top grouping of global CMS Tier-2 sites (along with the US and DESY) in terms of cpu-hours
delivered and analysis delivered. DC noted that he was currently setting up the cloud-group and
an email list would be established this week. The possibility of hosting a duplicate CMS conditions
db at RAL was discussed. The costs included £2.5k for nodes; £8.7k for disk; and £2k? for Oracle
Licence(s). It was not yet clear how many Oracle Licenses would be needed. AS would get back to
DC with the complete details and DC would talk to Ian Fisk as to whether the costs were
justifiable.
SI-4 LHCb Weekly Review and Plans [GP]
---------------------------------
PC reported that there were no issues on the LHCb side.
SI-5 Production Manager's weekly report [JC]
---------------------------------------
JC reported that:
1) We have agreed a VOMS upgrade/switch for 14th November. There will be a brief period
where VO information will not be editable but otherwise the switch will be transparent for VOs
already hosted on gridpp.ac.uk. David Wallom has been liasing with the NGS VOs that are coming
on to the gridpp VOMS.
2) A validator script running on VOMRS to check the status of issuer DNs produced some
confusing messages for (LHC) users last week as old certificate DNs were not deleted in VOMS but
the certificates against the old CA DN were picked up as failing (due to the old UK CA now having
expired?) the validation. This seems to have impacted ATLAS team memberships within GGUS for
editing tickets which used the old certificate status for team membership confirmation.
3) As of 1st November several GridPP sites were still running gLite 3.2 CEs with no EMI CEs in
parallel: UCL, Durham and ECDF. Additional sites with 3.2 CEs that will be removed soon (when
the EMI CEs are shown stable): Manchester, Sheffield, Bristol and Cambridge. Some sites have
deployed EMI-2 SL5 WNs (the status tables are being updated). Alessandra has been tracking
plans for ATLAS via this page: https://www.gridpp.ac.uk/wiki/UK_EMI2_Deployment.
4) Last week joint work (finally) began on producing EMI WN tarballs. Needless to say it is not
quite as simple as early reports suggested it would be. Matt Doidge at Lancaster together with
Wahid Bhimji are providing the GridPP input. Issues include what ÔextraÕ SL rpms need to be
included and a policy for later allowing use of glexec.
5) There was a request on TB-SUPPORT for more information on GridPP30 dates.
6) Are there any further PMB comments on the DPM collaboration notice I forwarded from Oliver
Keeble last week? It mentions the in principle agreement to support the collaboration from 3
countries and core development effort being provided by CERN.
For information
A) There is a GDB next week http://indico.cern.ch/conferenceDisplay.py?confId=155074.
B) There is a HEPSYSMAN meeting on Friday:
http://hepwww.rl.ac.uk/SYSMAN/Nov2012/main.html.
SI-6 Tier-1 Manager's weekly report [AS]
-----------------------------------
AS reported that:
Fabric
------
1) Disk tender closed - HAG meeting scheduled for Tuesday
2) CPU orders placed.
3) Review of our network performance indicates problem with our outbound rate to most/all
sites. Still investigating.
4) High traffic rate on LHCOPN to RAL at the moment (since Friday) under investigation. May
need to consider load balancing on backup link in future.
5) Failure of the primary OPN for about 10 hours on 30th October owing to a major fibre cut
between Gravelines and Bois-Grenier in France.
6) Site networking plan a short intervention on our board on the main site router on Tuesday 13th
November. this will lead to a short scheduled outage. We may take this opportunity to schedule
other network work such as performance tests and an upgrade to address bandwidth limitations
on one of our stack uplinks.
Service
-------
1) Operations report at:
https://www.gridpp.ac.uk/wiki/Tier1_Operations_Report_2012-10-31
2) CASTOR
a) CASTOR 2.1.12 upgrade now complete on all instances.
b) CASTOR 2.1.13 certification has commenced.
c) Lengthy (7 hours) downtime on ATLAS instance over weekend. Cause was non-optimal
change in
execution plan on SRM database. DB team plan to lock down execution plan using Oracle 11
feature.
3) Hyper-threading change expected to be implemented shortly.
SI-7 LCG Management Board Report of Issues [JG/DB]
------------------------------------------
There had been no MB.
REVIEW OF ACTIONS
=================
476.1 had been done
477.1 had been done but DB opened a new action:
ACTION
478.4 RJ to let PMB know more details about the future of the DELL LHC programme after he'd
talked to Andy Langford.
ACTIONS AS OF 05.11.12
======================
438.9 AS to contact relevant site managers to ask whether or not they would be interested in
having retired Tier-1 hardware - if a site were interested then they should submit a proposal as to
what they want and why.
478.1 RJ to draft response to the ATLAS message about Conditions db and Frontier server and
iterate with DB.
478.2 DC to propose dates for GridPP30.
478.3 SL to talk with NO; possibly a meeting with DB/SL/NO/CD about targeting our
dissemination.
478.4 RJ to report back to the PMB about the DELL LHC programme after he'd talked to Andy
Langford.
The next PMB would take place on Monday 12 November at 12:55 pm.
GridPP PMB Minutes 479 (012.11.2012)
=======================================
Present: Dave Britton (Chair), Pete Gronbech, Andrew Sansum, Pete Clarke, Tony Cass, Dave
Colling, Claire Devereux, Steve Lloyd, John Gordon, Jeremy Coles, Dave Kelsey
Apologies: Tony Doyle, Roger Jones, Neil Geddes
0. Summary of NGI Management Meeting [CD]
=========================================
Claire reported that the monthly NGI meeting had just been held. Dave Wallom was representing
the UK on the EGI Elixir Virtual Team. There is a call for EGI Champions - so nominations were
solicited (basically can fund some travel). The meeting discussed the imminent VOMS migration
and Claire was asked whether all UK NGI services had been restored following the power cut at
RAL (the answer was "yes"). DB raised the issue of contingency planning for NGI services. It was
agreed to make a list of services and to evaluate the need and status of contingency plans against
each.
1. Tier-1 Power Outage [AS]
===========================
AS described the events of last week when a power cut at RAL and the failure of the generator
brought down the whole Tier-1. The only data loss was "data-in-flight" and only a modest amount
of hardware had to be repaired. A full SIR will be made available; there are some more details in
the Tier-1 report below. It was noted that although the generator was tested on a monthly basis, it
had not been load tested. DB asked whether the recent departure of the Operations Manager had
compounded the situation (probably not).
2. Quarterly Reports: Issues from 12Q3 [PG]
===========================================
PG circulated a summary of 12Q3 quarterly reports. The Tier-1s performance in Q3 had been
excellent. PG/AS asked whether there should be a review of the Tier-1 next May as per the project
milestones? DB noted that the lightweight-informal review held last June had been very
informative; AS confirmed that it had been useful. Therefore, it was agreed that a repeat should be
scheduled in May 2013. It was noted that there was a slight delay in the disk procurement that
increased the risk of missing the deployment deadline for the MOU in April 2013. Delivery was
January. DB noted that this should still give time for 4-6weeks burn-in and then deployment
before the deadline. JG noted that we might expect to run into problems some problems so there
was a chance that perhaps half the capacity might be late. DB expressed his hope that this would
not happen.
Q3 had been less stellar at the Tier-2s, with poor availability at Glasgow for ATLAS (power issues)
and data loss at Cambridge. CMS and LHCb had had a good quarter. T2K were investigating their
storage requirements; it was hard for them to work out how much disk they were using at Tier-2s
due to shared resources with other VOs. The transition to EMI middleware had been somewhat a
concern at the end of the quarter but now, one month later, the UK was in good shape.
AOCB
====
1) EU Researcher Article: This non-refereed journal had approached DB about GridPP paying to
publish an article. DB had referred to Neasan. The proposal was for 1500words for £3000. The
PMB could not see how this would be of value. The decision was not to proceed.
2) ORACLE Licenses: CERN (Tony Cass) had written to GridPP (DB) to request planning numbers
of ORACLE Licences. AS had started the inventory but there were some outstanding questions,
particularly around ATLAS. DB had discussed with RJ: It looked likely that ATLAS would like RAL
to continue to host the 3D DB but not likely that the TAG DB would be required in its current form.
AS would use this input and come back with a plan next week.
ACTION
479.1 AS to provide ORACLE licence plan.
3) HAG: The hardware advisory group had met. JG had circulated an email to the PMB and the
salient points were in the Tier-1 Manager's report below.
4) EGI Software Support: Oxford had received an email about SAM support. This was something
that had been discussed a longtime ago by JG with EGI - providing support for APEL and SAM.
There was the odd month of effort funded to provide this, but it was felt to be a very low level
commitment and it was agreed that no further action was required (such as transferring this
month of funding to Oxford) unless the task proved more onerous than expected.
5) GridPP30: DC reported that IC no longer had student accommodation at Easter. DB asked about
local hotels but realised this was unlikely to be affordable. DC would check. PG suggested
contacting Dell about their conference centre in Ireland. CD suggested holding it in conjunction
with EGI in Manchester. DB/CD/PG/DC would look into these options.
STANDING ITEMS
==============
SI-1 Dissemination Report [SL]
-------------------------
SL noted that a KE meeting had been arranged for Nov 27th at QM to be attended by at least
SL,NO,DB and CD. Other PMB members were invited. DC and JC expressed interest. It was agreed,
therefore, to start at 12:45 to avoid Ops-team.
SI-2 ATLAS Weekly Review and Plans [RJ]
----------------------------------
RJ was not present due to teaching.
SI-3 CMS Weekly Review and Plans [DC]
--------------------------------
DC reported no issues from CMS operations. However, Stuart Wakefield had now left and some
issues with Brunel had been found where his certificate had been hardwired.
SI-4 LHCb Weekly Review and Plans [PC]
---------------------------------
No issues for LHCb.
SI-5 Production Manager's weekly report [JC]
---------------------------------------
JC reported as follows:
1) An upgrade of the GridPP VOMS takes place this Wednesday (14th). VO-admins have been
informed of the read-only period during the upgrade and that the new VOMS version has new
notification policies and in particular VO-admins will now Ò get regular emails about expired
users, or users that are going to expire.(see details here
https://www.gridpp.ac.uk/wiki/VOMS_Notifications).
2) There was a power cut that affected RAL at 11:30 UTC last Wednesday 7th November and the
backup diesel generators failed. This affected UK Tier-2 work but did not lead to any complaints.
We will review the impacts (and any lessons learned) at the ops meeting tomorrow Ð for example
top-BDII settings used by the UK Nagios testing and GOCDB failover. APEL processing at RAL was
also affected and sites were asked to temporarily avoid republishing data.
3) No GridPP/UK sites have been designated as unresponsive by EGI in regards to their EMI
upgrade progress and plans (but see D below for the process being followed).
4) Steady (positive) progress is being made with producing an EMI-2 tarball WN. Testing last
week showed a working version with ATLAS. (Reminder: The current deadline for sites to move
from gLite 3.2 WNs is the end of November).
5) HEPSYSMAN took place at Lancaster on Friday
(https://indico.cern.ch/conferenceDisplay.py?confId=211206). A flexible format and short-talks
approach worked well.
For information:
A) There is a GDB this week: http://indico.cern.ch/conferenceDisplay.py?confId=155074. Topics
include: GGUS recent developments; an update on the Security WG activities; Glue 2.0; IPv6 and
plans for the deployment of M/W clients (in light of EMI ending soon).
B) A statement on the DPM collaboration is now online:
https://svnweb.cern.ch/trac/lcgdm/blog. Planning for the DPM community workshop in
December has started: http://indico.cern.ch/conferenceDisplay.py?confId=214478.
C) The EGI-Inspire task TSA1.5 (accounting) has been handed over from John to Alison Packer
(STFC).
D) An EGI CSIRT process to handle unsupported gLite service end-points of unresponsive sites
that failed to reply to COD tickets and to provide information about their upgrade plans has now
been agreed. From today sites affected will be asked to put old endpoints into downtime and from
19th unresponsive sites will risk suspension.
SI-6 Tier-1 Manager's weekly report [AS]
-----------------------------------
AS reported as follows:
Fabric
------
1) Disk tender evaluation complete. Expect to start standstill shortly.
2) CPU orders placed.
3) Review of our network performance indicates problem with our outbound rate to most/all
sites. Still investigating.
4) Site networking plan a short intervention on our board on the main site router on Tuesday 13th
November. this will lead to a short scheduled outage. We will not be scheduling an intervention on
our internal stacks as suggested last week as testing could not be completed owing to the power
failure.
Service
-------
1) A major (>50%) site wide power failure at 11:20 on Wednesday 7th November (last major
power failure 44 months ago). Trip occurred at main site substation (cause being investigated).
UPS generator started but would not accept load (cause being investigated). Critical (UPS battery
protected) services operated for about 20 minutes but had to be shut down as cooling requires
generator. Power to machine room restored at 14:20. External national and international services
(FTS, BDI, WMS, LFC, GOC, APEL) restored by 18:00 (some much earlier). Batch and CASTOR
services restored by 14:00 on 8th November. Generator circuit remains faulty. Generator will not
start in event of another power failure. Investigation and generator load test being scheduled for
20th November but until then our UPS critical systems remain at risk in event of further power
problems. Post Mortem (SIR) underway.
2) CASTOR
a) On Sunday (again) problems with ATLAS SRM owing to database choosing non-optimal
execution plan. Expect to lock down the execution plans this Tuesday.
b) Intermittent CMS SRM test failures - leading to around 20% degradation in test results. Seems
to be an increasing problem, but the cause is not understood. Does not seem to be noticeably
impacting production work.
3) On Saturday problems with CRLs expiring on CEs. Investigating how this happened.
Inconvenient that CERN CRLs expire on Saturday (known problem).
4) Hyper-threading change rollout started.
5) EMI-2 workernode update in pipeline. Expected before end of month.
SI-7 LCG Management Board Report of Issues [JG/DB]
------------------------------------------
There had been no MB. JC asked about the software lifecycle plan that had been presented in
outline at the last but one MB and then at the GDB. DB had not heard anything more.
REVIEW OF ACTIONS
=================
438.9 AS to contact relevant site managers to ask whether or not they would be interested in
having retired Tier-1 hardware - if a site were interested then they should submit a proposal as to
what they want and why.
ONGOING
478.1 RJ to draft response to the ATLAS message about Conditions db and Frontier server and
iterate with DB.
ONGOING
478.2 DC to propose dates for GridPP30.
NO ACCOMMODATION. ACTION CLOSED
478.3 SL to talk with NO; possibly a meeting with DB/SL/NO/CD about targeting our
dissemination.
DONE - ARRANGED FOR 27TH
478.4 RJ to report back to the PMB about the DELL LHC programme after he'd talked to Andy
Langford.
DONE AND ONGOING!
ACTIONS AS OF 12.11.12
======================
438.9 AS to contact relevant site managers to ask whether or not they would be interested in
having retired Tier-1 hardware - if a site were interested then they should submit a proposal as to
what they want and why.
478.1 RJ to draft response to the ATLAS message about Conditions db and Frontier server and
iterate with DB.
479.1 AS to finalise ORACLE licence planning.
479.2 RJ to report back to the PMB about the DELL LHC programme after he'd talked to Andy
Langford.
|