JISCMail - UKHEPGRID Archives

Email discussion lists for the UK Education and Research communities
Subscriber's Corner
Email Lists
UKHEPGRID Archives

UKHEPGRID@JISCMAIL.AC.UK

View:

Message:
[
First
Last
]
By Topic:
[
First
Last
]
By Author:
[
First
Last
]
Font:
Proportional Font
		LISTSERV Archives
		UKHEPGRID Home
		UKHEPGRID December 2012
Options

Subscribe or Unsubscribe
Get Password
Subject:
Minutes of the 476th to 479th GridPP PMB meeting
From:
David Britton <[log in to unmask]>
Reply-To:
David Britton <[log in to unmask]>
Date:
Mon, 10 Dec 2012 10:47:43 +0000
Content-Type:
multipart/mixed
Parts/Attachments:
text/plain (46 lines) , 121015.txt (1 lines) , 121029.txt (1 lines) , 121105.txt (1 lines) , 121112.txt (1 lines)
Dear All,


Please find attached the GridPP Project Management Board
Meeting minutes for the 476th meeting to 479th meeting.

                    The latest minutes can be in:

http://www.gridpp.ac.uk/php/pmb/minutes.php?latest

as well as being listed with other minutes at:

http://www.gridpp.ac.uk/php/pmb/minutes.php

Cheers, Dave.
































GridPP PMB Minutes 476 (15.10.2012)

===================================



Present:  Dave Britton (Chair), Pete Gronbech, Jeremy Coles, Andrew Sansum,



Apologies:   Roger Jones, Steve Lloyd, John Gordon, Dave Kelsey, Pete Clarke, Tony Cass, Tony 
Doyle, Dave Colling, Claire Devereux, Neil Geddes



1.  Synergies with DIRAC

========================

Jeremy Yates had circulated a document and DB asked how we should respond?  DB proposed that 
the last section of the document be discussed to see what was possible and there may be actions 
to generate.



- identity management

DB noted that the UK3A bid had already been submitted



- GPFS multi-cluster

DB had been pursuing this already but GridPP probably would not want to consider GPFS because 
of long-term licensing costs.  We could however monitor DIRAC's progress - who was available to 
do that?  It could be delegated to the Storage Group.  JC advised that GPFS was not popular among 
the Storage Group.  DB considered that we wanted someone to monitor and understand what 
DIRAC was doing, then we could get a presentation in a year's time.



ACTION

476.1  PG to ask the Storage Group to be aware that DIRAC may deploy/test a form of GPFS as a 
prototype for a national system, the Storage Group to monitor and keep abreast of progress.



- creating VOs

Did this relate to the technical side of things, or to outreach?  In the long term, could we envisage 
VOs which might need/use both GridPP and DIRAC?  It was premature at this stage to consider 
this.



- sharing resources

It was noted that our CPU was full these days, but HPC was not so full - could we use the compute 
power that was available out there?  We couldn't use HECTOR at Edinburgh due to the 
architecture - was there anything else?  AS considered we might make use of other shared 
facilities, support edge nodes and buy-in, however this had been difficult to do in the past.  DB 
thought that Institute resources would be better, for example DIRAC at Cambridge - should we 
talk to them?  Was it too much work for too little gain?  PG advised that Cambridge suffered from a 
lack of manpower.  JC noted that we had just disengaged from the Condor cluster.  DB thought it 
was good for Institutes to have their clusters used, but recognised the manpower issues.  This 
needed to be devolved to Institutes to take forward, namely Oxford, SouthGrid, and Cambridge.  JC 
noted we needed manpower to pursue the technical side.



- helpdesk

DB noted this related to local support for DIRAC.  AS considered that it was a different concept to 
what we did at our Helpdesk.



- training

This would only be needed if we had things in common.



- security policy

DB considered that until DIRAC joined-up their technology then security was really a local issue 
only.  AS advised that there could be federated issues - they were at the stage we had been around 
8-10 years ago, sites not disclosing issues etc.  DB noted that there was a new GridPP Security 
Officer now, we could ask DK and the new Officer to have a dialogue with DIRAC to ascertain 
whether there was any common ground.



- operations

Could we collaborate here?  It was thought no, not until we had something in common.  AS noted it 
might be possible in relation to monitoring frameworks.



ACTION

476.2  DB to invite JY and his Sysadmin to visit Lancaster or attend a HEPSYSMAN meeting.



- outreach

It was thought collaboration in this area was possible, Neasan O'Neill should be involved.



ACTION

476.3  DB to feedback to JY the PMB discussion regarding possible synergies with DIRAC.



2.  AOB

=======

- Track Convenors

There had been a call for CHEP Convenors, however possible contenders were not here today.  RJ 
was a possibility.



- NGS CertWizard

This would be discussed in JC's report, but it was noted that this issue was being widely discussed 
at present.  Constructive comments were required so that we could feed back relevant 
information.  It was known that NGS CertWizard was causing some problems that might be due to 
the clarity of the instructions or more technical in nature.  JC noted he would be getting feedback 
from Jens Jensen at the Ops Team.  DB noted that this issue needed to be sorted out by someone.



- next PMB

DB noted he was travelling next week and other PMB members were also away.  PG advised that 
the Quarterly Reports were still awaited from some.  It was agreed, in the light of absences, that 
there would be no PMB meeting next Monday 22nd October.  The next PMB would take place on 
Monday 29th October.



STANDING ITEMS

==============

SI-1  Dissemination Report

--------------------------

SL was absent.



SI-2  ATLAS weekly review & plans

---------------------------------

RJ was absent.



SI-3  CMS weekly review & plans

---------------------------------

DC was absent.



SI-4  LHCb weekly review & plans

---------------------------------

PC was absent.



SI-5  Production Manager's Report

---------------------------------

JC reported as follows:

1) There were a number of current topics touched upon at the GDB last week 
(http://indico.cern.ch/conferenceOtherViews.py?view=standard&confId=155073). Sites running 
unsupported gLite 3.2 services will be ticketed from the start of November and must by then have 
a plan to move to EMI or an escalated technical reason that prevents them upgrading. The GridPP 
sites (still using gLite CEs) at the ops meeting last Tuesday all indicated plans to move their CEs 
before the end of October. 



- There are a number of activities involved with Storage Federations (failover, self-healing, 
caching�). GridPP sites are involved with both the ATLAS and CMS testing.



- Publishing WN environments is still being tested. 



- Jamie Shiers's talk on post EGI-Inspire emphasised the need for WLCG to work closely with other 
communities in new areas post EGI-Inspire. FP8/Horizon 2020 calls likely in data management 
and data preservation. JS should meet with PC/DC/RJ to push forward a common position 
regarding data preservation in the context of potential funding.



ACTION

476.4  PC/DC/RJ to meet with Jamie Shiers in order to push forward a common position regarding 
data preservation in the context of potential funding and FP8/Horizon 2020 calls.



- Markus Schulz circulated a proposal paper for middleware support post EMI 
(http://indico.cern.ch/materialDisplay.py?contribId=12&sessionId=1&materialId=paper&confId
=155073).



2) As part of our (GridPP) contribution to the future necessity of community supported activities, 
some of the ops team are now learning how to produce the WN tarball installs that we need.



In the DPM area, there is now confirmed interest in the community support model from France 
and Taiwan, and it is likely that we will be able to continue without the initially proposed MoU 
structure. CERN management have yet to discuss the CERN contribution.  There were possible 
alternative fixes from DPM - information to be sent by JC to the Glasgow Team.



ACTION

476.5  JC to send info on possible alternative DPM fixes to the Glasgow Team.



3) The next EGI Community Forum will be hosted by the departments of IT Services and Particle 
Physics, University of Manchester, UK between 8-12 April 2013. Wahid would like that we 
consider running a Storage Workshop in conjunction with this meeting (an extended version of a 
DPM workshop that will likely take place in the UK around April).



DB noted that in principle this was a good idea, but we needed to ensure that our costs would not 
be too high as a result.



4) Last Thursday the core ops team discussed progress and plans in each of the core task areas. 
Updates are captured in the meeting page here 
https://indico.cern.ch/conferenceDisplay.py?confId=212408. (This is for reference but I can talk 
through the areas at the PMB if there is time/interest). One item of note concerns other VOs. We 
currently point these VOs to use SRM, WMS and LFC yet there are indications that the LHC 
experiments will move away from them.  There was an issue about support in the longer term.



5) Communications have been sent out to our UK hosted VOs informing the VO-admins about 
upcoming changes in a number of areas and particularly with the EMI middleware transition (CEs 
and WNs). There are few indications that the VOs are testing and most likely problems will need 
to be dealt with if and when they arise. 



6) There have been multiple discussions about the CA CertWizard 
(http://www.ngs.ac.uk/use/tools/certwizard) in the last week. It is a tool for managing 
certificates. There are no current plans to replace the browser interface for certificate 
management, but Jens will be joining the ops meeting tomorrow to explain the rationale, plans 
and take feedback.



For information:



A) HEPiX takes place this week in Beijing: 
https://indico.cern.ch/conferenceOtherViews.py?view=standard&confId=199025. 



B) The next WLCG coordination meeting takes place this Thursday: 
https://indico.cern.ch/conferenceDisplay.py?confId=212691.



C) The next HEPSYSMAN meeting takes place on 9th November in Lancaster: 
http://hepwww.rl.ac.uk/sysman/Nov2012/main.html. 



SI-6  Tier-1 Manager's Report

-----------------------------

AS reported as follows:



Fabric

------

1) Disk tender closed - evaluation underway

2) CPU tender evaluation complete - now with procurement team



Service

-------

1) Operations continue generally smoothly 



2) CASTOR

a) CASTOR 2.1.12 upgrade for LHCB was cancelled last Tuesday while we investigated a possible 
problem with the previous ATLAS upgrade. This eventually turned out to be a false alarm and 
upgrade scheduling is underway again.

b) CMS upgrade now scheduled for this Tuesday 16th October. LHCb upgrade planned (TBC) for 
23rd October.

     

3) Upgrade to EMI2 CREAM CE in final tests but some publishing problems remain. Things are 
tight for us to meet our deadline to have switched off the old gLite CEs by the end of October or 
face possible suspension. However systems are deployed and being tested and we expect to move 
to full production this week.



4) Hyper-threading change has been approved to exploit hyper-threading by running more jobs 
than cores. This is a simple change to implement but does come with some risks/issues as well as 
benefits. Implementation scheduled for next month after CE change this month.



- We will gain an additional 8647 HEPSPEC from the existing hardware nominally 

- We will allow an additional 2048 job slots to run. The amount we over-commit will differ on the 
different generations:

  *10 slots on the 8 core 2009 generation

  * 20 slots on the 12 core 2010/2011 generations 

- We will gradually ramp up the number of additional job slots in case of load issues on the batch 
server (risk)

- CPU scale factors will be set according to the new benchmarked per job slot performance. This is 
only relevant when the worker node is fully occupied. When occupancy is below max, CPUs will 
effectively be faster than published and so we will under account work done at the accounting 
portal.

- Job efficiency will still be able to discriminate between efficient and inefficient work, but average 
job efficiency is no longer a measure of how much useful work is done on the farm (it remains a 
measure of how efficient jobs are. 

- "wasted CPU hours" from the efficiency stats becomes even less meaningful as if a job does not 
use execution units another overcommitted job will.

- By committing memory top run more jobs per node we have reduced our capacity to run large 
memory jobs (or visa versa). New hardware will be purchased configured with enough memory to 
support all hyper-threads concurrently.



5) Backup Oracle (and Frontier) Service for CMS - we expect to receive a formal request shortly to 
run a global backup Oracle service for the CMS conditions D/B. Given the reduction in load on 
Oracle from ATLAS LFC and LHCB 3D/LFC we expect to be able to meet Oracle licensing and 
database hardware mainly from existing resources, but we'll need to assess exact requirement 
before reaching a final conclusion.



DB noted that DC should request this via the PMB.



AOB

===

- GridPP30

PG asked what was happening about this?  DB advised that DC said he would look into hosting the 
meeting at the Royal Geographical Society near Imperial.



ACTION

476.6  DC to investigate the hosting of GridPP30 at the Royal Geographical Society near Imperial, 
and report back.



- European PP Strategy

AS reported that there had been an internal request within STFC regarding the European Particle 
Physics Strategy process and a discussion about national laboratories.  John Wormersley was 
putting together the proposal that RAL was a National Lab including the Tier-1.



ACTION

476.7  AS to check with John Wormersley regarding the proposal that RAL be considered as a 
National Lab including the Tier-1.  AS to find out status of the proposal and report back.





REVIEW OF ACTIONS

=================

438.9  AS to contact relevant site managers to ask whether or not they would be interested in 
having retired Tier-1 hardware - if a site were interested then they should submit a proposal as to 
what they want and why.  Ongoing.



475.1  DB/JC, in conjunction with AS, to consider and draft Terms of Reference (ToR) for the 
proposed GridPP Cloud Group.  Ongoing.



475.2  DB to draft a response to Peter Coveney's email request, using PC's suggestions and in the 
light of PMB discussion.  Done, item closed.





ACTIONS AS AT 15.10.12

======================

438.9  AS to contact relevant site managers to ask whether or not they would be interested in 
having retired Tier-1 hardware - if a site were interested then they should submit a proposal as to 
what they want and why.



475.1  DB/JC, in conjunction with AS, to consider and draft Terms of Reference (ToR) for the 
proposed GridPP Cloud Group.



476.1  PG to ask the Storage Group to be aware that DIRAC may deploy/test a form of GPFS as a 
prototype for a national system, the Storage Group to monitor and keep abreast of progress.



476.2  DB to invite Jeremy Yates and his Sysadmin to visit Lancaster or attend a HEPSYSMAN 
meeting, to help move forward with DIRAC synergies.



476.3  DB to feedback to Jeremy Yates the PMB discussion regarding possible synergies with 
DIRAC.



476.4  PC/DC/RJ to meet with Jamie Shiers in order to push forward a common position regarding 
data preservation in the context of potential funding and FP8/Horizon 2020 calls.



476.5  JC to send info on possible alternative DPM fixes to the Glasgow Team.



476.6  DC to investigate the hosting of GridPP30 at the Royal Geographical Society near Imperial, 
and report back.



476.7  AS to check with John Wormersley regarding the proposal that RAL be considered as a 
National Lab including the Tier-1.  AS to find out current status of the proposal and report back.



There would be *no* PMB on Monday 22nd October.  The next PMB would take place on Monday 
29th October at 12:55 pm.




GridPP PMB Minutes 477 (29.10.2012)

=======================================



Present:  Dave Britton (Chair), Andrew Sansum, Roger Jones, Pete Clarke, Tony Cass, Tony Doyle, 
Dave Colling, Claire Devereux (Suzanne Scott -Minutes)



Apologies:  Dave Kelsey, Steve Lloyd, John Gordon, Jeremy Coles, Pete Gronbech, Neil Geddes



STANDING ITEMS

==============

SI-1  Dissemination Report

--------------------------

SL was not present.



SI-2  ATLAS weekly report & plans

---------------------------------

RJ reported that there had been a rolling changeover to the EMI CE at RAL last week, there had 
been discussions about the process, extra disk for ATLAS at RAL was being installed this week but 
they had held back on the hyperthreading.  High memory MC jobs had gone to the Tier-1 recently, 
the Tier-2s could also contribute to this but this was to be discussed.  RJ had no major problems to 
report.



SI-3  CMS weekly review & plans

-------------------------------

DC was not present at this stage in the meeting.



SI-4  LHCb weekly review & plans

--------------------------------

PC reported that they were progressing with reprocessing, which was going fine, after Christmas 
they would be doing the 2011 data reprocessing.



SI-5  Production Manager's Report

---------------------------------

JC was absent but had sent a brief note:



We have made steady progress with removing gLite 3.2 CEs/BDIIs, but some (more than I hoped) 
will certainly remain in early November. Sites have received tickets and all have now responded 
but I am concerned that some of the smaller sites will not follow-up and there is a growing 
possibility they will be suspended/uncertified at some point in the coming month. I will send an 
update next week.



The WN tarball help has not so far developed which is another problem on the horizon when the 
gLite 3.2 WN deadline arrives at the end of November.



SI-6  Tier-1 Manager's Report

-----------------------------

AS reported as follows:



Fabric:

 

1) Disk tender closed - evaluation expected to complete this week.

2) CPU tender standstill complete. Orders about to be raised.

3) Asymmetric network routing discovered for some Tier-1 to RAL traffic. External sites had not 
accepted our OP_N routing. Now corrected.

4) A disk server operating system was accidentally re-installed (human error). This was risk 6 in 
our accidental data loss risk analysis. Mitigation worked - no data lost.



Service:



1) Operations continue generally smoothly 

2) CASTOR

a) CASTOR 2.1.12 upgrade for CMS+LHCB completed. Gen instance will be carried out on Tuesday 
30th. 

     

3) Upgrade to EMI2 CREAM CE completed. Went very well but experiments did not promptly 
change SAM test endpoints so incorrect availability will need correcting. Old glite nodes will be 
turned off by end of month.



4) WMS services upgrade from glite. We should now be glite free.



5) Hyper-threading change has been approved to exploit hyper-threading by running more jobs 
than cores. This is a simple change to implement but does come with some risks/issues as well as 
benefits. Implementation scheduled for next month after CE change this month.



6) Backup Oracle (and Frontier) Service for CMS - we expect to receive a formal request shortly to 
run a global backup Oracle service for the CMS conditions D/B. Given the reduction in load on 
Oracle from ATLAS LFC and LHCB 3D/LFC we expect to be able to meet Oracle licensing and 
database hardware mainly from existing resources, but we'll need to assess exact requirement 
before reaching a final conclusion. 



SI-7  LCG Management Board Report

---------------------------------

DB reported that there had been a discussion re Oracle licences, they were identifying cases 
where Oracle was in use at the Tier-1s; there had been the issue of OSG's contingency plans for 
their CA, users were requesting contingency planning for various scenarios if Certs could not be 
issued - the documents were available publicly.  DB noted that GridPP was in the same situation 
and we should ask the same question for services we don't directly run - the next NGI meeting 
would discuss this on 12th November.  DB noted that the documents re the CA and infrastructure 
were fairly generic and could maybe be used.  There needed to be contingency plans for all NGI 
services.  DB would report-back from the NGI meeting.  CD noted she had this issue on the NGI 
Agenda.



DB continued - there had been an update on the wLCG networking group by Michael Ernst.  The 
Oversight Board had raised a query about the networking group's remit, in order to clarify how it 
related to other bodies.  DB reported that there had been a bit of discussion about this group 
generally and 'bandwidth on demand', no further action was required at present.  There had 
followed a discussion on common projects; then a discussion on wLCG software life-cycle process.  
DB noted there would shortly be a Russian Tier-1.



AS had sent an email regarding Oracle.  He advised that the licence requirements were reducing 
over the next few years but the maintenance bill was due in GridPP4.  AS noted he was awaiting 
formal information from CERN.  DB thought we would need less licences going forward that was 
originally planned?  AS confirmed yes - the bulk of licences go on CASTOR.  DB noted that at RAL 
the dominant factor was CASTOR therefore the LFC and FTS changes would not affect things 
much.  AS agreed, and he would send round a summary.  DB noted that regarding the backup 
service for CMS we didn't want additional costs.



DC had joined the meeting and advised that he had a chat with Ian this morning.  The CMS request 
was not high on their wishlist but it would be good to have.  CMS may try and move away from 
Oracle.  DC noted that Fermilab had almost no Oracle licences at all.



1.  ToR for Cloud Group

=======================

A proposal document had been circulated by DB and he had sent it to AS for comment.  AS noted 
only one minor thing: 'production' cloud service could perhaps be modified to 'prototype' cloud 
service.  DC was to give feedback.  Any other comments should be sent to DB/DC.  It was noted 
that the document would be used as the basis for moving forward.  There would be a monthly 
report to the PMB.  Would PC and RJ be involved?  PC advised that a PDRA post was being 
advertised and this was something that the prospective member of staff could be involved with on 
behalf of LHCb.  RJ advised that he had been discussing this within ATLAS and a few people were 
interested, but this was to be confirmed.  DC should convene a meeting soon to start-off this Cloud 
Group.



2.  AOB

=======

- DELL LHC Programme

It was noted that George Jones had left DELL.  PG had received a message from Gary Kriegel noting 
that the Programme was currently in transition and that LHC pricing was being determined for 
the future.  It was thought that the programme could disappear entirely.  RJ would contact Andy 
Langford and thereafter the DELL contact he met at Manchester.



ACTION

477.1  RJ to contact Andy Langford and thereafter the DELL contact he met at Manchester in 
relation to DELL LHC programme changes.



AS advised that DELL hadn't made the cut for the CPU service, possibly reflecting their change of 
emphasis.



- DPHEP meeting

DB asked about this meeting - was anyone going?  PC noted no - it was difficult to get to Marseille 
from Edinburgh.  RJ noted he had also dropped out due to the change of venue from Munich.  PC 
advised that Marco would be going for LHCb.  ATLAS would not have any representation.





REVIEW OF ACTIONS

=================

438.9  AS to contact relevant site managers to ask whether or not they would be interested in 
having retired Tier-1 hardware - if a site were interested then they should submit a proposal as to 
what they want and why.  Ongoing for 2006 generation.



475.1  DB/JC, in conjunction with AS, to consider and draft Terms of Reference (ToR) for the 
proposed GridPP Cloud Group.  Done, item closed.



476.1  PG to ask the Storage Group to be aware that DIRAC may deploy/test a form of GPFS as a 
prototype for a national system, the Storage Group to monitor and keep abreast of progress.  
Ongoing.



476.2  DB to invite Jeremy Yates and his Sysadmin to visit Lancaster or attend a HEPSYSMAN 
meeting, to help move forward with DIRAC synergies.  Done, item closed.



476.3  DB to feedback to Jeremy Yates the PMB discussion regarding possible synergies with 
DIRAC.  Done, item closed.



476.4  PC/DC/RJ to meet with Jamie Shiers in order to push forward a common position regarding 
data preservation in the context of potential funding and FP8/Horizon 2020 calls.  Done, item 
closed.



476.5  JC to send info on possible alternative DPM fixes to the Glasgow Team.  Done, item closed.



476.6  DC to investigate the hosting of GridPP30 at the Royal Geographical Society near Imperial, 
and report back.  DC would check the Physics Dept and Halls of Residence.  Done, item closed.



476.7  AS to check with John Wormersley regarding the proposal that RAL be considered as a 
National Lab including the Tier-1.  AS to find out current status of the proposal and report back.  
Done, item closed.



ACTIONS AS AT 29.12.12

======================

438.9  AS to contact relevant site managers to ask whether or not they would be interested in 
having retired Tier-1 hardware - if a site were interested then they should submit a proposal as to 
what they want and why.



476.1  PG to ask the Storage Group to be aware that DIRAC may deploy/test a form of GPFS as a 
prototype for a national system, the Storage Group to monitor and keep abreast of progress.



477.1  RJ to contact Andy Langford and thereafter the DELL contact he met at Manchester in 
relation to DELL LHC programme changes.



The next PMB meeting would take place on Monday 5th November at 12:55 pm.








GridPP PMB Minutes 478 (05.11.2012)

=======================================



Present:  Dave Britton (Chair), Pete Gronbech, Andrew Sansum, Roger Jones, Pete Clarke, Tony 
Cass, Dave Colling, Claire Devereux, Steve Lloyd, John Gordon, Jeremy Coles, Dave Kelsey



Apologies:  Tony Doyle, Neil Geddes





Agenda:



1. ATLAS - Oracle for conditions DB and Frontier Server at RAL [RJ/AS]

======================================================================

ATLAS has asked the 5 Tier-1s (which includes RAL) that host the Conditions DataBase and 
Frontier Servers in addition to CERN, whether they intended to continue to do so for Run2 (i.e. 
until 2018). ATLAS were not sure how many instances were required: It might not be 5 but it was 
certainly "some".  AS noted that the 3D database required some 6 oracle licences (compared to 
something like 30 for CASTOR) and this might reduce to 4, so was not a dominant factor. RJ had 
yet to receive and answer from ATLAS as to the experiments longer term plans WRT Oracle. 
ATLAS has requested a response by mid-Nov. DB suggested that RJ find out a little more about 
ATLAS' position and draft initial response on the bases that it was not regarded as a big problem 
by the Tier-1. DB would want to add some caveats about the timeframe involved.



ACTION

478.1  RJ to draft response to the ATLAS message and iterate with DB.



AOCB

====

1) PG had been away last week and would summarise quarterly reports at the next PMB meeting.

2) DC had made some enquires about GridPP30 at Imperial and would make a proposal on dates 
to the PMB this week.



ACTION

478.2  DC to propose dates for GridPP30.





STANDING ITEMS

==============

SI-1 Dissemination Report [SL]

-------------------------

SL reported that he had received the following from NO:

Published Ganga News item



- Waiting to publish LCG CE news item

- Sussex news item ready for when they go into production

- perfSONAR news item in the works

- VOMS Snooper news item also in the works

- GridPP (and PG) in Linux Format this month

- I've been officially added to the LOC for the Community Forum (well I'm included in the phone 
calls)



DB expressed a concern that the events of September had demonstrated that our dissemination 
overall as a project had some gaps. In particular, news items were fine but they only addressed 
one area of dissemination. In particular, GridPP needs better contact with industry and better 
visibility within the developing UK e-infrastructure community. A discussion ensued, with broad 
agreement that there was an issue. It was felt that we need to target some very specific things: A 
project with an industrial partner would be valuable; money might be available from the various 
STFC impact programmes if something could be identified.



ACTION

478.3  SL to talk with NO; possibly a meeting with DB/SL/NO/CD?



RJ noted that website needed to be fixed so that the old Excel visit-notice was no longer liked from 
the resources page. DK said he would contact Andrew McNab.



SI-2 ATLAS Weekly Review and Plans [RJ]

----------------------------------

Main issue was that RAL had been moved out of raw-data export. This might be due to OPN 
saturation but there are several independent network-related issues on-going at RAL and AS was 
still trying to get to the bottom of this. The UK Tier-2s also seem to have a number of unrelated 
issues at present, but nothing too serious. Lancaster would shortly be moved off the light path 
now that the link north was up and running.



SI-3 CMS Weekly Review and Plans [DC]

--------------------------------

DC reported that things were fine with CMS. He had noted that the UK Tier-2s had appeared in the 
top grouping of global CMS Tier-2 sites (along with the US and DESY) in terms of cpu-hours 
delivered and analysis delivered. DC noted that he was currently setting up the cloud-group and 
an email list would be established this week. The possibility of hosting a duplicate CMS conditions 
db at RAL was discussed. The costs included �2.5k for nodes; �8.7k for disk; and �2k? for Oracle 
Licence(s). It was not yet clear how many Oracle Licenses would be needed. AS would get back to 
DC with the complete details and DC would talk to Ian Fisk as to whether the costs were 
justifiable.



SI-4 LHCb Weekly Review and Plans [GP]

---------------------------------

PC reported that there were no issues on the LHCb side.



SI-5 Production Manager's weekly report [JC]

---------------------------------------

JC reported that:

1)   We have agreed a VOMS upgrade/switch for 14th November. There will be a brief period 
where VO information will not be editable but otherwise the switch will be transparent for VOs 
already hosted on gridpp.ac.uk. David Wallom has been liasing with the NGS VOs that are coming 
on to the gridpp VOMS.

 

2)   A validator script running on VOMRS to check the status of issuer DNs produced some 
confusing messages for (LHC) users last week as old certificate DNs were not deleted in VOMS but 
the certificates against the old CA DN were picked up as failing (due to the old UK CA now having 
expired?) the validation. This seems to have impacted ATLAS team memberships within GGUS for 
editing tickets which used the old certificate status for team membership confirmation.



3)   As of 1st November several GridPP sites were still running gLite 3.2 CEs with no EMI CEs in 
parallel:  UCL, Durham and ECDF. Additional sites with 3.2 CEs that will be removed soon (when 
the EMI CEs are shown stable): Manchester, Sheffield, Bristol and Cambridge.  Some sites have 
deployed EMI-2 SL5 WNs (the status tables are being updated). Alessandra has been tracking 
plans for ATLAS via this page: https://www.gridpp.ac.uk/wiki/UK_EMI2_Deployment.



4)   Last week joint work (finally) began on producing EMI WN tarballs. Needless to say it is not 
quite as simple as early reports suggested it would be. Matt Doidge at Lancaster together with 
Wahid Bhimji are providing the GridPP input. Issues include what �extra� SL rpms need to be 
included and a policy for later allowing use of glexec.



5)  There was a request on TB-SUPPORT for more information on GridPP30 dates.



6) Are there any further PMB comments on the DPM collaboration notice I forwarded from Oliver 
Keeble last week? It mentions the in principle agreement to support the collaboration from 3 
countries and core development effort being provided by CERN.



For information

A)   There is a GDB next week http://indico.cern.ch/conferenceDisplay.py?confId=155074.

B) There is a HEPSYSMAN meeting on Friday:  
http://hepwww.rl.ac.uk/SYSMAN/Nov2012/main.html.



SI-6 Tier-1 Manager's weekly report [AS]

-----------------------------------

AS reported that:

Fabric

------

1) Disk tender closed - HAG meeting scheduled for Tuesday

2) CPU orders placed.

3) Review of our network performance indicates problem with our outbound rate to most/all 
sites. Still investigating.

4) High traffic rate on LHCOPN to RAL at the moment (since Friday) under investigation. May 
need to consider load balancing on backup link in future.

5) Failure of the primary OPN for about 10 hours on 30th October owing to a major fibre cut 
between Gravelines and Bois-Grenier in France.

6) Site networking plan a short intervention on our board on the main site router on Tuesday 13th 
November. this will lead to a short scheduled outage. We may take this opportunity to schedule 
other network work such as performance tests and an upgrade to address bandwidth limitations 
on one of our stack uplinks. 



Service

-------

1) Operations report at:

   https://www.gridpp.ac.uk/wiki/Tier1_Operations_Report_2012-10-31



2) CASTOR

   a) CASTOR 2.1.12 upgrade now complete on all instances.

   b) CASTOR 2.1.13 certification has commenced.

   c) Lengthy (7 hours) downtime on ATLAS instance over weekend. Cause was non-optimal 
change in 

      execution plan on SRM database. DB team plan to lock down execution plan using Oracle 11 
feature.

     

3) Hyper-threading change expected to be implemented shortly.





SI-7 LCG Management Board Report of Issues [JG/DB]

------------------------------------------

There had been no MB.



REVIEW OF ACTIONS

=================

476.1 had been done

477.1 had been done but DB opened a new action:



ACTION

478.4  RJ to let PMB know more details about the future of the DELL LHC programme after he'd 
talked to Andy Langford.



ACTIONS AS OF 05.11.12

======================

438.9  AS to contact relevant site managers to ask whether or not they would be interested in 
having retired Tier-1 hardware - if a site were interested then they should submit a proposal as to 
what they want and why.



478.1 RJ to draft response to the ATLAS message about Conditions db and Frontier server and 
iterate with DB.



478.2 DC to propose dates for GridPP30.



478.3 SL to talk with NO; possibly a meeting with DB/SL/NO/CD about targeting our 
dissemination.



478.4 RJ to report back to the PMB about the DELL LHC programme after he'd talked to Andy 
Langford.



The next PMB would take place on Monday 12 November at 12:55 pm.


GridPP PMB Minutes 479 (012.11.2012)

=======================================



Present:  Dave Britton (Chair), Pete Gronbech, Andrew Sansum, Pete Clarke, Tony Cass, Dave 
Colling, Claire Devereux, Steve Lloyd, John Gordon, Jeremy Coles, Dave Kelsey



Apologies:  Tony Doyle, Roger Jones, Neil Geddes





0. Summary of NGI Management Meeting [CD]

========================================= 

Claire reported that the monthly NGI meeting had just been held. Dave Wallom was representing 
the UK on the EGI Elixir Virtual Team. There is a call for EGI Champions - so nominations were 
solicited (basically can fund some travel). The meeting discussed the imminent VOMS migration 
and Claire was asked whether all UK NGI services had been restored following the power cut at 
RAL (the answer was "yes"). DB raised the issue of contingency planning for NGI services. It was 
agreed to make a list of services and to evaluate the need and status of contingency plans against 
each.



1. Tier-1 Power Outage [AS]

=========================== 

AS described the events of last week when a power cut at RAL and the failure of the generator 
brought down the whole Tier-1. The only data loss was "data-in-flight" and only a modest amount 
of hardware had to be repaired. A full SIR will be made available; there are some more details in 
the Tier-1 report below. It was noted that although the generator was tested on a monthly basis, it 
had not been load tested. DB asked whether the recent departure of the Operations Manager had 
compounded the situation (probably not).



2. Quarterly Reports: Issues from 12Q3 [PG]

=========================================== 

PG circulated a summary of 12Q3 quarterly reports. The Tier-1s performance in Q3 had been 
excellent. PG/AS asked whether there should be a review of the Tier-1 next May as per the project 
milestones? DB noted that the lightweight-informal review held last June had been very 
informative; AS confirmed that it had been useful. Therefore, it was agreed that a repeat should be 
scheduled in May 2013. It was noted that there was a slight delay in the disk procurement that 
increased the risk of missing the deployment deadline for the MOU in April 2013. Delivery was 
January. DB noted that this should still give time for 4-6weeks burn-in and then deployment 
before the deadline. JG noted that we might expect to run into problems some problems so there 
was a chance that perhaps half the capacity might be late. DB expressed his hope that this would 
not happen.

Q3 had been less stellar at the Tier-2s, with poor availability at Glasgow for ATLAS (power issues) 
and data loss at Cambridge. CMS and LHCb had had a good quarter. T2K were investigating their 
storage requirements; it was hard for them to work out how much disk they were using at Tier-2s 
due to shared resources with other VOs. The transition to EMI middleware had been somewhat a 
concern at the end of the quarter but now, one month later, the UK was in good shape.



AOCB

====

1) EU Researcher Article: This non-refereed journal had approached DB about GridPP paying to 
publish an article. DB had referred to Neasan. The proposal was for 1500words for �3000. The 
PMB could not see how this would be of value. The decision was not to proceed.



2) ORACLE Licenses: CERN (Tony Cass) had written to GridPP (DB) to request planning numbers 
of ORACLE Licences. AS had started the inventory  but there were some outstanding questions, 
particularly around ATLAS. DB had discussed with RJ: It looked likely that ATLAS would like RAL 
to continue to host the 3D DB but not likely that the TAG DB would be required in its current form. 
AS would use this input and come back with a plan next week.



ACTION

479.1  AS to provide ORACLE licence plan. 



3) HAG: The hardware advisory group had met. JG had circulated an email to the PMB and the 
salient points were in the Tier-1 Manager's report below.



4) EGI Software Support: Oxford had received an email about SAM support. This was something 
that had been discussed a longtime ago by JG with EGI - providing support for APEL and SAM. 
There was the odd month of effort funded to provide this, but it was felt to be a very low level 
commitment and it was agreed that no further action was required (such as transferring this 
month of funding to Oxford) unless the task proved more onerous than expected.



5) GridPP30: DC reported that IC no longer had student accommodation at Easter. DB asked about 
local hotels but realised this was unlikely to be affordable. DC would check. PG suggested 
contacting Dell about their conference centre in Ireland. CD suggested holding it in conjunction 
with EGI in Manchester. DB/CD/PG/DC would look into these options.



STANDING ITEMS

==============

SI-1 Dissemination Report [SL]

-------------------------

SL noted that a KE meeting had been arranged for Nov 27th at QM to be attended by at least 
SL,NO,DB and CD. Other PMB members were invited. DC and JC expressed interest. It was agreed, 
therefore, to start at 12:45 to avoid Ops-team.



SI-2 ATLAS Weekly Review and Plans [RJ]

----------------------------------

RJ was not present due to teaching.



SI-3 CMS Weekly Review and Plans [DC]

--------------------------------

DC reported no issues from CMS operations. However, Stuart Wakefield had now left and some 
issues with Brunel had been found where his certificate had been hardwired. 



SI-4 LHCb Weekly Review and Plans [PC]

---------------------------------

No issues for LHCb.



SI-5 Production Manager's weekly report [JC]

---------------------------------------

JC reported as follows:

1)   An upgrade of the GridPP VOMS takes place this Wednesday (14th). VO-admins have been 
informed of the read-only period during the upgrade and that the new VOMS version has new 
notification policies and in particular VO-admins will now � get regular emails about expired 
users, or users that are going to expire.(see details here 
https://www.gridpp.ac.uk/wiki/VOMS_Notifications).

 

2)   There was a power cut that affected RAL at 11:30 UTC last Wednesday 7th November and the 
backup diesel generators failed. This affected UK Tier-2 work but did not lead to any complaints. 
We will review the impacts (and any lessons learned) at the ops meeting tomorrow � for example 
top-BDII settings used by the UK Nagios testing and GOCDB failover. APEL processing at RAL was 
also affected and sites were asked to temporarily avoid republishing data.

 

3)    No GridPP/UK sites have been designated as unresponsive by EGI in regards to their EMI 
upgrade progress and plans (but see D below for the process being followed).

 

4)    Steady (positive) progress is being made with producing an EMI-2 tarball WN. Testing last 
week showed a working version with ATLAS. (Reminder: The current deadline for sites to move 
from gLite 3.2 WNs is the end of November).

 

5)     HEPSYSMAN took place at Lancaster on Friday 
(https://indico.cern.ch/conferenceDisplay.py?confId=211206). A flexible format and short-talks 
approach worked well. 

 

For information:

A)   There is a GDB this week: http://indico.cern.ch/conferenceDisplay.py?confId=155074. Topics 
include: GGUS recent developments; an update on the Security WG activities; Glue 2.0; IPv6 and 
plans for the deployment of M/W clients (in light of EMI ending soon).



B)   A statement on the DPM collaboration is now online: 
https://svnweb.cern.ch/trac/lcgdm/blog. Planning for the DPM community workshop in 
December has started: http://indico.cern.ch/conferenceDisplay.py?confId=214478.



C) The EGI-Inspire task TSA1.5 (accounting) has been handed over from John to Alison Packer 
(STFC).



D) An EGI CSIRT process to handle unsupported gLite service end-points of unresponsive sites 
that failed to reply to COD tickets and to provide information about their upgrade plans has now 
been agreed. From today sites affected will be asked to put old endpoints into downtime and from 
19th unresponsive sites will risk suspension.



SI-6 Tier-1 Manager's weekly report [AS]

-----------------------------------

AS reported as follows:

Fabric

------

1) Disk tender evaluation complete. Expect to start standstill shortly. 

2) CPU orders placed.

3) Review of our network performance indicates problem with our outbound rate to most/all 
sites. Still investigating.

4) Site networking plan a short intervention on our board on the main site router on Tuesday 13th 
November. this will lead to a short scheduled outage. We will not be scheduling an intervention on 
our internal stacks as suggested last week as testing could not be completed owing to the power 
failure.



Service

-------

1) A major (>50%) site wide power failure at 11:20 on Wednesday 7th November (last major 
power failure 44 months ago). Trip occurred at main site substation (cause being investigated). 
UPS generator started but would not accept load (cause being investigated). Critical (UPS battery 
protected) services operated for about 20 minutes but had to be shut down as cooling requires 
generator. Power to machine room restored at 14:20. External national and international services 
(FTS, BDI, WMS, LFC, GOC, APEL) restored by 18:00 (some much earlier). Batch and CASTOR 
services restored by 14:00 on 8th November. Generator circuit remains faulty. Generator will not 
start in event of another power failure. Investigation and generator load test being scheduled for 
20th November but until then our UPS critical systems remain at risk in event of further power 
problems. Post Mortem (SIR) underway.



2) CASTOR

   a) On Sunday (again) problems with ATLAS SRM owing to database choosing non-optimal 
execution plan. Expect to lock down the execution plans this Tuesday.

   b) Intermittent CMS SRM test failures - leading to around 20% degradation in test results. Seems 
to be an increasing problem, but the cause is not understood. Does not seem to be noticeably 
impacting production work. 

     

3) On Saturday problems with CRLs expiring on CEs. Investigating how this happened. 
Inconvenient that CERN CRLs expire on Saturday (known problem).



4) Hyper-threading change rollout started.



5) EMI-2 workernode update in pipeline. Expected before end of month.



SI-7 LCG Management Board Report of Issues [JG/DB]

------------------------------------------

There had been no MB. JC asked about the software lifecycle plan that had been presented in 
outline at the last but one MB and then at the GDB. DB had not heard anything more.



REVIEW OF ACTIONS

=================

438.9  AS to contact relevant site managers to ask whether or not they would be interested in 
having retired Tier-1 hardware - if a site were interested then they should submit a proposal as to 
what they want and why.

ONGOING



478.1 RJ to draft response to the ATLAS message about Conditions db and Frontier server and 
iterate with DB.

ONGOING



478.2 DC to propose dates for GridPP30.

NO ACCOMMODATION. ACTION CLOSED



478.3 SL to talk with NO; possibly a meeting with DB/SL/NO/CD about targeting our 
dissemination.

DONE - ARRANGED FOR 27TH



478.4 RJ to report back to the PMB about the DELL LHC programme after he'd talked to Andy 
Langford. 

DONE AND ONGOING!



ACTIONS AS OF 12.11.12

======================

438.9  AS to contact relevant site managers to ask whether or not they would be interested in 
having retired Tier-1 hardware - if a site were interested then they should submit a proposal as to 
what they want and why.



478.1 RJ to draft response to the ATLAS message about Conditions db and Frontier server and 
iterate with DB.



479.1 AS to finalise ORACLE licence planning.



479.2 RJ to report back to the PMB about the DELL LHC programme after he'd talked to Andy 
Langford.
Top of Message | Previous Page | Permalink
JiscMail Tools

Files Area | help
RSS Feeds and Sharing

Search Archives

Advanced Options