Dear All,
Please find attached the GridPP Project Management Board
Meeting minutes for the 447th and 448th meetings.
The latest minutes can be found each week in:
http://www.gridpp.ac.uk/php/pmb/minutes.php?latest
as well as being listed with other minutes at:
http://www.gridpp.ac.uk/php/pmb/minutes.php
Cheers, Dave.
GridPP PMB Minutes 447 (19.12.11)
=================================
Present: Dave Britton (Chair), John Gordon, Jeremy Coles, Andrew Sansum, Dave Colling, Dave
Kelsey, Tony Doyle, Tony Cass, Glenn Patrick, Roger Jones (Suzanne Scott - Minutes)
Apologies: Pete Gronbech, Steve Lloyd, Robin Middleton, Pete Clarke, Neil Geddes
1. DRI Status
==============
DB reported that all was ok, there seemed to be a last minute hitch but it was resolved. Things
were on track as far as he knew. AS advised that they had checked with Trish Mullins regarding
flexibility - there wasn't any so they would probably be placing orders this week. There was a
price drop due after Christmas. The outstanding issue they had was the £40k resource cost - this
would be for maintenance on switches and they needed to resolve it. They may be able to profile
it across 3 years. DB advised that the end date of all University grants was 31st March 2012. Tony
Medland had confirmed that we can spend up to that point but not after. This would be followed
by the usual 3-month Final Claim period. Final Claims would have to be submitted by end of June
2012.
2. RAL Network Issues
======================
Some time ago, DB had raised the issue of network outages at the Tier-1. AS confirmed that he
had just circulated a document regarding this. AS advised that Gareth and he had discussed
network issues and gone through all of them over the past 12-month period. They did this at the
end of October. Over the year there had been a couple of major issues at sites, also scheduled
interventions. There had been around 2% of lost time over the period. At the Tier-1 the issue had
been the firewall router and the access router. The main challenge was that site network was not
resilient - they needed to take the site down for interventions. In future, availability should be
better. JG noted that they didn't have an underlying serious hardware problem - human error had
been the main cause. There had been a few recent incidents but the stats over the year showed a
small effect only, at 2%.
DB asked regarding the investment of DRI money - which areas would this address? AS advised
that operationally, the DRI money would allow resilience for doing interventions. DB thought that
next year we should perform a Tier-1 review.
3. RAL Database Issues
=======================
DB asked about the SRM database? AS noted that the situation followed-on from the Thursday
network breaks. The problem was they couldn't unstick the ATLAS SRM database, the Resource
Manager was not releasing resources. The team tried to resolve this but got repetition of the
difficulty. Post Mortem detail was awaited. DB advised that we needed to understand whether
anything else was vulnerable. If things start going wrong, what would they do? AS noted they
would move over the databases, and they were prepped and ready to go as it had already been
discussed. More worrying was the fact that the 10G maintenance had expired - this came to light
last week. They would hold a meeting soon to discuss the strategy of 11G upgrade. TC advised
that the licences came from CERN and the maintenances should be paid for ok, extended
maintenance was paid. AS would pass this info onto Rich to get in touch with TC if necessary.
4. AOCB
========
DB noted that he had circulated the Minutes from NG re the NGS/GridPP meeting. Could the PMB
get back to him if they had any comments/questions.
STANDING ITEMS
==============
SI-1 Tier-1 Manager's Report
-----------------------------
AS reported that the Viglen deliveries had a delivery date of w/c 24/01/2012 - this was for Viglen
disk and CPU. Clustervision had all of the parts and would probably deliver earlier than Viglen
(Clustervision disk). For DELL there was no date as yet. They had received all of the tape media
and the tape drives were in small orders and were being received. DELL may be running into
problems with disk.
Regarding staff, there were start dates for two fabric Sysadmins, and there had been a new start
on the CASTOR Team recently. They should have four starts in total, and the forecast timeline was
ok.
SI-2 Production Manager's Report
---------------------------------
JC reported as follows:
1) Almost all sites now have CVMFS installed.
2) The latest version of CREAM was planned for release in UMD 1.4.0 (due out today) as SGE
sites need it. But a serious bug was found. Sites are looking at the EMI version but this is
unfortunate given the deadline to remove LCG-CEs. Sites upgrading at this stage before the
holiday period is anyway unwise.
3) The UK CA DN update has caused problems for VOMRS renewals for several UK users
(affecting all the LHC VOs, dteam and others) and this has become increasingly evident. An
operation on the backend database at CERN has attempted to workaround the problem of manual
interventions being needed. This was done for CMS on Wendesday last week and will be
undertaken for all other VOs too if no unexpected issues arise. The intervention is expected to be
transparent for users. We have not directly informed users to avoid confusion, but a technical
explanation has been posted by Jens for those who do query what has happened:
http://nationalgridservice.blogspot.com/2011/12/ca-stuff.html. Once the workaround is
established as successful and other VO membership entries are updated we will send an update to
GridPP-Users explaining what has happened.
DB asked why this had happened. DK noted it was a new CA, not a VOMS issue. JC advised that
the workaround seemed to be working. He would put a note round the GridPP user list.
For information:
A) There is a DPM community event being proposed for February/March. The storage group (and
ops team) are discussing whether we might consider hosting a DPM workshop next year.
SI-3 ATLAS weekly review & plans
---------------------------------
RJ noted they had been disrupted by networking issues at RAL; they would be working flat-out
over Christmas and there was a call for extra capacity. Apart from that, there were no issues to
report.
SI-4 CMS weekly review & plans
-------------------------------
DC was absent.
SI-5 LHCb weekly review & plans
--------------------------------
GP advised that MC11 production was now ongoing. There were no issues to report.
SI-6 User Co-ordination Issues
-------------------------------
None to report.
SI-7 LCG Management Board Report
---------------------------------
There had been no MB.
SI-8 Dissemination Report
--------------------------
SL was absent.
REVIEW OF ACTIONS
=================
436.12 DB to produce a financial proposal for adjustments to the Tier-2 staffing profile over the
term of GRIDPP4. Ongoing.
438.2 PC to provide feedback and guidance about the data management plan following the CAP
meeting on 4th October 2011. [PC will circulate something to the PMB before submission to STFC
- RJ and PC are dealing with this.] Ongoing.
438.8 TC to advise when it is a good time to move to vidyo - early adopters were possible.
Ongoing.
438.9 AS to contact relevant site managers to ask whether or not they would be interested in
having retired Tier-1 hardware - if a site were interested then they should submit a proposal as to
what they want and why. Ongoing.
439.1 AS to put together a summary of network issues recently experienced at the Tier-1. Done,
item closed.
446.1 PG to contact each PI individually re the DRI grants to ensure they understood they had to
order/commit an equipment spend on their Institutes' systems before 31st March 2012. Ongoing.
446.2 Re the DRI grants: PG to follow-up with PIs regarding evolution of plans and quotes in
order to monitor spend progress. Ongoing.
446.3 JG to put together a plan for the next joint GridPP/NGS management meeting in January
(which would be followed by the first NGI technical meeting the next day).
446.4 DB to contact Tony Medland and clarify the DRI 'spend by' date. Done, item closed.
446.5 JC to inform Lancaster that they should fund the backup Nagios server from the recent large
grants awarded to it. Done, item closed.
ACTIONS AS OF 19.12.11
======================
436.12 DB to produce a financial proposal for adjustments to the Tier-2 staffing profile over the
term of GRIDPP4.
438.2 PC to provide feedback and guidance about the data management plan following the CAP
meeting on 4th October 2011. [PC will circulate something to the PMB before submission to STFC
- RJ and PC are dealing with this.]
438.8 TC to advise when it is a good time to move to vidyo - early adopters were possible.
438.9 AS to contact relevant site managers to ask whether or not they would be interested in
having retired Tier-1 hardware - if a site were interested then they should submit a proposal as to
what they want and why.
446.1 PG to contact each PI individually re the DRI grants to ensure they understood they had to
order/commit an equipment spend on their Institutes' systems before 31st March 2012.
446.2 Re the DRI grants: PG to follow-up with PIs regarding evolution of plans and quotes in
order to monitor spend progress.
446.3 JG to put together a plan for the next joint GridPP/NGS management meeting in January
(which would be followed by the first NGI technical meeting the next day).
The next PMB would take place on Monday 9th January 2012 at 12:55 pm.
GridPP PMB Minutes 448 (09.01.2012)
===================================
Present: Dave Britton (Chair), John Gordon, Jeremy Coles, Andrew Sansum, Dave Colling, Dave
Kelsey, Tony Doyle, Tony Cass, Glenn Patrick, Roger Jones, Pete Gronbech, Steve Lloyd, Robin
Middleton, Pete Clarke (Suzanne Scott - Minutes)
Apologies: Neil Geddes
1. DRI Status
==============
PG reported no change to the release of the grants - he had checked with Malcolm Booy and Trish
Mullins, and they had said they were tying to sort it out. The profile of the grant had to match
what we needed, and this apparently was not easy to effect on the new system. PG advised that,
where Universities agree, PIs could start ordering. This was unlikely however, as Universities
could not spend on credit. All DRI bids were now submitted, but were not yet approved.
PC advised that STFC could transfer money for the grant onto an account in advance - they could
probably do that now - which would allow PIs to begin ordering equipment. DB asked PG to
discuss possible options with STFC - could they perhaps advance half of the funding? PC noted
that time was critical now and orders might not be done and delivered within timescale - he
agreed that we should check with STFC as to whether they could do an advance. DB thought that
the issue was they were having trouble with profiling on their new system. PG would check.
AS noted that he had received the new DELL pricing for the Force 10 kit, and it was about half the
price compared with before Christmas.
ACTION
448.1 PG to contact STFC again and discuss any possibilities regarding release of part of the
funding, in order to allow procurements to commence at institutes, and also to check current
approval status.
2. CHEP Travel Guidelines
==========================
DB advised that people had been in contact to note that submissions had largely been accepted as
posters. What was the policy for funding posters? RM advised that it was currently 50%, but
people if funded needed to stand at the stand and engage with the public - one's name on the
poster wasn't enough. DB asked about a group of posters per one person? RM thought that we
weren't usually that prescriptive, but the cost would be in the next FY. PG noted that we needed a
list of the people who wanted to go. DB asked PG to follow this up and co-ordinate it. PG would
check and confirm current status for next week's PMB.
ACTION
448.2 PG to check with all those who had submitted a paper to CHEP, who had been awarded a
poster instead, and ascertain who actually wanted to go. The PMB would decide once they saw
the list of people and sites.
RM advised that the wLCG workshop was immediately before CHEP and that we usually funded
this at 100%, including subsistence, for that event. The support went down to 50% for CHEP.
ACTION
448.3 PG to establish who, out of the list of those wanting to go to CHEP, also intended to go to the
wLCG workshop immediately before CHEP.
3. User Co-ordinator Position
==============================
DB advised that GP had been in this position for some time now. He would be leaving RAL at the
end of May 2012, therefore there was a need to consider the future role of his post. DB considered
that it was an opportunity to think about the scope of the role and who might be the best person
to take over. DB invited the PMB to think this over and contact him directly with
thoughts/suggestions. We also needed to consider how GridPP positions itself beyond GridPP4.
The role had also evolved over the years so we needed to look ahead at this point. DB noted that
we could bring in new people, but it would need someone with a broad view and also time and
interest to take it on.
PG asked about RAL involvement - was the role at RAL mandatory/expected? DB noted no, this
was not a RAL position, it was a GridPP position and was potentially possible at any institute. DB
noted he needed inputs from everyone, ideas were required at this point, and it would be good to
discuss this at a F2F meeting.
ACTION
448.4 ALL to send thoughts/suggestions to DB regarding the replacement for GP in the User Co-
ordinator position (not necessarily based at RAL).
4. AOCB
========
DB noted that he had re-shuffled the Standing Items due to both constraints of meeting
attendance for PMB members and the logical progression of reporting prior to the Tier-2 report.
AHM Paper: DC requested a draft asap please, from RJ, GP, AS, JC, (and himself).
Alice: Regarding Alice, DC had emailed Lee - was he still working with Alice? AS advised that he
had called into the last meeting.
STANDING ITEMS
==============
SI-1 Dissemination Report
--------------------------
SL had circulated an email report from Neasan O'Neill:
1) Website revamp will be done by end of the month/start of Feb.
2) Neasan should be attending the e-ScienceTalk Face2Face, next week. DB asked if Neasan could
do a presentation report on this at the Manchester GridPP meeting?
3) Neasan had some news items to chase/check on but should be one up this week (as long as the
ENROLLER work got done over the holidays)
SI-2 ATLAS weekly review & plans
---------------------------------
RJ advised that on the RAL side there had been network interruptions and an SRM problem last
week; sites were slow to respond, but three would need to account for it at the next ADC meeting.
The sites were: Durham, UCL, and Birmingham was problematic. For the other Tier-2s, they were
switching to CVFMS - this was an issue for CMS in relation to the way that the cluster was
configured, but they were trying a workaround. RJ noted there had been a lot of jobs processed
over Christmas; scheduled downtime was happening soon and they would contact sites to advise.
SI-3 CMS weekly review & plans
-------------------------------
DC reported a network outage over Christmas; there had been an Oracle issue on 16th December,
but all else was ok. The Tier-2s were doing fairly well. Bristol however was at 30%.
SI-4 LHCb weekly review & plans
--------------------------------
GP noted that things had been steady over Christmas; one disk server at the Tier-1 had been out
for one day, then there had been scheduled interventions on Castor. There was steady MC
production at present.
SI-5 Production Manager's Report
---------------------------------
JC reported as follows:
The Christmas and New Year periods passed without a major incident. CERN reopened last
Thursday 5th January. For global operations ATLAS reported that the grid ran smoothly with
occasional Tier-2 problems not significantly impacting global production. There was a VOMS issue
at BNL that affected Panada jobs on 3rd January. RAL and some UK T2s were affected by a change
in Stratum0/1 configurations at CERN in December that led to an issue with latest software
versions not being installed. Next Monday an ATLAS database migration to 11g means there will
be no grid activity during 16th and 17th January.
CMS did not report any major problems. Overall 60 tickets were submitted (globally) and over
half were closed promptly. Analysis levels dipped over the Christmas period but Tier-2 availability
remained good. LHCb also reported a good service over the holiday period though some issues
with Monte Carlo merging jobs were seen at RAL around 1st January (possibly a failing disk
server).
New ROD tickets created over the period were set to expire on 4th January. Some hosts at Brunel
were down from 27th December due to certificate expiry. Birmingham was affected by missing
ATLAS DB release files from 23rd Dec. Lancaster was affected by excessive data transfers by T2K
on 23rd December (Liverpool reported seeing an issue too so this needs following up) which
consumed a lot of resources. Bristol experienced CE issues and was down from 24th to 28th
December. Some sites declared themselves at risk over the period: Brunel (23rd-5th); Glasgow
(23rd-5th); RALPP (23rd-3rd) and ECDF (24th-4th).
There was a root vulnerability announced over the Christmas period (announced 24th December;
EGI advisory sent on 26th December). The service affected was not found running on any CEs (on
the expected port) and therefore did not require urgent attention, though there is an ongoing
assessment of other potential impacts.
For information:
A) There is a GDB this week: http://indico.cern.ch/conferenceDisplay.py?confId=155064.
Remote participation is to be via Vidyo.
B) Tier-2 quarterly reports have been requested.
SI-6 Tier-1 Manager's weekly report
------------------------------------
AS reported as follows:
FABRIC:
1) FY11 procurements
- Disk deliveries expected w/b 12th January (TBC) and 24th January
- CPU deliveries expected w/b 16th January and 24th January.
- T10KC media all received
- Tape drives received
2) A number of incidents on the site network leading up to Christmas (two independent
problems) but appear to be resolved and no further issues over the holiday period.
3) Site DNS upgrade went very smoothly (two servers remain to be changed - at our request). We
expect the servers more critical to us to be upgraded Tuesday 10th - we do not expect any
problems.
4) Repacking of ATLAS data to T10KC has been completed. We expect to keep LHCB and GEN on
T10KA as long as possible - possibly right through 2012 depending on demand for the A/B series
tapes.
5) Fabric team busy last week moving racks in the machine room in order to accommodate
incoming deliveries.
6) We expect a lot of work in the machine room in February/March - hardware installations,
electrical and cooling work and cold isle installation. Some increased risk of incidents.
SERVICE:
1) Summary of operational for the week leading up to Christmas is at:
https://www.gridpp.ac.uk/wiki/Tier1_Operations_Report_2011-12-21
2) Holiday period operations were very smooth. Scheduled routine checks were carried out and
on-call team made a number of interventions but generally no problems. Fabric team (Kash)
attended on-site once (on the 2th) to resolve a number of hardware problems.
https://www.gridpp.ac.uk/wiki/Tier1_Operations_Report_2012-01-04
3) CASTOR
The CASTOR ORACLE database servers were moved to temporary hardware as part of the planned
migration to a new ORACLE configuration. The upgrade started late (after a fallen tree blocked the
Wantage road early in the morning) but still just completed within the scheduled downtime.
Generally the upgrade went very well, but there have been performance issues on the logging
volume (which were not seen under load test). However this configuration is scheduled to be in
place for just 3 weeks until phase 2 completes and moves ORACLE on to its final hardware
configuration.
STAFF:
1) Grid team leader post ongoing.
2) Recruitments underway
* Two system admins for Fabric team - both expected to start this month.
* One CASTOR admin - started (Rob Appleyard)
* One Grid Team member - expected to start in next few weeks.
* Keir Hawker's (Database team lead) last week this week (leave of absence). Richard Sinclair will
be acting team lead. DB admin post advertised.
SI-7 User Co-ordination Issues
-------------------------------
Ulrich had spoken to GP about the Ganga Development Day at Birmingham, organised by Mark
Slater. He was looking for funding for an Italian Developer to attend and contribute to the day (re
SuperB) and had asked for around £120. DB thought this sum was fine as it was minimal, and it
would be nice to encourage SuperB. DB would respond to him.
ACTION
448.5 DB to respond to Ulrich about the Italian Developer attending the Ganga Development Day
at Birmingham - £120 funding had been authorised.
AS noted that he had been trying to get a response from D0 re their old file system but had
received nothing back. They would be dropping the file system soon if D0 didn't respond - no
authoritative response had been received, yet they had been trying to contact D0 for 9 months
now. GP advised that in the past his contact would have been Gavin Davies, and he suggested that
AS try and contact him. GP could do it, if it would help. DB asked if GP could follow this up? GP
noted yes. AS would forward the email thread.
ACTION
448.6 GP to try and contact Gavin Davies, on behalf of AS, to try and get a response regarding the
imminent drop of the D0 file system.
SI-8 LCG Management Board Report
---------------------------------
It was noted that there was an MB taking place tomorrow. JG advised that he had been talking
with Ian Bird prior to Christmas and that he would be giving up the position of Chair. They would
be seeking a new Chair to replace JG in the Spring. Countries would be asked to nominate a Chair.
DB commented that presumably there would be a bias against the UK following JG's tenure? JG
agreed probably yes. PC asked if anyone were suitable or were there constraints? It couldn't be
someone from a site. JG noted that PG/JC would be eligible to apply. JG advised that he would be
on the search committee.
REVIEW OF ACTIONS
=================
436.12 DB to produce a financial proposal for adjustments to the Tier-2 staffing profile over the
term of GRIDPP4. Ongoing.
438.2 PC to provide feedback and guidance about the data management plan following the CAP
meeting on 4th October 2011. [PC will circulate something to the PMB before submission to STFC
- RJ and PC are dealing with this.]
PC had finished the CAP document now, which provides info to STFC re data policy. Inputs from
RJ were still awaited. It was agreed that this action would be closed and a new action opened in
its place:
448.7 RJ/PC to draw-up GridPP guidelines in relation to a Data Management Policy.
438.8 TC to advise when it is a good time to move to vidyo - early adopters were possible. No
further info available at present.
438.9 AS to contact relevant site managers to ask whether or not they would be interested in
having retired Tier-1 hardware - if a site were interested then they should submit a proposal as to
what they want and why. Ongoing.
446.1 PG to contact each PI individually re the DRI grants to ensure they understood they had to
order/commit an equipment spend on their Institutes' systems before 31st March 2012. Done,
item closed.
446.2 Re the DRI grants: PG to follow-up with PIs regarding evolution of plans and quotes in
order to monitor spend progress. Done, item closed.
446.3 JG to put together a plan for the next joint GridPP/NGS management meeting in January
(which would be followed by the first NGI technical meeting the next day). In progress, being
done.
ACTIONS AS OF 09.01.2012
========================
436.12 DB to produce a financial proposal for adjustments to the Tier-2 staffing profile over the
term of GRIDPP4.
438.8 TC to advise when it is a good time to move to vidyo - early adopters were possible.
438.9 AS to contact relevant site managers to ask whether or not they would be interested in
having retired Tier-1 hardware - if a site were interested then they should submit a proposal as to
what they want and why.
446.3 JG to put together a plan for the next joint GridPP/NGS management meeting in January
(which would be followed by the first NGI technical meeting the next day).
448.1 PG to contact STFC again and discuss any possibilities regarding release of part of the
funding, in order to allow procurements to commence at institutes, and also to check current
approval status.
448.2 PG to check with all those who had submitted a paper to CHEP, who had been awarded a
poster instead, and check who actually wanted to go. The PMB would decide once they saw the
list of people and sites.
448.3 PG to establish who, out of the list of those wanting to go to CHEP, also intended to go to the
wLCG workshop immediately before CHEP.
448.4 ALL to send thoughts/suggestions to DB regarding the replacement for GP in the User Co-
ordinator position (not necessarily based at RAL).
448.5 DB to respond to Ulrich about the Italian Developer attending the Ganga Development Day
at Birmingham - £120 funding had been authorised.
448.6 GP to try and contact Gavin Davies, on behalf of AS, to try and get a response regarding the
imminent drop of the D0 file system.
448.7 RJ/PC to draw-up GridPP guidelines in relation to a Data Management Policy.
The next PMB would take place on Monday 16 January at 12:55 pm.
|