Dear All,
Please find attached the GridPP Project Management Board Meeting minutes
for the 423rd meeting.
The latest minutes can be found each week in:
http://www.gridpp.ac.uk/php/pmb/minutes.php?latest
as well as being listed with other minutes at:
http://www.gridpp.ac.uk/php/pmb/minutes.php
Cheers, Dave.
--
________________________________________________________________________
Prof. David Britton GridPP Project Leader
Rm 480, Kelvin Building Telephone: +44 141 330 5454
School of Physics and Astronomy Telefax: +44-141-330 5881
University of Glasgow EMail: [log in to unmask]
G12 8QQ, UK
________________________________________________________________________
GridPP PMB Minutes 423 (18.04.11)
================================
Present: Dave Britton (Chair), Dave Colling, Robin Middleton, Dave Kelsey, Jeremy Coles, Glenn
Patrick, Steve Lloyd, John Gordon, Pete Clarke, Roger Jones, Andrew Sansum,
Apologies: Tony Doyle, Pete Gronbech, Tony Cass, Neil Geddes
1. Security Document
=====================
DK had circulated the updated security document, the production of which had been a milestone
for the last quarter of GridPP3. It would be a new PMB document for the list. DB advised that he
had read it, and could not see anything that would be objected to. DK noted that the thrust of the
milestone was to ensure that we knew what was happening in relation to operational policy and
security for GridPP4. DB thought it was a useful document - it documented security status and
was a completed milestone. He noted that it could be referred to if there were to be a major
security incident - would the document be adequate in such a circumstance? DK noted yes, that
links to all procedural databases in EGI were provided - all documents were there and available. It
was agreed to accept the updated version as a PMB document. DK would forward it to SS for
upload to the GridPP website.
2. Input to the Oracle meeting
===============================
It was understood that this was an opportunity for JG to give input to Frederic. JG noted he had
received feedback from the tape and database people - the experience was that this was not a
great service, yet cost a lot of money. There ensued a discussion on tape drives and the
maintenance contract. JG advised that one-off problems were worth documenting. DB noted they
had experienced problems with their logistics - the issue was value for money and on occasion we
don't get the value we should out of the relationship. JG advised that TB support would agree with
that point of view. AS noted that since the move to Sun, a lot of their systems didn't seem to work
so well - for example, there had been migration issues and serial numbers had been corrupted. DC
asked if we had an alternative? AS noted no, not with the hardware. DB advised that it was the
business process system that seemed to be at fault. AS added that we had lost contact with
individuals as well.
STANDING ITEMS
==============
SI-1 Tier-1 Manager's Report
-----------------------------
AS reported as follows:
Fabric:
1) FY10 procurements
- CPU tender - deployed
- Disk - deployed
- Tape drives - delivered (4th April)
- Tape media - delivered
2) SL08 remains out of production - load test underway - fault free so far (four successful drive
ejects). 2-4 weeks of further testing is likely to yield the last 2 drives ejects we require for
assurance that the problem is resolved.
DB noted that the time out-of-service would be about 9 months in total? AS noted possibly 6
months. DB commented that it takes 6 months to do this, which is one-eighth of their lifetime -
was it worth continuing? What was the motivation for continuing to do testing? They had a finite
lifetime with a significant period under test - we should simply just use it for non-critical stuff.
3) FY11 procurements
- EU tender for disk framework agreement just about to go out.
- CPU framework about 1 week behind.
- Frameworks need to be renewed on tape drives and media this year.
4) Various network issues
- Time varying (but 5-10% peak) packet loss on production route to SJ5. Site networking working
to find cause. Possible protocol/load problem. Networking working to address an identified issue.
- Internal Tier-1 network stack problems.
* Short breaks in connectivity (< 1 min) to some services - suspect stack supporting some Critical
services. Emergency intervention planned on Tuesday will short network break to some services.
Announcement to follow.
* Second stack (stack 15) unstable and splitting into two. Has caused out of hours callouts.
Possible overheating problem addressed. Waiting to see if it is fixed.
Service:
1) Summary of operational issues is at:
http://www.gridpp.ac.uk/wiki/Tier1_Operations_Report_2011-04-13
2) CASTOR
- The upgrade to CASTOR 2.1.10-0 was successfully completed in March
- The CMS and LHCB upgrades to SRM 2.10 went well but problems were encountered following
the ATLAS upgrade on Thursday 14th April. SRM-ATLAS stopped responding at 01:00 UTC on
Friday 15th. Alarm Ticket from ATLAS at 07:20 UTC. ATLAS SRM was taken down for 15mins for
investigation. Transfers throttled and job load reduced to 200 jobs over weekend. Problem traced
to Oracle statistics for search path rapidly being invalidated - cause unknown. Work ongoing. Job
limit now at 2000.
3) LHCb batch work has been switched to use CVMFS for obtaining the LHCb software. CVMFS is
still not a CERN supported production service, however LHCB are using this at several Tier-1 sites
now. RAL hosts a production mirror of CVMFS which reduces the risk somewhat.
Staff:
1) James Thorne and Richard Hellier now left.
2) Matt Hodges (Grid team leader) leaves on Wednesday 20th April.
3) Derek Ross has accepted another job in e-Science and will leave 12th July.
Grid team in particular will be severely under-staffed until new starters begin. Looking at
temporary work offload.
- Vacancy notice for Grid team leader expected to be out in 5-10 days
- Paperwork for other Fabric team vacancies in draft.
DB asked that the posts be expedited as soon as possible - this was a high-level concern that a
number of people were leaving right at the beginning of a long data-taking phase, and it meant
erosion of expertise. AS advised that it was probably as a result of the long uncertainty over
funding at STFC, and also the pay freeze was probably an issue.
SI-2 Production Manager's Report
---------------------------------
JC reported as follows:
1) There is a new glibc vulnerability to be addressed by sites. Most kernels had patches available
last week. There is currently no public exploit so the EGI rating is high-risk not critical.
2) At last week’s ops meeting there was some concern expressed about impacts from site
spacetokens becoming full and as a consequence the site receiving less work since this was under
experiment control. “Missing release has caused reco jobs to go to T2s. A number of sites had to
increase their space in PRODDISK. Sites get blacklisted in DDM automatically if the space is
completely filled”. We need to remain aware of these issues but can we do anything more? Should
we maintain a table of events impacting site performance?
It was noted that the request was made by the experiment to increase their PRODDISK space. DB
noted that there was a higher-level picture: if the sites had more disk, the PRODDISK would be
increased, therefore space would be enough. Sites should also be monitoring what was happening
and proactively ask the experiment if they needed a PRODDISK increase. JC advised that the sites
had been blacklisted before they could do this. DB advised that SL needed to be the owner of the
issue of correcting the accounting, and he should apply judgement in conjunction with the Ops
Team. We probably do need to keep a list. This should be a 'standing item' at the Ops Team, if
there were any issues during the previous week then there should be a mechanism to request a
correction to the accounting. It was suggested that we keep a record and correct at the end of the
year. If a large issue was apparent then this could be dealt with at that time, however smaller
issues would average-out. DB asked that JC keep a list but not assume that little corrections would
be done - it needed to be monitored and we would correct at the end of the year if necessary.
Smaller issues would average out and we should not try and correct things too much.
3) There was an intention to run Security Service Challenge 5 at the end of May. The challenge
would involve a subset of sites in each NGI to help understand how sites would respond to a major
distributed incident. Unfortunately this would happen during the GridPP T2 accounting period.
The accounting period starts on 1st May and the SSC5 cannot start earlier and that only leaves two
further possibilities – we do not take part or accounting for that 1 week period is not counted.
There is an assumption here that sites would take nodes offline for the response but they may not
have to take such action.
DB noted that we should take part in the Security Challenge as scheduled. If sites close queues etc
then we will correct for that. JC asked whether this should go to sites with extra staff effort? DB
thought no, it should be entirely random - but if volunteers were required then all 8 sites with 2
people should volunteer. DC agreed - completely random was the only way that the test made
sense. AS advised that the sites should not consider this as a bad thing - sites do get huge benefit
working through procedures etc, it was a good learning tool.
4) The March WLCG availability/reliability figures were released a few weeks ago. No GridPP sites
have reported any concerns.
http://gvdev.cern.ch/GRIDVIEW/downloads/Reports/201103/wlcg/WLCG_Tier2_Mar2011.pdf.
Four sites have been flagged:
QMUL – availability 64%: Site had scheduled downtime due to air-conditioning upgrade work.
During the month there were also reported problems with the storage and packet loss on the
WAN.
RHUL – reliability 85%: availability 81%: There were a mixture of scheduled and unscheduled
periods to resolve network problems.
Oxford – reliability & availability 85%: Storage related?
RALPP – reliability 84% & availability 83%: The availability was down due to scheduled
networking outages. Reliability was affected by problems experienced with the site dcache
database.
5) EGI have asked for priorities in certification and release of EMI-1 components. The SL5 WMS
heads the UK list followed by SE releases and SL5 myproxy.
SI-3 ATLAS weekly review & plans
---------------------------------
RJ was not present at this point.
SI-4 CMS weekly review & plans
-------------------------------
DC reported that they were taking data. The Tier-1 was below 80% on CMS readiness. AS
confirmed he would get back to DC on this. RALPP were about 70% readiness.
SI-5 LHCb weekly review & plans
--------------------------------
GP noted there wasn't much to report. AS had already mentioned a few issues. They had a
memory footprint problem on stripping jobs.
SI-6 User Co-ordination issues
-------------------------------
GP had nothing to report.
SI-7 LCG Management Board Report
---------------------------------
The last meeting had been before the F2F at Brighton, there were no major issues to report.
SI-8 Dissemination Report
--------------------------
Neasan O'Neill had attended EGI at Vilnius. GridPP had a good stand location, joint with NGS,
under the NGI banner.
AOB
===
DB brought up the issue of forthcoming PMB meetings. The suggestion was as follows:
- Thursday 28th April @ 12.55 pm
- there would be NO meeting on May 2nd
- Monday 9th May @ 12.55 pm
(there was an STFC visit to GU on 16th May)
- Wednesday 18th May @ 11.00 am
- Tuesday 31st May @ 12.55 pm
ACTION
423.1 DB to do a doodle poll proposing PMB meetings during May.
It was noted that we might need a special meeting in order to discuss the Tier-2 algorithm, as time
was short. The OC documents had to be ready by 18th May.
3. Status of Tier-2 Algorithm
==============================
SL reported on progress - the issue had started as people were wanting to use CPU and not jobs;
then corrected CPU not raw CPU. There had been discussion at Brighton, following which SL had
tried to measure outputs. SL had circulated a spreadsheet for discussion.
On the table, March looked greener than the others. The algorithm proposed was based on
HEPSPEC numbers from APEL divided by ATLAS. The issue was however, that ATLAS and APEL
don't see the same number of seconds. You would expect the CPU total to be the same - APEL
divided by ATLAS was in most places OK, but four sites in particular seemed wrong: Cambridge
and QMUL were for known reasons, however Lancaster and others were not known. The
crosscheck was the 'PROD in seconds for the event' column - the amount of HEPSPEC per
production event should be constant at all sites, if correctly done. Then this is multiplied by 8.3.
In this crosscheck, green shows agreement within 10%, and this gives a consistent answer for
around half of the sites. The last column in the spreadsheet showed what was actually being
published, by range.
In summary - there was an HS06 APEL-to-seconds ratio; green sites agreed well; red sites were
disagreements. This was worse in April as the APEL numbers seeemed incorrect.
JG noted that it could take a day or so to reach the portal. SL advised that one day didn't explain
the discrepancy. SL reported that he had been sending full chain jobs single generation event to
sites at the weekend and checking how long it takes, also what CPU they are run on. DB
summarised the issues:
1. how did we make progress on this - was it possible to understand why these were red? There
were four red sites in March: QMUL/Lancaster/Cambridge/PPD. DK would check with RAL PPD.
For PPD in April, they had green in the last column (SL noted that they agree). DB asked how this
would converge in time, in order to publish the algorithm to sites? JC would discuss with sites at
the Ops meeting.
DB noted the second issue was - where were we with CMS and the other experiments? SL noted
he could pull numbers out from the same place for 'other' experiments, but there was no
crosscheck. DC commented that most sites publish and have a weighted average. JG suggested he
could check the number of jobs with APEL. DB suggested checking the total number of jobs at
ATLAS and APEL - it was in the database. DB suggested that we view the four 'red' sites are being
resolved. DK would look into RAL PPD. SL would look at the QMUL situation. RJ noted it should
be consistent month to month. SL advised that we could investigate the red sites and he could put
another column in the spreadsheet. JC noted there was an Ops Team at 11.00 am tomorrow and
he could get some insight from the sites direct. DB advised that we could delay the start of the
accounting period for a few weeks.
DB asked whether we believed that this method fundamentally worked? RJ said it had to work
everywhere, at all sites. DB asked whether there would be a wiki page on the web to let people
see it? DC noted that a wiki page was possible. SL would also contact LHCb and others. In
conclusion, there was 1. the Ops Team tomorrow; 2. there were 10 days until the PMB on 28th to
sort this out; 3. was a meeting required to review this beforehand? There was no time, as holidays
intervened.
REVIEW OF ACTIONS
=================
400.4 SL to co-ordinate changing the current GridPP MoU towards an MoU for GridPP4. In
progress - document had been circulated. Any corrections to be sent to SL. Ongoing.
409.1 JC to revisit document with a GridPP-NGI-NGS structure, not Dave Wallom’s. JG will
provide input. Visions for today and for the future. Ongoing.
416.5 PG to establish a process to generate a final project map in conjunction with work package
and task owners. Ongoing.
416.8 RJ/DB to establish ATLAS networking test programme to investigate Tier-2 connectivity
using Glasgow as an example. Done, item closed.
417.3 JC to follow-up with the Tier-2s, on a site-by-site basis, regarding deployment of
glexec/Argus and tarball installation packages - site readiness/difficulty to be reported-back to
the PMB. JC was working on this - the glexec deadline of May would not be met in all cases, but
this was now an open action in the Ops Team. Done, item closed.
419.1 SL to contact the Tier-2 sites, by the best route possible, and ask two questions relating to
hardware status: 1. what is the minimum available at present; and 2. what are they likely to be
able to pledge in April 2012 and April 2013? SL reported that: I contacted the sites and we got the
answers for 2011 as presented at Brighton. There is little info on 12/13. Done, item closed.
419.2 SL to respond to Amazon Web Services' invite to attend the cloud computing summit at
Oxford, and nominate PG to attend. SL reported that: I did this and nominated Pete but never got
any reply. Done, item closed.
419.3 PG to provide feedback on the Amazon Web Services (AWS) Academic Research Summit on
cloud computing at Oxford on April 12th. PG reported that: Despite me asking Amazon directly
for details I never heard anything back from them re an invite. Done, item closed.
419.4 PG to take the mandatory issue of glexec back to sites, and get clarification of status. This is
mandatory by June 2011 or sites will not get default analysis jobs. PG reported that: This has been
discussed at dteam meetings and I suggested site update their status on the wiki at :
http://www.gridpp.ac.uk/wiki/Site_status_and_plans Jeremy has asked again at this week’s
meeting for sites to update the page much many are out of date. Done, item closed.
419.5 PG to ask at dTeam whether sites had any issues/experience with storage servers from
Supermicro. PG reported that: The supermicro issues are currently affecting Oxford and UCL,
Cambridge have recently ordered a server of the same type. Others have very similar kit, so far
we think it’s just these three sites that have the actual combination of Adaptec 5805 controllers
and Western Digital Green 2TB drives. Viglen have upgraded the backplane firmware in the
servers at UCL and Oxford and are now suggesting a particular version of firmware for the
adaptecs that is being used on over 100 servers at CERN. Needless to say we are not very happy
about the situation and (At Oxford) have ordered up ~100TB from an alternative vendor to
mitigate our lack of storage. Done, item closed.
420.1 DB to explain the new ops structure to Ops Team and Collaboration – including some
clarity relating to the personnel who are explicitly expected to take on national/ops team roles.
420.2 PG and JC to establish details of Ops Team work remit.
420.3 JC to advertise the 'open' nature of the Ops Team meetings and encourage site attendance.
JC to ensure that (as agreed at Lancaster) the managers of the T2’s should identify a person for
reporting on T2 deliverables and metrics.
420.4 PG to ensure that each metric/deliverable has an owner identified.
422.1 DB to email the CB with GridPP input to the e-VAL questionnaire/survey and elicit some
guidance.
422.2 DB to to prepare e-VAL input for GridPP including information on Roles appended to
GridPP3 proposal.
ACTIONS AS AT 18.04.11
======================
400.4 SL to co-ordinate changing the current GridPP MoU towards an MoU for GridPP4. In
progress - document had been circulated. Any corrections to be sent to SL.
409.1 JC to revisit document with a GridPP-NGI-NGS structure, not Dave Wallom’s. JG will
provide input. Visions for today and for the future.
416.5 PG to establish a process to generate a final project map in conjunction with work package
and task owners. PG reported: I have had some input from some of the areas but will need to
tackle the Tier 1 and the experiments asap.
420.1 DB to explain the new ops structure to Ops Team and Collaboration – including some
clarity relating to the personnel who are explicitly expected to take on national/ops team roles.
420.2 PG and JC to establish details of Ops Team work remit.
420.3 JC to advertise the 'open' nature of the Ops Team meetings and encourage site attendance.
JC to ensure that (as agreed at Lancaster) the managers of the T2’s should identify a person for
reporting on T2 deliverables and metrics.
420.4 PG to ensure that each metric/deliverable has an owner identified.
422.1 DB to email the CB with GridPP input to the e-VAL questionnaire/survey and elicit some
guidance.
422.2 DB to to prepare e-VAL input for GridPP including information on Roles appended to
GridPP3 proposal.
423.1 DB to do a doodle poll proposing PMB meetings during May.
The next PMB would take place on - Thursday 28th April @ 12.55 pm. DB would do a doodle poll
to establish meetings in May.
|