Dear All,
Please remember to register for GridPP26 using the following link:
http://www.gridpp.ac.uk/gridpp26/index.html
Please find attached the GridPP Project Management Board Meeting minutes
for the 415th meeting.
The latest minutes can be found each week in:
http://www.gridpp.ac.uk/php/pmb/minutes.php?latest
as well as being listed with other minutes at:
http://www.gridpp.ac.uk/php/pmb/minutes.php
Cheers, Dave.
--
________________________________________________________________________
Prof. David Britton GridPP Project Leader
Rm 480, Kelvin Building Telephone: +44 141 330 5454
School of Physics and Astronomy Telefax: +44-141-330 5881
University of Glasgow EMail: [log in to unmask]
G12 8QQ, UK
________________________________________________________________________
GridPP PMB Minutes 415 (14.02.11)
=================================
Present: John Gordon (Chair), Andrew Sansum, Steve Lloyd, Robin Middleton, Jeremy Coles, Pete
Gronbech, Pete Clarke, Glenn Patrick, Dave Kelsey, Tony Cass (Suzanne Scott - Minutes)
Apologies: Dave Britton, Tony Doyle, Roger Jones, Dave Colling, Neil Geddes
1. Spend Plan
==============
DB had circulated an email regarding his discussion with Tony Medland. AS reported that he had
done the Tier-1 Outturn Report and had increased the recurrent figure. Tony Medland had noted
we should increase capital spend to our agreed limit. AS had a query re the recurrent: Given the
uncertainty in the information from SSC T-M had suggested spending half of the remaing figure
therefore AS needed to have the increase on recurrent approved. The other issue was that we
needed to be clear that the capital and recurrent outturn being produced are accurate - this was
difficult due to the new STFC SSC system at RAL, and AS needed to check all of the figures with the
Finance section.
ACTION
415.1 DK to check on the correct total allocation figure for both capital and recurrent with Tony
Medland.
415.2 AS to clarify the outturn forecast with RAL finance section and organise the spend.
415.3 PG to follow-up with sites re their Tier-2 hardware spend from GridPP3.
It was noted that the Tier-2 hardware spend in GridPP4 was still unknown.
415.4 DB to summarise the GridPP4 Tier-2 hardware spend in preparation for an email to Tony
Medland.
415.5 Re the JeS forms for the second half of GridPP4, DB to chase this up during the next month
or so.
2. Security Policy and glexec
==============================
DK reported that this issue had arisen at the MB and the GDB - pilot jobs were still being run with
no identity switching and this was in violation of Policy. The question was whether we extended
the suspension of the policy. A detailed report had been given regarding sites' ability to identity-
switch. DK reported that the conclusion had been that we weren't there yet, so the Policy
exclusion had been extended for a short while. The Tier-0 and Tier-1 should be ready to do the
identity switching around March 2011, however the Tier-2 had not really been looked at, but it
was asked that they be ready by 30th June 2011. DK asked if EGI/NGI could assist with this,
glexec and identity-switching? Could JC work with EGI operations to take this issue forward? JC
noted that we had started on this a while ago, and he could discuss it at dTeam, it might be
appropriate for sites starting this now to use Argus.
ACTION
415.6 JC to bring up the issue of glexec and identity-switching at dTeam, Tier-2 sites to be ready
by 30th June, it might be appropriate for sites starting to switch now, to use Argus.
3. Top-Level BDII plans
========================
AS reported that a couple of weeks ago he had been ticketed re volunteering to do top-level BDII
for a global service, and had been given a detailed spec. He had responded yes, however
discovered at the GDB recently that the plan was not as well developed as he had thought. There
was no architectural plan as yet and the issue had not been well-thought-through. AS noted that
clarification was required, as we couldn't sign-up for it at the moment until further information
was provided.
4. Phenogrid Issues
========================
It was reported that Peter Richardson had asked for action in relation to problems pheno users
had experienced with the grid recently. The main problem concerned proxy renewals; GridPP had
not seen the issue with other VOs as they either do not use the same service(s) (combinations) or
run shorter jobs that do not require job proxy renewal. JC noted that the problem was introduced
with a software update but closing in on the main problem took a while due to two sites being
involved (myproxy at RAL and WMS at Glasgow). JC reported that in a pheno testing phase, sites
were initially responding quickly to tickets but then had mixed carry through as sites (correctly)
did not consider problems with the WMS to be their responsibility to fix. It should be clear that
there was no evidence of sites ignoring the concerns and the underlying problem was with
middleware provided to GridPP.
JC noted a separate problem in relation to the length of time some tickets remained outstanding,
and he had listed his conclusions and recommendations in his circulated report; he suggested that
open tickets should be reviewed in more detail after two weeks.
In conclusion, following his in-depth investigation, JC concluded that Pheno have had a bad
experience primarily with the WMS due to the proxy renewals issue that was only impacting
them. They also were frustrated in their attempts to make progress by a VOMS certificate update
that was not picked up by all site services. AS pointed out that Pheno may also be quite far down
on sites' priority lists. JC reported that with Glasgow, for instance, Glasgow had thought that
Pheno had problems with their site alone, not that Pheno were having problems throughout the
UK, and so treated the reported incident as a user specific problem (shorter running jobs from the
VO were running successfully). The escalation routes open to VOs (such as the weekly deployment
meetings) need to be better publicised and ongoing issues within a Tier-2 made known earlier. JC
would be having a meeting with Peter this week.
ACTION
415.7 JC to follow-up the outcomes of his recent report on Phenogrid and begin to address
changes to the way tickets are handled.
415.8 JC to review the Helpdesk and ascertain if tickets can be reviewed more accurately by
personnel, who could look at ticket detail rather than length of time the ticket had been open.
5. F2F at Lancaster
====================
JG reminded that the next PMB was the F2F at Lancaster on 24th February. Could everyone who
had not already done so, contact RJ regarding attendance and accommodation requirements.
Apologies for non-attendance at the Lancaster F2F were noted in advance by: GP, RM, JG, DK, TC.
ACTION
415.9 ALL: to contact RJ and advise attendance and accommodation requirements for the F2F at
Lancaster.
STANDING ITEMS
==============
SI-1 Tier-1 Manager's Report
-----------------------------
AS reported as follows:
Fabric:
1) FY10 procurements
- CPU tender - all delivered. Acceptance testing has started on V10 completed acceptance testing.
CL10 now in acceptance testing.
- Tape drive order placed (although follow up second order may be required). Media availability is
now early March and we expect to place an order shortly.
2) Load test on SL08 has now started in order to reproduce the problems seen or to re-certify the
hardware.
3) A tape fault was discovered that has resulted in the loss of 78 LHCb files. The fault was caused
by a faulty tape drive that overwrote part of the data on the tape. A Post Mortem for this incident
is in preparation at:
http://www.gridpp.ac.uk/wiki/RAL_Tier1_Incident_20110202_Tape_Data_Loss_LHCb
4) All except CASTOR Gen instance disk servers have now been upgraded to SL5 64bit.
Service:
1) Summary of operational issues is at:
http://www.gridpp.ac.uk/wiki/Tier1_Operations_Report_2011-02-09
2) Bad checksum problem on CASTOR now resolved by upgrading gridftp server code. Now no
requirement for upgrade of whole of CASTOR to 2.1.9-10. Certification of 2.1.10-0 has
commenced. We are aiming to be able to deploy this into production late in March.
3) We are rolling out updates to disk server tcp tuning parameters, increasing default and max
window sizes.
4) Updates to Oracle (PSU) were rolled out.
SI-2 Production Manager's Report
---------------------------------
JC reported as follows:
1) The VO share information published by sites is progressing. The current status on Friday is
shown here:
http://indico.cern.ch/getFile.py/access?contribId=4&resId=0&materialId=0&confId=113884.
2) The issues affecting the pheno VO have been reviewed (see report) and improvements that we
can make in our support processes identified. The underlying problems are still related to
problematic middleware (particularly the WMS).
3) A change to EGI repositories for CA trust anchors has prompted some discussion about
WLCG/GridPP policy in this area. This will be discussed at the deployment team & sites meeting
tomorrow.
There ensued a discussion of communication from wLCG and notices from GridPP to sites to
confirm changes that should be made, as sites were not often aware that action on their part was
required.
4) The LHCOPN Tier-2 network working group have produced a new version of the LHC Open
Network Environment (LHCONE) Architecture.v2.1 document. The goal of LHCONE is to provide a
collection of access locations that are effectively entry points into a network that is private to the
LHC T1/2/3 sites. LHCONE is not intended to supplant LHCOPN but rather to complement it. At
this stage the document is helping to shape discussions on future networking - the GridPP position
needs to be discussed.
5) The NEISS VO (http://www.geog.leeds.ac.uk/projects/neiss/about.php) would like to make
use of the RAL based LFC. The technical requirements will be discussed tomorrow, but are there
any in principal objections to supporting this VO? You may also recall that we recently setup the
NA62 VO and it was almost implicit that they would require LFC enablement too. Can this VO be
added to the “GridPP approved VOs” list if there are no technical objections tomorrow? [Aside: the
approved list is used by the Tier-1 team to decide if a support request can be actioned].
The PMB approved this addition.
6) A management request following last week's MB was for sites to install ARGUS/glexec to a
timeline of 31 March 2011 for T0/T1 and (the end of) June for T2s. ATLAS and ALICE still
encounter problems using glexec.
7) At last week’s GDB (http://indico.cern.ch/conferenceDisplay.py?confId=106641) the
transition from gLite to EMI-1 was discussed. There still seems some uncertainty on where the
integration testing takes place and who “loads” the middleware repository used by WLCG sites.
EMI-1 is due at the end of April.
8) The January WLCG Tier-2 availability/reliability report is now available:
http://gvdev.cern.ch/GRIDVIEW/downloads/Reports/201101/wlcg/WLCG_Tier2_Jan2011.pdf.
Sites where we wanted to check on problems encountered were:
QMUL (98%:83%) – availability down due to work on electrical supply to the machine room
(prolonged by 1-day due to contractor availability).
UCL (100%:79%) – problems on one CE that developed during the Christmas vacation were not
fixed until returning to work in January.
MAN-HEP (84%:84%) – The site had downtime to upgrade the site DPM. Reliability was down due
to site-BDII problems (fixed with scripted restarts).
BHAM-HEP (85%:82%) – suffered due to the site-BDII crashing. The component needed to be
upgrades and an automatic restart implemented.
SI-3 ATLAS weekly review & plans
---------------------------------
RJ was absent.
SI-4 CMS weekly review & plans
-------------------------------
DC was absent.
SI-5 LHCb weekly review & plans
--------------------------------
GP reported on problems with pilot jobs at Glasgow due to a CE problem - this was now resolved.
The ORACLE upgrade had gone ok. There had been an upgrade to the FTS at the Tier-1, but there
were no new instances of the corrupted file issue, which was good. A new resource profile had
been requested from each experiment, post-Chamonix, and LHCb were dealing with this.
SI-6 User Co-ordination issues
-------------------------------
GP noted nothing else to report; Phenogrid had already been discussed.
SI-7 LCG Management Board report
---------------------------------
JG noted that some of the issues had already been discussed; installed capacity was an ongoing
issue.
SI-8 Dissemination Report
--------------------------
SL reported that Neasan O'Neill had provided a report as follows:
Events:
* EGI User Forum, there will be a UK NGI stand at the event, it is ~E450 shared between GridPP
and NGS.
* Royal Society Summer Exhibition, working with Karl Harrison and Cristina Lazzeroni on GridPP
involvemnet in their stand "Discovering particles: from Rutherford scattering to the Large Hadron
Collider"
* IOP Nuclear and Particle Physics Divisional Conference, after some haggling we will have a stand
at this too, it will be £454 (including my registration), waiting on invoice/cost of screen rental
* Masterclass half day meeting, on Wednesday discussing the masterclasses and I'll be trying to
get more grid into them
* Big Bang Fair, in London next month with an IoP physics stand, grid demo on that each day
* National Science and Engineering Week event, at QMUL also March, grid demo at that as well
Materials:
* Brochure - Invoice has been sent to Robin, Next version expected soon
* Magic Cubes - Do we want to do new cubes and refresh the design? Origination is £300, 2.55 a
cube for 1,000 and shipping was £200 in 2007. So 1,000 cubes would be £3050, 500 would be
£2360 (3.72 a cube).
RM reported that he had discussed this internally at RAL and any marketing costs overall for
GridPP were fine provided they were under £25,000. This particular expenditure was approved
for the Royal Society meeting but we may have to log the dissemination budget for GridPP4.
Funding was agreed for the magic cubes at ~£3k. JG noted it would be useful to have new ones.
SL suggested that if anything obviously required to be changed (eg: EGEE being mentioned) then it
should be changed. We should go ahead and buy 1000 of them. This was agreed.
News Items:
* Sussex: could have something soon?
* Have 4 items on Licensing in draft, would like comments if anyone is interested
* Suggestions?
Website:
* RTM site has been cleaned up to reflect EGI/e-ScienceTalk involvment also moved the design
over to the GridLoad graphs pages (http://gridportal-ws01.hep.ph.ic.ac.uk/gridload/)
* Still working on the website review will have that by next PMB meeting.
AOB
===
- PG reported that the Quarterly Reports had now been submitted and were uploaded for review.
- SL reported that Frank Krauss ([log in to unmask]) would be taking over from Prof
Nigel Glover as the Durham representative on the GridPP collaboration board.
The next PMB meeting would be a F2F meeting and would take place at Lancaster on Thursday
24th. Advance apologies had been recorded for GP, RM, JG, DK, TC. There would be NO meeting
next Monday 21st February.
REVIEW OF ACTIONS
=================
398.7 Re the GridPP Security Policies - DK advised that EGI formal signoff had now been given, he
would update the GridPP website pages. Ongoing.
400.4 SL to co-ordinate changing the current GridPP MoU towards an MoU for GridPP4. Ongoing.
409.1 JC to revisit document with a GridPP-NGI-NGS structure, not Dave Wallom’s. JG will
provide input. Visions for today and for the future. Ongoing.
409.2 GP to produce new role description for the Chair of the UB. Ongoing.
411.1 DB to organise an Agenda around the theme of 'Efficiency' for GridPP26 at Sussex. Done,
item closed.
411.3 SL to co-ordinate with RJ, DC, and GP, regarding monitoring site performance and
distribution of GridPP4 funds, and provide a draft document to which the PMB could respond.
This should be finalised at the F2F meeting in March, in relation to how much money was to be
allocated. We would need a starting point by the F2F in February. SL was awaiting input from RJ
and DC - they need to respond ASAP. SL reported that a meeting would be taking place next week.
Done, item closed.
412.3 JG to check with AS and RJ re the issue of the Tier-1 continuing to provide LFC services (the
issue here was extra effort, a proposal was required). Done, item closed.
413.1 RM to check the travel budget in relation to contributing to the costs of being involved with
the Royal Society Summer Science Exhibition, in conjunction with Birmingham/Cambridge. Done,
item closed.
413.2 DB to contact Karl Harrison and confirm GridPP's involvement in the Royal Society
Exhibition, noting a contribution in terms of a possible demo, manpower, and promotional
materials. Done, item closed.
413.3 JG to find out at the EGI meeting today if there was a GOCDB4 failover still in existence (the
last one ended with EGEEIII). JG reported that this was on the Work Plan, but one was not in
existence. This was being worked on at present. Done, item closed.
413.4 Regarding GSTAT2 publishing and sites filling-in the numbers as per SL's spreadsheet table
showing the fraction (ie: publish the theoretical model in GSTAT) - PG to send the relevant
spreadsheet to JC so that dTeam could progress this. Done, item closed.
ACTIONS AS AT 14.02.11
======================
398.7 Re the GridPP Security Policies - DK advised that EGI formal signoff had now been given, he
would update the GridPP website pages.
400.4 SL to co-ordinate changing the current GridPP MoU towards an MoU for GridPP4.
409.1 JC to revisit document with a GridPP-NGI-NGS structure, not Dave Wallom’s. JG will
provide input. Visions for today and for the future.
409.2 GP to produce new role description for the Chair of the UB.
415.1 DK to check on the correct total allocation figure for both capital and recurrent with Tony
Medland.
415.2 AS to clarify the outturn forecast with RAL finance section and organise the spend.
415.3 PG to follow-up with sites re their Tier-2 hardware spend from GridPP3.
It was noted that the Tier-2 hardware spend in GridPP4 was still unknown.
415.4 DB to summarise the GridPP4 Tier-2 hardware spend in preparation for an email to Tony
Medland.
415.5 Re the JeS forms for the second half of GridPP4, DB to chase this up during the next month
or so.
415.6 JC to bring up the issue of glexec and identity-switching at dTeam, Tier-2 sites to be ready
by 30th June, it might be appropriate for sites starting to switch now, to use Argus.
415.7 JC to follow-up the outcomes of his recent report on Phenogrid and begin to address
changes to the way tickets are handled. JC to review the Helpdesk and ascertain if tickets can be
reviewed more accurately by personnel, who could look at ticket detail rather than length of time
the ticket had been open.
415.8 JC to review the Helpdesk and ascertain if tickets can be reviewed more accurately by
personnel, who could look at ticket detail rather than length of time the ticket had been open.
415.9 ALL: to contact RJ and advise attendance and accommodation requirements for the F2F at
Lancaster.
|