JISCMail - UKHEPGRID Archives

Email discussion lists for the UK Education and Research communities

Subscriber's Corner

Email Lists

UKHEPGRID Archives

UKHEPGRID@JISCMAIL.AC.UK

View:

Message:

[

First

Last

]

By Topic:

[

First

Last

]

By Author:

[

First

Last

]

Font:

Proportional Font

		LISTSERV Archives
		UKHEPGRID Home
		UKHEPGRID March 2010

Options

Subscribe or Unsubscribe

Get Password

Subject:

Minutes of the 380th GridPP PMB meeting

From:

David Britton <[log in to unmask]>

Reply-To:

David Britton <[log in to unmask]>

Date:

Wed, 10 Mar 2010 16:36:42 +0000

Content-Type:

multipart/mixed

Parts/Attachments:

text/plain (47 lines) , 100308.txt (848 lines)

Dear All,

Please find attached the GridPP Project Management Board
Meeting minutes for the 380th meetings. The latest minutes can
be found each week in:

http://www.gridpp.ac.uk/php/pmb/minutes.php?latest

as well as being listed with other minutes at:

http://www.gridpp.ac.uk/php/pmb/minutes.php

Cheers, Dave.

--
________________________________________________________________________
Prof. David Britton GridPP Project Leader
Rm 480, Kelvin Building Telephone: +44 141 330 5454
Dept of Physics and Astronomy Telefax: +44-141-330 5881
University of Glasgow EMail: [log in to unmask]
G12 8QQ, UK
________________________________________________________________________

GridPP PMB Minutes 380 (08.03.10)
=================================

Present: David Britton (Chair), Steve Lloyd, Sarah Pearce, Andrew Sansum, Tony Doyle, Dave
Colling, Robin Middleton, Pete Clarke, Roger Jones, Pete Gronbech (Suzanne Scott, Minutes)

Apologies: David Kelsey, Tony Cass, John Gordon, Jeremy Coles, Glenn Patrick, Neil Geddes

1. DB Agenda for RHUL
======================
SL asked when we were going to switch to the new arrangements outlined in the GridPP4
proposal? Presumably not this time? DB noted yes, we would switch in advance of GridPP4 but
not before the PPRP. It would be better to discuss this issue at Ambleside. It was noted that the
PMB membership would change, the DB would cease, and the Ops Team would start.

SL asked about the Tier-2 hardware situation? DB suggested that a working group be set up, with
the criteria set by the experiments. On the timescale of RHUL it might not be possible however.
PG asked about the GridPP4 funding and how this would be split between sites? SL noted that
monte carlo and analysis would be treated separately. PG asked about measuring of site
performance, or would it be by amount of resource? And what about the small sites? SL advised
that these were all issues that needed to be discussed. DB advised that ATLAS and CMS should
drive this at high level. RJ noted that a uniform metric was probably not possible. DB advised
that the starting point would be receipt of statements from ATLAS and CMS at high level, then the
PMB could discuss implementation.

SL advised that there was also regional monitoring to consider - SAM and Nagios etc. The
statistics we have now may not exist later. PG reported that we were part way down the
conversion process - Nagios services run by CERN would move to regional Nagios services, likely
to be the Oxford dashboard for the UK. This would be a flexible system that can be modified.
CERN were working to the end of EGEE. DB asked if we wished to get ATLAS and CMS to make
preliminary presentations to the Deployment Board?

The Tier-2 spreadsheet was discussed, PG asked if it was generally available? It had been
distributed to the CB. SL advised that the Tier-2 Co-ordinators should inform their SysAdmins,
but it should not be generally available yet. DB advised that the word document which was sent to
the CB had all of the numbers in it. PG queried the MoU generated by the spreadhsheet, which
makes RAL PPD a 'small' ATLAS site. DB noted the other issue at the Deployment Board could be
regional nagios. SL agreed. SL would circulate an agenda.

ACTION
380.1 SL to circulate an Agenda for the Deployment Board meeting at RHUL.

2. Tier-2 Investments
======================
SL had circulated an email. DB advised that we needed a high-level picture of investments in
infrastructure to defend our case if this were to be raised at the PPRP. We needed to show
leverage of investment. SL had only received one or two responses thus far.

ACTION
380.2 ALL: to send SL information on infrastructure investments at their respective institutes.

380.3 AS to send SL assumptions re electricity (in relation to investments in infrastructure).

380.4 SP to send SL historical numbers on unfunded effort (in relation to investments in
infrastructure).

3. EGI/NGI Paper
=================
DB noted that several questions needed to be answered:
- how does GridPP relate to an NGI structure?
- what happens if EGI does not go ahead?
- what happens if NGS is not funded?
- how much is GridPP doing in NGI which is not directly related to GridPP?

DB reminded that he had prepared a draft document last year. RM had circulated an update to
this which provided a framework for argument. DB went through this document:
p3 top para - RM needs to qualify this, a statement is missing, eg: 'this reflects a reduced particle
physics influence going from EGEE to EGI' (cf the statement in DB's covering letter to the GridPP4
proposal)

DB noted this should be an internal document, however there will be issues raised prior to the site
visit. We should do a public version of the document once it is in good shape. It was understood
that GridPP has to be represented on an NGI MB in proportion to the size of resources and user
base. In relation to global tasks, security and training were clear (the former is very closely
coupled to GridPP; the latter does not involve Grid). We are proposing to continue with
configuration and accounting. DB emphasised that in relation to APEL there had been a negative
reputational effect due to the recent problems. We needed management buy-in. DB noted that
the large blue tables in RM's document were incomplete as follows:

- RM should fill-in the GridPP effort at top-level in relation to global tasks
- RM needed to fill-in the NGS column
- a 'totals' column was required
- the EU contribution needed to be clarified

TD advised that there was a difference between the hardware resources and users re the relative
size of GridPP within the NGI. In the 'risks' section of the document DB noted that if NGS is not
funded, an NGI would be de-scoped - we would drop training, but we could still do the global
tasks.

For Risk 1 the following was required:

- the paragraph needed to be quantified
- RM should add two columns to the table: the status quo was an NGS and an EGI, if there was no
NGS what do we do? If there is no EGI what do we do? These manpower changes should be
shown in these columns using an X or similar.

PC advised that the first sentence should be: 'This is the extra effort we need if these are
unfunded'. PG asked if Nagios at Oxford would be part of NGI? TD noted that as it would be
devolved from CERN then it would rest with us.

DB noted that the document was a good start. RM/SP should make the changes as discussed and
try and quantify some of the issues. Over the next week DB would iterate with them in order to
push the document forward. The text would follow the numbers - they should concentrate on the
numbers first, ie: task vs effort, and where the effort comes from. As noted, two strategy columns
should be added. RM advised that he would have an internal meeting with JG. By the end of the
week, SP/DB would try and talk.

ACTION
380.5 RM/SP to make changes to the EGI/NGI paper as discussed and bring back a revised
version to next week's PMB.

380.6 ALL: to feedback comments on the EGI/NGI paper to DB, RM or SP before next week's PMB.

4. Week's Notes
================
- DB advised that the PPRP Agenda had been changed, but the GridPP timing for the meeting
remained unaltered.
- Re the OPN backup link, AS advised that he had received an Invoice. They were scheduled to get
the line at the end of March, which would then be tested during April. DB noted that we needed to
confirm the delivery date and the usage/testing plans. The invoice should not be paid if there was
a possibility that the link would not be installed for several months.

ACTION
380.7 Re the OPN backup link: AS to find out: 1. When the link is supposed to be operational; 2.
More detail about how and when the link will be tested. If possible AS should delay Invoice
payment until more information was forthcoming.

There ensued a discussion on use and capacity of the link plus strategy required in relation to
usage - was a cap possible? The traffic could be split two ways if the link were to be used for
production.

STANDING ITEMS
==============
SI-1 Tier-1 Manager's Report
-----------------------------
AS reported as follows:
Fabric:
1) FY09 procurements:
- All disk and CPU has been delivered.
- We expect to be able to start acceptance tests on one lot of disk and CPU this week, the second lot
is still being installed.

2) FY10 procurements
- We have started the process of updating the procurement documentation for FY10
procurements. We are considering alternative options to a restricted EU tender.

DB noted there were pros and cons required for this issue. A HAG would be preferable - AS noted
that a teleconference would be required. There was no update re the UPS. DB gave direction to AS
that from GridPP's viewpoint the equipment was not fit-for-purpose and should at this point be
returned to the vendor, instead of allowing alternatives that only added other points of failure. AS
advised that he did not control the process which was being handled by Estates & Buildings. DB
noted he could speak with someone if required. AS would check and get back to him.

3) We have concluded that one lot of the 2006 procurement (about 250TB) is too unreliable (high
drive eject rate) and we are discussing phasout options with the UB. This lot was the source of all
multi-drive filesystem losses during 2009 and has generated the majority of drive ejects in the last
12 months. We do not expect the phasout to impact our WLCG commitments.
Service:
1) SAM test availability for the ops VO was 100%.
2) We are working on an upgrade strategy for CASTOR from 2.1.7 to 2.1.8 or 2.1.9 we expect to
discuss with the UK VO representatives in 1-2 weeks then discuss at the PMB.

3) We have been reviewing our position wrt the CASTOR database hardware wrt the problems
encountered during the migration back to the EMC RAID arrays. The current configuration is not
fully resiliant, currently a storage array break may lead to an outage of the CASTOR database SAN.
Our conclusions are that we will need to move the database service back off the EMC units to
allow a reconfiguration of the SAN to a well tested and working configuration. We will have to do
this by deploying new hardware temporarily to stage the service onto. We are still review exact
required configuration and hardware options. We also have to find a good time window for a 1-2
day intervention to release the existing hardware (not during the early stages of data taking
probably) and then a further timeslot to move back onto it.

4) On Friday we made an emergency change on the CMS CASTOR instance in order to address a
hot file issue (created a new service class overlaying the existing disk pool).

SI-2 ATLAS weekly review & plans
---------------------------------
RJ reported that things had been quiet last week; there were production jobs due this week.
There had been a problem over the weekend re the pilot factory at Glasgow due to the end of a
proxy, but this had been fixed today. There was also a bug in the distribution of hardware tasks
which meant they were blacklisted, wrongly, by the ganga robot - this was causing problems on
the ATLAS side.

SI-3 CMS weekly review & plans
-------------------------------
DC reported that they were preparing for 7TeV monte carlo - nothing unusual was happening at
present. There ensued a discussion about a change made on Friday afternoon by AS at the Tier-1.
DB commented that the Tier-1 should be responsive and they had made the right decision.

SI-4 LHCb weekly review & plans
--------------------------------
In absentia GP reported as follows:
1) Low level Monte Carlo productions. Most went without problem. Bulk of LHCb work on the Grid
is currently user analysis.

2) Problem uploading data out of the site at 3 UK Tier-2s : Sheffield, Glasgow and Brunel. GGUS
tickets opened against them and issue raised in dTeam mailing list. This particular problem is
limited to just these 3 sites on the (LHCb) Grid. Working with sites to understand.

3)dCache Tier-1s were brought back in to the mask last Tuesday after a new stack of LHCb
software was released with fixed versions of root. Analysis jobs now fine at most sites.
4)CASTORLHCB successfully upgraded to version 2.1.9.4 at CERN this morning.

SI-5 Production Manager's Report
---------------------------------
PG presented JC's report as follows:
1) CREAM & SCAS/glexec status (may be updated):

Oxford - two CREAM installs. Both in production. One uses SCAS. glexec
on small set of WNs.
Manchester - one CREAM CE in production. SCAS/glexec deployed but not
in production.
RAL T1 - CREAM CE in production. SCAS/glexec installed on test cluster
Glasgow - 1 CREAM instance in production. SCAS and glexec in
production. CREAM is using it only worker nodes at present. It has
been tested with CREAM and with the lcg-CE. No explicit testing by
any major VO yet, but found problem with proxy lease and renewals with
ATLAS condor submissions. Still to implement pilot ops role for ops
glexec testing.
Imperial - work in progress on CE and SCAS.

Sites with more than one CE have been asked to move one to CREAM.
Several sites administrators were concerned about doing this while
there remains a critical bug affecting ATLAS submissions.

There ensued a discussion about the problems at Imperial and RHUL in relation to the CREAM CE.
It was noted that sites are still finishing SL5 upgrade. DB asked about UK site testing of SCAS and
glexec? DC and RJ noted no this was not happening, not as far as they knew.

ACTION
380.8 RJ/DC to advise us of what the experiment plans are in the UK in relation to SCAS and
glexec.

DB asked whether the sites were using these at all? DC didn't know. PG would check his logs. RJ
didn't know - they were not doing specific testing as far as he knew. There was certainly no
pressure to do so from ATLAS. DB confirmed that comment from ATLAS and CMS was required.
Some sites have it installed and some don't, therefore direction was needed. PG noted that ATLAS
didn't use CREAM CE anyway at the moment - lcg CEs were still required.

2) A post-mortem/incident report for the outage of the gridpp.ac.uk
DNS is now available in the wiki: https://www.gridpp.ac.uk/wiki/Manchester_Incident_20100227
. The specific cause of the problem was a kernel panic on the DNS
host. The impact was larger than it should have been due to the DNS
and several other services being in the process of host migration at
Manchester. To mitigate future occurrences DNS backups are being
sought in the Manchester computer centre and at RAL.

3) The transition to Nagios took place last week. Once used in
production many new bugs were quickly identified. There remain issues
such as: the CREAM CE is missing from the myEGEE interface; multiple
top-level BDIIs are not supported; some data show differently between
the dashboard and Nagios portal.

SI-6 LCG Mangagement Board Report
----------------------------------
DB reported on issues as follows:
1. on Tuesday there was a clear statement from CERN on DPM and CASTOR. Both are and will
continue to be supported at CERN at the same level. The CASTOR situation was particularly good
at present. The statement was carefully made. There was a normal rotation of 3-year posts
happening - all in a steady state.
2. JG had provided an update on the GDB - there was a suspension of clauses in the security policy.
What was GridPP's position? It would be better to discuss this when DK and JG were present.
3. The APEL issue - there was a perception that this was done in the UK at RAL and was
synonymous with GridPP. DB noted we have to see this differently in future in relation to NGI, as
it affected GridPP's reputation. We have to take lessons from this going forward and need to do
better re communication - this had been a retrograde step.
4. Were the experiments working on resource estimates for the upcoming period? RJ noted they
will certainly be different.

ACTION
380.9 RJ/DC to send info to DB regarding resource estimates for the upcoming period, as this info
will be needed after the PPRP.

SI-7 Dissemination Report
--------------------------
SP reported that planning was ongoing for an upcoming meeting, where the Chief of STFC would
be giving a speech. There was nothing further on the LHC at present. It was noted that an email
had been circulated re STFC Innovations Partnership Scheme (IPS) Panel Nominations. SP asked
whether we wanted to nominate someone? Two academics were required. No-one was available.

AOB
===
SP reminded that the Quarterly Reports were due. RJ noted he was working on his, the info
systems had been changed. DB noted that there would be issues from the Quarter which should
be raised at the PMB.

REVIEW OF ACTIONS
=================
354.2 JC to consult with site admins on a framework policy for releases, with a mechanism for
escalation, plus a mechanism for monitoring. JC reported that the consultation happened. There
were a few suggestions in the deployment team about how to progress in this area. It needs
writing up and an implementation plan. JC to progress. Pending.
366.8 AS to confirm that the Tier-1 proposes to use Tape-based storage in the period 2011 - 2015.
DB advised this related to long-term plans and power capacity. Physical footprint space?
Alternatives? AS had sent tech questions round the team and would forward inputs when
available. AS noted that alternative further costings were required. AS to progress. Ongoing.

367.2 RM to fill-in the grey boxes on DB's UK NGI diagram of a minimal NGI, as to what NGS
would be doing in the areas listed. RM reported that there wasn't enough information available at
present to carry out this action, but he had met with Andy Richards. RM/SP to circulate a
document. Done, item closed.

375.9 RM to provide a skeleton outline plan, including post details, of GridPP/NGS convergence.
RM reported that a draft plan would be available soon. RM/SP to circulate a document. Done,
item closed.

379.1 Re GridPP4 proposal and forthcoming PPRP meeting: SP to begin work on 'background'
financial planning. Ongoing.

379.2 Re GridPP4 proposal and forthcoming PPRP meeting: AS to look at the CERN hardware
paper and work on the CPU and disk costings. Ongoing.

379.3 Re GridPP4 proposal and forthcoming PPRP meeting: SP to add more detailed information
to the WBS. Ongoing.

379.4 Re GridPP4 proposal and forthcoming PPRP meeting: RM to progress the EGI/NGI/NGS
model for next week's PMB (in relation to Actions 367.2 & 375.9). Done, item closed.

379.5 RM/SP to assimilate the information in DB's paper on NGI within the GridPP4 Proposal, and
circulate a new updated paper before next week's PMB. This would be a transition document
addressing the possibility that:
1. There would be no NGI;
2. There would be no future funding for NGS. Ongoing.

379.6 SL to ensure that the OC documents are made publicly available [done following the
meeting].

379.7 JC to follow-up the issue of merging VO lists and ILDG VO. Ongoing.

ACTIONS AS AT 08.03.10
======================
354.2 JC to consult with site admins on a framework policy for releases, with a mechanism for
escalation, plus a mechanism for monitoring. JC reported that the consultation happened. There
were a few suggestions in the deployment team about how to progress in this area. It needs
writing up and an implementation plan. JC to progress.
366.8 AS to confirm that the Tier-1 proposes to use Tape-based storage in the period 2011 - 2015.
DB advised this related to long-term plans and power capacity. Physical footprint space?
Alternatives? AS had sent tech questions round the team and would forward inputs when
available. AS noted that alternative further costings were required. AS to progress.

379.1 Re GridPP4 proposal and forthcoming PPRP meeting: SP to begin work on 'background'
financial planning.

379.2 Re GridPP4 proposal and forthcoming PPRP meeting: AS to look at the CERN hardware
paper and work on the CPU and disk costings.

379.3 Re GridPP4 proposal and forthcoming PPRP meeting: SP to add more detailed information
to the WBS.

379.7 JC to follow-up the issue of merging VO lists and ILDG VO.

380.1 SL to circulate an Agenda for the Deployment Board meeting at RHUL.

380.2 ALL: to send SL information on infrastructure investments at their respective institutes.

380.3 AS to send SL assumptions re electricity (in relation to investments in infrastructure).

380.4 SP to send SL historical numbers on unfunded effort (in relation to investments in
infrastructure).

380.5 RM/SP to make changes to the EGI/NGI paper as discussed and bring back a revised
version to next week's PMB.

380.6 ALL: to feedback comments on the EGI/NGI paper to DB, RM or SP before next week's PMB.

380.7 Re the OPN backup link: AS to find out: 1. When the link is supposed to be operational; 2.
More detail about how and when the link will be tested. If possible AS should delay Invoice
payment until more information was forthcoming.

380.8 RJ/DC to advise us of what the experiment plans are in the UK in relation to SCAS and
glexec.

380.9 RJ/DC to send info to DB regarding resource estimates for the upcoming period, as this info
will be needed after the PPRP.

INACTIVE CATEGORY
=================
359.4 JC to follow up dTeam actions from the DB, as follows:
---------------------------
05.02 dTeam to try and sort out CPU shares and priority resources, at
Glasgow first (perhaps by raising the job priority in Panda).
---------------------------
JC would check the situation with Graeme Stewart (who was currently on annual leave).

JC followed up with Graeme and the other experiments. A test was
started but this area has been deemed low priority and further
progress is not expected for some time. ATLAS see no issues with
contention. LHCb are not intending to pursue anything in this area. A
CMS discussion has started but again it does not appear to be urgent.
If the experiments are not pushing this internally then there is
nothing for the deployment team to follow up!

It was noted there was no priority in ATLAS at present, this will be pending for a while. Move to
inactive as it is a long-term action.
---------------------

The meeting closed at 3:00 pm. The next PMB would take place on Monday 15th March at 12:55
pm.