Dear All,
Please find attached the latest weekly GridPP Project Management
Board Meeting minutes. The latest minutes can be found each week in:
http://www.gridpp.ac.uk/php/pmb/minutes.php?latest
as well as being listed with other minutes at:
http://www.gridpp.ac.uk/php/pmb/minutes.php
Cheers, Tony
________________________________________________________________________
Prof. A T Doyle, FInstP FRSE GridPP Project Leader
Rm 478, Kelvin Building Telephone: +44-141-330 5899
Dept of Physics and Astronomy Telefax: +44-141-330 5881
University of Glasgow EMail: [log in to unmask]
G12 8QQ, UK Web: http://ppewww.physics.gla.ac.uk/~doyle/
________________________________________________________________________
GridPP PMB Minutes 287 - 14th January 2008
==========================================
Present: Tony Doyle, Sarah Pearce, Roger Jones, David Britton, Steve Lloyd,
Robin Middleton, John Gordon, Jeremy Coles, Peter Clarke, Glenn Patrick,
Andrew Sansum, Dave Colling, Suzanne Scott (Minutes)
Apologies: Stephen Burke, David Kelsey, Tony Cass, Neil Geddes
1. ALICE priority
==================
AS reported that at present Alice have zero disk allocation and have not
yet had their CASTOR disk space set up. In order to take part in
February's CCRC they (Alice central rather than UK) have requested 1.1 TB
but it was likely that the requirement would be at least 1 disk server,
which implies 4-6TB depending on exactly what can be made free. When we
get the disk space we have to install the xrootd interfaces. It is
probably not much work to install xrootd but if it gives any problems it
will be in competition with higher priority work for ATLAS/CMS and LHCb in
prep for CCRC08. Setting up the Alice CASTOR endpoints (on our shared
server) is less than half-a-day's effort. It was noted that if this work
does not start early next week there will be no chance of getting Alice
ready for CCRC08. Even if the effort is invested next week, the chances of
success are not great given the untried interfaces (at RAL), lack of
priority, and time to resolve problems. On Tuesday the MB will require to
know how we stand WRT the endpoint setup for all 4 experiments.
How does the PMB wish to proceed? The disk space issue was discussed
before but now our position of zero allocation will become very clear to
the WLCG and is inconsistent with our MoU commitments. In the event that
they want us to proceed, how should we prioritise Alice WRT the other LHC
experiments and even Minos and Babar over the next 6 weeks or so?
GP noted that the problem was lack of input from Alice, and the fact that
their disk allocation had been used elsewhere due to lack of uptake and
lack of engagement. GP had been given a technical contact at CERN but the
Alice request (which GP had estimated) had not been confirmed. GP advised
that minimal storage would be fine, but the priority would need to be set
at 'low'. TD asked if the PMB felt it reasonable to require a response
from Alice-UK prior to setting-up of support - the agreement was yes,
engagement is required. TD and GP would iterate, draft an email and
contact the individual involved - engagement was required along with
estimates of requirements, otherwise no priority could be afforded Alice.
AS advised that input was required before Wednesday at 10:30 am, which was
the next CASTOR Team Meeting. TD noted a deadline of Tuesday evening for
a response from Alice.
2. Tape Access
===============
TD reported that there was a major issue w.r.t. tape use at CERN raised at
last Tuesday's MB - in current operation it was clear that tape access was
~10MB/s (or less) rather than 50MB/s. The agenda link is here:
http://indico.cern.ch/conferenceDisplay.py?confId=22194
-> Storage Efficiency
TD advised that slides had been provided regarding rates at CERN for all
experiments. The discussion at the MB related to tests of the tape system
being incorporated into planning, but it was noted that there had been
problems accessing tape. RJ advised that CERN were not providing D1T0 but
were backing up to tape. There was a discussion regarding the processing
and reading of tape. AS advised that there were performance issues as
well, relating to concurrent writing to disk and reading from disk, and
multiple streams. TD noted that CCRC was meant to address simultaneous
contention, a week should be designated for ATLAS, CMS and LHCb re file
access alongside user analysis. GP advised that all CASTOR sites were
banned at LHCb at present for other reasons, therefore no efficiency
figures were available. TD asked if a week was possible for large
sequential access tests? AS advised that no week was yet designated
except for CCRC. GP noted that migration to CASTOR has to happen for all
experiments first. DB asked if extra tape drives were required at the
moment. TD noted no, not yet - types of rate were required along with
figures from tests, which would give realistic throughput to determine
accurate disk/tape balance. JG suggested we go with the plan for February
'08 then determine access rates in May. AS would contact Tim Folkes to
order six tape drives as per the original plan.
3. GridPP20 Agenda
===================
TD asked whether there were any user-based talks? Did GP, RJ, or DC have
any speakers relating to hands-on experience of experiments? TD advised
that the registration listing was currently being used to determine
possible speakers but Chairs had not yet been finalised. Were there any
updates to the main Agenda? This was ongoing.
4. AOCB
========
None.
STANDING ITEMS
==============
SI-1 Dissemination Officer's Report
------------------------------------
SP reported that a rejection had been received from the Royal Society
Summer Exhibition - SP would pursue feedback regarding this rejection.
However, STFC had an LHC stand accepted and have said they will aim to
include something about Grid on this. SP expressed thanks to DB for
passing on a couple of suggestions about news items. SP had contacted
UKQCD about news items on their biomed mini-PIPSS award and a demo of
integrating 5 regional Grids shown at a recent conference. SP was also
currently working on something about GANGA, and Mike Kenyon would forward
information on ELSSI. SP reported that Neasan O'Neill would attend the
EGEE All Activities meeting in Bulgaria next week at the request of EGEE
NA2, to take part in a meeting discussing Grid communication strategies.
The second phase of the bid for an STFC Science in Society large award, to
fund someone for LHC@home, was currently being worked on. This was due at
the end of this month.
SI-2 Tier-1 Manager's Report
-----------------------------
AS reported as follows:
1) Tenders:
a) Disk tender - delivery is scheduled for Thursday this week - if all
goes to schedule, acceptance will be complete by the end of February.
b) CPU tender - the order had been placed and scheduled for delivery 28
February.
c) Tape drive purchase - the purchase plan was being finalised. If the
order is placed in the next couple of days we may be able to get the
equipment on the ground in time for February's CCRC08.
2) Memory upgrades are all completed. Closed.
3) Work on the power supply is proceeding - so far with no disruption to
service. Measurements indicate that we have (just) sufficient power to
operate with one transformer out of service. This will continue to be
the case until late February (when the next CPU delivery will push us
over the limit). As it is likely that transformer work will be
completed before the CPU delivery, it is likely that e-Science will not
have to reduce electrical load.
4) The RAL PPD disk space loan (approx 80TB) is available.
Service
-------
1) SAM availability for last week was 99%.
2) CASTOR:
a) Problems with the ATLAS CASTOR instance were traced to queries
overflowing the Oracle query cache. This was increased and ATLAS
production restarted on Wednesday.
b) LHCB have encountered problems (also at CNAF) where rfio requests leave
files open after the end of the IO job. This gradually leads to a
degradation in performance as all IO job slots become occupied.
Investigations are still underway.
3) SL4 Migration - The SL4 UI is configured and is being tested.
4) The LHCB ORACLE based LFC is operating well - Item closed.
Progress to Grid Only Access - This standing item documents the status of
work towards achieving GRIDPP milestone 0.18 "Access to Tier-1 resources
by Grid Interfaces Only"
1) qsub access was scheduled to terminate last Friday but we have a few
details to finalise and will finally switch off qsub by Wednesday.
SI-3 Production Manager's Report
---------------------------------
JC reported as follows:
1) There have been several requests for improvement/changes to the EGEE
broadcast system.
2) A new process has been introduced whereby a ticket is not closed but
goes in to the "verify" state.
3) A bug in the service availability algorithm in Gridview (so that the
calculation considers services with no critical tests as up and
available) will be corrected from today.
4) Manchester has ~9GB of space occupied by CMS and ALICE software.
Considering the policies of these experiments the site wants to know
how to deal with this software (extra space on the software servers
would be useful).
5) Over the Christmas period the old gridpp VOMS certificate expired. The
resultant site reaction indicated that the change over was not widely
known.
6) Ops test performance over the Christmas and New Year period has been
stable for most sites. Several sites were 100% available. The worst
performing sites over the period are similar to during November/early
December. Overall Q4 saw an average availability of 86% vs 85% for Q3.
7) The most significant problem over the last few weeks (as already
discussed) was for ATLAS due to CASTOR. This has lead to reduced use of
UK Tier-2s.
There was a discussion regarding enabling and supporting VOs and the space
available to them that sites are responsible for. It was agreed that 9GB
was not felt to be excessive for a software area and that a bigger area
was appropriate if required. TD noted that VOs should be supported on a
site basis and any plans to drop individual VO support should be after
discussion with the Region and ultimately with the VO concerned. It was
reported that ECDF at Edinburgh was now a new site with a shared cluster.
Meetings:
A) There was a CCRC'08 planning meeting on 10th Jan:
http://indico.cern.ch/conferenceDisplay.py?confId=24844
B) There was a GDB last week:
http://indico.cern.ch/conferenceDisplay.py?confId=20225. The focus was
benchmarking; data management; worker node issues and security policies
SI-4 LCG Management Board Report
---------------------------------
It was noted that experiment requirements were still awaited in response
to MB questions. RJ, GP, DC would be sent a url relating to CCRC08 with
planning meeting details, so that the summary of experiment requirements
can be checked to ensure no major mismatch [done during meeting]. TD
reported that the tape issue had already been covered and that CCRC
planning would be reviewed again next time.
SI-5 Documentation Officer's Report
------------------------------------
SB was not present.
REVIEW OF ACTIONS
=================
272.4 AS to check the current Tier-1 disaster recovery plan and circulate
the existing version to the PMB. It was reported that this document does
not exist, but it was planned to have one in the longer term. TD would
incorporate in v0.4 anything that AS considered relevant. AS will check
and advise additions. Ongoing.
277.2 DC to provide an update and re-evaluation of CMS/CASTOR
deliverables. TD advised that there was a CMS/CASTOR document on
deliverables which should be revised in light of the December '07 tests.
DC to take the token for this now and iterate with DN. Ongoing.
277.5 Disaster Recovery 'Team B': SB, JC, TD, SP, DB to analyse the wider
issues of disaster planning, mapped to the experiments' lists, and this
work would include Project Management. A Recovery Plan was required. It
was agreed that JC was in charge of this and the experiment input relating
to subsets of the disaster plan. SB/JC to progress. It was noted that
the AFC Service was also linked to this. Ongoing.
277.8 User Experience 'Team C': SB, SP, SL, with input from JC to deal
with the issue of user experience and design of an easily-found lookup
facility for grid error messages. SL reported that he had started the
ATLAS wiki page and would circulate the url. Ongoing.
280.6 JG brought up the issue of the biomed VO and 'sieving' at the ROC
Manager's meeting - a broadcast is to go out from EGEE which will be
helpful in underlining acceptable use of Grid resources and would act as a
reminder to VOs about the policy they have signed-up to in relation to
their users. JC had now emailed the Chair to have this discussed. JG
reported that a new VO was now set up but there were few resources
allocated to it as yet, although the home Institute may be giving funds.
Pending further info from JC. EGEE broadcast action ongoing - JG will
bring-up the broadcast action at the ROC VO meeting tomorrow (Tue 15).
Ongoing.
280.7 JC to mention the issues (when approached by a VO with regard to
joining) of the 'standard' 6-month introduction period, following which
the VO must set-up something specific to them, if appropriate. This was
discussed at DTeam. JC to email GridPP VO members if possible - ongoing.
This was a standing action - JC had discussed it with the Tier-2
Co-ordinators in relation to VO members. JC to send email. JC reported
that he had received a request from OMII to set-up a GridPP VO - it was
preferable for this to be done through NGS. Ongoing.
280.8 JG to investigate the UKI ROC website - any change/progress, and
report-back. Ongoing.
282.2 SP to progress the Project Map using the T1 service areas and input
from the meeting. Ongoing.
282.6 JC and SB to progress existing 'disaster planning' template for next
F2F meeting on 1st Feb. Involve experiments as necessary. This was a
follow-up from the last F2F, and was to be distinguished from 277.5 action
which is a longer-term one relating to the OC.
283.1 TD to arrange a phone connection at TC Dublin for RJ to join the
GridPP20 PMB meeting remotely. Ongoing.
283.3 RM/TD to prepare use cases appropriate for the UK community,
[relating to item 278.10 EGEEIII -> EGI]. RM reported that he would be
attending a workshop at the end of January at CERN (by EGI design study
project) and would report-back at that time. RM reported that use case
and functions parts of the EGI website were now publicly visible. RM
would circulate the url for the use cases - a template was available to be
completed. All: to provide inputs to RM in the template format provided
via the url. Done, action closed.
286.1 RJ to call a NorthGrid meeting to decide hardship funding
allocations to Institutes. RJ reported that a meeting had been held this
morning. Information would be sent to SL. RJ summarised that the largest
figure would go to Sheffield: 12k, with 6k each to Liverpool, Lancaster,
and Manchester.
286.2 SL and DB to iterate regarding clause associated with the issuing of
Tier-2 hardware grants. SL had sent DB an email with suggestions.
Ongoing.
286.3 AS to formally apologise to ATLAS on behalf of GridPP for the outage
problems over the Christmas period. AS reported that he had sent a formal
email apology to Kors. The identified cause had now been resolved and
ATLAS production re-started ok. Done, item closed.
286.4 GP to advise the UB that the special cases for non-Grid access to
the UK Tier-1 were approved. Done, item closed.
286.5 AS to organise a service message at login relating to non-Grid
access being withdrawn. Ongoing.
286.6 JC and SB to incorporate the AFS Service into the disaster planning
document. This was added to the list. Done, item closed.
ACTIONS AS AT 14.01.08
======================
272.4 AS to check the current Tier-1 disaster recovery plan and circulate
the existing version to the PMB. It was reported that this document does
not exist, but it was planned to have one in the longer term. TD would
incorporate in v0.4 anything that AS considered relevant. AS will check
and advise additions.
277.2 DN to provide an update and re-evaluation of CMS/CASTOR
deliverables. TD advised that there was a CMS/CASTOR document on
deliverables which should be revised in light of the December '07 tests.
DC to take the token for this now and iterate with DN.
277.5 Disaster Recovery 'Team B': SB, JC, TD, SP, DB to analyse the wider
issues of disaster planning, mapped to the experiments' lists, and this
work would include Project Management. A Recovery Plan was required. It
was agreed that JC was in charge of this and the experiment input relating
to subsets of the disaster plan. SB/JC to progress.
277.8 User Experience 'Team C': SB, SP, SL, with input from JC to deal
with the issue of user experience and design of an easily-found lookup
facility for grid error messages. SL reported that he had started the
ATLAS wiki page and would circulate the url.
280.6 JG brought up the issue of the biomed VO and 'sieving' at the ROC
Manager's meeting - a broadcast is to go out from EGEE which will be
helpful in underlining acceptable use of Grid resources and would act as a
reminder to VOs about the policy they have signed-up to in relation to
their users. JC had now emailed the Chair to have this discussed. JG
reported that a new VO was now set up but there were few resources
allocated to it as yet, although the home Institute may be giving funds.
Pending further info from JC. EGEE broadcast action ongoing - JG will
bring-up the broadcast action at the ROC VO meeting tomorrow (Tue 15).
280.7 JC to mention the issues (when approached by a VO with regard to
joining) of the 'standard' 6-month introduction period, following which
the VO must set-up something specific to them, if appropriate. This was
discussed at DTeam. JC to email GridPP VO members if possible - ongoing.
This was a standing action - JC had discussed it with the Tier-2
Co-ordinators in relation to VO members. JC to send email.
280.8 JG to investigate the UKI ROC website - any change/progress, and
report-back.
282.2 SP to progress the Project Map using the T1 service areas and input
from the meeting.
282.6 JC and SB to progress existing 'disaster planning' template for next
F2F meeting on 1st Feb. Involve experiments as necessary. This was a
follow-up from the last F2F, and was to be distinguished from 277.5 action
which is a longer-term one relating to the OC.
283.1 TD to arrange a phone connection at TC Dublin for RJ to join the
GridPP20 meeting remotely.
286.1 RJ to call a NorthGrid meeting to decide hardship funding
allocations to Institutes. RJ reported that a meeting was scheduled for
this morning. Information would be sent to SL. RJ summarised that the
largest figure would go to Sheffield: 12k, with 6k each to Liverpool,
Lancaster, and Manchester.
286.2 SL and DB to iterate regarding clause associated with the issuing of
Tier-2 hardware grants. Ongoing.
286.5 AS to organise a service message at login relating to non-Grid
access being withdrawn.
287.1 TD and GP to iterate, draft an email, contact the Alice
representative (technical) at CERN and request inputs regarding estimates
of requirements for disk allocation - deadline for response from Alice was
Tue evening (15 Jan).
287.2 AS to contact Tim Folkes to order six tape drives as per original
plan.
287.3 All: to provide inputs to RM in the template format provided via
the circulated url - re EGEEIII -> EGI and use cases.
INACTIVE CATEGORY
=================
271.1 PMB to examine the issue of fibre breakage and outages, CERN-RAL OPN
link, in one year's time, when actual data on breakages is available.
Due date would be September '08.
271.3 Re CERN-RAL OPN link breakage and backup generally, PC to oversee
the issue and collate info so that the PMB have something to revisit in
one year's time. Due date September '08. It was noted that PC would
circulate a revised document after discussion with ATLAS (RJ/PC/DN to
iterate).
282.8 RM to monitor how R-GMA and networking issues impact on GridPP as
matters progress. RM advised that this item should be moved to the
'inactive' category as it will develop over the coming months. RM
discussed the issue with Steve Fisher and advised that support of R-GMA is
required whilst APEL is dependent on it. RM reported that he has spoken
to SF and there is currently no change to the R-GMA situation - process
ongoing.
The meeting closed at 2:30 pm. The next PMB would take place on Monday 21
January at 1:00 pm.
|