Dear All,
Please find attached the latest weekly GridPP Project Management
Board Meeting minutes. The latest minutes can be found each week in:
http://www.gridpp.ac.uk/php/pmb/minutes.php?latest
as well as being listed with other minutes at:
http://www.gridpp.ac.uk/php/pmb/minutes.php
Cheers, Tony
________________________________________________________________________
Prof. A T Doyle, FInstP FRSE GridPP Project Leader
Rm 478, Kelvin Building Telephone: +44-141-330 5899
Dept of Physics and Astronomy Telefax: +44-141-330 5881
University of Glasgow EMail: [log in to unmask]
G12 8QQ, UK Web: http://ppewww.physics.gla.ac.uk/~doyle/
________________________________________________________________________
GridPP PMB Minutes 266 - 23rd July 2007
=======================================
Present: Tony Doyle, Sarah Pearce, Stephen Burke, David Britton, Dave Newbold,
Steve Lloyd, John Gordon, Jeremy Coles, Peter Clarke, Glenn Patrick,
Andrew Sansum, Suzanne Scott (Minutes)
Apologies: Roger Jones, David Kelsey, Tony Cass, Robin Middleton, Neil Geddes
Yingqin Zheng was continuing to observe for the Pegasus project.
0. EVO testing
===============
TD reported that JC, AS, GP, JG, SB, & TD had participated in an EVO test
prior to the PMB. The test was not entirely successful and was
inconclusive. It was agreed to try another test next Monday but it was
understood that the problems were not going to be immediately resolved.
JC noted that EVO was to be used for the dTeam meeting on Tuesday so
further results would be available then.
[Note: later dTeam meeting was more successful]
On another issue, regarding JISCMAIL and the delays in delivery, the PMB
to send relevant info to JG to enable him to investigate.
1. CB Meeting
==============
http://www.gridpp.ac.uk/cb/
Meeting 12 - 16 Jul 2007
SL reported that there had been a CB meeting last Monday and draft Minutes
were available on the CB page. The main business had been to look at the
final GridPP3 plan along with the leftover funds from GridPP1 & 2 - DB had
presented a report. There had been a discussion regarding what could be
done if the Grid didn't work: the risk register was discussed, as was
access to CMS data by Bristol users, and MoUs. It was noted that sites
must allow access to data. SL noted that everyone can access data but
local logins would not be acceptable to sites. There was a question
regarding UKQCD and how they would get their requirements into the system
through user support. Regarding the election of a Chair for the CB: Pete
Watkins would canvas opinion. The next CB meeting would take place in the
New Year '08 and would probably be a F2F meeting.
TD asked if DB could provide a summary statement of the GridPP3 financial
position. DB reported that there had been iterations during the week; the
initial response from STFC had not been positive and discussion had been
delayed. DB reported that STFC had decided to remove from GridPP3 the 1.3
million saved from GridPP1 & 2 but GridPP3 could spend some of the working
allowance of 1.2 million on some of the proposed unfunded posts, excluding
the VOMS post at Manchester - this post had been referred back to the PPRP
to decide. DB reported that the Experiments' posts had not been funded,
only 1 out of the 1.5 posts for the Experiments had been agreed - the rest
of the posts were to be funded out of the working allowance. It was noted
that using the working allowance upfront was a significant risk.
TD noted that STFC had stated they would be flexible with contingency
funding. DB noted that the UB Chair and the GridPP3 Administrative post
would both be funded as proposed, out of the working allowance.
DN recommended that the Experiments now go back to their Oversight
Committees and provide them with information.
TD asked whether, in principle, the PMB do agree to use the working
allowance for the purpose proposed, and agree that the Experiments should
now go back to their OCs. This was agreed. The PMB would also
report-back to the CB with an agreed statement. DN noted that the GridPP
OC was upcoming. TD noted that GridPP had not been allowed to discuss
funding issues with the Oversight Committee; STFC would refer it
themselves. DN noted that there would be an invitation from the Oversight
Committee to the Experiments but this would not happen until October. DB
noted that grants need to be issued this weekend and decisions made about
posts. DN noted that it was not possible to run a service with what has
been agreed and that after the funding is finalised GridPP should
challenge STFC's decision.
PC asked who was in charge of computing at STFC, including Grid &
AstroGrid; who was on the Management Team and had responsibility for
setting policy? It was noted that Andrew Taylor was the contact person.
PC reported on a high-level document that had been send to STFC Management
in order to raise their awareness of the need for global computing
support.
The PMB endorsed DB's report and statements, and it was agreed that DB
would summarise the situation for the CB and circulate this to the PMB for
prior approval. It was understood that if the VOMS service remains
unfunded, it seriously compromises the running of the Grid.
2. GridPP19 Programme & Planning
=================================
http://www.gridpp.ac.uk/gridpp19/
TD had circulated a proposed Agenda and had invited comments. It was
hoped to invite speakers soon and announce the Collaboration Meeting to
UKHEPGRID on Wednesday. It was understood that delegates were likely to
leave early in order to attend CHEP. The Registration deadline was
planned for mid-August. It was agreed that the Tier-1 and Tier-2 talks
should also be forward-looking and not just retrospective. It was hoped
that registrations would begin from the middle of this week. The PMB
approved the programme.
3. Policy on Sites Stopping Stalled Jobs
=========================================
TD noted that new input from Tier-1 had been included in the latest
version of the Policy document. This was unlikely to change but there may
be new plots from Imperial College. TD proposed to implement a draft
Policy from August 1st to December 31st in response to Les Robertson's
statement at the MB. The Policy would cover all UK sites and would be
reviewed for 2008. JG noted that more interaction was required with
middleware on why jobs are cancelled. TD noted that individual sites can
determine queue lengths. DN observed that sites will still need to rely
on central advice however. There followed a discussion on queue types.
The PMB approved the draft Policy document and the timescale for review.
The draft policy document can now be found in its final form at:
http://www.gridpp.ac.uk/pmb/docs/GridPP-PMB-113-Inefficient_Jobs_v1.0.doc
or
http://www.gridpp.ac.uk/pmb/docs/GridPP-PMB-113-Inefficient_Jobs_v1.0.pdf
4. Job Queue Lengths
=====================
It was understood that the dTeam need to discuss this issue and ask all
sites to provide information regarding current queue lengths - this will
provide a clearer idea of current status. JC will send out this request
when he publicises the dTeam meeting on Thursday.
STANDING ITEMS
==============
SI-1 Dissemination Officer's Report
------------------------------------
SP reported that a news item on the Site Reviews was ready, and she had
received a quote from each of the review teams. A Press Release relating
to Cambridge Ontology was also prepared. Regarding the NA2 meeting at
CERN, Neasan O'Neill was attending, and SP would join by videolink - this
related to plans for EGEEIII and funding for Dissemination posts. SP
noted that there was a parallel bid going into the EU proposal to try and
get two people - UK would bid for someone to work on the Grid Cafe and
MultiMedia; another on International Science Grid - it had been indicated
that this bid would be favourably received. SP noted she had started to
consider SuperComputing '07 - JG thinks it will go ahead - the funding was
currently being finalised but only 9 were booked so far. JG noted that
only smaller screens would be available this time; the call would be going
out soon. It was noted that the Project would pay for travel and
accommodation. SP noted her thanks for text received relating to the All
Hands leaflets.
SI-2 Tier-1 Manager's Report
-----------------------------
AS provided the following report:
Hardware:
--------
- 10Gb path from Tier-1 to SJ5: The 10Gb firewall bypass was not yet
available for testing to commence.
- RAL networking group are in the process of obtaining a public AS number
in order that the Tier-1 can route Tier-1 -> Tier-1 traffic by the OPN.
This has not yet been resolved. AS would speak to JG and apply leverage
by a different route.
- Tenders:
a) The pre-qualification stage of the disk and CPU tenders closed Friday
29th June. Evaluation was almost complete. Financuial approval for full
requisition sum will be needed later this week.
b) Tape media Framework was running to schedule.
c) Tape media interim purchase had now arrived - item closed.
d) Tape drive Framework - went out on Friday.
e) Tape drive interim purchase - increasing concern was noted that the
existing 6 drives (even suplemented by temporary loans) will not meet
experiment's requirements in the autumn when it is expected that there
will be concurrent use by several experiments. An interim purchase of
5-6 tape drive bricks was planned, an order would be placed within the
next 1-2 weeks.
Service:
-------
- SAM availability for the last 7 days was 99% (As recorded by Steve
Lloyd's summary page).
- CASTOR: The CASTOR 2.1.3 instances all ran reliably last week.
a) On Friday ATLAS was notified that they were ready to accept wide area
T0->T1 transfers - these were expected to start today.
b) CMS T0->T1 testing was planned to commence today.
c) The LHCB instance was ready for LHCB to run their tests.
SL4 testing:
-----------
It was noted that a new production quality CE was being built for the SL4
service; this will replace the existing CE later this week. Once that is
available experiments will be polled for guidance as to when their
capacity should be moved to SL4.
SI-3 Production Manager's Report
=================================
JC provided the following report:
1) At last Monday's ops meeting it was agreed that sites should start
migrating to SL4 WNs. Birmingham and Brunel have moved already. Many
sites were planning the change towards the end of August. It was very
unlikely that all GridPP sites will have moved before September but CMS
dominated sites are expected to have moved.
2) Oxford saw its CE getting overloaded last week - there were 900 biomed
jobs queued at the site but only 20 could run simultaneously. To clean
up, the site closed network connections which led to SAM tests being
failed. Durham's CE also "locked up" last week. SB noted that it was
possible to block submissions
3) A large backlog of work developed last week on the RAL SL4 test queues.
LHCb jobs do not appear to be running correctly - the cause is being
investigated. There have been reports that ATLAS code that needs
compiling on the WNs runs into problems. There was a discussion
regarding release 13 and tests by ATLAS.
4) Lancaster was suffering from intermittent but regular dCache errors.
Developers were involved in finding the cause. There is useful
information on SRM2.2 for T2 sites in the Storage Group minutes from
last week:
http://indico.cern.ch/getFile.py/access?resId=0&materialId=minutes&confId=19037.
5) RAL Tier-1 runs the hardware for the RGMA registry. The service is
showing signs that the hardware is reaching its limits (especially
memory). To move to new hardware will require a change of ip address
which will need to be reflected in site firewall rules at all sites.
This will be raised as an issue at today's ops meeting. AS noted that
hardware was available at the moment; sites should be given notice of
change even if implementation was not imminent. TD noted that it would
be useful (as already agreed) to have a SAM summary in a set format
from JC on the 1st Monday of each month - this would cover items 6, 7,
and 9, and would formalise the report.
6) Last week's scheduled downtime: UCL-CENTRAL (for planned electrical
work); Durham (scheduled power outage); Birmingham (upgrade to SL4 &
aircon problems); Brunel (SE problems); EFDA-JET (issues with R27
update and electrical maintenance). Glasgow (15 mins downtime for R27
update).
7) Last week's unscheduled downtime: UCL-HEP (3 hrs due to power outage).
RAL-PP (air-conditioning failure).
8) There was announcement this morning that support for LCG-2_7_0 will
stop from the end of August. Apparently 14 sites still indicate this as
their middleware version. For the UKI region gstat reports: gLite-3_1_0
(2 sites); gLite-3_0_2 (5 sites); gLite-3_0_1 (1 site) and gLite-3_0_0
(16 sites). This is a similar distribution across the versions as seen
for EGEE as a whole.
9) The SAM availability figures (from Steve's page) for all GridPP sites
showed improvement over the last month. The last month of availability
was 82%, the last week was 85% and the 24hr measure shows 87%
availability. At GridPP18 the target mentioned for June was >85% and
for July >90%. Now that we have some historical data for performance it
will be possible to look at each site in turn and report reasons for
not meeting the target. IC-HEP appears to be the best performing site
at 99% for the last month while UCL-CENTRAL is at 52% for the month due
to a machine room move and electrical work.
10) In discussing T2 network testing and SE tuning at the DTEAM meeting
last week it became apparent that tests are now limited due to not
using CASTOR. Given the current situation with CASTOR what is the PMB
view on this matter? Running transfers under dteam will provide very
few resources and so far experiment transfers were not fully
exercising many of the T2 sites (in terms of sustained and high
bandwidth transfers).
DN noted that Experiments should be doing this and dTeam should
provide support. CMS can, and will, do this; internal tests
had already been carried out at RAL, but it needs to be
done in a highly structured and controlled way - no random testing
should be allowed. JC asked whether ATLAS would be doing this for
each site that they will use? TD noted that no, it was a subset up to
this point. This can now be opened further given that CASTOR problems
at RAL are now reduced.
11) The deployment team will revisit NGS-GridPP interoperation areas at
this week's DTEAM meeting. To allow better tracking of progress we
will start logging developments in the GridPP wiki:
http://www.gridpp.ac.uk/wiki/Interoperation_activities
12) There will be a UKI monthly operations meeting on Thursday of this
week from 10:30:
http://indico.cern.ch/conferenceDisplay.py?confId=19090 (agenda is
still being arranged).
SI-4 LCG Management Board Report
=================================
See https://twiki.cern.ch/twiki/bin/view/LCG/MbMeetingsMinutes
It was noted that the Minutes of the last meeting were not yet available.
It was reported that SRM2.2 was continuing to be used.
SI-5 Documentation Officer's Report
====================================
SB noted that a CERN student was now working on UIG pages and was carrying
out updates for accuracy.
REVIEW OF ACTIONS
=================
250.4 RJ, DN, GP, TD to meet to integrate experiment requirements of
Tier-2s going to Tier-1 - sites are aware of requirements but discussion
still has to take place. It was noted that this issue is not high
priority. A meeting is to take place with Barney Garrett - this is
ongoing and still to be arranged.
252.3 RM has now received inputs for his one-page summary regarding the
transition of each of the existing Middleware areas from GridPP2 to
GridPP2+ to GridPP3 - this to go to DB. This was to be done by Friday 8th
June but is still ongoing.
254.2 ALL PMB members have now signed-up to EVO. Tests were ongoing but
this action is on hold due to H323 requirements which must be resolved.
JG/RM will resolve EVO issues. RJ reported that he had joined an
evaluation group on EVO and asked that all information should be sent to
him to enable him to document the problems involved. It was agreed that
an EVO test would take place the week after next (PMB) as next week's
meeting was a short one due to the CB meeting at 2.00 pm.
259.5 JC to provide recommendations to the PMB on PPS testing and a
summary of what is currently available on the system. JC reported that he
had received feedback from 2 out of the 3 PPS sites. Ongoing.
260.1 RM to provide final feedback for site reviews to SL for
https://www.gridpp.ac.uk/tier2/Readiness_Reviews/index.html.
260.3 RM, NG, TD, DK to inform SL which site-review information is
public/private.
261.1 TD and JG to prepare a PMB statement to be prepared for the MB
regarding SL4 releases of basic middleware, which were still awaited and
were an issue at sites. JG reported that he would be doing this for
tomorrow. Sites should be encouraged to proceed with SL4 upgrades which
are to be tracked by JC. JG will give a summary statement to the MB as to
what we believe the current situation is - this will include 'SL5 on
hold'.
261.2 DN, RJ, GP: An action on the experiments to define the future
outlook for 64-bit applications and resultant effects on hardware
purchasing. Experiment reps to define the outlook. It was noted that the
priority is 32-bit at the moment; there is no advantage to 64-bit. A
short statement is required.
261.4 DB to look through the input in detail in relation to GGUS problems.
261.5 JC and dTeam to carry out a survey on sites' experiences of GGUS,
when possible to organise. Ongoing.
261.6 JC to look into the issue of 2-hour response timing @ Tier-2 sites
and understand the problem in greater detail - sites also need to
understand what the two-hour response time actually means. This needs
clarification. What are GridPP expecting on the 2-hour timeline?
Presumably this is a response indicating that the matter is being
investigated. Is this specific to when a GGUS ticket has been raised?
261.11 SL to progress receipt of final site documents from SouthGrid and
London T2 which were still outstanding. It was noted that SL was still
awaiting information.
261.13 DK to progress receipt of ScotGrid feedback.
261.14 RM to progress receipt of LT2 feedback.
261.17 JC to assess the general effectiveness of RSS feeds and
subscription-based updates, in relation to GridPP blogs. It was noted
that blogs are aggregated: PlanetGridPP is the mechanism, but RSS-feeds
that can be subscribed to don't exist. JC will bring this up at the
Deployment Board meeting. The Deployment Board did not have time for a
full consideration of things from JC. He will ask the question at the UKI
meeting on Thursday and take it from there. Ongoing.
262.2 SL to clarify GridPP contribution (what is accounted rather than
what is available) with the Tier-2 Board. SL reported that he understood
this to mean that the contribution of a site to GridPP will be measured in
terms of the amount of CPU used and disk available (whether used or not)
to GridPP enabled VOs via 'the Grid' i.e. currently using EDG/gLite
middleware, SRM etc. These numbers will be taken from the relevant
accounting systems. SL did not think that this had been formally agreed
yet and the exact wording will need to be incorporated in the new GridPP3
MoU. Done, item closed.
262.4 JC to ascertain the specific problems in relation to Condor support
issues. JC reported the following list from Santanu Das to follow up:
1. CE hard-wired to a dead version of Condor i.e. v6.7.10
2. A "yum update" (simply) is not possible if condor-6.7.10 is not
installed
3. CE still installs "torque" by default even for a condor site
4. "TORQUE_SERVER" has a default requirement for "configure_gip*" part
5. APEL is broken most of the time on condor (rpms are not part of the
official release though)
6. Info provider service doesn't work - running jobs, waiting jobs etc.
are always wrong.
7. Middleware doesn't provide any such thing to configuration Condor for
gLite
262.5 Regarding poor response time of middleware developers: DK to
propose the following recommendation to the Deployment Board: to recommend
that if specific issues were involved, GGUS should be used. If issues were
general, the TCG representative at the Tier-2 site should be informed.
The TCG rep in turn should raise the issue as appropriate at the TCG
meetings.
262.6 JC to raise the issue of PPS feedback information relating to
upgrades issues with the relevant individual(s) on the PPS, and ask if
there was anything else that could be done.
262.9 non-Grid access relating to VOs. A document is to be done detailing
this issue as VOs need a mechanism 'in'. AS to detail the issue in a
separate report and circulate to the PMB. What can and can't be offered
to non-Grid users: detail is required - AS still to do.
263.1 Robin Tasker to re-circulate his paper regarding the RAL-CERN OPN
link, once further information was available. What is the timescale for
this? PC to review the Minutes and discuss with Robin Tasker. TD
reported that RT had been in touch re the risk figures for fibre breaks -
pricing on resilience would be available soon. Action continues on RT and
PC.
263.2 JG to further investigate the lack of ability to pass job
requirements to the batch system and report-back (Tier-2 review issue).
JG will raise this through the GDB.
ACTIONS AS AT 23 JULY 2007
==========================
250.4 RJ, DN, GP, TD to meet to integrate experiment requirements of
Tier-2s going to Tier-1 - sites are aware of requirements but discussion
still has to take place. It was noted that this issue is not high
priority. A meeting is to take place with Barney Garrett - this is
ongoing and still to be arranged.
252.3 RM has now received inputs for his one-page summary regarding the
transition of each of the existing Middleware areas from GridPP2 to
GridPP2+ to GridPP3 - this to go to DB. This was to be done by Friday 8th
June but is still ongoing.
254.2 ALL PMB members have now signed-up to EVO. Tests were ongoing but
this action is on hold due to H323 requirements which must be resolved.
JG/RM will resolve EVO issues. RJ reported that he had joined an
evaluation group on EVO and asked that all information should be sent to
him to enable him to document the problems involved. It was agreed that a
further EVO test would take place just prior to next Monday's PMB. The
dTeam experience will also be reviewed. It may be that VRVS will continue
to be used if EVO problems remain unresolved.
259.5 JC to provide recommendations to the PMB on PPS testing and a
summary of what is currently available on the system.
260.1 RM to provide final feedback for site reviews to SL for
https://www.gridpp.ac.uk/tier2/Readiness_Reviews/index.html.
260.3 RM, NG, TD, DK to inform SL which site-review information is
public/private.
261.1 TD and JG to prepare a PMB statement to be prepared for the MB
regarding SL4 releases of basic middleware, which were still awaited and
were an issue at sites. JG reported that he would be doing this for
tomorrow. Sites should be encouraged to proceed with SL4 upgrades which
are to be tracked by JC. JG will give a summary statement to the MB as to
what we believe the current situation is - this will include 'SL5 on
hold'.
261.2 DN, RJ, GP: An action on the experiments to define the future
outlook for 64-bit applications and resultant effects on hardware
purchasing. Experiment reps to define the outlook. It was noted that the
priority is 32-bit at the moment; there is no advantage to 64-bit. A
short statement is required.
261.4 DB to look through the input in detail in relation to GGUS problems.
261.5 JC and dTeam to carry out a survey on sites' experiences of GGUS,
when possible to organise.
261.6 JC to look into the issue of 2-hour response timing @ Tier-2 sites
and understand the problem in greater detail - sites also need to
understand what the two-hour response time actually means. This needs
clarification. What are GridPP expecting on the 2-hour timeline?
Presumably this is a response indicating that the matter is being
investigated. Is this specific to when a GGUS ticket has been raised?
261.11 SL to progress receipt of final site documents from SouthGrid and
London T2 which were still outstanding. It was noted that SL was still
awaiting information.
261.13 DK to progress receipt of ScotGrid feedback.
261.14 RM to progress receipt of LT2 feedback.
261.17 JC to assess the general effectiveness of RSS feeds and
subscription-based updates, in relation to GridPP blogs. It was noted
that blogs are aggregated: PlanetGridPP is the mechanism, but RSS-feeds
that can be subscribed to don't exist. JC will bring this up at the
Deployment Board meeting. The Deployment Board did not have time for a
full consideration of things from JC. He will ask the question at the UKI
meeting on Thursday and take it from there.
262.4 JC to ascertain the specific problems in relation to Condor support
issues. Currently following-up a list of issues provided by Santanu Das.
262.5 Regarding poor response time of middleware developers: DK to
propose the following recommendation to the Deployment Board: to recommend
that if specific issues were involved, GGUS should be used. If issues were
general, the TCG representative at the Tier-2 site should be informed.
The TCG rep in turn should raise the issue as appropriate at the TCG
meetings.
262.6 JC to raise the issue of PPS feedback information relating to
upgrades issues with the relevant individual(s) on the PPS, and ask if
there was anything else that could be done.
262.9 non-Grid access relating to VOs. A document is to be done detailing
this issue as VOs need a mechanism 'in'. AS to detail the issue in a
separate report and circulate to the PMB. What can and can't be offered
to non-Grid users: detail is required - AS still to do.
263.1 Robin Tasker to re-circulate his paper regarding the RAL-CERN OPN
link, once further information was available. What is the timescale for
this? PC to review the Minutes and discuss with Robin Tasker.
263.2 JG to further investigate the lack of ability to pass job
requirements to the batch system and report-back (Tier-2 review issue).
JG will raise this through the GDB.
266.1 PMB (All) to send information on JISCMAIL delays/difficulties to JG
to enable him to investigate.
266.2 DB to summarise the current financial status of GridPP3, including
the recent STFC decision regarding GridPP1 & 2 saved funds, for the CB.
This to be circulated to the PMB for prior approval.
INACTIVE CATEGORY
=================
247.2 RJ to get further information from ATLAS regarding use of Grid for
testing of PANDA, and report-back.
251.1 TD to raise the issue of memory vs CPU cost at the MB [in order to
work out what the requirement was between 1GB and 2GB memory per core].
253.1 AS has commenced work on the report on data integrity at Tier-1, in
relation to implementation of checksums. Ongoing, AS hopes to complete
this by end August.
It was agreed that a further EVO test would take place just prior to next
Monday's PMB. The dTeam experience will also be reviewed. It may be that
VRVS will continue to be used if EVO problems remain unresolved. The next
PMB would take place at 1.15 pm on Monday 30th July. The meeting closed
at 3.00 pm.
|