Dear All,
Please find attached the latest weekly GridPP Project Management
Board Meeting minutes. The latest minutes can be found each week in:
http://www.gridpp.ac.uk/php/pmb/minutes.php?latest
as well as being listed with other minutes at:
http://www.gridpp.ac.uk/php/pmb/minutes.php
The previous minutes are at:
http://www.gridpp.ac.uk/pmb/minutes/070702.txt
Cheers, Tony
________________________________________________________________________
Prof. A T Doyle, FInstP FRSE GridPP Project Leader
Rm 478, Kelvin Building Telephone: +44-141-330 5899
Dept of Physics and Astronomy Telefax: +44-141-330 5881
University of Glasgow EMail: [log in to unmask]
G12 8QQ, UK Web: http://ppewww.physics.gla.ac.uk/~doyle/
________________________________________________________________________
GridPP PMB Minutes 264 - 9th July 2007
======================================
Present: Tony Doyle, Sarah Pearce, Roger Jones, Stephen Burke, David Britton,
David Kelsey, Steve Lloyd, Tony Cass, Robin Middleton, John Gordon,
Jeremy Coles, Peter Clarke, Andrew Sansum, Neil Geddes, Suzanne Scott (Minutes)
Apologies: Dave Newbold, Glenn Patrick
0. Approval of Previous Minutes
================================
It was agreed to send any amendments to SS by email, preferably by noon
tomorrow (Tue).
1. EGEE III Proposal
=====================
NG had circulated an email to UK/I EGEE partners. A workplan had been
refined by the PEB but bids from federations were still in excess. Two
issues were involved: 1. trim the bids to reflect the programme of work;
2. trim the programme of work itself. The EGEE PMB had met last Friday
(6th July) in closed session to discuss the bids. For SA1, the conclusion
was to approve all of the work proposed by the activity leader - this
would be translated into euros which would provide the approved budget for
each bid. The meeting also discussed the Applications Support area.
Bids had been sent in which were not in the programme of work, and were
not well defended. For some of the other bids it was agreed that they
need to be combined into one bid. Further discussion of this area will
happen this week. There had been a discussion on testbeds and other
non-(full)-production services which is likely to result in a
consolidation of these activities. The final budget table would be
discussed this week, and the next EGEE PMB meeting was scheduled for 16th
July.
2. Review of Tier-2 Issues
===========================
It was agreed that DB's list had been gone through and actions generated.
DB noted that JC had not been present at last week's meeting but his
comments had been incorporated in the Minutes. It was agreed that DB
would extract the issues and actions generated from the Review and put
these on the Tier-2 site.
Note: done, see
http://www.gridpp.ac.uk/tier2/Tier-2_Review_Issues_2007.doc (.pdf)
3. GridPP3 Planning
====================
DB had circulated an email. The indication was that no further formal
input from GridPP was required by STFC at this point. It was
understood that all of the money had been approved by PPRP and other
Committees but that the carry-forward of GridPP2 funds was not yet
quite confirmed. It was noted that a CB meeting was happening next
week and the funding issue would be raised with Group Leaders.
Everyone was aware that we have grants awaiting issue in 7 weeks' time.
It was agreed that DB would contact Janet Seed again to ask her advice
about a formal statement re the plan.
4. AOCB
========
None.
STANDING ITEMS
==============
SI-1 Dissemination Officer's Report
------------------------------------
SP reported a news article on blogs and the new PlanetGridPP blog. SP
asked about the situation relating to an article on the Site Reviews.
Information generally was not yet available for release. It was agreed
that SP would not be able to point to all detailed feedback; DB's summary
of outcomes could be the basis for a news item. It was noted that all of
the positive issues were not documented. SP will draft an item and draw
together the positive aspects of the Review, using some specific examples
- but release of information would be checked with sites. It had been
agreed that there would be a joint NGS/STFC stand at EGEE07. Neasan
O'Neill had produced a new website for LHC@Home, and the statistics were
also working now. Last Monday there had been a meeting of the LHC
Promotion Group regarding Grid promotion - a strategy document will be
drawn up with key messages. The Parliamentary POSTnote had been published
last week and there will be a link on the 'documents' page. An article is
being done for GridPP news and iSGTW.
SI-2 Tier-1 Manager's Report
-----------------------------
AS provided the following report:
Hardware: Regarding the 10Gb path from Tier-1 to SJ5, they were currently
waiting for network group to finish testing.
The RAL networking group are still in the process of obtaining a public AS
number in order that the Tier-1 can route Tier-1 -> Tier-1 traffic by the
OPN. This would be raised at the meeting on Wednesday (11th July).
The pre-qualification stage of the disk and CPU tenders closed Friday 29th
June. Evaluation is underway. AS reported three issues: 1) state of
evaluation; 2) tape planning; 3)input from the Tier-1 Board regarding
Tender Documents. It was noted that there is a Tier-1 Procurement Team
Meeting on Tuesday afternoon (10th July).
A tender to set up a Framework Purchasing agreement for tape media has now
commenced. This is expected to be able to deliver media in 2007Q4. 50% of
an interim purchase of 300TB of tape media has now been received and the
remainder is expected this week.
Service: SAM availability for the last 7 days was 96% (94%?). Reliability
for June (as measured by WLCG) was 87% - the average for the best 8 sites
was also 87%. Main impact was caused by the network outage in the middle
of the month - load related problems on the CE also contributed.
Regarding CASTOR:
The CMS CASTOR instance had some problems under the highest CMS load tests
of a week ago. However it has subsequently been stable and we are now
working to understand throughput rates, which CMS believe are still
insufficient to meet their CSA07 objectives. Further load testing is
scheduled. The standalone CASTOR for ATLAS is being tested by ATLAS. The
standalone CASTOR for LHCB is built and has had basic functionality tests
completed by the CASTOR team. Further load tests will be carried out by
the CASTOR team and it will then be released to LHCB for testing.
BDII: All 3 top-level BDII servers have now been upgraded to the lastest
release. Load on the BDII servers appears to be low and there do not
appear to be timeout problems at the Tier-1 since the upgrade.
RB: Both rb01 and rb02 were back in production last week. rb03 was
brought online for Alice. Over the weekend rb01 broke again and we are now
looking to move LHCB production work off these servers to rb03 to reduce
the load further. We also note that this morning both rb01 and rb02 are
flagged as OK by SAM but marked as Bad by SL's tests, this discrepancy is
not yet understood. Current strategy is to spread the load and keep
things going until WMS is available. SL4 is running and is available
externally - testing is commencing.
SI-3 Production Manager's Report
---------------------------------
JC commented on AS's report (above) by noting that the Alice RB problems
had not been their fault - JC would re-check the BDII timeouts as reports
don't provide information at present, they are not working.
JC reported as follows:
1) The issue of SL4 rollout was discussed at the GDB last week. The
experiments all claimed to be ready but the holding point on sites
deploying SL4 is confirmation of additional dependencies the
experiments may have on the OS over what is required for the gLite
middleware (in earlier middleware, additional packages were included in
a release to ensure that the experiment software computing environment
requirements were met). There is particular concern about circular
dependencies which may lead to incompatible requirements. To make
progress a series of SL4 WNs have been setup for the experiments to
test against - this is being done at LAL and RAL Tier-1 (Birmingham
will join this week). Experiments were asked to upload known
dependencies to their CIC portal ID card but so far only LHCb has done
it.
There was a discussion of Experiment requirements - a list from ATLAS had
been provided showing all of the libraries and links that they needed.
LHCb had also sent in a requirements list. It was noted that SL4 is
currently meeting ATLAS requirements and many sites have already installed
SL4. JC noted that he was not confident about the non-LHC Experiments.
TD noted that we need to push ahead anyway now. JC noted that the phased
transition would be discussed at the Deployment Board meeting on Thursday
(12th July).
Status for RAL WNs: ALICE added the queue to their production system. LHCb
agreed to run dedicated tests when production staff return from holiday.
Without dedicated testing we do not know that the jobs running test all
classes of jobs (they may be random from the matchmaking). This morning
200+ jobs were queued for 6 job slots. CMS have not communicated any
specific requirements. Before any migration can happen for the Tier-1 it
needs to be confirmed that the other non-LHC experiments work without
problem on SL4.
2) glexec on WNs is the subject of a lot of discussion at the moment. We
are trying to understand the principle objections. The real sticking
point appears to be whether glexec can easily (i.e. as a default) be
installed in non-SUID mode. SUID mode allows UID switching and is
frowned upon especially at non-HEP dedicated sites. In contrast other
sites in WLCG/EGEE require the job to always run under the ID of the
person whose work is being run. This issue was to be discussed at the
Deployment Board meeting on Thursday (12th July).
3) Since the move to GOCDB3 there have been problems creating the UKI tree
structure needed for the ROC reports. The accounting data for most/all
sites also seems to have stopped updating as seen in the site charts in
the portal.
4) As reported previously Glasgow has encouraged a number of groups to
join the gridpp VO to test the infrastructure. A significant amount of
work now seen at Glasgow is from this VO - the site remains full while
most other UK sites have plenty of spare capacity. Last week Graeme
Stewart managed to get MPI jobs running (required by engineers) at
Glasgow which is likely to further increase usage.
5) The question of specInt ratings is being raised once again as the T2
Co-ordinators fill out the Q2 report. The value being used by the T2s
differs and this clearly impacts the overall site and Tier-2 KSI2K. If
the KSI2K figures are being used for Tier-2 hardware allocations then
do we need to do better benchmarking?
6) The introduction of faster cores means that historical batch queue
limits need revisiting. TD noted that the given default time should be
retained - downstream the problem was concatenating files. JC to feed
this back to Graeme - and this was being discussed at the DTeam meeting
as well. TD noted that it should not require revisiting as the defaults
should remain unchanged.
7) The RAL-PPS instance of the PPS SAM testing framework is now up and
running.
8) SL joined the dteam VO to run his jobs outside of the ATLAS
environment. This led to the discovery of various problems including
with use of VOMS/Gridmap files and edg-job-submit. There is one
remaining problem with use of the Glasgow RB that needs further
investigation.
9) There is a deployment board meeting in London this Thursday. The agenda
is here: http://indico.cern.ch/conferenceDisplay.py?confId=18446
10) There were FTS problems (~24hrs) last week. The CERN grid service
operators did not notice a host certificate was about to expire for
the production service which it did with obvious repercussions for the
MyProxy service. JG noted that it is better to have unwanted tickets
rather than have these problems.
11) Finally JC has received several questions from people involved in
deployment roles who are still unsure where they stand with GridPP3
continuation of their posts. [see item 3, above]
SI-4 LCG Management Board Report
---------------------------------
JG reported that he had presented a document regarding the policy of
killing jobs. The feedback was that the VOs wanted to know what was going
wrong so that they could fix it, rather than the jobs simply being killed.
The VOs want to work with GridPP to resolve these issues. It was noted
that we need to flag when jobs are cancelled otherwise the Experiments
don't know why jobs have been cancelled. TD noted that we can get
statistics from Tier-1 regarding jobs, but rather than average efficiency,
we need profiled jobs. TD noted that the cut is on 2.7% efficiency, and
all that is required is a histogram to be inserted into the document. It
was agreed that AS would speak to Matt Hodges. DK noted that this issue
would also be discussed at the Deployment Board - but it was noted that it
was a User Board issue too.
JG reported on an action to set up SLAs to run VO boxes. A presentation
had been given regarding security etc. JG asked whether all of the
Tier-1s have SLAs? The issue for the future would be to have a generic
one.
JG reported that there had been a talk on OSG site validation; and SRM2.2
issues/options had also been discussed.
SI-5 Documentation Officer's Report
------------------------------------
It was noted that SB had been away at CERN.
REVIEW OF ACTIONS
=================
247.2 RJ to get further information from ATLAS regarding use of Grid for
testing of PANDA, and report-back. This is not a live topic and it was
agreed to initiate a new listing of 'Inactive' items. This to be moved to
that category.
250.4 RJ, DN, GP, TD to meet to integrate experiment requirements of
Tier-2s going to Tier-1 - sites are aware of requirements but discussion
still has to take place. It was noted that this issue is not high
priority. A meeting is to take place with Barney Garrett - this is
ongoing and still to be arranged.
251.1 TD to raise the issue of memory vs CPU cost at the MB [in order to
work out what the requirement was between 1GB and 2GB memory per core].
It was agreed this to be moved to 'Inactive' category.
252.3 RM has now received inputs for his one-page summary regarding the
transition of each of the existing Middleware areas from GridPP2 to
GridPP2+ to GridPP3 - this to go to DB. Ongoing.
253.1 AS has commenced work on the report on data integrity at Tier-1, in
relation to implementation of checksums. AS is still working on this and
it will take a further couple of weeks to complete. This is ongoing, and
AS hopes it will be finished by the end of August. It was agreed to move
this to 'Inactive' category.
254.2 ALL PMB members have now signed-up to EVO. Tests were ongoing but
this action is on hold due to H323 requirements which must be resolved.
JG/RM will resolve EVO issues. RJ reported that he had joined an
evaluation group on EVO and asked that all information should be sent to
him to enable him to document the problems involved. It was agreed that
an EVO test would take place the week after next (PMB) as next week's
meeting was a short one due to the CB meeting at 2.00 pm.
259.5 JC to provide recommendations to the PMB on PPS testing and a
summary of what is currently available on the system. Ongoing.
260.1 RM to provide final feedback for site reviews to SL for
https://www.gridpp.ac.uk/tier2/Readiness_Reviews/index.html. Ongoing.
260.3 RM, NG, TD, DK to inform SL which site-review information is
public/private. Ongoing.
261.1 TD and JG to prepare a PMB statement to be prepared for the MB
regarding SL4 releases of basic middleware, which were still awaited and
were an issue at sites. JG reported that he would be doing this for
tomorrow. Sites should be encouraged to proceed with SL4 upgrades which
are to be tracked by JC. JG will give a summary statement to the MB as to
what we believe the current situation is - this will include 'SL5 on
hold'.
261.2 DN, RJ, GP: An action on the experiments to define the future
outlook for 64-bit applications and resultant effects on hardware
purchasing. Experiment reps to define the outlook. It was noted that the
priority is 32-bit at the moment; there is no advantage to 64-bit. A
short statement is required.
261.4 DB to look through the input in detail in relation to GGUS problems.
Ongoing.
261.5 JC and dTeam to carry out a survey on sites' experiences of GGUS,
when possible to organise. Ongoing.
261.6 JC to look into the issue of 2-hour response timing @ Tier-2 sites
and understand the problem in greater detail - sites also need to
understand what the two-hour response time actually means. This may come
up at the next Board meeting. Ongoing.
261.11 SL to progress receipt of final site documents from SouthGrid and
London T2 which were still outstanding. It was noted that SL was still
awaiting information.
261.13 DK to progress receipt of ScotGrid feedback. Ongoing.
261.14 RM to progress receipt of LT2 feedback. Ongoing.
261.16 JG to progress the issue of somone getting involved in the SLA
(ROC) working group.
261.17 JC to assess the general effectiveness of RSS feeds and
subscription-based updates, in relation to GridPP blogs. It was noted
that blogs are aggregated: PlanetGridPP is the mechanism, but RSS-feeds
that can be subscribed to don't exist. JC will bring this up at the
Deployment Board meeting.
262.2 SL to clarify GridPP contribution (what is accounted rather than
what is available) with the Tier-2 Board. Ongoing.
262.3 DK to raise items (12) [re accounted GridPP contribution] and (22)
[re site availability via SAM tests] at the Deployment Board in two weeks'
time. This was on the Agenda for discussion at the DB. Done, item
closed.
262.4 JC to ascertain the specific problems in relation to Condor support
issues. JC awaiting feedback. Ongoing.
262.5 Regarding poor response time of middleware developers: DK to
propose the following recommendation to the Deployment Board: to recommend
that if specific issues were involved, GGUS should be used. If issues were
general, the TCG representative at the Tier-2 site should be informed.
The TCG rep in turn should raise the issue as appropriate at the TCG
meetings. This was on the DB Agenda for discussion. Ongoing.
262.6 JC to raise the issue of PPS feedback information relating to
upgrades issues with the relevant individual(s) on the PPS, and ask if
there was anything else that could be done. Ongoing.
262.7 AS to speak to procurement and warn them that sites might want to
make parallel purchases - a sentence could be added to the tender
document. AS still to talk to procurement - ongoing.
262.9 non-Grid access relating to VOs. A document is to be done detailing
this issue as VOs need a mechanism 'in'. AS to detail the issue in a
separate report and circulate to the PMB. What can and can't be offered
to non-Grid users: detail is required - AS still to do. Ongoing.
262.10 Regarding user communication/info provision, JC suggested amending
the emphasis of the UB to be more in touch with users generally - it was
agreed that he would raise this with Glen. Glen will be there on
Thursday, JC will speak to him then.
262.11 SB to add a new Document to the PMB Documents, No 114, relating to
a documentation report overview on current status. Ongoing.
263.1 Robin Tasker to re-circulate his paper regarding the RAL-CERN OPN
link, once further information was available. What is the timescale for
this? PC to review the Minutes and discuss with Robin Tasker.
263.2 JG to further investigate the lack of ability to pass job
requirements to the batch system and report-back (Tier-2 review issue).
JG will raise this through the GDB. Ongoing.
ACTIONS AS AT 09.07.06
======================
250.4 RJ, DN, GP, TD to meet to integrate experiment requirements of
Tier-2s going to Tier-1 - sites are aware of requirements but discussion
still has to take place. It was noted that this issue is not high
priority. A meeting is to take place with Barney Garrett - this is
ongoing and still to be arranged.
252.3 RM has now received inputs for his one-page summary regarding the
transition of each of the existing Middleware areas from GridPP2 to
GridPP2+ to GridPP3 - this to go to DB. This was to be done by Friday 8th
June but is still ongoing.
254.2 ALL PMB members have now signed-up to EVO. Tests were ongoing but
this action is on hold due to H323 requirements which must be resolved.
JG/RM will resolve EVO issues. RJ reported that he had joined an
evaluation group on EVO and asked that all information should be sent to
him to enable him to document the problems involved. It was agreed that
an EVO test would take place the week after next (PMB) as next week's
meeting was a short one due to the CB meeting at 2.00 pm.
259.5 JC to provide recommendations to the PMB on PPS testing and a
summary of what is currently available on the system.
260.1 RM to provide final feedback for site reviews to SL for
https://www.gridpp.ac.uk/tier2/Readiness_Reviews/index.html.
260.3 RM, NG, TD, DK to inform SL which site-review information is
public/private.
261.1 TD and JG to prepare a PMB statement to be prepared for the MB
regarding SL4 releases of basic middleware, which were still awaited and
were an issue at sites. JG reported that he would be doing this for
tomorrow. Sites should be encouraged to proceed with SL4 upgrades which
are to be tracked by JC. JG will give a summary statement to the MB as to
what we believe the current situation is - this will include 'SL5 on
hold'.
261.2 DN, RJ, GP: An action on the experiments to define the future
outlook for 64-bit applications and resultant effects on hardware
purchasing. Experiment reps to define the outlook. It was noted that the
priority is 32-bit at the moment; there is no advantage to 64-bit. A
short statement is required.
261.4 DB to look through the input in detail in relation to GGUS problems.
261.5 JC and dTeam to carry out a survey on sites' experiences of GGUS,
when possible to organise.
261.6 JC to look into the issue of 2-hour response timing @ Tier-2 sites
and understand the problem in greater detail - sites also need to
understand what the two-hour response time actually means.
261.11 SL to progress receipt of final site documents from SouthGrid and
London T2 which were still outstanding. It was noted that SL was still
awaiting information.
261.13 DK to progress receipt of ScotGrid feedback.
261.14 RM to progress receipt of LT2 feedback.
261.16 JG to progress the issue of somone getting involved in the SLA
(ROC) working group.
261.17 JC to assess the general effectiveness of RSS feeds and
subscription-based updates, in relation to GridPP blogs. It was noted
that blogs are aggregated: PlanetGridPP is the mechanism, but RSS-feeds
that can be subscribed to don't exist. JC will bring this up at the
Deployment Board meeting.
262.2 SL to clarify GridPP contribution (what is accounted rather than
what is available) with the Tier-2 Board.
262.4 JC to ascertain the specific problems in relation to Condor support
issues.
262.5 Regarding poor response time of middleware developers: DK to
propose the following recommendation to the Deployment Board: to recommend
that if specific issues were involved, GGUS should be used. If issues were
general, the TCG representative at the Tier-2 site should be informed.
The TCG rep in turn should raise the issue as appropriate at the TCG
meetings.
262.6 JC to raise the issue of PPS feedback information relating to
upgrades issues with the relevant individual(s) on the PPS, and ask if
there was anything else that could be done.
262.7 AS to speak to procurement and warn them that sites might want to
make parallel purchases - a sentence could be added to the tender
document.
262.9 non-Grid access relating to VOs. A document is to be done detailing
this issue as VOs need a mechanism 'in'. AS to detail the issue in a
separate report and circulate to the PMB. What can and can't be offered
to non-Grid users: detail is required - AS still to do.
262.10 Regarding user communication/info provision, JC suggested amending
the emphasis of the UB to be more in touch with users generally - it was
agreed that he would raise this with Glen.
262.11 SB to add a new Document to the PMB Documents, No 114, relating to
a documentation report overview on current status.
263.1 Robin Tasker to re-circulate his paper regarding the RAL-CERN OPN
link, once further information was available. What is the timescale for
this? PC to review the Minutes and discuss with Robin Tasker.
263.2 JG to further investigate the lack of ability to pass job
requirements to the batch system and report-back (Tier-2 review issue).
JG will raise this through the GDB. Ongoing.
264.1 DB to extract the issues and actions generated from the Tier-2
Review as discussed at the PMB and put these on the Tier-2 site.
264.2 DB to contact Janet again and remind her about the forthcoming CB
meeting and ask her advice about a formal statement re the plan V2.
264.3 JC noted that the Alice RB problems had not been their fault - he
would re-check the BDII timeouts as reports don't provide information at
present, they are not working.
264.4 Regarding policy of killing jobs, statistics are required from
Tier-1, but rather than average efficiency we need profiled jobs. AS to
speak to Matt Hodges.
INACTIVE CATEGORY AS AT 09.07.06
================================
247.2 RJ to get further information from ATLAS regarding use of Grid for
testing of PANDA, and report-back.
251.1 TD to raise the issue of memory vs CPU cost at the MB [in order to
work out what the requirement was between 1GB and 2GB memory per core].
253.1 AS has commenced work on the report on data integrity at Tier-1, in
relation to implementation of checksums. Ongoing, AS hopes to complete
this by end August.
Next week's PMB (16.07.07) would be for 1 hour only due to the CB meeting
at 2.00 pm. EVO test the following week (23.07.07).
GridPP PMB Minutes 263 - 2nd July 2007
======================================
Present: Roger Jones, David Britton, David Kelsey, Dave Newbold, Tony Cass,
Robin Middleton, John Gordon, Glenn Patrick, Robin Tasker, Suzanne Scott
(Minutes)
Apologies: Tony Doyle, Sarah Pearce, Stephen Burke, Steve Lloyd,
Jeremy Coles, Peter Clarke, Andrew Sansum, Neil Geddes
1. UK Position on Resilience of the RAL-CERN Line
=================================================
Robin Tasker had produced a paper regarding the RAL-CERN OPN link. There
had been an outage in June - it was reported that French road repair men
had dug up the fibre and it was 48 hours before it was repaired. What
resilience was required to protect against outage? The lightpath from RAL
to CERN was summarised in RT's paper in terms of the problems involved,
but overall the link was fairly reliable. The paper addressed issues of
fibre infrastructure, with feasibility and costing confirmation awaited
from UKERNA. It was understood that outage could be infrequent and a
large cost was involved in protecting the link if such protection was not
generally required. RT was currently awaiting a risk assessment in
relation to the break in fibre in such a catastrophic way - it was a
question of balance of risk and cost, and of how long an outage was likely
to last - how significant was an outage of 48 hours in June? JG noted
that breaks in the Tier-1 do result in dataflow issues to the other
Tier-1s. There was a discussion regarding steering data and storage. It
was agreed that the links need to be as reliable as possible within
reason. An outage of 1-2 hours or one day was acceptable, but for two
weeks, no. It was noted that the lightpath cannot be re-routed, if the
fibre breaks then the connection is lost. It was noted that bandwidth
might be an issue for the future. There was a discussion of routes into
CERN and cross-border fibres.
It was reported that JANET (UK) were providing figures to RT for a diverse
route by the end of the week. NetNorthWest and JANET will be able to give
a realistic assessment of risk. It was agreed that a decision should be
deferred until further information was available. RT will update his
paper with fuller information when it was available, and re-circulate.
2. Ongoing Review of Tier-2 Issues
==================================
In absentia, JC had submitted comments on the remaining issues.
18) Lack of ability to pass job requirements to the batch system - JG
noted that the GLite CE can pass information. The RB looks at the
user requirement and matches it to a queue. It was noted that the
system fills with jobs that can't be optimised. JG would investigate
this issue further and report-back.
19) Virtualisation - UCL had wanted to know GridPP direction/support in
this area. JC noted that Marian had started looking at
virtualization. He currently has some nodes on the PPS which are on
virtual machines - his intention was to put the PPS SAM client in such
an environment. It was noted that Grid-Ireland also had a lot of
experience in this area which GridPP could draw upon. JC reported that
there might be some support available via the TB-SUPPORT list and
helpdesk, but at the moment we are still looking at this area and do
not have a definite direction. It was agreed that this is largely
uncharted territory for GridPP and a diversion away from the standard
GridPP environment. In abeyance at present.
20) Changing Experiment requirements - JC noted that this might relate to
such things as the ATLAS ACL change requests. Some sites thought there
needed to be more structure to change requests. VO views might be
cited as another area where difficulties have been encountered. There
was also the difficulty of consistency of feedback - on SL4 JC has
heard different positions depending who he talks to within an
experiment. It was reported that the 39 Tier-2s in CMS are in regular
contact. JG summarised that this was an issue more for the Experiments
to deal with.
21) Level of noise for site problems - JC noted that this covered things
like false-positive problems in the site SAM results. It was agreed
that people are playing more attention now to the SAM results.
Issues should be raised in the weekly Ops reports meetings.
22) Definition of 'what is available' - JC noted that if sites are going
to be measured against one measure of availability, is it the number
coming from GridView (even if there are (many) questions about how
accurate it is in measuring availability for the experiments). It was
agreed that, yes, GridView and the SAM reports come from the same
database, but if there is not a consistent query then you won't get
the same number out of the same data.
23) Enforcement of MoUs/SLAs - JC noted that the process is known but
other than getting less funding in the future, were there any other
enforcement options? It was agreed that this issue was not for public
debate at present.
3. Killing Jobs
================
It was reported that TD had sent a draft policy to the WLCG Management
Board. It was noted that killing stalled jobs was treating the symptom
rather than the problem. Some feedback had been received, it was
understood that the policy intention was to try to improve efficiency at
sites. It was noted that the Tier-2 have less staff and VOs send jobs in.
The issue would be discussed at the face-to-face MB meeting tomorrow. It
was noted that the dashboard was an answer to cross-VO problems but the
Experiments don't know who is running jobs. It was agreed that it was not
right if it became the normal procedure to kill-off jobs as a matter of
course.
4. AOB
=======
RJ reported that Liverpool had asked for some GridPP funding for pre-spending.
DB noted that this was not possible as no official word had been received from
STFC with regard to allocations. It was agreed that nothing could be done
until GridPP know officially what the scale of expenditure is.
STANDING ITEMS
==============
SI-1 Dissemination Officer's Report
------------------------------------
It was noted that SP was not present.
SI-2 Tier-1 Manager's Report
-----------------------------
In absentia, AS had sent in the following report on Friday 29th:
Hardware - Re the 10Gb path from Tier-1 to SJ5, it was reported that they
were currently waiting for network group to finish testing. They were
currently working on implementing the firewall configuration as a set of
router filters.
The RAL networking group were in the process of obtaining a public AS
number in order that the Tier-1 could route Tier-1 -> Tier-1 traffic by
the OPN. Still waiting for RAL networking group to complete this work.
The pre-qualification stage of the disk and CPU tenders closed on Friday
29th. Evaluation will start w/c 2nd July.
The Tape service was down last Tuesday for a firmware update.
Service - SAM availability for the last 7 days was 93% (some overlap with
previous 7 days reported).
Regarding CASTOR: A stand-alone 2.1.3 release of CASTOR for CMS had been
implemented and is undergoing testing. Results were very encouraging with
high rates achieved (400MB/s writing - concurrent with 300MB/s to tape
followed by >700MB/s reading). Reliability has been excellent, far better
than any previous tests with CMS. However, so far only native rfio load
tests have been tried and we need to see good results with gridftp/srm/fts
before feeling confident that we have a good working production ready
release.
A standalone 2.1.3 release for ATLAS is currently being worked on. This
was delayed by technical problems but is now nearly complete and will be
tested soon.
We have reviewed hardware capacity available to implement a 2.1.3
stand-alone implementation for LHCB. Tier-1 batch workers will be
redeployed temporarily. Work on this will commence once the ATLAS instance
is complete. It is expected to go faster as documentation and processes
have now been improved.
Regarding dCache: all is OK - but is apparently not being used by ATLAS
production. We are following this up.
BDII: We have seen some timeouts on the top-level BDII. These are load
related, probably caused by the LHCB VO box. One BDII has been updated to
the latest release and has seen a significant reduction in CPU load. If it
remains stable then the two remaining hosts will be updated shortly.
RB: rb01 is currently under sysdev having its database cleaned. rb02 is
struggling to cope with the load on its own. rb03 is deployed and is
currently being tested. Once completed will arrange Alice production to
move to it. We may also move the LHCB production.
LFC: Problems reported on Monday were resolved (on Monday). Cause was a
faulty gLite update.
SI-3 Production Manager's Report
---------------------------------
In absentia, JC sent in the following report:
1) We are pursuing two security related matters - concerns raised in
the UK and the submitters are concerned that there has been no
result(patch) for one and lack of discussion of the other. There has
actually been some progress on both but this particular problem has
highlighted a need to review procedures and communication in this area.
Another issue being faced generally is how we are supposed to deal with
vulnerabilities in VO/experiment code.
2) BDII timeouts appear to be affecting UK sites again (causing lcg-rm
tests to fail for several sites).
3) The main things to note from the UKI monthly meeting last week
(http://indico.cern.ch/conferenceDisplay.py?confId=17879) are that the
UK helpdesk will now move to chase/close tickets where the ticket
submitter has not responded to the agent's response (after a site
waiting on a user to confirm a fix), and that generally sites are
finding it difficult to keep up with constant changes in YAIM and the
middleware. Sites have been encouraged to check their storage data
being published to the storage accounting portal
(http://goc02.grid-support.ac.uk/accountingDisplay/view.php?queryType=storage)
and report any problems.
4) GOCDB3 (https://goc.gridops.org/) went live last week on Wednesday. We
have seen an increase in tickets to the UKI ROC as users point out
minor issues but so far the release seems to have been well planned and
has gone smoothly.
5) There two monthly grid deployment related meetings at CERN this week. A
storage workshop runs Monday and Tuesday
(http://indico.cern.ch/conferenceDisplay.py?confId=16456) with both SRM
developers present and representatives from the experiments. Grieg
Cowan will present on "GridPP sites: experience running dCache, DPM,
and StoRM". Then on Wednesday is the July Grid Deployment Board meeting
(http://indico.cern.ch/conferenceDisplay.py?confId=8485) with a focus
on accounting and security. There will be surrounding discussions on WN
utilisation, the OPN and a summary from the storage workshop.
SI-4 LCG Management Board Report
---------------------------------
See https://twiki.cern.ch/twiki/bin/view/LCG/MbMeetingsMinutes
SI-5 Documentation Officer's Report
------------------------------------
It was noted that SB was not present.
REVIEW OF ACTIONS
=================
247.2 RJ to get further information from ATLAS regarding use of Grid for
testing of PANDA, and report-back. RJ reported that this was ongoing and
nothing would be happening regarding it in the near future.
250.4 RJ, DN, GP, TD and TC to meet to integrate experiment requirements
and work on Tier-2 networks - sites are aware of requirements but
discussion still has to take place. Ongoing when convenient to arrange.
It was noted that this issue is not high priority.
251.1 TD to raise the issue of memory vs CPU cost at the MB [in order to
work out what the requirement was between 1GB and 2GB memory per core].
Ongoing.
252.3 RM has now received inputs for his one-page summary regarding the
transition of each of the existing Middleware areas from GridPP2 to
GridPP2+ to GridPP3 - this to go to DB. Ongoing.
253.1 AS has commenced work on the report on data integrity at Tier-1, in
relation to implementation of checksums. Ongoing.
254.2 ALL PMB members have now signed-up to EVO. Tests were ongoing but
this action is on hold due to H323 requirements which must be resolved.
JG has resolved EVO H.323 issues at RAL. It was noted that there had been
a further EVO test today (2/7) but JG was the only one to join.
255.3 DK to get approval from groups regarding Grid Site Operations policy
and report-back. Obligations are on the site to carry forward issues.
It was reported that all sites had now been consulted. Final project
approval was currently happening. Done, item closed.
256.1 NG to review the draft of the new Grid Security Policy from NGS
perspective, and SL from Tier-2, and report-back. NG had reported at the
F2F. Done, item closed.
258.6 JC to discuss RAL RB issues with Catalin Condurache and bring
conclusions back to the PMB. In absentia JC reported that the recent RB
problems are thought to be due to ALICE hammering the RB until it fails.
It is proving difficult to validate this due to poor RB VO monitoring. The
urgency to fix problems seen by users is now recognised and the T1
procedure will not always be to wait until queues are empty if a component
is being problematic. Another issue here is that UIs are not being
configured properly to take account of the load balanced nature of the
RBs. ALICE and LHCb are having their own RBs installed. This is now
closed.
259.5 JC to provide recommendations to the PMB on PPS testing and a
summary of what is currently available on the system. JC will also
forward the chat window location to the PMB via email. The link that was
circulated is
http://egee-pre-production-service.web.cern.ch/egee-pre-production-service/.
Ongoing.
260.1 RM, NG to provide final feedback for site reviews to SL for
https://www.gridpp.ac.uk/tier2/Readiness_Reviews/index.html. This was 'in
progress' - NG action done; RM ongoing.
260.3 RM, NG, TD, DK to inform SL which site-review information is
public/private. Ongoing.
260.4 JG (not JC) to re-start Castor Strategy meetings. Done, item
closed.
261.1 TD and JG to prepare a PMB statement to be prepared for the MB
regarding SL4 releases of basic middleware, which were still awaited and
were an issue at sites. Ongoing.
261.2 DN, RJ, GP: An action on the experiments to define the future
outlook for 64-bit applications and resultant effects on hardware
purchasing. Experiment reps to define the outlook. There was a
discussion re SL4 & SL5 - ongoing.
261.4 DB to look through the input in detail in relation to GGUS problems.
Ongoing.
261.5 JC and dTeam to carry out a survey on sites' experiences of GGUS,
when possible to organise. In absentia JC reported that a dialogue has
been started but it will take a few weeks to close this action. Ongoing.
261.6 JC to look into the issue of 2-hour response timing @ Tier-2 sites
and understand the problem in greater detail - sites also need to
understand what the two-hour response time actually means. Ongoing.
261.7 DK to ask Mingchao Ma, the new GridPP Security Officer, to contact
sites and check they have security incident response systems in place.
The 'climate' of this item was understood that this would happen naturally
in due course. Item closed.
261.8 JC to talk to Pete Gronbech and Alessandra Forti regarding
Monitoring/Nagios/Ganglia training, to include someone from GridView. In
absentia JC reported that this had been discussed with Pete and Alessandra
and also at the UKI meeting. There is support for this around the next
HEPSYSMAN meeting. We will start working on the agenda. Action can be
closed.
261.11 SL to progress receipt of final site documents from SouthGrid and
London T2 which were still outstanding. It was noted that this was a
duplicate of an earlier action, but was still ongoing.
261.12 NG to progress receipt of SouthGrid feedback. Done, item closed.
261.13 DK to progress receipt of ScotGrid feedback. Ongoing.
261.14 RM to progress receipt of LT2 feedback. Ongoing.
261.15 SL to send an email to sites who still had to provide final
versions of the Questionnaire response (list above), informing them that
the current version would be considered final unless a revised one was
provided by Friday 22nd June. Done, item closed.
261.16 JC to speak to Steve McAllister about getting involved in the SLA
(ROC) working group. In absentia JC reported that he had spent an hour
with Steve last week but it is not clear that he is the right person to
work on SLA issues for the ROC. This should be the ROC manager. It was
agreed that JG would progress this.
261.17 JC to assess the general effectiveness of RSS feeds and
subscription-based updates, in relation to GridPP blogs. Ongoing.
262.1 RM to draft an extra line for the Travel Policy relating to Tier-2
staff/Experiment contact. Done, item closed.
262.2 SL to clarify GridPP contribution (what is accounted rather than
what is available) with the Tier-2 Board. Ongoing.
262.3 DK to raise items (12) [re accounted GridPP contribution] and (22)
[re site availability via SAM tests] at the Deployment Board in two weeks'
time. Still to be done.
262.4 JC to ascertain the specific problems in relation to Condor support
issues. In absentia JC reported that he was still working on this. So far
he had contacted two other EGEE sites that are using or trying to use
Condor and have asked Santanu to distill the main issues Cambridge is
having with Condor as a batch system. Ongoing.
262.5 Regarding poor response time of middleware developers: DK to
propose the following recommendation to the Deployment Board: to recommend
that if specific issues were involved, GGUS should be used. If issues
were general, the TCG representative at the Tier-2 site should be
informed. The TCG rep in turn should raise the issue as appropriate at
the TCG meetings. Ongoing.
262.6 JC to raise the issue of PPS feedback information relating to
upgrades issues with Pete on the PPS, and ask if there was anything else
that could be done. In absentia JC reported that he had talked with Yves
and Marian but there was nothing conclusive yet about how to take this
forward. Marian reinstalls each time and Yves is already inputting
experiences into the wiki (such as with DNS style VO configuration).
Ongoing.
262.7 AS to speak to procurement and warn them that sites might want to
make parallel purchases - a sentence could be added to the tender
document. Ongoing.
262.8 A statement is to be prepared for the MB relating to SAM
availability for the last 7 days (62%) - AS to send an email to JG, JC and
TD. [This was mainly caused by the failure of the RAL-CERN line, which
was down in excess of 48 hrs from 20/06/2007 10:17:54 to approximatly
22/06/2007 15:00:00.] Done, item closed.
262.9 Grid access relating to VOs. A document is to be done detailing
this issue as VOs need a mechanism 'in'. AS to detail the issue in a
separate report and circulate to the PMB. Ongoing.
262.10 Regarding user communication/info provision, JC suggested amending
the emphasis of the UB to be more in touch with users generally - it was
agreed that he would raise this with Glen. In absentia JC reported that
he would talk with Glenn next week when at RAL. Ongoing.
262.11 SB to add a new Document to the PMB Documents, No 114, relating to
a documentation report overview on current status. Ongoing.
ACTIONS AS AT 02.07.06
======================
247.2 RJ to get further information from ATLAS regarding use of Grid for
testing of PANDA, and report-back.
250.4 RJ, DN, GP, TD and TC to meet to integrate experiment requirements
and work on Tier-2 networks - sites are aware of requirements but
discussion still has to take place. Ongoing when convenient to arrange.
It was noted that this issue is not high priority.
251.1 TD to raise the issue of memory vs CPU cost at the MB [in order to
work out what the requirement was between 1GB and 2GB memory per core].
252.3 RM has now received inputs for his one-page summary regarding the
transition of each of the existing Middleware areas from GridPP2 to
GridPP2+ to GridPP3 - this to go to DB. This will be done by Friday 8th
June.
253.1 AS has commenced work on the report on data integrity at Tier-1, in
relation to implementation of checksums.
254.2 ALL PMB members have now signed-up to EVO. Tests were ongoing but
this action is on hold due to H323 requirements which must be resolved.
JG/RM will resolve EVO issues.
259.5 JC to provide recommendations to the PMB on PPS testing and a
summary of what is currently available on the system.
260.1 RM to provide final feedback for site reviews to SL for
https://www.gridpp.ac.uk/tier2/Readiness_Reviews/index.html.
260.3 RM, NG, TD, DK to inform SL which site-review information is
public/private.
261.1 TD and JG to prepare a PMB statement to be prepared for the MB
regarding SL4 releases of basic middleware, which were still awaited and
were an issue at sites.
261.2 DN, RJ, GP: An action on the experiments to define the future
outlook for 64-bit applications and resultant effects on hardware
purchasing. Experiment reps to define the outlook.
261.4 DB to look through the input in detail in relation to GGUS problems.
261.5 JC and dTeam to carry out a survey on sites' experiences of GGUS,
when possible to organise.
261.6 JC to look into the issue of 2-hour response timing @ Tier-2 sites
and understand the problem in greater detail - sites also need to
understand what the two-hour response time actually means.
261.11 SL to progress receipt of final site documents from SouthGrid and
London T2 which were still outstanding.
261.13 DK to progress receipt of ScotGrid feedback.
261.14 RM to progress receipt of LT2 feedback.
261.16 JG to progress the issue of (someone, not Steve McAllister - the
ROC manager?) getting involved in the SLA (ROC) working group.
261.17 JC to assess the general effectiveness of RSS feeds and
subscription-based updates, in relation to GridPP blogs.
262.2 SL to clarify GridPP contribution (what is accounted rather than
what is available) with the Tier-2 Board.
262.3 DK to raise items (12) [re accounted GridPP contribution] and (22)
[re site availability via SAM tests] at the Deployment Board in two weeks'
time.
262.4 JC to ascertain the specific problems in relation to Condor support
issues.
262.5 Regarding poor response time of middleware developers: DK to
propose the following recommendation to the Deployment Board: to recommend
that if specific issues were involved, GGUS should be used. If issues
were general, the TCG representative (Alessandra Forti) should be
informed. The TCG rep in turn should raise the issue as appropriate at
the TCG meetings.
262.6 JC to raise the issue of PPS feedback information relating to
upgrades issues with the relevant individual(s) on the PPS, and ask if
there was anything else that could be done.
262.7 AS to speak to procurement and warn them that sites might want to
make parallel purchases - a sentence could be added to the tender
document.
262.9 Grid access relating to VOs. A document is to be done detailing
this issue as VOs need a mechanism 'in'. AS to detail the issue in a
separate report and circulate to the PMB.
262.10 Regarding user communication/info provision, JC suggested amending
the emphasis of the UB to be more in touch with users generally - it was
agreed that he would raise this with Glen.
262.11 SB to add a new Document to the PMB Documents, No 114, relating to
a documentation report overview on current status.
263.1 Robin Tasker to re-circulate his paper regarding the RAL-CERN OPN
link, once further information was available.
263.2 JG to investigate further the lack of ability to pass job
requirements to the batch system and report-back (Tier-2 review issue).
The next PMB would take place on Monday 9th July. The meeting closed at
2.00 pm.
|