JISCMail - UKHEPGRID Archives

Email discussion lists for the UK Education and Research communities
Subscriber's Corner
Email Lists
UKHEPGRID Archives

UKHEPGRID@JISCMAIL.AC.UK

View:

Message:
[
First
Last
]
By Topic:
[
First
Last
]
By Author:
[
First
Last
]
Font:
Proportional Font
		LISTSERV Archives
		UKHEPGRID Home
		UKHEPGRID July 2007
Options

Subscribe or Unsubscribe
Get Password
Subject:
Minutes of the 263rd and 264th GridPP PMB meetings
From:
Tony Doyle <[log in to unmask]>
Reply-To:
Tony Doyle <[log in to unmask]>
Date:
Fri, 13 Jul 2007 15:29:13 +0100
Content-Type:
MULTIPART/MIXED
Parts/Attachments:
TEXT/PLAIN (24 lines) , 070709.txt (1 lines) , 070702.txt (1 lines)
Dear All,

     Please find attached the latest weekly GridPP Project Management 
Board Meeting minutes. The latest minutes can be found each week in:

http://www.gridpp.ac.uk/php/pmb/minutes.php?latest

as well as being listed with other minutes at:

http://www.gridpp.ac.uk/php/pmb/minutes.php

The previous minutes are at:

http://www.gridpp.ac.uk/pmb/minutes/070702.txt

Cheers, Tony
________________________________________________________________________
Prof. A T Doyle, FInstP FRSE                       GridPP Project Leader 
Rm 478, Kelvin Building                      Telephone: +44-141-330 5899
Dept of Physics and Astronomy                  Telefax: +44-141-330 5881
University of Glasgow                   EMail: [log in to unmask]
G12 8QQ, UK                 Web: http://ppewww.physics.gla.ac.uk/~doyle/
________________________________________________________________________


                     GridPP PMB Minutes 264 - 9th July 2007

                     ======================================



Present:  Tony Doyle, Sarah Pearce, Roger Jones, Stephen Burke, David Britton, 

David Kelsey, Steve Lloyd, Tony Cass, Robin Middleton, John Gordon, 

Jeremy Coles, Peter Clarke, Andrew Sansum, Neil Geddes, Suzanne Scott (Minutes)



Apologies:  Dave Newbold, Glenn Patrick



0.  Approval of Previous Minutes

================================

It was agreed to send any amendments to SS by email, preferably by noon 

tomorrow (Tue).



1.  EGEE III Proposal

=====================

NG had circulated an email to UK/I EGEE partners.  A workplan had been 

refined by the PEB but bids from federations were still in excess. Two 

issues were involved: 1. trim the bids to reflect the programme of work; 

2. trim the programme of work itself.  The EGEE PMB had met last Friday 

(6th July) in closed session to discuss the bids. For SA1, the conclusion 

was to approve all of the work proposed by the activity leader - this 

would be translated into euros which would provide the approved budget for 

each bid.  The meeting also discussed the Applications Support area.  

Bids had been sent in which were not in the programme of work, and were 

not well defended.  For some of the other bids it was agreed that they 

need to be combined into one bid.  Further discussion of this area will 

happen this week. There had been a discussion on testbeds and other 

non-(full)-production services which is likely to result in a 

consolidation of these activities. The final budget table would be 

discussed this week, and the next EGEE PMB meeting was scheduled for 16th 

July.



2.  Review of Tier-2 Issues

===========================

It was agreed that DB's list had been gone through and actions generated.  

DB noted that JC had not been present at last week's meeting but his 

comments had been incorporated in the Minutes.  It was agreed that DB 

would extract the issues and actions generated from the Review and put 

these on the Tier-2 site.

Note: done, see

http://www.gridpp.ac.uk/tier2/Tier-2_Review_Issues_2007.doc (.pdf)



3.  GridPP3 Planning

====================

DB had circulated an email.  The indication was that no further formal

input from GridPP was required by STFC at this point.  It was

understood that all of the money had been approved by PPRP and other

Committees but that the carry-forward of GridPP2 funds was not yet

quite confirmed.  It was noted that a CB meeting was happening next

week and the funding issue would be raised with Group Leaders.

Everyone was aware that we have grants awaiting issue in 7 weeks' time.

It was agreed that DB would contact Janet Seed again to ask her advice

about a formal statement re the plan.



4.  AOCB

========

None.



STANDING ITEMS

==============



SI-1  Dissemination Officer's Report

------------------------------------

SP reported a news article on blogs and the new PlanetGridPP blog. SP 

asked about the situation relating to an article on the Site Reviews. 

Information generally was not yet available for release.  It was agreed 

that SP would not be able to point to all detailed feedback; DB's summary 

of outcomes could be the basis for a news item.  It was noted that all of 

the positive issues were not documented.  SP will draft an item and draw 

together the positive aspects of the Review, using some specific examples 

- but release of information would be checked with sites.  It had been 

agreed that there would be a joint NGS/STFC stand at EGEE07.  Neasan 

O'Neill had produced a new website for LHC@Home, and the statistics were 

also working now.  Last Monday there had been a meeting of the LHC 

Promotion Group regarding Grid promotion - a strategy document will be 

drawn up with key messages.  The Parliamentary POSTnote had been published 

last week and there will be a link on the 'documents' page.  An article is 

being done for GridPP news and iSGTW.



SI-2  Tier-1 Manager's Report

-----------------------------



AS provided the following report:



Hardware: Regarding the 10Gb path from Tier-1 to SJ5, they were currently 

waiting for network group to finish testing.



The RAL networking group are still in the process of obtaining a public AS 

number in order that the Tier-1 can route Tier-1 -> Tier-1 traffic by the 

OPN. This would be raised at the meeting on Wednesday (11th July).



The pre-qualification stage of the disk and CPU tenders closed Friday 29th 

June. Evaluation is underway.  AS reported three issues: 1) state of 

evaluation; 2) tape planning; 3)input from the Tier-1 Board regarding 

Tender Documents.  It was noted that there is a Tier-1 Procurement Team 

Meeting on Tuesday afternoon (10th July).



A tender to set up a Framework Purchasing agreement for tape media has now 

commenced. This is expected to be able to deliver media in 2007Q4. 50% of 

an interim purchase of 300TB of tape media has now been received and the 

remainder is expected this week.



Service: SAM availability for the last 7 days was 96% (94%?). Reliability 

for June (as measured by WLCG) was 87% - the average for the best 8 sites 

was also 87%. Main impact was caused by the network outage in the middle 

of the month - load related problems on the CE also contributed.



Regarding CASTOR: 



The CMS CASTOR instance had some problems under the highest CMS load tests 

of a week ago. However it has subsequently been stable and we are now 

working to understand throughput rates, which CMS believe are still 

insufficient to meet their CSA07 objectives. Further load testing is 

scheduled.  The standalone CASTOR for ATLAS is being tested by ATLAS. The 

standalone CASTOR for LHCB is built and has had basic functionality tests 

completed by the CASTOR team. Further load tests will be carried out by 

the CASTOR team and it will then be released to LHCB for testing.



BDII: All 3 top-level BDII servers have now been upgraded to the lastest 

release. Load on the BDII servers appears to be low and there do not 

appear to be timeout problems at the Tier-1 since the upgrade.



RB:  Both rb01 and rb02 were back in production last week. rb03 was 

brought online for Alice. Over the weekend rb01 broke again and we are now 

looking to move LHCB production work off these servers to rb03 to reduce 

the load further. We also note that this morning both rb01 and rb02 are 

flagged as OK by SAM but marked as Bad by SL's tests, this discrepancy is 

not yet understood.  Current strategy is to spread the load and keep 

things going until WMS is available.  SL4 is running and is available 

externally - testing is commencing.



SI-3  Production Manager's Report

---------------------------------

JC commented on AS's report (above) by noting that the Alice RB problems 

had not been their fault - JC would re-check the BDII timeouts as reports 

don't provide information at present, they are not working.



JC reported as follows:



1) The issue of SL4 rollout was discussed at the GDB last week. The 

   experiments all claimed to be ready but the holding point on sites 

   deploying SL4 is confirmation of additional dependencies the 

   experiments may have on the OS over what is required for the gLite 

   middleware (in earlier middleware, additional packages were included in 

   a release to ensure that the experiment software computing environment 

   requirements were met). There is particular concern about circular 

   dependencies which may lead to incompatible requirements. To make 

   progress a series of SL4 WNs have been setup for the experiments to 

   test against - this is being done at LAL and RAL Tier-1 (Birmingham 

   will join this week). Experiments were asked to upload known 

   dependencies to their CIC portal ID card but so far only LHCb has done 

   it.



There was a discussion of Experiment requirements - a list from ATLAS had 

been provided showing all of the libraries and links that they needed.  

LHCb had also sent in a requirements list.  It was noted that SL4 is 

currently meeting ATLAS requirements and many sites have already installed 

SL4.  JC noted that he was not confident about the non-LHC Experiments.  

TD noted that we need to push ahead anyway now.  JC noted that the phased 

transition would be discussed at the Deployment Board meeting on Thursday 

(12th July).



Status for RAL WNs: ALICE added the queue to their production system. LHCb 

agreed to run dedicated tests when production staff return from holiday. 

Without dedicated testing we do not know that the jobs running test all 

classes of jobs (they may be random from the matchmaking). This morning 

200+ jobs were queued for 6 job slots. CMS have not communicated any 

specific requirements. Before any migration can happen for the Tier-1 it 

needs to be confirmed that the other non-LHC experiments work without 

problem on SL4.



2) glexec on WNs is the subject of a lot of discussion at the moment. We 

   are trying to understand the principle objections. The real sticking 

   point appears to be whether glexec can easily (i.e. as a default) be 

   installed in non-SUID mode. SUID mode allows UID switching and is 

   frowned upon especially at non-HEP dedicated sites. In contrast other 

   sites in WLCG/EGEE require the job to always run under the ID of the 

   person whose work is being run.  This issue was to be discussed at the 

   Deployment Board meeting on Thursday (12th July).



3) Since the move to GOCDB3 there have been problems creating the UKI tree 

   structure needed for the ROC reports. The accounting data for most/all 

   sites also seems to have stopped updating as seen in the site charts in 

   the portal.



4) As reported previously Glasgow has encouraged a number of groups to 

   join the gridpp VO to test the infrastructure. A significant amount of 

   work now seen at Glasgow is from this VO - the site remains full while 

   most other UK sites have plenty of spare capacity. Last week Graeme 

   Stewart managed to get MPI jobs running (required by engineers) at 

   Glasgow which is likely to further increase usage.



5) The question of specInt ratings is being raised once again as the T2 

   Co-ordinators fill out the Q2 report. The value being used by the T2s 

   differs and this clearly impacts the overall site and Tier-2 KSI2K. If 

   the KSI2K figures are being used for Tier-2 hardware allocations then 

   do we need to do better benchmarking?



6) The introduction of faster cores means that historical batch queue 

   limits need revisiting. TD noted that the given default time should be 

   retained - downstream the problem was concatenating files.  JC to feed 

   this back to Graeme - and this was being discussed at the DTeam meeting 

   as well. TD noted that it should not require revisiting as the defaults 

   should remain unchanged.



7) The RAL-PPS instance of the PPS SAM testing framework is now up and 

   running.



8) SL joined the dteam VO to run his jobs outside of the ATLAS 

   environment. This led to the discovery of various problems including 

   with use of VOMS/Gridmap files and edg-job-submit. There is one 

   remaining problem with use of the Glasgow RB that needs further 

   investigation.



9) There is a deployment board meeting in London this Thursday. The agenda 

   is here: http://indico.cern.ch/conferenceDisplay.py?confId=18446



10) There were FTS problems (~24hrs) last week. The CERN grid service 

    operators did not notice a host certificate was about to expire for 

    the production service which it did with obvious repercussions for the 

    MyProxy service. JG noted that it is better to have unwanted tickets 

    rather than have these problems.



11) Finally JC has received several questions from people involved in 

    deployment roles who are still unsure where they stand with GridPP3 

    continuation of their posts. [see item 3, above]



SI-4  LCG Management Board Report

---------------------------------

JG reported that he had presented a document regarding the policy of 

killing jobs.  The feedback was that the VOs wanted to know what was going 

wrong so that they could fix it, rather than the jobs simply being killed.  

The VOs want to work with GridPP to resolve these issues.  It was noted 

that we need to flag when jobs are cancelled otherwise the Experiments 

don't know why jobs have been cancelled.  TD noted that we can get 

statistics from Tier-1 regarding jobs, but rather than average efficiency, 

we need profiled jobs.  TD noted that the cut is on 2.7% efficiency, and 

all that is required is a histogram to be inserted into the document.  It 

was agreed that AS would speak to Matt Hodges.  DK noted that this issue 

would also be discussed at the Deployment Board - but it was noted that it 

was a User Board issue too.



JG reported on an action to set up SLAs to run VO boxes.  A presentation 

had been given regarding security etc.  JG asked whether all of the 

Tier-1s have SLAs?  The issue for the future would be to have a generic 

one.



JG reported that there had been a talk on OSG site validation; and SRM2.2 

issues/options had also been discussed.



SI-5  Documentation Officer's Report

------------------------------------

It was noted that SB had been away at CERN.



REVIEW OF ACTIONS

=================



247.2 RJ to get further information from ATLAS regarding use of Grid for 

testing of PANDA, and report-back.  This is not a live topic and it was 

agreed to initiate a new listing of 'Inactive' items.  This to be moved to 

that category.



250.4 RJ, DN, GP, TD to meet to integrate experiment requirements of 

Tier-2s going to Tier-1 - sites are aware of requirements but discussion 

still has to take place.  It was noted that this issue is not high 

priority.  A meeting is to take place with Barney Garrett - this is 

ongoing and still to be arranged.



251.1 TD to raise the issue of memory vs CPU cost at the MB [in order to 

work out what the requirement was between 1GB and 2GB memory per core].  

It was agreed this to be moved to 'Inactive' category.



252.3 RM has now received inputs for his one-page summary regarding the 

transition of each of the existing Middleware areas from GridPP2 to 

GridPP2+ to GridPP3 - this to go to DB.  Ongoing.



253.1 AS has commenced work on the report on data integrity at Tier-1, in 

relation to implementation of checksums.  AS is still working on this and 

it will take a further couple of weeks to complete.  This is ongoing, and 

AS hopes it will be finished by the end of August.  It was agreed to move 

this to 'Inactive' category.



254.2 ALL PMB members have now signed-up to EVO.  Tests were ongoing but 

this action is on hold due to H323 requirements which must be resolved. 

JG/RM will resolve EVO issues.  RJ reported that he had joined an 

evaluation group on EVO and asked that all information should be sent to 

him to enable him to document the problems involved.  It was agreed that 

an EVO test would take place the week after next (PMB) as next week's 

meeting was a short one due to the CB meeting at 2.00 pm.



259.5 JC to provide recommendations to the PMB on PPS testing and a 

summary of what is currently available on the system.  Ongoing.



260.1 RM to provide final feedback for site reviews to SL for 

https://www.gridpp.ac.uk/tier2/Readiness_Reviews/index.html.  Ongoing.



260.3 RM, NG, TD, DK to inform SL which site-review information is 

public/private.  Ongoing.



261.1 TD and JG to prepare a PMB statement to be prepared for the MB 

regarding SL4 releases of basic middleware, which were still awaited and 

were an issue at sites.  JG reported that he would be doing this for 

tomorrow.  Sites should be encouraged to proceed with SL4 upgrades which 

are to be tracked by JC.  JG will give a summary statement to the MB as to 

what we believe the current situation is - this will include 'SL5 on 

hold'.



261.2 DN, RJ, GP: An action on the experiments to define the future 

outlook for 64-bit applications and resultant effects on hardware 

purchasing.  Experiment reps to define the outlook.  It was noted that the 

priority is 32-bit at the moment; there is no advantage to 64-bit.  A 

short statement is required.



261.4 DB to look through the input in detail in relation to GGUS problems.  

Ongoing.



261.5 JC and dTeam to carry out a survey on sites' experiences of GGUS, 

when possible to organise.  Ongoing.



261.6 JC to look into the issue of 2-hour response timing @ Tier-2 sites 

and understand the problem in greater detail - sites also need to 

understand what the two-hour response time actually means.  This may come 

up at the next Board meeting.  Ongoing.



261.11 SL to progress receipt of final site documents from SouthGrid and 

London T2 which were still outstanding.  It was noted that SL was still 

awaiting information.



261.13 DK to progress receipt of ScotGrid feedback.  Ongoing.



261.14 RM to progress receipt of LT2 feedback.  Ongoing.



261.16 JG to progress the issue of somone getting involved in the SLA 

(ROC) working group.



261.17 JC to assess the general effectiveness of RSS feeds and 

subscription-based updates, in relation to GridPP blogs.  It was noted 

that blogs are aggregated: PlanetGridPP is the mechanism, but RSS-feeds 

that can be subscribed to don't exist.  JC will bring this up at the 

Deployment Board meeting.



262.2 SL to clarify GridPP contribution (what is accounted rather than 

what is available) with the Tier-2 Board.  Ongoing.



262.3 DK to raise items (12) [re accounted GridPP contribution] and (22) 

[re site availability via SAM tests] at the Deployment Board in two weeks' 

time.  This was on the Agenda for discussion at the DB.  Done, item 

closed.



262.4 JC to ascertain the specific problems in relation to Condor support 

issues.  JC awaiting feedback.  Ongoing.



262.5 Regarding poor response time of middleware developers:  DK to 

propose the following recommendation to the Deployment Board: to recommend 

that if specific issues were involved, GGUS should be used. If issues were 

general, the TCG representative at the Tier-2 site should be informed.  

The TCG rep in turn should raise the issue as appropriate at the TCG 

meetings.  This was on the DB Agenda for discussion.  Ongoing.



262.6 JC to raise the issue of PPS feedback information relating to 

upgrades issues with the relevant individual(s) on the PPS, and ask if 

there was anything else that could be done.  Ongoing.



262.7 AS to speak to procurement and warn them that sites might want to 

make parallel purchases - a sentence could be added to the tender 

document. AS still to talk to procurement - ongoing.



262.9 non-Grid access relating to VOs.  A document is to be done detailing 

this issue as VOs need a mechanism 'in'.  AS to detail the issue in a 

separate report and circulate to the PMB.  What can and can't be offered 

to non-Grid users: detail is required - AS still to do.  Ongoing.



262.10 Regarding user communication/info provision, JC suggested amending 

the emphasis of the UB to be more in touch with users generally - it was 

agreed that he would raise this with Glen.  Glen will be there on 

Thursday, JC will speak to him then.



262.11 SB to add a new Document to the PMB Documents, No 114, relating to 

a documentation report overview on current status.  Ongoing.



263.1 Robin Tasker to re-circulate his paper regarding the RAL-CERN OPN 

link, once further information was available.  What is the timescale for 

this?  PC to review the Minutes and discuss with Robin Tasker.



263.2 JG to further investigate the lack of ability to pass job 

requirements to the batch system and report-back (Tier-2 review issue).  

JG will raise this through the GDB.  Ongoing.



ACTIONS AS AT 09.07.06

======================

250.4 RJ, DN, GP, TD to meet to integrate experiment requirements of 

Tier-2s going to Tier-1 - sites are aware of requirements but discussion 

still has to take place.  It was noted that this issue is not high 

priority.  A meeting is to take place with Barney Garrett - this is 

ongoing and still to be arranged.



252.3 RM has now received inputs for his one-page summary regarding the 

transition of each of the existing Middleware areas from GridPP2 to 

GridPP2+ to GridPP3 - this to go to DB.  This was to be done by Friday 8th 

June but is still ongoing.



254.2 ALL PMB members have now signed-up to EVO.  Tests were ongoing but 

this action is on hold due to H323 requirements which must be resolved. 

JG/RM will resolve EVO issues.  RJ reported that he had joined an 

evaluation group on EVO and asked that all information should be sent to 

him to enable him to document the problems involved.  It was agreed that 

an EVO test would take place the week after next (PMB) as next week's 

meeting was a short one due to the CB meeting at 2.00 pm.



259.5 JC to provide recommendations to the PMB on PPS testing and a 

summary of what is currently available on the system.



260.1 RM to provide final feedback for site reviews to SL for 

https://www.gridpp.ac.uk/tier2/Readiness_Reviews/index.html.



260.3 RM, NG, TD, DK to inform SL which site-review information is 

public/private.



261.1 TD and JG to prepare a PMB statement to be prepared for the MB 

regarding SL4 releases of basic middleware, which were still awaited and 

were an issue at sites.  JG reported that he would be doing this for 

tomorrow.  Sites should be encouraged to proceed with SL4 upgrades which 

are to be tracked by JC.  JG will give a summary statement to the MB as to 

what we believe the current situation is - this will include 'SL5 on 

hold'.



261.2 DN, RJ, GP: An action on the experiments to define the future 

outlook for 64-bit applications and resultant effects on hardware 

purchasing.  Experiment reps to define the outlook.  It was noted that the 

priority is 32-bit at the moment; there is no advantage to 64-bit.  A 

short statement is required.



261.4 DB to look through the input in detail in relation to GGUS problems.



261.5 JC and dTeam to carry out a survey on sites' experiences of GGUS, 

when possible to organise.



261.6 JC to look into the issue of 2-hour response timing @ Tier-2 sites 

and understand the problem in greater detail - sites also need to 

understand what the two-hour response time actually means.



261.11 SL to progress receipt of final site documents from SouthGrid and 

London T2 which were still outstanding. It was noted that SL was still 

awaiting information.



261.13 DK to progress receipt of ScotGrid feedback.



261.14 RM to progress receipt of LT2 feedback.



261.16 JG to progress the issue of somone getting involved in the SLA 

(ROC) working group.



261.17 JC to assess the general effectiveness of RSS feeds and 

subscription-based updates, in relation to GridPP blogs.  It was noted 

that blogs are aggregated: PlanetGridPP is the mechanism, but RSS-feeds 

that can be subscribed to don't exist.  JC will bring this up at the 

Deployment Board meeting.



262.2 SL to clarify GridPP contribution (what is accounted rather than 

what is available) with the Tier-2 Board.



262.4 JC to ascertain the specific problems in relation to Condor support 

issues.



262.5 Regarding poor response time of middleware developers:  DK to 

propose the following recommendation to the Deployment Board: to recommend 

that if specific issues were involved, GGUS should be used. If issues were 

general, the TCG representative at the Tier-2 site should be informed.  

The TCG rep in turn should raise the issue as appropriate at the TCG 

meetings.



262.6 JC to raise the issue of PPS feedback information relating to 

upgrades issues with the relevant individual(s) on the PPS, and ask if 

there was anything else that could be done.



262.7 AS to speak to procurement and warn them that sites might want to 

make parallel purchases - a sentence could be added to the tender 

document.



262.9 non-Grid access relating to VOs.  A document is to be done detailing 

this issue as VOs need a mechanism 'in'.  AS to detail the issue in a 

separate report and circulate to the PMB.  What can and can't be offered 

to non-Grid users: detail is required - AS still to do.



262.10 Regarding user communication/info provision, JC suggested amending 

the emphasis of the UB to be more in touch with users generally - it was 

agreed that he would raise this with Glen.



262.11 SB to add a new Document to the PMB Documents, No 114, relating to 

a documentation report overview on current status.



263.1 Robin Tasker to re-circulate his paper regarding the RAL-CERN OPN 

link, once further information was available.  What is the timescale for 

this?  PC to review the Minutes and discuss with Robin Tasker.



263.2 JG to further investigate the lack of ability to pass job 

requirements to the batch system and report-back (Tier-2 review issue).  

JG will raise this through the GDB.  Ongoing.



264.1 DB to extract the issues and actions generated from the Tier-2 

Review as discussed at the PMB and put these on the Tier-2 site.



264.2 DB to contact Janet again and remind her about the forthcoming CB 

meeting and ask her advice about a formal statement re the plan V2.



264.3 JC noted that the Alice RB problems had not been their fault - he 

would re-check the BDII timeouts as reports don't provide information at 

present, they are not working.



264.4 Regarding policy of killing jobs, statistics are required from 

Tier-1, but rather than average efficiency we need profiled jobs.  AS to 

speak to Matt Hodges.



INACTIVE CATEGORY AS AT 09.07.06

================================



247.2 RJ to get further information from ATLAS regarding use of Grid for 

testing of PANDA, and report-back.



251.1 TD to raise the issue of memory vs CPU cost at the MB [in order to 

work out what the requirement was between 1GB and 2GB memory per core].



253.1 AS has commenced work on the report on data integrity at Tier-1, in 

relation to implementation of checksums.  Ongoing, AS hopes to complete 

this by end August.



Next week's PMB (16.07.07) would be for 1 hour only due to the CB meeting 

at 2.00 pm.  EVO test the following week (23.07.07).

































                    GridPP PMB Minutes 263 - 2nd July 2007

                    ======================================

Present:  Roger Jones, David Britton, David Kelsey, Dave Newbold, Tony Cass, 

Robin Middleton, John Gordon, Glenn Patrick, Robin Tasker, Suzanne Scott 

(Minutes)



Apologies:  Tony Doyle, Sarah Pearce, Stephen Burke, Steve Lloyd, 

Jeremy Coles, Peter Clarke, Andrew Sansum, Neil Geddes



1. UK Position on Resilience of the RAL-CERN Line

=================================================

Robin Tasker had produced a paper regarding the RAL-CERN OPN link.  There 

had been an outage in June - it was reported that French road repair men 

had dug up the fibre and it was 48 hours before it was repaired.  What 

resilience was required to protect against outage?  The lightpath from RAL 

to CERN was summarised in RT's paper in terms of the problems involved, 

but overall the link was fairly reliable.  The paper addressed issues of 

fibre infrastructure, with feasibility and costing confirmation awaited 

from UKERNA.  It was understood that outage could be infrequent and a 

large cost was involved in protecting the link if such protection was not 

generally required.  RT was currently awaiting a risk assessment in 

relation to the break in fibre in such a catastrophic way - it was a 

question of balance of risk and cost, and of how long an outage was likely 

to last - how significant was an outage of 48 hours in June?  JG noted 

that breaks in the Tier-1 do result in dataflow issues to the other 

Tier-1s.  There was a discussion regarding steering data and storage.  It 

was agreed that the links need to be as reliable as possible within 

reason.  An outage of 1-2 hours or one day was acceptable, but for two 

weeks, no.  It was noted that the lightpath cannot be re-routed, if the 

fibre breaks then the connection is lost.  It was noted that bandwidth 

might be an issue for the future.  There was a discussion of routes into 

CERN and cross-border fibres.



It was reported that JANET (UK) were providing figures to RT for a diverse 

route by the end of the week.  NetNorthWest and JANET will be able to give 

a realistic assessment of risk.  It was agreed that a decision should be 

deferred until further information was available.  RT will update his 

paper with fuller information when it was available, and re-circulate.



2. Ongoing Review of Tier-2 Issues

==================================

In absentia, JC had submitted comments on the remaining issues.  



18) Lack of ability to pass job requirements to the batch system - JG 

    noted that the GLite CE can pass information.  The RB looks at the 

    user requirement and matches it to a queue.  It was noted that the 

    system fills with jobs that can't be optimised.  JG would investigate 

    this issue further and report-back.



19) Virtualisation - UCL had wanted to know GridPP direction/support in 

    this area.  JC noted that Marian had started looking at 

    virtualization. He currently has some nodes on the PPS which are on 

    virtual machines - his intention was to put the PPS SAM client in such 

    an environment.  It was noted that Grid-Ireland also had a lot of 

    experience in this area which GridPP could draw upon. JC reported that 

    there might be some support available via the TB-SUPPORT list and 

    helpdesk, but at the moment we are still looking at this area and do 

    not have a definite direction. It was agreed that this is largely 

    uncharted territory for GridPP and a diversion away from the standard 

    GridPP environment.  In abeyance at present.



20) Changing Experiment requirements - JC noted that this might relate to 

    such things as the ATLAS ACL change requests. Some sites thought there 

    needed to be more structure to change requests.  VO views might be 

    cited as another area where difficulties have been encountered. There 

    was also the difficulty of consistency of feedback - on SL4 JC has 

    heard different positions depending who he talks to within an 

    experiment. It was reported that the 39 Tier-2s in CMS are in regular 

    contact. JG summarised that this was an issue more for the Experiments 

    to deal with.



21) Level of noise for site problems - JC noted that this covered things 

    like false-positive problems in the site SAM results. It was agreed 

    that people are playing more attention now to the SAM results.  

    Issues should be raised in the weekly Ops reports meetings.



22) Definition of 'what is available' - JC noted that if sites are going 

    to be measured against one measure of availability, is it the number 

    coming from GridView (even if there are (many) questions about how 

    accurate it is in measuring availability for the experiments).  It was 

    agreed that, yes, GridView and the SAM reports come from the same 

    database, but if there is not a consistent query then you won't get 

    the same number out of the same data.



23) Enforcement of MoUs/SLAs - JC noted that the process is known but 

    other than getting less funding in the future, were there any other 

    enforcement options?  It was agreed that this issue was not for public 

    debate at present.



3.  Killing Jobs

================

It was reported that TD had sent a draft policy to the WLCG Management 

Board.  It was noted that killing stalled jobs was treating the symptom 

rather than the problem.  Some feedback had been received, it was 

understood that the policy intention was to try to improve efficiency at 

sites.  It was noted that the Tier-2 have less staff and VOs send jobs in.  

The issue would be discussed at the face-to-face MB meeting tomorrow.  It 

was noted that the dashboard was an answer to cross-VO problems but the 

Experiments don't know who is running jobs.  It was agreed that it was not 

right if it became the normal procedure to kill-off jobs as a matter of 

course.



4.  AOB

=======

RJ reported that Liverpool had asked for some GridPP funding for pre-spending. 

DB noted that this was not possible as no official word had been received from 

STFC with regard to allocations.  It was agreed that nothing could be done 

until GridPP know officially what the scale of expenditure is.



STANDING ITEMS

==============



SI-1  Dissemination Officer's Report

------------------------------------

It was noted that SP was not present.



SI-2  Tier-1 Manager's Report

-----------------------------



In absentia, AS had sent in the following report on Friday 29th:



Hardware - Re the 10Gb path from Tier-1 to SJ5, it was reported that they 

were currently waiting for network group to finish testing. They were 

currently working on implementing the firewall configuration as a set of 

router filters.



The RAL networking group were in the process of obtaining a public AS 

number in order that the Tier-1 could route Tier-1 -> Tier-1 traffic by 

the OPN.  Still waiting for RAL networking group to complete this work.



The pre-qualification stage of the disk and CPU tenders closed on Friday 

29th. Evaluation will start w/c 2nd July.



The Tape service was down last Tuesday for a firmware update.



Service - SAM availability for the last 7 days was 93% (some overlap with 

previous 7 days reported).



Regarding CASTOR: A stand-alone 2.1.3 release of CASTOR for CMS had been 

implemented and is undergoing testing. Results were very encouraging with 

high rates achieved (400MB/s writing - concurrent with 300MB/s to tape 

followed by >700MB/s reading). Reliability has been excellent, far better 

than any previous tests with CMS. However, so far only native rfio load 

tests have been tried and we need to see good results with gridftp/srm/fts 

before feeling confident that we have a good working production ready 

release.



A standalone 2.1.3 release for ATLAS is currently being worked on. This 

was delayed by technical problems but is now nearly complete and will be 

tested soon.



We have reviewed hardware capacity available to implement a 2.1.3 

stand-alone implementation for LHCB. Tier-1 batch workers will be 

redeployed temporarily. Work on this will commence once the ATLAS instance 

is complete. It is expected to go faster as documentation and processes 

have now been improved.



Regarding dCache: all is OK - but is apparently not being used by ATLAS 

production. We are following this up.



BDII: We have seen some timeouts on the top-level BDII. These are load 

related, probably caused by the LHCB VO box. One BDII has been updated to 

the latest release and has seen a significant reduction in CPU load. If it 

remains stable then the two remaining hosts will be updated shortly.



RB: rb01 is currently under sysdev having its database cleaned. rb02 is 

struggling to cope with the load on its own. rb03 is deployed and is 

currently being tested. Once completed will arrange Alice production to 

move to it. We may also move the LHCB production.



LFC: Problems reported on Monday were resolved (on Monday). Cause was a 

faulty gLite update.



SI-3  Production Manager's Report

---------------------------------

In absentia, JC sent in the following report:



1) We are pursuing two security related matters - concerns raised in 

   the UK and the submitters are concerned that there has been no 

   result(patch) for one and lack of discussion of the other. There has 

   actually been some progress on both but this particular problem has 

   highlighted a need to review procedures and communication in this area. 

   Another issue being faced generally is how we are supposed to deal with 

   vulnerabilities in VO/experiment code.



2) BDII timeouts appear to be affecting UK sites again (causing lcg-rm 

   tests to fail for several sites).



3) The main things to note from the UKI monthly meeting last week 

   (http://indico.cern.ch/conferenceDisplay.py?confId=17879) are that the 

   UK helpdesk will now move to chase/close tickets where the ticket 

   submitter has not responded to the agent's response (after a site 

   waiting on a user to confirm a fix), and that generally sites are 

   finding it difficult to keep up with constant changes in YAIM and the 

   middleware. Sites have been encouraged to check their storage data 

   being published to the storage accounting portal 

(http://goc02.grid-support.ac.uk/accountingDisplay/view.php?queryType=storage)

   and report any problems.



4) GOCDB3 (https://goc.gridops.org/) went live last week on Wednesday. We 

   have seen an increase in tickets to the UKI ROC as users point out 

   minor issues but so far the release seems to have been well planned and 

   has gone smoothly.



5) There two monthly grid deployment related meetings at CERN this week. A 

   storage workshop runs Monday and Tuesday 

   (http://indico.cern.ch/conferenceDisplay.py?confId=16456) with both SRM 

   developers present and representatives from the experiments. Grieg 

   Cowan will present on "GridPP sites: experience running dCache, DPM, 

   and StoRM". Then on Wednesday is the July Grid Deployment Board meeting 

   (http://indico.cern.ch/conferenceDisplay.py?confId=8485) with a focus 

   on accounting and security. There will be surrounding discussions on WN 

   utilisation, the OPN and a summary from the storage workshop.



SI-4  LCG Management Board Report

---------------------------------

See https://twiki.cern.ch/twiki/bin/view/LCG/MbMeetingsMinutes



SI-5  Documentation Officer's Report

------------------------------------

It was noted that SB was not present.



REVIEW OF ACTIONS

=================



247.2 RJ to get further information from ATLAS regarding use of Grid for 

testing of PANDA, and report-back.  RJ reported that this was ongoing and 

nothing would be happening regarding it in the near future.



250.4 RJ, DN, GP, TD and TC to meet to integrate experiment requirements 

and work on Tier-2 networks - sites are aware of requirements but 

discussion still has to take place.  Ongoing when convenient to arrange.  

It was noted that this issue is not high priority.



251.1 TD to raise the issue of memory vs CPU cost at the MB [in order to 

work out what the requirement was between 1GB and 2GB memory per core].  

Ongoing.



252.3 RM has now received inputs for his one-page summary regarding the 

transition of each of the existing Middleware areas from GridPP2 to 

GridPP2+ to GridPP3 - this to go to DB.  Ongoing.



253.1 AS has commenced work on the report on data integrity at Tier-1, in 

relation to implementation of checksums.  Ongoing.



254.2 ALL PMB members have now signed-up to EVO.  Tests were ongoing but 

this action is on hold due to H323 requirements which must be resolved.  

JG has resolved EVO H.323 issues at RAL.  It was noted that there had been 

a further EVO test today (2/7) but JG was the only one to join.



255.3 DK to get approval from groups regarding Grid Site Operations policy 

and report-back.  Obligations are on the site to carry forward issues.  

It was reported that all sites had now been consulted.  Final project 

approval was currently happening.  Done, item closed.



256.1 NG to review the draft of the new Grid Security Policy from NGS 

perspective, and SL from Tier-2, and report-back.  NG had reported at the 

F2F.  Done, item closed.



258.6 JC to discuss RAL RB issues with Catalin Condurache and bring 

conclusions back to the PMB.  In absentia JC reported that the recent RB 

problems are thought to be due to ALICE hammering the RB until it fails. 

It is proving difficult to validate this due to poor RB VO monitoring. The 

urgency to fix problems seen by users is now recognised and the T1 

procedure will not always be to wait until queues are empty if a component 

is being problematic. Another issue here is that UIs are not being 

configured properly to take account of the load balanced nature of the 

RBs. ALICE and LHCb are having their own RBs installed. This is now 

closed.



259.5 JC to provide recommendations to the PMB on PPS testing and a 

summary of what is currently available on the system.  JC will also 

forward the chat window location to the PMB via email.  The link that was 

circulated is 

http://egee-pre-production-service.web.cern.ch/egee-pre-production-service/.  

Ongoing.



260.1 RM, NG to provide final feedback for site reviews to SL for 

https://www.gridpp.ac.uk/tier2/Readiness_Reviews/index.html.  This was 'in 

progress' - NG action done; RM ongoing.



260.3 RM, NG, TD, DK to inform SL which site-review information is 

public/private.  Ongoing.



260.4 JG (not JC) to re-start Castor Strategy meetings.  Done, item 

closed.



261.1 TD and JG to prepare a PMB statement to be prepared for the MB 

regarding SL4 releases of basic middleware, which were still awaited and 

were an issue at sites.  Ongoing.



261.2 DN, RJ, GP: An action on the experiments to define the future 

outlook for 64-bit applications and resultant effects on hardware 

purchasing.  Experiment reps to define the outlook.  There was a 

discussion re SL4 & SL5 - ongoing.



261.4 DB to look through the input in detail in relation to GGUS problems.  

Ongoing.



261.5 JC and dTeam to carry out a survey on sites' experiences of GGUS, 

when possible to organise.  In absentia JC reported that a dialogue has 

been started but it will take a few weeks to close this action.  Ongoing.



261.6 JC to look into the issue of 2-hour response timing @ Tier-2 sites 

and understand the problem in greater detail - sites also need to 

understand what the two-hour response time actually means.  Ongoing.



261.7 DK to ask Mingchao Ma, the new GridPP Security Officer, to contact 

sites and check they have security incident response systems in place.  

The 'climate' of this item was understood that this would happen naturally 

in due course.  Item closed.



261.8 JC to talk to Pete Gronbech and Alessandra Forti regarding 

Monitoring/Nagios/Ganglia training, to include someone from GridView.  In 

absentia JC reported that this had been discussed with Pete and Alessandra 

and also at the UKI meeting. There is support for this around the next 

HEPSYSMAN meeting. We will start working on the agenda. Action can be 

closed.



261.11 SL to progress receipt of final site documents from SouthGrid and 

London T2 which were still outstanding.  It was noted that this was a 

duplicate of an earlier action, but was still ongoing.



261.12 NG to progress receipt of SouthGrid feedback.  Done, item closed.



261.13 DK to progress receipt of ScotGrid feedback.  Ongoing.



261.14 RM to progress receipt of LT2 feedback.  Ongoing.



261.15 SL to send an email to sites who still had to provide final 

versions of the Questionnaire response (list above), informing them that 

the current version would be considered final unless a revised one was 

provided by Friday 22nd June.  Done, item closed.



261.16 JC to speak to Steve McAllister about getting involved in the SLA 

(ROC) working group.  In absentia JC reported that he had spent an hour 

with Steve last week but it is not clear that he is the right person to 

work on SLA issues for the ROC.  This should be the ROC manager. It was 

agreed that JG would progress this.



261.17 JC to assess the general effectiveness of RSS feeds and 

subscription-based updates, in relation to GridPP blogs.  Ongoing.



262.1 RM to draft an extra line for the Travel Policy relating to Tier-2 

staff/Experiment contact.  Done, item closed.



262.2 SL to clarify GridPP contribution (what is accounted rather than 

what is available) with the Tier-2 Board.  Ongoing.



262.3 DK to raise items (12) [re accounted GridPP contribution] and (22) 

[re site availability via SAM tests] at the Deployment Board in two weeks' 

time.  Still to be done.



262.4 JC to ascertain the specific problems in relation to Condor support 

issues.  In absentia JC reported that he was still working on this. So far 

he had contacted two other EGEE sites that are using or trying to use 

Condor and have asked Santanu to distill the main issues Cambridge is 

having with Condor as a batch system. Ongoing.



262.5 Regarding poor response time of middleware developers:  DK to 

propose the following recommendation to the Deployment Board: to recommend 

that if specific issues were involved, GGUS should be used.  If issues 

were general, the TCG representative at the Tier-2 site should be 

informed.  The TCG rep in turn should raise the issue as appropriate at 

the TCG meetings.  Ongoing.



262.6 JC to raise the issue of PPS feedback information relating to 

upgrades issues with Pete on the PPS, and ask if there was anything else 

that could be done.  In absentia JC reported that he had talked with Yves 

and Marian but there was nothing conclusive yet about how to take this 

forward. Marian reinstalls each time and Yves is already inputting 

experiences into the wiki (such as with DNS style VO configuration). 

Ongoing.



262.7 AS to speak to procurement and warn them that sites might want to 

make parallel purchases - a sentence could be added to the tender 

document. Ongoing.



262.8 A statement is to be prepared for the MB relating to SAM 

availability for the last 7 days (62%) - AS to send an email to JG, JC and 

TD.  [This was mainly caused by the failure of the RAL-CERN line, which 

was down in excess of 48 hrs from 20/06/2007 10:17:54 to approximatly 

22/06/2007 15:00:00.] Done, item closed.



262.9 Grid access relating to VOs.  A document is to be done detailing 

this issue as VOs need a mechanism 'in'.  AS to detail the issue in a 

separate report and circulate to the PMB.  Ongoing.



262.10 Regarding user communication/info provision, JC suggested amending 

the emphasis of the UB to be more in touch with users generally - it was 

agreed that he would raise this with Glen.  In absentia JC reported that 

he would talk with Glenn next week when at RAL.  Ongoing.



262.11 SB to add a new Document to the PMB Documents, No 114, relating to 

a documentation report overview on current status.  Ongoing.



ACTIONS AS AT 02.07.06

======================

247.2 RJ to get further information from ATLAS regarding use of Grid for 

testing of PANDA, and report-back.



250.4 RJ, DN, GP, TD and TC to meet to integrate experiment requirements 

and work on Tier-2 networks - sites are aware of requirements but 

discussion still has to take place.  Ongoing when convenient to arrange.  

It was noted that this issue is not high priority.



251.1 TD to raise the issue of memory vs CPU cost at the MB [in order to 

work out what the requirement was between 1GB and 2GB memory per core].



252.3 RM has now received inputs for his one-page summary regarding the 

transition of each of the existing Middleware areas from GridPP2 to 

GridPP2+ to GridPP3 - this to go to DB.  This will be done by Friday 8th 

June.



253.1 AS has commenced work on the report on data integrity at Tier-1, in 

relation to implementation of checksums.



254.2 ALL PMB members have now signed-up to EVO.  Tests were ongoing but 

this action is on hold due to H323 requirements which must be resolved.  

JG/RM will resolve EVO issues.



259.5 JC to provide recommendations to the PMB on PPS testing and a 

summary of what is currently available on the system.



260.1 RM to provide final feedback for site reviews to SL for 

https://www.gridpp.ac.uk/tier2/Readiness_Reviews/index.html.  



260.3 RM, NG, TD, DK to inform SL which site-review information is 

public/private.



261.1 TD and JG to prepare a PMB statement to be prepared for the MB 

regarding SL4 releases of basic middleware, which were still awaited and 

were an issue at sites.



261.2 DN, RJ, GP: An action on the experiments to define the future 

outlook for 64-bit applications and resultant effects on hardware 

purchasing.  Experiment reps to define the outlook.



261.4 DB to look through the input in detail in relation to GGUS problems.



261.5 JC and dTeam to carry out a survey on sites' experiences of GGUS, 

when possible to organise.



261.6 JC to look into the issue of 2-hour response timing @ Tier-2 sites 

and understand the problem in greater detail - sites also need to 

understand what the two-hour response time actually means.



261.11 SL to progress receipt of final site documents from SouthGrid and 

London T2 which were still outstanding.



261.13 DK to progress receipt of ScotGrid feedback.



261.14 RM to progress receipt of LT2 feedback.



261.16 JG to progress the issue of (someone, not Steve McAllister - the 

ROC manager?) getting involved in the SLA (ROC) working group.



261.17 JC to assess the general effectiveness of RSS feeds and 

subscription-based updates, in relation to GridPP blogs.



262.2 SL to clarify GridPP contribution (what is accounted rather than 

what is available) with the Tier-2 Board.



262.3 DK to raise items (12) [re accounted GridPP contribution] and (22) 

[re site availability via SAM tests] at the Deployment Board in two weeks' 

time.



262.4 JC to ascertain the specific problems in relation to Condor support 

issues.



262.5 Regarding poor response time of middleware developers:  DK to 

propose the following recommendation to the Deployment Board: to recommend 

that if specific issues were involved, GGUS should be used.  If issues 

were general, the TCG representative (Alessandra Forti) should be 

informed.  The TCG rep in turn should raise the issue as appropriate at 

the TCG meetings.



262.6 JC to raise the issue of PPS feedback information relating to 

upgrades issues with the relevant individual(s) on the PPS, and ask if 

there was anything else that could be done.



262.7 AS to speak to procurement and warn them that sites might want to 

make parallel purchases - a sentence could be added to the tender 

document.



262.9 Grid access relating to VOs.  A document is to be done detailing 

this issue as VOs need a mechanism 'in'.  AS to detail the issue in a 

separate report and circulate to the PMB.



262.10 Regarding user communication/info provision, JC suggested amending 

the emphasis of the UB to be more in touch with users generally - it was 

agreed that he would raise this with Glen.



262.11 SB to add a new Document to the PMB Documents, No 114, relating to 

a documentation report overview on current status.



263.1 Robin Tasker to re-circulate his paper regarding the RAL-CERN OPN 

link, once further information was available.



263.2 JG to investigate further the lack of ability to pass job 

requirements to the batch system and report-back (Tier-2 review issue).



The next PMB would take place on Monday 9th July. The meeting closed at 

2.00 pm.
Top of Message | Previous Page | Permalink
JiscMail Tools

Files Area | help
RSS Feeds and Sharing

Search Archives

Advanced Options