JiscMail Logo
Email discussion lists for the UK Education and Research communities

Help for UKHEPGRID Archives


UKHEPGRID Archives

UKHEPGRID Archives


UKHEPGRID@JISCMAIL.AC.UK


View:

Message:

[

First

|

Previous

|

Next

|

Last

]

By Topic:

[

First

|

Previous

|

Next

|

Last

]

By Author:

[

First

|

Previous

|

Next

|

Last

]

Font:

Proportional Font

LISTSERV Archives

LISTSERV Archives

UKHEPGRID Home

UKHEPGRID Home

UKHEPGRID  February 2008

UKHEPGRID February 2008

Options

Subscribe or Unsubscribe

Subscribe or Unsubscribe

Log In

Log In

Get Password

Get Password

Subject:

Minutes of the 290th GridPP PMB meeting

From:

Tony Doyle <[log in to unmask]>

Reply-To:

Tony Doyle <[log in to unmask]>

Date:

Mon, 11 Feb 2008 12:30:19 +0000

Content-Type:

MULTIPART/MIXED

Parts/Attachments:

Parts/Attachments

TEXT/PLAIN (20 lines) , 080201.txt (1 lines)

Dear All,

     Please find attached the latest F2F GridPP Project Management 
Board Meeting minutes. The latest minutes can be found each week in:

http://www.gridpp.ac.uk/php/pmb/minutes.php?latest

as well as being listed with other minutes at:

http://www.gridpp.ac.uk/php/pmb/minutes.php

Cheers, Tony
________________________________________________________________________
Prof. A T Doyle, FInstP FRSE                       GridPP Project Leader
Rm 478, Kelvin Building                      Telephone: +44-141-330 5899
Dept of Physics and Astronomy                  Telefax: +44-141-330 5881
University of Glasgow                   EMail: [log in to unmask]
G12 8QQ, UK                 Web: http://ppewww.physics.gla.ac.uk/~doyle/
________________________________________________________________________


GridPP PMB Minutes 290 - 1st February 2008 ========================================== Face-to-face meeting at Glasgow. Present: Tony Doyle, Roger Jones, Stephen Burke, David Britton, David Kelsey, Steve Lloyd, Robin Middleton, John Gordon, Jeremy Coles, Glenn Patrick, Andrew Sansum, Neil Geddes, Dave Colling, Suzanne Scott (Minutes) By Phone: Sarah Pearce, Tony Cass (pm only) In attendance: Trish Mullins (STFC) Apologies: Peter Clarke 1. GridPP3 Board Compositions ============================== DB had circulated documents and now invited comments. DK suggested adding a Security Officer as a standing member to assist with security policy advice. TD suggested that this was more an issue the more technically-focussed D-Team, and that a route already existed for high-level inputs. SL agreed that there was a danger of technical issues overtaking business and it was noted that DK already wore a security 'hat' as representative on the PMB and, for coherence, should sit on the Deployment Board in a similar capacity. DK noted that the Deployment Board should ultimately sign-off on security issues, but that DTeam was much more of an 'open' meeting where technical details could be debated. DB advised that membership of DTeam needed to be more formalised and written down; in particular, someone was needed to replace Steve Traylen from the Tier-1. AS noted that this was likely to be Derek Ross. It was noted that more Tier-1 input to the Deployment Board might be required, and possibly a Technical Co-ordinator from Tier-1 should attend. DB asked about the experiment representation and requested that individuals should be nominated. Actions: JC to write-down membership of DTeam. RJ, DC and GP to nominate experiment user representatives for the Deployment Board. Decision: It was agreed that DK would sit on the Deployment Board as Security Policy liaison. DB noted that the other Board under discussion was the Tier-1 Board. This had been setup originally at the request of RAL for as an oversight-type requirement. DK noted that some kind of Board was required. DB also noted that the Terms of Reference overlapped with some other boards. TD advised that the current structure also resulted in some reporting direct to Janet Seed (STFC). The Tier-1 Board meeting that had been scheduled for the beginning of March had been cancelled due to the timing. TD asked how much of the Tier-1 remit the Deployment Board could incorporate? Decision: It was agreed that the Tier-1 Board would be suspended for 6 months, and it would be seen to what extent the Deployment Board could cover the remit. Action: SL and DB to review the Tier-1 Board Terms of Reference and see what could be formally incorporated into the new Deployment Board Terms of Reference. AS and JG to discuss what could replace the Tier-1 Board at RAL. The discussion turned to STFC input. Should there be STFC representation on the PMB, or at the F2F? DB asked whether Swindon wanted to be represented on the Board, with Trish Mullins as member? observer? It was agreed that TM would be 'in attendance' at Face-to-Face PMB meetings. SL asked whether the Collaboration Board would remain unchanged? It was felt that the experiments were well represented there and it should remain as is. It was agreed that SP would be on the Collaboration Board as Project Manager 'ex officio'. SP asked whether the CB could be arranged to coincide with the Oversight Committee which would mean she could attend in person - this was agreed. Regarding the User Board, DB asked what worked best? GP noted that it was really a 'user forum', and better identified when it was an 'experiments' board'. The core was experiments' contacts but anyone could join. More generally it was noted that all users should be encouraged to have access. DB asked if the Terms of Reference were OK? GP advised yes, but there was an overlap with DTeam and that should be clarified; the Terms were fine. DB asked about the meeting schedule? GP noted every Quarter. DB asked if there were any further points/questions? On the related issue of roles, TD was noted as the new Technical Director and would iterate with JC as this was a new role for inclusion in the PMB. The Project Manager's role would be taken on by SP and SP/DB would need to iterate this week. Regarding the role of Production Manager, JC needs to check this. It was noted that everyone should look at their new roles and advise DB of any changes required. Regarding the LCG GDB and the MB - DB noted that although the GDB was part of JC's role, being deputy should be included as part of JC's role. The Project Leader is by default the MB representative but this needs to be discussed with Ian Bird as the default is actually a Tier-1 representative from each country. Action: TD to contact Ian Bird and suggest that the MB UK membership change as at 1st April 2008. 2. GridPP3 Reporting and Reporting Routes ========================================== DB gave a presentation on reporting routes covering the following areas: Tier-1 staff, Tier-2 staff, Technical staff and others. Tier-1 staff ------------ Regarding the Quarterly Reporting - SP/AS to discuss; to be aligned with the reporting which AS already carries out. This will contain the effort figures reported against the service areas; a summary of service levels achieved; and an update of milestones and metrics related to the project map. SL noted that CPU and disk figures in the same format as those provide by the Tier-2 would be useful- this would give unified information. DB noted that in the Tier-1 Quarterly Report there should be an appropriate set of financial numbers (details to be discussed). Action: AS to provide CPU/Disk usages numbers in the Quarterly Report for the Tier-1 as per the ones provided for Tier-2. AS/SP to iterate regarding the financial summary in the Quarterly Reporting (eg: Outturn figures). Tier-2 staff ------------ It was agreed that the Quarterly Report should be compiled by the Production Manager, to include: - effort per Institute (for grant monitoring) - CPU/disk availability (for MoU verification) - milestones & metrics update (for Project Map updates) There was a discussion on unfunded effort and reporting. It was agreed to encourage institutes to continue to report "Unfunded" effort if they wished. Technical staff --------------- TD as Technical Director is to compile the Quarterly Report containing the following: - effort figures (for grant monitoring) - milestones & metrics (to update Project Map) It was noted that GOC posts come under the Technical Director's brief in relation to overview, but are also in the EGEE slot in the Project Map. There was a discussion on the dove-tailing of GridPP2 - GridPP3 posts at Manchester - DB to progress. User-related staff ------------------ It was noted that the Portal post best fits under the overview of the Chair of the User Board. DC noted that the other half post was related to RTM - the UB gets 50% of this post. It was agreed that DC and GP would iterate each quarter on the direction of this work. DB asked in relation to the Technical Documentation 50% FTE post - what exactly will the post address? SB advised that up until now, it related to user documentation. GP noted that this would only be of use if it was closely linked to user requirements, or else the post was redundant - many users don't know that it exists at all. SL suggested that it was not a useful post - VO support is what is needed and support for system managers - it was up to the experiments to support users effectively. TD noted that the post needs to be more visible - would a blog by SB help? SB advised that individuals prefer dialogue/email, which results in an immediate response. SL suggested that use cases need documented. JG advised that everyone should know that the Grid Documentation page exists, and user support should be able to respond appropriately and give links; however a documentation person needs to ensure that these links actually exist, and should let all the managers know where to access them. DB summarised that this issue needs to be better defined - the post should be called 'User Support' and should complement the portal post: The Portal Post provides technical help to new users/VO's interfacing their proprietary software to the Grid middleware; the Document Posts supports users/VO's by leading them through the process of getting engaged and pointing them at the right documentation (and ensuring that documentation exists in an up-to-date form). Action: GP/SB/DC to define these Support posts and ensure they form a comprehensive basis for user support (both documentation and Grid access assistance), overseen by the UB Chair. Regarding the new experiment posts, DB noted that they would naturally be overseen by the UB chair but this did raise the question as to whether the Tier-1 user experiment support posts also fell under this oversight. It was felt important to try and make the experimental support posts more visible outside with Tier-1. AS asked how the metrics would be monitored? DB advised that these staff members should stay under the Tier-1 reporting line for practical reasons but the User Board should monitor them. DB noted that the danger was that nothing would be achieved, however AS should provide a report to GP. AS advised that staff should also report to the experiments. DB suggested that they be included on the User Board. DC suggested that the experiments and the Tier-1 were already well-aligned, and he did not see too much danger of lack of communication or reporting. Decision: DB summarised that the consensus was that the Tier-1 experiment support posts should report via the Tier-1 but that GP as Chair of the User Board should be aware of this effort. Other staff ----------- There was an outreach post at 50%FTE that would report direct to SP. DB would complete the document and circulate it, thereafter it would be posted on the website as a record. It was noted that there were no other dissemination issues that required discussion at present. TD, SL and RJ broke for the Group Leaders' phone conference. DB proposed bringing forward the items on EGEE/EGI - RM and JG to report. 3. EGI Workshop Report ======================= RM reported that the workshop was in relation to EGI Work Package 3. Around 30 people attended to discuss this issue. The workshop was a brainstorming and consensus session driven by the deliverables schedule for EGI which was due in March. RM advised that it was very much a 'closed' project and it was difficult to find out what was going on and who exactly was involved - no list had been published of any international collaborators. RM would circulate the Agenda for info. RM gave a presentation on the various discussions. It was noted that Malcolm Atkinson and NG had been invited to the meeting in March, but this clashed with GridPP20 and MA would not be able to attend - also, NG might not be able to attend and he advised that he had invited Andy Richards to go in his stead. There was therefore a place available if anyone else wished to go. RM reported that Work Package 2 related to use cases and there had been discussions regarding middleware, ramp-up, and ways of handling the projects and the facilities. RM reported that Day Two had comprised the operations model (which needed clarification) and more research members of the Grid were hoped-for; services for VOs had been discussed and a sub-group had been convened to progress issues. The workshop had addressed middleware issues, the building and testing of systems, legal issues and MoUs. Types of management had been addressed, also resource provisioning, VO entry strategy, standardization, and policies. Day Three had concentrated on industry take-up, application support (with a possible taskforce being convened to progress this), training, outreach and dissemination, security, and then Agenda and actions for the forthcoming Rome meeting in March. DB asked about cost and distribution of effort. RM said that not enough information was available as yet. JG asked about the transition between EGEE III and EGI, but again, RM noted that not enough information was available. DB suggested that the workshop had not really been a useful meeting but that if a 'strawman' existed by March then it could be helpful to attend and GridPP should have an opinion - it would be good if JG could go. JG would check with Malcolm Atkinson. NG asked what the fallback plan was if EGI does not happen after EGEE III? It was advised that GridPP would cover their commitments to the Particle Physics Community as our dependency on EGEE is limited in this respect. NG noted that he had provided a draft paper but would add information that addressed the period beyond 2011 and re-circulate. 4. How ready are we for EGEE III? ================================= JG reported on SA1 posts and staffing. He noted that assurances had been given that things would go ahead, but more information might be forthcoming in February. JG gave an overview of status as follows: NA2 Dissemination Manchester, QMUL, Imperial (RTM funding shortfall) NA3 Training Edinburgh (not yet appointed) NA4 User Support Glasgow (MBL) NA5 Policy/strategy/external relations STFC JRA1 GridSite Security Manchester SA3 R-GMA STFC JG then reported on SA1 staffing, work, and duties overview. Additional personnel comprised Linda Cornwall (Vulnerability), Dave Kelsey (Security), Dave Kant (APEL) and Andy Newton (GOCDB). SA1 personnel would be based at Glasgow, Imperial, Manchester, Oxford, and STFC. The SA1 summary required more effort at the Tier-2s and stricter oversight of effort overall. DB noted that the Tier-2 need to think about booking much earlier than in EGEE. I was advised that the Tier-2 Co-ordinators would assist with this. JG noted that JC and DTeam plus NGS effort would help to deliver requirements. RM noted that on 6-7 May there would be an EGEE transition meeting at CERN relating to the transition of activities, if anyone was interested in attending. The meeting broke for lunch and re-convened at 1:00 pm. 5. GridPP3 Milestones & Metrics ================================ SP had circulated the Project Map. SP confirmed that she had iterated with owners of some of the boxes: the experiments, SL, RM re NGI, dissemination, Tier-1 and Ops with JC and AS. All metrics thus far had been incorporated and older ones that were no longer relevant had been deleted. SP advised that a few milestones were still unfinished. SP suggested that there were too many metrics in some areas and we needed to define more clearly what it was important to measure - this might become easier after the dress rehearsal. SP reported that the Tier-2 section was different in overview - it had been restructured so that the top level had a box for each of the Tier-2s - you could then drill-down to give metrics for each individual Tier-2. SP asked whether the PMB wished performance to be at the top level, or did they want the Tier-2 aggregated, or shown by different sites? DB advised that the performance of the Tier-2 should reflect the management structure and the devolution of responsibility - for example, red boxes caused by a single bad site should not be visible here - rather, a green box that shows that the Tier-2 as a whole is performing as defined by the MOU - it needs to reflect the structure that has been set up. The PMB agreed. SP raised the issue of metrics which SL suggested for the Tier-2s - should SL's tests be included along with SAM availability? DC thought no, SAM availability was important to include - utilisation could be measured separately, therefore a combination of the SAM tests + utilisation would show how the Tier-2 is performing. DB advised that at the top level the boxes need to monitor the MoU which would provide parameters for the next hardware allocation, but, SL's tests were useful too for additional information. AS asked about response times - were they part of the MoU? This was covered by No 11 on the list: 4.x11. DC thought that having 12 metrics quarterly was a lot. SP noted that as many as possible should be automatic but she asked whether there were any here that were difficult to collect? DC advised that each one seemed reasonable, but SL's tests should remain unofficial. JC asked about response time of sites to problems? This was dealt with by No 11. DC noted that No 11 was difficult to quantify - a yes/no answer was not sufficient - a percentage was needed. DB advised that re the ATLAS tests, ATLAS was the most important experiment to some of the Tier-2s but for London, CMS might want to have performance metrics. DC suggested that it was fairly well balanced between ATLAS and CMS in London. SP noted that the number of metrics in each experiment were looking at service levels etc, on the Project Map these will not be aggregated. RJ advised that the experiments were interested in sites, not aggregation, and that there existed a tension between the MoU and the conditions of grant. DB advised that the Project Map should monitor the Tier-2s. SP asked about milestones and measurements for EGI? RM advised that GridPP were not doing this alone and that we needed to participate, therefore meaningful milestones were difficult at present. TD noted that this was a bigger issue for NGS, but that the European integration of GridPP was OK. JG asked if it could be considered an internal target? TD noted that using the EGEE infrastructure was a key metric and that a higher-level box for SA1 and NA4 would be useful. DB agreed, noting one box for each SA and NA areas, showing them failing or succeeding. RM confirmed it was better to wait and see how things crystallise within EGEE/EGI. DB noted that we needed something that measures our involvement. RM said that for EGEE this was clearer but for EGI not so - we would need to wait and revisit the Project Map in light of subsequent events. Action: SP would look at the EGI wiki, and NG would consider more inputs relating to box 6.2. SP then asked about a separate box for the GOC post? JG asked whether this was within EGEE SA1? TD noted that it should be under Grid Services. DB advised that the development aspect could go under EGEE. SP noted that it would be easier to put it under EGEE as it was a GOC activity. SP queried the LCG box - did we need one? If yes, content? TD noted a Tier-1 overlap and that TC would have a view. SP assumed that we did not want service delivery metrics in this box? DB advised that it should measure our relationship with LCG, but any high-level LCG requirements should go through the Tier-1. Action: SP to iterate with TC and bring this issue back to the PMB. It was agreed that DB/SP need to progress the details of the Map over the next few months. DB noted that a cross-check would need to be done, checking all elements are included in the Project Map, including strategic priorities and staffing. This needed to be completed before the next Oversight Committee. SP hoped that it would be complete by GridPP20. 6. GridPP3 Meetings and Travel Policy ====================================== RM presented an outline of retrospective travel costs and suggested a review of the Travel Policy, with queries as to allocation of funds. Trish Mullins confirmed that the pressure year would be next year, and anything that can be done to save this year will help. TM noted that GridPP will be given flexibility to carry forward any unspent travel. DB noted that the travel budget reduce by 100k to 188k + inflation in the subsequent years. RM noted the issues as follows: - continue support of non-GridPP (experiment) people? - what level of support for conferences/workshops? - what level of support to sysadmins and WLCG? DB asked what we spent in GridPP2 for Collaboration Meetings etc - a broad-brush comparison would be useful if approximate percentages could be given? SL thought that we should not rule out any categories. DB noted that how much we have spent would help the decision for next year - and that at this stage, categories should not be ruled-out. DB advised that if there are problems with STFC funding, GridPP would need to cut back and this was a wider issue which would need revisited - and could not be unequivocally decided just now. GP asked about travel for Ganga? TD advised that that would be looked at as part of a larger discussion. SL pointed out that there was no budget for dissemination consumables in GridPP3 (posters, brochures, equipment hire etc) and these would also have to come out of the travel budget. DB concluded that there was no obvious need for a written change in the travel policy at the moment, but he wished to understand the numbers better and in more detail. If there were further cuts from STFC then travel would form part of a larger discussion, but little flexibility would be anticipated. JC asked for a concrete decision now regarding the WLCG in April. TD noted that this was core business and could not be compromised - the PMB agreed. It was advisable to book early to keep costs down. Action: RM to provide more detailed figures on travel expenditure - broad-brush percentages would assist with decisions re travel in GridPP3. SS was asked to hand-out travel forms at Dublin ('overseas' claim on web would be submitted as 'actuals' and should be submitted before the end of March 2008). 7. Tier-1 Review/progress towards recommendations ================================================== AS submitted the following report: AS reported that the reviewers had recommended actions in the following areas: 3.1 The CASTOR level of effort is appropriate for steady-state operation, but given the current status, it needs to be monitored. Based on current input, we do not believe that a long-term redistribution of manpower in this area would lead to an optimum overall plan. In the short term, it is recognised that dedicated effort is required for testing. This should be regarded as transitionary. (Point-2.1) Because of the current financial position within e-Science no decision can be made regarding additional support for CASTOR in FY08. We expect our CASE student to work full time on CASTOR until mid-summer and we have some money agreed to be available for funding a contractor for most likely at least 3 months in 2Q08. Beyond this the situation is uncertain. DB asked if there was sufficient effort on CASTOR at present? In the short term funding was available (within FY07). DB asked if everything possible had been done to sort-out CASTOR? AS advised yes, the issues list was much smaller and now general service was more of an issue, not CASTOR per se. 3.2 Details of the operational arrangements to meet the 24x7 requirements are required, which details the people and the systems that will be employed. (Point-2.2) A plan and schedule had been drawn up before the review took place and could have been made available if we had realized it was required. It is available at: http://www.gridpp.ac.uk/wiki In summary: Nagios will be used as the exception monitoring system, this in turn will pass alarms to the Automate/SURE system which will (for now) be used to generate the callout to bleeper. Critical systems, and responses have already been documented. We plan to run a small team of first line on-call experts who will be able to resolve straightforward problems and carry out hardware repair. Several second line on-call experts will be available on any given day to provide cover for CASTOR, Databases and Tier-1 infrastructure. These staff all guarantee to respond within 2 hours. Most other Tier-1 staff have agreed to provide reasonable efforts response and will handle alarms when they are able but do not guarantee to do so. 3.3 The UK is seen as decoupled from the real user requirements. This particularly affects the Tier-1. The GridPP PMB should address this by reviewing the current requirements and providing the necessary requirements summary. (Point-2.3) A meeting with experiments was held on 14th December. Information was collected from some experiments and the CCRC planning has exposed more. 3.4 The Tier-1 needs an overarching experiments coordinator to manage the interactions with the experiments and the User Board. (Point-2.3) We agree that this would be a big improvement on the current situation. Success here depends on getting the right person for the role. We do not plan to name a person for this role until the recruitment described in 1.7 below. This recommendation could be implemented in a matter of days once 1.7 is resolved. 3.5 Given the user requirements summary in 3.3, quantify and implement the implications of the user requirements for UK Tier-1 services more fully. (Point-2.3) We plan to complete by April. An initial attempt was made by Andrew Sansum in mid January to collate information based on the CCRC planning information then available but major holes still existed at that time and the experiments were asked to provide more detail. It is likely that sufficient information is now available for another attempt. 3.6 User support (this distinguishes the Tier-1 from the Tier-2s). The Service Delivery Plan should address potential detachment issues. (Points 2.3 and 2.7) The recruitment of additional staff part funded by the experiments and part funded by the Tier-1 should make a big difference in this area. The designation of an experiments coordinator will also help. The service delivery plan will address this (but does not do so yet in its current draft). 3.7 The 1.5FTEs of GridPP-funded experiment support is not optimally deployed. Further effort is envisaged and a review of overall effectiveness in this area should be undertaken in conjunction with the experiments. It is important to recruit the right person/people with the right experimental background. (Point-2.3) Recruitment is held up by the current STFC financial situation. PPD (who are running this recruitment) do not expect to be in a position to move this forward for perhaps 3 months while details of the program are finalized. 3.8 A dedicated formal weekly management meeting is suggested. Regular weekly meetings should be reduced in order to allow for a more formal weekly Tier-1 Project management meeting that provides coordinated input to the Deployment Team. This single body should manage the week-to-week problems and review progress on the Service Delivery Plan. (Points-2.4, 2.6 and 2.7) This meeting is presently scheduled for 2pm on Tuesdays. The first meeting has been held. Membership is: Neil Geddes (Dept Head), John Gordon (Div Head), Andrew Sansum (Tier-1 Manager), David Corney (Castor GL), David Kelsey (Finance), Jeremy Coles (Production Manager), Robin Tasker (Network GL), Gordon Brown (Database GL), Martin Bly (Fabric Manager), Bonny Strong (Castor Service Manager), Matt Hodges (Resource Manager). A formal agenda is in the process of being drafted, but the meeting will review the Changing experiment and project requirements, maintain the Service Delivery Plan (with its milestones and Metrics), Staffing and Finance and set work priorities. Incidents/Operational problems will be reviewed in detail elsewhere but this body will ensure they are being prioritized/handled/progressed satisfactorily. 3.9 The operational responses on the questionnaire were not developed sufficiently in areas of strategic importance. For example, network contention and data rates within the Tier-1 should be fully understood and not left until they become a problem. (Point-2.4) We have commenced work on a full I/O model of the Tier-1 which will simulate hardware performance and I/O rates into individual CASTOR disk pools. The model will include load from both the wide area network and local batch farm. It has already been prototyped in Python and will be implemented in C++. Inputs will be the experiment requirements as defined in 3.3 and measured hardware performance (in the process of profiling disks and tape drives).We plan to be in a position to use the results from of the model by May, although work is expected to continue on it through until August. If it works satisfactorily, the model will allow us to quantify I/O rates to different disk pools and parts of the batch farm. This in turn should allow us to validate planned configuration changes and new use cases and even help with planned hardware purchases. 3.10 Network planning needs to be quantified: the planning and management is evident in sub-areas, but the overall planning is not clear. We recommend a networking task force be set up prior to data-taking c.f. CASTOR team approach as part of an overall management plan. (Point-2.5). By April we will hold a planning meeting to review experiment bandwidth requirements to be used as input to the architectural and capacity planning of the Tier-1 and site network. We have considered how we could set up a Networking Task Force to address performance issues as they are identified on the RAL network or external links. However, funded effort and staff expertise in this area is limited. We recently invested money in training a member of the Tier-1 team in network performance tuning, unfortunately they have subsequently resigned. We plan to work to raise the overall expertise of various members of staff in this area (both Networking Group and e-Science) as it is clear that the current level of expertise is too low. Richard Hughes-Jones (Now Dante) has agreed to come to RAL for a couple of days to provide training and work on any current problems. We expect that with the deployment of PerfSonar on the OPN Dante will become more involved in end to end network performance issues in that area. 3.11 There is too much emphasis on internal firefighting of issues that do not address the high-level user requirements. The Tier-1 management team (AS, JG and NG) should address this together with the Production Manager (JC). (Point-2.5) The service delivery plan will ensure high-level requirements are prioritized. The production Manager will attend the weekly management meeting. 3.12 An analysis strategy needs to be developed at the Tier-1 that meets the requirements of LHCb (since LHCb-UK relies on this service). (Point-2.5 and 2.3). Discussions have commenced between Andrew Sansum, Raja Nandakumar and Nick Brook. A face to face meeting will happen in March. 3.13 The Tier-1 and the Production Manager should investigate ways to strengthen the integration of the Tier-1 with the wider deployment team. Greater involvement is required at this level. A dedicated meeting is suggested. This needs to be addressed as a first step towards a Tier-1 Service Delivery Plan. (Points-2.6 and 2.7). Meetings and discussions have taken place between the Tier-1 manager and the Production Manager UK site admins were asked where they consider more Tier-1 information sharing would be useful. The general feeling was that it is difficult to suggest anything specific as little was known about the internal functioning and work activities of the Tier-1. In the first instance site admins suggested that exposing some sort of Tier-1 diary or blog would be useful and this prompted the idea of a bulletin. Increasing awareness of the public Tier-1 actions page would also help in this area. Finally, having a standing item to talk about current Tier-1 work at the UKI monthly operations meeting (and dteam) would also improve transparency. It was hoped that introducing standing items into current scheduled meetings will meet this need as there is already a general feeling within the dteam that there are too many meetings. By the end of February, the Tier-1 Grid team will trial a blog where key items of interest are identified each week for publication on the blog. The management meeting may also identify items (to be discussed at the next meeting). These will ideally be pointers to existing updated material but may also include short stand alone items. If the dteam find this useful, we will expand the blog to other parts of the Tier-1. A draft of the service delivery plan will be made available to the dteam once there is sufficient content to enable the dteam to consider and suggest modifications. 3.14 The UK deployment team needs to be informed of the Tier-1 deployment requirements. (Point-2.6) The production manager will attend the weekly management meeting. A regular slot will be available at the dteam meeting where Tier-1 deployment requirements and information of interest can be discussed. 3.15 Quantification is required (at back of the envelope level) prior to defining the Service Delivery Plan. (Point-2.7) The Service Delivery Plan will take account of quantitative requirements as far as possible. However not all of these will be available by the time the preliminary draft of the Service Delivery plan is scheduled to be completed (end of February). 3.16 RAL runs important UK-wide services e.g. the BDII and RBs. Improved procedures for monitoring, announcing down-time and reporting problems should be established as part of the Service Delivery Plan. (Point-2.7). There is work ongoing at CERN to allow SAM tests to be propagated to remote sites via a push from nagios to our own internal monitoring system. We have asked to be involved in early testing of this when it comes available. This should improve our response time for critical alarms. There is clearly a problem with the current process of announcing downtimes. The Tier-1 has a standard process to ensure that downtimes are registered properly in the GOCDB and broadcast via EGEE. Unfortunately UK production staff find themselves swamped by announcements from the many sites that they use and dont necessarily notice announcements. The Tier-1 will raise the issue with the dteam, but preliminary discussions with various people indicate that a further mailing list will be unhelpful. A Tier-1 dashboard will help but it may be that better tools are required from EGEE to allow better visualization of scheduled/unscheduled downtimes. 3.17 Improved procedures for documenting and replicating the above services as part of a resilient Grid should be established via the Production Manager and together with the deployment team. (Point-2.7). The Tier-1 manager and Production manager have discussed this but a schedule/plan has not yet been agreed. We have also started to explore the possibility of locating some critical state-full services at Daresbury laboratory and manage them remotely if required. This has the advantage that IP addresses/names are managed by the same set of DNS servers for both sites. 3.18 All current Tier-1 metrics should be reviewed with an emphasis on services and user-impact, incorporating input from the GridPP Project Manager. (Point-2.9) A preliminary set of metrics have been incorporated into the draft Service Delivery Plan. These have included those proposed by the GRIDPP Project Manager. Metrics will be established not only to measure the quality of service delivered to the users but also to ensure that internal systems and processes are working well. 3.19 STFC Networking and the Tier-1 should jointly plan to ensure that the future site network connectivity and performance aims to meet the Tier-1 networking requirements in a timely fashion. (Point-2.10) A number of changes will take place to the way STFC and the Tier-1 manage the site network. A proposal has been made by Networking Group to set up a Technical Design Authority which will be responsible for establishing and maintaining the overall technical architecture of the RAL core network. It will have representation from the internal and external science communities and will contain external network experts to provide peer review of its program. The Tier-1 will wish to provide input of requirements to this authority. A RAL Network User Group will be set up which will provide accountability and review of day to day operation of the site network. Representation on this body is not yet agreed but e-Science would expect to be represented. The twice monthly meeting between Tier-1 Network experts and Networking Group will continue to consider implementation of planned changes to the network. Robin Tasker will attend the weekly Tier-1 management meeting. 3.20 We recommend the development and dissemination of a Tier-1 dashboard that presents a complete view of the monitoring for the centre. (Point-2.11) We will develop a dashboard, although WLCG is also planning work in this area to present Tier-1 Dashboards. DB noted that good progress was being made in some areas, but that this would need to be revisited at Dublin but possibly the new deployment board would be the right forum? 8. Disaster Recovery Plan ========================== JC had circulated a template and had received comments from the last PMB, and had now mapped the template to AS's documents. JC's report provided an associated reference and a description of the disaster/failure, with a bulleted list of services and areas affected. Issues to be considered included: classification of the severity/impact (to help with prioritisation); an action plan; an individual to be allocated as 'lead' with a Deputy; contacts and timescales. For each failure a one-page summary sheet was to be completed. AS advised that a lot of the 'disasters' experienced were standard things which happened, and that the communication list would be long. TD agreed, noting that the report should be short and readable. JC advised that this sheet would document the disaster, the actual plan would be elsewhere. DB asked about how many disaster templates would need to be completed? AS thought that they would number in their hundreds. JC noted that it depended on the level of the disaster itself. JG noted that it would not be maintainable if it included every smaller issue. DB agreed, noting that it would not converge. AS suggested that it needed to be deliverable, maintainable, and useful. Action: TD noted that AS/JC should iterate and remove capturable items that were considered to be minor. AS noted that for number 7, response co-ordination, there were various stages to this. TD noted that someone needed to be defined as 'lead' and someone as 'deputy'. AS suggested this be the project's decision - and it would depend on the level of the crisis. TD noted the following criteria: - category of disaster - who to inform - who is in charge The document needed to be fairly high-level. TD and DB noted that the template looked good but it was not to be issued in the hundreds. RJ noted that for experiments, such a 'disaster' would be loss of data. AS noted that for the Tier-1, it would be loss of CASTOR metadata. It was agreed to go with the template as provided by JC and see how it goes when it is actually used - then the issue can be revisited if necessary. JC to progress this. 9. AOCB ======== RJ initiated a discussion on job slots on the Tier-1. AS would get back to RJ, and it was noted that an occasional forum was required. It was understood that weekly CASTOR meetings took up a lot of time and might need to be reconfigured. ACTIONS AS AT 01.02.08 ====================== 272.4 AS to check the current Tier-1 disaster recovery plan and circulate the existing version to the PMB. It was reported that this document does not exist, but it was planned to have one in the longer term. TD would incorporate in v0.4 anything that AS considered relevant. AS will check and advise additions. 277.2 DN to provide an update and re-evaluation of CMS/CASTOR deliverables. TD advised that there was a CMS/CASTOR document on deliverables which should be revised in light of the December '07 tests. DC to take the token for this now and iterate with DN. 277.5 Disaster Recovery 'Team B': SB, JC, TD, SP, DB to analyse the wider issues of disaster planning, mapped to the experiments' lists, and this work would include Project Management. A Recovery Plan was required. It was agreed that JC was in charge of this and the experiment input relating to subsets of the disaster plan. SB/JC to progress. 277.8 User Experience 'Team C': SB, SP, SL, with input from JC to deal with the issue of user experience and design of an easily-found lookup facility for grid error messages. SL reported that he had started the ATLAS wiki page and would circulate the url. SB was leading this with inputs from SP, SL and JC where needed. A new simple summary was required of all areas available plus a lookup/links facility, for the OC to review. This would include a list of most recent types of problems (possibly a 'top 12' for users - what the error means and the course of action to follow). SB to progress this. 280.7 JC to mention the issues (when approached by a VO with regard to joining) of the 'standard' 6-month introduction period, following which the VO must set-up something specific to them, if appropriate. This was discussed at DTeam. JC to email GridPP VO members if possible - ongoing. This was a standing action - JC had discussed it with the Tier-2 Co-ordinators in relation to VO members. JC to send email. 280.8 JG to investigate the UKI ROC website - any change/progress, and report-back. SB to iterate with JG in order to sign-off this item next week. Ongoing. 282.2 SP to progress the Project Map using the T1 service areas and input from the meeting. 282.6 JC and SB to progress existing 'disaster planning' template for next F2F meeting on 1st Feb. Involve experiments as necessary. This was a follow-up from the last F2F, and was to be distinguished from 277.5 action which is a longer-term one relating to the OC. 289.1 AS to provide an analysis of the ATLAS disk server failures on the RAID controller. 289.2 DC to check current situation regarding gLite WMS and SL4 - current status to be conveyed to DTeam. 289.3 JC to check the VOMS/-skipcacheck issue (in relation to UK CA certificate change) with Jens Jensen and raise the issue at an Operations meeting. 289.4 SP to speak to the KT person at STFC who assisted with the PIPSS case, to help with the post-competitive phase (in relation to EGEE only providing support to pre-competitive startup). SP to involve NG. 290.1 JC to write-down membership of DTeam. 290.2 RJ, DC and GP to nominate experiment user representatives for the Deployment Board. 290.3 SL and DB to review the Tier-1 Board Terms of Reference and see what could be formally incorporated into the new Deployment Board Terms of Reference. 290.4 AS and JG to iterate regarding what could replace the Tier-1 Board. 290.5 All: to check their individual roles as outlined and advise DB of any required changes. 290.6 TD to contact Ian Bird and suggest that the GDB and MB UK membership changes as at 1st April 2008. 290.7 AS to provide numbers in the Quarterly Report for the Tier-1 as per the ones provided for Tier-2. 290.8 AS/SP to iterate regarding the financial summary in the Quarterly Reporting (eg: Outturn figures). 290.9 Quarterly Report for Tier-2 staff to be compiled by the Production Manager. 290.10 TD as Technical Director to provide a report showing effort figures; milestones & metrics; and a table of posts showing Technical Support. 290.11 DB to progress the situation at Manchester. 290.12 GP/SB/DC to define these Support posts and ensure they form a comprehensive basis for user support (both documentation and Grid access assistance), overseen by the UB Chair. 290.13 DB to complete the document re Reporting and Reporting Routes relating to staff, and circulate it, thereafter it would be posted on the website as a record. 290.14 RM to circulate the EGI Workshop Agenda. 290.15 JG to check with Malcolm Atkinson re attending the next EGI workshop in Rome (March). 290.16 NG noted that he had provided a draft paper relating to the end of EGEE III but would add information that addressed the period beyond 2011 and re-circulate. 290.17 Re the Project Map, SP would look at the EGI wiki, and NG would consider more inputs relating to box 6.2. 290.18 Regarding the LCG box on the Project Map, SP to iterate with TC and bring this issue back to the PMB. 290.19 DB/SP to progress the details of the Project Map over the next few months, cross-checking that all elements are incorporated, including strategic priorities and staffing. To be completed before the next Oversight Committee. 290.20 RM to provide more detailed figures on travel expenditure - broad-brush percentages would assist with decisions re travel in GridPP3. 290.21 SS to hand-out travel forms at Dublin ('overseas' claim on web to be submitted as 'actuals' and should be submitted before the end of March 2008). 290.22 AS to get back to RJ regarding job slots at the Tier-1. 290.23 AS/JC to iterate on the Disaster Recovery template and remove capturable items that were considered to be minor. 290.24 JC to progress his suggested template to use when a crisis occurs - to be revisited subsequently at a PMB. INACTIVE CATEGORY ================= 271.1 PMB to examine the issue of fibre breakage and outages, CERN-RAL OPN link, in one year's time, when actual data on breakages is available. Due date would be September '08. 271.3 Re CERN-RAL OPN link breakage and backup generally, PC to oversee the issue and collate info so that the PMB have something to revisit in one year's time. Due date September '08. It was noted that PC would circulate a revised document after discussion with ATLAS (RJ/PC/DN to iterate). 282.8 RM to monitor how R-GMA and networking issues impact on GridPP as matters progress. RM advised that this item should be moved to the 'inactive' category as it will develop over the coming months. RM discussed the issue with Steve Fisher and advised that support of R-GMA is required whilst APEL is dependent on it. RM reported that he has spoken to SF and there is currently no change to the R-GMA situation - process ongoing. There was no other business, and the meeting closed at 3:45 pm. The next PMB would be at 1:00 pm on Monday 11 February.

Top of Message | Previous Page | Permalink

JiscMail Tools


RSS Feeds and Sharing


Advanced Options


Archives

February 2024
January 2024
September 2022
July 2022
June 2022
February 2022
December 2021
August 2021
March 2021
November 2020
October 2020
August 2020
March 2020
February 2020
October 2019
August 2019
June 2019
May 2019
April 2019
March 2019
February 2019
January 2019
December 2018
November 2018
August 2018
July 2018
June 2018
May 2018
April 2018
March 2018
February 2018
January 2018
November 2017
October 2017
September 2017
August 2017
May 2017
April 2017
March 2017
February 2017
January 2017
October 2016
August 2016
July 2016
June 2016
May 2016
April 2016
March 2016
February 2016
January 2016
December 2015
November 2015
October 2015
September 2015
July 2015
May 2015
April 2015
March 2015
February 2015
January 2015
December 2014
November 2014
October 2014
September 2014
August 2014
July 2014
June 2014
May 2014
April 2014
March 2014
February 2014
January 2014
December 2013
October 2013
August 2013
July 2013
June 2013
May 2013
March 2013
February 2013
January 2013
December 2012
November 2012
October 2012
September 2012
August 2012
July 2012
June 2012
May 2012
April 2012
March 2012
February 2012
January 2012
December 2011
November 2011
October 2011
September 2011
August 2011
July 2011
June 2011
May 2011
April 2011
March 2011
February 2011
January 2011
December 2010
November 2010
October 2010
September 2010
August 2010
July 2010
June 2010
May 2010
April 2010
March 2010
February 2010
January 2010
December 2009
November 2009
October 2009
September 2009
August 2009
July 2009
June 2009
May 2009
April 2009
March 2009
February 2009
January 2009
December 2008
November 2008
October 2008
September 2008
August 2008
July 2008
June 2008
May 2008
April 2008
March 2008
February 2008
January 2008
December 2007
November 2007
October 2007
September 2007
August 2007
July 2007
June 2007
May 2007
April 2007
March 2007
February 2007
January 2007
2006
2005
2004
2003
2002
2001
2000


JiscMail is a Jisc service.

View our service policies at https://www.jiscmail.ac.uk/policyandsecurity/ and Jisc's privacy policy at https://www.jisc.ac.uk/website/privacy-notice

For help and support help@jisc.ac.uk

Secured by F-Secure Anti-Virus CataList Email List Search Powered by the LISTSERV Email List Manager