GridPP PMB Minutes 329 - F2F meeting @ RAL_Storage Review I - Fri 21st November 2008 ==================================================================================== Present: David Britton (Chair), Tony Doyle, Sarah Pearce (remote), Andrew Sansum, Roger Jones, Jeremy Coles, Steve Lloyd, Robin Middleton, John Gordon, Tony Cass, Neil Geddes, Dave Colling (remote), David Kelsey, Pete Clarke, Glenn Patrick Invited speakers & reviewers: Chris Brew, Gordon Brown, David Corney, James Jackson, Norman McCubbin, Raja Nandakumar, Jamie Shiers, Graeme Stewart Other members of the RAL Tier 1, Castor, Database and Storage teams: Martin Bly, Shaun De Witt, Keir Hawker, Jens Jensen, Bonny Strong, Matthew Viljoen 0. Overview ============ The GridPP PMB would like to maintain an oversight of the RAL Tier-1 by meeting face-to-face at RAL on a biannual basis. This replaces, in part, the function of the old Tier-1 Board. These meetings are intended to concentrate on the most topical issue, rather than be just a general overview, with a view to agreeing strategy for the following period. Where appropriate, external experts will be invited attend. This structure is intended to compliment any internal oversight of the Tier-1 within RAL. The first of these meetings took place on November 21st 2008, with a focus on CASTOR and Storage." 1. Context =========== DB welcomed all and roundtable introductions were made. DB explained the aims of the meeting - there was a full Tier 1 review this time last year, at that time we were expecting data in 2008 and Castor had been broken leading up to the GridPP Oversight Committee meeting in Oct 2007. The next Oversight Committee was coming up in December and we now expected data in 6 months' time. The aim today was to review the current status and future plans for storage at the RAL Tier- 1. Is the current approach the correct one? We need a structured and agreed process to move from today to data taking next year. DB reported that a range of views had been expressed by the Experiments on storage issues at RAL. JamesJ presented the view of CMS. At one end of the scale CMS see no significant problems. They are basically happy with CASTOR 2.1.7 and are satisfied that they have tested this to the required level for data-taking. They do not wish to make changes. RN presented the view of LHCb. Although LHCb has suffered data loss with Castor, overall the experiment feels that the UK Tier 1 has responded well to the production demands of the experiment. However, the demands and access patterns of large-scale user analysis have yet to be fully tested. GS presented the view of ATLAS. ATLAS have experienced a considerable number of software and hardware problems, and do not consider that performance is currently satisfactory. The failure of the scheduler during "red-button" day (Sep 10th) was particularly unfortunate. There have been periods of good performance, but overall the performance is patchy and the impression is one of fragility. The "other experiments" have little experience with Castor and the main factor is the available support and documentation to assist migration. The effort needed to support the main LHC experiments has meant that the smaller experiments (including ALICE) have necessarily received a lower priority. In addition, there was User Board input from GP, and JG presented on Tier-1 Staffing and Communication. 2. Outputs =========== Following the presentations and round-table discussion, the following consensus was reached: 1) All experiments would prefer or are prepared to remain with CASTOR 2.1.7 until we can be very sure that any future release is not going to cause problems. 2) The current process of testing releases is good (several problems found in 2.1.7) but not sufficient. Functionality can be and has been tested prior to deployment, but we need also to test future releases at the load levels expected. Given that CERN has mechanisms for generating loads, we recommend looking into the possibility of doing a final (post-RAL) certification at CERN possibly involving people from the UK. That is, we would like to investigate the possibility of some kind of Tier-1 test bed at CERN to complement the test bed at RAL. 3) The Oracle database is problematic. Dedicated effort is required and the current effort (0.8 FTE) is probably a little low. A key consideration is whether we can either reduce the load (which is much higher than that at CERN) or enhance the database to enable it tocope (perhaps by adding an additional RAC?). 4) More work needs to be done to try and expedite the resolution of Oracle issues, eg wLCG wide meeting with Oracle to address long-standing issues; and possibly RAL DBA going to CERN to shadow experts there for a short period. 5) The different perspectives and experiences of ATLAS and CMS appear to originate from a combination of factors: the lack of embedded ATLAS technical effort at the Tier-1; possibly the much smaller file-size used by ATLAS in some activities and the resultant load on CASTOR infrastructure (dBases etc), ATLAS’ greater sensitivity to disk failure (D1T0, D0t1 etc), and ATLAS’ greater reliance on the LFC. It is recommended that ATLAS should consider these points. 6) Staffing effort is problematic. The Tier-1 needs to fill the posts that have been vacant for over a year and to ensure that staff are agile and responsive. In particular, the Production Team need to be able to take up some of the day-to-day load from the CASTOR team to allow the experts to focus on troubleshooting. 7) On the timescale of the end of GridPP3 we need to know whether CERN has been successful at making CASTOR more suitable for disk only and we need to consider the long-term scenario for tape. 8) Nobody is pushing for effort to be invested in developing a short-term (ie deployable in ~9 months) “Plan-B” for first data. Alternatives have been considered and all are felt to be most unlikely to produce a usable system on a short time-scale. 3. Conclusions =============== 3.1 The review suggests that there are issues on both the Tier-1 side and on the ATLAS side. 3.2 At the Tier-1 we should stay with CASTOR 2.1.7 for now and try to ensure that it works reliably. We need to ensure continued support for this release from CERN and we need to develop a more robust certification process for future releases. CASTOR upgrades will need to be reconsidered in February when the final running schedule is known. 3.3 Attention needs to be concentrated on the Oracle Database system to ensure that it operates at an appropriate load and to engage Oracle better. The staffing at the Tier-1 needs to brought up to the funded level and careful consideration needs to be given to making the staff agile across the various domains in order to be responsive. 3.4 The ATLAS approach has proven less successful than that of LHCb or CMS. A ‘black-box’ Tier-1 service is not achievable while the requirements continue to evolve and the complex MSS system is still being developed. Although there is already quite extensive contact between ATLAS and the CASTOR/Tier-1 team, the experience of CMS and to some extent LHCb suggests that ATLAS would benefit enormously by embedding a clearly identified expert to work at the Tier-1 along with the CASTOR and database teams. In addition, ATLAS should work, where possible, to moderate how the ATLAS computer model uses CASTOR. Slides shown during the presentations are available at (password from David Britton or David Kelsey): http://indico.cern.ch/conferenceDisplay.py?confId=45952 ACTIONS AS AT 21.11.08 ====================== 329.1 AS to initiate discussion on improved testing strategy for future CASTOR releases. 329.2 AS to request the data-base team look at the cost/benefit of reducing the loads/aligning closer with CERN.