GridPP PMB Minutes 329 - F2F meeting @ RAL_Storage Review I - Fri 21st November 2008
====================================================================================

Present:  David Britton (Chair), Tony Doyle, Sarah Pearce (remote), Andrew Sansum, Roger Jones, 
Jeremy Coles, Steve Lloyd, Robin Middleton, John Gordon, Tony Cass, Neil Geddes, Dave Colling 
(remote), David Kelsey, Pete Clarke, Glenn Patrick

Invited speakers & reviewers:  Chris Brew, Gordon Brown, David Corney, James Jackson, Norman 
McCubbin, Raja Nandakumar, Jamie Shiers, Graeme Stewart

Other members of the RAL Tier 1, Castor, Database and Storage teams:  Martin Bly, Shaun De Witt, 
Keir Hawker, Jens Jensen, Bonny Strong, Matthew Viljoen


0.  Overview
============
The GridPP PMB would like to maintain an oversight of the RAL Tier-1 by meeting face-to-face at 
RAL on a biannual basis. This replaces, in part, the function of the old Tier-1 Board. These 
meetings are intended to concentrate on the most topical issue, rather than be just a general 
overview, with a view to agreeing strategy for the following period. Where appropriate, external 
experts will be invited attend. This structure is intended to compliment any internal oversight of 
the Tier-1 within RAL. The first of these meetings took place on November 21st 2008, with a focus 
on CASTOR and Storage."

1.  Context
===========
DB welcomed all and roundtable introductions were made. DB explained the aims of the meeting 
- there was a full Tier 1 review this time last year, at that time we were expecting data in 2008 and 
Castor had been broken leading up to the GridPP Oversight Committee meeting in Oct 2007. The 
next Oversight Committee was coming up in December and we now expected data in 6 months' 
time. The aim today was to review the current status and future plans for storage at the RAL Tier-
1. Is the current approach the correct one? We need a structured and agreed process to move 
from today to data taking next year.

DB reported that a range of views had been expressed by the Experiments on storage issues at 
RAL. 

JamesJ presented the view of CMS.  At one end of the scale CMS see no significant problems. 
They are basically happy with CASTOR 2.1.7 and are satisfied that they have tested this to the 
required level for data-taking. They do not wish to make changes.

RN presented the view of LHCb.  Although LHCb has suffered data loss with Castor, overall the 
experiment feels that the UK Tier 1 has responded well to the production demands of the 
experiment. However, the demands and access patterns of large-scale user analysis have yet to 
be fully tested.

GS presented the view of ATLAS.  ATLAS have experienced a considerable number of software 
and hardware problems, and do not consider that performance is currently satisfactory.  The 
failure of the scheduler during "red-button" day (Sep 10th) was particularly unfortunate. There 
have been periods of good performance, but overall the performance is patchy and the 
impression is one of fragility. 

The "other experiments" have little experience with Castor and the main factor is the available 
support and documentation to assist migration. The effort needed to support the main LHC 
experiments has meant that the smaller experiments (including ALICE) have necessarily received 
a lower priority.

In addition, there was User Board input from GP, and JG presented on Tier-1 Staffing and 
Communication.

2.  Outputs
=========== 
Following the presentations and round-table discussion, the following consensus was reached:

1) All experiments would prefer or are prepared to remain with CASTOR 2.1.7 until we can be very 
sure that any future release is not going to 
cause problems.

2) The current process of testing releases is good (several problems found in 2.1.7) but not 
sufficient. Functionality can be and has been tested prior to deployment, but we need also to test 
future releases at the load levels expected. Given that CERN has mechanisms for generating 
loads, we recommend looking into the possibility of doing a final (post-RAL) certification at CERN 
possibly involving people from the UK. That is, we would like to investigate the possibility of 
some kind of Tier-1 test bed at CERN to complement the test bed at RAL.

3) The Oracle database is problematic. Dedicated effort is required and the current effort (0.8 FTE) 
is probably a little low. A key consideration is whether we can either reduce the load (which is 
much higher than that at CERN) or enhance the database to enable it tocope (perhaps by adding 
an additional RAC?).

4) More work needs to be done to try and expedite the resolution of Oracle issues, eg wLCG wide 
meeting with Oracle to address long-standing issues; and possibly RAL DBA going to CERN to 
shadow experts there for a short period.

5) The different perspectives and experiences of ATLAS and CMS appear to originate from a 
combination of factors: the lack of embedded ATLAS technical effort at the Tier-1; possibly the 
much smaller file-size used by ATLAS in some activities and the resultant load on CASTOR 
infrastructure (dBases etc), ATLAS’ greater sensitivity to disk failure (D1T0, D0t1 etc), and ATLAS’ 
greater reliance on the LFC. It is recommended that ATLAS should consider these points.

6) Staffing effort is problematic. The Tier-1 needs to fill the posts that have been vacant for over a 
year and to ensure that staff are agile and responsive. In particular, the Production Team need to 
be able to take up some of the day-to-day load from the CASTOR team to allow the experts to 
focus on troubleshooting.

7) On the timescale of the end of GridPP3 we need to know whether CERN has been successful at 
making CASTOR more suitable for disk only and we need to consider the long-term scenario for 
tape.

8) Nobody is pushing for effort to be invested in developing a short-term (ie deployable in ~9 
months) “Plan-B” for first data. Alternatives have been considered and all are felt to be most 
unlikely to produce a usable system on a short time-scale.

3.  Conclusions
===============
3.1  The review suggests that there are issues on both the Tier-1 side and on the ATLAS side. 

3.2  At the Tier-1 we should stay with CASTOR 2.1.7 for now and try to ensure that it works 
reliably. We need to ensure continued support for this release from CERN and we need to 
develop a more robust certification process for future releases. CASTOR upgrades will need to be 
reconsidered in February when the final running schedule is known. 

3.3  Attention needs to be concentrated on the Oracle Database system to ensure that it operates 
at an appropriate load and to engage Oracle better. The staffing at the Tier-1 needs to brought up 
to the funded level and careful consideration needs to be given to making the staff agile across 
the various domains in order to be responsive. 

3.4  The ATLAS approach has proven less successful than that of LHCb or CMS. A ‘black-box’ Tier-1 
service is not achievable while the requirements continue to evolve and the complex MSS system 
is still being developed. Although there is already quite extensive contact between ATLAS and 
the CASTOR/Tier-1 team, the experience of CMS and to some extent LHCb suggests that ATLAS 
would benefit enormously by embedding a clearly identified expert to work at the Tier-1 along 
with the CASTOR and database teams. In addition, ATLAS should work, where possible, to 
moderate how the ATLAS computer model uses CASTOR.

Slides shown during the presentations are available at (password from David Britton or David 
Kelsey):
http://indico.cern.ch/conferenceDisplay.py?confId=45952


ACTIONS AS AT 21.11.08
======================

329.1  AS to initiate discussion on improved testing strategy for future CASTOR releases.

329.2  AS to request the data-base team look at the cost/benefit of reducing the loads/aligning 
closer with CERN.