GridPP PMB Minutes 246 - 12th February 2007 =========================================== Present: John Gordon, Sarah Pearce, Roger Jones, Stephen Burke, David Britton, David Kelsey, Dave Newbold, Steve Lloyd, Tony Cass, Jeremy Coles, Andrew Sansum Apologies: Peter Clarke, Tony Doyle, Robin Middleton, Glenn Patrick 1. Site Readiness Review ========================= Members reported on the preparation for the reviews: ScotGrid - No dates have been fixed yet. The original plan to append to GridPP18 has been abandoned due to key people being unavailable due to DTeam/User Board meetings that week. NorthGrid - still hoping to find dates in April. SouthGrid - Team has agreed on March 9, 13, and 14 but havent yet got agreement of all sites. London - Agreed dates in late April. Still checking with sites on the order. DB said he would have a draft of the review announcement and questionnaire with a few days. ACTION DB Produce draft of review documents. 2. Oversight Committee ======================= The meeting planned for 8th February was cancelled on the day after several members cancelled due to weather and related childcare problems. No date has yet been fixed for the replacement meeting. STANDING ITEMS ============== SI.1 Dissemination Officer's Report ----------------------------------- There had been one news item in the last week; on the last cpu upgrade. Several more were in preparation: - Hannah Cumming on the Total VO - RJ on ATLAS - SB on gLite User Guide - JET - WLCG Workshop - JC coordinating input from several people. - DN on CMS A report on the EUGrid PMA meeting at RAL recently will be in ISGW next week. Neasan O'Neill will be visiting the Science Museum this week to discuss their LHC exhibit. SI.2 Tier-1 Manager's Report ---------------------------- AS provided the following report: Hardware: 1) Supplier One delivery - Integration into dCache is complete. The CASTOR team have all 15 servers planned for deployment but have a problem integrating them into CASTOR. The difficulty is that "garbage collection" cannot be made to work (yet) despite the fact that it works fine on identical hardware already deployed into CSA06. 2) Supplier Two Delivery (I) - Acceptance testing completed. Servers will need to be deployed into capacity to meet the March UB allocations. 3) Supplier Two Delivery (II) - Acceptance tests have started and should finish by the 9th March. 4) Tape Purchase - 350TB of media has been ordered - delivery 1 weeks time. 5) Tape drive purchase - 3 drives have been ordered - delivery in about 2 weeks time. 6) Tape drive servers - Ordered delivery in about 4 weeks time (estimate). Service: The top level BDII will be replicated onto two new servers by Wednesday. The FTS was unavailable for much of the weekend following multiple core dumps filling up a partition. It is scheduled to be down on Tuesday. In the absence of Derek Ross (leave) we were unable to meet last week to discuss the SL4 rollout. However I have informed CMS that we will definitely not be able to provide SL4 within 1 month. Job CPU efficiency for January fell to 64%. This appears to be dominated by LHCb who suffered 38% efficiency for a large share of total resources. LHCb believe that this is caused by performance and reliability problems in RAL's dCache - we are investigating. Testing of the dCache 1.7 upgrade has been completed successfully and this is planned to be deployed ASAP - it may help resolve this issue (although not specifically addressed in the revision history). JC asked if the delay in deploying disk to Castor was affecting CMS. Yes but ATLAS is affected more. JC asked when the second RB would be deployed. This happened before Christmas. SI.3 Production Manager's Report -------------------------------- JC provided the following report: 1) The WLCG GDB was held last Wednesday (http://indico.cern.ch/conferenceDisplay.py?confId=8469). John Gordon was elected as the next GDB chairperson. Markus Schulz gave an update on gLite on SL4 (we do not expect something in production until at least April, so sites are going to need to use published workarounds ahead of the experiment Full Dress Rehearsals (FDRs)). AS raised worries that the inevitable partial deployment of SL would fragment the cpu cluster and reduce efficiency still further. 2) Utilisation of CPU has been low across the UK (<40%) for about the last week as LHCb have stopped running jobs while a bug in their production code is fixed. Meanwhile enablement of camont.gridpp.ac.uk and total.vo.gridpp.ac.uk on the GridPP RBs and CEs is taking longer than expected. 3) We continue to see a lot of SE problems as they become more widely tested and used (Firewall, gridftp doors in dCache, SE information publishing and full disks and implementation of ATLAS ACLs are some recent issues). The continued instability in the top-level BDII is not helping the situation. 4) A Tier-2 board (VRVS or phone) meeting is scheduled for this Friday 10:00-13:00. Discussion topics include: meeting MoU disk requirements, cover at sites, experiment-site interaction and OS policies. 5) The Deployment Team are trialing an Operations Blog in the hope that it will provide a consistent view on problems being resolved each day (http://gridpp-ops.blogspot.com/). We already have Tier-2 Blogs and many other information sources so there are concerns about whether this new blog is worthwhile. SI.4 LCG Management Board Report -------------------------------- JG reported that SRM2.2 was not expected to be tested at Tier1s until March. This would impact experiments plans to use it. Most of the rest of last weeks MB was discussion of the Harry Tables which document Tier1 deployment plans in finer detail than the MoU and match them against experiment requirements. They duplicate information also gathered through Quarterly Progress Reports. A document clarifying who reports where and which information is regarded as definitive would be prepared to focus this discussion. SI.5 Documentation Officer's Report ----------------------------------- SB reported that the most significant recent issue was the name change of the CERN wiki which had broken documentation links across the world. (note: CERN have since added a redirection link from the old wiki address). REVIEW OF ACTIONS ================= 236.6 GP to summarise and circulate the LHCb model as a basis for discussion. GP would now focus on this as two models were now available. Ongoing. 245.1 DB to report on the user statistics. DB had prepared a transparency for the OC. Item closed. 245.2 JC to forward information on WLCG meeting to SP. Ongoing 245.3 JC to contact the Tier-2 co-ordinators regarding reports for the Manchester EGEE User Forum. Done. Item closed. 245.4 SP to send out an All Hands and EGEE User Forum roundup to UKHEPGRID. EGEE UF done. Ongoing. ACTIONS AS AT 12.02.07 ====================== 236.6 GP to summarise and circulate the LHCb model as a basis for discussion. 245.2 JC to forward information on WLCG meeting to SP. 245.4 SP to send out an All Hands reminder to UKHEPGRID. 246.1 DB to produce draft of review documents. The meeting closed at 13.55.