JiscMail Logo
Email discussion lists for the UK Education and Research communities

Help for UKHEPGRID Archives


UKHEPGRID Archives

UKHEPGRID Archives


UKHEPGRID@JISCMAIL.AC.UK


View:

Message:

[

First

|

Previous

|

Next

|

Last

]

By Topic:

[

First

|

Previous

|

Next

|

Last

]

By Author:

[

First

|

Previous

|

Next

|

Last

]

Font:

Proportional Font

LISTSERV Archives

LISTSERV Archives

UKHEPGRID Home

UKHEPGRID Home

UKHEPGRID  January 2008

UKHEPGRID January 2008

Options

Subscribe or Unsubscribe

Subscribe or Unsubscribe

Log In

Log In

Get Password

Get Password

Subject:

Minutes of the 289th GridPP PMB meeting

From:

Tony Doyle <[log in to unmask]>

Reply-To:

Tony Doyle <[log in to unmask]>

Date:

Wed, 30 Jan 2008 14:48:26 +0000

Content-Type:

MULTIPART/MIXED

Parts/Attachments:

Parts/Attachments

TEXT/PLAIN (20 lines) , 080128.txt (1 lines)

Dear All,

     Please find attached the latest weekly GridPP Project Management 
Board Meeting minutes. The latest minutes can be found each week in:

http://www.gridpp.ac.uk/php/pmb/minutes.php?latest

as well as being listed with other minutes at:

http://www.gridpp.ac.uk/php/pmb/minutes.php

Cheers, Tony
________________________________________________________________________
Prof. A T Doyle, FInstP FRSE                       GridPP Project Leader
Rm 478, Kelvin Building                      Telephone: +44-141-330 5899
Dept of Physics and Astronomy                  Telefax: +44-141-330 5881
University of Glasgow                   EMail: [log in to unmask]
G12 8QQ, UK                 Web: http://ppewww.physics.gla.ac.uk/~doyle/
________________________________________________________________________


GridPP PMB Minutes 289 - 28th January 2008 ========================================== Present: Tony Doyle, Sarah Pearce, Roger Jones, Stephen Burke, David Britton, Steve Lloyd, Tony Cass, John Gordon, Jeremy Coles, Peter Clarke, Glenn Patrick, Andrew Sansum, Dave Colling, Suzanne Scott (Minutes) Apologies: David Kelsey, Robin Middleton, Neil Geddes 1. F2F Agenda ============== DB reported that he had circulated a draft Agenda but had received no comments. Items to be discussed included: the composition of the Board and the DB, Quarterly Reporting, a brief item on Dissemination, progress of milestones and metrics, travel policy, how the Tier-1 responded to the Tier-1 Site Review and implementation plans required, disaster recovery planning, EGEE/EGI. There would be a break in the middle of the meeting for a Group Leaders' phone conference in order to report issues (to include TD, RJ and SL). The PMB agreed the Agenda. DB noted that if there were any other items, to let him know. DC advised that he might be delayed due to his flight, but hoped to arrive around 09:30. 2. LHCb Funding ================ GP had circulated an email summary of LHCb requirements in relation to another software course on distributed analysis and Grid computing. GP noted that the funding requirement was similar to last year but new people were involved - 6k was requested from GridPP for accommodation and room costs, LHCb would cover travel and meal costs. The course was likely to take place sometime in March. DB asked how many people would be involved? Around 30. JG asked how much money had been awarded last time? A similar sum. TD noted the requirement for an updated version of the LHCb proposal from last time, which would be dealt with by TD/RM,(similar to the recent ATLAS request). DB recommended support for the LHCb request. TD noted that the PMB approved the request in outline, subject to receipt of a one-page proposal case. GP would submit this. 3. GridPP20 ============ TD advised that the registrations deadline was this Wednesday. It was noted that numbers were down this year for a variety of reasons, but probably due to the requirement to book early. TD would put out another reminder tomorrow (Tue). TD requested that PMB members ensure that people within their groups have thought about attendance; TD had already spoken about this to AS. TD advised that accommodation could not be held following the end of January but that GridPP registrations could remain open following further negotiations. SS would find out if the Harcourt Hotel might allow more time [emailed following the meeting]. TD noted that experiments' section names were not yet there. It was advised that they were still to come. Raja Nandakumar and Mingchao Ma were awaiting visas. 4. AOCB ======== None. STANDING ITEMS ============== SI-1 Dissemination Officer's Report ------------------------------------ SP reported two news items in preparation, 1) Neasan O'Neill was attending the All Activities meeting; 2) SP was awaiting final approval for a report on ELSSI at Glasgow. SP asked whether a press release was required for CCRC08. TD advised that afterwards, once achievements were known, would be more appropriate. SP would go forward on the basis that there may be a release in due course. TD advised that this could be either March or May. SP reported that the LHC@Home bid was due to be submitted to STFC on Thursday. SP advised that she had been in contact with BBC Radio 4 who were planning a day of programmes about the LHC and may want someone in the studio who knew about data processing and the Grid. SP was going to meet them in March regarding Grid input to the day. SP advised that she was working on the Project Map and hoped to be able to remit this to the PMB F2F on Friday morning. SI-2 Tier-1 Manager's Report ----------------------------- AS provided the following report: 1) Tenders a) Disk tender - installation is complete and supplier load test is running. This is scheduled to end on Friday after which our own 29 day load test will commence. b) CPU tender - Order placed and scheduled for delivery 28 February. c) Tape drive purchase - Order for 6 drives has been placed. d) Non-Capacity hardware order has been placed. e) Oracle server hardware upgrade order has been placed. f) An order for a 32 port non-blocking 10Gb switch is expected to be placed shorrly (e-Science funded). This will be the new core switch of the Tier-1 network. 2) Work on the power supply is proceeding - work on the first transformer is now complete and work on the second transformer has started and will continue for 2-3 further weeks. 3) Disk server failures Last week we suffered a severe filesystem corruption on an ATLAS disk server after the RAID controller took an already offlined drive back into the RAID set. This led to corruption of filesystem metadata and the filesystem being marked as read only. On investigation many bad data blocks were found and as it was not possible to identify which files had been corrupted the filesystem was written off and over 2600 ATLAS files lost. This is the second incident of this kind. We do not believe that the drive should have been accepted back into the array and will contact the supplier this week to escalate the matter with the controller manufacturer. There was a discussion on investigation and recovery of data. TD noted that it was a reasonably manageable problem but asked AS if he expected this would happen again? Were any statistics available? AS noted that this issue could be a systematic problem rather than a straight disk failure, so it was hard to tell at this stage. AS noted that with RAID6 it would be less likely to experience such issues. TD asked whether the next disks were RAID6, and if so would that assist at controller-level error? AS confirmed yes, but it would be hard to say completely. TD asked if AS could assess the level of the problem and provide a review? AS confirmed yes. Service 1) SAM availability for last week was 99%. 2) CASTOR c) Preperations for CCRC08 i) ATLAS: Service Classes/Space tokens set up. Disk pools fully configured and ready. ii) LHCB: Service classes/Space tokens set up. Disk pools fully configured and ready. iii) CMS have now decided how to allocate disk pools and deployment of the last disk servers will commence. iv) Alice have now made a request to the UB for disk space. It is not likely that we will be able to iterate the allocation changes, redeploy the disk and implement a CASTOR instance (with an untested xrootd) interface in time for CCRC08. 3) SL4 Migration The SL4 UI build is now working and a test system has been installed. One of the production UIs will be retasked to provide an SL4 service this week. 4) On-call: We now have completed our analysis and documented critical systems and alarms for on-call. We have also have tested the full callout chain from alarm to pager. Progress to Grid Only Access ============================ This standing item documents the status of work towards achieving GRIDPP milestone 0.18 "Access to Tier-1 resources by Grid Interfaces Only" 1) We are still finalising the list of users to be allowed to continue qsub. This was scheduled to be completed last week. Once we have the list we will terminate qsub with no notice. 2) Message of the day is in place for the termination of the interactive service at the end of February. SI-3 Production Manager's Report --------------------------------- JC provided the following report: 1) UKI is the TPM backup team this week for EGEE. We are next on operations duty (backup) from 2nd February. Some thought needs to be given to who provides the support from April. 2) There is a little more clarity about sites that are expected to form a full part in the CCRC08 run in February, though it has not been straightforward to get required information about T2 space token requirements (the publishing of the space token (needed for accounting systems) requires GLUE 1.3 currently only used at the T1) and hardware to be committed. Several sites are scheduled to undertake SE upgrades this week. UK SRM 2.2 instances can be seen here: http://tinyurl.com/348t52. 3) The shared cluster at Edinburgh (ECDF) took a while to become operational (some lessons on trying to integrate shared clusters) but now publishes just over 500 job slots to the Grid. The site is now entering its installation Phase 2. 4) The RB is still being seen as a weak point in the UK service. Checking the tests this morning revealed that all the UK production RBs appear to have problems (http://wn3.epcc.ed.ac.uk/srm/xml/srm_versions_bar). We need a strategy for moving to the gLite-WMS or to revisit the problems users are having with the RBs. Has the T1 had user feedback on the gLite-WMS installed at the end of last year? DC advised that the gLite WMS was not certified for SL4 and problems were being worked on at present. The situation would be continuing meantime and was being monitored, and would be revisited once the SL4 version was certified. Once certified there should be a quick introduction. TD noted that this information should be circulated to DTeam. DC will check the current situation. SL will investigate advertising his RB switches in the interim. 5) The UK CA certificate change (carried out as a precaution in case the private key was compromised late last year) appears not to be as transparent as was hoped. The first case of a new certificate not being recognized by various VOMS servers has been reported. Instructions to rectify the situation were created here http://www.gridpp.ac.uk/wiki/Instruction_for_VO_administrators under CA Rollover. A fix for the previously reported bug (https://savannah.cern.ch/bugs/?func=detailitem&item_id=20789) has been implemented but requires VOMS to be run with the option skipcacheck enabled. It is suspected that the CERN instance did not have this enabled. It was noted that this was an issue to be brought-up at an Operations meeting. JC will check with Jens Jensen and raise the issue at Ops. 6) Imense Ltd (the company behind Cambridge ontology) are starting to examine options for expanding their access to resources. In following up within EGEE I received the following from Gabriel Zaquine After examination of each business project, EGEE will support some of them, only during the pre-competitive phase. Due to GEANT, EGEE can't be used for business purposes. However, EGEE will encourage and support companies willing to exploit the EGEE technology ( e.g providing services or applications based on gLite on they one infrastructure). Is the GridPP/UK message back to camont the same i.e. there is no pay for use option available once their project moves to a competitive stage? It was asked whether NGS could take-over this issue? TD noted that our response would be the same as the EGEE response, even NGS would find it difficult to support 'real' use over the JANET network. SP advised that she could speak to the KT person at STFC who assisted with the PIPPS case, to help with the post-competitive phase. TD noted that NG could also be involved. Meetings: A) There will be a pre-GDB CCRC08 F2F meeting on Tuesday 5th February at CERN: http://indico.cern.ch/conferenceDisplay.py?confId=26922 B) There is a GDB next Wednesday: http://indico.cern.ch/conferenceDisplay.py?confId=20226. C) There was an ATLAS UK operations meeting on 17th January (omitted from last weeks summary): http://indico.cern.ch/conferenceDisplay.py?confId=26907. SI-4 LCG Management Board Report --------------------------------- TD noted that there was nothing more to say at present, although the Alice issue might come up in relation to UK support. SI-5 Documentation Officer's Report ------------------------------------ SB noted that there was nothing to add to last week's report. Issues were ongoing. REVIEW OF ACTIONS ================= 272.4 AS to check the current Tier-1 disaster recovery plan and circulate the existing version to the PMB. It was reported that this document does not exist, but it was planned to have one in the longer term. TD would incorporate in v0.4 anything that AS considered relevant. AS will check and advise additions. AS noted that he had the basic document laid out, and had completed one section in detail. Ongoing. 277.2 DN to provide an update and re-evaluation of CMS/CASTOR deliverables. TD advised that there was a CMS/CASTOR document on deliverables which should be revised in light of the December '07 tests. DC to take the token for this now and iterate with DN. DC reported that he had discussed this with DN. Ongoing. 277.5 Disaster Recovery 'Team B': SB, JC, TD, SP, DB to analyse the wider issues of disaster planning, mapped to the experiments' lists, and this work would include Project Management. A Recovery Plan was required. It was agreed that JC was in charge of this and the experiment input relating to subsets of the disaster plan. SB/JC to progress. Ongoing. 277.8 User Experience 'Team C': SB, SP, SL, with input from JC to deal with the issue of user experience and design of an easily-found lookup facility for grid error messages. SL reported that he had started the ATLAS wiki page and would circulate the url. SB was leading this with inputs from SP, SL and JC where needed. A new simple summary was required of all areas available plus a lookup/links facility, for the OC to review. This would include a list of most recent types of problems (possibly a 'top 12' for users - what the error means and the course of action to follow). SB to progress this. Ongoing. 280.6 JG brought up the issue of the biomed VO and 'sieving' at the ROC Manager's meeting - a broadcast is to go out from EGEE which will be helpful in underlining acceptable use of Grid resources and would act as a reminder to VOs about the policy they have signed-up to in relation to their users. JC had now emailed the Chair to have this discussed. JG reported that a new VO was now set up but there were few resources allocated to it as yet, although the home Institute may be giving funds. Pending further info from JC. EGEE broadcast action ongoing - JG will bring-up the broadcast action at the ROC VO meeting tomorrow (Tue 15). JG reported that Heinz may bring up the issue of being banned. JG will provide an update at the next meeting, probably the F2F. Done, item closed. 280.7 JC to mention the issues (when approached by a VO with regard to joining) of the 'standard' 6-month introduction period, following which the VO must set-up something specific to them, if appropriate. This was discussed at DTeam. JC to email GridPP VO members if possible - ongoing. This was a standing action - JC had discussed it with the Tier-2 Co-ordinators in relation to VO members. JC to send email. JC reported that he has brought this up but we do not have stable regional VOs as yet, to which people can migrate. VOs have been set-up at VOMS but not at sites. Ongoing. 280.8 JG to investigate the UKI ROC website - any change/progress, and report-back. SB to iterate with JG in order to sign-off this item next week. Ongoing. 282.2 SP to progress the Project Map using the T1 service areas and input from the meeting. Ongoing. 282.6 JC and SB to progress existing 'disaster planning' template for next F2F meeting on 1st Feb. Involve experiments as necessary. This was a follow-up from the last F2F, and was to be distinguished from 277.5 action which is a longer-term one relating to the OC. 286.5 AS to organise a service message at login relating to non-Grid access being withdrawn. Done, item closed. 287.3 url to be sent to FP, RJ, DC, relating to CCRC08 with planning meeting details, so that the summary of experiment requirements can be checked to ensure no major mismatch. This was circulated, no response received. RM to re-circulate urls with deadline. [Done following the meeting]. Done, item closed. 288.1 All: to email DB if planning not to be in Glasgow the night before the F2F meeting. Done, item closed. ACTIONS AS AT 28.01.08 ====================== 272.4 AS to check the current Tier-1 disaster recovery plan and circulate the existing version to the PMB. It was reported that this document does not exist, but it was planned to have one in the longer term. TD would incorporate in v0.4 anything that AS considered relevant. AS will check and advise additions. 277.2 DN to provide an update and re-evaluation of CMS/CASTOR deliverables. TD advised that there was a CMS/CASTOR document on deliverables which should be revised in light of the December '07 tests. DC to take the token for this now and iterate with DN. 277.5 Disaster Recovery 'Team B': SB, JC, TD, SP, DB to analyse the wider issues of disaster planning, mapped to the experiments' lists, and this work would include Project Management. A Recovery Plan was required. It was agreed that JC was in charge of this and the experiment input relating to subsets of the disaster plan. SB/JC to progress. 277.8 User Experience 'Team C': SB, SP, SL, with input from JC to deal with the issue of user experience and design of an easily-found lookup facility for grid error messages. SL reported that he had started the ATLAS wiki page and would circulate the url. SB was leading this with inputs from SP, SL and JC where needed. A new simple summary was required of all areas available plus a lookup/links facility, for the OC to review. This would include a list of most recent types of problems (possibly a 'top 12' for users - what the error means and the course of action to follow). SB to progress this. 280.7 JC to mention the issues (when approached by a VO with regard to joining) of the 'standard' 6-month introduction period, following which the VO must set-up something specific to them, if appropriate. This was discussed at DTeam. JC to email GridPP VO members if possible - ongoing. This was a standing action - JC had discussed it with the Tier-2 Co-ordinators in relation to VO members. JC to send email. 280.8 JG to investigate the UKI ROC website - any change/progress, and report-back. SB to iterate with JG in order to sign-off this item next week. Ongoing. 282.2 SP to progress the Project Map using the T1 service areas and input from the meeting. 282.6 JC and SB to progress existing 'disaster planning' template for next F2F meeting on 1st Feb. Involve experiments as necessary. This was a follow-up from the last F2F, and was to be distinguished from 277.5 action which is a longer-term one relating to the OC. 289.1 AS to provide an analysis of the ATLAS disk server failures on the RAID controller. 289.2 DC to check current situation regarding gLite WMS and SL4 - current status to be conveyed to DTeam. 289.3 JC to check the VOMS/-skipcacheck issue (in relation to UK CA certificate change) with Jens Jensen and raise the issue at an Operations meeting. 289.4 SP to speak to the KT person at STFC who assisted with the PIPSS case, to help with the post-competitive phase (in relation to EGEE only providing support to pre-competitive startup). SP to involve NG. INACTIVE CATEGORY ================= 271.1 PMB to examine the issue of fibre breakage and outages, CERN-RAL OPN link, in one year's time, when actual data on breakages is available. Due date would be September '08. 271.3 Re CERN-RAL OPN link breakage and backup generally, PC to oversee the issue and collate info so that the PMB have something to revisit in one year's time. Due date September '08. It was noted that PC would circulate a revised document after discussion with ATLAS (RJ/PC/DN to iterate). 282.8 RM to monitor how R-GMA and networking issues impact on GridPP as matters progress. RM advised that this item should be moved to the 'inactive' category as it will develop over the coming months. RM discussed the issue with Steve Fisher and advised that support of R-GMA is required whilst APEL is dependent on it. RM reported that he has spoken to SF and there is currently no change to the R-GMA situation - process ongoing. The next PMB would be a F2F meeting on Friday 1st Feb in Glasgow.

Top of Message | Previous Page | Permalink

JiscMail Tools


RSS Feeds and Sharing


Advanced Options


Archives

February 2024
January 2024
September 2022
July 2022
June 2022
February 2022
December 2021
August 2021
March 2021
November 2020
October 2020
August 2020
March 2020
February 2020
October 2019
August 2019
June 2019
May 2019
April 2019
March 2019
February 2019
January 2019
December 2018
November 2018
August 2018
July 2018
June 2018
May 2018
April 2018
March 2018
February 2018
January 2018
November 2017
October 2017
September 2017
August 2017
May 2017
April 2017
March 2017
February 2017
January 2017
October 2016
August 2016
July 2016
June 2016
May 2016
April 2016
March 2016
February 2016
January 2016
December 2015
November 2015
October 2015
September 2015
July 2015
May 2015
April 2015
March 2015
February 2015
January 2015
December 2014
November 2014
October 2014
September 2014
August 2014
July 2014
June 2014
May 2014
April 2014
March 2014
February 2014
January 2014
December 2013
October 2013
August 2013
July 2013
June 2013
May 2013
March 2013
February 2013
January 2013
December 2012
November 2012
October 2012
September 2012
August 2012
July 2012
June 2012
May 2012
April 2012
March 2012
February 2012
January 2012
December 2011
November 2011
October 2011
September 2011
August 2011
July 2011
June 2011
May 2011
April 2011
March 2011
February 2011
January 2011
December 2010
November 2010
October 2010
September 2010
August 2010
July 2010
June 2010
May 2010
April 2010
March 2010
February 2010
January 2010
December 2009
November 2009
October 2009
September 2009
August 2009
July 2009
June 2009
May 2009
April 2009
March 2009
February 2009
January 2009
December 2008
November 2008
October 2008
September 2008
August 2008
July 2008
June 2008
May 2008
April 2008
March 2008
February 2008
January 2008
December 2007
November 2007
October 2007
September 2007
August 2007
July 2007
June 2007
May 2007
April 2007
March 2007
February 2007
January 2007
2006
2005
2004
2003
2002
2001
2000


JiscMail is a Jisc service.

View our service policies at https://www.jiscmail.ac.uk/policyandsecurity/ and Jisc's privacy policy at https://www.jisc.ac.uk/website/privacy-notice

For help and support help@jisc.ac.uk

Secured by F-Secure Anti-Virus CataList Email List Search Powered by the LISTSERV Email List Manager