JiscMail Logo
Email discussion lists for the UK Education and Research communities

Help for UKHEPGRID Archives


UKHEPGRID Archives

UKHEPGRID Archives


UKHEPGRID@JISCMAIL.AC.UK


View:

Message:

[

First

|

Previous

|

Next

|

Last

]

By Topic:

[

First

|

Previous

|

Next

|

Last

]

By Author:

[

First

|

Previous

|

Next

|

Last

]

Font:

Proportional Font

LISTSERV Archives

LISTSERV Archives

UKHEPGRID Home

UKHEPGRID Home

UKHEPGRID  April 2011

UKHEPGRID April 2011

Options

Subscribe or Unsubscribe

Subscribe or Unsubscribe

Log In

Log In

Get Password

Get Password

Subject:

Minutes of the 423rd GridPP PMB meeting

From:

David Britton <[log in to unmask]>

Reply-To:

David Britton <[log in to unmask]>

Date:

Wed, 27 Apr 2011 13:46:23 +0100

Content-Type:

multipart/mixed

Parts/Attachments:

Parts/Attachments

text/plain (80 lines) , 110418.txt (397 lines)

Dear All,

Please find attached the GridPP Project Management Board Meeting minutes
for the 423rd meeting.

   The latest minutes can be found each week in:

http://www.gridpp.ac.uk/php/pmb/minutes.php?latest

as well as being listed with other minutes at:

http://www.gridpp.ac.uk/php/pmb/minutes.php

Cheers, Dave.

-- 
________________________________________________________________________
Prof. David Britton                          GridPP Project Leader
Rm 480, Kelvin Building                      Telephone: +44 141 330 5454
School of Physics and Astronomy              Telefax: +44-141-330 5881
University of Glasgow                 EMail: [log in to unmask]
G12 8QQ, UK
________________________________________________________________________


























































GridPP PMB Minutes 423 (18.04.11) ================================ Present: Dave Britton (Chair), Dave Colling, Robin Middleton, Dave Kelsey, Jeremy Coles, Glenn Patrick, Steve Lloyd, John Gordon, Pete Clarke, Roger Jones, Andrew Sansum, Apologies: Tony Doyle, Pete Gronbech, Tony Cass, Neil Geddes 1. Security Document ===================== DK had circulated the updated security document, the production of which had been a milestone for the last quarter of GridPP3. It would be a new PMB document for the list. DB advised that he had read it, and could not see anything that would be objected to. DK noted that the thrust of the milestone was to ensure that we knew what was happening in relation to operational policy and security for GridPP4. DB thought it was a useful document - it documented security status and was a completed milestone. He noted that it could be referred to if there were to be a major security incident - would the document be adequate in such a circumstance? DK noted yes, that links to all procedural databases in EGI were provided - all documents were there and available. It was agreed to accept the updated version as a PMB document. DK would forward it to SS for upload to the GridPP website. 2. Input to the Oracle meeting =============================== It was understood that this was an opportunity for JG to give input to Frederic. JG noted he had received feedback from the tape and database people - the experience was that this was not a great service, yet cost a lot of money. There ensued a discussion on tape drives and the maintenance contract. JG advised that one-off problems were worth documenting. DB noted they had experienced problems with their logistics - the issue was value for money and on occasion we don't get the value we should out of the relationship. JG advised that TB support would agree with that point of view. AS noted that since the move to Sun, a lot of their systems didn't seem to work so well - for example, there had been migration issues and serial numbers had been corrupted. DC asked if we had an alternative? AS noted no, not with the hardware. DB advised that it was the business process system that seemed to be at fault. AS added that we had lost contact with individuals as well. STANDING ITEMS ============== SI-1 Tier-1 Manager's Report ----------------------------- AS reported as follows: Fabric:    1) FY10 procurements - CPU tender - deployed - Disk - deployed - Tape drives - delivered (4th April) - Tape media - delivered 2) SL08 remains out of production - load test underway - fault free so far (four successful drive ejects). 2-4 weeks of further testing is likely to yield the last 2 drives ejects we require for assurance that the problem is resolved. DB noted that the time out-of-service would be about 9 months in total? AS noted possibly 6 months. DB commented that it takes 6 months to do this, which is one-eighth of their lifetime - was it worth continuing? What was the motivation for continuing to do testing? They had a finite lifetime with a significant period under test - we should simply just use it for non-critical stuff. 3) FY11 procurements - EU tender for disk framework agreement just about to go out. - CPU framework about 1 week behind. - Frameworks need to be renewed on tape drives and media this year. 4) Various network issues - Time varying (but 5-10% peak) packet loss on production route to SJ5. Site networking working to find cause. Possible protocol/load problem. Networking working to address an identified issue. - Internal Tier-1 network stack problems. * Short breaks in connectivity (< 1 min) to some services - suspect stack supporting some Critical services. Emergency intervention planned on Tuesday will short network break to some services. Announcement to follow. * Second stack (stack 15) unstable and splitting into two. Has caused out of hours callouts. Possible overheating problem addressed. Waiting to see if it is fixed. Service: 1) Summary of operational issues is at:     http://www.gridpp.ac.uk/wiki/Tier1_Operations_Report_2011-04-13   2) CASTOR - The upgrade to CASTOR 2.1.10-0 was successfully completed in March - The CMS and LHCB upgrades to SRM 2.10 went well but problems were encountered following the ATLAS upgrade on Thursday 14th April. SRM-ATLAS stopped responding at 01:00 UTC on Friday 15th. Alarm Ticket from ATLAS at 07:20 UTC. ATLAS SRM was taken down for 15mins for investigation. Transfers throttled and job load reduced to 200 jobs over weekend. Problem traced to Oracle statistics for search path rapidly being invalidated - cause unknown. Work ongoing. Job limit now at 2000. 3) LHCb batch work has been switched to use CVMFS for obtaining the LHCb software. CVMFS is still not a CERN supported production service, however LHCB are using this at several Tier-1 sites now. RAL hosts a production mirror of CVMFS which reduces the risk somewhat. Staff: 1) James Thorne and Richard Hellier now left. 2) Matt Hodges (Grid team leader) leaves on Wednesday 20th April. 3) Derek Ross has accepted another job in e-Science and will leave 12th July. Grid team in particular will be severely under-staffed until new starters begin. Looking at temporary work offload. - Vacancy notice for Grid team leader expected to be out in 5-10 days - Paperwork for other Fabric team vacancies in draft. DB asked that the posts be expedited as soon as possible - this was a high-level concern that a number of people were leaving right at the beginning of a long data-taking phase, and it meant erosion of expertise. AS advised that it was probably as a result of the long uncertainty over funding at STFC, and also the pay freeze was probably an issue. SI-2 Production Manager's Report --------------------------------- JC reported as follows: 1) There is a new glibc vulnerability to be addressed by sites. Most kernels had patches available last week. There is currently no public exploit so the EGI rating is high-risk not critical. 2) At last week’s ops meeting there was some concern expressed about impacts from site spacetokens becoming full and as a consequence the site receiving less work since this was under experiment control. “Missing release has caused reco jobs to go to T2s. A number of sites had to increase their space in PRODDISK. Sites get blacklisted in DDM automatically if the space is completely filled”. We need to remain aware of these issues but can we do anything more? Should we maintain a table of events impacting site performance? It was noted that the request was made by the experiment to increase their PRODDISK space. DB noted that there was a higher-level picture: if the sites had more disk, the PRODDISK would be increased, therefore space would be enough. Sites should also be monitoring what was happening and proactively ask the experiment if they needed a PRODDISK increase. JC advised that the sites had been blacklisted before they could do this. DB advised that SL needed to be the owner of the issue of correcting the accounting, and he should apply judgement in conjunction with the Ops Team. We probably do need to keep a list. This should be a 'standing item' at the Ops Team, if there were any issues during the previous week then there should be a mechanism to request a correction to the accounting. It was suggested that we keep a record and correct at the end of the year. If a large issue was apparent then this could be dealt with at that time, however smaller issues would average-out. DB asked that JC keep a list but not assume that little corrections would be done - it needed to be monitored and we would correct at the end of the year if necessary. Smaller issues would average out and we should not try and correct things too much.   3) There was an intention to run Security Service Challenge 5 at the end of May. The challenge would involve a subset of sites in each NGI to help understand how sites would respond to a major distributed incident. Unfortunately this would happen during the GridPP T2 accounting period. The accounting period starts on 1st May and the SSC5 cannot start earlier and that only leaves two further possibilities – we do not take part or accounting for that 1 week period is not counted. There is an assumption here that sites would take nodes offline for the response but they may not have to take such action. DB noted that we should take part in the Security Challenge as scheduled. If sites close queues etc then we will correct for that. JC asked whether this should go to sites with extra staff effort? DB thought no, it should be entirely random - but if volunteers were required then all 8 sites with 2 people should volunteer. DC agreed - completely random was the only way that the test made sense. AS advised that the sites should not consider this as a bad thing - sites do get huge benefit working through procedures etc, it was a good learning tool. 4) The March WLCG availability/reliability figures were released a few weeks ago. No GridPP sites have reported any concerns. http://gvdev.cern.ch/GRIDVIEW/downloads/Reports/201103/wlcg/WLCG_Tier2_Mar2011.pdf. Four sites have been flagged:   QMUL – availability 64%: Site had scheduled downtime due to air-conditioning upgrade work. During the month there were also reported problems with the storage and packet loss on the WAN.   RHUL – reliability 85%: availability 81%: There were a mixture of scheduled and unscheduled periods to resolve network problems.   Oxford – reliability & availability 85%: Storage related?   RALPP – reliability 84% & availability 83%: The availability was down due to scheduled networking outages. Reliability was affected by problems experienced with the site dcache database.   5) EGI have asked for priorities in certification and release of EMI-1 components. The SL5 WMS heads the UK list followed by SE releases and SL5 myproxy. SI-3 ATLAS weekly review & plans --------------------------------- RJ was not present at this point. SI-4 CMS weekly review & plans ------------------------------- DC reported that they were taking data. The Tier-1 was below 80% on CMS readiness. AS confirmed he would get back to DC on this. RALPP were about 70% readiness. SI-5 LHCb weekly review & plans -------------------------------- GP noted there wasn't much to report. AS had already mentioned a few issues. They had a memory footprint problem on stripping jobs. SI-6 User Co-ordination issues ------------------------------- GP had nothing to report. SI-7 LCG Management Board Report --------------------------------- The last meeting had been before the F2F at Brighton, there were no major issues to report. SI-8 Dissemination Report -------------------------- Neasan O'Neill had attended EGI at Vilnius. GridPP had a good stand location, joint with NGS, under the NGI banner. AOB === DB brought up the issue of forthcoming PMB meetings. The suggestion was as follows: - Thursday 28th April @ 12.55 pm - there would be NO meeting on May 2nd - Monday 9th May @ 12.55 pm (there was an STFC visit to GU on 16th May) - Wednesday 18th May @ 11.00 am - Tuesday 31st May @ 12.55 pm ACTION 423.1 DB to do a doodle poll proposing PMB meetings during May. It was noted that we might need a special meeting in order to discuss the Tier-2 algorithm, as time was short. The OC documents had to be ready by 18th May. 3. Status of Tier-2 Algorithm ============================== SL reported on progress - the issue had started as people were wanting to use CPU and not jobs; then corrected CPU not raw CPU. There had been discussion at Brighton, following which SL had tried to measure outputs. SL had circulated a spreadsheet for discussion. On the table, March looked greener than the others. The algorithm proposed was based on HEPSPEC numbers from APEL divided by ATLAS. The issue was however, that ATLAS and APEL don't see the same number of seconds. You would expect the CPU total to be the same - APEL divided by ATLAS was in most places OK, but four sites in particular seemed wrong: Cambridge and QMUL were for known reasons, however Lancaster and others were not known. The crosscheck was the 'PROD in seconds for the event' column - the amount of HEPSPEC per production event should be constant at all sites, if correctly done. Then this is multiplied by 8.3. In this crosscheck, green shows agreement within 10%, and this gives a consistent answer for around half of the sites. The last column in the spreadsheet showed what was actually being published, by range. In summary - there was an HS06 APEL-to-seconds ratio; green sites agreed well; red sites were disagreements. This was worse in April as the APEL numbers seeemed incorrect. JG noted that it could take a day or so to reach the portal. SL advised that one day didn't explain the discrepancy. SL reported that he had been sending full chain jobs single generation event to sites at the weekend and checking how long it takes, also what CPU they are run on. DB summarised the issues: 1. how did we make progress on this - was it possible to understand why these were red? There were four red sites in March: QMUL/Lancaster/Cambridge/PPD. DK would check with RAL PPD. For PPD in April, they had green in the last column (SL noted that they agree). DB asked how this would converge in time, in order to publish the algorithm to sites? JC would discuss with sites at the Ops meeting. DB noted the second issue was - where were we with CMS and the other experiments? SL noted he could pull numbers out from the same place for 'other' experiments, but there was no crosscheck. DC commented that most sites publish and have a weighted average. JG suggested he could check the number of jobs with APEL. DB suggested checking the total number of jobs at ATLAS and APEL - it was in the database. DB suggested that we view the four 'red' sites are being resolved. DK would look into RAL PPD. SL would look at the QMUL situation. RJ noted it should be consistent month to month. SL advised that we could investigate the red sites and he could put another column in the spreadsheet. JC noted there was an Ops Team at 11.00 am tomorrow and he could get some insight from the sites direct. DB advised that we could delay the start of the accounting period for a few weeks. DB asked whether we believed that this method fundamentally worked? RJ said it had to work everywhere, at all sites. DB asked whether there would be a wiki page on the web to let people see it? DC noted that a wiki page was possible. SL would also contact LHCb and others. In conclusion, there was 1. the Ops Team tomorrow; 2. there were 10 days until the PMB on 28th to sort this out; 3. was a meeting required to review this beforehand? There was no time, as holidays intervened. REVIEW OF ACTIONS ================= 400.4 SL to co-ordinate changing the current GridPP MoU towards an MoU for GridPP4. In progress - document had been circulated. Any corrections to be sent to SL. Ongoing. 409.1 JC to revisit document with a GridPP-NGI-NGS structure, not Dave Wallom’s. JG will provide input. Visions for today and for the future. Ongoing. 416.5 PG to establish a process to generate a final project map in conjunction with work package and task owners. Ongoing. 416.8 RJ/DB to establish ATLAS networking test programme to investigate Tier-2 connectivity using Glasgow as an example. Done, item closed. 417.3 JC to follow-up with the Tier-2s, on a site-by-site basis, regarding deployment of glexec/Argus and tarball installation packages - site readiness/difficulty to be reported-back to the PMB. JC was working on this - the glexec deadline of May would not be met in all cases, but this was now an open action in the Ops Team. Done, item closed. 419.1 SL to contact the Tier-2 sites, by the best route possible, and ask two questions relating to hardware status: 1. what is the minimum available at present; and 2. what are they likely to be able to pledge in April 2012 and April 2013? SL reported that: I contacted the sites and we got the answers for 2011 as presented at Brighton. There is little info on 12/13. Done, item closed. 419.2 SL to respond to Amazon Web Services' invite to attend the cloud computing summit at Oxford, and nominate PG to attend. SL reported that: I did this and nominated Pete but never got any reply. Done, item closed. 419.3 PG to provide feedback on the Amazon Web Services (AWS) Academic Research Summit on cloud computing at Oxford on April 12th. PG reported that: Despite me asking Amazon directly for details I never heard anything back from them re an invite. Done, item closed. 419.4 PG to take the mandatory issue of glexec back to sites, and get clarification of status. This is mandatory by June 2011 or sites will not get default analysis jobs. PG reported that: This has been discussed at dteam meetings and I suggested site update their status on the wiki at : http://www.gridpp.ac.uk/wiki/Site_status_and_plans Jeremy has asked again at this week’s meeting for sites to update the page much many are out of date. Done, item closed. 419.5 PG to ask at dTeam whether sites had any issues/experience with storage servers from Supermicro. PG reported that: The supermicro issues are currently affecting Oxford and UCL, Cambridge have recently ordered a server of the same type. Others have very similar kit, so far we think it’s just these three sites that have the actual combination of Adaptec 5805 controllers and Western Digital Green 2TB drives. Viglen have upgraded the backplane firmware in the servers at UCL and Oxford and are now suggesting a particular version of firmware for the adaptecs that is being used on over 100 servers at CERN. Needless to say we are not very happy about the situation and (At Oxford) have ordered up ~100TB from an alternative vendor to mitigate our lack of storage. Done, item closed. 420.1 DB to explain the new ops structure to Ops Team and Collaboration – including some clarity relating to the personnel who are explicitly expected to take on national/ops team roles. 420.2 PG and JC to establish details of Ops Team work remit. 420.3 JC to advertise the 'open' nature of the Ops Team meetings and encourage site attendance. JC to ensure that (as agreed at Lancaster) the managers of the T2’s should identify a person for reporting on T2 deliverables and metrics. 420.4 PG to ensure that each metric/deliverable has an owner identified. 422.1 DB to email the CB with GridPP input to the e-VAL questionnaire/survey and elicit some guidance. 422.2 DB to to prepare e-VAL input for GridPP including information on Roles appended to GridPP3 proposal. ACTIONS AS AT 18.04.11 ====================== 400.4 SL to co-ordinate changing the current GridPP MoU towards an MoU for GridPP4. In progress - document had been circulated. Any corrections to be sent to SL. 409.1 JC to revisit document with a GridPP-NGI-NGS structure, not Dave Wallom’s. JG will provide input. Visions for today and for the future. 416.5 PG to establish a process to generate a final project map in conjunction with work package and task owners. PG reported: I have had some input from some of the areas but will need to tackle the Tier 1 and the experiments asap. 420.1 DB to explain the new ops structure to Ops Team and Collaboration – including some clarity relating to the personnel who are explicitly expected to take on national/ops team roles. 420.2 PG and JC to establish details of Ops Team work remit. 420.3 JC to advertise the 'open' nature of the Ops Team meetings and encourage site attendance. JC to ensure that (as agreed at Lancaster) the managers of the T2’s should identify a person for reporting on T2 deliverables and metrics. 420.4 PG to ensure that each metric/deliverable has an owner identified. 422.1 DB to email the CB with GridPP input to the e-VAL questionnaire/survey and elicit some guidance. 422.2 DB to to prepare e-VAL input for GridPP including information on Roles appended to GridPP3 proposal. 423.1 DB to do a doodle poll proposing PMB meetings during May. The next PMB would take place on - Thursday 28th April @ 12.55 pm. DB would do a doodle poll to establish meetings in May.

Top of Message | Previous Page | Permalink

JiscMail Tools


RSS Feeds and Sharing


Advanced Options


Archives

February 2024
January 2024
September 2022
July 2022
June 2022
February 2022
December 2021
August 2021
March 2021
November 2020
October 2020
August 2020
March 2020
February 2020
October 2019
August 2019
June 2019
May 2019
April 2019
March 2019
February 2019
January 2019
December 2018
November 2018
August 2018
July 2018
June 2018
May 2018
April 2018
March 2018
February 2018
January 2018
November 2017
October 2017
September 2017
August 2017
May 2017
April 2017
March 2017
February 2017
January 2017
October 2016
August 2016
July 2016
June 2016
May 2016
April 2016
March 2016
February 2016
January 2016
December 2015
November 2015
October 2015
September 2015
July 2015
May 2015
April 2015
March 2015
February 2015
January 2015
December 2014
November 2014
October 2014
September 2014
August 2014
July 2014
June 2014
May 2014
April 2014
March 2014
February 2014
January 2014
December 2013
October 2013
August 2013
July 2013
June 2013
May 2013
March 2013
February 2013
January 2013
December 2012
November 2012
October 2012
September 2012
August 2012
July 2012
June 2012
May 2012
April 2012
March 2012
February 2012
January 2012
December 2011
November 2011
October 2011
September 2011
August 2011
July 2011
June 2011
May 2011
April 2011
March 2011
February 2011
January 2011
December 2010
November 2010
October 2010
September 2010
August 2010
July 2010
June 2010
May 2010
April 2010
March 2010
February 2010
January 2010
December 2009
November 2009
October 2009
September 2009
August 2009
July 2009
June 2009
May 2009
April 2009
March 2009
February 2009
January 2009
December 2008
November 2008
October 2008
September 2008
August 2008
July 2008
June 2008
May 2008
April 2008
March 2008
February 2008
January 2008
December 2007
November 2007
October 2007
September 2007
August 2007
July 2007
June 2007
May 2007
April 2007
March 2007
February 2007
January 2007
2006
2005
2004
2003
2002
2001
2000


JiscMail is a Jisc service.

View our service policies at https://www.jiscmail.ac.uk/policyandsecurity/ and Jisc's privacy policy at https://www.jisc.ac.uk/website/privacy-notice

For help and support help@jisc.ac.uk

Secured by F-Secure Anti-Virus CataList Email List Search Powered by the LISTSERV Email List Manager