JiscMail Logo
Email discussion lists for the UK Education and Research communities

Help for UKHEPGRID Archives


UKHEPGRID Archives

UKHEPGRID Archives


UKHEPGRID@JISCMAIL.AC.UK


View:

Message:

[

First

|

Previous

|

Next

|

Last

]

By Topic:

[

First

|

Previous

|

Next

|

Last

]

By Author:

[

First

|

Previous

|

Next

|

Last

]

Font:

Proportional Font

LISTSERV Archives

LISTSERV Archives

UKHEPGRID Home

UKHEPGRID Home

UKHEPGRID  July 2007

UKHEPGRID July 2007

Options

Subscribe or Unsubscribe

Subscribe or Unsubscribe

Log In

Log In

Get Password

Get Password

Subject:

Minutes of the 263rd and 264th GridPP PMB meetings

From:

Tony Doyle <[log in to unmask]>

Reply-To:

Tony Doyle <[log in to unmask]>

Date:

Fri, 13 Jul 2007 15:29:13 +0100

Content-Type:

MULTIPART/MIXED

Parts/Attachments:

Parts/Attachments

TEXT/PLAIN (24 lines) , 070709.txt (1 lines) , 070702.txt (1 lines)

Dear All,

     Please find attached the latest weekly GridPP Project Management 
Board Meeting minutes. The latest minutes can be found each week in:

http://www.gridpp.ac.uk/php/pmb/minutes.php?latest

as well as being listed with other minutes at:

http://www.gridpp.ac.uk/php/pmb/minutes.php

The previous minutes are at:

http://www.gridpp.ac.uk/pmb/minutes/070702.txt

Cheers, Tony
________________________________________________________________________
Prof. A T Doyle, FInstP FRSE                       GridPP Project Leader 
Rm 478, Kelvin Building                      Telephone: +44-141-330 5899
Dept of Physics and Astronomy                  Telefax: +44-141-330 5881
University of Glasgow                   EMail: [log in to unmask]
G12 8QQ, UK                 Web: http://ppewww.physics.gla.ac.uk/~doyle/
________________________________________________________________________


GridPP PMB Minutes 264 - 9th July 2007 ====================================== Present: Tony Doyle, Sarah Pearce, Roger Jones, Stephen Burke, David Britton, David Kelsey, Steve Lloyd, Tony Cass, Robin Middleton, John Gordon, Jeremy Coles, Peter Clarke, Andrew Sansum, Neil Geddes, Suzanne Scott (Minutes) Apologies: Dave Newbold, Glenn Patrick 0. Approval of Previous Minutes ================================ It was agreed to send any amendments to SS by email, preferably by noon tomorrow (Tue). 1. EGEE III Proposal ===================== NG had circulated an email to UK/I EGEE partners. A workplan had been refined by the PEB but bids from federations were still in excess. Two issues were involved: 1. trim the bids to reflect the programme of work; 2. trim the programme of work itself. The EGEE PMB had met last Friday (6th July) in closed session to discuss the bids. For SA1, the conclusion was to approve all of the work proposed by the activity leader - this would be translated into euros which would provide the approved budget for each bid. The meeting also discussed the Applications Support area. Bids had been sent in which were not in the programme of work, and were not well defended. For some of the other bids it was agreed that they need to be combined into one bid. Further discussion of this area will happen this week. There had been a discussion on testbeds and other non-(full)-production services which is likely to result in a consolidation of these activities. The final budget table would be discussed this week, and the next EGEE PMB meeting was scheduled for 16th July. 2. Review of Tier-2 Issues =========================== It was agreed that DB's list had been gone through and actions generated. DB noted that JC had not been present at last week's meeting but his comments had been incorporated in the Minutes. It was agreed that DB would extract the issues and actions generated from the Review and put these on the Tier-2 site. Note: done, see http://www.gridpp.ac.uk/tier2/Tier-2_Review_Issues_2007.doc (.pdf) 3. GridPP3 Planning ==================== DB had circulated an email. The indication was that no further formal input from GridPP was required by STFC at this point. It was understood that all of the money had been approved by PPRP and other Committees but that the carry-forward of GridPP2 funds was not yet quite confirmed. It was noted that a CB meeting was happening next week and the funding issue would be raised with Group Leaders. Everyone was aware that we have grants awaiting issue in 7 weeks' time. It was agreed that DB would contact Janet Seed again to ask her advice about a formal statement re the plan. 4. AOCB ======== None. STANDING ITEMS ============== SI-1 Dissemination Officer's Report ------------------------------------ SP reported a news article on blogs and the new PlanetGridPP blog. SP asked about the situation relating to an article on the Site Reviews. Information generally was not yet available for release. It was agreed that SP would not be able to point to all detailed feedback; DB's summary of outcomes could be the basis for a news item. It was noted that all of the positive issues were not documented. SP will draft an item and draw together the positive aspects of the Review, using some specific examples - but release of information would be checked with sites. It had been agreed that there would be a joint NGS/STFC stand at EGEE07. Neasan O'Neill had produced a new website for LHC@Home, and the statistics were also working now. Last Monday there had been a meeting of the LHC Promotion Group regarding Grid promotion - a strategy document will be drawn up with key messages. The Parliamentary POSTnote had been published last week and there will be a link on the 'documents' page. An article is being done for GridPP news and iSGTW. SI-2 Tier-1 Manager's Report ----------------------------- AS provided the following report: Hardware: Regarding the 10Gb path from Tier-1 to SJ5, they were currently waiting for network group to finish testing. The RAL networking group are still in the process of obtaining a public AS number in order that the Tier-1 can route Tier-1 -> Tier-1 traffic by the OPN. This would be raised at the meeting on Wednesday (11th July). The pre-qualification stage of the disk and CPU tenders closed Friday 29th June. Evaluation is underway. AS reported three issues: 1) state of evaluation; 2) tape planning; 3)input from the Tier-1 Board regarding Tender Documents. It was noted that there is a Tier-1 Procurement Team Meeting on Tuesday afternoon (10th July). A tender to set up a Framework Purchasing agreement for tape media has now commenced. This is expected to be able to deliver media in 2007Q4. 50% of an interim purchase of 300TB of tape media has now been received and the remainder is expected this week. Service: SAM availability for the last 7 days was 96% (94%?). Reliability for June (as measured by WLCG) was 87% - the average for the best 8 sites was also 87%. Main impact was caused by the network outage in the middle of the month - load related problems on the CE also contributed. Regarding CASTOR: The CMS CASTOR instance had some problems under the highest CMS load tests of a week ago. However it has subsequently been stable and we are now working to understand throughput rates, which CMS believe are still insufficient to meet their CSA07 objectives. Further load testing is scheduled. The standalone CASTOR for ATLAS is being tested by ATLAS. The standalone CASTOR for LHCB is built and has had basic functionality tests completed by the CASTOR team. Further load tests will be carried out by the CASTOR team and it will then be released to LHCB for testing. BDII: All 3 top-level BDII servers have now been upgraded to the lastest release. Load on the BDII servers appears to be low and there do not appear to be timeout problems at the Tier-1 since the upgrade. RB: Both rb01 and rb02 were back in production last week. rb03 was brought online for Alice. Over the weekend rb01 broke again and we are now looking to move LHCB production work off these servers to rb03 to reduce the load further. We also note that this morning both rb01 and rb02 are flagged as OK by SAM but marked as Bad by SL's tests, this discrepancy is not yet understood. Current strategy is to spread the load and keep things going until WMS is available. SL4 is running and is available externally - testing is commencing. SI-3 Production Manager's Report --------------------------------- JC commented on AS's report (above) by noting that the Alice RB problems had not been their fault - JC would re-check the BDII timeouts as reports don't provide information at present, they are not working. JC reported as follows: 1) The issue of SL4 rollout was discussed at the GDB last week. The experiments all claimed to be ready but the holding point on sites deploying SL4 is confirmation of additional dependencies the experiments may have on the OS over what is required for the gLite middleware (in earlier middleware, additional packages were included in a release to ensure that the experiment software computing environment requirements were met). There is particular concern about circular dependencies which may lead to incompatible requirements. To make progress a series of SL4 WNs have been setup for the experiments to test against - this is being done at LAL and RAL Tier-1 (Birmingham will join this week). Experiments were asked to upload known dependencies to their CIC portal ID card but so far only LHCb has done it. There was a discussion of Experiment requirements - a list from ATLAS had been provided showing all of the libraries and links that they needed. LHCb had also sent in a requirements list. It was noted that SL4 is currently meeting ATLAS requirements and many sites have already installed SL4. JC noted that he was not confident about the non-LHC Experiments. TD noted that we need to push ahead anyway now. JC noted that the phased transition would be discussed at the Deployment Board meeting on Thursday (12th July). Status for RAL WNs: ALICE added the queue to their production system. LHCb agreed to run dedicated tests when production staff return from holiday. Without dedicated testing we do not know that the jobs running test all classes of jobs (they may be random from the matchmaking). This morning 200+ jobs were queued for 6 job slots. CMS have not communicated any specific requirements. Before any migration can happen for the Tier-1 it needs to be confirmed that the other non-LHC experiments work without problem on SL4. 2) glexec on WNs is the subject of a lot of discussion at the moment. We are trying to understand the principle objections. The real sticking point appears to be whether glexec can easily (i.e. as a default) be installed in non-SUID mode. SUID mode allows UID switching and is frowned upon especially at non-HEP dedicated sites. In contrast other sites in WLCG/EGEE require the job to always run under the ID of the person whose work is being run. This issue was to be discussed at the Deployment Board meeting on Thursday (12th July). 3) Since the move to GOCDB3 there have been problems creating the UKI tree structure needed for the ROC reports. The accounting data for most/all sites also seems to have stopped updating as seen in the site charts in the portal. 4) As reported previously Glasgow has encouraged a number of groups to join the gridpp VO to test the infrastructure. A significant amount of work now seen at Glasgow is from this VO - the site remains full while most other UK sites have plenty of spare capacity. Last week Graeme Stewart managed to get MPI jobs running (required by engineers) at Glasgow which is likely to further increase usage. 5) The question of specInt ratings is being raised once again as the T2 Co-ordinators fill out the Q2 report. The value being used by the T2s differs and this clearly impacts the overall site and Tier-2 KSI2K. If the KSI2K figures are being used for Tier-2 hardware allocations then do we need to do better benchmarking? 6) The introduction of faster cores means that historical batch queue limits need revisiting. TD noted that the given default time should be retained - downstream the problem was concatenating files. JC to feed this back to Graeme - and this was being discussed at the DTeam meeting as well. TD noted that it should not require revisiting as the defaults should remain unchanged. 7) The RAL-PPS instance of the PPS SAM testing framework is now up and running. 8) SL joined the dteam VO to run his jobs outside of the ATLAS environment. This led to the discovery of various problems including with use of VOMS/Gridmap files and edg-job-submit. There is one remaining problem with use of the Glasgow RB that needs further investigation. 9) There is a deployment board meeting in London this Thursday. The agenda is here: http://indico.cern.ch/conferenceDisplay.py?confId=18446 10) There were FTS problems (~24hrs) last week. The CERN grid service operators did not notice a host certificate was about to expire for the production service which it did with obvious repercussions for the MyProxy service. JG noted that it is better to have unwanted tickets rather than have these problems. 11) Finally JC has received several questions from people involved in deployment roles who are still unsure where they stand with GridPP3 continuation of their posts. [see item 3, above] SI-4 LCG Management Board Report --------------------------------- JG reported that he had presented a document regarding the policy of killing jobs. The feedback was that the VOs wanted to know what was going wrong so that they could fix it, rather than the jobs simply being killed. The VOs want to work with GridPP to resolve these issues. It was noted that we need to flag when jobs are cancelled otherwise the Experiments don't know why jobs have been cancelled. TD noted that we can get statistics from Tier-1 regarding jobs, but rather than average efficiency, we need profiled jobs. TD noted that the cut is on 2.7% efficiency, and all that is required is a histogram to be inserted into the document. It was agreed that AS would speak to Matt Hodges. DK noted that this issue would also be discussed at the Deployment Board - but it was noted that it was a User Board issue too. JG reported on an action to set up SLAs to run VO boxes. A presentation had been given regarding security etc. JG asked whether all of the Tier-1s have SLAs? The issue for the future would be to have a generic one. JG reported that there had been a talk on OSG site validation; and SRM2.2 issues/options had also been discussed. SI-5 Documentation Officer's Report ------------------------------------ It was noted that SB had been away at CERN. REVIEW OF ACTIONS ================= 247.2 RJ to get further information from ATLAS regarding use of Grid for testing of PANDA, and report-back. This is not a live topic and it was agreed to initiate a new listing of 'Inactive' items. This to be moved to that category. 250.4 RJ, DN, GP, TD to meet to integrate experiment requirements of Tier-2s going to Tier-1 - sites are aware of requirements but discussion still has to take place. It was noted that this issue is not high priority. A meeting is to take place with Barney Garrett - this is ongoing and still to be arranged. 251.1 TD to raise the issue of memory vs CPU cost at the MB [in order to work out what the requirement was between 1GB and 2GB memory per core]. It was agreed this to be moved to 'Inactive' category. 252.3 RM has now received inputs for his one-page summary regarding the transition of each of the existing Middleware areas from GridPP2 to GridPP2+ to GridPP3 - this to go to DB. Ongoing. 253.1 AS has commenced work on the report on data integrity at Tier-1, in relation to implementation of checksums. AS is still working on this and it will take a further couple of weeks to complete. This is ongoing, and AS hopes it will be finished by the end of August. It was agreed to move this to 'Inactive' category. 254.2 ALL PMB members have now signed-up to EVO. Tests were ongoing but this action is on hold due to H323 requirements which must be resolved. JG/RM will resolve EVO issues. RJ reported that he had joined an evaluation group on EVO and asked that all information should be sent to him to enable him to document the problems involved. It was agreed that an EVO test would take place the week after next (PMB) as next week's meeting was a short one due to the CB meeting at 2.00 pm. 259.5 JC to provide recommendations to the PMB on PPS testing and a summary of what is currently available on the system. Ongoing. 260.1 RM to provide final feedback for site reviews to SL for https://www.gridpp.ac.uk/tier2/Readiness_Reviews/index.html. Ongoing. 260.3 RM, NG, TD, DK to inform SL which site-review information is public/private. Ongoing. 261.1 TD and JG to prepare a PMB statement to be prepared for the MB regarding SL4 releases of basic middleware, which were still awaited and were an issue at sites. JG reported that he would be doing this for tomorrow. Sites should be encouraged to proceed with SL4 upgrades which are to be tracked by JC. JG will give a summary statement to the MB as to what we believe the current situation is - this will include 'SL5 on hold'. 261.2 DN, RJ, GP: An action on the experiments to define the future outlook for 64-bit applications and resultant effects on hardware purchasing. Experiment reps to define the outlook. It was noted that the priority is 32-bit at the moment; there is no advantage to 64-bit. A short statement is required. 261.4 DB to look through the input in detail in relation to GGUS problems. Ongoing. 261.5 JC and dTeam to carry out a survey on sites' experiences of GGUS, when possible to organise. Ongoing. 261.6 JC to look into the issue of 2-hour response timing @ Tier-2 sites and understand the problem in greater detail - sites also need to understand what the two-hour response time actually means. This may come up at the next Board meeting. Ongoing. 261.11 SL to progress receipt of final site documents from SouthGrid and London T2 which were still outstanding. It was noted that SL was still awaiting information. 261.13 DK to progress receipt of ScotGrid feedback. Ongoing. 261.14 RM to progress receipt of LT2 feedback. Ongoing. 261.16 JG to progress the issue of somone getting involved in the SLA (ROC) working group. 261.17 JC to assess the general effectiveness of RSS feeds and subscription-based updates, in relation to GridPP blogs. It was noted that blogs are aggregated: PlanetGridPP is the mechanism, but RSS-feeds that can be subscribed to don't exist. JC will bring this up at the Deployment Board meeting. 262.2 SL to clarify GridPP contribution (what is accounted rather than what is available) with the Tier-2 Board. Ongoing. 262.3 DK to raise items (12) [re accounted GridPP contribution] and (22) [re site availability via SAM tests] at the Deployment Board in two weeks' time. This was on the Agenda for discussion at the DB. Done, item closed. 262.4 JC to ascertain the specific problems in relation to Condor support issues. JC awaiting feedback. Ongoing. 262.5 Regarding poor response time of middleware developers: DK to propose the following recommendation to the Deployment Board: to recommend that if specific issues were involved, GGUS should be used. If issues were general, the TCG representative at the Tier-2 site should be informed. The TCG rep in turn should raise the issue as appropriate at the TCG meetings. This was on the DB Agenda for discussion. Ongoing. 262.6 JC to raise the issue of PPS feedback information relating to upgrades issues with the relevant individual(s) on the PPS, and ask if there was anything else that could be done. Ongoing. 262.7 AS to speak to procurement and warn them that sites might want to make parallel purchases - a sentence could be added to the tender document. AS still to talk to procurement - ongoing. 262.9 non-Grid access relating to VOs. A document is to be done detailing this issue as VOs need a mechanism 'in'. AS to detail the issue in a separate report and circulate to the PMB. What can and can't be offered to non-Grid users: detail is required - AS still to do. Ongoing. 262.10 Regarding user communication/info provision, JC suggested amending the emphasis of the UB to be more in touch with users generally - it was agreed that he would raise this with Glen. Glen will be there on Thursday, JC will speak to him then. 262.11 SB to add a new Document to the PMB Documents, No 114, relating to a documentation report overview on current status. Ongoing. 263.1 Robin Tasker to re-circulate his paper regarding the RAL-CERN OPN link, once further information was available. What is the timescale for this? PC to review the Minutes and discuss with Robin Tasker. 263.2 JG to further investigate the lack of ability to pass job requirements to the batch system and report-back (Tier-2 review issue). JG will raise this through the GDB. Ongoing. ACTIONS AS AT 09.07.06 ====================== 250.4 RJ, DN, GP, TD to meet to integrate experiment requirements of Tier-2s going to Tier-1 - sites are aware of requirements but discussion still has to take place. It was noted that this issue is not high priority. A meeting is to take place with Barney Garrett - this is ongoing and still to be arranged. 252.3 RM has now received inputs for his one-page summary regarding the transition of each of the existing Middleware areas from GridPP2 to GridPP2+ to GridPP3 - this to go to DB. This was to be done by Friday 8th June but is still ongoing. 254.2 ALL PMB members have now signed-up to EVO. Tests were ongoing but this action is on hold due to H323 requirements which must be resolved. JG/RM will resolve EVO issues. RJ reported that he had joined an evaluation group on EVO and asked that all information should be sent to him to enable him to document the problems involved. It was agreed that an EVO test would take place the week after next (PMB) as next week's meeting was a short one due to the CB meeting at 2.00 pm. 259.5 JC to provide recommendations to the PMB on PPS testing and a summary of what is currently available on the system. 260.1 RM to provide final feedback for site reviews to SL for https://www.gridpp.ac.uk/tier2/Readiness_Reviews/index.html. 260.3 RM, NG, TD, DK to inform SL which site-review information is public/private. 261.1 TD and JG to prepare a PMB statement to be prepared for the MB regarding SL4 releases of basic middleware, which were still awaited and were an issue at sites. JG reported that he would be doing this for tomorrow. Sites should be encouraged to proceed with SL4 upgrades which are to be tracked by JC. JG will give a summary statement to the MB as to what we believe the current situation is - this will include 'SL5 on hold'. 261.2 DN, RJ, GP: An action on the experiments to define the future outlook for 64-bit applications and resultant effects on hardware purchasing. Experiment reps to define the outlook. It was noted that the priority is 32-bit at the moment; there is no advantage to 64-bit. A short statement is required. 261.4 DB to look through the input in detail in relation to GGUS problems. 261.5 JC and dTeam to carry out a survey on sites' experiences of GGUS, when possible to organise. 261.6 JC to look into the issue of 2-hour response timing @ Tier-2 sites and understand the problem in greater detail - sites also need to understand what the two-hour response time actually means. 261.11 SL to progress receipt of final site documents from SouthGrid and London T2 which were still outstanding. It was noted that SL was still awaiting information. 261.13 DK to progress receipt of ScotGrid feedback. 261.14 RM to progress receipt of LT2 feedback. 261.16 JG to progress the issue of somone getting involved in the SLA (ROC) working group. 261.17 JC to assess the general effectiveness of RSS feeds and subscription-based updates, in relation to GridPP blogs. It was noted that blogs are aggregated: PlanetGridPP is the mechanism, but RSS-feeds that can be subscribed to don't exist. JC will bring this up at the Deployment Board meeting. 262.2 SL to clarify GridPP contribution (what is accounted rather than what is available) with the Tier-2 Board. 262.4 JC to ascertain the specific problems in relation to Condor support issues. 262.5 Regarding poor response time of middleware developers: DK to propose the following recommendation to the Deployment Board: to recommend that if specific issues were involved, GGUS should be used. If issues were general, the TCG representative at the Tier-2 site should be informed. The TCG rep in turn should raise the issue as appropriate at the TCG meetings. 262.6 JC to raise the issue of PPS feedback information relating to upgrades issues with the relevant individual(s) on the PPS, and ask if there was anything else that could be done. 262.7 AS to speak to procurement and warn them that sites might want to make parallel purchases - a sentence could be added to the tender document. 262.9 non-Grid access relating to VOs. A document is to be done detailing this issue as VOs need a mechanism 'in'. AS to detail the issue in a separate report and circulate to the PMB. What can and can't be offered to non-Grid users: detail is required - AS still to do. 262.10 Regarding user communication/info provision, JC suggested amending the emphasis of the UB to be more in touch with users generally - it was agreed that he would raise this with Glen. 262.11 SB to add a new Document to the PMB Documents, No 114, relating to a documentation report overview on current status. 263.1 Robin Tasker to re-circulate his paper regarding the RAL-CERN OPN link, once further information was available. What is the timescale for this? PC to review the Minutes and discuss with Robin Tasker. 263.2 JG to further investigate the lack of ability to pass job requirements to the batch system and report-back (Tier-2 review issue). JG will raise this through the GDB. Ongoing. 264.1 DB to extract the issues and actions generated from the Tier-2 Review as discussed at the PMB and put these on the Tier-2 site. 264.2 DB to contact Janet again and remind her about the forthcoming CB meeting and ask her advice about a formal statement re the plan V2. 264.3 JC noted that the Alice RB problems had not been their fault - he would re-check the BDII timeouts as reports don't provide information at present, they are not working. 264.4 Regarding policy of killing jobs, statistics are required from Tier-1, but rather than average efficiency we need profiled jobs. AS to speak to Matt Hodges. INACTIVE CATEGORY AS AT 09.07.06 ================================ 247.2 RJ to get further information from ATLAS regarding use of Grid for testing of PANDA, and report-back. 251.1 TD to raise the issue of memory vs CPU cost at the MB [in order to work out what the requirement was between 1GB and 2GB memory per core]. 253.1 AS has commenced work on the report on data integrity at Tier-1, in relation to implementation of checksums. Ongoing, AS hopes to complete this by end August. Next week's PMB (16.07.07) would be for 1 hour only due to the CB meeting at 2.00 pm. EVO test the following week (23.07.07).
GridPP PMB Minutes 263 - 2nd July 2007 ====================================== Present: Roger Jones, David Britton, David Kelsey, Dave Newbold, Tony Cass, Robin Middleton, John Gordon, Glenn Patrick, Robin Tasker, Suzanne Scott (Minutes) Apologies: Tony Doyle, Sarah Pearce, Stephen Burke, Steve Lloyd, Jeremy Coles, Peter Clarke, Andrew Sansum, Neil Geddes 1. UK Position on Resilience of the RAL-CERN Line ================================================= Robin Tasker had produced a paper regarding the RAL-CERN OPN link. There had been an outage in June - it was reported that French road repair men had dug up the fibre and it was 48 hours before it was repaired. What resilience was required to protect against outage? The lightpath from RAL to CERN was summarised in RT's paper in terms of the problems involved, but overall the link was fairly reliable. The paper addressed issues of fibre infrastructure, with feasibility and costing confirmation awaited from UKERNA. It was understood that outage could be infrequent and a large cost was involved in protecting the link if such protection was not generally required. RT was currently awaiting a risk assessment in relation to the break in fibre in such a catastrophic way - it was a question of balance of risk and cost, and of how long an outage was likely to last - how significant was an outage of 48 hours in June? JG noted that breaks in the Tier-1 do result in dataflow issues to the other Tier-1s. There was a discussion regarding steering data and storage. It was agreed that the links need to be as reliable as possible within reason. An outage of 1-2 hours or one day was acceptable, but for two weeks, no. It was noted that the lightpath cannot be re-routed, if the fibre breaks then the connection is lost. It was noted that bandwidth might be an issue for the future. There was a discussion of routes into CERN and cross-border fibres. It was reported that JANET (UK) were providing figures to RT for a diverse route by the end of the week. NetNorthWest and JANET will be able to give a realistic assessment of risk. It was agreed that a decision should be deferred until further information was available. RT will update his paper with fuller information when it was available, and re-circulate. 2. Ongoing Review of Tier-2 Issues ================================== In absentia, JC had submitted comments on the remaining issues. 18) Lack of ability to pass job requirements to the batch system - JG noted that the GLite CE can pass information. The RB looks at the user requirement and matches it to a queue. It was noted that the system fills with jobs that can't be optimised. JG would investigate this issue further and report-back. 19) Virtualisation - UCL had wanted to know GridPP direction/support in this area. JC noted that Marian had started looking at virtualization. He currently has some nodes on the PPS which are on virtual machines - his intention was to put the PPS SAM client in such an environment. It was noted that Grid-Ireland also had a lot of experience in this area which GridPP could draw upon. JC reported that there might be some support available via the TB-SUPPORT list and helpdesk, but at the moment we are still looking at this area and do not have a definite direction. It was agreed that this is largely uncharted territory for GridPP and a diversion away from the standard GridPP environment. In abeyance at present. 20) Changing Experiment requirements - JC noted that this might relate to such things as the ATLAS ACL change requests. Some sites thought there needed to be more structure to change requests. VO views might be cited as another area where difficulties have been encountered. There was also the difficulty of consistency of feedback - on SL4 JC has heard different positions depending who he talks to within an experiment. It was reported that the 39 Tier-2s in CMS are in regular contact. JG summarised that this was an issue more for the Experiments to deal with. 21) Level of noise for site problems - JC noted that this covered things like false-positive problems in the site SAM results. It was agreed that people are playing more attention now to the SAM results. Issues should be raised in the weekly Ops reports meetings. 22) Definition of 'what is available' - JC noted that if sites are going to be measured against one measure of availability, is it the number coming from GridView (even if there are (many) questions about how accurate it is in measuring availability for the experiments). It was agreed that, yes, GridView and the SAM reports come from the same database, but if there is not a consistent query then you won't get the same number out of the same data. 23) Enforcement of MoUs/SLAs - JC noted that the process is known but other than getting less funding in the future, were there any other enforcement options? It was agreed that this issue was not for public debate at present. 3. Killing Jobs ================ It was reported that TD had sent a draft policy to the WLCG Management Board. It was noted that killing stalled jobs was treating the symptom rather than the problem. Some feedback had been received, it was understood that the policy intention was to try to improve efficiency at sites. It was noted that the Tier-2 have less staff and VOs send jobs in. The issue would be discussed at the face-to-face MB meeting tomorrow. It was noted that the dashboard was an answer to cross-VO problems but the Experiments don't know who is running jobs. It was agreed that it was not right if it became the normal procedure to kill-off jobs as a matter of course. 4. AOB ======= RJ reported that Liverpool had asked for some GridPP funding for pre-spending. DB noted that this was not possible as no official word had been received from STFC with regard to allocations. It was agreed that nothing could be done until GridPP know officially what the scale of expenditure is. STANDING ITEMS ============== SI-1 Dissemination Officer's Report ------------------------------------ It was noted that SP was not present. SI-2 Tier-1 Manager's Report ----------------------------- In absentia, AS had sent in the following report on Friday 29th: Hardware - Re the 10Gb path from Tier-1 to SJ5, it was reported that they were currently waiting for network group to finish testing. They were currently working on implementing the firewall configuration as a set of router filters. The RAL networking group were in the process of obtaining a public AS number in order that the Tier-1 could route Tier-1 -> Tier-1 traffic by the OPN. Still waiting for RAL networking group to complete this work. The pre-qualification stage of the disk and CPU tenders closed on Friday 29th. Evaluation will start w/c 2nd July. The Tape service was down last Tuesday for a firmware update. Service - SAM availability for the last 7 days was 93% (some overlap with previous 7 days reported). Regarding CASTOR: A stand-alone 2.1.3 release of CASTOR for CMS had been implemented and is undergoing testing. Results were very encouraging with high rates achieved (400MB/s writing - concurrent with 300MB/s to tape followed by >700MB/s reading). Reliability has been excellent, far better than any previous tests with CMS. However, so far only native rfio load tests have been tried and we need to see good results with gridftp/srm/fts before feeling confident that we have a good working production ready release. A standalone 2.1.3 release for ATLAS is currently being worked on. This was delayed by technical problems but is now nearly complete and will be tested soon. We have reviewed hardware capacity available to implement a 2.1.3 stand-alone implementation for LHCB. Tier-1 batch workers will be redeployed temporarily. Work on this will commence once the ATLAS instance is complete. It is expected to go faster as documentation and processes have now been improved. Regarding dCache: all is OK - but is apparently not being used by ATLAS production. We are following this up. BDII: We have seen some timeouts on the top-level BDII. These are load related, probably caused by the LHCB VO box. One BDII has been updated to the latest release and has seen a significant reduction in CPU load. If it remains stable then the two remaining hosts will be updated shortly. RB: rb01 is currently under sysdev having its database cleaned. rb02 is struggling to cope with the load on its own. rb03 is deployed and is currently being tested. Once completed will arrange Alice production to move to it. We may also move the LHCB production. LFC: Problems reported on Monday were resolved (on Monday). Cause was a faulty gLite update. SI-3 Production Manager's Report --------------------------------- In absentia, JC sent in the following report: 1) We are pursuing two security related matters - concerns raised in the UK and the submitters are concerned that there has been no result(patch) for one and lack of discussion of the other. There has actually been some progress on both but this particular problem has highlighted a need to review procedures and communication in this area. Another issue being faced generally is how we are supposed to deal with vulnerabilities in VO/experiment code. 2) BDII timeouts appear to be affecting UK sites again (causing lcg-rm tests to fail for several sites). 3) The main things to note from the UKI monthly meeting last week (http://indico.cern.ch/conferenceDisplay.py?confId=17879) are that the UK helpdesk will now move to chase/close tickets where the ticket submitter has not responded to the agent's response (after a site waiting on a user to confirm a fix), and that generally sites are finding it difficult to keep up with constant changes in YAIM and the middleware. Sites have been encouraged to check their storage data being published to the storage accounting portal (http://goc02.grid-support.ac.uk/accountingDisplay/view.php?queryType=storage) and report any problems. 4) GOCDB3 (https://goc.gridops.org/) went live last week on Wednesday. We have seen an increase in tickets to the UKI ROC as users point out minor issues but so far the release seems to have been well planned and has gone smoothly. 5) There two monthly grid deployment related meetings at CERN this week. A storage workshop runs Monday and Tuesday (http://indico.cern.ch/conferenceDisplay.py?confId=16456) with both SRM developers present and representatives from the experiments. Grieg Cowan will present on "GridPP sites: experience running dCache, DPM, and StoRM". Then on Wednesday is the July Grid Deployment Board meeting (http://indico.cern.ch/conferenceDisplay.py?confId=8485) with a focus on accounting and security. There will be surrounding discussions on WN utilisation, the OPN and a summary from the storage workshop. SI-4 LCG Management Board Report --------------------------------- See https://twiki.cern.ch/twiki/bin/view/LCG/MbMeetingsMinutes SI-5 Documentation Officer's Report ------------------------------------ It was noted that SB was not present. REVIEW OF ACTIONS ================= 247.2 RJ to get further information from ATLAS regarding use of Grid for testing of PANDA, and report-back. RJ reported that this was ongoing and nothing would be happening regarding it in the near future. 250.4 RJ, DN, GP, TD and TC to meet to integrate experiment requirements and work on Tier-2 networks - sites are aware of requirements but discussion still has to take place. Ongoing when convenient to arrange. It was noted that this issue is not high priority. 251.1 TD to raise the issue of memory vs CPU cost at the MB [in order to work out what the requirement was between 1GB and 2GB memory per core]. Ongoing. 252.3 RM has now received inputs for his one-page summary regarding the transition of each of the existing Middleware areas from GridPP2 to GridPP2+ to GridPP3 - this to go to DB. Ongoing. 253.1 AS has commenced work on the report on data integrity at Tier-1, in relation to implementation of checksums. Ongoing. 254.2 ALL PMB members have now signed-up to EVO. Tests were ongoing but this action is on hold due to H323 requirements which must be resolved. JG has resolved EVO H.323 issues at RAL. It was noted that there had been a further EVO test today (2/7) but JG was the only one to join. 255.3 DK to get approval from groups regarding Grid Site Operations policy and report-back. Obligations are on the site to carry forward issues. It was reported that all sites had now been consulted. Final project approval was currently happening. Done, item closed. 256.1 NG to review the draft of the new Grid Security Policy from NGS perspective, and SL from Tier-2, and report-back. NG had reported at the F2F. Done, item closed. 258.6 JC to discuss RAL RB issues with Catalin Condurache and bring conclusions back to the PMB. In absentia JC reported that the recent RB problems are thought to be due to ALICE hammering the RB until it fails. It is proving difficult to validate this due to poor RB VO monitoring. The urgency to fix problems seen by users is now recognised and the T1 procedure will not always be to wait until queues are empty if a component is being problematic. Another issue here is that UIs are not being configured properly to take account of the load balanced nature of the RBs. ALICE and LHCb are having their own RBs installed. This is now closed. 259.5 JC to provide recommendations to the PMB on PPS testing and a summary of what is currently available on the system. JC will also forward the chat window location to the PMB via email. The link that was circulated is http://egee-pre-production-service.web.cern.ch/egee-pre-production-service/. Ongoing. 260.1 RM, NG to provide final feedback for site reviews to SL for https://www.gridpp.ac.uk/tier2/Readiness_Reviews/index.html. This was 'in progress' - NG action done; RM ongoing. 260.3 RM, NG, TD, DK to inform SL which site-review information is public/private. Ongoing. 260.4 JG (not JC) to re-start Castor Strategy meetings. Done, item closed. 261.1 TD and JG to prepare a PMB statement to be prepared for the MB regarding SL4 releases of basic middleware, which were still awaited and were an issue at sites. Ongoing. 261.2 DN, RJ, GP: An action on the experiments to define the future outlook for 64-bit applications and resultant effects on hardware purchasing. Experiment reps to define the outlook. There was a discussion re SL4 & SL5 - ongoing. 261.4 DB to look through the input in detail in relation to GGUS problems. Ongoing. 261.5 JC and dTeam to carry out a survey on sites' experiences of GGUS, when possible to organise. In absentia JC reported that a dialogue has been started but it will take a few weeks to close this action. Ongoing. 261.6 JC to look into the issue of 2-hour response timing @ Tier-2 sites and understand the problem in greater detail - sites also need to understand what the two-hour response time actually means. Ongoing. 261.7 DK to ask Mingchao Ma, the new GridPP Security Officer, to contact sites and check they have security incident response systems in place. The 'climate' of this item was understood that this would happen naturally in due course. Item closed. 261.8 JC to talk to Pete Gronbech and Alessandra Forti regarding Monitoring/Nagios/Ganglia training, to include someone from GridView. In absentia JC reported that this had been discussed with Pete and Alessandra and also at the UKI meeting. There is support for this around the next HEPSYSMAN meeting. We will start working on the agenda. Action can be closed. 261.11 SL to progress receipt of final site documents from SouthGrid and London T2 which were still outstanding. It was noted that this was a duplicate of an earlier action, but was still ongoing. 261.12 NG to progress receipt of SouthGrid feedback. Done, item closed. 261.13 DK to progress receipt of ScotGrid feedback. Ongoing. 261.14 RM to progress receipt of LT2 feedback. Ongoing. 261.15 SL to send an email to sites who still had to provide final versions of the Questionnaire response (list above), informing them that the current version would be considered final unless a revised one was provided by Friday 22nd June. Done, item closed. 261.16 JC to speak to Steve McAllister about getting involved in the SLA (ROC) working group. In absentia JC reported that he had spent an hour with Steve last week but it is not clear that he is the right person to work on SLA issues for the ROC. This should be the ROC manager. It was agreed that JG would progress this. 261.17 JC to assess the general effectiveness of RSS feeds and subscription-based updates, in relation to GridPP blogs. Ongoing. 262.1 RM to draft an extra line for the Travel Policy relating to Tier-2 staff/Experiment contact. Done, item closed. 262.2 SL to clarify GridPP contribution (what is accounted rather than what is available) with the Tier-2 Board. Ongoing. 262.3 DK to raise items (12) [re accounted GridPP contribution] and (22) [re site availability via SAM tests] at the Deployment Board in two weeks' time. Still to be done. 262.4 JC to ascertain the specific problems in relation to Condor support issues. In absentia JC reported that he was still working on this. So far he had contacted two other EGEE sites that are using or trying to use Condor and have asked Santanu to distill the main issues Cambridge is having with Condor as a batch system. Ongoing. 262.5 Regarding poor response time of middleware developers: DK to propose the following recommendation to the Deployment Board: to recommend that if specific issues were involved, GGUS should be used. If issues were general, the TCG representative at the Tier-2 site should be informed. The TCG rep in turn should raise the issue as appropriate at the TCG meetings. Ongoing. 262.6 JC to raise the issue of PPS feedback information relating to upgrades issues with Pete on the PPS, and ask if there was anything else that could be done. In absentia JC reported that he had talked with Yves and Marian but there was nothing conclusive yet about how to take this forward. Marian reinstalls each time and Yves is already inputting experiences into the wiki (such as with DNS style VO configuration). Ongoing. 262.7 AS to speak to procurement and warn them that sites might want to make parallel purchases - a sentence could be added to the tender document. Ongoing. 262.8 A statement is to be prepared for the MB relating to SAM availability for the last 7 days (62%) - AS to send an email to JG, JC and TD. [This was mainly caused by the failure of the RAL-CERN line, which was down in excess of 48 hrs from 20/06/2007 10:17:54 to approximatly 22/06/2007 15:00:00.] Done, item closed. 262.9 Grid access relating to VOs. A document is to be done detailing this issue as VOs need a mechanism 'in'. AS to detail the issue in a separate report and circulate to the PMB. Ongoing. 262.10 Regarding user communication/info provision, JC suggested amending the emphasis of the UB to be more in touch with users generally - it was agreed that he would raise this with Glen. In absentia JC reported that he would talk with Glenn next week when at RAL. Ongoing. 262.11 SB to add a new Document to the PMB Documents, No 114, relating to a documentation report overview on current status. Ongoing. ACTIONS AS AT 02.07.06 ====================== 247.2 RJ to get further information from ATLAS regarding use of Grid for testing of PANDA, and report-back. 250.4 RJ, DN, GP, TD and TC to meet to integrate experiment requirements and work on Tier-2 networks - sites are aware of requirements but discussion still has to take place. Ongoing when convenient to arrange. It was noted that this issue is not high priority. 251.1 TD to raise the issue of memory vs CPU cost at the MB [in order to work out what the requirement was between 1GB and 2GB memory per core]. 252.3 RM has now received inputs for his one-page summary regarding the transition of each of the existing Middleware areas from GridPP2 to GridPP2+ to GridPP3 - this to go to DB. This will be done by Friday 8th June. 253.1 AS has commenced work on the report on data integrity at Tier-1, in relation to implementation of checksums. 254.2 ALL PMB members have now signed-up to EVO. Tests were ongoing but this action is on hold due to H323 requirements which must be resolved. JG/RM will resolve EVO issues. 259.5 JC to provide recommendations to the PMB on PPS testing and a summary of what is currently available on the system. 260.1 RM to provide final feedback for site reviews to SL for https://www.gridpp.ac.uk/tier2/Readiness_Reviews/index.html. 260.3 RM, NG, TD, DK to inform SL which site-review information is public/private. 261.1 TD and JG to prepare a PMB statement to be prepared for the MB regarding SL4 releases of basic middleware, which were still awaited and were an issue at sites. 261.2 DN, RJ, GP: An action on the experiments to define the future outlook for 64-bit applications and resultant effects on hardware purchasing. Experiment reps to define the outlook. 261.4 DB to look through the input in detail in relation to GGUS problems. 261.5 JC and dTeam to carry out a survey on sites' experiences of GGUS, when possible to organise. 261.6 JC to look into the issue of 2-hour response timing @ Tier-2 sites and understand the problem in greater detail - sites also need to understand what the two-hour response time actually means. 261.11 SL to progress receipt of final site documents from SouthGrid and London T2 which were still outstanding. 261.13 DK to progress receipt of ScotGrid feedback. 261.14 RM to progress receipt of LT2 feedback. 261.16 JG to progress the issue of (someone, not Steve McAllister - the ROC manager?) getting involved in the SLA (ROC) working group. 261.17 JC to assess the general effectiveness of RSS feeds and subscription-based updates, in relation to GridPP blogs. 262.2 SL to clarify GridPP contribution (what is accounted rather than what is available) with the Tier-2 Board. 262.3 DK to raise items (12) [re accounted GridPP contribution] and (22) [re site availability via SAM tests] at the Deployment Board in two weeks' time. 262.4 JC to ascertain the specific problems in relation to Condor support issues. 262.5 Regarding poor response time of middleware developers: DK to propose the following recommendation to the Deployment Board: to recommend that if specific issues were involved, GGUS should be used. If issues were general, the TCG representative (Alessandra Forti) should be informed. The TCG rep in turn should raise the issue as appropriate at the TCG meetings. 262.6 JC to raise the issue of PPS feedback information relating to upgrades issues with the relevant individual(s) on the PPS, and ask if there was anything else that could be done. 262.7 AS to speak to procurement and warn them that sites might want to make parallel purchases - a sentence could be added to the tender document. 262.9 Grid access relating to VOs. A document is to be done detailing this issue as VOs need a mechanism 'in'. AS to detail the issue in a separate report and circulate to the PMB. 262.10 Regarding user communication/info provision, JC suggested amending the emphasis of the UB to be more in touch with users generally - it was agreed that he would raise this with Glen. 262.11 SB to add a new Document to the PMB Documents, No 114, relating to a documentation report overview on current status. 263.1 Robin Tasker to re-circulate his paper regarding the RAL-CERN OPN link, once further information was available. 263.2 JG to investigate further the lack of ability to pass job requirements to the batch system and report-back (Tier-2 review issue). The next PMB would take place on Monday 9th July. The meeting closed at 2.00 pm.

Top of Message | Previous Page | Permalink

JiscMail Tools


RSS Feeds and Sharing


Advanced Options


Archives

February 2024
January 2024
September 2022
July 2022
June 2022
February 2022
December 2021
August 2021
March 2021
November 2020
October 2020
August 2020
March 2020
February 2020
October 2019
August 2019
June 2019
May 2019
April 2019
March 2019
February 2019
January 2019
December 2018
November 2018
August 2018
July 2018
June 2018
May 2018
April 2018
March 2018
February 2018
January 2018
November 2017
October 2017
September 2017
August 2017
May 2017
April 2017
March 2017
February 2017
January 2017
October 2016
August 2016
July 2016
June 2016
May 2016
April 2016
March 2016
February 2016
January 2016
December 2015
November 2015
October 2015
September 2015
July 2015
May 2015
April 2015
March 2015
February 2015
January 2015
December 2014
November 2014
October 2014
September 2014
August 2014
July 2014
June 2014
May 2014
April 2014
March 2014
February 2014
January 2014
December 2013
October 2013
August 2013
July 2013
June 2013
May 2013
March 2013
February 2013
January 2013
December 2012
November 2012
October 2012
September 2012
August 2012
July 2012
June 2012
May 2012
April 2012
March 2012
February 2012
January 2012
December 2011
November 2011
October 2011
September 2011
August 2011
July 2011
June 2011
May 2011
April 2011
March 2011
February 2011
January 2011
December 2010
November 2010
October 2010
September 2010
August 2010
July 2010
June 2010
May 2010
April 2010
March 2010
February 2010
January 2010
December 2009
November 2009
October 2009
September 2009
August 2009
July 2009
June 2009
May 2009
April 2009
March 2009
February 2009
January 2009
December 2008
November 2008
October 2008
September 2008
August 2008
July 2008
June 2008
May 2008
April 2008
March 2008
February 2008
January 2008
December 2007
November 2007
October 2007
September 2007
August 2007
July 2007
June 2007
May 2007
April 2007
March 2007
February 2007
January 2007
2006
2005
2004
2003
2002
2001
2000


JiscMail is a Jisc service.

View our service policies at https://www.jiscmail.ac.uk/policyandsecurity/ and Jisc's privacy policy at https://www.jisc.ac.uk/website/privacy-notice

For help and support help@jisc.ac.uk

Secured by F-Secure Anti-Virus CataList Email List Search Powered by the LISTSERV Email List Manager