Dear All,
****REMINDER****
Abstract submission deadline for ACAT is tomorrow
http://acat2011.cern.ch/
Please find attached the GridPP Project Management Board Meeting minutes
for the 430th meeting.
The latest minutes can be found each week in:
http://www.gridpp.ac.uk/php/pmb/minutes.php?latest
as well as being listed with other minutes at:
http://www.gridpp.ac.uk/php/pmb/minutes.php
Cheers, Dave.
--
________________________________________________________________________
Prof. David Britton GridPP Project Leader
Rm 480, Kelvin Building Telephone: +44 141 330 5454
School of Physics and Astronomy Telefax: +44-141-330 5881
University of Glasgow EMail: [log in to unmask]
G12 8QQ, UK
________________________________________________________________________
GridPP PMB Minutes 430 (27.06.11)
=================================
Present: Dave Britton (Chair), Jeremy Coles, Pete Gronbech, Dave Kelsey, Steve Lloyd, John
Gordon, Roger Jones, Andrew Sansum, Tony Cass, Neil Geddes (Suzanne Scott - Minutes)
Apologies: Tony Doyle, Robin Middleton, Pete Clarke, Glenn Patrick, Dave Colling
1. Input to 'Future of Research' draft
=======================================
DB asked for inputs to the discussion on the UK Research Computing Ecosystem document which
Peter Coveney had produced. DB had circulated various drafts of the GridPP response. It was
noted that the document described neither HEP nor GridPP in the UK. It had been written from an
HPC perspective. DB noted an implicit danger as the document appeared to apply to the whole of
UK Research Computing, and in the long run this would be problematic if high-level discussions
did not include what the HEP community had achieved and what it currently did. DB noted that it
was possible to be helpful to them, as they did have a problem to solve. DB asked if we wanted to
be included or excluded from this paper. Comments?
NG advised that the document had grown out of a number of different themes which were
running, partly due to Malcolm Atkinson resigning. At the e-Science Directors' meetings,
Edinburgh had been keen to be involved, and the UK was not in PRACE, therefore the community
meetings had proposed this course of action for a Town Meeting to discuss UK e-Science. In
parallel, involved in Collaborative Computational Projects (CCP), Peter Coveney had acted to
produce a strategy document which had been discussed at UCL, and several people had been
tasked with writing different sections of this document. It was felt that they had a stronger case if
there was community-wide buy-in. NG noted that HEP references had been included in the first
draft but had been removed in subsequent drafts. The Recommendations hadn't been discussed
at all.
JG thought that some of the Recommendations were non-starters, especially the funding idea of a
'central pot'. DB agreed, noting that if this were propagated through they system it could affect
our funding. NG noted yes, it could affect research, especially people on the boundaries of
different research projects which were funded by different research councils.
DB noted that he had moved through various drafts of his response letter, but that we should be
supportive if we could. NG thought that the document had to be inclusive in order to be
successful. SL considered that we should hold ourselves up as an example of how things do work.
JG thought this should include NGS. SL disagreed, noting that we should answer this from a
GridPP point of view. DB agreed, advising that we should not make things too diffuse. Their
document needed to be clear it wasn't talking about HEP. DB asked if the PMB were comfortable
with both direction and tone of his response? Yes. SL advised that we could have an Annexe
document that summarised GridPP, and possibly add the List of Roles into that? DB suggested
putting the Appendices into a separate document? SL thought it didn't matter very much. DB
considered that the background information gave the strength and breadth of GridPP. DB would
do a final draft today and send it to Peter Coveney. Any more comments were welcome.
2. Accounting - new metrics from Manchester
============================================
SL reported that he had discussed the Accounting with Mike, who had been going to suggest using
different metrics. SL had explained everything to him in detail, from the beginning, explaining
why the current method being used was not the best. Mike had seemed reasonably happy and had
understood why we were doing what we were doing. SL had subsequently received an email from
him saying that they were not against our methods but that they were looking at the ATLAS
numbers for consistency. They believed that they could come up with a better metric. SL noted
that if they were to do this, he needed it urgently. There ensued a discussion on normalised CPU.
DB advised that we would consider a suggestion from Manchester but that it would be needed
before the end of the month. PG thought that there should be a heavier weighting on analysis and
production work actually done at sites rather than what CPU was available. DB noted that we had
already discussed these points and it was ultimately an ATLAS choice as to how and why they
distributed the funds - it wasn't something that the PMB should be involved in. PG would speak to
RJ offline. DB asked SL to contact Mike and ask for any input from them by the end of June (this
Thursday). SL advised that he was only proposing to change the HEPSPEC of used CPU not that
advertised.
3. SNO+ Resource Request
=========================
DB reported that there had been an email request for resources for SNO+. DB didn't think it
looked too unreasonable. AS noted he was re-doing the tape planning anyway - we had 1-1.5 PB
of tape for 'other' experiments for that period therefore the 300TB SNO request seemed
manageable. In general, SL advised that there were two models we currently used: (a) we asked
explicitly for support for 'others' in each GridPP proposal, using whatever numbers the 'others'
come up with and then they live within this (currently 10%); (b) other communities request funds
from PPRP for computing which GridPP then administers and gives them a guaranteed share. DB
advised that SNO should request the computing they wanted in their grant application, then we
could include a resource request line for that experiment. This was the best model, cf UKQCD/LHC
'others'. SNO+ could simply be a new line item.
ACTION
430.1 Re the request for resources from SNO+, DB to draft something for GP to respond and feed-
in.
4. wLCG Technology Evolution Group
===================================
DB reported that at the last GDB it had been agreed to start a working group to understand the
technical evolution of wLCG. The suggested format was a forum on Tuesdays, before the GDBs,
where detailed technical discussions could take place. JG advised that some discussions were too
big and complex for the GDB, eg: multi-user pilot jobs framework. There was a need for another
forum which was a smaller group with site representation. JG was not convinced that it would
work in the model proposed, that of a core of people discussing all issues. The Tier-1 could decide
for themselves regarding their representative; for the operations part perhaps JC and another for
the site delegate? JG noted he wasn't on this group. DB agreed that it probably wouldn't work, but
if it did, for the UK we needed one person there as it would be good to have someone in the room.
JG asked if we could get one person from the Tier-1/Tier-2? DB thought that Romain was a good
candidate for security, but DK was also required for policy issues. DB suggested nominating JC for
operations and/or PG for the Tier-2? It was agreed to nominate both.
ACTION
430.2 DB to nominate both JC and PG for membership of the wLCG technical evolution working
group, to ensure UK representation.
5. AOCB
========
a) DB reported that, regarding GridPP28 in 2012, EGI had chosen the same week for their
meeting. What constraints were there from our side regarding different dates? 19-23 March was
out. The IoP was the week after 26th March, then it was Easter. DB thought it would need to be
either 11-13 April, or the following week, 16-20 April. RJ noted that 23rd was term time and he
would likely be teaching.
ACTION
430.3 DB to contact Mike Seymour at Manchester and find out what dates were possible from
their point of view - possibly w/c 16 or 23 April 2012.
b) Re Capital Expenditure for FY10 - the message was that we can bill the tape drive
infrastructure to FY10 (GridPP3). This was £200k, but the maintenance could not be accrued,
therefore for the drives only it was £184k. AS advised that this was a 'done deal' now unless the
auditors rejected it. The one complication was that the credit would not show at project level,
only at cost centre level, which meant that it wasn't visible. We would need a letter that
documents this. DB noted that this was potentially good news, we could change our accounting to
register the credit and we would tell STFC that we spent on that budget. AS asked if we would
spend the credit in this financial year? DB noted probably not, but it might be required at some
point in the future. DK noted it gave us potential flexibility.
STANDING ITEMS
==============
SI-1 Tier-1 Manager's Report
-----------------------------
AS reported as follows:
Fabric:
1) FY11 procurements
- EU tender for disk framework PQQ evaluation complete and supplier shortlist agreed. Expect ITT
to go out late this week or early next week.
- CPU framework PQQ ready to go out.
2) SL08 considered deployable. Plan to redeploy as required into T1D0 service classes.
There ensued a discussion on tape buffer and LHCb requirements.
3) FY10 Tape drive purchase - update on delivery and financial profile available.
4) Probable intervention on OPN router on 5th July 8-10am (TBC) is likely to cause a break in
connectivity
from the WAN to our disk servers.
Service:
1) Summary of operational issues is at:
http://www.gridpp.ac.uk/wiki/Tier1_Operations_Report_2011-06-15
http://www.gridpp.ac.uk/wiki/Tier1_Operations_Report_2011-06-22
2) CASTOR
* CASTOR outage (two periods of about 6 hours) over the weekend owing to database problems.
Still under investigation but probably caused by database problems on the Neptune RAC.
* High load on tape recalls for LHCB coupled with a number of issues (size of service class, disk
server read/write contention/performance, migration policy, poor localisation of data on tape)
has led to delayed tape access for LHCB. We are working on a number of these issues.
* Expect to upgrade CASTOR tape servers to 2.1.10-1 to enable T10KC - expected 5th July. Will
need downtime (probably co-scheduled with the network intervention.
* Preparing T10KC migration plan. Most of the pieces are already in place and we now need to
agree which VOs we will migrate and when.
Staff:
1) Grid team leader post internal recruitment unsuccessful (late last week). Considering
alternatives.
2) Paperwork for four other vacancies has been approved! Expect to submit to SSC in next day.
* Two system admins for Fabric team
* One CASTOR admin
* One Grid Team member
SI-2 Production Manager's Report
---------------------------------
JC reported as follows:
1) There are now 11 GridPP sites with glexec enabled and passing the ops VO tests on at least one
cluster (RHUL; Birmingham; Brunel; Bristol?; Liverpool; RALPP; RAL Tier-1; Glasgow; Oxford and
Sheffield). A couple of sites are still enabling it and may be ready this week. 6 sites are waiting for
a form of relocatable installation (we have not yet got any specific dates back on this yet but if it
looks too far away will look again at building from source).
2) There have been some problems with APEL publishing for most sites during the last week. This
now looks to be resolving and may have been due to the Spanish Tier-1 republishing a lot of data
leading to timeouts for others trying to upload data.
3) Grid Ireland has finished the process of creating NGI_IE. This should mean that we begin the
move to “NGI_UK” very soon.
4) Some sites have been setting up iperf servers to help understand issues being found with the
perf-sonar tests: http://tinyurl.com/6a7dshg. Some WLCG Tier-1s have agreed to provide a
service too but with mixed feelings. There was a discussion on the difficulty of eliciting details on
the GridMon setup to enable operation at Glasgow.
5) An Authentication Bypass Vulnerability in torque that if exploited
allows unauthorized users to submit jobs has required some sites to update their torque
configuration settings and revise their firewall rules.
6) Pete Gronbech observed a problem with the REBUS updater that meant site CPU values were
not updated correctly. This has now been fixed. This was noticed because the GridPP accounting
table did not update the site available CPU resources after additional nodes were put online.
7) WLCG has now released an updated version of the monthly availability and reliability figures
for Tier-2 sites with CREAM now correctly accounted. This update shows some improvements in
the GridPP site figures but does not introduce new site issues to discuss today (see the
explanations given at the last PMB).
A) The summer HEPSYSMAN meeting takes place later this week at RAL
http://hepwww.rl.ac.uk/sysman/June2011/agenda.html. In addition to site updates and a
security workshop on the last day, those in the ops team will try to fit in discussions about the
(individual) ops team tasks.
B) There will be a Lustre workshop at QMUL on 14th July http://www.lustreusergroup.org/.
SI-3 ATLAS weekly review & plans
---------------------------------
RJ reported that they were doing network testing; there was an issue of load going through the
Tier-1 which they were investigating; ATLAS production worldwide crashed on Friday morning
last, queues still existed (this was not a UK issue, it was global).
SI-4 CMS weekly review & plans
-------------------------------
DC was not present.
SI-5 LHCb weekly review & plans
--------------------------------
In absentia GP reported:
1) LHCb has had a few problems with “input data resolution” failures. Usually, this is due to input
data not found on SE. Also, a rise in the number of jobs with “Watchdog identified job as stalled” –
usually due to problems access/streaming data at worker node. Some problems also with DIRAC
staging and SRM unresponsiveness.
2) From RAL Tier 1 side, a number of problems with staging data (stuck tapes, daemons, etc). Also,
some long delays between staging and being able to access data. Castor then went down due to
database issues over the weekend (I think this only affected LHCb and ATLAS).
SI-6 User Co-ordination issues
-------------------------------
GP was not present. Please see agenda item 3 for discussion of SNO+ resources.
SI-7 LCG Management Board Report
---------------------------------
There had been no MB.
SI-8 Dissemination
-------------------
SL reported that he had started putting the weekly minutes onto the GridPP website in
docs/Minutes. This would give an idea of issues currently being covered. DB advised that this
was a reminder that the document page should be re-organised - a higher-level front page was
required to facilitate ease of access to the various documents. This was on Neasan's 'to do' list.
AOB
===
PG reminded the meeting about the Quarterly Reports. He would send out template reports to the
different groups, but he needed target values for metrics. Users to reply please. RJ noted he could
work on this on 1st July.
RJ reported on issues with the Cream CE and Condor, which were currently being investigated.
REVIEW OF ACTIONS
=================
400.4 SL to co-ordinate changing the current GridPP MoU towards an MoU for GridPP4. In
progress - document had been circulated. Any corrections to be sent to SL. Ongoing.
424.3: DB to contact ALICE-UK about Tier-2 resources. Ongoing.
425.7 DC to have an internal discussion within CMS relating to use of future technology and
evolution of the computing model, from September to the next couple of years. DC to come up
with possible suggestion of theme/topics for GridPP27 at CERN. Ongoing.
425.8 AS to consider any longer-term issues relating to storage, DPM, databases etc, and come
back to DB with any ideas for sessions at GridPP27. Ongoing.
428.2 DC to check at Imperial regarding the new person dealing with ganga, in relation to a talk at
ACAT. Ongoing.
428.3 JC to compile an info list relating to sub-nets at sites. Ongoing.
428.6 AS to come up with a proposal for how to use the current disk buffer at the Tier-1. Ongoing.
ACTIONS AS AT 27.06.11
======================
400.4 SL to co-ordinate changing the current GridPP MoU towards an MoU for GridPP4. In
progress - document had been circulated. Any corrections to be sent to SL.
424.3: DB to contact ALICE-UK about Tier-2 resources.
425.7 DC to have an internal discussion within CMS relating to use of future technology and
evolution of the computing model, from September to the next couple of years. DC to come up
with possible suggestion of theme/topics for GridPP27 at CERN.
425.8 AS to consider any longer-term issues relating to storage, DPM, databases etc, and come
back to DB with any ideas for sessions at GridPP27.
428.2 DC to check at Imperial regarding the new person dealing with ganga, in relation to a talk at
ACAT.
428.3 JC to compile an info list relating to sub-nets at sites.
428.6 AS to come up with a proposal for how to use the current disk buffer at the Tier-1.
430.1 Re the request for resources from SNO+, DB to draft something for GP to respond and feed-
in.
430.2 DB to nominate both JC and PG for membership of the wLCG technical evolution working
group, to ensure UK representation.
430.3 Re GridPP28, DB to contact Mike Seymour at Manchester and find out what dates were
possible from their point of view - possibly w/c 16 or 23 April 2012.
Forthcoming PMB meetings would take place on the following dates:
**** Fri July 15th ****
Mon July 25th
Mon Aug 8th
Mon Aug 22nd
Mon Sep 5th
Tue Sep 13th F2F@CERN
Mon Sep 26th
GridPP PMB Minutes 430 (27.06.11)
=================================
Present: Dave Britton (Chair), Jeremy Coles, Pete Gronbech, Dave Kelsey, Steve Lloyd, John
Gordon, Roger Jones, Andrew Sansum, Tony Cass, Neil Geddes (Suzanne Scott - Minutes)
Apologies: Tony Doyle, Robin Middleton, Pete Clarke, Glenn Patrick, Dave Colling
1. Input to 'Future of Research' draft
=======================================
DB asked for inputs to the discussion on the UK Research Computing Ecosystem document which
Peter Coveney had produced. DB had circulated various drafts of the GridPP response. It was
noted that the document described neither HEP nor GridPP in the UK. It had been written from an
HPC perspective. DB noted an implicit danger as the document appeared to apply to the whole of
UK Research Computing, and in the long run this would be problematic if high-level discussions
did not include what the HEP community had achieved and what it currently did. DB noted that it
was possible to be helpful to them, as they did have a problem to solve. DB asked if we wanted to
be included or excluded from this paper. Comments?
NG advised that the document had grown out of a number of different themes which were
running, partly due to Malcolm Atkinson resigning. At the e-Science Directors' meetings,
Edinburgh had been keen to be involved, and the UK was not in PRACE, therefore the community
meetings had proposed this course of action for a Town Meeting to discuss UK e-Science. In
parallel, involved in Collaborative Computational Projects (CCP), Peter Coveney had acted to
produce a strategy document which had been discussed at UCL, and several people had been
tasked with writing different sections of this document. It was felt that they had a stronger case if
there was community-wide buy-in. NG noted that HEP references had been included in the first
draft but had been removed in subsequent drafts. The Recommendations hadn't been discussed
at all.
JG thought that some of the Recommendations were non-starters, especially the funding idea of a
'central pot'. DB agreed, noting that if this were propagated through they system it could affect
our funding. NG noted yes, it could affect research, especially people on the boundaries of
different research projects which were funded by different research councils.
DB noted that he had moved through various drafts of his response letter, but that we should be
supportive if we could. NG thought that the document had to be inclusive in order to be
successful. SL considered that we should hold ourselves up as an example of how things do work.
JG thought this should include NGS. SL disagreed, noting that we should answer this from a
GridPP point of view. DB agreed, advising that we should not make things too diffuse. Their
document needed to be clear it wasn't talking about HEP. DB asked if the PMB were comfortable
with both direction and tone of his response? Yes. SL advised that we could have an Annexe
document that summarised GridPP, and possibly add the List of Roles into that? DB suggested
putting the Appendices into a separate document? SL thought it didn't matter very much. DB
considered that the background information gave the strength and breadth of GridPP. DB would
do a final draft today and send it to Peter Coveney. Any more comments were welcome.
2. Accounting - new metrics from Manchester
============================================
SL reported that he had discussed the Accounting with Mike, who had been going to suggest using
different metrics. SL had explained everything to him in detail, from the beginning, explaining
why the current method being used was not the best. Mike had seemed reasonably happy and had
understood why we were doing what we were doing. SL had subsequently received an email from
him saying that they were not against our methods but that they were looking at the ATLAS
numbers for consistency. They believed that they could come up with a better metric. SL noted
that if they were to do this, he needed it urgently. There ensued a discussion on normalised CPU.
DB advised that we would consider a suggestion from Manchester but that it would be needed
before the end of the month. PG thought that there should be a heavier weighting on analysis and
production work actually done at sites rather than what CPU was available. DB noted that we had
already discussed these points and it was ultimately an ATLAS choice as to how and why they
distributed the funds - it wasn't something that the PMB should be involved in. PG would speak to
RJ offline. DB asked SL to contact Mike and ask for any input from them by the end of June (this
Thursday). SL advised that he was only proposing to change the HEPSPEC of used CPU not that
advertised.
3. SNO+ Resource Request
=========================
DB reported that there had been an email request for resources for SNO+. DB didn't think it
looked too unreasonable. AS noted he was re-doing the tape planning anyway - we had 1-1.5 PB
of tape for 'other' experiments for that period therefore the 300TB SNO request seemed
manageable. In general, SL advised that there were two models we currently used: (a) we asked
explicitly for support for 'others' in each GridPP proposal, using whatever numbers the 'others'
come up with and then they live within this (currently 10%); (b) other communities request funds
from PPRP for computing which GridPP then administers and gives them a guaranteed share. DB
advised that SNO should request the computing they wanted in their grant application, then we
could include a resource request line for that experiment. This was the best model, cf UKQCD/LHC
'others'. SNO+ could simply be a new line item.
ACTION
430.1 Re the request for resources from SNO+, DB to draft something for GP to respond and feed-
in.
4. wLCG Technology Evolution Group
===================================
DB reported that at the last GDB it had been agreed to start a working group to understand the
technical evolution of wLCG. The suggested format was a forum on Tuesdays, before the GDBs,
where detailed technical discussions could take place. JG advised that some discussions were too
big and complex for the GDB, eg: multi-user pilot jobs framework. There was a need for another
forum which was a smaller group with site representation. JG was not convinced that it would
work in the model proposed, that of a core of people discussing all issues. The Tier-1 could decide
for themselves regarding their representative; for the operations part perhaps JC and another for
the site delegate? JG noted he wasn't on this group. DB agreed that it probably wouldn't work, but
if it did, for the UK we needed one person there as it would be good to have someone in the room.
JG asked if we could get one person from the Tier-1/Tier-2? DB thought that Romain was a good
candidate for security, but DK was also required for policy issues. DB suggested nominating JC for
operations and/or PG for the Tier-2? It was agreed to nominate both.
ACTION
430.2 DB to nominate both JC and PG for membership of the wLCG technical evolution working
group, to ensure UK representation.
5. AOCB
========
a) DB reported that, regarding GridPP28 in 2012, EGI had chosen the same week for their
meeting. What constraints were there from our side regarding different dates? 19-23 March was
out. The IoP was the week after 26th March, then it was Easter. DB thought it would need to be
either 11-13 April, or the following week, 16-20 April. RJ noted that 23rd was term time and he
would likely be teaching.
ACTION
430.3 DB to contact Mike Seymour at Manchester and find out what dates were possible from
their point of view - possibly w/c 16 or 23 April 2012.
b) Re Capital Expenditure for FY10 - the message was that we can bill the tape drive
infrastructure to FY10 (GridPP3). This was £200k, but the maintenance could not be accrued,
therefore for the drives only it was £184k. AS advised that this was a 'done deal' now unless the
auditors rejected it. The one complication was that the credit would not show at project level,
only at cost centre level, which meant that it wasn't visible. We would need a letter that
documents this. DB noted that this was potentially good news, we could change our accounting to
register the credit and we would tell STFC that we spent on that budget. AS asked if we would
spend the credit in this financial year? DB noted probably not, but it might be required at some
point in the future. DK noted it gave us potential flexibility.
STANDING ITEMS
==============
SI-1 Tier-1 Manager's Report
-----------------------------
AS reported as follows:
Fabric:
1) FY11 procurements
- EU tender for disk framework PQQ evaluation complete and supplier shortlist agreed. Expect ITT
to go out late this week or early next week.
- CPU framework PQQ ready to go out.
2) SL08 considered deployable. Plan to redeploy as required into T1D0 service classes.
There ensued a discussion on tape buffer and LHCb requirements.
3) FY10 Tape drive purchase - update on delivery and financial profile available.
4) Probable intervention on OPN router on 5th July 8-10am (TBC) is likely to cause a break in
connectivity
from the WAN to our disk servers.
Service:
1) Summary of operational issues is at:
http://www.gridpp.ac.uk/wiki/Tier1_Operations_Report_2011-06-15
http://www.gridpp.ac.uk/wiki/Tier1_Operations_Report_2011-06-22
2) CASTOR
* CASTOR outage (two periods of about 6 hours) over the weekend owing to database problems.
Still under investigation but probably caused by database problems on the Neptune RAC.
* High load on tape recalls for LHCB coupled with a number of issues (size of service class, disk
server read/write contention/performance, migration policy, poor localisation of data on tape)
has led to delayed tape access for LHCB. We are working on a number of these issues.
* Expect to upgrade CASTOR tape servers to 2.1.10-1 to enable T10KC - expected 5th July. Will
need downtime (probably co-scheduled with the network intervention.
* Preparing T10KC migration plan. Most of the pieces are already in place and we now need to
agree which VOs we will migrate and when.
Staff:
1) Grid team leader post internal recruitment unsuccessful (late last week). Considering
alternatives.
2) Paperwork for four other vacancies has been approved! Expect to submit to SSC in next day.
* Two system admins for Fabric team
* One CASTOR admin
* One Grid Team member
SI-2 Production Manager's Report
---------------------------------
JC reported as follows:
1) There are now 11 GridPP sites with glexec enabled and passing the ops VO tests on at least one
cluster (RHUL; Birmingham; Brunel; Bristol?; Liverpool; RALPP; RAL Tier-1; Glasgow; Oxford and
Sheffield). A couple of sites are still enabling it and may be ready this week. 6 sites are waiting for
a form of relocatable installation (we have not yet got any specific dates back on this yet but if it
looks too far away will look again at building from source).
2) There have been some problems with APEL publishing for most sites during the last week. This
now looks to be resolving and may have been due to the Spanish Tier-1 republishing a lot of data
leading to timeouts for others trying to upload data.
3) Grid Ireland has finished the process of creating NGI_IE. This should mean that we begin the
move to “NGI_UK” very soon.
4) Some sites have been setting up iperf servers to help understand issues being found with the
perf-sonar tests: http://tinyurl.com/6a7dshg. Some WLCG Tier-1s have agreed to provide a
service too but with mixed feelings. There was a discussion on the difficulty of eliciting details on
the GridMon setup to enable operation at Glasgow.
5) An Authentication Bypass Vulnerability in torque that if exploited
allows unauthorized users to submit jobs has required some sites to update their torque
configuration settings and revise their firewall rules.
6) Pete Gronbech observed a problem with the REBUS updater that meant site CPU values were
not updated correctly. This has now been fixed. This was noticed because the GridPP accounting
table did not update the site available CPU resources after additional nodes were put online.
7) WLCG has now released an updated version of the monthly availability and reliability figures
for Tier-2 sites with CREAM now correctly accounted. This update shows some improvements in
the GridPP site figures but does not introduce new site issues to discuss today (see the
explanations given at the last PMB).
A) The summer HEPSYSMAN meeting takes place later this week at RAL
http://hepwww.rl.ac.uk/sysman/June2011/agenda.html. In addition to site updates and a
security workshop on the last day, those in the ops team will try to fit in discussions about the
(individual) ops team tasks.
B) There will be a Lustre workshop at QMUL on 14th July http://www.lustreusergroup.org/.
SI-3 ATLAS weekly review & plans
---------------------------------
RJ reported that they were doing network testing; there was an issue of load going through the
Tier-1 which they were investigating; ATLAS production worldwide crashed on Friday morning
last, queues still existed (this was not a UK issue, it was global).
SI-4 CMS weekly review & plans
-------------------------------
DC was not present.
SI-5 LHCb weekly review & plans
--------------------------------
In absentia GP reported:
1) LHCb has had a few problems with “input data resolution” failures. Usually, this is due to input
data not found on SE. Also, a rise in the number of jobs with “Watchdog identified job as stalled” –
usually due to problems access/streaming data at worker node. Some problems also with DIRAC
staging and SRM unresponsiveness.
2) From RAL Tier 1 side, a number of problems with staging data (stuck tapes, daemons, etc). Also,
some long delays between staging and being able to access data. Castor then went down due to
database issues over the weekend (I think this only affected LHCb and ATLAS).
SI-6 User Co-ordination issues
-------------------------------
GP was not present. Please see agenda item 3 for discussion of SNO+ resources.
SI-7 LCG Management Board Report
---------------------------------
There had been no MB.
SI-8 Dissemination
-------------------
SL reported that he had started putting the weekly minutes onto the GridPP website in
docs/Minutes. This would give an idea of issues currently being covered. DB advised that this
was a reminder that the document page should be re-organised - a higher-level front page was
required to facilitate ease of access to the various documents. This was on Neasan's 'to do' list.
AOB
===
PG reminded the meeting about the Quarterly Reports. He would send out template reports to the
different groups, but he needed target values for metrics. Users to reply please. RJ noted he could
work on this on 1st July.
RJ reported on issues with the Cream CE and Condor, which were currently being investigated.
REVIEW OF ACTIONS
=================
400.4 SL to co-ordinate changing the current GridPP MoU towards an MoU for GridPP4. In
progress - document had been circulated. Any corrections to be sent to SL. Ongoing.
424.3: DB to contact ALICE-UK about Tier-2 resources. Ongoing.
425.7 DC to have an internal discussion within CMS relating to use of future technology and
evolution of the computing model, from September to the next couple of years. DC to come up
with possible suggestion of theme/topics for GridPP27 at CERN. Ongoing.
425.8 AS to consider any longer-term issues relating to storage, DPM, databases etc, and come
back to DB with any ideas for sessions at GridPP27. Ongoing.
428.2 DC to check at Imperial regarding the new person dealing with ganga, in relation to a talk at
ACAT. Ongoing.
428.3 JC to compile an info list relating to sub-nets at sites. Ongoing.
428.6 AS to come up with a proposal for how to use the current disk buffer at the Tier-1. Ongoing.
ACTIONS AS AT 27.06.11
======================
400.4 SL to co-ordinate changing the current GridPP MoU towards an MoU for GridPP4. In
progress - document had been circulated. Any corrections to be sent to SL.
424.3: DB to contact ALICE-UK about Tier-2 resources.
425.7 DC to have an internal discussion within CMS relating to use of future technology and
evolution of the computing model, from September to the next couple of years. DC to come up
with possible suggestion of theme/topics for GridPP27 at CERN.
425.8 AS to consider any longer-term issues relating to storage, DPM, databases etc, and come
back to DB with any ideas for sessions at GridPP27.
428.2 DC to check at Imperial regarding the new person dealing with ganga, in relation to a talk at
ACAT.
428.3 JC to compile an info list relating to sub-nets at sites.
428.6 AS to come up with a proposal for how to use the current disk buffer at the Tier-1.
430.1 Re the request for resources from SNO+, DB to draft something for GP to respond and feed-
in.
430.2 DB to nominate both JC and PG for membership of the wLCG technical evolution working
group, to ensure UK representation.
430.3 Re GridPP28, DB to contact Mike Seymour at Manchester and find out what dates were
possible from their point of view - possibly w/c 16 or 23 April 2012.
Forthcoming PMB meetings would take place on the following dates:
**** Fri July 15th ****
Mon July 25th
Mon Aug 8th
Mon Aug 22nd
Mon Sep 5th
Tue Sep 13th F2F@CERN
Mon Sep 26th
|