I think I added them to the agenda page but cannot easily find
where they went, so I attach them to this email as well.
Share and enjoy.
-j
UKI-ROC meeting.
Present: Yves Coppens, Duncan Rand (DTR), Winnie Lacesso, Phil Roffe,
Andrew Elwell, Matt Doidge, Peter Love, Brian Davies, Pete
Gronbech, Ewan Mac Mahon, Chris Brew, Derek Ross (DR), Sam
Skipsey, Greig Cowan, Simon George, Barney Garrett, Graeme
Stewart, Jeremy Coles (chair), Jens Jensen (mins)
Site status review and issues (20')
- Accounting
- These sites have accounting issues
-- cpDIAsie; csTCDie (Nov); mpUCDie
-- EFDA-JET (late Nov); BRIS-HEP; OX-HEP
Bristol should be OK now.
- Monitoring status
-- RAL-PPD
CB: CE offline, the rest available.
- WLCG MB tracking
WLCG MB beginning to take an interest in T2 availability. Sites not
meeting their 95% availability targets may need to stand tall before
the man.
Sites like QM and RHUL are currently low.
- http://lcg-sam.cern.ch:8080/reports/t2/site_avail.xsql
Worked for Graeme, not for Jeremy or Jens.
- Steve's tests
Pages not available for inspection. Lancaster was greyed out for
some reason. QM appeared to have low utilisation. Cambridge had a
problem but Frédéric is (believed to be) working with Santanu. An
Atlas file went missing at Edinburgh, Greig will investigate.
Atlas problems at QM: took 10 days to upgrade a file; gLite out of
date. Storage unrealiable: intentions to run DPM on Lustre (like
Cambridge and UCL) but not making progress.
- What's going on with cfengine at sites!
Alessandra working on streamlining cfengine config at Manchester;
complicated. Apparently nothing else is going on!
10:50 Current experiment/VO activities and issues (05')
- ATLAS
Atlas more recently ran 2000+ jobs with >90% success rate. However,
saw file corruption with CASTOR disk server at RAL (believed by
CASTOR team to be caused by a faulty network card) which is bad when
that data is subsequently sent further to T2s for analysis. Also
problems with CASTOR access, resolved when Gridmap file was updated
earlier today. Also serious problems with disk server availability:
running out, particularly with one being (possibly) defective. This
problem will be raised with the PMB on Monday. Data distribution
backlog.
RHUL should be moving closer to validation.
- CMS
Workshop at RAL. Problem with CMS software wasn't installed.
Inactive sites in London, QM and UCL.
- LHCb
- Regional VOs
ScotGrid and NorthGrid VOs now active. London has one already.
SouthGrid nearly has one - it is not registered with EGEE yet.
Is WMS supporting regional VOs. Jeremy should check with Catalin
(who is going on leave today) - ACTION JC.
- Other such as biomed
-- Sites encouraged to support supernemo, gridpp and ngs.ac.uk.
Concern about smaller VOs not using the helpdesk much, or
properly. E.g. one pheno ticket wasn't answered after a few hours,
they found a workaround and closed the ticket - but then the
original problem wasn't resolved.
New "generic" VO, gear. Also Crypto VO set up specifically to mess
with crypto challenges. Now RSA768 has been estimated to take 1800
CPU years, is that worth while doing? Can we review whether each
proposal is worth while - not with allocations but perhaps by
coordinating closer with VO mgrs.
GDB: There was a GDB last week (http://indico.cern.ch/conferenceDisplay.py?confId=8508)
- Monitoring
Some problems with Ian Neilson's monitoring prototype; requires
separate Nagios config (AE). Aimed originally at sites not
currently running Nagios, but these sites - if there are any - have
not expressed much interest in this meeting.
- Some discussion on reliability workshop outcomes (held the week
before) - main points covers in Jamie Shiers' paper
http://tinyurl.com/ywx6v7.
- Short report on HEPiX. The membership is now expanding with groups
like those from Genomics interested. Michel Jouvin is now the
European chair. Next meeting 5th-9th May at CERN:
http://www.hepix.org.
- Pilot jobs - still looking for volunteers to test glexec with batch
systems (also looking for testers for job priority working group
prototype).
Does anyone with Condor or SGE volunteer to test?
WLCG management board
-- Main focus has been the pledges to 2012 which were due at the end
of November.
-- Experiment specific SAM tests
-- HEP benchmarking - moving on from KSI2K
Purpose is to decide the relevant benchmarks. Partly that we don't
always trust the vendors, partly that their benchmarks aren't
always relevant for our type of work. Some of the used benchmarks
are proprietary, with a licence fee - e.g $200 (was $170) for
"educational" users.
-- CCRC (Common Computing Readiness Challenge) preparations are
ongoing. What needs to be in place... timelines etc. Main thing for
T2s to be concerned about is having SRMv2.2 available and stable.
There are weekly meetings (Mondays). For GridPP, Andrew Sansum
attends.
DTEAM
-- "Areas that DTEAM could benefit from more Tier-1 input " - result
of recent GridPP Tier-1 reviews. Many areas that could be
covered:CEs; dCache; CASTOR; WMS; FTS; RB's; MON box; R-GMA;
monitoring; SRB; Information system; LSF; storage services; machine
room management; security; Oracle
Suggestion that we would have more clout combined (T1 + T2s).
Suggestion that a T1 blog would be useful.
Suggestion that joint purchases with T2s piggybacking on T1
purchases.
-- Following up on tickets
Instances of slow turnaround once the ticket has reached the ROC.
This needs follow-up. (ACTION JC)
-- Feedback on incident response procedure (simple form):
http://www.gridpp.ac.uk/deployment/security/inchand/index.html
-- Ops meeting news
11:15 Regional planning (10')
- Plans for deployment of gLite-WMS
Discussion again about local VOs and local WMS. ScotGrid will
install local WMS. SouthGrid have not discussed it yet. NorthGrid
needs another meeting to discuss it.
London has a regional LFC, and Imperial is of course heavily
involved in WMS but is currently doing performance testing. Will
have a local WMS for VOs soon.
NGS: DNS style VOs need enabling, pool accounts too. Problem found
(ScotGrid) with short timeout on SAM - thus SAM requires special
privileged access. Also some questions about the information
schema.
NGS uses GSI authentication to log in, giving shell access to UIs
from where jobs are run. This is necessary because NGS are (in
general) running Globus/VDT, not gLite - simpler software stack.
Portals exist and are maturing all the time, e.g. portal.ngs.ac.uk.
11:25 AOB (05')
Oxford have DPM on 64 bit (pool nodes). This issue was also
discussed in the storage meeting Wed.
PG specifically mentioned problems with DNS style VOs, particularly
on UIs. The VOMS file to match the VO (in /opt/glite/etc/vomses)
does not get generated correctly (with YAIM). It works manually.
However, ScotGrid enabled them 2 months ago with no problems?
|