> -----Original Message-----
> From: Testbed Support for GridPP member institutes [mailto:TB-
> [log in to unmask]] On Behalf Of Jeremy Coles
> Sent: 24 February 2015 09:43
> To: [log in to unmask]
> Subject: Ops @ 11am
>
> Dear All,
>
> The agenda for today’s ops meeting can be found at
> http://indico.cern.ch/event/376287/.
Hi all,
Draft minutes of this morning's meeting are attached, comments and corrections are welcome as always. Can I particularly encourage everyone to have a look at their site's section in the round table - I think it's basically good, but some bits did go past fairly rapidly.
Ewan
GridPP Weekly Operations meeting 2015 01 24
===========================================
--- https://indico.cern.ch/event/376287/ --
Experiments
============
LHCb
-----
Andrew McNab eported that LHCb have moved from using an LFC file catalogue
to a Dirac File Catalogue (DFC). This seems to have happened pretty
smoothly, after a great deal of preparation. LHCb have reduced the amount
of jobs they're running to help, so there are currently only some user
jobs running, but essentially no monte carlo.
CMS
----
Nothing much to report; an availability monotoring problem at RALPPD has
been fixed and the history re-done. A downtime failed to end correctly in
the SAM dashboard which led the CMS site readiness tools to think the site
was still in downtime when it wasn't. Jeremy Coles queried the fact that
the site still appears to be in downtime now, and it was explained that
this is a new, and real, one for a dcache upgrade. Federico later
clarified that the current downtime was started on Friday, so as to allow
the cluster to drain of jobs over the weekend.
Daniela noted that Imperial have received a CMS ticket about glExec test
failures on one of their ARC CEs - test jobs seem to be submitted and run
fine, but the glExec specific compenent complains. Chris Brew noted that
both RAL PPD and the Tier 1 have got tickets for a possibly similar
problem but from ATLAS.
ATLAS
------
Elena has attached a written ATLAS status report to the meeting agenda:
https://indico.cern.ch/event/376287/contribution/0/material/slides/0.pdf
She went on to discuss in particular somewhat conflicting reports on
ATLAS' ability to effectively utilise multicore resources; firstly, at
last week's WLCG Operations coordination meeting
(https://indico.cern.ch/event/372875/) it was reported that ATLAS are
using almost all resources available to them, both single and multicore.
However, at the ADC weekly meeting (http://indico.cern.ch/e/366304), it
was reported that it's still being difficult to fill all the capacity.
There is an issue with multicore job lengths - on dedicated resources
submitting longer jobs (e.g. 8 hours) gives better utilisation, but ATLAS
has access to opportunistic resources that prefer shorter jobs (e.g. one
hour). Work is being done to allow job lengths to be selected/adjusted
after pilots land on resources.
ATLAS have some issues with FTS3 for which they are currently using a
work-around, but are collaborating with the FTS developers to try to find
a clean solution.
It was also noted that Matt Doidge had reported a problem with software
release validation jobs at Lancaster; this has also been seen at other
sites, and is currently under investigation.
Other VOs
----------
There is an issue with the LIGO VO; they've previously used OSG resources,
and have had a VOMS infraastructure, but it no longer exists. Discussions
are ongoing with OSG to enable the VO without duplicating effort.
LSST is being rolled out on some NorthGrid resources as a new VO, but
their software is being distributed via the NorthGrid cvmfs repository as
an interim solution until they have a cvmfs of their own set up.
Tom Whyntie reported that the UCLan Galaxy Dynamics group have sucessfully
used cernvm systems to compile their code. The likely next step is to
upload the result to cvmfs, but it's not quite ready. After that, they'll
probably need their own VO (and corresponding cvmfs area), but there is
some work to do first to ensure that we understand how to create a new VO
- some of our tools and documentation have bitrotted a bit. The proteomics
group at QMUL is in a similar state. Ewan queried whether these groups
have really outgrown the regional incubator VOs, and Tom explained that
while they arguably haven't, in these specific cases they're being very
useful to help shake out some of our VO creation procedures, so we're
advancing them through the process arguably a little faster than we
otherwise might.
GridPP Dirac service
---------------------
Janusz and Duncan are both on holiday at the moment. Daniela reported that
things are slightly behind schedule at present due to a problem with
support for importing user lists automatically from VOMS servers in the
multi-VO case. We have a current target to have the Dirac in a production
state by April, which was desdcribed as 'not impossible'.
Andrew McNab noted that the GridPP dirac appears not to be sending pilot
jobs - tests are running on the VAC type sites that pull in work on their
own, but no pilots appear to be being sent to CREAM/ARC CEs. Daniela
said that she'd look into it; there was a suggestion that, given people
being on holiday, it may simply be an expired proxy somewhere.
There was some discussion about whether this would block anyone from
running work, or just restrict them to VAC type sites. Andrew explained
that a normal GridPP Dirac VAC VM will only pull work for the GridPP VO,
not any VO on the GridPP Dirac server, and that new VM types needed to be
created for each VO (even though they'd be almost completely identical). Ewan
asked whether it was possible to set that VO filter to 'any' to enable a
VAC resource to pick up work for any VO supported on the Dirac server, Andrew
thought not, but said he'd check.
Bulletin updates
=================
In order to leave time for a site round table, there was a rapid review of
the Operations Bulletin updates with only a few items attracting any
discussion. For everything else, see the bulletin itself and links
therein:
https://www.gridpp.ac.uk/w/index.php?title=Operations_Bulletin_Latest&oldid=7611
HTTP task force - experiements have two weeks to decide whether they were
actually interested in HTTP before it's decided whether to push ahead
with the task force proper.
There was a discussion about testing the Middleware package reporter tool
and whether its installation was documented anywhere; it was described as
being very simple, but not well documented. The discussion was slightly
complicated by confusion with the Machine Job Features work which is in a
similar, but possibly even worse state.
It was noted that the WLCG baseline version of DPM now seems to be 1.8.9,
which was thought to not be a good idea, and should not be required.
Tickets
--------
We're now down to only 15 tickets in the UK, which is good.
Two tickets were talked about:
- Sussex's perfSonar ticket:
https://ggus.eu/?mode=ticket_info&ticket_id=110389
Matt-RB reports that he's taken the ticket out of the 'on hold' status,
and believes that the Sussex perfSonar is now fully working.
- SNO+ filecopying at the Tier:
https://ggus.eu/?mode=ticket_info&ticket_id=109694
There is a new version of the GFAL2 client tools which has been tested
and shown to solve the problem, Brian noted that it is currently in the
epel-testing repository.
Site round-table
-----------------
Manchester: Alessandra has been very busy with getting LSST going. The
full cluster is now multicore enabled. It is still running torque, but
there are plans to begin an HTCondor deployment very soon. Andre McNab
said that the VM system is solidly running, and Robert indicated an
intention to renew efforts towards IPv6, but that this requires
conversations with NetNorthWest.
ECDF: No major issues to report, for the future there's a shared cluster
hardware refresh coming up in the next few months, and there's a
possibility that this may offer some cloud interface services.
Tier 1: Andrew Lahiff has been looking at using cloud resources on demand
to provide worker nodes to the batch system, and mentioned a plan to
allow worker nodes to run SL7 on the hardware, but to run jobs in SL6
containers. Brian reported some recent moves on storage - it should soon
be possible to allow more frequent namespace dumps of Castor, and an
interesting result of the recent ATLAS deletion campaign was to uncover an
amount of files that the VO thought RAL had, but that they actually did
not.
RALPPD: Currently in downtime to upgrade to dcache 2.10, which appears to
have gone mostly OK, but there is a problem preventing the SRM service
starting. RALPPD is investigating the possibility of cloud services,
including potentially running RALPPD worker nodes on the Tier 1 cloud
infrastructure.
Imperial: Effort going mostly into Dirac. Jeremy noted that Imperial is
well ahead on things like cloud/IPv6 etc.
GLasgow: Gareth reported that new hardware is currently being commissioned
and new storage should be online in a couple of weeks or so. New CPU is
already online. Approximately 40% of the batch system is now running as
condor/arc. The perfSonar is all up to date and fine, though Jeremy did
point out that there was a warning showing in the dashboard:
https://perfsonar-itb.grid.iu.edu/WLCGperfSONAR/check_mk/index.py?start_url=%2FWLCGperfSONAR%2Fcheck_mk%2Fview.py%3Fview_name%3Dhosts%26host%3Dgla
VM/cloud activities are at the initial stages; there is a very new
OpenStack instance on three machines that's currently focussed on local
users but is hoped to expand to gridpp uses too. Glasgow has had basic
IPv6 support for a long time, but this is provisioned on a non-production
capable network, so is only suitable for dedicated test systems - rolling
out IPv6 support on production nodes would actually degrade service.
Sheffield: There is a new condor/arc system currently being tested, which
is planned to be in production end of next week. On IPv6, addresses have
been allocated and are being used on perfSonar boxes. There are no plans
to enable IPv6 on other services due to a lack of effort, but the site
expects to be able to follow along once IPv6 is out of the experimental
phase. There is an ongoing firewall/port opening problem with
the Sheffield perfSonars, Elena is planning to email Duncan for help.
Elena is planning a DPM upgrade to 1.8.9.
Oxford: A majority of the cluster is running under condor/arc, but with a
legacy CREAM/torque system essentially for the benefit of ALICE who are
unable to submit jobs to ARC CEs. Brian mentioned that Catalin may have a
solution for ALICE being able to submit to ARC CEs at the Tier 1 that
would be worth investigating. IPv6 is in a similar state to Glasgow, with
an established 'test' service, but not yet a production capable one. The
University is making progress on replacing the main backbone network, so
it is hoped that this will change within a year. Oxford has both VAC and
OpenStack VM systems, the VAC estate has recently been expanded and
brought up to modern standards (and with working accounting), and the
OpenStack has been upgraded to new hardware and latest versions of
software, and is currently being tested by Peter Love for ATLAS; when
that's known good the CPU resources allocated to it can be increased.
It was also noted that the new ARC/Condor system seems to get less
'random' VO submission that the old system, making it strongly dependent
on ATLAS work to fill it; this could/should be investigated via the APEL
stats, but at the moment it's not clear whether this is as a result of job
brokering, or simply that the older cluster's CE is still in some VO
static configurations that the newer one is not.
Liverpool: IPv6 still waiting on the University who have promised an IPv6
allocation 'this year'; they have been recently re-prodded. Steve has
been working on attempt to improve the handling of multicore jobs on the
condor system, and has blogged about progress so far:
http://northgrid-tech.blogspot.co.uk/2015/02/replacing-condor-defrag-daemon.html
There is interest in VM/cloud technologies, but nothing actually deployed
yet. Rob reported that they've been talking to Andrew McNab about VAC.
Cambridge: Little to report, just chugging along. About two thids of
resources are multicore enabled, there's no IPv6 deployment yet, just
waiting on time availability (there is already a notional allocation from
the university, but it's not provisioned to the cluster). There was some
discussion about how incredibly falling-off-a-log easy it is to IPv6
enable perfSonar boxes.
Lancaster: IPv6 waiting on the University networking team who have been
losing staff and so are very busy, but Matt will chase them up again.
Work has been going on re-arranging the machine room to remove old kit,
and as part of this the site is currently down to a single CE, but with
plans to (re)add more for some redundancy. On VM deployment, some of the
old kit is likely to be redeployed as VAC factory nodes. Lancaster have
new kit expected to arrive next Tuesday from Viglen. Jeremy noted that
several people seem to have been having delays getting kit from Viglen.
Gareth Roy suggested that the problem may actually be upstream - Glasgow
have had delays with another vendor having trouble sourcing kit from
SuperMicro.
Sussex: No immediate plans for IPv6; this will require a move of the grid
kit to be behind the new site firewall, which isn't likely to happen until
summer. Batch system is still Univa Grid Engine, and this has been the
source of the historical problems with APEL accounting, though there is
hope on the horizon. The grid storage is problematic since the general
storage is going to be moving to Luste 2.x which StoRM currently cannot
run on. It is uncertain whether the grid storage will be kept on a legacy
Lustre 1.8 install, or whether it will be possible to deploy a patch to
Lustre 2.x that will allow it to suopport StoRM on a single install.
RHUL: Govind was having problems with his Vidyo audio, but reported via
the chat that their current priority is to move to their recently acquired
10Gbit link, and to logically relocate their grid storage outside the site
firewall. This involves the University networking team, and work on IPv6
is not expected to begin until after this is completed. Work is currently
underway to evaluate both Condor and Son of Gridengine as possible future
batch systems, and the site has no current plans for cloud or VM services.
QMUL: Dan was not able to be present at the meeting, and has indicated in
advance an intention to submit a written report.
Chat log
===========
Daniela Bauer: (24/02/2015 11:01:14)
https://cms-site-readiness.web.cern.ch/cms-site-readiness/SiteReadiness/HTML/SiteReadinessReport.html#T2_UK_London_Brunel
Maybe I should add that link as a default to the agenda.
Jeremy Coles: (11:01 AM)
I can do that.
Tom Whyntie: (11:03 AM)
Great news, congrats.
Andrew McNab: (11:04 AM)
Relief all round, Tom!
Federico Melaccio: (11:07 AM)
we had downtime starting from Friday to drain the farm before the dcache upgrade
Jeremy Coles: (11:08 AM)
Thanks for confirming Federico.
ATLAS update: https://indico.cern.ch/event/376287/contribution/0/material/slides/0.pdf
Matt Doidge: (11:14 AM)
Thanks for clearing that up Elena
Jeremy Coles: (11:24 AM)
DIRAC proposal: https://indico.cern.ch/event/376287/contribution/0/material/slides/1.pdf
https://www.gridpp.ac.uk/wiki/Operations_Bulletin_Latest
wahid: (11:42 AM)
https://twiki.cern.ch/twiki/bin/view/LCG/MiddlewarePackageReporter#Installation
raul: (11:42 AM)
I've installed it
But they dont' have any monitoring/documentation/anthing to see what's going on
wahid: (11:45 AM)
theres nothing wrong with 1.8.9 I think - we have at edinburgh.. don't think all sites need be forced to move to it though - depends what 'baseline' means
Matt Raso-Barnett: (11:46 AM)
there has been a bit of movement on our apel reporting tickets, perhaps due to the escalation. Hoping for a patched version to test soon
raul: (11:48 AM)
sorry! I've got ot leave. Local meeting/lunch with hardware vendor.
Ewan Mac Mahon: (12:02 PM)
Mostly sucessful, except doesn't actually run?
other than that, fine though?
Alessandra Forti: (12:07 PM)
depends what baseline is. The WLCG one is an attempt to have sites all at the same level.
of software
Matt Doidge: (12:08 PM)
Is it squid 3.5 that you chaps are running?
(at Glasgow)
David Crooks: (12:09 PM)
No, 2.7
Matt Doidge: (12:10 PM)
On SL5 or 6?
Steve Jones: (12:10 PM)
Just popping out; back in 5 mins.
David Crooks: (12:10 PM)
SL6
Matt Doidge: (12:11 PM)
Interesting - can I ask where you got the rpms please? Our new squid is running 3.1, which is kinda the not-advised version.
David Crooks: (12:12 PM)
We're using the cern-frontier repo
Steve Jones: (12:12 PM)
Back Now!
Matt Doidge: (12:13 PM)
ah, is that frontier flavoured squid rather then plain squid? (sorry for all the questions)
Govind: (12:18 PM)
sorry i lost sound.. trying to fix it..
John Bland: (12:21 PM)
on ipv6 we're a little stuck because an old bit of network kit is between us and the new ipv6/10g uni network
Steve Jones: (12:21 PM)
http://northgrid-tech.blogspot.co.uk/2015/02/replacing-condor-defrag-daemon.html
Govind: (12:23 PM)
My headphone looks OK.. but vidyo has lost sound.. any suggestion..
I will try to give short update here..
Current priority to switch to 10gb link and then move storage nodes outside firewall..
Network guy will deal with IpV6 after moving to 10GB link..
Cloud- not planned at the moment..
Batch system- I am setting up HTC condor and SOG and then evavalute
Thats all for now..
John Hill: (12:29 PM)
XMA
John Bland: (12:30 PM)
supermico UK support is absolutely rubbish. Shame their kit's so attractive.
Gareth Douglas Roy: (12:30 PM)
particularly if they are the 36 bay chassis
David Crooks: (12:31 PM)
Matt: Sorry, I missed your comment, yes that's frontier-squid
Matt Doidge: (12:31 PM)
Thanks!
Brian Davies @RAL-LCG2: (12:35 PM)
apparently us dock action finished
ww.usatoday.com/story/news/2015/02/20/west-coast-ports-dispute-union-labor-secretary-tom-perez/23744299/
Federico Melaccio: (12:35 PM)
thanks
|