JiscMail Logo
Email discussion lists for the UK Education and Research communities

Help for TB-SUPPORT Archives


TB-SUPPORT Archives

TB-SUPPORT Archives


TB-SUPPORT@JISCMAIL.AC.UK


View:

Message:

[

First

|

Previous

|

Next

|

Last

]

By Topic:

[

First

|

Previous

|

Next

|

Last

]

By Author:

[

First

|

Previous

|

Next

|

Last

]

Font:

Proportional Font

LISTSERV Archives

LISTSERV Archives

TB-SUPPORT Home

TB-SUPPORT Home

TB-SUPPORT  February 2015

TB-SUPPORT February 2015

Options

Subscribe or Unsubscribe

Subscribe or Unsubscribe

Log In

Log In

Get Password

Get Password

Subject:

Re: Ops @ 11am

From:

Ewan MacMahon <[log in to unmask]>

Reply-To:

Testbed Support for GridPP member institutes <[log in to unmask]>

Date:

Tue, 24 Feb 2015 17:15:40 +0000

Content-Type:

multipart/mixed

Parts/Attachments:

Parts/Attachments

text/plain (1 lines) , ops-minutes-20150224.txt (1 lines)

> -----Original Message-----

> From: Testbed Support for GridPP member institutes [mailto:TB-

> [log in to unmask]] On Behalf Of Jeremy Coles

> Sent: 24 February 2015 09:43

> To: [log in to unmask]

> Subject: Ops @ 11am

> 

> Dear All,

> 

> The agenda for today’s ops meeting can be found at

> http://indico.cern.ch/event/376287/. 



Hi all,



Draft minutes of this morning's meeting are attached, comments and corrections are welcome as always. Can I particularly encourage everyone to have a look at their site's section in the round table - I think it's basically good, but some bits did go past fairly rapidly.



Ewan





GridPP Weekly Operations meeting 2015 01 24 =========================================== --- https://indico.cern.ch/event/376287/ -- Experiments ============ LHCb ----- Andrew McNab eported that LHCb have moved from using an LFC file catalogue to a Dirac File Catalogue (DFC). This seems to have happened pretty smoothly, after a great deal of preparation. LHCb have reduced the amount of jobs they're running to help, so there are currently only some user jobs running, but essentially no monte carlo. CMS ---- Nothing much to report; an availability monotoring problem at RALPPD has been fixed and the history re-done. A downtime failed to end correctly in the SAM dashboard which led the CMS site readiness tools to think the site was still in downtime when it wasn't. Jeremy Coles queried the fact that the site still appears to be in downtime now, and it was explained that this is a new, and real, one for a dcache upgrade. Federico later clarified that the current downtime was started on Friday, so as to allow the cluster to drain of jobs over the weekend. Daniela noted that Imperial have received a CMS ticket about glExec test failures on one of their ARC CEs - test jobs seem to be submitted and run fine, but the glExec specific compenent complains. Chris Brew noted that both RAL PPD and the Tier 1 have got tickets for a possibly similar problem but from ATLAS. ATLAS ------ Elena has attached a written ATLAS status report to the meeting agenda: https://indico.cern.ch/event/376287/contribution/0/material/slides/0.pdf She went on to discuss in particular somewhat conflicting reports on ATLAS' ability to effectively utilise multicore resources; firstly, at last week's WLCG Operations coordination meeting (https://indico.cern.ch/event/372875/) it was reported that ATLAS are using almost all resources available to them, both single and multicore. However, at the ADC weekly meeting (http://indico.cern.ch/e/366304), it was reported that it's still being difficult to fill all the capacity. There is an issue with multicore job lengths - on dedicated resources submitting longer jobs (e.g. 8 hours) gives better utilisation, but ATLAS has access to opportunistic resources that prefer shorter jobs (e.g. one hour). Work is being done to allow job lengths to be selected/adjusted after pilots land on resources. ATLAS have some issues with FTS3 for which they are currently using a work-around, but are collaborating with the FTS developers to try to find a clean solution. It was also noted that Matt Doidge had reported a problem with software release validation jobs at Lancaster; this has also been seen at other sites, and is currently under investigation. Other VOs ---------- There is an issue with the LIGO VO; they've previously used OSG resources, and have had a VOMS infraastructure, but it no longer exists. Discussions are ongoing with OSG to enable the VO without duplicating effort. LSST is being rolled out on some NorthGrid resources as a new VO, but their software is being distributed via the NorthGrid cvmfs repository as an interim solution until they have a cvmfs of their own set up. Tom Whyntie reported that the UCLan Galaxy Dynamics group have sucessfully used cernvm systems to compile their code. The likely next step is to upload the result to cvmfs, but it's not quite ready. After that, they'll probably need their own VO (and corresponding cvmfs area), but there is some work to do first to ensure that we understand how to create a new VO - some of our tools and documentation have bitrotted a bit. The proteomics group at QMUL is in a similar state. Ewan queried whether these groups have really outgrown the regional incubator VOs, and Tom explained that while they arguably haven't, in these specific cases they're being very useful to help shake out some of our VO creation procedures, so we're advancing them through the process arguably a little faster than we otherwise might. GridPP Dirac service --------------------- Janusz and Duncan are both on holiday at the moment. Daniela reported that things are slightly behind schedule at present due to a problem with support for importing user lists automatically from VOMS servers in the multi-VO case. We have a current target to have the Dirac in a production state by April, which was desdcribed as 'not impossible'. Andrew McNab noted that the GridPP dirac appears not to be sending pilot jobs - tests are running on the VAC type sites that pull in work on their own, but no pilots appear to be being sent to CREAM/ARC CEs. Daniela said that she'd look into it; there was a suggestion that, given people being on holiday, it may simply be an expired proxy somewhere. There was some discussion about whether this would block anyone from running work, or just restrict them to VAC type sites. Andrew explained that a normal GridPP Dirac VAC VM will only pull work for the GridPP VO, not any VO on the GridPP Dirac server, and that new VM types needed to be created for each VO (even though they'd be almost completely identical). Ewan asked whether it was possible to set that VO filter to 'any' to enable a VAC resource to pick up work for any VO supported on the Dirac server, Andrew thought not, but said he'd check. Bulletin updates ================= In order to leave time for a site round table, there was a rapid review of the Operations Bulletin updates with only a few items attracting any discussion. For everything else, see the bulletin itself and links therein: https://www.gridpp.ac.uk/w/index.php?title=Operations_Bulletin_Latest&oldid=7611 HTTP task force - experiements have two weeks to decide whether they were actually interested in HTTP before it's decided whether to push ahead with the task force proper. There was a discussion about testing the Middleware package reporter tool and whether its installation was documented anywhere; it was described as being very simple, but not well documented. The discussion was slightly complicated by confusion with the Machine Job Features work which is in a similar, but possibly even worse state. It was noted that the WLCG baseline version of DPM now seems to be 1.8.9, which was thought to not be a good idea, and should not be required. Tickets -------- We're now down to only 15 tickets in the UK, which is good. Two tickets were talked about: - Sussex's perfSonar ticket: https://ggus.eu/?mode=ticket_info&ticket_id=110389 Matt-RB reports that he's taken the ticket out of the 'on hold' status, and believes that the Sussex perfSonar is now fully working. - SNO+ filecopying at the Tier: https://ggus.eu/?mode=ticket_info&ticket_id=109694 There is a new version of the GFAL2 client tools which has been tested and shown to solve the problem, Brian noted that it is currently in the epel-testing repository. Site round-table ----------------- Manchester: Alessandra has been very busy with getting LSST going. The full cluster is now multicore enabled. It is still running torque, but there are plans to begin an HTCondor deployment very soon. Andre McNab said that the VM system is solidly running, and Robert indicated an intention to renew efforts towards IPv6, but that this requires conversations with NetNorthWest. ECDF: No major issues to report, for the future there's a shared cluster hardware refresh coming up in the next few months, and there's a possibility that this may offer some cloud interface services. Tier 1: Andrew Lahiff has been looking at using cloud resources on demand to provide worker nodes to the batch system, and mentioned a plan to allow worker nodes to run SL7 on the hardware, but to run jobs in SL6 containers. Brian reported some recent moves on storage - it should soon be possible to allow more frequent namespace dumps of Castor, and an interesting result of the recent ATLAS deletion campaign was to uncover an amount of files that the VO thought RAL had, but that they actually did not. RALPPD: Currently in downtime to upgrade to dcache 2.10, which appears to have gone mostly OK, but there is a problem preventing the SRM service starting. RALPPD is investigating the possibility of cloud services, including potentially running RALPPD worker nodes on the Tier 1 cloud infrastructure. Imperial: Effort going mostly into Dirac. Jeremy noted that Imperial is well ahead on things like cloud/IPv6 etc. GLasgow: Gareth reported that new hardware is currently being commissioned and new storage should be online in a couple of weeks or so. New CPU is already online. Approximately 40% of the batch system is now running as condor/arc. The perfSonar is all up to date and fine, though Jeremy did point out that there was a warning showing in the dashboard: https://perfsonar-itb.grid.iu.edu/WLCGperfSONAR/check_mk/index.py?start_url=%2FWLCGperfSONAR%2Fcheck_mk%2Fview.py%3Fview_name%3Dhosts%26host%3Dgla VM/cloud activities are at the initial stages; there is a very new OpenStack instance on three machines that's currently focussed on local users but is hoped to expand to gridpp uses too. Glasgow has had basic IPv6 support for a long time, but this is provisioned on a non-production capable network, so is only suitable for dedicated test systems - rolling out IPv6 support on production nodes would actually degrade service. Sheffield: There is a new condor/arc system currently being tested, which is planned to be in production end of next week. On IPv6, addresses have been allocated and are being used on perfSonar boxes. There are no plans to enable IPv6 on other services due to a lack of effort, but the site expects to be able to follow along once IPv6 is out of the experimental phase. There is an ongoing firewall/port opening problem with the Sheffield perfSonars, Elena is planning to email Duncan for help. Elena is planning a DPM upgrade to 1.8.9. Oxford: A majority of the cluster is running under condor/arc, but with a legacy CREAM/torque system essentially for the benefit of ALICE who are unable to submit jobs to ARC CEs. Brian mentioned that Catalin may have a solution for ALICE being able to submit to ARC CEs at the Tier 1 that would be worth investigating. IPv6 is in a similar state to Glasgow, with an established 'test' service, but not yet a production capable one. The University is making progress on replacing the main backbone network, so it is hoped that this will change within a year. Oxford has both VAC and OpenStack VM systems, the VAC estate has recently been expanded and brought up to modern standards (and with working accounting), and the OpenStack has been upgraded to new hardware and latest versions of software, and is currently being tested by Peter Love for ATLAS; when that's known good the CPU resources allocated to it can be increased. It was also noted that the new ARC/Condor system seems to get less 'random' VO submission that the old system, making it strongly dependent on ATLAS work to fill it; this could/should be investigated via the APEL stats, but at the moment it's not clear whether this is as a result of job brokering, or simply that the older cluster's CE is still in some VO static configurations that the newer one is not. Liverpool: IPv6 still waiting on the University who have promised an IPv6 allocation 'this year'; they have been recently re-prodded. Steve has been working on attempt to improve the handling of multicore jobs on the condor system, and has blogged about progress so far: http://northgrid-tech.blogspot.co.uk/2015/02/replacing-condor-defrag-daemon.html There is interest in VM/cloud technologies, but nothing actually deployed yet. Rob reported that they've been talking to Andrew McNab about VAC. Cambridge: Little to report, just chugging along. About two thids of resources are multicore enabled, there's no IPv6 deployment yet, just waiting on time availability (there is already a notional allocation from the university, but it's not provisioned to the cluster). There was some discussion about how incredibly falling-off-a-log easy it is to IPv6 enable perfSonar boxes. Lancaster: IPv6 waiting on the University networking team who have been losing staff and so are very busy, but Matt will chase them up again. Work has been going on re-arranging the machine room to remove old kit, and as part of this the site is currently down to a single CE, but with plans to (re)add more for some redundancy. On VM deployment, some of the old kit is likely to be redeployed as VAC factory nodes. Lancaster have new kit expected to arrive next Tuesday from Viglen. Jeremy noted that several people seem to have been having delays getting kit from Viglen. Gareth Roy suggested that the problem may actually be upstream - Glasgow have had delays with another vendor having trouble sourcing kit from SuperMicro. Sussex: No immediate plans for IPv6; this will require a move of the grid kit to be behind the new site firewall, which isn't likely to happen until summer. Batch system is still Univa Grid Engine, and this has been the source of the historical problems with APEL accounting, though there is hope on the horizon. The grid storage is problematic since the general storage is going to be moving to Luste 2.x which StoRM currently cannot run on. It is uncertain whether the grid storage will be kept on a legacy Lustre 1.8 install, or whether it will be possible to deploy a patch to Lustre 2.x that will allow it to suopport StoRM on a single install. RHUL: Govind was having problems with his Vidyo audio, but reported via the chat that their current priority is to move to their recently acquired 10Gbit link, and to logically relocate their grid storage outside the site firewall. This involves the University networking team, and work on IPv6 is not expected to begin until after this is completed. Work is currently underway to evaluate both Condor and Son of Gridengine as possible future batch systems, and the site has no current plans for cloud or VM services. QMUL: Dan was not able to be present at the meeting, and has indicated in advance an intention to submit a written report. Chat log =========== Daniela Bauer: (24/02/2015 11:01:14) https://cms-site-readiness.web.cern.ch/cms-site-readiness/SiteReadiness/HTML/SiteReadinessReport.html#T2_UK_London_Brunel Maybe I should add that link as a default to the agenda. Jeremy Coles: (11:01 AM) I can do that. Tom Whyntie: (11:03 AM) Great news, congrats. Andrew McNab: (11:04 AM) Relief all round, Tom! Federico Melaccio: (11:07 AM) we had downtime starting from Friday to drain the farm before the dcache upgrade Jeremy Coles: (11:08 AM) Thanks for confirming Federico. ATLAS update: https://indico.cern.ch/event/376287/contribution/0/material/slides/0.pdf Matt Doidge: (11:14 AM) Thanks for clearing that up Elena Jeremy Coles: (11:24 AM) DIRAC proposal: https://indico.cern.ch/event/376287/contribution/0/material/slides/1.pdf https://www.gridpp.ac.uk/wiki/Operations_Bulletin_Latest wahid: (11:42 AM) https://twiki.cern.ch/twiki/bin/view/LCG/MiddlewarePackageReporter#Installation raul: (11:42 AM) I've installed it But they dont' have any monitoring/documentation/anthing to see what's going on wahid: (11:45 AM) theres nothing wrong with 1.8.9 I think - we have at edinburgh.. don't think all sites need be forced to move to it though - depends what 'baseline' means Matt Raso-Barnett: (11:46 AM) there has been a bit of movement on our apel reporting tickets, perhaps due to the escalation. Hoping for a patched version to test soon raul: (11:48 AM) sorry! I've got ot leave. Local meeting/lunch with hardware vendor. Ewan Mac Mahon: (12:02 PM) Mostly sucessful, except doesn't actually run? other than that, fine though? Alessandra Forti: (12:07 PM) depends what baseline is. The WLCG one is an attempt to have sites all at the same level. of software Matt Doidge: (12:08 PM) Is it squid 3.5 that you chaps are running? (at Glasgow) David Crooks: (12:09 PM) No, 2.7 Matt Doidge: (12:10 PM) On SL5 or 6? Steve Jones: (12:10 PM) Just popping out; back in 5 mins. David Crooks: (12:10 PM) SL6 Matt Doidge: (12:11 PM) Interesting - can I ask where you got the rpms please? Our new squid is running 3.1, which is kinda the not-advised version. David Crooks: (12:12 PM) We're using the cern-frontier repo Steve Jones: (12:12 PM) Back Now! Matt Doidge: (12:13 PM) ah, is that frontier flavoured squid rather then plain squid? (sorry for all the questions) Govind: (12:18 PM) sorry i lost sound.. trying to fix it.. John Bland: (12:21 PM) on ipv6 we're a little stuck because an old bit of network kit is between us and the new ipv6/10g uni network Steve Jones: (12:21 PM) http://northgrid-tech.blogspot.co.uk/2015/02/replacing-condor-defrag-daemon.html Govind: (12:23 PM) My headphone looks OK.. but vidyo has lost sound.. any suggestion.. I will try to give short update here.. Current priority to switch to 10gb link and then move storage nodes outside firewall.. Network guy will deal with IpV6 after moving to 10GB link.. Cloud- not planned at the moment.. Batch system- I am setting up HTC condor and SOG and then evavalute Thats all for now.. John Hill: (12:29 PM) XMA John Bland: (12:30 PM) supermico UK support is absolutely rubbish. Shame their kit's so attractive. Gareth Douglas Roy: (12:30 PM) particularly if they are the 36 bay chassis David Crooks: (12:31 PM) Matt: Sorry, I missed your comment, yes that's frontier-squid Matt Doidge: (12:31 PM) Thanks! Brian Davies @RAL-LCG2: (12:35 PM) apparently us dock action finished ww.usatoday.com/story/news/2015/02/20/west-coast-ports-dispute-union-labor-secretary-tom-perez/23744299/ Federico Melaccio: (12:35 PM) thanks

Top of Message | Previous Page | Permalink

JiscMail Tools


RSS Feeds and Sharing


Advanced Options


Archives

April 2024
March 2024
February 2024
January 2024
December 2023
November 2023
October 2023
September 2023
August 2023
July 2023
June 2023
May 2023
April 2023
March 2023
February 2023
January 2023
December 2022
November 2022
October 2022
September 2022
August 2022
July 2022
June 2022
May 2022
April 2022
March 2022
February 2022
January 2022
December 2021
November 2021
October 2021
September 2021
August 2021
July 2021
June 2021
May 2021
April 2021
March 2021
February 2021
January 2021
December 2020
November 2020
October 2020
September 2020
August 2020
July 2020
June 2020
May 2020
April 2020
March 2020
February 2020
January 2020
December 2019
November 2019
October 2019
September 2019
August 2019
July 2019
June 2019
May 2019
April 2019
March 2019
February 2019
January 2019
December 2018
November 2018
October 2018
September 2018
August 2018
July 2018
June 2018
May 2018
April 2018
March 2018
February 2018
January 2018
December 2017
November 2017
October 2017
September 2017
August 2017
July 2017
June 2017
May 2017
April 2017
March 2017
February 2017
January 2017
December 2016
November 2016
October 2016
September 2016
August 2016
July 2016
June 2016
May 2016
April 2016
March 2016
February 2016
January 2016
December 2015
November 2015
October 2015
September 2015
August 2015
July 2015
June 2015
May 2015
April 2015
March 2015
February 2015
January 2015
December 2014
November 2014
October 2014
September 2014
August 2014
July 2014
June 2014
May 2014
April 2014
March 2014
February 2014
January 2014
December 2013
November 2013
October 2013
September 2013
August 2013
July 2013
June 2013
May 2013
April 2013
March 2013
February 2013
January 2013
December 2012
November 2012
October 2012
September 2012
August 2012
July 2012
June 2012
May 2012
April 2012
March 2012
February 2012
January 2012
December 2011
November 2011
October 2011
September 2011
August 2011
July 2011
June 2011
May 2011
April 2011
March 2011
February 2011
January 2011
December 2010
November 2010
October 2010
September 2010
August 2010
July 2010
June 2010
May 2010
April 2010
March 2010
February 2010
January 2010
December 2009
November 2009
October 2009
September 2009
August 2009
July 2009
June 2009
May 2009
April 2009
March 2009
February 2009
January 2009
December 2008
November 2008
October 2008
September 2008
August 2008
July 2008
June 2008
May 2008
April 2008
March 2008
February 2008
January 2008
December 2007
November 2007
October 2007
September 2007
August 2007
July 2007
June 2007
May 2007
April 2007
March 2007
February 2007
January 2007
December 2006
November 2006
October 2006
September 2006
August 2006
July 2006
June 2006
May 2006
April 2006
March 2006
February 2006
January 2006
December 2005
November 2005
October 2005
September 2005
August 2005
July 2005
June 2005
May 2005
April 2005
March 2005
February 2005
January 2005
December 2004
November 2004
October 2004
September 2004
August 2004
July 2004
June 2004
May 2004
April 2004
March 2004
February 2004
January 2004
December 2003
November 2003
October 2003
September 2003
August 2003
July 2003
June 2003
May 2003
April 2003
March 2003
February 2003
January 2003
December 2002
November 2002
October 2002
September 2002
August 2002
July 2002
June 2002
May 2002
April 2002
March 2002
February 2002
January 2002


JiscMail is a Jisc service.

View our service policies at https://www.jiscmail.ac.uk/policyandsecurity/ and Jisc's privacy policy at https://www.jisc.ac.uk/website/privacy-notice

For help and support help@jisc.ac.uk

Secured by F-Secure Anti-Virus CataList Email List Search Powered by the LISTSERV Email List Manager