JISCMail - TB-SUPPORT Archives

Email discussion lists for the UK Education and Research communities

Subscriber's Corner

Email Lists

TB-SUPPORT Archives

TB-SUPPORT@JISCMAIL.AC.UK

View:

Message:

[

First

Last

]

By Topic:

[

First

Last

]

By Author:

[

First

Last

]

Font:

Proportional Font

		LISTSERV Archives
		TB-SUPPORT Home
		TB-SUPPORT September 2016

Options

Subscribe or Unsubscribe

Get Password

Subject:

Re: Ops meeting @ 11am

From:

Matt Doidge <[log in to unmask]>

Reply-To:

Testbed Support for GridPP member institutes <[log in to unmask]>

Date:

Tue, 20 Sep 2016 13:57:34 +0100

Content-Type:

multipart/mixed

Parts/Attachments:

text/plain (37 lines) , 200916_ops_minutes.txt (349 lines)

Hello!
Please find attached the minutes (which I was denied being able to
upload to Indico). Thanks to Pete and David for collating the attendees
list. I hope I didn't miss anything important, or paraphrase anyone too
imaginatively - please let me know if I did.

Actions were for sites to consider supporting vo.moedal.org [1][2], and
on Sam to liaise with Oxford and Atlas to look at the diskless site
testing again.

A quasi-action is on anyone wishing for a CERN External Account to
contact Jeremy today if you haven't already done so.

Cheers all,
Matt

[1] https://operations-portal.egi.eu/vo/view/voname/vo.moedal.org
[2] moedal.org

On 20/09/16 10:44, Jeremy Coles wrote:
> Dear All,
>
> A reminder that our ops meeting is at 11am. The agenda is
> at https://indico.cern.ch/event/570325/. There are several updates in
> the bulletin that we will
> review https://www.gridpp.ac.uk/wiki/Operations_Bulletin_Latest.
>
> Discussion this week will revolve around the GDB updates from last week
> (and the pre-GDB) and a continuation of our lightweight sites theme from
> GridPP37.
>
> Regards,
> Jeremy
>
>

Chair: Jeremy C
Minutes: Matt D
Atttending:
Alessandra Forti. Andrew Washbrook, Andrew Lahiff, Brian Davies, Chris Brew, Dan Traynor, Daniela Bauer, David Crooks, John Bland, John Hill, Marcus Ebert, Oliver Smith, Winnie Lacesso, Pete Gronbech, Raul Lopes, Robert Frank, Sam Skipsey, Steve Jones, Tom Whyntie, Vip Davda, Matt Williams, Ian Loader.
Apologies: Andrew McNab, Ian Neilson, Raja, Duncan.

*Experiment problems/issues

LHCB - no one could attend

CMS - Daniela - CERN has intermittant connectivity problems affecting xrootd redirectors. Not much can be done. Brunel have a cms ticket that's being actively debugged.

ATLAS - No one had any problems to report.
Alessandra confirms this.

Other VO updates.
No other VO updates

New VO news from Tom
moedal - Monopole searching VO, would like to use a grid for simulation. Technically an LHC experiment but would like support as a small VO. Linked to cernatschool work.

Infrastructure (voms, cvmfs) setup already.

Would anyone like to support them? moedal.org is their homepage, piggybacks on lhcb gauss infrastructure.

Jeremy - are we targetting specific sites?
Tom - QM, cernatschool supporters (Glasgow, Liverpool, Birmingham), will be using ganga for job submission stuff.
No current sites in the UK support, possibly no other EGI or OSG sites, most simulation run "locally" so far.
Website:
https://operations-portal.egi.eu/vo/view/voname/vo.moedal.org

Any interest among sites? Create an action to return to this - consider supporting the VO. Not a heavy load expected.

Chris B - might be happy to do it at RALPP, possibly include it in a batch of VO additions along with cernatschool to reduce workload ("not difficult, just intricate")

No Gridpp Dirac "news", but...
http://bugzilla.nordugrid.org/show_bug.cgi?id=3600
Observations - Daniala: Hit by a bug in arc ce, new versions don't report max cpu/wall time, which dirac uses for queue matching. Tried some hacking which didn't work, have another go planned this afternoon. Currently no way of doing queue matching.
Raul comments that arc never did it properly, Danieal replies that the problem comes from the new release being "zero".
Andrew L - this has been a long term problem with Condor, which didn't have the concept of it.

Brunel and ECDF have forced correction in arc.
Andrew L - It appears batch system dependent.
Daniela - It (wall/cpu time) can be set, but not sure what that will do. SGE seems to be broken too.
Raul will double check and rehack if needed, asks Daniela to poke him if it still doesn't work.
Steve - https://www.gridpp.ac.uk/wiki/Example_Build_of_an_ARC/Condor_Cluster#Patch_for_Extra_BDII_Fields
-See chat for a bit more on this.

Jeremy - attached a slide to the MB to the agenda. Shows GGUS ticket statistics, for information and intestest only

To the Bulletin!

*Meetings and Updates
International Symposium on Grids and Clouds (ISGC) 2017 call for papers closes at the end of October.
http://event.twgrid.org/isgc2017
August WLCG T2 Availability:
ALICE. All okay
http://wlcg-sam.cern.ch/reports/2016/201608/wlcg/WLCG_All_Sites_ALICE_Aug2016.pdf
ATLAS. Glasgow: 86%:97% | Oxford: 82%:82%
http://wlcg-sam.cern.ch/reports/2016/201608/wlcg/WLCG_All_Sites_ATLAS_Aug2016.pdf
Glasgow availability was down due to a power cut in their machine room at the beginning of the month. It took a few days to recover from it.
Oxford was down for a few days due to an A/C failure on Friday 12th August. The cluster was shutdown and restored on Monday 15th.
CMS. All okay
http://wlcg-sam.cern.ch/reports/2016/201608/wlcg/WLCG_All_Sites_CMS_Aug2016.pdf
LHCb. All okay (but note ECDF as N/A).
http://wlcg-sam.cern.ch/reports/2016/201608/wlcg/WLCG_All_Sites_LHCB_Aug2016.pdf
There was a GDB last week. Minutes will appear here.
https://twiki.cern.ch/twiki/bin/view/LCG/WLCGGDBDocs#2016
Notes from Thursday's EGI OMB.
https://indico.egi.eu/indico/event/2810/material/minutes/minutes.html
https://indico.egi.eu/indico/event/2810/
Actions:
NGIs using the GOCDB API should assess if their use is compatible with the new developments available in the test instance.
Gather information about best practices for the users who are transitioning from WMS to DIRAC.

Jeremy: Any feedback? Do we have anything to help with this
Tom: Happy using Ganga.
Daniela: I could try and dig up my talk for the dirac workshop. it's from May, but we did do a little survey on how VOs use dirac (see chat for more)

Discuss the CSIRT proposal with sites and ROD staff.
-Meat of it is that sites will need to add pakiti client to (a) Worker Node(s).
The ARGO proposal for GOCDB proposal has an impact on the site managers and therefore NGIs should discuss this proposal with their sites and staff.

Notes from Monday's WLCG ops meeting.
https://twiki.cern.ch/twiki/bin/view/LCG/WLCGDailyMeetingsWeek160919
-Intermittant connectivity problems mentioned again, particularly a problem for CMS.
Jeremy C will follow up on External Accounts this week.
-Has list of 5 names, anyone else wants added please contact Jeremy today.
Alastair mentions "ARC Camp!" for an interested person (TB-SUPPORT 14th Sept).
-Useful for a technical person to attend to represent the work in the UK
Andy W - Andy might be able to do it.
Steve: Where will it be?
AndyL Undecided, somewhere cheap, probably not in the UK
Decommissioning of the old downtime notification system took place last week. From now on use the [ https://operations-portal.egi.eu/downtimes/subscription new system].
-Probably the cause of any odd messages seen last week.
You have to select the targets of you subscription then a channel of communication (RSS, Ical or email) . Don't forget to fill your email address if you have selected the email channel!
VAPOR application v2.1 is now online. Various changes including integration of Gstat features.
https://operations-portal.egi.eu/vapor
-Overview of data, a lot of stuff previously in gstat. Worth a look if you haven't already.
APEL Tests Paused today - There is a temporary problem with the APEL Pub and Sync tests. They are not reflecting recent data received by the APEL repository.

-No comments.

*WLCG Operations Coordination
There was a WLCG Throughput call on 15th.
https://indico.cern.ch/event/562629/
-Duncan was set to make it, but wasn't in the meeting today. Jeremy couldn't make it.
The next ops meeting is on 29th. Theme suggestions welcome.
https://indico.cern.ch/event/540422/
-Please let Jeremy know if you have any suggestions.

*Tier 1
A reminder that there is a weekly Tier-1 experiment liaison meeting. Notes from the last meeting here
http://www.gridpp.ac.uk/wiki/RAL_Tier1_Experiments_Liaison_Meeting
https://www.gridpp.ac.uk/wiki/Tier1_Operations_Report_2016-09-14
The use of both OPN links giving a maximum of 20Gbit connection to CERN and other Tier1s continues to run OK with use being made of the extra bandwidth.
In the last report (a couple of weeks ago) I mentioned some intermittent periods of high packet loss within the Tier1 network. This was resolved by replacing a network transceiver.
The first 100Gbit link within our internal Tier1 network has been put in place.
There was a preventative maintenance on the tape libraries last week. This was a general checkover of the hardware plus a firmware update. This went OK. Oracle wish to make an intervention on the libraries to improve some of the mechanics. We are scheduling this for the first week of November and is expected to be a day's downtime of each of the libraries.
We are in the process of moving services from the old WIndows Hyper-V 2008 virtual infrastructure to one based on the 2012 version.

-No Tier 1 related issues raised.

*Storage & Data Management
Sam - preGDB and GDB happened. Will come back under discussion.

*Tier-2 Evolution
-Jeremy noted quiet since June. No open issues in JIRA.

*Accounting
-Some discussion at GDB, will come back to it later

*Documentation
GridPP Approved VOs now has link to RPM versions of the VOMS records. They are available for now via the VOMS RPMS Yum Repository. The latest version, which is consistent with the Yaim records in the Approved VOs doc, is 1.0-1. Plan is that when VO records change, Approved VOs doc version will be incremented, and RPMs of changed VOs (only those) will be released carrying the same version stamp as the document. Thus a site that upgrades to "latest" will get the records compatible with the newest version of the GridPP Approved VOs document.

Note: A typical RPM contains as so:

[sjones@hep169]$ rpm -qlp gridpp-voms-dteam-1.0-1.noarch.rpm
/etc/grid-security/vomsdir/dteam
/etc/grid-security/vomsdir/dteam/voms.hellasgrid.gr.lsc
/etc/grid-security/vomsdir/dteam/voms2.hellasgrid.gr.lsc
/etc/vomses/dteam-voms.hellasgrid.gr
/etc/vomses/dteam-voms2.hellasgrid.gr
/root/vo_xml/dteam.xml

The vomsdir (lsc) files (which list the DNs and CA DNs of acceptable certificates) and the vomses files (which give the coordinates of VOMS servers of various VOs) are provided, as if they were created by YAIM in the normal locations. No other features of YAIM are facilitaed by these RPMs. Thus they are useful for migrating from YAIM, but do not provide all the functions of YAIM such as setting SW dirs or other ENV vars etc.

http://hep.ph.liv.ac.uk/~sjones/RPMS.voms/
https://www.gridpp.ac.uk/wiki/GridPP_approved_VOs

-Steve will keep the document up to date, and the RPMS too. Notes this doesn't do everything that YAIM does/did.

-Steve waiting to hear from Marcus and Gareth to see how this works.

*Interoperation
The next EGI ops meeting is on 12th October.

*On Duty
Jeremy setting up the ROTA.

*Rollout
A lot of SL7 work in the UK, worth looking at and collating this.

*Security
Changes to site (re-)certification procedure proposed at OMB to enable security vulnerability checks which are currently blocked due to move to Argo monitoring. [1]
IGTF & EUGridPMA (certificate issuing authorities) meeting [2]
Summaries of issues exploiting federated identity management (e.g. eduGain) and social id's (e.g. facebook) on Monday [3] [4].
[1] https://indico.egi.eu/indico/event/2810/
[2] https://indico.nikhef.nl/conferenceDisplay.py?confId=500
[3] https://indico.nikhef.nl/materialDisplay.py?contribId=1&materialId=slides&confId=500
[4] https://indico.nikhef.nl/materialDisplay.py?contribId=4&materialId=slides&confId=500

-Ian is likely at the meeting alongside Dave Kelsey, so it would be good to hear back.

*Services
UK eScience CA - certificate issuance problems. Jens reported that on 15th a partial but significant database corruption occurred on the signing system for the CA. Data was restored from (offline) backups but the rebuild was not correctly configured.
-Hopefully hear back from Jens about this in the near future.
A large number of site admins and other GridPP supporters appeared to be suspended from the dteam VO last week. “During a planned upgrade operation of VOMS service, a system malfunction occurred. As a result, some users received false notification about membership expiration. We are in contact with the software development team in order to identify the cause.”

Jeremy - everyone should be unsuspended now, but check if your AUP signing comes up.
Anyone still unsuspended? No response.

*Tickets.
Were discussed. Steve will point Biomed to the Spacetoken documentation.

*Other Bits
Site round table will be needed soon.

*GDB update
Summary of GDB talks by Jeremy:
https://indico.cern.ch/event/570325/contributions/2306936/attachments/1338612/2015890/September-GDB-2016.pdf
Talk 1- WLCG workshop.
Talk 2- IPv6.
Atlas Canada - would like/are interested in pure IPv6, but not going ot get it yet.

1st April 2017 is the earliest date to be able to provide Ipv6 only compute and expect it to date.
Some reckon that this is too soon, “Reasonable fraction on IPv6 by
end LS2”

Talk 3 - Review of nordic tier 1, with view of improving efficiency. Conclusions is that consolidation loses leverage, which increases cost in other areas, as seen in other studies.

Talk 4 - Malware information sharing platforms. "Threat Intelligence".
CERN MISP - access requiring egroup.

Dave C - testing between Glasgow and RAL, with Jo a summer student. Interesting thing is the technical aspect of the sharing platform, but the meat is in the semantics of sharing this information. Big challenge in false postives. Maxmising trust is the bulk of the work.

Data PreGDB.
Brian - good summary on Jeremy's slides.
Still trying to work out how to do storage accounting in an SRM-less setup. Caveat that SRM-less tape is on the backburner.
Site perspectives saught to provide development in this area.
IPv6 wasn't mentioned in the preGDB oddly enough.
Different VOs have different pushes in which protocols to use. Gridftp big as it's usable at all sites (other options being xroot and http).
Some of the possible ways of "providing" IPv6 is dual homed xrootd proxies.
More focus on xroot and gridftp.
Brian and Alastair's talk went down well, interesting analysis from IPNL3, studying access and create times of files on disk servers, noting differences in patterns between VOs.
Updates from the various storage providers, including timelines and roadmaps. Worth looking at for each site.

No questions.

GDB Fast Benchmarking.
Update from each VO. The slides tell all.
LHCB- Dirac benchmarking gives a much clearer result to other benchmark.
ATLAS -
Alessandra - plan to add fastbenchmark to pilots and add to elasticsearch cluster. Aim to simplify effort in comparing things.
No update from CMS on this at the GDB. No one present knows what CMS are doing on this.
Jeremy will circulate any thing that comes out of these talks.

Discussion:
Continuing discussion from gridpp37 about lightweight sites.
https://indico.cern.ch/event/556609/sessions/204093/attachments/1330334/1998927/Lightweight_sites_-_notes.pdf

Starting halfway through storage sention.

Sam - xroot support, globus connect.
Jeremy - what can we do to aid this?
Sam - arc caching testing at Durham, Sam has some stats on cache growth with atlas work. Aiming to work with rucio, but quiet on that front. Also work on network only site has been slowed, but partly reported on. HC infrastructure work was a block, but that should be done with as of yesterday. Progress, just not as much we'd like?

Is it workable to run sites without storage?
UCL works, but is in London. Network topology is very different, not applicable to the rest of the UK.
Potentially not scabale outside London with JANET in the current state (and JANET not just used by us). Alastair having a look at this, getting a feel for network use for a certain cluster size. Loss of Ewan slowed this work.

Brian - preGDB on this was theorectical analysis over what would happen if we lost the smaller sites - wrt loss of job slots, increased network load.
Can a disk less site cope with the connections out?

Jeremy - is there a timeline for some conclusions on this?
Sam - wanted to be at that point now,

Pete - anything at Oxford that we can do to help?
Sam - possibly nothing at Oxford, work was on the atlas infrastructure. Should be there.
Pete- we're as staffed as well as used to.
Sam - what's useful is to know what the monitoring is like to understand what's going on as well as possible. Looking at Network and Job monitoring.

Sam will send email to Kashif and Alastair about it - putting it into the actions.

Once you offload site services, such as storage, how do you monitor a site with depenencies at the other site. Wider discussion we need to have.

Potential issue talked about in the storage evolution document. Who do we ticket? See this with CMS now.
Would ticket reassignment be a job of the site?
Low efficiency might be due to job types.

Global redirector and Dirac incompatability. Dirac can't job match a job to a site efficiantly if the data is "everywhere".
Sam - No win situation here. Trade off we cannot avoid.
How would this work with Dirac?
Sam - LHCB manage it already, so we should check how they do it.

Funding policy for these new types of storage?
Brian - Assume sites move to being T2C? Degrading cache as existing storage ages and isn't replaced? Continued funding for continued storage provision?
Jeremy - A list of high level questions, could these be written down?

What do experiments themselves want?
Sam - this is better understood after the preGDB.

xrootd federation pilot
Sam - can have higher levels of xroot redirectors, could have a UK level one that would redirect to a subset to exposed UK xrootd endpoints and have that as the top level interface of that storage, so we're only "one" endpoint. If you go all in with xroot can do cache layers, reliability via redirection. Need to do a pilot of this first.
May or may not interact with experiment plans for gridftp, there is a plugin but it might not work very well.

Come back to the rest of this another day.

Actions - minutes.
Make sure you upload the minutes! Jeremy will continue setting up egroups.

AOB?
None.
Reiterated that we will need to do a Tier 2 review at some point soon.

Chat Window
Alessandra Forti: (11:08 AM)
For ATLAS there isn't much to report.
same tickets as last week
Tom Whyntie: (11:08 AM)
moedal.org
https://operations-portal.egi.eu/vo/view/voname/vo.moedal.org
Daniela Bauer: (11:16 AM)
http://bugzilla.nordugrid.org/show_bug.cgi?id=3600

raul: (11:16 AM)
ArcCEs have always reported that incorrectly.
I've forced a correctiion for Brunel
Andrew John Washbrook: (11:17 AM)
us too (ECDF)
raul: (11:17 AM)
I'll check and hack it. Daniela could email me tomorrow if Ii don't
Steve Late: (11:19 AM)
https://www.gridpp.ac.uk/wiki/Example_Build_of_an_ARC/Condor_Cluster#Patch_for_Extra_BDII_Fields
Daniela Bauer: (11:20 AM)
@Raul Sure, will do. But it seems endemic, it's defintely not just you.
Steve Late: (11:20 AM)
Patch for Extra BDII Fields
To set the GlueCEPolicyMaxCPUTime and GlueCEPolicyMaxWallClockTime bdii publishing values, you need to change the lines involving GlueCEPolicyMaxCPUTime and GlueCEPolicyMaxWallClockTime in /usr/share/arc/glue-generator.pl. For example:
GlueCEPolicyMaxCPUTime: 4320
GlueCEPolicyMaxWallClockTime: 4320
I was only late once; but it never forgets for some reason!
Daniela Bauer: (11:23 AM)
I could try and dig up my talk for the dirac workshop
it's from May, but we did do a little survey on how VOs use dirac
raul: (11:23 AM)
hacking glue-generator.pl has always been my option. However, I've upgraded all CEs recently and forgot about it.
Andrew Lahiff: (11:24 AM)
Can't your configuration management system take care of that for you?
Daniela Bauer: (11:25 AM)
@Jeremy: Maybe this is useful:
https://indico.cern.ch/event/477578/contributions/2168288/
Jeremy Coles: (11:40 AM)
Yes. Thanks Daniela.
raul: (11:41 AM)
@Andrew: If the configiration system can take care of glue in Arc? yes
I keep postponing as a minor problem that Arc would solve in the "next" version
Chris Brew: (11:51 AM)
raul - I think I saw some official statment from Arc the Glue 1 is obsolete and they will no longer fix any issues with it.
raul: (11:52 AM)
Yes, I think I saw in their list, but really it was not clear for me what do
Jeremy Coles: (12:06 PM)
https://indico.cern.ch/event/556609/sessions/204093/attachments/1330334/1998927/Lightweight_sites_-_notes.pdf
Paige Winslowe Lacesso: (12:11 PM)
Apologies, I have to leave now.
Daniela Bauer: (12:12 PM)
@Chris: If this is something that needs to be set by hand again everytime you upgrade it should be in teh arc.conf
And the information doesn't seem to be rpesent in glue2 either
raul: (12:17 PM)
Thatt's what got me confused. glue1 is out, glue2 doesn't have it. Yet, I seem to have seen a discussing in the nordugrid list about support for some glue stuff. confused again
David Crooks: (12:29 PM)
Cheers

Top of Message | Previous Page | Permalink

JiscMail Tools

Files Area | help

RSS Feeds and Sharing

Search Archives

Advanced Options