Thanks for this Tim
See you shortly in Ambleside
Pete
On 27 Aug 2019, at 14:22, Tim Chown <[log in to unmask]> wrote:
Hi,
[Apologies this is a bit long, but if you’re at GridPP this week and interested in the future of the perfSONAR mesh, read on….]
Those of you who take an interest in the perfSONAR measurements between the GridPP sites will be familiar with Duncan’s work in maintaining a UK-based version of the MaDDash mesh between the sites. You can view the mesh, which is hosted on a Jisc VM, at:
https://ps-dash.dev.ja.net/maddash-webui//index.cgi?dashboard=UK%20Mesh%20Config
Most of the mesh is active, with a small number of ongoing issues. But in general, there is very useful data to be drawn from the mesh, and agreement that we should maintain this capability (and possibly extend it) in GridPP6.
Recent discussions, in particular with Pete Clark, and in a recent TB-SUPPORT call, suggest that it would be timely to plan for a refresh of the perfSONAR infrastructure across the GridPP sites. Given Jisc has been active with perfSONAR in recent years (we have two 10G perfSONAR nodes on our backbone, and are planning a 100G node, and are now responsible for the European side of the perfSONAR s/w development in the GÉANT GN4-3 project), Pete suggested we present some thoughts on the refresh to help seed the activity, and provide assistance where useful. To that end, I have a brief slot at GridPP43, so I'm floating some of those here on the list beforehand so those interested have a chance to think a little about it before the session on Friday.
I hadn’t realised quite how long ago the original perfSONAR infrastructure was established. Duncan pointed me at a mail thread from 2012 in which the original deployment was discussed, and a specification agreed for the nodes. For the nostalgic, the thread can be found at https://www.jiscmail.ac.uk/cgi-bin/webadmin?A2=ind1201&L=TB-SUPPORT&O=D&P=5928. There was some lengthy considerations of what was required, and many of the points made are still relevant today.
As far as I can tell from the thread, the agreed spec at the time was:
- Pair of E5620 CPUs,
- 12Gb memory,
- RAID 1,
- PERC H200,
- 2x 500Gb disks,
- Redundant PSU,
- Power leads,
- a 10Gb SFP+ interface,
- iDRAC6 Enterprise,
- 5yr basic warranty.
And again as far as I can tell, no specific hardware was required, but it was agreed all sites would buy to that spec, whether Dell or otherwise. I think the above was a Dell R610.
Duncan has looked at the specs of the current nodes, and has seen examples of some nodes that have been upgraded since the refresh above; this information is available via the node’s web interface (under the /toolkit area).
As we approach the refresh, the hardware will be important. Unlike 2012, it is now considered perfectly acceptable to run both throughput and loss/latency measurements off a single system via two interfaces, rather than having separate hardware for each. We could debate the potential for virtualised measurement points, but physical hardware is still generally recommended. We should also reflect on the nature of each GridPP site; if there are to be five “larger” sites, we might deploy a higher spec for those sites, running 10G now but capable of 100G in the future, and run lower spec servers at 10G at the other locations. Regardless, these should probably be bought with a 5-year view in mind.
There are other things to consider, including -
- Automation of the perfSONAR node management. It was clear from the TB-SUPPORT call that every site should, if they choose, be able to manage their nodes themselves, but more broadly the option to have them managed from one point may be useful. There has been work in the GN4-3 project on Ansible for perfSONAR, and the upcoming 4.2 release has improved support.
- perfSONAR mesh views and data archiving. Jisc is running a UK-based version of the mesh that is hosted by the WLCG at https://psmad.opensciencegrid.org/maddash-webui/index.cgi?dashboard=UK%20Mesh%20Config. We do not currently archive the data, but we (or the GridPP project) could do so. Or perhaps we’re happy to just draw on the WLCG’s archive. Either way, there are now increasingly good APIs available to mine the measurement data and export it to other processes, for example.
- monitoring the perfSONAR servers. It could be useful to have a single management view of the status of the servers. Marian Babik has worked on using check_mk for this with a setup “tuned” for perfSONAR, e.g., as described at https://slideplayer.com/slide/10427717/. While it’s perfectly OK for local sites to monitor their systems, a community view may be useful to have.
- improved visualisations. The perfSONAR visualisations haven’t been changed for quite some time, and are now a little dated, though still useful. Work is ongoing in the dev team and elsewhere on improved tools, e.g., using Grafana. We might want to apply some of that here.
There’a also some other things we could look at, e.g., the ideas being explored in the SAND project, on correlating measurement data from multiple sources, including FTS logs and router interface utilisation; there’s a recent talk on this at https://wiki.geant.org/display/PMV/6th+SIG-PMV+Meeting+@+Dublin (see the talk at 16:00 on the first day).
And finally we do need to be sure that whatever we do is compatible with the WLCG’s broader measurement / monitoring requirements, and there’s we’re not duplicating efforts. I’ve had some brief exchanges with Shawn about Friday’s talk, and Duncan and I will look to have a more detailed chat with Shawn and Marian after the GridPP meeting.
Again, apologies this is a bit longer than planned, but hopefully is useful and captures some of the topics that could be discussed and agreed.
Best wishes,
Tim
To unsubscribe from the TB-SUPPORT list, click the following link:
https://www.jiscmail.ac.uk/cgi-bin/webadmin?SUBED1=TB-SUPPORT&A=1
--
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.
########################################################################
To unsubscribe from the TB-SUPPORT list, click the following link:
https://www.jiscmail.ac.uk/cgi-bin/webadmin?SUBED1=TB-SUPPORT&A=1
|