Posted by Chris Northwood on , last updated
This blog post was written by team members on the Trial Platform project.
We’ve previously blogged about our Trial Platform project, where we attempt to take our IP Studio production technology and work with BBC production teams outside of the lab to test it. In order to support outside broadcasts, we have designed our Trial Platform with commercially available fibre Internet connections in mind, so we are able to operate from any venue offering upload speeds of 100Mbps+. The Trial Platform uses a site to site VPN (Virtual Private Network) to provide a secure, encrypted tunnel between our on-site Outside Broadcast kit and the MediaCityUK datacentre, via the public internet.
Our use of the VPN tunnel has 2 main purposes. Firstly, to enable the remote rendering of our output, which is captured by UHD cameras connected to the OB kit. This rendering is performed back at the MediaCityUK datacentre where IP Studio nodes pull video streams from the recorded content stored on the on-site OB kit. Secondly, to support remote production, lower resolution content must be streamed at low latency to the SOMA interface allowing an operator to view the stream and make edits to the outgoing broadcast. The bandwidth requirements into SOMA are relatively low, and a home wi-fi connection is all that is needed to control the editing system across the public internet.
In its first iteration Trial Platform used the widely deployed open source OpenVPN solution for our VPN. In our lab testing, standard Linux performance monitoring tools such as iperf consistently reported speeds of 450Mb/s across Trial Platform’s VPN tunnel, close to our theoretical max throughput of 500 Mbps through our firewall. However, during the first outing for the Trial Platform at the the Great Exhibition of The North 2018 at Sage, Gateshead we experienced the pitfalls of working outside the confines of the BBC R&D Lab environment. Real-world networks, including the network at the Sage, are subject to a modest amount of latency; the delays typically incurred in the processing of network packet data and the speed of light of moving the data around fibre optics.
Network latency can create bottlenecks that prevent data from filling a network pipe, decreasing throughput and limiting the maximum effective bandwidth of a connection. The impact of latency on network throughput can be temporary or persistent depending on the source of delays.
Unfortunately as we discovered at Sage, latency had a dramatic impact on performance of our VPN. The latency of 28ms which we saw at Sage reduced our effective throughput from 450Mb/s to around 50Mb/s. For our trial, this left our rendering nodes in the MediaCityUK datacentre unable to pull back data streams quickly enough, resulting in ‘null grains’; choppy, broken video at best, or none at all. These null grains also caused issues when we tried to pass on the content to the BBC’s video streaming infrastructure.
Back in the lab, we set about investigating the relationship between throughput & latency and its impact on our VPN’s performance. initial investigations focussed on our networking setup, in particular our firewall and routing setup which we felt, in conjunction with network latency, maybe caused the degraded performance.
Our first challenge in testing this assertion would be to introduce varying degrees of latency and packet loss without having to leave the lab. We achieved this using the netem package which provides a suite of tools to introduce network artifacts, such as latency & packet loss, and thereby emulate wide area network behaviour.
We were now in a position to chart latency vs. throughput using netem to apply delays and the iperf tool to measure throughput. We applied latency to our VPN interface, this being the route network traffic would take between our remote OB kit and Dock House and were able to successfully replicate the reduction in throughput seen on site in Gateshead.
Buoyed by this success we set about simplifying our routing rules, avoiding crossing firewalls repeatedly, believing this would address our performance issue as latency increased. Subsequent testing with ‘faked’ latency showed it did not. Throughput still degraded by the same margins; disappointment followed.
Having now excluded our routing, we felt the VPN itself warranted investigation. Having replicated the decreased performance on the physical Trial Platform kit, we set about testing the VPN software independently of the hardware, to once again see if we could mimic the degradation in performance.
We use Vagrant & Ansible extensively as part of Trial Platform; these tools allow us to automate the configuration and management of our physical hardware and verify that automation in a virtualised environment. Using these tools we set up a virtualised VPN server & client (employing the same scripts used to deploy the VPN on hardware). Additionally, we automated the process of applying varying latency to the network interface and charting throughput to allow us to quickly test different configurations.
Using this approach we were able to successfully replicate the reduction in throughput seen on site in Gateshead, and in lab testing of physical hardware. This allowed us to confirm, in a repeatable, isolated manner that it was the VPN, rather than another part of Trial Platform, that was the root cause. Having identified the VPN link as the bottleneck in our application in the presence of network latency, we began testing & comparing alternatives to the OpenVPN solution we were using,
First into the mix was StrongSwan; a well established open source IPsec based VPN solution. Using the same approach we used to test OpenVPN independently of Trial Platform, we spiked a virtualised VPN server & client set up this time deploying StrongSwan.
Next, we spiked a WireGuard virtualised VPN server & client setup, again deployed in the same manner as StrongSwan & OpenVPN. WireGuard is a relative newcomer which purports to be leaner & faster than alternative solutions, using state of the art cryptography and released for the Linux kernel.
We tested all spiked configurations using the same criteria, applying 0 to 40ms of latency in increments of 1ms and charting network throughput in kb/s. At each level of latency, throughput was determined using the iperf tool, running 3 10 second tests between the client & server; the results of the 3 test were used to derive a mean throughput speed.
From our testing, WireGuard met the needed performance characteristics for our use cases when latency was applied to the network interface: applying 40ms of latency reduced throughput to approximately 100Mb/s. This offers a significant improvement over the previous solution providing almost twice the throughput at the higher end of our latency range.
On the basis of our findings WireGuard has now been implemented into the Trial Platform. The WireGuard solution offers us improved resilience against network conditions beyond our control when we next venture outside the the R&D lab and gives us greater confidence in our ability to render video between sites.