Below is a summary of our efforts to optimize flow control in the Rust Yamux implementation. While not a novel approach, I still find the end result worth sharing thus this forum entry.
What is Yamux?
Yamux is a stream oriented multiplexer, operating on top of a connection oriented reliable transport (e.g. TCP). It was initially specified and implemented by Hashicorp. It is widely used across various libp2p implementations as the multiplexing library operating on top of TCP.
Flow control in Yamux
First off, what is flow control again? Wikipedia:
Flow control is the process of managing the rate of data transmission between two nodes to prevent a fast sender from overwhelming a slow receiver.
Yamux provides flow control via the concept of a flow control window and window update frames. A stream between a sender and a receiver starts off with a 256KB window size. Each time the sender sends a frame, the window is reduced by the size of the frame. Once the window is at 0 the sender is no longer allowed to send more data. To prevent the sender from stalling indefinitely, the receiver sends window update frames back to the sender, each increasing the window again.
For those of you familiar with HTTP/2 or SPDY the above might ring a bell. Indeed Yamux is inspired by the early SPDY design and thus shares the flow control mechanism.
When To Send Window Update Frames
The easiest strategy is the following: Both the sender and the receiver track the flow control window. Once 256KB of data arrived at the receiver’s end, and thus the window being fully depleted, the receiver sends a window update frame back to the sender, allowing the sender to send additional data once it receives the window update frame.
This sounds ideal, right? Not quite. Let’s take one step back:
allowing the sender to send additional data ONCE it receives the window update frame.
Rephrasing the above, when using the easy strategy, the sender is blocked right after sending the last frame that fully depleted the window up until receiving the window update frame from the receiver. Now you might be thinking that (a) this is only for a small time period and (b) it only occurs when the window is depleted, but early experiments showed that this small time period has a large impact on the overall throughput of the stream. This is especially true for long fat pipes, i.e. connections with high latency and high bandwidth. The former increases the time the sender is blocked, the latter increases the amount of missed bytes one could have sent. Simulating on top of a ADSL2+ network repeatadly sending 4KiBytes of data on a stream one would at most achieve a throughput of ~ 1.7 MB/sec whereas when using an initial window large enough not to be depleted during the testrun one could achieve up to 2.3 MB/sec.
So is there room for improvements by removing this small time period?
Sending Window Update Frames early
Another strategy, preventing the sender-stalling-issue above, would be to just choose a very large initial window, hoping for it not to be depleted at all or at least not very often. But that would defeat the idea of doing flow-control on the multiplexed stream level in the first place.
Both the Yamux specification as well as the SPDY and HTTP/2 specifications leave it up to the receiver when to send window update frames. Thus instead of the receiver sending the window update frame only once the entire window is depleted, the receiver could as well send the window update frame once a specific fraction of the overall window is depleted. In case the window update frame arrives at the sender side before the sender depletes the window, the sender is never blocked, continuously sending data to the receiver.
Without having to implement an advanced flow control algorithm, like glorious BBR or good old New Reno, what would be a reasonable constant value for the fraction at which to send a window update frame from the receiver to the sender?
We simulated different networks, from a low-bandwidth high-latency mobile network all the way to a high-bandwidth low-latency LAN network. In most cases, the basic strategy of sending a window update frame once half of the window is depleted, showed great results similar to those that one would achieve when allowing the sender to go full steam, effectively ignoring flow control window.
Split Large Frames
One last thing, before revealing the impact our changes had: When sending very large frames close to or equal to the maximum window size, (a) the sender can not interleave control messages and (b) the receiver can only send back window update frames once it read one of these large frames, resulting in the sender being blocked despite the window update frame strategy. (Unless one wants to do some fancy header inspection ahead of time.)
With that in mind we limited the data frame payload size to 16KiBytes by default. But wait, … does that not introduce a large overhead? Not quite:
Limiting the frame size to 16KiB does not introduce a large overhead. A Yamux frame header is 12 bytes large, thus this change introduces an overhead of ~0.07%.
After having spend many hours measuring the throughput of our Yamux library with simulated network connections it was about time to test the impact of our changes in the real world. For that we used two machines, one in Berlin, one in Helsinki. First we ran a good-old
iperf test giving us a bandwidth of 11.2 Mbits/s. Next we used libp2p-perf to run a similar test on top of the rust-libp2p stack, once without and once with the optimizations described above (early window update frames and frame splitting).
Using the most recent release of our Yamux library (without the optimizations) as well as the latest rust-libp2p library we would achieve a bandwidth of 7.01 MBit/s. Using a patched Yamux library with the optimizations described above and again the latest rust-libp2p release we were able to achieve up to 8.76 MBit/s, thus attaining a ~25% improvement in bandwidth.
Resources to dig deeper:
- You can find the initial proposal as well as all benchmarking results in Send WindowUpdate message early · Issue #100 · paritytech/yamux · GitHub.
- Code to replicate all benchmarks was added via benches: Benchmark with constrained connections by mxinden · Pull Request #102 · paritytech/yamux · GitHub.
- Sending window update frames early was implemented in src/connection: Send WindowUpdate message early by mxinden · Pull Request #109 · paritytech/yamux · GitHub.
- Splitting large frames is achieved with src: Limit data frame payload size by mxinden · Pull Request #111 · paritytech/yamux · GitHub.
As always, comments, feedback, questions, critical remarks, … are most welcome.