|
| 1 | +# JMUX proxy performance |
| 2 | + |
| 3 | +This document explains how we evaluated and improved the performance of our JMUX proxy implementation. |
| 4 | + |
| 5 | +## Measurement procedure |
| 6 | + |
| 7 | +Throughput and performance is measured locally on a Linux machine. |
| 8 | +[`iperf`](https://en.wikipedia.org/wiki/Iperf) is used for measuring the network performance. |
| 9 | +Wide area network delays is emulated using [`netem`](https://wiki.linuxfoundation.org/networking/netem). |
| 10 | + |
| 11 | +6 measures are performed: |
| 12 | + |
| 13 | +- 1 connection with emulated delay of 50msec |
| 14 | +- 2 connections with emulated delay of 50msec |
| 15 | +- 10 connections with emulated delay of 50msec |
| 16 | +- 1 connection without delay |
| 17 | +- 2 connections without delay |
| 18 | +- 10 connections without delay |
| 19 | + |
| 20 | +Jetsocat is built using the `profiling` profile, and two instances are run: |
| 21 | + |
| 22 | +```shell |
| 23 | +jetsocat jmux-proxy tcp-listen://127.0.0.1:5009 --allow-all |
| 24 | +``` |
| 25 | + |
| 26 | +```shell |
| 27 | +jetsocat jmux-proxy tcp://127.0.0.1:5009 tcp-listen://127.0.0.1:5000/127.0.0.1:5001 |
| 28 | +``` |
| 29 | + |
| 30 | +`iperf` is then run 6 times using the following script: |
| 31 | + |
| 32 | +```bash |
| 33 | +#/bin/bash |
| 34 | + |
| 35 | +PORT="$1" |
| 36 | +ADDR="127.0.0.1" |
| 37 | + |
| 38 | +echo "PORT=$PORT" |
| 39 | +echo "ADDR=$ADDR" |
| 40 | + |
| 41 | +tc qdisc add dev lo root handle 1:0 netem delay 50msec |
| 42 | +echo "==> Enabled delay of 50msec" |
| 43 | + |
| 44 | +echo "==> 1 connection" |
| 45 | +iperf -c "$ADDR" -p "$PORT" -P 1 -t 600 |
| 46 | + |
| 47 | +sleep 5 |
| 48 | +echo "==> 2 connections" |
| 49 | +iperf -c "$ADDR" -p "$PORT" -P 2 -t 600 |
| 50 | + |
| 51 | +sleep 5 |
| 52 | +echo "==> 10 connections" |
| 53 | +iperf -c "$ADDR" -p "$PORT" -P 10 -t 600 |
| 54 | + |
| 55 | +sleep 5 |
| 56 | +tc qdisc del dev lo root |
| 57 | +echo "==> Disabled delay" |
| 58 | + |
| 59 | +echo "==> 1 connection" |
| 60 | +iperf -c "$ADDR" -p $PORT -P 1 -t 600 |
| 61 | + |
| 62 | +sleep 5 |
| 63 | +echo "==> 2 connections" |
| 64 | +iperf -c "$ADDR" -p $PORT -P 2 -t 600 |
| 65 | + |
| 66 | +sleep 5 |
| 67 | +echo "==> 10 connections" |
| 68 | +iperf -c "$ADDR" -p $PORT -P 10 -t 600 |
| 69 | +``` |
| 70 | + |
| 71 | +Let’s assume the script is in a file named `run_iperf.sh`. |
| 72 | + |
| 73 | +Running `iperf` for long enough is important to ensure that the buffering happening at the socket level is not influencing the numbers too much. |
| 74 | +When running less a minute, we end up measuring the rate at which `iperf` enqueue bytes into the socket’s buffer. |
| 75 | +Filling the buffer can be done very quickly and can have a significant impact on the measured average speed. |
| 76 | +10 minutes is long enough to obtain convergent results. |
| 77 | + |
| 78 | +## Applied optimizations |
| 79 | + |
| 80 | +- <https://github.com/Devolutions/devolutions-gateway/pull/973> |
| 81 | +- <https://github.com/Devolutions/devolutions-gateway/pull/975> |
| 82 | +- <https://github.com/Devolutions/devolutions-gateway/pull/976> |
| 83 | +- <https://github.com/Devolutions/devolutions-gateway/pull/977> |
| 84 | +- <https://github.com/Devolutions/devolutions-gateway/pull/980> |
| 85 | + |
| 86 | +## Measures |
| 87 | + |
| 88 | +Results obtained following the above procedure. |
| 89 | + |
| 90 | +### Direct (no JMUX proxy) |
| 91 | + |
| 92 | +`iperf` client is run against the server directly, without using the JMUX proxy in-between. |
| 93 | + |
| 94 | +```shell |
| 95 | +./run_iperf.sh 5001 |
| 96 | +``` |
| 97 | + |
| 98 | +The most interesting metric is the 1-connection one, which is the best we can hope to achieve. |
| 99 | +The JMUX proxy is multiplexing many connections into a single one. |
| 100 | +In other words, the maximum overall throughput we can hope to achieve using the JMUX proxy is the same as the direct 1-connection one. |
| 101 | + |
| 102 | +#### With 50ms delay on loopback |
| 103 | + |
| 104 | +1 connection: |
| 105 | + |
| 106 | +``` |
| 107 | +[ 1] 0.0000-600.2051 sec 16.1 GBytes 230 Mbits/sec |
| 108 | +``` |
| 109 | + |
| 110 | +#### Without delay |
| 111 | + |
| 112 | +1 connection: |
| 113 | + |
| 114 | +``` |
| 115 | +[ 1] 0.0000-600.0059 sec 6.84 TBytes 100 Gbits/sec |
| 116 | +``` |
| 117 | + |
| 118 | +### Old unoptimized JMUX proxy up to 2024.3.1 |
| 119 | + |
| 120 | +This time, `iperf` client is run against the JMUX proxy and redirected to the server. |
| 121 | + |
| 122 | +```shell |
| 123 | +./run_iperf.sh 5000 |
| 124 | +``` |
| 125 | + |
| 126 | +#### With 50ms delay on loopback |
| 127 | + |
| 128 | +1 connection: |
| 129 | + |
| 130 | +``` |
| 131 | +[ 1] 0.0000-637.5385 sec 66.2 MBytes 871 Kbits/sec |
| 132 | +``` |
| 133 | + |
| 134 | +2 connections: |
| 135 | + |
| 136 | +``` |
| 137 | +[ 2] 0.0000-637.1529 sec 66.4 MBytes 874 Kbits/sec |
| 138 | +[ 1] 0.0000-637.4966 sec 66.4 MBytes 874 Kbits/sec |
| 139 | +[SUM] 0.0000-637.4967 sec 133 MBytes 1.75 Mbits/sec |
| 140 | +``` |
| 141 | + |
| 142 | +10 connections: |
| 143 | + |
| 144 | +``` |
| 145 | +[ 6] 0.0000-627.8686 sec 85.9 MBytes 1.15 Mbits/sec |
| 146 | +[ 4] 0.0000-627.8686 sec 86.5 MBytes 1.16 Mbits/sec |
| 147 | +[ 2] 0.0000-627.9682 sec 86.3 MBytes 1.15 Mbits/sec |
| 148 | +[ 8] 0.0000-628.0679 sec 86.5 MBytes 1.15 Mbits/sec |
| 149 | +[ 1] 0.0000-628.0678 sec 86.5 MBytes 1.16 Mbits/sec |
| 150 | +[ 10] 0.0000-628.0682 sec 86.6 MBytes 1.16 Mbits/sec |
| 151 | +[ 7] 0.0000-628.1684 sec 86.2 MBytes 1.15 Mbits/sec |
| 152 | +[ 9] 0.0000-628.1675 sec 87.0 MBytes 1.16 Mbits/sec |
| 153 | +[ 5] 0.0000-628.2687 sec 86.6 MBytes 1.16 Mbits/sec |
| 154 | +[ 3] 0.0000-628.3688 sec 86.4 MBytes 1.15 Mbits/sec |
| 155 | +[SUM] 0.0000-628.3700 sec 865 MBytes 11.5 Mbits/sec |
| 156 | +``` |
| 157 | + |
| 158 | +The more we have connections, the more the overall throughput is high. |
| 159 | +This shows that our control flow algorithm is not efficient. |
| 160 | + |
| 161 | +#### Without delay |
| 162 | + |
| 163 | +1 connection: |
| 164 | + |
| 165 | +``` |
| 166 | +[ 1] 0.0000-600.0517 sec 468 GBytes 6.70 Gbits/sec |
| 167 | +``` |
| 168 | + |
| 169 | +2 connections: |
| 170 | + |
| 171 | +``` |
| 172 | +[ 2] 0.0000-600.0294 sec 152 GBytes 2.18 Gbits/sec |
| 173 | +[ 1] 0.0000-600.0747 sec 152 GBytes 2.18 Gbits/sec |
| 174 | +[SUM] 0.0000-600.0747 sec 305 GBytes 4.36 Gbits/sec |
| 175 | +``` |
| 176 | + |
| 177 | +10 connections: |
| 178 | + |
| 179 | +``` |
| 180 | +[ 6] 0.0000-600.1632 sec 32.7 GBytes 467 Mbits/sec |
| 181 | +[ 10] 0.0000-600.1636 sec 32.7 GBytes 467 Mbits/sec |
| 182 | +[ 3] 0.0000-600.1635 sec 32.7 GBytes 467 Mbits/sec |
| 183 | +[ 7] 0.0000-600.1633 sec 32.7 GBytes 467 Mbits/sec |
| 184 | +[ 4] 0.0000-600.1633 sec 32.7 GBytes 467 Mbits/sec |
| 185 | +[ 5] 0.0000-600.1641 sec 32.7 GBytes 467 Mbits/sec |
| 186 | +[ 8] 0.0000-600.1635 sec 32.7 GBytes 467 Mbits/sec |
| 187 | +[ 2] 0.0000-600.1634 sec 32.7 GBytes 467 Mbits/sec |
| 188 | +[ 9] 0.0000-600.1633 sec 32.7 GBytes 467 Mbits/sec |
| 189 | +[ 1] 0.0000-600.1632 sec 32.7 GBytes 467 Mbits/sec |
| 190 | +[SUM] 0.0000-600.1641 sec 327 GBytes 4.67 Gbits/sec |
| 191 | +``` |
| 192 | + |
| 193 | +### New optimized JMUX proxy starting 2024.3.2 |
| 194 | + |
| 195 | +Again, `iperf` client is run against the JMUX proxy and redirected to the server. |
| 196 | + |
| 197 | +```shell |
| 198 | +./run_iperf.sh 5000 |
| 199 | +``` |
| 200 | + |
| 201 | +#### With 50ms delay on loopback |
| 202 | + |
| 203 | +1 connection: |
| 204 | + |
| 205 | +``` |
| 206 | +[ 1] 0.0000-600.4197 sec 16.1 GBytes 230 Mbits/sec |
| 207 | +``` |
| 208 | + |
| 209 | +2 connections: |
| 210 | + |
| 211 | +``` |
| 212 | +[ 1] 0.0000-605.0387 sec 8.19 GBytes 116 Mbits/sec |
| 213 | +[ 2] 0.0000-605.1395 sec 8.19 GBytes 116 Mbits/sec |
| 214 | +[SUM] 0.0000-605.1395 sec 16.4 GBytes 233 Mbits/sec |
| 215 | +``` |
| 216 | + |
| 217 | +10 connections: |
| 218 | + |
| 219 | +``` |
| 220 | +[ 3] 0.0000-625.7966 sec 1.69 GBytes 23.2 Mbits/sec |
| 221 | +[ 8] 0.0000-625.9956 sec 1.69 GBytes 23.2 Mbits/sec |
| 222 | +[ 1] 0.0000-626.0966 sec 1.69 GBytes 23.2 Mbits/sec |
| 223 | +[ 5] 0.0000-626.0964 sec 1.69 GBytes 23.2 Mbits/sec |
| 224 | +[ 2] 0.0000-626.1983 sec 1.69 GBytes 23.2 Mbits/sec |
| 225 | +[ 7] 0.0000-626.1964 sec 1.69 GBytes 23.2 Mbits/sec |
| 226 | +[ 6] 0.0000-626.1964 sec 1.69 GBytes 23.2 Mbits/sec |
| 227 | +[ 9] 0.0000-626.1981 sec 1.69 GBytes 23.2 Mbits/sec |
| 228 | +[ 10] 0.0000-626.2973 sec 1.69 GBytes 23.2 Mbits/sec |
| 229 | +[ 4] 0.0000-626.3984 sec 1.69 GBytes 23.2 Mbits/sec |
| 230 | +[SUM] 0.0000-626.3986 sec 16.9 GBytes 232 Mbits/sec |
| 231 | +``` |
| 232 | + |
| 233 | +We are able to reach the same throughput as our "direct" baseline. |
| 234 | +This shows that the flow control algorithm is not getting in the way anymore. |
| 235 | + |
| 236 | +#### Without delay |
| 237 | + |
| 238 | +1 connection: |
| 239 | + |
| 240 | +``` |
| 241 | +[ 1] 0.0000-600.0518 sec 1.33 TBytes 19.4 Gbits/sec |
| 242 | +``` |
| 243 | + |
| 244 | +2 connections: |
| 245 | + |
| 246 | +``` |
| 247 | +[ 2] 0.0000-600.0706 sec 681 GBytes 9.75 Gbits/sec |
| 248 | +[ 1] 0.0000-600.0705 sec 681 GBytes 9.75 Gbits/sec |
| 249 | +[SUM] 0.0000-600.0705 sec 1.33 TBytes 19.5 Gbits/sec |
| 250 | +``` |
| 251 | + |
| 252 | +10 connections: |
| 253 | + |
| 254 | +``` |
| 255 | +[ 3] 0.0000-600.3608 sec 112 GBytes 1.60 Gbits/sec |
| 256 | +[ 5] 0.0000-600.3606 sec 112 GBytes 1.60 Gbits/sec |
| 257 | +[ 6] 0.0000-600.3605 sec 112 GBytes 1.60 Gbits/sec |
| 258 | +[ 8] 0.0000-600.3598 sec 112 GBytes 1.60 Gbits/sec |
| 259 | +[ 7] 0.0000-600.3594 sec 112 GBytes 1.60 Gbits/sec |
| 260 | +[ 1] 0.0000-600.3606 sec 112 GBytes 1.60 Gbits/sec |
| 261 | +[ 9] 0.0000-600.3597 sec 112 GBytes 1.60 Gbits/sec |
| 262 | +[ 10] 0.0000-600.3606 sec 112 GBytes 1.60 Gbits/sec |
| 263 | +[ 2] 0.0000-600.3602 sec 112 GBytes 1.60 Gbits/sec |
| 264 | +[ 4] 0.0000-600.3719 sec 112 GBytes 1.60 Gbits/sec |
| 265 | +[SUM] 0.0000-600.3721 sec 1.09 TBytes 16.0 Gbits/sec |
| 266 | +``` |
| 267 | + |
| 268 | +Even without delay, the throughput is greatly improved over the unoptimized version. |
| 269 | +Improved CPU usage is allowing more bytes to be processed in the same amount of time. |
| 270 | + |
| 271 | +## Analysis |
| 272 | + |
| 273 | +The flow control algorithm, particularly the window size, is a critical parameter for maintaining good throughput, especially when wide area network delays are present. |
| 274 | +Since such delays are common in almost all practical setups, it’s safe to say that this is the most important metric to optimize. |
| 275 | + |
| 276 | +Other optimizations, while beneficial, primarily serve to reduce CPU usage and increase throughput on very high-speed networks. |
| 277 | +A speed of 30 Mbits/s is already considered high, but networks with throughput exceeding 1 Gbits/s also exist. |
| 278 | +Enhancing performance for these networks is valuable, particularly in reducing CPU usage as the volume of data processed increases. |
| 279 | + |
| 280 | +Measurements indicate that our JMUX proxy should perform well, even on high-speed networks. |
| 281 | +It is capable of matching the throughput of a direct connection, even at speeds of 230 Mbits/s. |
| 282 | +At this rate, network overhead remains a more significant factor than the speed at which we can reframe for (de)multiplexing. |
| 283 | + |
| 284 | +Of course, this benchmark has some limitations: for the sake of reproducibility, it assumes a perfect network where no packets are lost. |
| 285 | +In real-world wide-area networks, packet loss will inevitably occur. |
| 286 | + |
| 287 | +Nevertheless, these results provide valuable data, confirming that our optimizations are effective with a high degree of confidence. |
| 288 | +While further optimization could be pursued to address more specific scenarios, the current implementation is likely sufficient for most practical purposes. |
0 commit comments