Skip to content

Commit 73f1716

Browse files
authored
docs: add JMUX-proxy-performance.md (#981)
This document summarize the optimization work done on our JMUX proxy implementation so far.
1 parent 53af6fa commit 73f1716

File tree

1 file changed

+288
-0
lines changed

1 file changed

+288
-0
lines changed

docs/JMUX-proxy-performance.md

Lines changed: 288 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,288 @@
1+
# JMUX proxy performance
2+
3+
This document explains how we evaluated and improved the performance of our JMUX proxy implementation.
4+
5+
## Measurement procedure
6+
7+
Throughput and performance is measured locally on a Linux machine.
8+
[`iperf`](https://en.wikipedia.org/wiki/Iperf) is used for measuring the network performance.
9+
Wide area network delays is emulated using [`netem`](https://wiki.linuxfoundation.org/networking/netem).
10+
11+
6 measures are performed:
12+
13+
- 1 connection with emulated delay of 50msec
14+
- 2 connections with emulated delay of 50msec
15+
- 10 connections with emulated delay of 50msec
16+
- 1 connection without delay
17+
- 2 connections without delay
18+
- 10 connections without delay
19+
20+
Jetsocat is built using the `profiling` profile, and two instances are run:
21+
22+
```shell
23+
jetsocat jmux-proxy tcp-listen://127.0.0.1:5009 --allow-all
24+
```
25+
26+
```shell
27+
jetsocat jmux-proxy tcp://127.0.0.1:5009 tcp-listen://127.0.0.1:5000/127.0.0.1:5001
28+
```
29+
30+
`iperf` is then run 6 times using the following script:
31+
32+
```bash
33+
#/bin/bash
34+
35+
PORT="$1"
36+
ADDR="127.0.0.1"
37+
38+
echo "PORT=$PORT"
39+
echo "ADDR=$ADDR"
40+
41+
tc qdisc add dev lo root handle 1:0 netem delay 50msec
42+
echo "==> Enabled delay of 50msec"
43+
44+
echo "==> 1 connection"
45+
iperf -c "$ADDR" -p "$PORT" -P 1 -t 600
46+
47+
sleep 5
48+
echo "==> 2 connections"
49+
iperf -c "$ADDR" -p "$PORT" -P 2 -t 600
50+
51+
sleep 5
52+
echo "==> 10 connections"
53+
iperf -c "$ADDR" -p "$PORT" -P 10 -t 600
54+
55+
sleep 5
56+
tc qdisc del dev lo root
57+
echo "==> Disabled delay"
58+
59+
echo "==> 1 connection"
60+
iperf -c "$ADDR" -p $PORT -P 1 -t 600
61+
62+
sleep 5
63+
echo "==> 2 connections"
64+
iperf -c "$ADDR" -p $PORT -P 2 -t 600
65+
66+
sleep 5
67+
echo "==> 10 connections"
68+
iperf -c "$ADDR" -p $PORT -P 10 -t 600
69+
```
70+
71+
Let’s assume the script is in a file named `run_iperf.sh`.
72+
73+
Running `iperf` for long enough is important to ensure that the buffering happening at the socket level is not influencing the numbers too much.
74+
When running less a minute, we end up measuring the rate at which `iperf` enqueue bytes into the socket’s buffer.
75+
Filling the buffer can be done very quickly and can have a significant impact on the measured average speed.
76+
10 minutes is long enough to obtain convergent results.
77+
78+
## Applied optimizations
79+
80+
- <https://github.com/Devolutions/devolutions-gateway/pull/973>
81+
- <https://github.com/Devolutions/devolutions-gateway/pull/975>
82+
- <https://github.com/Devolutions/devolutions-gateway/pull/976>
83+
- <https://github.com/Devolutions/devolutions-gateway/pull/977>
84+
- <https://github.com/Devolutions/devolutions-gateway/pull/980>
85+
86+
## Measures
87+
88+
Results obtained following the above procedure.
89+
90+
### Direct (no JMUX proxy)
91+
92+
`iperf` client is run against the server directly, without using the JMUX proxy in-between.
93+
94+
```shell
95+
./run_iperf.sh 5001
96+
```
97+
98+
The most interesting metric is the 1-connection one, which is the best we can hope to achieve.
99+
The JMUX proxy is multiplexing many connections into a single one.
100+
In other words, the maximum overall throughput we can hope to achieve using the JMUX proxy is the same as the direct 1-connection one.
101+
102+
#### With 50ms delay on loopback
103+
104+
1 connection:
105+
106+
```
107+
[ 1] 0.0000-600.2051 sec 16.1 GBytes 230 Mbits/sec
108+
```
109+
110+
#### Without delay
111+
112+
1 connection:
113+
114+
```
115+
[ 1] 0.0000-600.0059 sec 6.84 TBytes 100 Gbits/sec
116+
```
117+
118+
### Old unoptimized JMUX proxy up to 2024.3.1
119+
120+
This time, `iperf` client is run against the JMUX proxy and redirected to the server.
121+
122+
```shell
123+
./run_iperf.sh 5000
124+
```
125+
126+
#### With 50ms delay on loopback
127+
128+
1 connection:
129+
130+
```
131+
[ 1] 0.0000-637.5385 sec 66.2 MBytes 871 Kbits/sec
132+
```
133+
134+
2 connections:
135+
136+
```
137+
[ 2] 0.0000-637.1529 sec 66.4 MBytes 874 Kbits/sec
138+
[ 1] 0.0000-637.4966 sec 66.4 MBytes 874 Kbits/sec
139+
[SUM] 0.0000-637.4967 sec 133 MBytes 1.75 Mbits/sec
140+
```
141+
142+
10 connections:
143+
144+
```
145+
[ 6] 0.0000-627.8686 sec 85.9 MBytes 1.15 Mbits/sec
146+
[ 4] 0.0000-627.8686 sec 86.5 MBytes 1.16 Mbits/sec
147+
[ 2] 0.0000-627.9682 sec 86.3 MBytes 1.15 Mbits/sec
148+
[ 8] 0.0000-628.0679 sec 86.5 MBytes 1.15 Mbits/sec
149+
[ 1] 0.0000-628.0678 sec 86.5 MBytes 1.16 Mbits/sec
150+
[ 10] 0.0000-628.0682 sec 86.6 MBytes 1.16 Mbits/sec
151+
[ 7] 0.0000-628.1684 sec 86.2 MBytes 1.15 Mbits/sec
152+
[ 9] 0.0000-628.1675 sec 87.0 MBytes 1.16 Mbits/sec
153+
[ 5] 0.0000-628.2687 sec 86.6 MBytes 1.16 Mbits/sec
154+
[ 3] 0.0000-628.3688 sec 86.4 MBytes 1.15 Mbits/sec
155+
[SUM] 0.0000-628.3700 sec 865 MBytes 11.5 Mbits/sec
156+
```
157+
158+
The more we have connections, the more the overall throughput is high.
159+
This shows that our control flow algorithm is not efficient.
160+
161+
#### Without delay
162+
163+
1 connection:
164+
165+
```
166+
[ 1] 0.0000-600.0517 sec 468 GBytes 6.70 Gbits/sec
167+
```
168+
169+
2 connections:
170+
171+
```
172+
[ 2] 0.0000-600.0294 sec 152 GBytes 2.18 Gbits/sec
173+
[ 1] 0.0000-600.0747 sec 152 GBytes 2.18 Gbits/sec
174+
[SUM] 0.0000-600.0747 sec 305 GBytes 4.36 Gbits/sec
175+
```
176+
177+
10 connections:
178+
179+
```
180+
[ 6] 0.0000-600.1632 sec 32.7 GBytes 467 Mbits/sec
181+
[ 10] 0.0000-600.1636 sec 32.7 GBytes 467 Mbits/sec
182+
[ 3] 0.0000-600.1635 sec 32.7 GBytes 467 Mbits/sec
183+
[ 7] 0.0000-600.1633 sec 32.7 GBytes 467 Mbits/sec
184+
[ 4] 0.0000-600.1633 sec 32.7 GBytes 467 Mbits/sec
185+
[ 5] 0.0000-600.1641 sec 32.7 GBytes 467 Mbits/sec
186+
[ 8] 0.0000-600.1635 sec 32.7 GBytes 467 Mbits/sec
187+
[ 2] 0.0000-600.1634 sec 32.7 GBytes 467 Mbits/sec
188+
[ 9] 0.0000-600.1633 sec 32.7 GBytes 467 Mbits/sec
189+
[ 1] 0.0000-600.1632 sec 32.7 GBytes 467 Mbits/sec
190+
[SUM] 0.0000-600.1641 sec 327 GBytes 4.67 Gbits/sec
191+
```
192+
193+
### New optimized JMUX proxy starting 2024.3.2
194+
195+
Again, `iperf` client is run against the JMUX proxy and redirected to the server.
196+
197+
```shell
198+
./run_iperf.sh 5000
199+
```
200+
201+
#### With 50ms delay on loopback
202+
203+
1 connection:
204+
205+
```
206+
[ 1] 0.0000-600.4197 sec 16.1 GBytes 230 Mbits/sec
207+
```
208+
209+
2 connections:
210+
211+
```
212+
[ 1] 0.0000-605.0387 sec 8.19 GBytes 116 Mbits/sec
213+
[ 2] 0.0000-605.1395 sec 8.19 GBytes 116 Mbits/sec
214+
[SUM] 0.0000-605.1395 sec 16.4 GBytes 233 Mbits/sec
215+
```
216+
217+
10 connections:
218+
219+
```
220+
[ 3] 0.0000-625.7966 sec 1.69 GBytes 23.2 Mbits/sec
221+
[ 8] 0.0000-625.9956 sec 1.69 GBytes 23.2 Mbits/sec
222+
[ 1] 0.0000-626.0966 sec 1.69 GBytes 23.2 Mbits/sec
223+
[ 5] 0.0000-626.0964 sec 1.69 GBytes 23.2 Mbits/sec
224+
[ 2] 0.0000-626.1983 sec 1.69 GBytes 23.2 Mbits/sec
225+
[ 7] 0.0000-626.1964 sec 1.69 GBytes 23.2 Mbits/sec
226+
[ 6] 0.0000-626.1964 sec 1.69 GBytes 23.2 Mbits/sec
227+
[ 9] 0.0000-626.1981 sec 1.69 GBytes 23.2 Mbits/sec
228+
[ 10] 0.0000-626.2973 sec 1.69 GBytes 23.2 Mbits/sec
229+
[ 4] 0.0000-626.3984 sec 1.69 GBytes 23.2 Mbits/sec
230+
[SUM] 0.0000-626.3986 sec 16.9 GBytes 232 Mbits/sec
231+
```
232+
233+
We are able to reach the same throughput as our "direct" baseline.
234+
This shows that the flow control algorithm is not getting in the way anymore.
235+
236+
#### Without delay
237+
238+
1 connection:
239+
240+
```
241+
[ 1] 0.0000-600.0518 sec 1.33 TBytes 19.4 Gbits/sec
242+
```
243+
244+
2 connections:
245+
246+
```
247+
[ 2] 0.0000-600.0706 sec 681 GBytes 9.75 Gbits/sec
248+
[ 1] 0.0000-600.0705 sec 681 GBytes 9.75 Gbits/sec
249+
[SUM] 0.0000-600.0705 sec 1.33 TBytes 19.5 Gbits/sec
250+
```
251+
252+
10 connections:
253+
254+
```
255+
[ 3] 0.0000-600.3608 sec 112 GBytes 1.60 Gbits/sec
256+
[ 5] 0.0000-600.3606 sec 112 GBytes 1.60 Gbits/sec
257+
[ 6] 0.0000-600.3605 sec 112 GBytes 1.60 Gbits/sec
258+
[ 8] 0.0000-600.3598 sec 112 GBytes 1.60 Gbits/sec
259+
[ 7] 0.0000-600.3594 sec 112 GBytes 1.60 Gbits/sec
260+
[ 1] 0.0000-600.3606 sec 112 GBytes 1.60 Gbits/sec
261+
[ 9] 0.0000-600.3597 sec 112 GBytes 1.60 Gbits/sec
262+
[ 10] 0.0000-600.3606 sec 112 GBytes 1.60 Gbits/sec
263+
[ 2] 0.0000-600.3602 sec 112 GBytes 1.60 Gbits/sec
264+
[ 4] 0.0000-600.3719 sec 112 GBytes 1.60 Gbits/sec
265+
[SUM] 0.0000-600.3721 sec 1.09 TBytes 16.0 Gbits/sec
266+
```
267+
268+
Even without delay, the throughput is greatly improved over the unoptimized version.
269+
Improved CPU usage is allowing more bytes to be processed in the same amount of time.
270+
271+
## Analysis
272+
273+
The flow control algorithm, particularly the window size, is a critical parameter for maintaining good throughput, especially when wide area network delays are present.
274+
Since such delays are common in almost all practical setups, it’s safe to say that this is the most important metric to optimize.
275+
276+
Other optimizations, while beneficial, primarily serve to reduce CPU usage and increase throughput on very high-speed networks.
277+
A speed of 30 Mbits/s is already considered high, but networks with throughput exceeding 1 Gbits/s also exist.
278+
Enhancing performance for these networks is valuable, particularly in reducing CPU usage as the volume of data processed increases.
279+
280+
Measurements indicate that our JMUX proxy should perform well, even on high-speed networks.
281+
It is capable of matching the throughput of a direct connection, even at speeds of 230 Mbits/s.
282+
At this rate, network overhead remains a more significant factor than the speed at which we can reframe for (de)multiplexing.
283+
284+
Of course, this benchmark has some limitations: for the sake of reproducibility, it assumes a perfect network where no packets are lost.
285+
In real-world wide-area networks, packet loss will inevitably occur.
286+
287+
Nevertheless, these results provide valuable data, confirming that our optimizations are effective with a high degree of confidence.
288+
While further optimization could be pursued to address more specific scenarios, the current implementation is likely sufficient for most practical purposes.

0 commit comments

Comments
 (0)