mirror of https://github.com/coturn/coturn
You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
507 lines
26 KiB
507 lines
26 KiB
# Performance iteration log
|
|
|
|
Running notes for the multi-iteration performance work on the UDP relay
|
|
data path. Pick this up to continue without re-deriving everything.
|
|
|
|
The harness, baseline command, and droplet topology are documented in
|
|
[CLAUDE.md](../CLAUDE.md) under "Load Test on DigitalOcean" — this file
|
|
captures the *deltas*: what was measured, what landed, what didn't, and
|
|
where the next round should go.
|
|
|
|
## Cumulative result
|
|
|
|
Five commits on `claude/beautiful-black-c3b741` between `727ec2ab`
|
|
("loadgen") and `321a2d18`:
|
|
|
|
| # | Commit | Optimization |
|
|
|---|---|---|
|
|
| 1 | `ce7e7e53` | Hoist `turn_server_get_engine()` out of per-packet hot path |
|
|
| 2 | `8e28491a` | `ioa_socket_check_bandwidth` early fast-exit; drop dead `if (!(s->done \|\| s->fd==-1))` in `send_data_from_ioa_socket_nbh` |
|
|
| 3 | `344360f6` | Cache `get_relay_socket_ss()` and `ioa_network_buffer_get_size()` in `write_to_peerchannel`, `handle_turn_send`, `read_client_connection` |
|
|
| 4 | `a6f6767f` | Inline `get_ioa_addr_len()` via `ns_turn_ioaddr.h` |
|
|
| 5 | `321a2d18` | Inline `addr_cpy()` via `ns_turn_ioaddr.h` |
|
|
|
|
Current `relay-recvmmsg` follow-up:
|
|
|
|
| # | Commit | Optimization |
|
|
|---|---|---|
|
|
| 6 | `54c589d0` / `4b1a8d71` | Initial Linux `recvmmsg` batching for UDP listener and connected relay sockets |
|
|
| 7 | `8d9a7292` | Share the existing `--udp-recvmmsg` flag across listener and relay UDP paths; remove separate relay flag; use the shared ancillary-data parser in `dtls_listener` |
|
|
| 8 | `d48686b7` | Reduce relay per-socket `recvmmsg` state from 16 x 64 KiB cmsg buffers to TTL/TOS-sized buffers, avoid an extra would-block fallback `recvmsg`, and clean up all preallocated buffers after partial batches |
|
|
| 9 | `ad81705e` | Add per-engine `recvmmsg` occupancy counters and 10 s log summaries (`calls`, `packets`, `avg_batch`, `wouldblock`, `unavailable`, `no_buffer`, batch-size histogram) |
|
|
| 10 | `388b15d4` | Move connected relay UDP `recvmmsg` scratch from per-socket state to per-engine/per-thread state |
|
|
| 11 | `4c4fd67e` | Make the occupancy summaries opt-in behind `--udp-recvmmsg-log`, so `--udp-recvmmsg` can ship without periodic stats logs |
|
|
|
|
Validation after #7-#11:
|
|
|
|
- Local `cmake -S . -B build -DBUILD_TESTING=ON` passed.
|
|
- Local `cmake --build build --parallel 8` passed.
|
|
- Local `ctest --test-dir build --output-on-failure` passed 3/3.
|
|
- Local `build/bin/turnserver --udp-recvmmsg --udp-recvmmsg-log --version`
|
|
parsed both flags and printed `4.11.0`.
|
|
- Linux Docker `turnserver` build passed after #7, after #8, and after #10.
|
|
|
|
Shipping cleanup learning: keep the occupancy counters in place because they
|
|
are low overhead and useful for DigitalOcean diagnostics, but keep the periodic
|
|
summaries off by default. Use `--udp-recvmmsg-log` only during measured runs
|
|
where the log stream is part of the observation.
|
|
|
|
DigitalOcean check on 2026-05-09:
|
|
|
|
- Reused the existing `c-4` droplets in `nyc1`: turnserver public
|
|
`157.230.3.102`, private `10.116.0.2`; loadgen public `167.99.153.216`,
|
|
private `10.116.0.3`. Droplets were left running between steps.
|
|
- Built fresh current artifacts from `d48686b7` on both droplets under
|
|
`/root/coturn_recvmmsg_current`.
|
|
- Same-binary `--udp-recvmmsg` off/on, `-Y packet -m 1 -l 120`, 5 alternating
|
|
30 s rounds each:
|
|
- off mean 154,527, median 154,596, stdev 3,467
|
|
- on mean 149,994, median 153,011, stdev 7,174
|
|
- on was -2.9 % by mean and -1.0 % by median
|
|
- Same-binary `--udp-recvmmsg` off/on, `-Y packet -m 100 -l 120`, 5 alternating
|
|
rounds each. The client completed before the 30 s timeout and landed in two
|
|
send-volume buckets, so treat this as a coarse many-connection signal:
|
|
- off mean 59,432, median 65,071, stdev 7,952
|
|
- on mean 59,640, median 65,421, stdev 7,963
|
|
- on was +0.3 % by mean and +0.5 % by median
|
|
- Follow-up `m=100 -n 1000` run, 3 alternating rounds each, derived receive
|
|
count from `tot_recv_bytes / 120` because this log format omits
|
|
`tot_recv_msgs`:
|
|
- off mean 8,540, median 8,990, stdev 1,004
|
|
- on mean 8,857, median 8,749, stdev 759
|
|
- on was +3.7 % by mean and -2.7 % by median
|
|
|
|
Learning: the corrected relay `recvmmsg` implementation is now buildable and
|
|
much safer for many connections, but these droplet runs still do not show a
|
|
clear throughput win. Keep `--udp-recvmmsg` opt-in for now. The next useful
|
|
step is to instrument actual batch occupancy on connected relay sockets; if
|
|
most readiness events return one datagram, `recvmmsg` will mostly add setup
|
|
work without reducing syscalls.
|
|
|
|
DigitalOcean occupancy check on 2026-05-09:
|
|
|
|
- Built fresh current artifacts from `388b15d4` on both droplets under
|
|
`/root/coturn_recvmmsg_current`.
|
|
- Same-binary `--udp-recvmmsg` off/on, `-Y packet -m 1 -l 120`, 3 alternating
|
|
30 s rounds each:
|
|
- off mean 153,133, median 153,608, stdev 4,383
|
|
- on mean 148,452, median 149,711, stdev 10,833
|
|
- on was -3.1 % by mean and -2.5 % by median
|
|
- `m=1` occupancy from the on runs: 1,129,427 `recvmmsg` calls returned
|
|
17,660,300 packets, average batch 15.64. Histogram buckets:
|
|
`hist_1=1,353`, `hist_2=1,496`, `hist_3_4=3,707`,
|
|
`hist_5_8=14,817`, `hist_9_16=1,108,057`; 98.1 % of calls were in the
|
|
`9..16` bucket.
|
|
- Same-binary `--udp-recvmmsg` off/on, `-Y packet -m 100 -l 120`, 3 alternating
|
|
runs each:
|
|
- off mean 55,443, median 50,679, stdev 8,369
|
|
- on mean 60,596, median 65,404, stdev 8,383
|
|
- on was +9.3 % by mean and +29.1 % by median, but the client again landed
|
|
in two send-volume buckets, so treat the throughput delta as noisy.
|
|
- `m=100` occupancy from the on runs across all relay threads: 1,426,401
|
|
`recvmmsg` calls returned 16,188,946 packets, average batch 11.35.
|
|
Histogram buckets: `hist_1=83,057`, `hist_2=79,781`,
|
|
`hist_3_4=130,066`, `hist_5_8=188,259`, `hist_9_16=945,238`; 66.3 %
|
|
of calls were in the `9..16` bucket.
|
|
|
|
Learning: receive-side occupancy is high. The earlier hypothesis that
|
|
`recvmmsg` was mostly returning one packet is wrong for this harness. The
|
|
remaining bottleneck is after receive: per-packet callbacks, TURN processing,
|
|
and especially one `sendto` per relayed packet. The per-thread scratch change
|
|
is still worth keeping for memory/cache behavior with thousands of sockets,
|
|
but the next performance lever should be send-side batching or a design that
|
|
passes batches deeper instead of immediately decomposing them back into
|
|
single-packet callbacks.
|
|
|
|
Alternating A/B run on the same droplet pair, m=1 packet flood, 30 s
|
|
per run, with a 4 s warm-up between binary swaps:
|
|
|
|
- Baseline (clean `master` binary): mean 146,984 round-trips / 30 s
|
|
- Cumulative (all 5 iters): mean 155,468 round-trips / 30 s
|
|
- **+5.8 % throughput**
|
|
|
|
Per-iteration deltas were within run-to-run noise (~5–10 % variance).
|
|
The cumulative effect is what's visible.
|
|
|
|
## Test setup that was used
|
|
|
|
Two `c-4` Ubuntu 24.04 droplets in `nyc1`, same VPC `default-nyc1`.
|
|
Current active pair:
|
|
|
|
- `coturn-turnserver` — public `157.230.3.102`, private `10.116.0.2`
|
|
- `coturn-loadgen` — public `167.99.153.216`, private `10.116.0.3`
|
|
|
|
Older pair used for the iter 5 cumulative run:
|
|
|
|
- `coturn-turnserver` — public `68.183.121.197`, private `10.116.0.2`
|
|
- `coturn-loadgen` — public `68.183.132.220`, private `10.116.0.3`
|
|
|
|
Created via the DigitalOcean v2 API (`doctl` is *not* installed; use
|
|
`curl` + `$DIGITALOCEAN_TOKEN` from the user's `~/.zshrc`). SSH via
|
|
`~/.ssh/id_rsa` (matches DO ssh key id `23704483`, fingerprint
|
|
`37:3a:9b:e3:1e:1a:9b:42:a0:6f:58:f5:5a:3a:6a:2c`).
|
|
|
|
State on the turnserver droplet (kept across iterations):
|
|
|
|
- `/root/coturn_clean.tar` — `git archive HEAD` of master at start of run.
|
|
Re-extract this before applying any new patch.
|
|
- `/root/coturn_baseline/build/bin/turnserver` — clean baseline binary,
|
|
used as the "B" in every A/B round. **Don't overwrite.**
|
|
- `/root/coturn/build/bin/turnserver` — current iteration binary.
|
|
- `/root/start_turnserver.sh`, `/root/baseline_run.sh` — helper scripts.
|
|
|
|
State on the loadgen droplet:
|
|
|
|
- `/root/coturn/build/bin/turnutils_uclient`, `turnutils_peer`.
|
|
- `turnutils_peer` runs as a daemon on `10.116.0.3:3480`
|
|
(`pid` in `/root/peer.pid`).
|
|
|
|
A small env file was written to `/tmp/coturn_perf_env.sh` on the local
|
|
machine with the IPs / droplet IDs — recreate it from the current
|
|
state of the DO account if it gets lost.
|
|
|
|
The standard packet-flood command (matches CLAUDE.md baseline, runs without
|
|
`--udp-recvmmsg`; add `--udp-recvmmsg` to `turnserver`, not the client, for the
|
|
batched listener/relay receive path):
|
|
|
|
```bash
|
|
timeout -s INT 30s /root/coturn/build/bin/turnutils_uclient \
|
|
-Y packet -m 1 -l 120 \
|
|
-e 10.116.0.3 -r 3480 -X -g \
|
|
-u user -W secret \
|
|
10.116.0.2
|
|
```
|
|
|
|
Metric: the `tot_recv_msgs` field on the last `start_mclient:` log
|
|
line. (This is round-trips through the relay over the test window;
|
|
`send_pps` is loadgen-side only and can hit 262 K even when the relay
|
|
is dropping most of them, so it's not a useful proxy for relay
|
|
throughput.)
|
|
|
|
## Hot-path map at the end of iter 5
|
|
|
|
`perf record -F 99 -g` on the turnserver during a 12 s `-Y packet -m 1`
|
|
run, sorted by user-space self-time:
|
|
|
|
```
|
|
0.80 % send_data_from_ioa_socket_nbh
|
|
0.76 % socket_input_worker
|
|
0.69 % read_client_connection.isra.0
|
|
0.60 % turn_report_session_usage
|
|
0.53 % peer_input_handler
|
|
0.51 % udp_server_input_handler
|
|
0.35 % udp_recvfrom # was 0.76 % at iter 1
|
|
0.34 % lm_map_get
|
|
0.27 % stun_is_channel_message_str
|
|
0.27 % get_relay_socket
|
|
0.26 % ioa_socket_check_bandwidth # was 0.33 % at iter 1
|
|
0.26 % udp_send # was 0.60 % at iter 1
|
|
0.18 % ioa_network_buffer_get_size
|
|
```
|
|
|
|
Total user-space coturn cycles: ~5–7 % of the relay thread.
|
|
The relay thread sits at ~100 % CPU pinned to one core; the 4 relay
|
|
threads aren't parallelised by the m=1 single-flow test (one 5-tuple
|
|
hashes to one SO_REUSEPORT worker).
|
|
|
|
Kernel side (children-aggregated) is the real cost:
|
|
|
|
```
|
|
36 % udp_sendmsg (sendto path)
|
|
14 % udp_recvmsg
|
|
17 % ip_finish_output / ip_output / __dev_queue_xmit
|
|
~23 % syscall enter / exit machinery (sysret, SYSRETQ, SYSCALL_64*)
|
|
```
|
|
|
|
That ~23 % syscall overhead is the next big lever. Halving it
|
|
(via batching) is worth ~10 % wall-clock CPU.
|
|
|
|
## What didn't work
|
|
|
|
### Default `--udp-recvmmsg=true` on Linux — opt-in in iters 1–11, **shipped default-on later**
|
|
|
|
Original finding (iters 1–11): the flag applied the 16-buffer batch path to
|
|
**every connected per-session relay socket**, which only ever receives one
|
|
flow. Throughput parity or slight negative results were confirmed across
|
|
multiple A/B rounds on `m=1` and `m=100` — the per-session prealloc churn ate
|
|
the listener-side win — so the flag was kept opt-in.
|
|
|
|
Resolution: a later change scoped `recvmmsg` to **shared fan-in sockets only**
|
|
(the `udp_recvmmsg_eligible` flag in
|
|
[ns_ioalib_impl.h](../src/apps/relay/ns_ioalib_impl.h) — set on the client
|
|
listener and, under `--multiplex-peer`, the per-thread shared relay socket).
|
|
Per-session relay sockets now stay on the single-recv path regardless of the
|
|
flag, so the per-session tax that motivated keeping it opt-in is gone. With the
|
|
cost surface reduced to the one genuine fan-in point (the listener), the flag
|
|
ships **on by default on Linux**; operators opt out with `--udp-recvmmsg=false`.
|
|
DTLS session sockets remain on the SSL read path and are never batched.
|
|
|
|
### Caching `get_relay_socket_ss` (iter 3) — no measurable wall-clock win
|
|
|
|
The function is `static inline` already and the underlying
|
|
`get_relay_socket()` is a four-line accessor. Caching the result
|
|
*does* save a cross-TU function call per packet (the compiler can't
|
|
prove `get_relay_socket` pure across the
|
|
`set_df_on_ioa_socket` / `ioa_network_buffer_*` calls in between),
|
|
which the perf profile picked up as a small redistribution, but
|
|
throughput stayed in the noise band. Kept anyway: the cleanup is
|
|
defensible and matches the iter 4/5 inlining direction.
|
|
|
|
## Methodology lessons
|
|
|
|
- **Always alternate A/B per round** rather than running 5×B then 5×I.
|
|
The droplet pair has noticeable environmental drift over a few
|
|
minutes (other tenants on the hypervisor, NIC ring backpressure,
|
|
whatever); sequential blocks bias whichever binary ran on the worse
|
|
half of the run.
|
|
- **Discard the first run after a turnserver restart.** The loadgen's
|
|
first run after a server restart is consistently 30–80 % slower
|
|
than steady-state — looks like channel/permission state in the
|
|
client side warming up, not the server. A 4 s "throwaway" run
|
|
before the measured 30 s run is enough.
|
|
- **Run-to-run variance is ~5–10 %** even with alternation. Plan on
|
|
6–8 rounds (≈ 8 minutes wall-clock) before claiming a sub-10 % win.
|
|
A single 3-round A/B will lie to you.
|
|
- **Use the `tot_recv_msgs` field, not `send_pps`**. Loadgen send rate
|
|
saturates at ~262 K pps regardless of relay capacity — it's
|
|
whatever the loadgen kernel will accept into its UDP send buffer.
|
|
The receive count is what made it round-trip through the relay.
|
|
- **The relay is kernel-bound.** User-space coturn is ~5 % of cycles.
|
|
Halving it gives at most ~2.5 % wall-clock — usually undetectable
|
|
per-iteration, only visible cumulatively. Don't expect a 10 % jump
|
|
from a CSE.
|
|
- **Single-flow tests pin one core.** With `SO_REUSEPORT` the kernel
|
|
hashes 5-tuples to worker sockets; one client → one tuple → one
|
|
worker thread. The other 3 cores sit idle. To exercise all 4 relay
|
|
threads you'd need m≥4 *with distinct source ports* — ours don't
|
|
spread cleanly because the loadgen reuses ports.
|
|
- **Don't re-extract `/root/coturn` between iterations** if you want
|
|
to keep `git apply`-style patches working. The droplet copy is *not*
|
|
a git checkout (it's the `git archive` tar). Use `patch -p1`. Each
|
|
iteration uploaded a *cumulative* diff (current branch vs `master`)
|
|
and re-extracted from `/root/coturn_clean.tar` first to get a clean
|
|
apply.
|
|
|
|
## Optimization backlog (bigger fish for next session)
|
|
|
|
Ordered by expected impact for the m=1 packet-flood metric:
|
|
|
|
1. **Batch the send side (`sendmmsg`) or pass receive batches deeper.** The
|
|
occupancy counters show receive batching is already working: `m=1` averaged
|
|
15.6 packets per call and `m=100` averaged 11.4. The code immediately
|
|
invokes the existing per-packet callback for each received datagram, and
|
|
each forwarded packet still pays a separate send syscall. The next
|
|
measurable lever is to queue per-thread outbound datagrams during a receive
|
|
batch and flush them with `sendmmsg`, or introduce a batch-aware callback
|
|
path for the hot UDP relay case.
|
|
|
|
2. **Keep `recvmmsg` occupancy counters available while developing send
|
|
batching.** They are cheap enough for targeted performance builds and make
|
|
it obvious whether a benchmark is exercising one relay thread or all relay
|
|
threads. Consider hiding periodic logs behind a verbose/debug option before
|
|
shipping broadly.
|
|
|
|
3. **GSO (`UDP_SEGMENT`)** on the send path. Linux can take one
|
|
"large" datagram and segment it in the kernel for back-to-back
|
|
packets to the same destination. Our channel-data flood IS
|
|
same-destination. Setting `UDP_SEGMENT` and submitting a single
|
|
`sendmsg` of N×packet_size cuts skb-alloc / `__dev_queue_xmit`
|
|
work substantially. Needs careful handling for short tails and
|
|
non-uniform sizes; complementary to (2).
|
|
|
|
4. **Inline more cross-TU per-packet accessors.** Pattern from iter
|
|
4/5 still applies: `addr_eq` (called per channel-data packet for
|
|
permission lookup), `ioa_network_buffer_get_size`,
|
|
`get_ioa_socket_type` / `_app_type`. Each is small enough; the
|
|
only reason to be cautious is they're declared in `ns_turn_ioalib.h`
|
|
which is part of the public-ish server library API — moving the
|
|
body inline doesn't break ABI but does require a recompile of all
|
|
consumers. Likely <1 % each but cheap to do.
|
|
|
|
5. **Re-evaluate `--udp-recvmmsg` default after instrumentation.** *(Done.)*
|
|
Scoping `recvmmsg` to shared fan-in sockets removed the per-session-relay
|
|
tax that blocked default-on; the listener is a genuine fan-in point that
|
|
benefits whenever client concurrency is non-trivial and costs little when
|
|
idle (few packets ⇒ few prealloc cycles). Now ships on by default on Linux,
|
|
with `--udp-recvmmsg=false` as the opt-out. See the resolved entry under
|
|
"What didn't work" above.
|
|
|
|
## Things investigated and ruled out (don't redo)
|
|
|
|
- `set_socket_ttl` / `set_socket_tos` already short-circuit on
|
|
no-change via `s->current_ttl != ttl` / `s->current_tos != tos`.
|
|
In a steady-state flood the per-packet call returns immediately
|
|
without `setsockopt`. Already optimized.
|
|
- `set_df_on_ioa_socket` similarly guarded
|
|
([ns_ioalib_engine_impl.c:242](../src/apps/relay/ns_ioalib_engine_impl.c#L242)).
|
|
- `turn_report_session_usage` slow path runs once per 4096 packets
|
|
(see iter 1 commit); the per-call overhead is now ~3 reads + 1
|
|
bitmask test + 1 conditional return.
|
|
- `MSG_CONFIRM` in `sendto` would skip ARP refresh, but
|
|
`neigh_resolve_output` + `neigh_hh_output` show ~17 % combined in
|
|
perf only because we're sending *that many* packets — per-packet
|
|
it's the normal cached neighbor path, not a refresh.
|
|
- Increasing `MAX_TRIES` from 16 to 64 in `socket_input_worker`
|
|
doesn't change syscall count; it only delays returning to libevent.
|
|
Useless without (1) above.
|
|
|
|
## How to resume
|
|
|
|
1. Verify the droplets are still up (the IPs above). If they were
|
|
destroyed, re-create with `c-4` / `nyc1` / `default-nyc1` VPC and
|
|
the `pavel` SSH key (id 23704483).
|
|
2. Re-upload `/tmp/coturn_clean.tar` from `git archive master` and
|
|
rebuild `/root/coturn_baseline/build/bin/turnserver` if the
|
|
baseline binary is gone. The A/B harness depends on having both
|
|
binaries side-by-side on the turnserver droplet.
|
|
3. Run a 6-round alternating A/B as a sanity check that the current
|
|
tip-of-branch still beats `master` by ~5 %. If it doesn't, the
|
|
environment drifted and the baseline needs re-anchoring.
|
|
4. Pick the next item from the backlog. Item (1) — `recvmmsg` into
|
|
`socket_input_worker` — is where the next material gain lives.
|
|
|
|
## 2026-05-03 sendmmsg follow-up
|
|
|
|
A later run on two DigitalOcean CPU-optimized `c-4` droplets in `sfo3`
|
|
(`10.124.0.2` turnserver, `10.124.0.3` loadgen) tested an experimental
|
|
Linux-only `--udp-sendmmsg` flag with `--udp-recvmmsg`.
|
|
|
|
| Run | Code/flags | Generator max pps | Generator avg pps | Server RX avg pps | Server TX avg pps | Server TX peak pps | CPU avg | Perf conclusion |
|
|
| --- | --- | ---: | ---: | ---: | ---: | ---: | ---: | --- |
|
|
| iter0 | baseline, `--udp-recvmmsg` | 335,872 | 286,721 | 360,900 | 257,357 | 323,488 | 97.8% | `sendto`/`udp_sendmsg` dominates |
|
|
| iter1 | `--udp-sendmmsg`, both directions | 409,600 | 312,662 | 428,184 | 197,300 | 260,453 | 99.8% | `sendmmsg` path dominates; TX regressed |
|
|
| iter2 | `sendmmsg` only for batches >= 4 | 393,216 | 315,393 | 398,121 | 163,626 | 215,068 | 98.9% | Threshold did not recover TX |
|
|
| iter3 | listener-side batching only | 425,984 | 286,038 | 376,444 | 210,050 | 332,417 | 97.4% | Peak ingress/TX improved, average TX still below baseline |
|
|
|
|
Validation result: `sendmmsg()` is not a proven general win for this workload.
|
|
It can increase generator max pps and peak server TX, but average delivered
|
|
server TX stayed below the `--udp-recvmmsg` baseline. Keep it opt-in until a
|
|
follow-up change proves better end-to-end relay throughput.
|
|
|
|
Perf still points at per-datagram kernel transmit cost:
|
|
|
|
- baseline: `udp_send -> sendto -> __sys_sendto -> udp_sendmsg -> udp_send_skb -> ip_output`
|
|
- sendmmsg variants: `udp_sendmmsg_flush -> __sendmmsg -> __sys_sendmmsg -> ___sys_sendmsg -> udp_sendmsg -> ip_output`
|
|
|
|
The key observation is that `sendmmsg()` reduces syscall entry count but still
|
|
walks `udp_sendmsg` and the IP output path once per datagram. On this workload,
|
|
the extra `mmsghdr` copy/looping overhead can offset the syscall savings.
|
|
|
|
Deferred bigger refactors from this run:
|
|
|
|
- Per-peer connected UDP relay sockets or a destination cache could reduce
|
|
address handling and route lookup for repeated peer sends, but it changes
|
|
relay socket semantics and receive filtering.
|
|
- Shard a single hot allocation/flow across multiple relay workers only with a
|
|
careful design for ordering, session accounting, socket ownership, and lock
|
|
contention.
|
|
- Investigate `io_uring` send batching or kernel-bypass style transmit only as
|
|
a larger architecture experiment.
|
|
- Consider a purpose-built benchmark mode that measures delivered relay pps at a
|
|
controlled input rate. The current saturated packet flood is useful for
|
|
finding hot functions but can obscure end-to-end delivery changes.
|
|
|
|
## 2026-05-09 UDP-GSO send path (`--udp-gso`)
|
|
|
|
Realizes the GSO backlog item from the iter-5 backlog above. The recvmmsg /
|
|
sendmmsg follow-ups confirmed that on this workload the dominant cost is the
|
|
per-datagram kernel TX path (`udp_sendmsg → ip_finish_output → __dev_queue_xmit
|
|
→ start_xmit`), which mmsg-style batching does not collapse. UDP-GSO (Linux
|
|
`UDP_SEGMENT` cmsg) does collapse it: N same-destination, same-size datagrams
|
|
are submitted as one `sendmsg` carrying an iovec; the kernel allocates one
|
|
super-skb that traverses the network stack once and is split at egress (NIC).
|
|
|
|
Implementation lives in [src/apps/relay/ns_ioalib_engine_impl.c](../src/apps/relay/ns_ioalib_engine_impl.c)
|
|
and reuses the existing `--udp-sendmmsg` batch state. Eligibility (same fd,
|
|
same dest, same size, ≤ 1472 B per datagram) is tracked on every
|
|
`udp_sendmmsg_enqueue`; eligible flushes go through `udp_gso_attempt_flush`
|
|
ahead of the `sendmmsg` loop, with an automatic sticky disable on
|
|
`EINVAL/ENOPROTOOPT` so a kernel/NIC without GSO support gracefully falls back.
|
|
The relay-side `socket_udp_read_batch_recvmmsg` now wraps its callback loop
|
|
in `udp_sendmmsg_batch_begin/end` so peer→client sends triggered inside a
|
|
recvmmsg batch can also coalesce — without that wrapping, the relay path
|
|
issues one `sendto` per delivered datagram.
|
|
|
|
DigitalOcean validation on 2026-05-09 — fresh nyc1 `c-4` droplets (turn
|
|
`10.116.0.4`, load `10.116.0.5`), all variants built from the same source tree
|
|
under `/root/coturn/build`, `-Y packet -m 1 -l 120`, monitor window via `sar
|
|
-n DEV` for `eth1`, `mpstat`, `pidstat`. The 12 s sweep first established the
|
|
ordering, then a 30 s alternating A/B (`baseline → gso → baseline → gso`)
|
|
confirmed the magnitude of the delta:
|
|
|
|
| Variant | eth1 RX pps | eth1 TX pps | sys CPU | idle CPU |
|
|
|---|---:|---:|---:|---:|
|
|
| baseline_r1 | 322,091 | 127,445 | 22.9% | 67.5% |
|
|
| `--udp-recvmmsg --udp-sendmmsg --udp-gso` (gso_r1) | 266,068 | **257,996** | 15.0% | 78.7% |
|
|
| baseline_r2 | 309,475 | 125,573 | 20.9% | 70.7% |
|
|
| gso_r2 | 275,992 | **225,366** | 14.9% | 74.3% |
|
|
|
|
Mean server forwarding rate (eth1 TX): baseline 126,509 pps → GSO 241,681 pps,
|
|
**+91 % (1.91×)**, with mean system CPU dropping from 21.9 % to 14.9 % — about
|
|
**2.8× CPU efficiency** in TX pps per system-CPU-%.
|
|
|
|
12 s packet sweep, all four variants, mean send_pps reported by uclient (used
|
|
only for ordering — for absolute throughput trust eth1 TX above):
|
|
|
|
| Variant | m=1 | m=2 | m=4 | m=8 | m=16 | m=32 |
|
|
|---|---:|---:|---:|---:|---:|---:|
|
|
| baseline | 230,401 | 150,189 | 187,055 | 174,771 | 160,871 | 167,789 |
|
|
| `--udp-recvmmsg` | 255,660 | 148,824 | 174,767 | 142,997 | 150,743 | 144,200 |
|
|
| `--udp-recvmmsg --udp-sendmmsg` | 231,766 | 146,776 | 148,826 | 136,542 | 148,955 | 143,575 |
|
|
| `--udp-recvmmsg --udp-sendmmsg --udp-gso` | 136,876 | 147,458 | 124,250 | 131,081 | 137,636 | 114,714 |
|
|
|
|
The uclient generator reports its own send rate, which drops with GSO because
|
|
the loadgen droplet's `turnutils_peer` becomes the new bottleneck — it is
|
|
single-threaded and cannot reflect 240 k pps. The 30 s `eth1` capture is the
|
|
authoritative server-side metric; `sweep_m1` is retained only to show that
|
|
GSO does not regress in the moderately-loaded `m=2..32` range relative to
|
|
`recvmmsg+sendmmsg`.
|
|
|
|
Perf children share, m=1 12 s perf record on the turnserver process:
|
|
|
|
| Symbol | baseline | recvmmsg | recvsendmmsg | gso |
|
|
|---|---:|---:|---:|---:|
|
|
| `__x64_sys_sendto` (children) | 43.6 % | 47.6 % | 22.8 % | 0.0 % |
|
|
| `__x64_sys_sendmsg` (children) | — | — | — | **38.1 %** |
|
|
| `__x64_sys_sendmmsg` (children) | — | — | 27.0 % | 0.0 % |
|
|
| `udp_sendmsg` | 38.8 % | 41.9 % | 20.6 % | 35.9 % |
|
|
| `__dev_queue_xmit` | 18.5 % | — | — | 29.3 % |
|
|
| `skb_segment` (egress GSO split) | absent | absent | absent | 2.2 % |
|
|
| `syscall_return_via_sysret` (self) | 7.2 % | 4.7 % | 4.4 % | 2.4 % |
|
|
| `entry_SYSCALL_64_after_hwframe` (self) | 4.1 % | 3.6 % | 2.6 % | 1.8 % |
|
|
|
|
In the GSO column the per-packet kernel-stack cost is now amortized across
|
|
the segments of a single super-skb. The proportional rise of
|
|
`__dev_queue_xmit` is misleading on its own — it reflects a smaller
|
|
denominator (CPU usage dropped) while the per-packet absolute cost dropped.
|
|
|
|
Operational notes:
|
|
|
|
- Flag is opt-in. `--udp-gso` requires `--udp-sendmmsg`; without that flag
|
|
the batch state never accumulates and GSO has nothing to flush. The
|
|
`--help` text states the dependency.
|
|
- GSO eligibility resets on every `_begin/_end`. Mixed-destination or
|
|
mixed-size workloads transparently fall back through the existing
|
|
`sendmmsg` and `udp_send` paths.
|
|
- Sticky disable on `EINVAL/ENOPROTOOPT` keeps a process running on an
|
|
un-virtio host or older kernel from hot-looping in the sticky failure
|
|
path. A WARNING line is logged once.
|
|
- Tested on Linux 6.8 + virtio-net (DO `c-4`), `gso_max_segs=65535`. Older
|
|
hosts (kernel <4.18) lack `UDP_SEGMENT` entirely; the sticky-disable
|
|
path covers them.
|
|
|
|
Suggested next levers if more relay throughput is needed:
|
|
|
|
1. **Move loadgen off turnutils_peer.** The 240 k → 90 k tot_recv_msgs/30 s
|
|
gap at GSO is dominated by single-threaded peer reflection, not the TURN
|
|
server. A multi-thread peer or `pktgen`-style reflector would let us
|
|
measure the real ceiling.
|
|
2. **Per-peer connected relay sockets.** Same-destination is the GSO
|
|
eligibility predicate; a connected relay socket would always be
|
|
same-dest and would also save `route_lookup` per send.
|
|
3. **`MSG_ZEROCOPY` on the GSO sendmsg.** `rep_movs_alternative` is still
|
|
3 % self in GSO, and zerocopy avoids the userspace→kernel copy.
|
|
Probably small for 32-B STUN packets; revisit when payloads are larger.
|
|
|
|
Artifacts (perf.data, sar/mpstat/pidstat, sweep logs, AB logs) are saved at
|
|
`perf-results-20260508-213056/` in the worktree.
|
|
|