# Performance iteration log
Running notes for the multi-iteration performance work on the UDP relay
data path. Pick this up to continue without re-deriving everything.
The harness, baseline command, and droplet topology are documented in
[CLAUDE.md ](../CLAUDE.md ) under "Load Test on DigitalOcean" — this file
captures the *deltas* : what was measured, what landed, what didn't, and
where the next round should go.
## Cumulative result
Five commits on `claude/beautiful-black-c3b741` between `727ec2ab`
("loadgen") and `321a2d18` :
| # | Commit | Optimization |
|---|---|---|
| 1 | `ce7e7e53` | Hoist `turn_server_get_engine()` out of per-packet hot path |
| 2 | `8e28491a` | `ioa_socket_check_bandwidth` early fast-exit; drop dead `if (!(s->done \|\| s->fd==-1))` in `send_data_from_ioa_socket_nbh` |
| 3 | `344360f6` | Cache `get_relay_socket_ss()` and `ioa_network_buffer_get_size()` in `write_to_peerchannel` , `handle_turn_send` , `read_client_connection` |
| 4 | `a6f6767f` | Inline `get_ioa_addr_len()` via `ns_turn_ioaddr.h` |
| 5 | `321a2d18` | Inline `addr_cpy()` via `ns_turn_ioaddr.h` |
Relay recvmmsg (#1906)
## Summary
Extends the existing Linux-only `--udp-recvmmsg` flag from the UDP
listener socket to also cover **connected per-session UDP relay
sockets**, so steady-state client→relay and peer→relay traffic on plain
UDP is read in batches of up to 16 datagrams per `recvmmsg(2)` instead
of one `recvmsg` per packet. DTLS sessions still go through the SSL read
path and are unchanged.
The flag stays **opt-in**: receive-side batching works correctly, but on
the current `m=1` / `m=100` benchmarks throughput is flat to slightly
negative — the bottleneck has moved past receive (see results below).
## What's in the change
- **Shared receive helpers** (`src/apps/relay/ns_ioalib_engine_impl.c`,
`src/apps/relay/ns_ioalib_impl.h`):
- `ioa_parse_udp_recvmsg_cmsg()` — single TTL/TOS/`IP_RECVERR` cmsg
parser used by both `udp_recvfrom()` and the new batch path. Replaces
the duplicated parser previously inlined in `dtls_listener.c` and
`udp_recvfrom()`.
- `ioa_init_recvmmsg_hdr()` — single initializer for
`mmsghdr`/`iovec`/cmsg/source-address fields, also used by the listener.
- New `IOA_UDP_RECVMMSG_MAX_BATCH = 16` constant; both listener and
relay paths now share it.
- **Connected relay batch read** (`socket_udp_read_batch_recvmmsg` in
`ns_ioalib_engine_impl.c`): called from `socket_input_worker` for
non-SSL UDP sockets when `--udp-recvmmsg` is on. Allocates per-message
`stun_buffer_list_elem`s, calls `recvmmsg(MSG_DONTWAIT)`, dispatches
each datagram through the existing `read_cb` path, and falls back
cleanly on `ENOSYS`/`EINVAL`/`EOPNOTSUPP` (auto-disables the flag) and
on `EAGAIN`/short-batch (releases unused buffers).
- **Per-engine scratch state**: the `mmsghdr[16]` / `iovec[16]` / cmsg /
src-addr arrays live on `ioa_engine`, not on every socket — keeps memory
flat at thousands of allocations.
- **TTL/TOS-sized cmsg buffers** in the listener: the listener
previously over-allocated `64 KiB` per slot; it now uses the same
TTL+TOS sizing as the relay path.
- **Opt-in occupancy stats** behind a new `--udp-recvmmsg-log` flag:
every 10 s the relay logs `udp-recvmmsg stats: calls=… packets=…
avg_batch=… wouldblock=… unavailable=… no_buffer=… hist_1=… hist_2=…
hist_3_4=… hist_5_8=… hist_9_16=…`. Counters are always tracked (cheap);
the periodic log is gated by the new flag so default operation is
silent.
- **CLI plumbing**: `--udp-recvmmsg-log` long option in
`mainrelay.c`/`mainrelay.h`, `cli_print_flag` entry in
`turn_admin_server.c`, doc updates in `README.turnserver`.
- **Docs**: `docs/PerformanceIterationLog.md` records the iteration
steps, validation, and two rounds of DigitalOcean A/B numbers.
`CLAUDE.md` load-test instructions updated to mention the new flag and
the `tot_recv_msgs` / `tot_recv_bytes` workaround.
1 month ago
Current `relay-recvmmsg` follow-up:
| # | Commit | Optimization |
|---|---|---|
| 6 | `54c589d0` / `4b1a8d71` | Initial Linux `recvmmsg` batching for UDP listener and connected relay sockets |
| 7 | `8d9a7292` | Share the existing `--udp-recvmmsg` flag across listener and relay UDP paths; remove separate relay flag; use the shared ancillary-data parser in `dtls_listener` |
| 8 | `d48686b7` | Reduce relay per-socket `recvmmsg` state from 16 x 64 KiB cmsg buffers to TTL/TOS-sized buffers, avoid an extra would-block fallback `recvmsg` , and clean up all preallocated buffers after partial batches |
| 9 | `ad81705e` | Add per-engine `recvmmsg` occupancy counters and 10 s log summaries (`calls`, `packets` , `avg_batch` , `wouldblock` , `unavailable` , `no_buffer` , batch-size histogram) |
| 10 | `388b15d4` | Move connected relay UDP `recvmmsg` scratch from per-socket state to per-engine/per-thread state |
| 11 | `4c4fd67e` | Make the occupancy summaries opt-in behind `--udp-recvmmsg-log` , so `--udp-recvmmsg` can ship without periodic stats logs |
Validation after #7 -#11:
- Local `cmake -S . -B build -DBUILD_TESTING=ON` passed.
- Local `cmake --build build --parallel 8` passed.
- Local `ctest --test-dir build --output-on-failure` passed 3/3.
- Local `build/bin/turnserver --udp-recvmmsg --udp-recvmmsg-log --version`
parsed both flags and printed `4.11.0` .
- Linux Docker `turnserver` build passed after #7 , after #8 , and after #10 .
Shipping cleanup learning: keep the occupancy counters in place because they
are low overhead and useful for DigitalOcean diagnostics, but keep the periodic
summaries off by default. Use `--udp-recvmmsg-log` only during measured runs
where the log stream is part of the observation.
DigitalOcean check on 2026-05-09:
- Reused the existing `c-4` droplets in `nyc1` : turnserver public
`157.230.3.102` , private `10.116.0.2` ; loadgen public `167.99.153.216` ,
private `10.116.0.3` . Droplets were left running between steps.
- Built fresh current artifacts from `d48686b7` on both droplets under
`/root/coturn_recvmmsg_current` .
- Same-binary `--udp-recvmmsg` off/on, `-Y packet -m 1 -l 120` , 5 alternating
30 s rounds each:
- off mean 154,527, median 154,596, stdev 3,467
- on mean 149,994, median 153,011, stdev 7,174
- on was -2.9 % by mean and -1.0 % by median
- Same-binary `--udp-recvmmsg` off/on, `-Y packet -m 100 -l 120` , 5 alternating
rounds each. The client completed before the 30 s timeout and landed in two
send-volume buckets, so treat this as a coarse many-connection signal:
- off mean 59,432, median 65,071, stdev 7,952
- on mean 59,640, median 65,421, stdev 7,963
- on was +0.3 % by mean and +0.5 % by median
- Follow-up `m=100 -n 1000` run, 3 alternating rounds each, derived receive
count from `tot_recv_bytes / 120` because this log format omits
`tot_recv_msgs` :
- off mean 8,540, median 8,990, stdev 1,004
- on mean 8,857, median 8,749, stdev 759
- on was +3.7 % by mean and -2.7 % by median
Learning: the corrected relay `recvmmsg` implementation is now buildable and
much safer for many connections, but these droplet runs still do not show a
clear throughput win. Keep `--udp-recvmmsg` opt-in for now. The next useful
step is to instrument actual batch occupancy on connected relay sockets; if
most readiness events return one datagram, `recvmmsg` will mostly add setup
work without reducing syscalls.
DigitalOcean occupancy check on 2026-05-09:
- Built fresh current artifacts from `388b15d4` on both droplets under
`/root/coturn_recvmmsg_current` .
- Same-binary `--udp-recvmmsg` off/on, `-Y packet -m 1 -l 120` , 3 alternating
30 s rounds each:
- off mean 153,133, median 153,608, stdev 4,383
- on mean 148,452, median 149,711, stdev 10,833
- on was -3.1 % by mean and -2.5 % by median
- `m=1` occupancy from the on runs: 1,129,427 `recvmmsg` calls returned
17,660,300 packets, average batch 15.64. Histogram buckets:
`hist_1=1,353` , `hist_2=1,496` , `hist_3_4=3,707` ,
`hist_5_8=14,817` , `hist_9_16=1,108,057` ; 98.1 % of calls were in the
`9..16` bucket.
- Same-binary `--udp-recvmmsg` off/on, `-Y packet -m 100 -l 120` , 3 alternating
runs each:
- off mean 55,443, median 50,679, stdev 8,369
- on mean 60,596, median 65,404, stdev 8,383
- on was +9.3 % by mean and +29.1 % by median, but the client again landed
in two send-volume buckets, so treat the throughput delta as noisy.
- `m=100` occupancy from the on runs across all relay threads: 1,426,401
`recvmmsg` calls returned 16,188,946 packets, average batch 11.35.
Histogram buckets: `hist_1=83,057` , `hist_2=79,781` ,
`hist_3_4=130,066` , `hist_5_8=188,259` , `hist_9_16=945,238` ; 66.3 %
of calls were in the `9..16` bucket.
Learning: receive-side occupancy is high. The earlier hypothesis that
`recvmmsg` was mostly returning one packet is wrong for this harness. The
remaining bottleneck is after receive: per-packet callbacks, TURN processing,
and especially one `sendto` per relayed packet. The per-thread scratch change
is still worth keeping for memory/cache behavior with thousands of sockets,
but the next performance lever should be send-side batching or a design that
passes batches deeper instead of immediately decomposing them back into
single-packet callbacks.
Alternating A/B run on the same droplet pair, m=1 packet flood, 30 s
per run, with a 4 s warm-up between binary swaps:
- Baseline (clean `master` binary): mean 146,984 round-trips / 30 s
- Cumulative (all 5 iters): mean 155,468 round-trips / 30 s
- **+5.8 % throughput**
Per-iteration deltas were within run-to-run noise (~5–10 % variance).
The cumulative effect is what's visible.
## Test setup that was used
Relay recvmmsg (#1906)
## Summary
Extends the existing Linux-only `--udp-recvmmsg` flag from the UDP
listener socket to also cover **connected per-session UDP relay
sockets**, so steady-state client→relay and peer→relay traffic on plain
UDP is read in batches of up to 16 datagrams per `recvmmsg(2)` instead
of one `recvmsg` per packet. DTLS sessions still go through the SSL read
path and are unchanged.
The flag stays **opt-in**: receive-side batching works correctly, but on
the current `m=1` / `m=100` benchmarks throughput is flat to slightly
negative — the bottleneck has moved past receive (see results below).
## What's in the change
- **Shared receive helpers** (`src/apps/relay/ns_ioalib_engine_impl.c`,
`src/apps/relay/ns_ioalib_impl.h`):
- `ioa_parse_udp_recvmsg_cmsg()` — single TTL/TOS/`IP_RECVERR` cmsg
parser used by both `udp_recvfrom()` and the new batch path. Replaces
the duplicated parser previously inlined in `dtls_listener.c` and
`udp_recvfrom()`.
- `ioa_init_recvmmsg_hdr()` — single initializer for
`mmsghdr`/`iovec`/cmsg/source-address fields, also used by the listener.
- New `IOA_UDP_RECVMMSG_MAX_BATCH = 16` constant; both listener and
relay paths now share it.
- **Connected relay batch read** (`socket_udp_read_batch_recvmmsg` in
`ns_ioalib_engine_impl.c`): called from `socket_input_worker` for
non-SSL UDP sockets when `--udp-recvmmsg` is on. Allocates per-message
`stun_buffer_list_elem`s, calls `recvmmsg(MSG_DONTWAIT)`, dispatches
each datagram through the existing `read_cb` path, and falls back
cleanly on `ENOSYS`/`EINVAL`/`EOPNOTSUPP` (auto-disables the flag) and
on `EAGAIN`/short-batch (releases unused buffers).
- **Per-engine scratch state**: the `mmsghdr[16]` / `iovec[16]` / cmsg /
src-addr arrays live on `ioa_engine`, not on every socket — keeps memory
flat at thousands of allocations.
- **TTL/TOS-sized cmsg buffers** in the listener: the listener
previously over-allocated `64 KiB` per slot; it now uses the same
TTL+TOS sizing as the relay path.
- **Opt-in occupancy stats** behind a new `--udp-recvmmsg-log` flag:
every 10 s the relay logs `udp-recvmmsg stats: calls=… packets=…
avg_batch=… wouldblock=… unavailable=… no_buffer=… hist_1=… hist_2=…
hist_3_4=… hist_5_8=… hist_9_16=…`. Counters are always tracked (cheap);
the periodic log is gated by the new flag so default operation is
silent.
- **CLI plumbing**: `--udp-recvmmsg-log` long option in
`mainrelay.c`/`mainrelay.h`, `cli_print_flag` entry in
`turn_admin_server.c`, doc updates in `README.turnserver`.
- **Docs**: `docs/PerformanceIterationLog.md` records the iteration
steps, validation, and two rounds of DigitalOcean A/B numbers.
`CLAUDE.md` load-test instructions updated to mention the new flag and
the `tot_recv_msgs` / `tot_recv_bytes` workaround.
1 month ago
Two `c-4` Ubuntu 24.04 droplets in `nyc1` , same VPC `default-nyc1` .
Current active pair:
- `coturn-turnserver` — public `157.230.3.102` , private `10.116.0.2`
- `coturn-loadgen` — public `167.99.153.216` , private `10.116.0.3`
Older pair used for the iter 5 cumulative run:
- `coturn-turnserver` — public `68.183.121.197` , private `10.116.0.2`
- `coturn-loadgen` — public `68.183.132.220` , private `10.116.0.3`
Created via the DigitalOcean v2 API (`doctl` is *not* installed; use
`curl` + `$DIGITALOCEAN_TOKEN` from the user's `~/.zshrc` ). SSH via
`~/.ssh/id_rsa` (matches DO ssh key id `23704483` , fingerprint
`37:3a:9b:e3:1e:1a:9b:42:a0:6f:58:f5:5a:3a:6a:2c` ).
State on the turnserver droplet (kept across iterations):
- `/root/coturn_clean.tar` — `git archive HEAD` of master at start of run.
Re-extract this before applying any new patch.
- `/root/coturn_baseline/build/bin/turnserver` — clean baseline binary,
used as the "B" in every A/B round. **Don't overwrite.**
- `/root/coturn/build/bin/turnserver` — current iteration binary.
- `/root/start_turnserver.sh` , `/root/baseline_run.sh` — helper scripts.
State on the loadgen droplet:
- `/root/coturn/build/bin/turnutils_uclient` , `turnutils_peer` .
- `turnutils_peer` runs as a daemon on `10.116.0.3:3480`
(`pid` in `/root/peer.pid` ).
A small env file was written to `/tmp/coturn_perf_env.sh` on the local
machine with the IPs / droplet IDs — recreate it from the current
state of the DO account if it gets lost.
Relay recvmmsg (#1906)
## Summary
Extends the existing Linux-only `--udp-recvmmsg` flag from the UDP
listener socket to also cover **connected per-session UDP relay
sockets**, so steady-state client→relay and peer→relay traffic on plain
UDP is read in batches of up to 16 datagrams per `recvmmsg(2)` instead
of one `recvmsg` per packet. DTLS sessions still go through the SSL read
path and are unchanged.
The flag stays **opt-in**: receive-side batching works correctly, but on
the current `m=1` / `m=100` benchmarks throughput is flat to slightly
negative — the bottleneck has moved past receive (see results below).
## What's in the change
- **Shared receive helpers** (`src/apps/relay/ns_ioalib_engine_impl.c`,
`src/apps/relay/ns_ioalib_impl.h`):
- `ioa_parse_udp_recvmsg_cmsg()` — single TTL/TOS/`IP_RECVERR` cmsg
parser used by both `udp_recvfrom()` and the new batch path. Replaces
the duplicated parser previously inlined in `dtls_listener.c` and
`udp_recvfrom()`.
- `ioa_init_recvmmsg_hdr()` — single initializer for
`mmsghdr`/`iovec`/cmsg/source-address fields, also used by the listener.
- New `IOA_UDP_RECVMMSG_MAX_BATCH = 16` constant; both listener and
relay paths now share it.
- **Connected relay batch read** (`socket_udp_read_batch_recvmmsg` in
`ns_ioalib_engine_impl.c`): called from `socket_input_worker` for
non-SSL UDP sockets when `--udp-recvmmsg` is on. Allocates per-message
`stun_buffer_list_elem`s, calls `recvmmsg(MSG_DONTWAIT)`, dispatches
each datagram through the existing `read_cb` path, and falls back
cleanly on `ENOSYS`/`EINVAL`/`EOPNOTSUPP` (auto-disables the flag) and
on `EAGAIN`/short-batch (releases unused buffers).
- **Per-engine scratch state**: the `mmsghdr[16]` / `iovec[16]` / cmsg /
src-addr arrays live on `ioa_engine`, not on every socket — keeps memory
flat at thousands of allocations.
- **TTL/TOS-sized cmsg buffers** in the listener: the listener
previously over-allocated `64 KiB` per slot; it now uses the same
TTL+TOS sizing as the relay path.
- **Opt-in occupancy stats** behind a new `--udp-recvmmsg-log` flag:
every 10 s the relay logs `udp-recvmmsg stats: calls=… packets=…
avg_batch=… wouldblock=… unavailable=… no_buffer=… hist_1=… hist_2=…
hist_3_4=… hist_5_8=… hist_9_16=…`. Counters are always tracked (cheap);
the periodic log is gated by the new flag so default operation is
silent.
- **CLI plumbing**: `--udp-recvmmsg-log` long option in
`mainrelay.c`/`mainrelay.h`, `cli_print_flag` entry in
`turn_admin_server.c`, doc updates in `README.turnserver`.
- **Docs**: `docs/PerformanceIterationLog.md` records the iteration
steps, validation, and two rounds of DigitalOcean A/B numbers.
`CLAUDE.md` load-test instructions updated to mention the new flag and
the `tot_recv_msgs` / `tot_recv_bytes` workaround.
1 month ago
The standard packet-flood command (matches CLAUDE.md baseline, runs without
`--udp-recvmmsg` ; add `--udp-recvmmsg` to `turnserver` , not the client, for the
batched listener/relay receive path):
```bash
timeout -s INT 30s /root/coturn/build/bin/turnutils_uclient \
-Y packet -m 1 -l 120 \
-e 10.116.0.3 -r 3480 -X -g \
-u user -W secret \
10.116.0.2
```
Metric: the `tot_recv_msgs` field on the last `start_mclient:` log
line. (This is round-trips through the relay over the test window;
`send_pps` is loadgen-side only and can hit 262 K even when the relay
is dropping most of them, so it's not a useful proxy for relay
throughput.)
## Hot-path map at the end of iter 5
`perf record -F 99 -g` on the turnserver during a 12 s `-Y packet -m 1`
run, sorted by user-space self-time:
```
0.80 % send_data_from_ioa_socket_nbh
0.76 % socket_input_worker
0.69 % read_client_connection.isra.0
0.60 % turn_report_session_usage
0.53 % peer_input_handler
0.51 % udp_server_input_handler
0.35 % udp_recvfrom # was 0.76 % at iter 1
0.34 % lm_map_get
0.27 % stun_is_channel_message_str
0.27 % get_relay_socket
0.26 % ioa_socket_check_bandwidth # was 0.33 % at iter 1
0.26 % udp_send # was 0.60 % at iter 1
0.18 % ioa_network_buffer_get_size
```
Total user-space coturn cycles: ~5–7 % of the relay thread.
The relay thread sits at ~100 % CPU pinned to one core; the 4 relay
threads aren't parallelised by the m=1 single-flow test (one 5-tuple
hashes to one SO_REUSEPORT worker).
Kernel side (children-aggregated) is the real cost:
```
36 % udp_sendmsg (sendto path)
14 % udp_recvmsg
17 % ip_finish_output / ip_output / __dev_queue_xmit
~23 % syscall enter / exit machinery (sysret, SYSRETQ, SYSCALL_64*)
```
That ~23 % syscall overhead is the next big lever. Halving it
(via batching) is worth ~10 % wall-clock CPU.
## What didn't work
Enable --udp-recvmmsg by default on Linux (#1930)
## Summary
Flips the Linux default for `--udp-recvmmsg` from **off** to **on**.
Operators opt out with `--udp-recvmmsg=false` (or `=0`).
> **Stacked on #1929.** This depends on the recvmmsg-scoping change in
#1929 and is based on that branch, so the diff shows only the default-on
change. GitHub will auto-retarget the base to `master` once #1929
merges. Merge #1929 first.
## Why this is now safe
The original objection to default-on (recorded in
`docs/PerformanceIterationLog.md`) was the **per-session-relay-socket
prealloc tax**: `--udp-recvmmsg` applied the 16-buffer batch path to
every connected relay socket, which only ever carries one flow, so the
churn ate the listener-side win.
#1929 scoped recvmmsg to **shared fan-in sockets only**
(`udp_recvmmsg_eligible`: the client listener, plus the per-thread
shared relay socket under `--multiplex-peer`). Per-session relay sockets
now stay on the single-recv path regardless of the flag, so that tax is
gone. The one socket touched by default — the client listener — is a
genuine fan-in point:
- batches whenever client concurrency is non-trivial (measured
`avg_batch ≈ 16` under load), and
- costs little when idle (few packets ⇒ few prealloc cycles).
## What changed
- `mainrelay.c`: `turn_params.udp_recvmmsg` default `false → true`
(Linux only).
- Removed the now-dead `--multiplex-peer` auto-enable block and the
`udp_recvmmsg_set_explicitly` tracking it relied on; multiplex-peer gets
its recvmmsg window from the default. The opt-out flows through the
normal `get_bool_value` path.
- Help text, `man/man1/turnserver.1`, `examples/etc/turnserver.conf`,
`CLAUDE.md`, and `docs/PerformanceIterationLog.md` updated for the new
default + opt-out.
Per-session relay sockets and DTLS session sockets are unchanged.
## Validation
- **Format:** clang-format 15.0.7 clean.
- **macOS:** build + ctest 6/6 + `run_tests.sh` pass.
- **Linux (Docker, clean build):** ctest 5/5; `run_tests.sh`,
`run_tests_conf.sh`, `run_tests_multiplex_peer.sh` all pass (no FAIL).
- **Runtime proof (loopback, `--udp-recvmmsg-log`):**
- Default, no flag: recvmmsg active, `calls=13714 packets=219306
avg_batch=15.99`.
- `--udp-recvmmsg=false`: zero recvmmsg activity — opt-out confirmed.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
---------
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2 weeks ago
### Default `--udp-recvmmsg=true` on Linux — opt-in in iters 1–11, **shipped default-on later**
Original finding (iters 1–11): the flag applied the 16-buffer batch path to
**every connected per-session relay socket**, which only ever receives one
flow. Throughput parity or slight negative results were confirmed across
multiple A/B rounds on `m=1` and `m=100` — the per-session prealloc churn ate
the listener-side win — so the flag was kept opt-in.
Resolution: a later change scoped `recvmmsg` to **shared fan-in sockets only**
(the `udp_recvmmsg_eligible` flag in
[ns_ioalib_impl.h ](../src/apps/relay/ns_ioalib_impl.h ) — set on the client
listener and, under `--multiplex-peer` , the per-thread shared relay socket).
Per-session relay sockets now stay on the single-recv path regardless of the
flag, so the per-session tax that motivated keeping it opt-in is gone. With the
cost surface reduced to the one genuine fan-in point (the listener), the flag
ships **on by default on Linux** ; operators opt out with `--udp-recvmmsg=false` .
DTLS session sockets remain on the SSL read path and are never batched.
### Caching `get_relay_socket_ss` (iter 3) — no measurable wall-clock win
The function is `static inline` already and the underlying
`get_relay_socket()` is a four-line accessor. Caching the result
*does* save a cross-TU function call per packet (the compiler can't
prove `get_relay_socket` pure across the
`set_df_on_ioa_socket` / `ioa_network_buffer_*` calls in between),
which the perf profile picked up as a small redistribution, but
throughput stayed in the noise band. Kept anyway: the cleanup is
defensible and matches the iter 4/5 inlining direction.
## Methodology lessons
- **Always alternate A/B per round** rather than running 5×B then 5×I.
The droplet pair has noticeable environmental drift over a few
minutes (other tenants on the hypervisor, NIC ring backpressure,
whatever); sequential blocks bias whichever binary ran on the worse
half of the run.
- **Discard the first run after a turnserver restart.** The loadgen's
first run after a server restart is consistently 30–80 % slower
than steady-state — looks like channel/permission state in the
client side warming up, not the server. A 4 s "throwaway" run
before the measured 30 s run is enough.
- **Run-to-run variance is ~5–10 %** even with alternation. Plan on
6–8 rounds (≈ 8 minutes wall-clock) before claiming a sub-10 % win.
A single 3-round A/B will lie to you.
- **Use the `tot_recv_msgs` field, not `send_pps` **. Loadgen send rate
saturates at ~262 K pps regardless of relay capacity — it's
whatever the loadgen kernel will accept into its UDP send buffer.
The receive count is what made it round-trip through the relay.
- **The relay is kernel-bound.** User-space coturn is ~5 % of cycles.
Halving it gives at most ~2.5 % wall-clock — usually undetectable
per-iteration, only visible cumulatively. Don't expect a 10 % jump
from a CSE.
- **Single-flow tests pin one core.** With `SO_REUSEPORT` the kernel
hashes 5-tuples to worker sockets; one client → one tuple → one
worker thread. The other 3 cores sit idle. To exercise all 4 relay
threads you'd need m≥4 *with distinct source ports* — ours don't
spread cleanly because the loadgen reuses ports.
- **Don't re-extract `/root/coturn` between iterations** if you want
to keep `git apply` -style patches working. The droplet copy is *not*
a git checkout (it's the `git archive` tar). Use `patch -p1` . Each
iteration uploaded a *cumulative* diff (current branch vs `master` )
and re-extracted from `/root/coturn_clean.tar` first to get a clean
apply.
## Optimization backlog (bigger fish for next session)
Ordered by expected impact for the m=1 packet-flood metric:
Relay recvmmsg (#1906)
## Summary
Extends the existing Linux-only `--udp-recvmmsg` flag from the UDP
listener socket to also cover **connected per-session UDP relay
sockets**, so steady-state client→relay and peer→relay traffic on plain
UDP is read in batches of up to 16 datagrams per `recvmmsg(2)` instead
of one `recvmsg` per packet. DTLS sessions still go through the SSL read
path and are unchanged.
The flag stays **opt-in**: receive-side batching works correctly, but on
the current `m=1` / `m=100` benchmarks throughput is flat to slightly
negative — the bottleneck has moved past receive (see results below).
## What's in the change
- **Shared receive helpers** (`src/apps/relay/ns_ioalib_engine_impl.c`,
`src/apps/relay/ns_ioalib_impl.h`):
- `ioa_parse_udp_recvmsg_cmsg()` — single TTL/TOS/`IP_RECVERR` cmsg
parser used by both `udp_recvfrom()` and the new batch path. Replaces
the duplicated parser previously inlined in `dtls_listener.c` and
`udp_recvfrom()`.
- `ioa_init_recvmmsg_hdr()` — single initializer for
`mmsghdr`/`iovec`/cmsg/source-address fields, also used by the listener.
- New `IOA_UDP_RECVMMSG_MAX_BATCH = 16` constant; both listener and
relay paths now share it.
- **Connected relay batch read** (`socket_udp_read_batch_recvmmsg` in
`ns_ioalib_engine_impl.c`): called from `socket_input_worker` for
non-SSL UDP sockets when `--udp-recvmmsg` is on. Allocates per-message
`stun_buffer_list_elem`s, calls `recvmmsg(MSG_DONTWAIT)`, dispatches
each datagram through the existing `read_cb` path, and falls back
cleanly on `ENOSYS`/`EINVAL`/`EOPNOTSUPP` (auto-disables the flag) and
on `EAGAIN`/short-batch (releases unused buffers).
- **Per-engine scratch state**: the `mmsghdr[16]` / `iovec[16]` / cmsg /
src-addr arrays live on `ioa_engine`, not on every socket — keeps memory
flat at thousands of allocations.
- **TTL/TOS-sized cmsg buffers** in the listener: the listener
previously over-allocated `64 KiB` per slot; it now uses the same
TTL+TOS sizing as the relay path.
- **Opt-in occupancy stats** behind a new `--udp-recvmmsg-log` flag:
every 10 s the relay logs `udp-recvmmsg stats: calls=… packets=…
avg_batch=… wouldblock=… unavailable=… no_buffer=… hist_1=… hist_2=…
hist_3_4=… hist_5_8=… hist_9_16=…`. Counters are always tracked (cheap);
the periodic log is gated by the new flag so default operation is
silent.
- **CLI plumbing**: `--udp-recvmmsg-log` long option in
`mainrelay.c`/`mainrelay.h`, `cli_print_flag` entry in
`turn_admin_server.c`, doc updates in `README.turnserver`.
- **Docs**: `docs/PerformanceIterationLog.md` records the iteration
steps, validation, and two rounds of DigitalOcean A/B numbers.
`CLAUDE.md` load-test instructions updated to mention the new flag and
the `tot_recv_msgs` / `tot_recv_bytes` workaround.
1 month ago
1. **Batch the send side (`sendmmsg`) or pass receive batches deeper.** The
occupancy counters show receive batching is already working: `m=1` averaged
15.6 packets per call and `m=100` averaged 11.4. The code immediately
invokes the existing per-packet callback for each received datagram, and
each forwarded packet still pays a separate send syscall. The next
measurable lever is to queue per-thread outbound datagrams during a receive
batch and flush them with `sendmmsg` , or introduce a batch-aware callback
path for the hot UDP relay case.
2. **Keep `recvmmsg` occupancy counters available while developing send
batching.** They are cheap enough for targeted performance builds and make
it obvious whether a benchmark is exercising one relay thread or all relay
threads. Consider hiding periodic logs behind a verbose/debug option before
shipping broadly.
3. **GSO (`UDP_SEGMENT`)** on the send path. Linux can take one
"large" datagram and segment it in the kernel for back-to-back
packets to the same destination. Our channel-data flood IS
same-destination. Setting `UDP_SEGMENT` and submitting a single
`sendmsg` of N×packet_size cuts skb-alloc / `__dev_queue_xmit`
work substantially. Needs careful handling for short tails and
non-uniform sizes; complementary to (2).
4. **Inline more cross-TU per-packet accessors.** Pattern from iter
4/5 still applies: `addr_eq` (called per channel-data packet for
permission lookup), `ioa_network_buffer_get_size` ,
`get_ioa_socket_type` / `_app_type` . Each is small enough; the
only reason to be cautious is they're declared in `ns_turn_ioalib.h`
which is part of the public-ish server library API — moving the
body inline doesn't break ABI but does require a recompile of all
consumers. Likely < 1 % each but cheap to do .
Enable --udp-recvmmsg by default on Linux (#1930)
## Summary
Flips the Linux default for `--udp-recvmmsg` from **off** to **on**.
Operators opt out with `--udp-recvmmsg=false` (or `=0`).
> **Stacked on #1929.** This depends on the recvmmsg-scoping change in
#1929 and is based on that branch, so the diff shows only the default-on
change. GitHub will auto-retarget the base to `master` once #1929
merges. Merge #1929 first.
## Why this is now safe
The original objection to default-on (recorded in
`docs/PerformanceIterationLog.md`) was the **per-session-relay-socket
prealloc tax**: `--udp-recvmmsg` applied the 16-buffer batch path to
every connected relay socket, which only ever carries one flow, so the
churn ate the listener-side win.
#1929 scoped recvmmsg to **shared fan-in sockets only**
(`udp_recvmmsg_eligible`: the client listener, plus the per-thread
shared relay socket under `--multiplex-peer`). Per-session relay sockets
now stay on the single-recv path regardless of the flag, so that tax is
gone. The one socket touched by default — the client listener — is a
genuine fan-in point:
- batches whenever client concurrency is non-trivial (measured
`avg_batch ≈ 16` under load), and
- costs little when idle (few packets ⇒ few prealloc cycles).
## What changed
- `mainrelay.c`: `turn_params.udp_recvmmsg` default `false → true`
(Linux only).
- Removed the now-dead `--multiplex-peer` auto-enable block and the
`udp_recvmmsg_set_explicitly` tracking it relied on; multiplex-peer gets
its recvmmsg window from the default. The opt-out flows through the
normal `get_bool_value` path.
- Help text, `man/man1/turnserver.1`, `examples/etc/turnserver.conf`,
`CLAUDE.md`, and `docs/PerformanceIterationLog.md` updated for the new
default + opt-out.
Per-session relay sockets and DTLS session sockets are unchanged.
## Validation
- **Format:** clang-format 15.0.7 clean.
- **macOS:** build + ctest 6/6 + `run_tests.sh` pass.
- **Linux (Docker, clean build):** ctest 5/5; `run_tests.sh`,
`run_tests_conf.sh`, `run_tests_multiplex_peer.sh` all pass (no FAIL).
- **Runtime proof (loopback, `--udp-recvmmsg-log`):**
- Default, no flag: recvmmsg active, `calls=13714 packets=219306
avg_batch=15.99`.
- `--udp-recvmmsg=false`: zero recvmmsg activity — opt-out confirmed.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
---------
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2 weeks ago
5. **Re-evaluate `--udp-recvmmsg` default after instrumentation.** *(Done.)*
Scoping `recvmmsg` to shared fan-in sockets removed the per-session-relay
tax that blocked default-on; the listener is a genuine fan-in point that
benefits whenever client concurrency is non-trivial and costs little when
idle (few packets ⇒ few prealloc cycles). Now ships on by default on Linux,
with `--udp-recvmmsg=false` as the opt-out. See the resolved entry under
"What didn't work" above.
## Things investigated and ruled out (don't redo)
- `set_socket_ttl` / `set_socket_tos` already short-circuit on
no-change via `s->current_ttl != ttl` / `s->current_tos != tos` .
In a steady-state flood the per-packet call returns immediately
without `setsockopt` . Already optimized.
- `set_df_on_ioa_socket` similarly guarded
([ns_ioalib_engine_impl.c:242](../src/apps/relay/ns_ioalib_engine_impl.c#L242)).
- `turn_report_session_usage` slow path runs once per 4096 packets
(see iter 1 commit); the per-call overhead is now ~3 reads + 1
bitmask test + 1 conditional return.
- `MSG_CONFIRM` in `sendto` would skip ARP refresh, but
`neigh_resolve_output` + `neigh_hh_output` show ~17 % combined in
perf only because we're sending *that many* packets — per-packet
it's the normal cached neighbor path, not a refresh.
- Increasing `MAX_TRIES` from 16 to 64 in `socket_input_worker`
doesn't change syscall count; it only delays returning to libevent.
Useless without (1) above.
## How to resume
1. Verify the droplets are still up (the IPs above). If they were
destroyed, re-create with `c-4` / `nyc1` / `default-nyc1` VPC and
the `pavel` SSH key (id 23704483).
2. Re-upload `/tmp/coturn_clean.tar` from `git archive master` and
rebuild `/root/coturn_baseline/build/bin/turnserver` if the
baseline binary is gone. The A/B harness depends on having both
binaries side-by-side on the turnserver droplet.
3. Run a 6-round alternating A/B as a sanity check that the current
tip-of-branch still beats `master` by ~5 %. If it doesn't, the
environment drifted and the baseline needs re-anchoring.
4. Pick the next item from the backlog. Item (1) — `recvmmsg` into
`socket_input_worker` — is where the next material gain lives.
Add UDP-GSO send path (--udp-gso) (#1907)
## Summary
- New `--udp-gso` flag (Linux, requires `--udp-sendmmsg`) collapses
same-destination, same-size sendmmsg batches into a single `sendmsg`
with a `UDP_SEGMENT` cmsg, so the kernel allocates one super-skb that
traverses the network stack once and is segmented at egress instead of
running `udp_sendmsg → ip_finish_output → __dev_queue_xmit` per
datagram.
- Also wraps the relay-side `recvmmsg` callback loop in
`udp_sendmmsg_batch_begin/end` so peer→client sends triggered inside a
recv batch can also coalesce — without that wrapping the relay path
issues one `sendto` per delivered datagram.
- Sticky-disable on `EINVAL/ENOPROTOOPT` for older kernels/NICs that
lack UDP-GSO; one warning logged, then transparent fallback to the
existing `sendmmsg` and `udp_send` paths.
## Why
The `--udp-recvmmsg` and `--udp-sendmmsg` follow-ups confirmed (see
[docs/PerformanceIterationLog.md](docs/PerformanceIterationLog.md)) that
on the relay flood workload the dominant cost is the per-datagram kernel
TX path. mmsg-style batching reduces only the syscall entry/exit, not
the per-skb stack traversal — UDP-GSO collapses both.
## Result
DigitalOcean nyc1 c-4, 30 s alternating A/B, `-Y packet -m 1`, eth1 TX
as the authoritative server forwarding metric:
| Variant | eth1 RX | eth1 TX | sys CPU | idle CPU |
|---|---:|---:|---:|---:|
| baseline (no flags) | 322,091 | 127,445 | 22.9 % | 67.5 % |
| `--udp-recvmmsg --udp-sendmmsg --udp-gso` | 266,068 | **257,996** |
15.0 % | 78.7 % |
| baseline (no flags) | 309,475 | 125,573 | 20.9 % | 70.7 % |
| `--udp-recvmmsg --udp-sendmmsg --udp-gso` | 275,992 | **225,366** |
14.9 % | 74.3 % |
Mean server forwarding rate: **126.5 k → 241.7 k pps (+91 %, 1.91×)**,
mean system CPU **21.9 % → 14.9 %** — about **2.8× CPU efficiency** (TX
pps per system-CPU-%). Full perf-children comparison and methodology in
the new section of
[docs/PerformanceIterationLog.md](docs/PerformanceIterationLog.md).
## Notes for reviewers
- `--udp-gso` is opt-in and requires `--udp-sendmmsg` (the help text
states the dependency). Without `--udp-sendmmsg` the batch state never
accumulates and GSO has nothing to flush.
- GSO eligibility resets on every `_begin/_end`. Mixed-destination,
mixed-size, or oversize batches transparently fall back through
`sendmmsg` / `udp_send`.
- Rebased onto current `master`; the recvmmsg dependency is already
merged via #1906.
## Test plan
- [x] `cmake --build build --target turnserver` (RelWithDebInfo + ASan
local builds clean)
- [x] `ctest --test-dir build --output-on-failure` — 3/3 unit tests pass
- [x] `examples/run_tests.sh` — TCP/TLS/UDP pass; DTLS pre-existing
failure on macOS environment, unrelated to this change
- [x] DigitalOcean A/B perf validation captured above
- [ ] Reviewer to confirm CI green on Linux build/test/CodeQL
---------
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 month ago
## 2026-05-03 sendmmsg follow-up
A later run on two DigitalOcean CPU-optimized `c-4` droplets in `sfo3`
(`10.124.0.2` turnserver, `10.124.0.3` loadgen) tested an experimental
Linux-only `--udp-sendmmsg` flag with `--udp-recvmmsg` .
| Run | Code/flags | Generator max pps | Generator avg pps | Server RX avg pps | Server TX avg pps | Server TX peak pps | CPU avg | Perf conclusion |
| --- | --- | ---: | ---: | ---: | ---: | ---: | ---: | --- |
| iter0 | baseline, `--udp-recvmmsg` | 335,872 | 286,721 | 360,900 | 257,357 | 323,488 | 97.8% | `sendto` /`udp_sendmsg` dominates |
| iter1 | `--udp-sendmmsg` , both directions | 409,600 | 312,662 | 428,184 | 197,300 | 260,453 | 99.8% | `sendmmsg` path dominates; TX regressed |
| iter2 | `sendmmsg` only for batches >= 4 | 393,216 | 315,393 | 398,121 | 163,626 | 215,068 | 98.9% | Threshold did not recover TX |
| iter3 | listener-side batching only | 425,984 | 286,038 | 376,444 | 210,050 | 332,417 | 97.4% | Peak ingress/TX improved, average TX still below baseline |
Validation result: `sendmmsg()` is not a proven general win for this workload.
It can increase generator max pps and peak server TX, but average delivered
server TX stayed below the `--udp-recvmmsg` baseline. Keep it opt-in until a
follow-up change proves better end-to-end relay throughput.
Perf still points at per-datagram kernel transmit cost:
- baseline: `udp_send -> sendto -> __sys_sendto -> udp_sendmsg -> udp_send_skb -> ip_output`
- sendmmsg variants: `udp_sendmmsg_flush -> __sendmmsg -> __sys_sendmmsg -> ___sys_sendmsg -> udp_sendmsg -> ip_output`
The key observation is that `sendmmsg()` reduces syscall entry count but still
walks `udp_sendmsg` and the IP output path once per datagram. On this workload,
the extra `mmsghdr` copy/looping overhead can offset the syscall savings.
Deferred bigger refactors from this run:
- Per-peer connected UDP relay sockets or a destination cache could reduce
address handling and route lookup for repeated peer sends, but it changes
relay socket semantics and receive filtering.
- Shard a single hot allocation/flow across multiple relay workers only with a
careful design for ordering, session accounting, socket ownership, and lock
contention.
- Investigate `io_uring` send batching or kernel-bypass style transmit only as
a larger architecture experiment.
- Consider a purpose-built benchmark mode that measures delivered relay pps at a
controlled input rate. The current saturated packet flood is useful for
finding hot functions but can obscure end-to-end delivery changes.
## 2026-05-09 UDP-GSO send path (`--udp-gso`)
Realizes the GSO backlog item from the iter-5 backlog above. The recvmmsg /
sendmmsg follow-ups confirmed that on this workload the dominant cost is the
per-datagram kernel TX path (`udp_sendmsg → ip_finish_output → __dev_queue_xmit
→ start_xmit`), which mmsg-style batching does not collapse. UDP-GSO (Linux
`UDP_SEGMENT` cmsg) does collapse it: N same-destination, same-size datagrams
are submitted as one `sendmsg` carrying an iovec; the kernel allocates one
super-skb that traverses the network stack once and is split at egress (NIC).
Implementation lives in [src/apps/relay/ns_ioalib_engine_impl.c ](../src/apps/relay/ns_ioalib_engine_impl.c )
and reuses the existing `--udp-sendmmsg` batch state. Eligibility (same fd,
same dest, same size, ≤ 1472 B per datagram) is tracked on every
`udp_sendmmsg_enqueue` ; eligible flushes go through `udp_gso_attempt_flush`
ahead of the `sendmmsg` loop, with an automatic sticky disable on
`EINVAL/ENOPROTOOPT` so a kernel/NIC without GSO support gracefully falls back.
The relay-side `socket_udp_read_batch_recvmmsg` now wraps its callback loop
in `udp_sendmmsg_batch_begin/end` so peer→client sends triggered inside a
recvmmsg batch can also coalesce — without that wrapping, the relay path
issues one `sendto` per delivered datagram.
DigitalOcean validation on 2026-05-09 — fresh nyc1 `c-4` droplets (turn
`10.116.0.4` , load `10.116.0.5` ), all variants built from the same source tree
under `/root/coturn/build` , `-Y packet -m 1 -l 120` , monitor window via `sar
-n DEV` for `eth1` , `mpstat` , `pidstat` . The 12 s sweep first established the
ordering, then a 30 s alternating A/B (`baseline → gso → baseline → gso`)
confirmed the magnitude of the delta:
| Variant | eth1 RX pps | eth1 TX pps | sys CPU | idle CPU |
|---|---:|---:|---:|---:|
| baseline_r1 | 322,091 | 127,445 | 22.9% | 67.5% |
| `--udp-recvmmsg --udp-sendmmsg --udp-gso` (gso_r1) | 266,068 | **257,996** | 15.0% | 78.7% |
| baseline_r2 | 309,475 | 125,573 | 20.9% | 70.7% |
| gso_r2 | 275,992 | **225,366** | 14.9% | 74.3% |
Mean server forwarding rate (eth1 TX): baseline 126,509 pps → GSO 241,681 pps,
**+91 % (1.91×)**, with mean system CPU dropping from 21.9 % to 14.9 % — about
**2.8× CPU efficiency** in TX pps per system-CPU-%.
12 s packet sweep, all four variants, mean send_pps reported by uclient (used
only for ordering — for absolute throughput trust eth1 TX above):
| Variant | m=1 | m=2 | m=4 | m=8 | m=16 | m=32 |
|---|---:|---:|---:|---:|---:|---:|
| baseline | 230,401 | 150,189 | 187,055 | 174,771 | 160,871 | 167,789 |
| `--udp-recvmmsg` | 255,660 | 148,824 | 174,767 | 142,997 | 150,743 | 144,200 |
| `--udp-recvmmsg --udp-sendmmsg` | 231,766 | 146,776 | 148,826 | 136,542 | 148,955 | 143,575 |
| `--udp-recvmmsg --udp-sendmmsg --udp-gso` | 136,876 | 147,458 | 124,250 | 131,081 | 137,636 | 114,714 |
The uclient generator reports its own send rate, which drops with GSO because
the loadgen droplet's `turnutils_peer` becomes the new bottleneck — it is
single-threaded and cannot reflect 240 k pps. The 30 s `eth1` capture is the
authoritative server-side metric; `sweep_m1` is retained only to show that
GSO does not regress in the moderately-loaded `m=2..32` range relative to
`recvmmsg+sendmmsg` .
Perf children share, m=1 12 s perf record on the turnserver process:
| Symbol | baseline | recvmmsg | recvsendmmsg | gso |
|---|---:|---:|---:|---:|
| `__x64_sys_sendto` (children) | 43.6 % | 47.6 % | 22.8 % | 0.0 % |
| `__x64_sys_sendmsg` (children) | — | — | — | **38.1 %** |
| `__x64_sys_sendmmsg` (children) | — | — | 27.0 % | 0.0 % |
| `udp_sendmsg` | 38.8 % | 41.9 % | 20.6 % | 35.9 % |
| `__dev_queue_xmit` | 18.5 % | — | — | 29.3 % |
| `skb_segment` (egress GSO split) | absent | absent | absent | 2.2 % |
| `syscall_return_via_sysret` (self) | 7.2 % | 4.7 % | 4.4 % | 2.4 % |
| `entry_SYSCALL_64_after_hwframe` (self) | 4.1 % | 3.6 % | 2.6 % | 1.8 % |
In the GSO column the per-packet kernel-stack cost is now amortized across
the segments of a single super-skb. The proportional rise of
`__dev_queue_xmit` is misleading on its own — it reflects a smaller
denominator (CPU usage dropped) while the per-packet absolute cost dropped.
Operational notes:
- Flag is opt-in. `--udp-gso` requires `--udp-sendmmsg` ; without that flag
the batch state never accumulates and GSO has nothing to flush. The
`--help` text states the dependency.
- GSO eligibility resets on every `_begin/_end` . Mixed-destination or
mixed-size workloads transparently fall back through the existing
`sendmmsg` and `udp_send` paths.
- Sticky disable on `EINVAL/ENOPROTOOPT` keeps a process running on an
un-virtio host or older kernel from hot-looping in the sticky failure
path. A WARNING line is logged once.
- Tested on Linux 6.8 + virtio-net (DO `c-4` ), `gso_max_segs=65535` . Older
hosts (kernel < 4.18 ) lack `UDP_SEGMENT` entirely ; the sticky-disable
path covers them.
Suggested next levers if more relay throughput is needed:
1. **Move loadgen off turnutils_peer.** The 240 k → 90 k tot_recv_msgs/30 s
gap at GSO is dominated by single-threaded peer reflection, not the TURN
server. A multi-thread peer or `pktgen` -style reflector would let us
measure the real ceiling.
2. **Per-peer connected relay sockets.** Same-destination is the GSO
eligibility predicate; a connected relay socket would always be
same-dest and would also save `route_lookup` per send.
3. ** `MSG_ZEROCOPY` on the GSO sendmsg.** `rep_movs_alternative` is still
3 % self in GSO, and zerocopy avoids the userspace→kernel copy.
Probably small for 32-B STUN packets; revisit when payloads are larger.
Artifacts (perf.data, sar/mpstat/pidstat, sweep logs, AB logs) are saved at
`perf-results-20260508-213056/` in the worktree.