coturn/docs/PerformanceIterationLog.md

# Performance iteration log

Running notes for the multi-iteration performance work on the UDP relay
data path. Pick this up to continue without re-deriving everything.

The harness, baseline command, and droplet topology are documented in
[CLAUDE.md](../CLAUDE.md) under "Load Test on DigitalOcean" — this file
captures the *deltas*: what was measured, what landed, what didn't, and
where the next round should go.

## Cumulative result

Five commits on `claude/beautiful-black-c3b741` between `727ec2ab`
("loadgen") and `321a2d18`:

| # | Commit | Optimization |
|---|---|---|
| 1 | `ce7e7e53` | Hoist `turn_server_get_engine()` out of per-packet hot path |
| 2 | `8e28491a` | `ioa_socket_check_bandwidth` early fast-exit; drop dead `if (!(s->done \|\| s->fd==-1))` in `send_data_from_ioa_socket_nbh` |
| 3 | `344360f6` | Cache `get_relay_socket_ss()` and `ioa_network_buffer_get_size()` in `write_to_peerchannel`, `handle_turn_send`, `read_client_connection` |
| 4 | `a6f6767f` | Inline `get_ioa_addr_len()` via `ns_turn_ioaddr.h` |
| 5 | `321a2d18` | Inline `addr_cpy()` via `ns_turn_ioaddr.h` |

Current `relay-recvmmsg` follow-up:

| # | Commit | Optimization |
|---|---|---|
| 6 | `54c589d0` / `4b1a8d71` | Initial Linux `recvmmsg` batching for UDP listener and connected relay sockets |
| 7 | `8d9a7292` | Share the existing `--udp-recvmmsg` flag across listener and relay UDP paths; remove separate relay flag; use the shared ancillary-data parser in `dtls_listener` |
| 8 | `d48686b7` | Reduce relay per-socket `recvmmsg` state from 16 x 64 KiB cmsg buffers to TTL/TOS-sized buffers, avoid an extra would-block fallback `recvmsg`, and clean up all preallocated buffers after partial batches |
| 9 | `ad81705e` | Add per-engine `recvmmsg` occupancy counters and 10 s log summaries (`calls`, `packets`, `avg_batch`, `wouldblock`, `unavailable`, `no_buffer`, batch-size histogram) |
| 10 | `388b15d4` | Move connected relay UDP `recvmmsg` scratch from per-socket state to per-engine/per-thread state |
| 11 | `4c4fd67e` | Make the occupancy summaries opt-in behind `--udp-recvmmsg-log`, so `--udp-recvmmsg` can ship without periodic stats logs |

Validation after #7-#11:

- Local `cmake -S . -B build -DBUILD_TESTING=ON` passed.
- Local `cmake --build build --parallel 8` passed.
- Local `ctest --test-dir build --output-on-failure` passed 3/3.
- Local `build/bin/turnserver --udp-recvmmsg --udp-recvmmsg-log --version`
  parsed both flags and printed `4.11.0`.
- Linux Docker `turnserver` build passed after #7, after #8, and after #10.

Shipping cleanup learning: keep the occupancy counters in place because they
are low overhead and useful for DigitalOcean diagnostics, but keep the periodic
summaries off by default. Use `--udp-recvmmsg-log` only during measured runs
where the log stream is part of the observation.

DigitalOcean check on 2026-05-09:

- Reused the existing `c-4` droplets in `nyc1`: turnserver public
  `157.230.3.102`, private `10.116.0.2`; loadgen public `167.99.153.216`,
  private `10.116.0.3`. Droplets were left running between steps.
- Built fresh current artifacts from `d48686b7` on both droplets under
  `/root/coturn_recvmmsg_current`.
- Same-binary `--udp-recvmmsg` off/on, `-Y packet -m 1 -l 120`, 5 alternating
  30 s rounds each:
  - off mean 154,527, median 154,596, stdev 3,467
  - on mean 149,994, median 153,011, stdev 7,174
  - on was -2.9 % by mean and -1.0 % by median
- Same-binary `--udp-recvmmsg` off/on, `-Y packet -m 100 -l 120`, 5 alternating
  rounds each. The client completed before the 30 s timeout and landed in two
  send-volume buckets, so treat this as a coarse many-connection signal:
  - off mean 59,432, median 65,071, stdev 7,952
  - on mean 59,640, median 65,421, stdev 7,963
  - on was +0.3 % by mean and +0.5 % by median
- Follow-up `m=100 -n 1000` run, 3 alternating rounds each, derived receive
  count from `tot_recv_bytes / 120` because this log format omits
  `tot_recv_msgs`:
  - off mean 8,540, median 8,990, stdev 1,004
  - on mean 8,857, median 8,749, stdev 759
  - on was +3.7 % by mean and -2.7 % by median

Learning: the corrected relay `recvmmsg` implementation is now buildable and
much safer for many connections, but these droplet runs still do not show a
clear throughput win. Keep `--udp-recvmmsg` opt-in for now. The next useful
step is to instrument actual batch occupancy on connected relay sockets; if
most readiness events return one datagram, `recvmmsg` will mostly add setup
work without reducing syscalls.

DigitalOcean occupancy check on 2026-05-09:

- Built fresh current artifacts from `388b15d4` on both droplets under
  `/root/coturn_recvmmsg_current`.
- Same-binary `--udp-recvmmsg` off/on, `-Y packet -m 1 -l 120`, 3 alternating
  30 s rounds each:
  - off mean 153,133, median 153,608, stdev 4,383
  - on mean 148,452, median 149,711, stdev 10,833
  - on was -3.1 % by mean and -2.5 % by median
- `m=1` occupancy from the on runs: 1,129,427 `recvmmsg` calls returned
  17,660,300 packets, average batch 15.64. Histogram buckets:
  `hist_1=1,353`, `hist_2=1,496`, `hist_3_4=3,707`,
  `hist_5_8=14,817`, `hist_9_16=1,108,057`; 98.1 % of calls were in the
  `9..16` bucket.
- Same-binary `--udp-recvmmsg` off/on, `-Y packet -m 100 -l 120`, 3 alternating
  runs each:
  - off mean 55,443, median 50,679, stdev 8,369
  - on mean 60,596, median 65,404, stdev 8,383
  - on was +9.3 % by mean and +29.1 % by median, but the client again landed
    in two send-volume buckets, so treat the throughput delta as noisy.
- `m=100` occupancy from the on runs across all relay threads: 1,426,401
  `recvmmsg` calls returned 16,188,946 packets, average batch 11.35.
  Histogram buckets: `hist_1=83,057`, `hist_2=79,781`,
  `hist_3_4=130,066`, `hist_5_8=188,259`, `hist_9_16=945,238`; 66.3 %
  of calls were in the `9..16` bucket.

Learning: receive-side occupancy is high. The earlier hypothesis that
`recvmmsg` was mostly returning one packet is wrong for this harness. The
remaining bottleneck is after receive: per-packet callbacks, TURN processing,
and especially one `sendto` per relayed packet. The per-thread scratch change
is still worth keeping for memory/cache behavior with thousands of sockets,
but the next performance lever should be send-side batching or a design that
passes batches deeper instead of immediately decomposing them back into
single-packet callbacks.

Alternating A/B run on the same droplet pair, m=1 packet flood, 30 s
per run, with a 4 s warm-up between binary swaps:

- Baseline (clean `master` binary): mean 146,984 round-trips / 30 s
- Cumulative (all 5 iters): mean 155,468 round-trips / 30 s
- **+5.8 % throughput**

Per-iteration deltas were within run-to-run noise (~5–10 % variance).
The cumulative effect is what's visible.

## Test setup that was used

Two `c-4` Ubuntu 24.04 droplets in `nyc1`, same VPC `default-nyc1`.
Current active pair:

- `coturn-turnserver` — public `157.230.3.102`, private `10.116.0.2`
- `coturn-loadgen`    — public `167.99.153.216`, private `10.116.0.3`

Older pair used for the iter 5 cumulative run:

- `coturn-turnserver` — public `68.183.121.197`, private `10.116.0.2`
- `coturn-loadgen`    — public `68.183.132.220`, private `10.116.0.3`

Created via the DigitalOcean v2 API (`doctl` is *not* installed; use
`curl` + `$DIGITALOCEAN_TOKEN` from the user's `~/.zshrc`). SSH via
`~/.ssh/id_rsa` (matches DO ssh key id `23704483`, fingerprint
`37:3a:9b:e3:1e:1a:9b:42:a0:6f:58:f5:5a:3a:6a:2c`).

State on the turnserver droplet (kept across iterations):

- `/root/coturn_clean.tar` — `git archive HEAD` of master at start of run.
  Re-extract this before applying any new patch.
- `/root/coturn_baseline/build/bin/turnserver` — clean baseline binary,
  used as the "B" in every A/B round. **Don't overwrite.**
- `/root/coturn/build/bin/turnserver` — current iteration binary.
- `/root/start_turnserver.sh`, `/root/baseline_run.sh` — helper scripts.

State on the loadgen droplet:

- `/root/coturn/build/bin/turnutils_uclient`, `turnutils_peer`.
- `turnutils_peer` runs as a daemon on `10.116.0.3:3480`
  (`pid` in `/root/peer.pid`).

A small env file was written to `/tmp/coturn_perf_env.sh` on the local
machine with the IPs / droplet IDs — recreate it from the current
state of the DO account if it gets lost.

The standard packet-flood command (matches CLAUDE.md baseline, runs without
`--udp-recvmmsg`; add `--udp-recvmmsg` to `turnserver`, not the client, for the
batched listener/relay receive path):

```bash
timeout -s INT 30s /root/coturn/build/bin/turnutils_uclient \
  -Y packet -m 1 -l 120 \
  -e 10.116.0.3 -r 3480 -X -g \
  -u user -W secret \
  10.116.0.2
```

Metric: the `tot_recv_msgs` field on the last `start_mclient:` log
line. (This is round-trips through the relay over the test window;
`send_pps` is loadgen-side only and can hit 262 K even when the relay
is dropping most of them, so it's not a useful proxy for relay
throughput.)

## Hot-path map at the end of iter 5

`perf record -F 99 -g` on the turnserver during a 12 s `-Y packet -m 1`
run, sorted by user-space self-time:

```
0.80 % send_data_from_ioa_socket_nbh
0.76 % socket_input_worker
0.69 % read_client_connection.isra.0
0.60 % turn_report_session_usage
0.53 % peer_input_handler
0.51 % udp_server_input_handler
0.35 % udp_recvfrom               # was 0.76 % at iter 1
0.34 % lm_map_get
0.27 % stun_is_channel_message_str
0.27 % get_relay_socket
0.26 % ioa_socket_check_bandwidth # was 0.33 % at iter 1
0.26 % udp_send                   # was 0.60 % at iter 1
0.18 % ioa_network_buffer_get_size
```

Total user-space coturn cycles: ~5–7 % of the relay thread.
The relay thread sits at ~100 % CPU pinned to one core; the 4 relay
threads aren't parallelised by the m=1 single-flow test (one 5-tuple
hashes to one SO_REUSEPORT worker).

Kernel side (children-aggregated) is the real cost:

```
36 % udp_sendmsg (sendto path)
14 % udp_recvmsg
17 % ip_finish_output / ip_output / __dev_queue_xmit
~23 % syscall enter / exit machinery (sysret, SYSRETQ, SYSCALL_64*)
```

That ~23 % syscall overhead is the next big lever. Halving it
(via batching) is worth ~10 % wall-clock CPU.

## What didn't work

### Default `--udp-recvmmsg=true` on Linux — opt-in in iters 1–11, **shipped default-on later**

Original finding (iters 1–11): the flag applied the 16-buffer batch path to
**every connected per-session relay socket**, which only ever receives one
flow. Throughput parity or slight negative results were confirmed across
multiple A/B rounds on `m=1` and `m=100` — the per-session prealloc churn ate
the listener-side win — so the flag was kept opt-in.

Resolution: a later change scoped `recvmmsg` to **shared fan-in sockets only**
(the `udp_recvmmsg_eligible` flag in
[ns_ioalib_impl.h](../src/apps/relay/ns_ioalib_impl.h) — set on the client
listener and, under `--multiplex-peer`, the per-thread shared relay socket).
Per-session relay sockets now stay on the single-recv path regardless of the
flag, so the per-session tax that motivated keeping it opt-in is gone. With the
cost surface reduced to the one genuine fan-in point (the listener), the flag
ships **on by default on Linux**; operators opt out with `--udp-recvmmsg=false`.
DTLS session sockets remain on the SSL read path and are never batched.

### Caching `get_relay_socket_ss` (iter 3) — no measurable wall-clock win

The function is `static inline` already and the underlying
`get_relay_socket()` is a four-line accessor. Caching the result
*does* save a cross-TU function call per packet (the compiler can't
prove `get_relay_socket` pure across the
`set_df_on_ioa_socket` / `ioa_network_buffer_*` calls in between),
which the perf profile picked up as a small redistribution, but
throughput stayed in the noise band. Kept anyway: the cleanup is
defensible and matches the iter 4/5 inlining direction.

## Methodology lessons

- **Always alternate A/B per round** rather than running 5×B then 5×I.
  The droplet pair has noticeable environmental drift over a few
  minutes (other tenants on the hypervisor, NIC ring backpressure,
  whatever); sequential blocks bias whichever binary ran on the worse
  half of the run.
- **Discard the first run after a turnserver restart.** The loadgen's
  first run after a server restart is consistently 30–80 % slower
  than steady-state — looks like channel/permission state in the
  client side warming up, not the server. A 4 s "throwaway" run
  before the measured 30 s run is enough.
- **Run-to-run variance is ~5–10 %** even with alternation. Plan on
  6–8 rounds (≈ 8 minutes wall-clock) before claiming a sub-10 % win.
  A single 3-round A/B will lie to you.
- **Use the `tot_recv_msgs` field, not `send_pps`**. Loadgen send rate
  saturates at ~262 K pps regardless of relay capacity — it's
  whatever the loadgen kernel will accept into its UDP send buffer.
  The receive count is what made it round-trip through the relay.
- **The relay is kernel-bound.** User-space coturn is ~5 % of cycles.
  Halving it gives at most ~2.5 % wall-clock — usually undetectable
  per-iteration, only visible cumulatively. Don't expect a 10 % jump
  from a CSE.
- **Single-flow tests pin one core.** With `SO_REUSEPORT` the kernel
  hashes 5-tuples to worker sockets; one client → one tuple → one
  worker thread. The other 3 cores sit idle. To exercise all 4 relay
  threads you'd need m≥4 *with distinct source ports* — ours don't
  spread cleanly because the loadgen reuses ports.
- **Don't re-extract `/root/coturn` between iterations** if you want
  to keep `git apply`-style patches working. The droplet copy is *not*
  a git checkout (it's the `git archive` tar). Use `patch -p1`. Each
  iteration uploaded a *cumulative* diff (current branch vs `master`)
  and re-extracted from `/root/coturn_clean.tar` first to get a clean
  apply.

## Optimization backlog (bigger fish for next session)

Ordered by expected impact for the m=1 packet-flood metric:

1. **Batch the send side (`sendmmsg`) or pass receive batches deeper.** The
   occupancy counters show receive batching is already working: `m=1` averaged
   15.6 packets per call and `m=100` averaged 11.4. The code immediately
   invokes the existing per-packet callback for each received datagram, and
   each forwarded packet still pays a separate send syscall. The next
   measurable lever is to queue per-thread outbound datagrams during a receive
   batch and flush them with `sendmmsg`, or introduce a batch-aware callback
   path for the hot UDP relay case.

2. **Keep `recvmmsg` occupancy counters available while developing send
   batching.** They are cheap enough for targeted performance builds and make
   it obvious whether a benchmark is exercising one relay thread or all relay
   threads. Consider hiding periodic logs behind a verbose/debug option before
   shipping broadly.

3. **GSO (`UDP_SEGMENT`)** on the send path. Linux can take one
   "large" datagram and segment it in the kernel for back-to-back
   packets to the same destination. Our channel-data flood IS
   same-destination. Setting `UDP_SEGMENT` and submitting a single
   `sendmsg` of N×packet_size cuts skb-alloc / `__dev_queue_xmit`
   work substantially. Needs careful handling for short tails and
   non-uniform sizes; complementary to (2).

4. **Inline more cross-TU per-packet accessors.** Pattern from iter
   4/5 still applies: `addr_eq` (called per channel-data packet for
   permission lookup), `ioa_network_buffer_get_size`,
   `get_ioa_socket_type` / `_app_type`. Each is small enough; the
   only reason to be cautious is they're declared in `ns_turn_ioalib.h`
   which is part of the public-ish server library API — moving the
   body inline doesn't break ABI but does require a recompile of all
   consumers. Likely <1 % each but cheap to do.

5. **Re-evaluate `--udp-recvmmsg` default after instrumentation.** *(Done.)*
   Scoping `recvmmsg` to shared fan-in sockets removed the per-session-relay
   tax that blocked default-on; the listener is a genuine fan-in point that
   benefits whenever client concurrency is non-trivial and costs little when
   idle (few packets ⇒ few prealloc cycles). Now ships on by default on Linux,
   with `--udp-recvmmsg=false` as the opt-out. See the resolved entry under
   "What didn't work" above.

## Things investigated and ruled out (don't redo)

- `set_socket_ttl` / `set_socket_tos` already short-circuit on
  no-change via `s->current_ttl != ttl` / `s->current_tos != tos`.
  In a steady-state flood the per-packet call returns immediately
  without `setsockopt`. Already optimized.
- `set_df_on_ioa_socket` similarly guarded
  ([ns_ioalib_engine_impl.c:242](../src/apps/relay/ns_ioalib_engine_impl.c#L242)).
- `turn_report_session_usage` slow path runs once per 4096 packets
  (see iter 1 commit); the per-call overhead is now ~3 reads + 1
  bitmask test + 1 conditional return.
- `MSG_CONFIRM` in `sendto` would skip ARP refresh, but
  `neigh_resolve_output` + `neigh_hh_output` show ~17 % combined in
  perf only because we're sending *that many* packets — per-packet
  it's the normal cached neighbor path, not a refresh.
- Increasing `MAX_TRIES` from 16 to 64 in `socket_input_worker`
  doesn't change syscall count; it only delays returning to libevent.
  Useless without (1) above.

## How to resume

1. Verify the droplets are still up (the IPs above). If they were
   destroyed, re-create with `c-4` / `nyc1` / `default-nyc1` VPC and
   the `pavel` SSH key (id 23704483).
2. Re-upload `/tmp/coturn_clean.tar` from `git archive master` and
   rebuild `/root/coturn_baseline/build/bin/turnserver` if the
   baseline binary is gone. The A/B harness depends on having both
   binaries side-by-side on the turnserver droplet.
3. Run a 6-round alternating A/B as a sanity check that the current
   tip-of-branch still beats `master` by ~5 %. If it doesn't, the
   environment drifted and the baseline needs re-anchoring.
4. Pick the next item from the backlog. Item (1) — `recvmmsg` into
   `socket_input_worker` — is where the next material gain lives.

## 2026-05-03 sendmmsg follow-up

A later run on two DigitalOcean CPU-optimized `c-4` droplets in `sfo3`
(`10.124.0.2` turnserver, `10.124.0.3` loadgen) tested an experimental
Linux-only `--udp-sendmmsg` flag with `--udp-recvmmsg`.

| Run | Code/flags | Generator max pps | Generator avg pps | Server RX avg pps | Server TX avg pps | Server TX peak pps | CPU avg | Perf conclusion |
| --- | --- | ---: | ---: | ---: | ---: | ---: | ---: | --- |
| iter0 | baseline, `--udp-recvmmsg` | 335,872 | 286,721 | 360,900 | 257,357 | 323,488 | 97.8% | `sendto`/`udp_sendmsg` dominates |
| iter1 | `--udp-sendmmsg`, both directions | 409,600 | 312,662 | 428,184 | 197,300 | 260,453 | 99.8% | `sendmmsg` path dominates; TX regressed |
| iter2 | `sendmmsg` only for batches >= 4 | 393,216 | 315,393 | 398,121 | 163,626 | 215,068 | 98.9% | Threshold did not recover TX |
| iter3 | listener-side batching only | 425,984 | 286,038 | 376,444 | 210,050 | 332,417 | 97.4% | Peak ingress/TX improved, average TX still below baseline |

Validation result: `sendmmsg()` is not a proven general win for this workload.
It can increase generator max pps and peak server TX, but average delivered
server TX stayed below the `--udp-recvmmsg` baseline. Keep it opt-in until a
follow-up change proves better end-to-end relay throughput.

Perf still points at per-datagram kernel transmit cost:

- baseline: `udp_send -> sendto -> __sys_sendto -> udp_sendmsg -> udp_send_skb -> ip_output`
- sendmmsg variants: `udp_sendmmsg_flush -> __sendmmsg -> __sys_sendmmsg -> ___sys_sendmsg -> udp_sendmsg -> ip_output`

The key observation is that `sendmmsg()` reduces syscall entry count but still
walks `udp_sendmsg` and the IP output path once per datagram. On this workload,
the extra `mmsghdr` copy/looping overhead can offset the syscall savings.

Deferred bigger refactors from this run:

- Per-peer connected UDP relay sockets or a destination cache could reduce
  address handling and route lookup for repeated peer sends, but it changes
  relay socket semantics and receive filtering.
- Shard a single hot allocation/flow across multiple relay workers only with a
  careful design for ordering, session accounting, socket ownership, and lock
  contention.
- Investigate `io_uring` send batching or kernel-bypass style transmit only as
  a larger architecture experiment.
- Consider a purpose-built benchmark mode that measures delivered relay pps at a
  controlled input rate. The current saturated packet flood is useful for
  finding hot functions but can obscure end-to-end delivery changes.

## 2026-05-09 UDP-GSO send path (`--udp-gso`)

Realizes the GSO backlog item from the iter-5 backlog above. The recvmmsg /
sendmmsg follow-ups confirmed that on this workload the dominant cost is the
per-datagram kernel TX path (`udp_sendmsg → ip_finish_output → __dev_queue_xmit
→ start_xmit`), which mmsg-style batching does not collapse. UDP-GSO (Linux
`UDP_SEGMENT` cmsg) does collapse it: N same-destination, same-size datagrams
are submitted as one `sendmsg` carrying an iovec; the kernel allocates one
super-skb that traverses the network stack once and is split at egress (NIC).

Implementation lives in [src/apps/relay/ns_ioalib_engine_impl.c](../src/apps/relay/ns_ioalib_engine_impl.c)
and reuses the existing `--udp-sendmmsg` batch state. Eligibility (same fd,
same dest, same size, ≤ 1472 B per datagram) is tracked on every
`udp_sendmmsg_enqueue`; eligible flushes go through `udp_gso_attempt_flush`
ahead of the `sendmmsg` loop, with an automatic sticky disable on
`EINVAL/ENOPROTOOPT` so a kernel/NIC without GSO support gracefully falls back.
The relay-side `socket_udp_read_batch_recvmmsg` now wraps its callback loop
in `udp_sendmmsg_batch_begin/end` so peer→client sends triggered inside a
recvmmsg batch can also coalesce — without that wrapping, the relay path
issues one `sendto` per delivered datagram.

DigitalOcean validation on 2026-05-09 — fresh nyc1 `c-4` droplets (turn
`10.116.0.4`, load `10.116.0.5`), all variants built from the same source tree
under `/root/coturn/build`, `-Y packet -m 1 -l 120`, monitor window via `sar
-n DEV` for `eth1`, `mpstat`, `pidstat`. The 12 s sweep first established the
ordering, then a 30 s alternating A/B (`baseline → gso → baseline → gso`)
confirmed the magnitude of the delta:

| Variant | eth1 RX pps | eth1 TX pps | sys CPU | idle CPU |
|---|---:|---:|---:|---:|
| baseline_r1 | 322,091 | 127,445 | 22.9% | 67.5% |
| `--udp-recvmmsg --udp-sendmmsg --udp-gso` (gso_r1) | 266,068 | **257,996** | 15.0% | 78.7% |
| baseline_r2 | 309,475 | 125,573 | 20.9% | 70.7% |
| gso_r2 | 275,992 | **225,366** | 14.9% | 74.3% |

Mean server forwarding rate (eth1 TX): baseline 126,509 pps → GSO 241,681 pps,
**+91 % (1.91×)**, with mean system CPU dropping from 21.9 % to 14.9 % — about
**2.8× CPU efficiency** in TX pps per system-CPU-%.

12 s packet sweep, all four variants, mean send_pps reported by uclient (used
only for ordering — for absolute throughput trust eth1 TX above):

| Variant | m=1 | m=2 | m=4 | m=8 | m=16 | m=32 |
|---|---:|---:|---:|---:|---:|---:|
| baseline | 230,401 | 150,189 | 187,055 | 174,771 | 160,871 | 167,789 |
| `--udp-recvmmsg` | 255,660 | 148,824 | 174,767 | 142,997 | 150,743 | 144,200 |
| `--udp-recvmmsg --udp-sendmmsg` | 231,766 | 146,776 | 148,826 | 136,542 | 148,955 | 143,575 |
| `--udp-recvmmsg --udp-sendmmsg --udp-gso` | 136,876 | 147,458 | 124,250 | 131,081 | 137,636 | 114,714 |

The uclient generator reports its own send rate, which drops with GSO because
the loadgen droplet's `turnutils_peer` becomes the new bottleneck — it is
single-threaded and cannot reflect 240 k pps. The 30 s `eth1` capture is the
authoritative server-side metric; `sweep_m1` is retained only to show that
GSO does not regress in the moderately-loaded `m=2..32` range relative to
`recvmmsg+sendmmsg`.

Perf children share, m=1 12 s perf record on the turnserver process:

| Symbol | baseline | recvmmsg | recvsendmmsg | gso |
|---|---:|---:|---:|---:|
| `__x64_sys_sendto` (children) | 43.6 % | 47.6 % | 22.8 % | 0.0 % |
| `__x64_sys_sendmsg` (children) | — | — | — | **38.1 %** |
| `__x64_sys_sendmmsg` (children) | — | — | 27.0 % | 0.0 % |
| `udp_sendmsg` | 38.8 % | 41.9 % | 20.6 % | 35.9 % |
| `__dev_queue_xmit` | 18.5 % | — | — | 29.3 % |
| `skb_segment` (egress GSO split) | absent | absent | absent | 2.2 % |
| `syscall_return_via_sysret` (self) | 7.2 % | 4.7 % | 4.4 % | 2.4 % |
| `entry_SYSCALL_64_after_hwframe` (self) | 4.1 % | 3.6 % | 2.6 % | 1.8 % |

In the GSO column the per-packet kernel-stack cost is now amortized across
the segments of a single super-skb. The proportional rise of
`__dev_queue_xmit` is misleading on its own — it reflects a smaller
denominator (CPU usage dropped) while the per-packet absolute cost dropped.

Operational notes:

- Flag is opt-in. `--udp-gso` requires `--udp-sendmmsg`; without that flag
  the batch state never accumulates and GSO has nothing to flush. The
  `--help` text states the dependency.
- GSO eligibility resets on every `_begin/_end`. Mixed-destination or
  mixed-size workloads transparently fall back through the existing
  `sendmmsg` and `udp_send` paths.
- Sticky disable on `EINVAL/ENOPROTOOPT` keeps a process running on an
  un-virtio host or older kernel from hot-looping in the sticky failure
  path. A WARNING line is logged once.
- Tested on Linux 6.8 + virtio-net (DO `c-4`), `gso_max_segs=65535`. Older
  hosts (kernel <4.18) lack `UDP_SEGMENT` entirely; the sticky-disable
  path covers them.

Suggested next levers if more relay throughput is needed:

1. **Move loadgen off turnutils_peer.** The 240 k → 90 k tot_recv_msgs/30 s
   gap at GSO is dominated by single-threaded peer reflection, not the TURN
   server. A multi-thread peer or `pktgen`-style reflector would let us
   measure the real ceiling.
2. **Per-peer connected relay sockets.** Same-destination is the GSO
   eligibility predicate; a connected relay socket would always be
   same-dest and would also save `route_lookup` per send.
3. **`MSG_ZEROCOPY` on the GSO sendmsg.** `rep_movs_alternative` is still
   3 % self in GSO, and zerocopy avoids the userspace→kernel copy.
   Probably small for 32-B STUN packets; revisit when payloads are larger.

Artifacts (perf.data, sar/mpstat/pidstat, sweep logs, AB logs) are saved at
`perf-results-20260508-213056/` in the worktree.