## Summary
- New `--udp-gso` flag (Linux, requires `--udp-sendmmsg`) collapses
same-destination, same-size sendmmsg batches into a single `sendmsg`
with a `UDP_SEGMENT` cmsg, so the kernel allocates one super-skb that
traverses the network stack once and is segmented at egress instead of
running `udp_sendmsg → ip_finish_output → __dev_queue_xmit` per
datagram.
- Also wraps the relay-side `recvmmsg` callback loop in
`udp_sendmmsg_batch_begin/end` so peer→client sends triggered inside a
recv batch can also coalesce — without that wrapping the relay path
issues one `sendto` per delivered datagram.
- Sticky-disable on `EINVAL/ENOPROTOOPT` for older kernels/NICs that
lack UDP-GSO; one warning logged, then transparent fallback to the
existing `sendmmsg` and `udp_send` paths.
## Why
The `--udp-recvmmsg` and `--udp-sendmmsg` follow-ups confirmed (see
[docs/PerformanceIterationLog.md](docs/PerformanceIterationLog.md)) that
on the relay flood workload the dominant cost is the per-datagram kernel
TX path. mmsg-style batching reduces only the syscall entry/exit, not
the per-skb stack traversal — UDP-GSO collapses both.
## Result
DigitalOcean nyc1 c-4, 30 s alternating A/B, `-Y packet -m 1`, eth1 TX
as the authoritative server forwarding metric:
| Variant | eth1 RX | eth1 TX | sys CPU | idle CPU |
|---|---:|---:|---:|---:|
| baseline (no flags) | 322,091 | 127,445 | 22.9 % | 67.5 % |
| `--udp-recvmmsg --udp-sendmmsg --udp-gso` | 266,068 | **257,996** |
15.0 % | 78.7 % |
| baseline (no flags) | 309,475 | 125,573 | 20.9 % | 70.7 % |
| `--udp-recvmmsg --udp-sendmmsg --udp-gso` | 275,992 | **225,366** |
14.9 % | 74.3 % |
Mean server forwarding rate: **126.5 k → 241.7 k pps (+91 %, 1.91×)**,
mean system CPU **21.9 % → 14.9 %** — about **2.8× CPU efficiency** (TX
pps per system-CPU-%). Full perf-children comparison and methodology in
the new section of
[docs/PerformanceIterationLog.md](docs/PerformanceIterationLog.md).
## Notes for reviewers
- `--udp-gso` is opt-in and requires `--udp-sendmmsg` (the help text
states the dependency). Without `--udp-sendmmsg` the batch state never
accumulates and GSO has nothing to flush.
- GSO eligibility resets on every `_begin/_end`. Mixed-destination,
mixed-size, or oversize batches transparently fall back through
`sendmmsg` / `udp_send`.
- Rebased onto current `master`; the recvmmsg dependency is already
merged via #1906.
## Test plan
- [x] `cmake --build build --target turnserver` (RelWithDebInfo + ASan
local builds clean)
- [x] `ctest --test-dir build --output-on-failure` — 3/3 unit tests pass
- [x] `examples/run_tests.sh` — TCP/TLS/UDP pass; DTLS pre-existing
failure on macOS environment, unrelated to this change
- [x] DigitalOcean A/B perf validation captured above
- [ ] Reviewer to confirm CI green on Linux build/test/CodeQL
---------
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Summary
Extends the existing Linux-only `--udp-recvmmsg` flag from the UDP
listener socket to also cover **connected per-session UDP relay
sockets**, so steady-state client→relay and peer→relay traffic on plain
UDP is read in batches of up to 16 datagrams per `recvmmsg(2)` instead
of one `recvmsg` per packet. DTLS sessions still go through the SSL read
path and are unchanged.
The flag stays **opt-in**: receive-side batching works correctly, but on
the current `m=1` / `m=100` benchmarks throughput is flat to slightly
negative — the bottleneck has moved past receive (see results below).
## What's in the change
- **Shared receive helpers** (`src/apps/relay/ns_ioalib_engine_impl.c`,
`src/apps/relay/ns_ioalib_impl.h`):
- `ioa_parse_udp_recvmsg_cmsg()` — single TTL/TOS/`IP_RECVERR` cmsg
parser used by both `udp_recvfrom()` and the new batch path. Replaces
the duplicated parser previously inlined in `dtls_listener.c` and
`udp_recvfrom()`.
- `ioa_init_recvmmsg_hdr()` — single initializer for
`mmsghdr`/`iovec`/cmsg/source-address fields, also used by the listener.
- New `IOA_UDP_RECVMMSG_MAX_BATCH = 16` constant; both listener and
relay paths now share it.
- **Connected relay batch read** (`socket_udp_read_batch_recvmmsg` in
`ns_ioalib_engine_impl.c`): called from `socket_input_worker` for
non-SSL UDP sockets when `--udp-recvmmsg` is on. Allocates per-message
`stun_buffer_list_elem`s, calls `recvmmsg(MSG_DONTWAIT)`, dispatches
each datagram through the existing `read_cb` path, and falls back
cleanly on `ENOSYS`/`EINVAL`/`EOPNOTSUPP` (auto-disables the flag) and
on `EAGAIN`/short-batch (releases unused buffers).
- **Per-engine scratch state**: the `mmsghdr[16]` / `iovec[16]` / cmsg /
src-addr arrays live on `ioa_engine`, not on every socket — keeps memory
flat at thousands of allocations.
- **TTL/TOS-sized cmsg buffers** in the listener: the listener
previously over-allocated `64 KiB` per slot; it now uses the same
TTL+TOS sizing as the relay path.
- **Opt-in occupancy stats** behind a new `--udp-recvmmsg-log` flag:
every 10 s the relay logs `udp-recvmmsg stats: calls=… packets=…
avg_batch=… wouldblock=… unavailable=… no_buffer=… hist_1=… hist_2=…
hist_3_4=… hist_5_8=… hist_9_16=…`. Counters are always tracked (cheap);
the periodic log is gated by the new flag so default operation is
silent.
- **CLI plumbing**: `--udp-recvmmsg-log` long option in
`mainrelay.c`/`mainrelay.h`, `cli_print_flag` entry in
`turn_admin_server.c`, doc updates in `README.turnserver`.
- **Docs**: `docs/PerformanceIterationLog.md` records the iteration
steps, validation, and two rounds of DigitalOcean A/B numbers.
`CLAUDE.md` load-test instructions updated to mention the new flag and
the `tot_recv_msgs` / `tot_recv_bytes` workaround.
Remove the two engine implementations (NEV_UDP_SOCKET_PER_SESSION and
NEV_UDP_SOCKET_PER_ENDPOINT) and all the dispatch/selection logic around
them. NEV_UDP_SOCKET_PER_THREAD is now the sole, unconditional
implementation.
- mainrelay.h: removed _NET_ENG_VERSION enum, typedef, and
net_engine_version / net_engine_version_txt struct fields
- mainrelay.c: removed NE_TYPE_OPT CLI option, set_network_engine(),
per-endpoint branch in print_features(), and all remaining
net_engine_version references
- netengine.c: removed run_udp_listener_thread(),
setup_socket_per_endpoint_udp_listener_servers() (~190 lines),
setup_socket_per_session_udp_listener_servers() (~90 lines); simplified
setup_barriers(), setup_relay_server(), run_general_relay_thread(),
setup_general_relay_servers(), and setup_server() by eliminating all
engine-type conditionals
- turn_admin_server.c: replaced dynamic engine version lookups with
hardcoded values (3 / "UDP thread per CPU core") in CLI and HTTPS status
handlers
---------
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
A rewrite and reformat of the PostgreSQL.md file:
- Applies some minor formatting to the document.
- Applies consistency between the 'turn' user and 'coturn' database
names.
- Adds some warnings about issues I encountered during my own
installation process (public schema error)
- Adds a minor example of restarting a service with SystemD
Add basic diagram with modules of coturn inside Developer.md.
I also wanted to move the existing Flow chart to mermaid so that our doc
doesn't depend on external tools but i couldn't find the time to do it
properly (horizontal arrows are hard to represent with mermaid) so i
postponed that part.
This is the result of the mermaid diagram:

The flowchart will be kept updated:
- Use [drawio](https://app.diagrams.net/) to edit.
1. Open [drawio](https://app.diagrams.net/) in brower
2. Menu → File → Open from ... → Device:
Select [FlowChart.html](FlowChart.html)
3. Edit flow chart
4. Export to svg:
Menu → File → Export as... → SVG...
This is not a long term documentation works - this is a temp proposal to
modernize existing documentation
- Create docs folder
- Move existing documentation files into docs
- Split one massive INSTALL files into smaller files (no text editing
was done except small markdown header formatting)
- Rename files to be markdowns for nicer rendering
- Bad md rendering is expected in many places