* Parameterize SQLite user-DB driver queries (eliminate SQL injection class)
The SQLite driver built every statement with snprintf, interpolating
caller-supplied values directly into the SQL text before sqlite3_prepare.
That left every value-carrying query injectable (the admin-panel delete
paths fixed in GHSA-v8hj-2xx7-xmp5 were the reachable instances; this
removes the underlying class for this backend).
Add a sqlite_prepare_bind() helper that prepares a statement using '?'
placeholders and binds the values as text parameters, and route all
value-carrying statements through it: users, secrets, origins, realm
options, oauth keys, permission IPs, and admin users. Values never enter
the SQL text again. The only remaining interpolation is the peer-ip
table name (allowed/denied), which cannot be a bound parameter and is
already validated against that whitelist by the caller; its realm/ip
values are now bound. Static, parameter-free statements are unchanged.
oauth timestamp/lifetime were previously rendered as numeric literals;
they are now formatted to decimal text and bound. The oauth_key columns
have integer affinity, so SQLite stores the bound text as the same
integer and reads it back identically (verified).
Add tests/test_sqlite_dbd.c: an interface test that drives the driver's
public vtable against a throwaway database, covering every converted
entry point plus a SQL-injection regression test. Run against the old
string-interpolated driver the seven behavioral tests pass unchanged
(parity) while the injection test fails (a boolean payload in del_user
deletes a whole realm's users); against the new driver all eight pass.
The driver is compiled in isolation via a small support/stub layer
(tests/test_sqlite_support.*) so the suite needs no running server.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* Parameterize PostgreSQL user-DB driver queries (eliminate SQL injection class)
Like the SQLite driver, dbd_pgsql.c built every statement with snprintf,
interpolating caller-supplied values into the SQL text and running it via
PQexec -- which on PostgreSQL also permits stacked queries, the highest-
impact instance of GHSA-v8hj-2xx7-xmp5.
Add a pq_exec_params() helper around PQexecParams and route all
value-carrying statements through it with $1,$2,... placeholders and the
values passed as out-of-band text parameters: users, secrets, origins,
realm options, oauth keys, permission IPs, and admin users. Values never
enter the SQL text. The only remaining interpolation is the peer-ip table
name (allowed/denied), which cannot be a bound parameter and is already
whitelisted by the caller; its realm/ip values are now bound. Static,
parameter-free SELECTs keep using PQexec.
oauth timestamp/lifetime and the realm-option value were numeric literals;
they are now formatted to decimal text and bound (PostgreSQL casts them to
the column's integer type, preserving behavior).
Add tests/test_pgsql_dbd.c with a link-seam libpq mock (test_pgsql_stub.c)
that captures each emitted command, whether it was parameterized, and the
bound parameter values -- so the driver is unit-tested with no server. The
tests assert every operation emits a parameterized query with caller values
bound out-of-band, and test_sql_injection_neutralized asserts a boolean
payload travels as a bound value, never as SQL text. Run against the old
driver all of these fail (it calls PQexec with values interpolated and binds
nothing); against the new driver all pass. Relay-side stubs are reused from
test_sqlite_support.c.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* Parameterize MySQL user-DB driver with prepared statements (eliminate SQL injection class)
dbd_mysql.c built every statement with snprintf and ran it via mysql_query(),
interpolating caller-supplied values into the SQL text -- the same SQL injection
class fixed for SQLite and PostgreSQL, now closed for the MySQL backend.
Convert the whole driver to the prepared-statement API. Two helpers carry the
risk:
- my_exec() prepares an INSERT/UPDATE/DELETE, binds text params, executes;
- my_query_rows() prepares a SELECT, binds text params and text result
columns, and invokes a per-row callback (replacing mysql_query +
mysql_store_result + mysql_fetch_row).
Every value-carrying statement now uses '?' placeholders with values bound
out-of-band; the read paths get small row callbacks. The only remaining
interpolation is the peer-ip table name (allowed/denied), which cannot be a
bound parameter and is already whitelisted by the caller; its realm/ip values
are bound. oauth timestamp/lifetime and the realm-option value are formatted to
decimal text and bound (MySQL casts to the column's integer type).
Add tests/test_mysql_dbd.c with a link-seam libmysqlclient mock
(test_mysql_stub.c) that captures the prepared SQL and bound parameters, and can
feed one canned result row so the read paths' bind-result/fetch handling is
exercised (get_user_key, get_oauth_key, get_admin_user, get_auth_secrets all
round-trip). The mock also implements the classic mysql_query/store_result API
so the suite links and runs against the old driver: there every test fails ("got
mysql_query()") while all pass against the new one, and
test_sql_injection_neutralized asserts a boolean payload travels as a bound
value. Relay-side stubs are reused from test_sqlite_support.c (with
ur_string_map_free added there).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
I attempted to use `make-man.sh` and it added a ton of man macro
formatting back in,
since it was a minor targeted fix I just updated `turnserver.1`
manually.
Fixes: https://github.com/coturn/coturn/issues/1917
---------
Co-authored-by: Corey Siltala <csiltala@atcorp.com>
Co-authored-by: Pavel Punsky <eakraly@users.noreply.github.com>
## Summary
Adds **`--multiplex-peer`**, a non-standard relay mode that replaces the
per-allocation peer-side port bind with **one shared IPv4+IPv6 UDP
socket pair per relay thread**. Sessions are demultiplexed by exact peer
IP:port in a per-thread `mp_table`. This lifts the ~16 k allocation cap
that the default 49152-65535 relay port range imposes, and dramatically
reduces kernel-level UDP receive-buffer drops under high pps.
Design and trade-offs: [docs/multiplex-peer.md](docs/multiplex-peer.md).
## What changes
### Server (turnserver)
- **`--multiplex-peer`** (cross-platform) — enable the shared per-thread
relay sockets. Replaces the per-session port bind. Implies sendmmsg
batching on Linux and default-enables `--udp-recvmmsg` (override with
`--udp-recvmmsg=0`). Incompatible with EVEN-PORT — those Allocates are
rejected with 400.
- **`--multiplex-peer-port <port>`** (cross-platform, default 3480) —
base port; thread `i` binds `<base>+2i` (IPv4) and `<base>+2i+1` (IPv6).
A 4-thread server consumes 8 ports.
- **`--udp-gso`** (Linux-only CLI) — UDP-GSO (`UDP_SEGMENT` cmsg) on the
relay send path. Requires `--multiplex-peer` (which is what enables the
sendmmsg batching GSO piggybacks on); passing `--udp-gso` alone is a
silent no-op.
- **CLI surface tightened**: `--udp-recvmmsg`, `--udp-recvmmsg-log`,
`--udp-gso` and their fields are now `#if defined(__linux__)` — absent
from `--help`, rejected with `unrecognized option`, and the code paths
compile out on macOS/Windows.
- **Windows portability**: `SO_REUSEPORT` in `mp_open_socket` wrapped in
`#ifdef` (MSVC's Winsock doesn't define it; REUSEPORT was defensive
anyway because the per-thread port layout is unique by construction).
- **`--sock-buf-size` honoured at startup**: the shared multiplex-peer
relay socket now calls `set_ioa_socket_buf_size` in `mp_open_socket` so
the configured rcvbuf is in effect from the moment the socket exists,
not deferred to the first Allocate.
### turnutils_uclient (loadgen)
- **`--no-even-port`** — force `ep = -1` on Allocate. The default path
randomly attaches EVEN-PORT (with no-R bit) even under `-c`, which
`--multiplex-peer` strictly rejects with 400; this flag makes
alloc-flood runs against multiplex-peer deterministic.
- **Legacy `timer_handler` now wraps the per-tick send batch with
`uclient_send_batch_begin/_end`** — without this, runs with
`--sender-threads 0` (the default for `-m < 4`) silently fell through
every send to plain `send(2)`. strace A/B: 205 k `sendto` → 61 k
`sendmsg` (GSO) + 4 k `sendmmsg` + small `sendto` residual for control.
## Measured impact (3-droplet DigitalOcean, c-4 / 4 vCPU, 8 concurrent
UDP streams, 45 s)
| | baseline | `--udp-recvmmsg` | `--multiplex-peer` | `--multiplex-peer
--udp-gso` |
|---|---:|---:|---:|---:|
| Server NIC rx pps (UDP relay both legs) | 350 k | 334 k | 326 k | 294
k |
| Server `UdpInDatagrams` pps | 279 k | 292 k | 300 k | 294 k |
| **Server `UdpRcvbufErrors` pps** | **71 k** | 42 k | 26 k | **0.3 k
(−99.6 %)** |
| **`turnserver` process CPU** | **387 %** | 205 % | 283 % | **133 %
(−65 %)** |
| Server host idle | 22 % | 49 % | 41 % | **68 %** |
Same loadgen-side packet rate (~2 M pps reported by uclient `send_pps`
after the legacy-path batching fix). Iteration log:
[docs/PerformanceIterationLog.md](docs/PerformanceIterationLog.md).
## Test plan
- [x] `ctest --test-dir build` — 3/3 pass (test_ioaddr, test_stun_msg,
test_http_server) on macOS + Linux.
- [x] `examples/run_tests.sh` — 4 protocols + 4 threaded + load-gen
smoke on Linux; 4 protocols on macOS.
- [x] `examples/run_tests_conf.sh` — same coverage, conf-driven.
- [x] `examples/run_tests_multiplex_peer.sh` — UDP/TCP/TLS/DTLS via
`--multiplex-peer --multiplex-peer-port=35000` on macOS + Linux.
- [x] Flag matrix smoke on macOS: `--multiplex-peer`,
`--multiplex-peer-port=42000`, `--multiplex-peer --udp-gso` (no-op),
`uclient --no-even-port`, `uclient --listener-threads N --sender-threads
M` — all pass; `--udp-recvmmsg` / `--udp-gso` correctly rejected with
`unrecognized option`.
- [x] Flag matrix smoke on Linux (Docker): same + `--udp-recvmmsg`
accepted, `--multiplex-peer` auto-enables `--udp-recvmmsg`,
`--udp-recvmmsg=0` overrides the auto-enable.
- [x] Windows compile fix verified — `SO_REUSEPORT` no longer referenced
unconditionally.
- [x] 3-droplet perf matrix completed; per-hop UDP counters captured.
## Docs updated
- New: [docs/multiplex-peer.md](docs/multiplex-peer.md)
- [README.turnserver](README.turnserver): full entries for
`--multiplex-peer`, `--multiplex-peer-port`, `--udp-gso`; clarified
`--udp-recvmmsg` auto-enable semantics.
- [README.turnutils](README.turnutils): added `--no-even-port`, plus
previously-undocumented `--listener-threads` / `--sender-threads`
loadgen pool flags.
- [examples/etc/turnserver.conf](examples/etc/turnserver.conf):
commented `udp-recvmmsg`, `udp-recvmmsg-log`, `udp-gso`,
`multiplex-peer`, `multiplex-peer-port` keys with one-paragraph
descriptions and pointer to `docs/multiplex-peer.md`.
- Man pages regenerated via `./make-man.sh`.
---------
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
## Summary
`examples/run_tests.sh` and `examples/run_tests_conf.sh` previously
covered only the legacy single-threaded uclient path. After recent PRs
they were no longer exercising:
- `--listener-threads` from #1911 (recv thread pool)
- `--sender-threads` from #1913 (send thread pool)
- `--udp-gso` from #1907 (server-side GSO send)
- the `recv_pps` metric introduced in #1913
A regression in any of those would have shipped silently. This PR
expands both scripts so every CI cycle hits all four features
end-to-end.
## What changes
**Per-protocol coverage doubled.** The four protocol tests (TCP / TLS /
UDP / DTLS) are factored into a shell function and each runs twice:
1. Default uclient flags — legacy single-thread paths.
2. `--listener-threads 1 --sender-threads 1` — engages both pools at
minimum non-zero size, so the slab counters, per-thread libevent base,
and pick_listener_base / pick_sender_id paths all fire.
The `grep "tot_send_bytes ~ 1000, tot_recv_bytes ~ 1000"` target is
stable across both variants because the workload (`-m 1 -n default=5 -l
200`) totals 1000 bytes either way — threading doesn't change the byte
count, only the machinery that produces it.
**Server-side GSO enabled on Linux.** `run_tests.sh` adds `--udp-gso`
next to the existing `--udp-recvmmsg`. `run_tests_conf.sh` writes both
keys into the generated `turnserver.conf` on Linux. The conf-file parser
uses the same long-options table as the CLI (see `mainrelay.c`
`long_options`), so the keys map 1:1.
**Load-gen smoke on Linux.** A short `-Y packet -m 4 -l 100 -c` run with
`--sender-threads 2` ends each script:
```bash
timeout -s INT 6s turnutils_uclient \
-Y packet -m 4 -l 100 -c -e 127.0.0.1 -g \
--listener-threads 1 --sender-threads 2 \
-u user -W secret 127.0.0.1
```
The grep asserts the progress print contains a **non-zero** `send_pps`
AND a **non-zero** `recv_pps`. Non-zero `recv_pps` is the canonical
signal that the listener-pool slab reduction (`recv_count_snapshot`) is
feeding back into the load-rate reporter — the codepath most likely to
silently drop output if the thread pool's stop ordering or slab plumbing
regresses.
`timeout -s INT 6s` triggers a clean SIGINT-driven exit; exit codes 0,
124, and 130 all count as success because we just want a fixed-duration
run.
## Test plan
Linux Docker (authoritative):
```
=== run_tests.sh ===
Using TURNSERVER_EXTRA_ARGS="--udp-recvmmsg --udp-gso"
Running turn client TCP OK
Running turn client TLS OK
Running turn client UDP OK
Running turn client DTLS OK
Running turn client TCP (threaded) OK
Running turn client TLS (threaded) OK
Running turn client UDP (threaded) OK
Running turn client DTLS (threaded) OK
Running turn client UDP load-gen smoke OK
=== run_tests_conf.sh ===
Running turn client TCP OK
Running turn client TLS OK
Running turn client UDP OK
Running turn client DTLS OK
Running turn client TCP (threaded) OK
Running turn client TLS (threaded) OK
Running turn client UDP (threaded) OK
Running turn client DTLS (threaded) OK
Running turn client UDP load-gen smoke OK
```
9/9 in each. macOS local TCP test fails the same way it did before this
change (pre-existing flake on Darwin loopback TCP, unrelated to
threading or GSO).
- [x] Both scripts run end-to-end in Linux Docker against a fresh build.
- [x] `--help` flag list verified: `--listener-threads`,
`--sender-threads`, `-Y` all present.
- [x] No new dependencies; only uses `timeout` (already in coreutils on
Linux base images).
Fix UB in udp_recvfrom: recv_ttl/recv_tos were unsigned char locals whose addresses were cast to int * and passed to ioa_parse_udp_recvmsg_cmsg, which writes a full int through that pointer — clobbering 3 bytes of adjacent stack on every UDP receive. Manifested as a TTL/TOS bug on macOS.
Change the locals in udp_recvfrom to int so no pointer cast is needed.
Hoist the recv_ttl_t / recv_tos_t typedefs out of the two function bodies and into ns_ioalib_impl.h so they remain shared by the cmsg parser.
## Test plan
[x] cmake -S . -B build -DBUILD_TESTING=ON && cmake --build build -j && ctest --test-dir build --output-on-failure
[x] cd examples && ./run_tests.sh && ./run_tests_conf.sh
[x] Verify relayed UDP carries the expected TTL/TOS on macOS (previously corrupted)
[x] Linux build + system tests in Docker per CLAUDE.md
## Summary
Three related changes to `turnutils_uclient` that together unblock the
loadgen from being the bottleneck when benchmarking the relay:
1. **Sender thread pool** (`--sender-threads <N>`, max 4, auto-bumped to
2 at `-m >= 4`). Mirrors the listener pool that landed in #1911. Each
sender thread owns its own libevent base, a session shard (round-robin
assigned at allocation time via `elem->sender_id`), and a 100 µs timer
that runs the burst loop just like the legacy main-thread
`timer_handler` did. Send-side counters (`tot_send_messages`,
`tot_send_bytes`, `tot_send_dropped`, `load_sent_packets`) and the
completion accumulators in `client_timer_handler` (`total_loss` /
`total_latency` / `total_jitter`) are written into per-thread
cache-line-aligned slabs and reduced into the globals after
`pthread_join`. This avoids the cross-core atomic-counter contention
that the listener-pool work already documented.
2. **UDP-GSO send batching** in `send_buffer` for the plain-UDP path.
The sender pool opens a thread-local batch window around its per-tick
iteration; within the window, `send_buffer` copies the payload into a
per-thread slot and appends to a scatter-gather `iov[]`. On flush:
- **If `count > 1` and all segments share the same size** → one
`sendmsg(2)` with a `UDP_SEGMENT` cmsg.
- **If GSO is unavailable** (kernel returns
`EINVAL`/`ENOPROTOOPT`/`EOPNOTSUPP`) → sticky-disable per thread, fall
back to `sendmmsg(2)` over the same iov array.
- **Per-entry `send(2)`** as the final fallback for whatever sendmmsg
refused (EAGAIN tail, etc.).
Auto-flush triggers: different fd (next session in iteration), different
segment size, batch capacity (64), or end of iteration.
3. **`recv_pps` in `print_load_generator_rate`**, alongside the existing
`send_pps`. Once the sender pool + GSO let uclient push >>1 Mpps of UDP,
the meaningful end-to-end metric is the round-trip count, not the
send-side count — the relay/peer pipeline drops 95+% of packets when
uclient outpaces it. The progress line now reads:
send_pps=6012928.00, recv_pps=101486.00, total_sent=112975924,
total_recv=1853369
## Why
Benchmarking `--multiplex-client` / `--multiplex-peer` on a c-4
DigitalOcean droplet, the loadgen's single-threaded `timer_handler`
saturated one CPU around 300 kpps regardless of `-m`. The relay was
never put under real pressure, so the multiplex paths' value couldn't be
measured. With this patch the loadgen can produce >6 Mpps from a single
c-4 droplet, far above the relay's per-thread saturation point, so the
bottleneck moves to the server where it belongs.
## Benchmark — multiplex-client turnserver, c-4 loadgen, m=4, 20 s
| Round | OLD (master) | NEW (this PR) | Lift |
|-------|--------------|---------------|------|
| 1 | 246k send_pps | 7.48M | 30.4× |
| 2 | 459k | 6.06M | 13.2× |
| 3 | 360k | 5.07M | 14.1× |
| **avg** | **355k** | **6.20M** | **17.5×** |
Throughput cap shifts from loadgen to relay. End-to-end recv_pps (which
is now first-class in the progress line) is ~100 kpps in this
configuration — limited by the relay, not uclient.
## Design notes
- **Cache-line alignment** on `uclient_sender` mirrors the
listener-pool's slab pattern. Same false-sharing trap, same fix.
- **Main-thread timer slows to 10 ms** when the sender pool is engaged.
The main timer still fires for lifecycle / `__turn_getMSTime` refresh,
but `timer_handler` early-returns when `num_sender_threads > 0` so we
don't burn a core on no-op 100 µs ticks.
- **Stop ordering**: `stop_sender_threads()` runs before
`stop_listener_threads()` — the senders own session mutation (wmsgnum,
to_send_timems, shutdown), so joining them first prevents a race where a
listener accumulates a stat into a session whose owning sender is still
iterating it.
- **UDP-GSO copy**: the per-slot memcpy is intentional. The caller
(`client_write`) reuses `elem->out_buffer` across burst iterations, so
pointing `iov[i]` at the session buffer would alias all entries to the
most recent payload. A rotating per-session output ring would eliminate
the copy — left out of this PR because the kernel-side savings from
collapsing N sendmsg into one GSO sendmsg dominate the per-packet copy
cost at the rates we measured.
- **Linux-only**: send-side batching machinery is gated by `#if
defined(__linux__)`. Non-Linux builds get no-op
`uclient_send_batch_begin`/`_end` and `uclient_tx_enqueue` returns
false, falling through to the legacy `send(2)` loop.
## Test plan
- [x] macOS local build (Apple Silicon, AppleClang). Sender-pool code
paths compile under both Linux and non-Linux gates.
- [x] `clang-format-15 --dry-run --Werror` clean.
- [x] Linux build on a c-4 Ubuntu 24.04 droplet (`cmake
-DCMAKE_BUILD_TYPE=Release`).
- [x] `--help` includes the new `--sender-threads` option with
valid-range hint; out-of-range values rejected.
- [x] Benchmark on two c-4 droplets in nyc1 against `turnserver
--multiplex-client`: 3 alternating rounds OLD vs NEW, +17.5× average
send-side lift (data table above).
- [x] `print_load_generator_rate` output verified — `send_pps`,
`recv_pps`, `total_sent`, `total_recv` all populated and consistent
across listener slab reductions.
## Limitations
- `--multiplex-peer` is not driven by this PR. uclient's pattern (each
`-m N` opens two internal sessions per client that share the same peer
port) hits the multiplex-peer "one allocation per peer endpoint" rule;
benchmarking that flag at high concurrency requires a separate small
change (per-session secondary peer port) — not in scope here.
- The wider per-round variance under the sender pool (rounds in our
bench ranged 13×–30× lift) is timing/scheduler noise at small per-thread
shards. Smoothens out as `-m` and per-thread session counts grow.
> Updated 2026-05-11 — three follow-up improvements applied on top of
the original draft, with a single-trial real-Linux confirmation run.
Auto threshold lowered, per-listener counter slabs added (eliminating
the K=2/K=4 cache-line regression), per-elem stats atomicised.
## What
Adds an N-thread receive pool to `turnutils_uclient`. Main thread keeps
owning the sender timer, the lifecycle, and the control plane; `EV_READ`
events for client UDP sockets are routed to one of N listener threads,
each with its own libevent base. Sessions sharded round-robin across
listeners at allocation time and pinned to their owning listener for the
lifetime of the test.
### State ownership / synchronisation
- **Per-session bookkeeping** (`recvmsgnum`, `recvtimems`, `rmsgnum`):
mutated only by the owning listener thread — no locking, no atomics.
- **Per-session running stats** (`elem->loss/latency/jitter`): listener
does `__atomic_fetch_add`, the timer on main harvests with
`__atomic_exchange_n` (read-and-zero atomically). Closes the race that
existed in the original draft where a listener increment landing between
the timer's read and its zero-store would be lost.
- **Per-test totals** (`tot_recv_messages`, `tot_recv_bytes`, min/max
latency/jitter): each listener has its own cache-line-aligned slab
inside `uclient_listener`. The listener writes into its own slab (no
atomic); main reads on-demand via snapshot helpers (atomic loads, no
contention with writers); `stop_listener_threads()` folds slabs back
into the global totals before reporting. An early draft used a single
shared atomic counter and the K=2/K=4 path regressed ~19× on `-m {4,
8}`; the slabs close that gap.
Sits on top of the recvmmsg path landed in #1910.
## CLI
```
-K N, --listener-threads N
Number of receive threads. Default auto:
-m < 2 -> K=0 (no worker thread, recv on main event_base)
-m >= 2 -> K=1 (one worker thread, clean send/recv split)
-K overrides the auto rule. Capped at 4.
```
`start_listener_threads()` logs the resolved count and whether it came
from `-K` or the auto rule, so an operator can confirm what actually
ran.
## Real-Linux bench (DigitalOcean, 2× c-4 / 4 vCPU, private VPC)
**Single-trial confirmation** of this PR's follow-ups vs the previous
3-trial baseline.
### After follow-ups (this PR head, 1 trial, recv pps)
| `-m` | **K=0** | **K=1** | K=2 | K=4 |
|---:|---:|---:|---:|---:|
| 1 | **4545** | 3226 | 3226 | 3226 |
| 2 | 6250 | **4878** | 4878 | 153 |
| 4 | **7692** | 6557 | 371 | 350 |
| 8 | **8602** | 689 † | 708 | 199 |
| 16 | 1174 | **1339** | 1281 | 1283 |
| 32 | 1445 | 1517 | **2149** | 1421 |
† K=1 `-m=8` = 689 looks like a single-trial outlier (the matching K=0
row finishes in 0.93 s sending all 8000 packets; K=1 hit the 14 s
timeout with similar throughput). Re-running would clarify, but per the
user's "1 iteration" instruction this is the data we have.
### Comparison vs the previous 3-trial averages
| `-m` | metric | before follow-ups | after follow-ups | delta |
|---:|:---|---:|---:|:---|
| 4 | K=0 recv pps | 5123 | **7692** | **+50 %** |
| 8 | K=0 recv pps | 3967 | **8602** | **+117 %** |
| 2 | K=2 recv pps | 1749 | **4878** | **+179 %** |
| 4 | K=2 status | 344 pps, ~91 % loss | 371 pps, **2.1 %** loss |
regression unblocked |
| 8 | K=2 recv pps | 665 | **708** | +6 % (loss 11 % → 3.2 %) |
| 32 | K=2 recv pps | 1763 | **2149** | +22 % |
The K=0 hot path improved noticeably at mid-`m`. The K=2
cache-line-bouncing regression that originally made the auto-threshold
conservative is gone — K=2 at `-m=4` went from "9× worse than K=1" (344
pps) to "comparable to K=0 with low loss" (371 pps, 2.1 % loss). K=2 at
`-m=32` reached **2149 pps**, the new high-concurrency winner.
### Auto-rule change
With the K=2 regression unblocked, `UCLIENT_AUTO_LISTENERS_THRESHOLD`
lowered from 4 to 2. The auto-default now goes:
- `-m=1` → K=0 (still the lowest-overhead config; ~30 % faster than K=1
at this concurrency)
- `-m>=2` → K=1 (clean send/recv split; consistently strong from m=4
upward)
Explicit `-K` still overrides, so anyone curious about K=2/K=4 on
different hardware can dial them up — they're no longer landmines.
## Combined effect on top of #1910
vs `master` *before* either improvement, at `-m=8`: 148 → **8602 pps** =
**58× speedup**.
## Test plan
- [x] Builds clean on macOS (clang) and Linux (gcc, Debian Trixie via
container)
- [x] `make lint` clean for the modified files
- [x] CLI parsing smoke (`-K 99` rejected, `-K 0` works,
`--listener-threads` long form works)
- [x] Real-Linux single-trial bench across `K ∈ {0,1,2,4}` × `m ∈
{1,2,4,8,16,32}`
- [x] Auto-bump log line confirmed (`uclient: started 1 listener
thread(s) (auto)`)
- [x] Per-listener slab fold-back verified end-to-end (totals match
after `pthread_join`)
- [ ] Functional smoke: `examples/scripts/basic/relay.sh` +
`examples/scripts/basic/udp_c2c_client.sh`
- [ ] CI
## CMake / portability
- pthreads pulled in transitively via `turnclient` → `common`
(`find_package(Threads REQUIRED)`). No link-line change needed.
- `getopt_long()` requires `<getopt.h>` — included unconditionally on
POSIX so the long-option table compiles on macOS as well as glibc.
- `_Thread_local`, `__atomic_*` builtins, struct `aligned(64)`
attribute: C11 / GCC builtins, available on glibc + clang.
## Summary
Two improvements to `turnutils_uclient`'s receive path. With `-Y packet
-m N` workloads the loadgen was previously hitting client-side queue
overflow before the server was anywhere near saturated, so reported
"lost packets" reflected uclient's own kernel-buffer drops rather than
real server loss. Mirrors the recvmmsg work already landed in
`turnutils_peer` (#1908) and the relay (#1906) — this closes the loop on
the loadgen side.
- **`SO_RCVBUF`/`SO_SNDBUF` 64 KB → 4 MB.** New `UCLIENT_SOCK_BUF_SIZE`
constant in `uclient.h`, applied at the 3 socket-creation sites in
`startuclient.c`. `set_sock_buf_size()` already halves on `EPERM/EINVAL`
so this is safe under any `net.core.rmem_max`.
- **Linux-only `recvmmsg(2)` batched receive in
`client_input_handler`.** Refactored `client_read()` → extracted
`process_received_buffer()` so the per-packet processing
(channel/Indication/Data parsing, latency/jitter accounting) can run
after either a single `recv()` or a batched `recvmmsg()`. New
`client_read_batch_udp()` drains up to 32 datagrams per syscall;
`client_input_handler` dispatches to it for plain UDP only (no SSL/DTLS,
no TCP, no TCP-relay sub-connections). All other paths fall through to
the legacy single-`recv()` loop unchanged. Behind `#if
defined(__linux__)` with `_GNU_SOURCE`; static scratch buffers (32 ×
2048 B = 64 KB).
## Bench (local Docker, 3 trials per concurrency, identical server, only
uclient binary differs)
| `-m` | OLD recv pps | NEW recv pps | speedup | OLD loss | NEW loss |
|---:|---:|---:|---:|---:|---:|
| 1 | 58 | **71** | **+22%** | 42% | 27% |
| 2 | 78 | **124** | **+59%** | 60% | 33% |
| 4 | 135 | **211** | **+56%** | 64% | 45% |
| 8 | 145 | **359** | **+148%** | 80% | 50% |
| 16 | 224 | **591** | **+164%** | 84% | 54% |
Send PPS is essentially unchanged (uclient was never send-bound). Server
CPU stayed below 1% across all runs — confirming the bottleneck was the
loadgen, not the server.
## Notes
- macOS / *BSD keep the legacy single-`recv()` loop. The SO_RCVBUF bump
still applies and helps somewhat.
- The remaining loss at higher concurrency is loadgen single-event-loop
saturation, which is a larger restructuring (worker-pool uclient) and
out of scope for this PR.
- DTLS / TCP / SSL paths are deliberately left on the legacy code path —
`recvmmsg` doesn't apply.
## Test plan
- [x] Builds clean on macOS (clang) and Linux (gcc, Debian Trixie via
container)
- [x] `make lint` clean for the modified files
- [x] Functional smoke: `examples/scripts/basic/relay.sh` +
`examples/scripts/basic/udp_c2c_client.sh` still pass
- [x] Bench above (3 trials per `-m {1,2,4,8,16}`)
## Summary
- New `--udp-gso` flag (Linux, requires `--udp-sendmmsg`) collapses
same-destination, same-size sendmmsg batches into a single `sendmsg`
with a `UDP_SEGMENT` cmsg, so the kernel allocates one super-skb that
traverses the network stack once and is segmented at egress instead of
running `udp_sendmsg → ip_finish_output → __dev_queue_xmit` per
datagram.
- Also wraps the relay-side `recvmmsg` callback loop in
`udp_sendmmsg_batch_begin/end` so peer→client sends triggered inside a
recv batch can also coalesce — without that wrapping the relay path
issues one `sendto` per delivered datagram.
- Sticky-disable on `EINVAL/ENOPROTOOPT` for older kernels/NICs that
lack UDP-GSO; one warning logged, then transparent fallback to the
existing `sendmmsg` and `udp_send` paths.
## Why
The `--udp-recvmmsg` and `--udp-sendmmsg` follow-ups confirmed (see
[docs/PerformanceIterationLog.md](docs/PerformanceIterationLog.md)) that
on the relay flood workload the dominant cost is the per-datagram kernel
TX path. mmsg-style batching reduces only the syscall entry/exit, not
the per-skb stack traversal — UDP-GSO collapses both.
## Result
DigitalOcean nyc1 c-4, 30 s alternating A/B, `-Y packet -m 1`, eth1 TX
as the authoritative server forwarding metric:
| Variant | eth1 RX | eth1 TX | sys CPU | idle CPU |
|---|---:|---:|---:|---:|
| baseline (no flags) | 322,091 | 127,445 | 22.9 % | 67.5 % |
| `--udp-recvmmsg --udp-sendmmsg --udp-gso` | 266,068 | **257,996** |
15.0 % | 78.7 % |
| baseline (no flags) | 309,475 | 125,573 | 20.9 % | 70.7 % |
| `--udp-recvmmsg --udp-sendmmsg --udp-gso` | 275,992 | **225,366** |
14.9 % | 74.3 % |
Mean server forwarding rate: **126.5 k → 241.7 k pps (+91 %, 1.91×)**,
mean system CPU **21.9 % → 14.9 %** — about **2.8× CPU efficiency** (TX
pps per system-CPU-%). Full perf-children comparison and methodology in
the new section of
[docs/PerformanceIterationLog.md](docs/PerformanceIterationLog.md).
## Notes for reviewers
- `--udp-gso` is opt-in and requires `--udp-sendmmsg` (the help text
states the dependency). Without `--udp-sendmmsg` the batch state never
accumulates and GSO has nothing to flush.
- GSO eligibility resets on every `_begin/_end`. Mixed-destination,
mixed-size, or oversize batches transparently fall back through
`sendmmsg` / `udp_send`.
- Rebased onto current `master`; the recvmmsg dependency is already
merged via #1906.
## Test plan
- [x] `cmake --build build --target turnserver` (RelWithDebInfo + ASan
local builds clean)
- [x] `ctest --test-dir build --output-on-failure` — 3/3 unit tests pass
- [x] `examples/run_tests.sh` — TCP/TLS/UDP pass; DTLS pre-existing
failure on macOS environment, unrelated to this change
- [x] DigitalOcean A/B perf validation captured above
- [ ] Reviewer to confirm CI green on Linux build/test/CodeQL
---------
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…DP-GSO
The libevent EV_READ handler used to do one recvfrom + one sendto per
ready event, so a packet flood through the relay generated O(N) libevent
re-entries and 2N syscalls per N relayed datagrams — saturating one core
on the loadgen-side peer well below modern relay throughput.
On Linux, replace the handler with:
* a drain loop: keep recvmmsg'ing in MSG_DONTWAIT until the queue
returns less than a full batch, bounded by MAX_DRAIN_ROUNDS so a flood
can't starve the rest of the event loop;
* recvmmsg into a static mmsghdr[32] (peer is single-threaded) and reuse
the same mmsghdr array for sendmmsg back — each entry already has
msg_name pointing at the source (the echo destination) and the iovec
pointing at the received bytes, so no userspace copy;
* UDP-GSO: when the recvmmsg batch is homogeneous (≥2 entries, same
source, same size, ≤1472 B), echo it as one sendmsg with UDP_SEGMENT
cmsg so the kernel allocates one super-skb that traverses the network
stack once.
The non-Linux build keeps the original recvfrom/sendto handler.
DigitalOcean nyc1 c-4 30 s alternating A/B paired with the GSO
turnserver (-Y packet -m 1):
old peer: turn TX mean 228 k pps, peer CPU mean 91.0 % (saturated)
new peer: turn TX mean 255 k pps, peer CPU mean 28.8 %
Peer CPU drops 3.2× while turn-side throughput climbs ~12 % because the
old peer was no longer fully reflecting at the GSO turnserver's rate.
The peer is no longer the loadgen-side bottleneck, freeing CPU for
multi-flow tests.
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Summary
Extends the existing Linux-only `--udp-recvmmsg` flag from the UDP
listener socket to also cover **connected per-session UDP relay
sockets**, so steady-state client→relay and peer→relay traffic on plain
UDP is read in batches of up to 16 datagrams per `recvmmsg(2)` instead
of one `recvmsg` per packet. DTLS sessions still go through the SSL read
path and are unchanged.
The flag stays **opt-in**: receive-side batching works correctly, but on
the current `m=1` / `m=100` benchmarks throughput is flat to slightly
negative — the bottleneck has moved past receive (see results below).
## What's in the change
- **Shared receive helpers** (`src/apps/relay/ns_ioalib_engine_impl.c`,
`src/apps/relay/ns_ioalib_impl.h`):
- `ioa_parse_udp_recvmsg_cmsg()` — single TTL/TOS/`IP_RECVERR` cmsg
parser used by both `udp_recvfrom()` and the new batch path. Replaces
the duplicated parser previously inlined in `dtls_listener.c` and
`udp_recvfrom()`.
- `ioa_init_recvmmsg_hdr()` — single initializer for
`mmsghdr`/`iovec`/cmsg/source-address fields, also used by the listener.
- New `IOA_UDP_RECVMMSG_MAX_BATCH = 16` constant; both listener and
relay paths now share it.
- **Connected relay batch read** (`socket_udp_read_batch_recvmmsg` in
`ns_ioalib_engine_impl.c`): called from `socket_input_worker` for
non-SSL UDP sockets when `--udp-recvmmsg` is on. Allocates per-message
`stun_buffer_list_elem`s, calls `recvmmsg(MSG_DONTWAIT)`, dispatches
each datagram through the existing `read_cb` path, and falls back
cleanly on `ENOSYS`/`EINVAL`/`EOPNOTSUPP` (auto-disables the flag) and
on `EAGAIN`/short-batch (releases unused buffers).
- **Per-engine scratch state**: the `mmsghdr[16]` / `iovec[16]` / cmsg /
src-addr arrays live on `ioa_engine`, not on every socket — keeps memory
flat at thousands of allocations.
- **TTL/TOS-sized cmsg buffers** in the listener: the listener
previously over-allocated `64 KiB` per slot; it now uses the same
TTL+TOS sizing as the relay path.
- **Opt-in occupancy stats** behind a new `--udp-recvmmsg-log` flag:
every 10 s the relay logs `udp-recvmmsg stats: calls=… packets=…
avg_batch=… wouldblock=… unavailable=… no_buffer=… hist_1=… hist_2=…
hist_3_4=… hist_5_8=… hist_9_16=…`. Counters are always tracked (cheap);
the periodic log is gated by the new flag so default operation is
silent.
- **CLI plumbing**: `--udp-recvmmsg-log` long option in
`mainrelay.c`/`mainrelay.h`, `cli_print_flag` entry in
`turn_admin_server.c`, doc updates in `README.turnserver`.
- **Docs**: `docs/PerformanceIterationLog.md` records the iteration
steps, validation, and two rounds of DigitalOcean A/B numbers.
`CLAUDE.md` load-test instructions updated to mention the new flag and
the `tot_recv_msgs` / `tot_recv_bytes` workaround.
## Summary
`fuzzing/stun.dict` line 147 used C-style `\r\n\r\n` for the HTTP
end-of-headers keyword:
```
kw_http_eoh="\r\n\r\n"
```
libFuzzer's `ParseDictionaryFile` only accepts three escape sequences
inside quoted entries: `\\`, `\"`, and `\xAB` hex. `\r` / `\n` are
unrecognized, so a local fuzz run aborts dictionary load with:
```
ParseDictionaryFile: error in line 147
kw_http_eoh="\r\n\r\n"
```
Replace with the hex form used by the other 111 entries in the file:
```
kw_http_eoh="\x0d\x0a\x0d\x0a"
```
The turnserver man page had drifted from the actual CLI options the
binary accepts. The shipped `man/man1/turnserver.1` was last regenerated
on 05 June 2021, so several options added since then were missing and
one removed option was still documented.
The man page is auto-generated from `README.turnserver` via
`make-man.sh` (txt2man), so the source-of-truth edit is in the README;
the `.1` files are then regenerated.
In `README.turnserver`:
- Add 13 options that exist in `mainrelay.c` long_options[] but were
undocumented: --include-reason-string, --syslog-facility,
--drop-invalid-packets, --drop-invalid-packets-log, --udp-recvmmsg,
--respond-http-unsupported, --prometheus-address, --prometheus-path,
--version, --cpus, --no-cli, --no-rfc5780,
--response-origin-only-with-rfc5780.
- Document --sql-userdb as an alias on the existing --psql-userdb line.
- Remove the stale --ne=[1|2|3] entry (no longer parsed by the binary).
The regenerated `man/man1/turnserver.1` also picks up a backlog of
options that were already in the README but never reached the shipped
page (--software-attribute, --cli, --sock-buf-size, --raw-public-keys,
--stun-backward-compatibility, and the corrected --no-tlsv1_2 wording).
`man/man1/turnadmin.1` and `man/man1/turnutils.1` are regenerated as a
side-effect of `make-man.sh` running over all three READMEs; their
content was similarly stale relative to README.turnadmin /
README.turnutils.
## Summary
- The `--ne=[1|2|3]` option was already removed from `long_options[]`
and the option parser, so `turnserver` rejects it at runtime, but the
help text printed by `turnserver --help` still advertised it.
PR #1517 (Jun 2024) simplified codeql.yml in ways that left scans
incomplete: it dropped the actions:read / contents:read permissions and
the analyze category, both of which CodeQL Action requires for results
to land under the existing language category. Combined with the later
cpp -> c-cpp rename and v3 -> v4 upgrade, scheduled scans have not
refreshed the Security tab since Jun 1, 2024.
- Add actions:read and contents:read back to job permissions
- Set build-mode: manual on init (required for v3+/v4 manual builds)
- Pass category "/language:c-cpp" on analyze so SARIF de-duplicates
against the configured language
- Build with --parallel so the tracer keeps up on default runners
## Summary
- Add a self-contained Fil-C build/test harness under `filc/` that
mirrors the existing `fuzzing/` pattern: one host script
(`filc/run-local.sh`) builds an Ubuntu 24.04 image with the
[Fil-C](https://github.com/pizlonator/fil-c) optfil 0.678 toolchain,
builds turnserver with `CC=filcc`/`CXX=fil++`, runs unit tests + system
tests, and drops a per-run timestamped log directory with `SUMMARY.txt`
+ `ISSUES.txt`.
- Fix the two real Fil-C compatibility bugs the harness surfaces by
changing `ur_map_value_type` and `ur_addr_map_value_type` from
`uintptr_t` to `void *` in `src/server/ns_turn_maps.h`.
## Why
[Fil-C](https://fil-c.org) is a memory-safe C/C++ compiler (Clang 20
fork) that pairs every pointer with an "InvisiCap" capability and turns
UB into deterministic panics with no `unsafe` escape hatch. Putting
coturn through it answers two questions: (a) does it compile unmodified,
and (b) does it run correctly under capability-enforced memory safety.
After this PR, the answer is **yes** for both — turnserver,
turnutils_peer, and turnutils_uclient relay TCP/TLS/UDP/DTLS traffic
with full Fil-C enforcement, all unit tests pass, and
`examples/run_tests_conf.sh` runs end-to-end.
## What's in the PR
### `filc/` harness (commit 1)
| File | Purpose |
|---|---|
| `filc/Dockerfile` | Ubuntu 24.04 + Fil-C optfil 0.678 (extracts the
nested `fil.tar.xz` to `/opt/fil`); `--platform linux/amd64` so it works
on Apple Silicon under emulation. |
| `filc/run-local.sh` | Host-side: build image, create
`filc/logs/<UTC-ts>/`, run container with source mounted read-only and
log dir mounted r/w. |
| `filc/docker-entrypoint.sh` | In-container orchestrator. Phases: env /
source-copy / build / unit-tests / system-cli / system-conf. Runs every
phase even when a prior one fails (no aborting mid-pass). Captures
per-phase logs + a combined `all.log` + JUnit XML for ctest. Greps
panics/errors into `ISSUES.txt`. Downgrades `system-*` phases to FAIL
when `examples/run_tests*.sh` prints `FAIL` despite exiting 0 (existing
fragility in those scripts). |
| `filc/build.sh` | `cmake … -DBUILD_TESTING=ON -DCMAKE_C_COMPILER=filcc
-DCMAKE_CXX_COMPILER=fil++ -DCMAKE_BUILD_TYPE=RelWithDebInfo`, then
build. |
| `filc/.gitignore` | Ignore the on-host `logs/` dir. |
The harness also bumps the post-launch sleep in `examples/run_tests.sh`
from 2s to 6s **only inside the container** (sed-in-place on the copied
source; upstream is untouched). Under linux/amd64 emulation the
Fil-C-built turnserver isn't accepting TCP at 2s, so the first sub-test
races and prints `FAIL`. Matches the 5s sleep already used by
`run_tests_conf.sh`.
### Pointer-typedef fixes (commit 2)
`src/server/ns_turn_maps.h`:
```diff
-typedef uintptr_t ur_map_value_type;
+typedef void *ur_map_value_type;
...
-typedef uintptr_t ur_addr_map_value_type;
+typedef void *ur_addr_map_value_type;
```
**Why this is necessary.** Both maps store pointers, but their value
slot is integer-typed. Every existing `_put` site casts a pointer
through `(ur_*_value_type)` to store, and every `_get` site casts back.
Under standard C this is a well-defined no-op. Under Fil-C, casting a
pointer to `uintptr_t` discards its InvisiCap; casting back yields a
pointer with a non-null address but a NULL Fil-C object — the next
dereference panics with `cannot read pointer with null object`.
The harness caught two such panics, both in the auth-resume /
relay-allocate flow:
1. `src/server/ns_turn_server.c:3248` — `ss->client_socket`, where `ss`
came from `sessions_map` (a `ur_map`).
2. `src/apps/relay/turn_ports.c:225` — `tp->mutex`, where `tp` came from
`ip_to_turnports_*` (a `ur_addr_map`) via `turnipports_add`.
**Why this is also a correctness improvement on a normal build.** The
new typedef makes the API strictly more type-safe — the compiler now
enforces "you put a pointer in." It eliminates a class of accidental
misuse (storing a non-pointer integer where a pointer was expected) that
the integer typedef silently allowed. Same generated code on a normal
build; different (correct) Fil-C semantics.
**Audit.** Verified before changing:
- All `ur_map_put` / `lm_map_put` / `ur_addr_map_put` callers store
pointer-typed values exclusively (no callers store raw integers).
- No internal arithmetic on the value type anywhere in `ns_turn_maps.c`.
- `ur_map_del_func` / `ur_addr_map_func` implementations either don't
exist (all `_del` callers pass `NULL`) or immediately cast their
parameter to a real pointer type — no source change needed.
- `KHASH_MAP_INIT_INT64(3, ur_map_value_type)` works identically with
`void *`.
- `ur_addr_map`'s `addr_elem.value` is assigned, read, compared for
truthiness, and cleared with `= 0` — all valid for `void *`.
## Test plan
- [ ] `filc/run-local.sh` reports all six phases PASS (env / source-copy
/ build / unit-tests / system-cli / system-conf), `ISSUES.txt` carries
no Fil-C panic / safety / sanitizer entries.
- [ ] Local `cmake -S . -B build -DBUILD_TESTING=ON && cmake --build
build && ctest --test-dir build --output-on-failure` is green (no
regression on the regular build).
- [ ] `examples/run_tests.sh` and `examples/run_tests_conf.sh` are green
on Linux per `CLAUDE.md`.
- [ ] Existing `fuzzing/run-local.sh ASan 0 -runs=1` still passes (the
new `filc/` directory is independent and shouldn't perturb anything).
write_to_peerchannel(): get_relay_socket_ss() and
ioa_network_buffer_get_size() were each called twice per channel-data
packet. The compiler can't CSE the calls (cross-TU through a
get_relay_socket() accessor in ns_turn_allocation.c that it can't prove
pure), so cache the relay socket and the inbound size once.
handle_turn_send(): same get_relay_socket_ss() duplication on the
STUN_SEND path.
read_client_connection(): the inbound size was fetched four times
(received_bytes accumulator, verbose log, blen seed, ret check). Reuse
ret as orig_blen.
No behavior change. Targets the ~0.4% per-packet overhead these helpers
were contributing in the m=1 packet-flood profile.
This is a four-instruction accessor (read sa_family, return struct
sockaddr_{in,in6} size) that gets called from every per-packet sendto(),
recvmsg(), and addr-map lookup. Cross-TU it stays a real function call;
moving the body into ns_turn_ioaddr.h as static inline lets each call
site fold the family branch directly into the syscall setup.
perf record on the m=1 packet flood (c-4 nyc1) confirms the win:
- udp_recvfrom self-time: 0.76% -> 0.35% (-54%)
- udp_send self-time: 0.60% -> 0.26% (-57%)
End-to-end throughput stays in the run-to-run noise band, as
expected for a kernel-bound workload, but the released CPU is real.
ioa_socket_check_bandwidth(): hoist the "no bps limit configured"
fast-exit before the multi-condition socket-state check. The vast
majority of sessions have max_bps == 0, so the existing path was running
5+ pointer dereferences and equality tests just to land on the same
return-1.
send_data_from_ioa_socket_nbh(): drop the redundant inner "if (!(s->done
|| s->fd == -1))" gate. The outer if/else-if branch already filtered
those, and ioa_socket_tobeclosed() rechecks both, so the inner test was
dead code on every successful send.
perf record on the c-4 nyc1 droplet (m=1 packet flood, 12s) shows
send_data_from_ioa_socket_nbh self-time drop from 0.91% to 0.54% and
ioa_socket_check_bandwidth fall out of the top-25 user-space symbols
(was 0.33%). Throughput is within run-to-run noise — the relay is
syscall-bound, so user-space wins don't translate 1:1 — but the released
CPU is real.
Same pattern as the get_ioa_addr_len() inline: addr_cpy() is a
single-memcpy helper that fires on every receive (each packet dispatch
copies the source address into ioa_net_data, plus allocation/permission
map-key copies). Cross-TU it stays a real function call.
Combined with the previous four iterations (turn_server_get_engine
hoist, bandwidth fast-exit + dead-check removal, cached relay-socket and
buffer-size lookups, get_ioa_addr_len inline), the alternating A/B run
on the same c-4 nyc1 droplet now shows a consistent +5% throughput on
the m=1 packet flood test (recv_msgs/30s mean over 6 rounds: B=146984 /
I=155468).
turn_report_session_usage() runs on every packet but only does real
reporting work once per 4096 packets. Re-order the early returns so the
bitmask fast-exit fires before the cross-TU
turn_server_get_engine() call, and flatten the nested if-blocks into
guard clauses for readability.
No behavior change. A/B testing on a c-4 nyc1 droplet shows the
single-client packet-flood throughput within noise (alternating B/I
rounds: B=149317 / I=153844 mean recv_msgs over 30s, ~3% in iter1's
favor with ~10% run-to-run variance).
stun_is_challenge_response_str in src/client/ns_turn_msg.c only descends
into its three inner stun_attr_get_first_by_type_str calls when the
input is an error response with err_code 401 or 438 *and* a REALM
attribute *and* a NONCE attribute. The OAuth branch additionally
requires STUN_ATTRIBUTE_THIRD_PARTY_AUTHORIZATION.
The fuzzer-driven path in harness_attr_iter calls the predicate every
iteration but the conjunction of conditions is too specific for
libFuzzer to discover from binary mutation alone — OSS-Fuzz introspector
flags 9 unreached callsites on
stun_attr_get_first_by_type_str gated on this function.
Add harness_challenge_response_builder that constructs six deterministic
message variants on every iteration and runs each through the predicate:
- 401 with REALM + NONCE (canonical success)
- 401 with REALM + NONCE + THIRD-PARTY-AUTH (OAuth branch)
- 438 with REALM + NONCE (438 disjunct)
- 401 with REALM only (NONCE-missing path)
- 401 with no REALM (REALM-missing path)
- 400 with REALM + NONCE (wrong err_code path)
Each variant runs once with a non-NULL oauth pointer and once with NULL
to cover both branches of the optional output. Realm / nonce /
server-name lengths and the transaction id are derived from fuzz bytes
so iterations stay meaningfully distinct.
Verified by stand-alone harness:
- 401+REALM+NONCE returns true with attrs copied out, oauth=false
- 401+REALM+NONCE+TPA returns true with oauth=true and server_name
populated — confirming all three inner get_first_by_type_str callsites
and the OAuth disjunct are now exercised.
map_addr_from_public_to_private and map_addr_from_private_to_public in
src/client/ns_turn_ioaddr.c walk a static public_addrs[] table of size
mcount. Without an explicit ioa_addr_add_mapping call mcount stays 0 for
the entire fuzz process, so the loop body — including the
addr_eq_no_port call it gates — is dead code in every fuzz iteration.
OSS-Fuzz introspector flags this as 19 unreached callsites under
map_addr_from_public_to_private and 4+4 under addr_eq_no_port.
Extend the existing shared LLVMFuzzerInitialize in FuzzOpenSSLInit.c
(linked into both FuzzStun and FuzzStunClient via FUZZ_COMMON_SOURCES)
to register two synthetic public<->private mapping pairs — one v4
(192.0.2.1 <-> 10.0.0.1) and one v6 (2001:db8::1 <-> fd00::1) — once at
startup. The header comment for ioa_addr_add_mapping requires
single-threaded init before fuzzing begins, which matches exactly when
LLVMFuzzerInitialize runs.
Verified by stand-alone harness: after init,
stun_attr_get_first_addr_str on a XOR-MAPPED-ADDRESS attribute holding
192.0.2.1:443 returns 10.0.0.1:443, and the v6 equivalent returns
[fd00::1]:8080 — confirming addr_eq_no_port is now called inside the
loop body in both helpers.
OSS-Fuzz introspector flags three blockers the fuzzer cannot reach on
its own:
1. findstr() in src/client/ns_turn_msg.c is gated by is_http(), which
requires GET/POST/PUT/DELETE prefix + " HTTP/" + "\r\n\r\n". The
fuzzer's binary STUN seeds never synthesize a valid HTTP frame.
2. stun_attr_get_reservation_token_value() and
stun_attr_get_response_port_str() are called from harness_attr_iter only
when the input contains the matching attribute type. Neither appears in
the existing seed corpus.
Add HTTP framing keywords to fuzzing/stun.dict and four new seed files
covering both gaps:
- seed_http_get.raw: minimal "GET / HTTP/1.1\r\nHost: x\r\n\r\n"
- seed_http_post_clen.raw: POST with Content-Length to drive the strtoul
branch in is_http
- seed_reservation_token.raw: STUN allocate response with an 8-byte
RESERVATION-TOKEN attribute
- seed_response_port.raw: STUN binding request with a 4-byte
RESPONSE-PORT attribute
Each new STUN seed validated against the real parsers
(stun_get_message_len_str, stun_attr_get_first_by_type_str, is_http) to
confirm it reaches the targeted branch.
The corpus zips also drop pre-existing __MACOSX/ and .DS_Store entries
that had snuck in during a prior macOS zip step; net file count rises
(24 -> 28 in FuzzStun, 4 -> 8 in FuzzStunClient) while archive size
shrinks because of the junk removal.
Add harness_stun_buffer_api to FuzzStunClient.c that exercises every
public wrapper in src/apps/common/stun_buffer.c not already reached by
the existing harnesses: stun_get_size (NULL/non-NULL), the init_request
/ init_indication / init_success_response builders, the tid accessors,
the stun_is_indication wrapper (which gates the static is_channel_msg),
the attr_add / attr_add_channel_number / attr_add_addr /
attr_add_even_port (both branches) / attr_get_first_by_type accessors,
stun_set_allocate_request (rt NULL and non-NULL paths),
stun_set_binding_request /
stun_prepare_binding_request, and the channel-message wrappers.
Each builder call is followed by inspect_buffer_message so the resulting
serialized message is also walked by the parser predicates. A tail block
also pumps raw fuzzer bytes through the wrapper-form predicates
(stun_is_indication, stun_is_channel_message, stun_tid_from_message,
stun_attr_get_first_by_type) so they see malformed inputs the serializer
paths cannot produce.
## Summary
Introduces an opt-in unit test layer for coturn using
[Unity](https://github.com/ThrowTheSwitch/Unity) — a single-header
pure-C test framework that matches coturn's C11 toolchain, portability
bar, and zero-C++ production tree.
- Unity v2.6.0 is fetched on demand via CMake `FetchContent` (nothing
vendored).
- Tests are gated behind `-DBUILD_TESTING=ON` (off by default), so the
standard build and OSS-Fuzz pipeline are unaffected.
- Two test binaries cover pure C-callable code in `libturnclient`:
- `test_ioaddr` (6 cases) — `make_ioa_addr`,
`addr_get_port`/`addr_set_port`, `addr_eq` variants, `addr_to_string`,
IPv4/IPv6/garbage input
- `test_stun_msg` (7 cases) — STUN header construction,
request/indication/success/error response classification, transaction-ID
round-trip, channel message parsing, truncated/zeroed buffer rejection
- New `check` cmake target builds tests before running ctest (avoids the
`make test` footgun where the auto-generated `test` target only runs
already-built binaries).
- Legacy `Makefile.in` gets a `unit-tests` target that bootstraps
`build/unit-tests/` and delegates to the cmake `check` target. `make
check` and `make test` now run the RFC 5769 conformance suite **plus**
the Unity unit tests.
- CLAUDE.md documents the new workflow plus the one-liner for adding a
new `test_<name>.c`.
## Why
The existing test story is shell-script integration suites under
`examples/scripts/` — they exercise the binary end-to-end but can't pin
down behavior of individual functions, can't run without a full build
environment, and don't fail loudly when a unit-level invariant breaks. A
lightweight unit layer gives us:
- Targeted regression coverage for protocol parsing/encoding (the
highest bug-yield area).
- A natural home for tests of the kinds of subtle invariants already
documented in CLAUDE.md (port-counter overflow safety, port-bounds
inclusivity, HMAC buffer initialization).
- Sub-second feedback for contributors.
## Usage
```bash
# CMake direct
cmake -S . -B build -DBUILD_TESTING=ON
cmake --build build -j --target check # build + run all unit tests
ctest --test-dir build --output-on-failure # run already-built tests
# Legacy Makefile bridge (after ./configure)
make unit-tests # bootstraps build/unit-tests/, builds + runs Unity tests
make check # RFC 5769 conformance + unit tests
```
Adding a new test:
1. Drop `tests/test_<name>.c`
2. Append `coturn_add_test(test_<name>)` in `tests/CMakeLists.txt`
3. The `check` target picks it up automatically.
## Test plan
- [x] Clean cmake build with `-DBUILD_TESTING=ON` succeeds; full source
tree (turnserver, turnadmin, turnclient, turn_server, all turnutils)
still builds
- [x] `cmake --build build --target check` builds and runs both test
binaries — 13/13 cases pass
- [x] `ctest --verbose` shows per-case PASS lines for all 13 cases
- [x] Default build (`-DBUILD_TESTING` unset) does not fetch Unity or
build any test binary
## Notes for reviewers
- Why Unity over GoogleTest/Catch2: pure C, single source file, no C++
toolchain dependency, runs anywhere coturn does (incl. exotic CMake
targets like Solaris/AIX). GoogleTest would force `extern "C"` wrappers
and a C++ compiler everywhere.
## Summary
- Remove the `udp-relay-servers` config knob and the
`udp_relay_servers_number` field — after #1849 left only the
`PER_THREAD` UDP engine, this knob is no longer wired to anything that
creates extra UDP relay threads.
- Delete the now-orphaned `udp_relay_servers[]` array, the
`TURNSERVER_ID_BOUNDARY_BETWEEN_TCP_AND_UDP` / `..._UDP_AND_TCP` macros,
and the `id >= boundary` branch in `get_relay_server()`. The array was
read but never written anywhere in the tree, so the branch was
unreachable dead code.
- Drop a stray unused `char s[257]` in
`dbd_redis.c::redis_list_secrets`.
- Adjust the startup banner to log "Total relay threads" (was "Total
General servers" + a never-fired "Total UDP servers" block).
## Test plan
- [x] `cmake .. && make -j8` clean build
- [x] `examples/scripts/rfc5769.sh` — all RFC 5769 conformance vectors
pass
- [x] `examples/scripts/basic/relay.sh` + `udp_c2c_client.sh` —
12000/12000 msgs, 0 lost, 0 dropped
Upstream OSS-Fuzz build recipe
(google/oss-fuzz/projects/coturn/build.sh) only copies two fuzzer
binaries -- FuzzStun and FuzzStunClient -- and their seed corpora into
$OUT. The eight additional fuzz targets added later never ran on
oss-fuzz.com, which is why the introspector profile reports "fuzzer no
longer available" for them.
Rather than patching the Google-owned build recipe, fold all fuzzers
into the two binaries OSS-Fuzz actually ships. Each target now begins
with a single-byte selector (Data[0] mod 5) that dispatches to one of
five sub-harnesses:
FuzzStun - integrity (SHA1/multi-SHA), attr_iter, attr_add,
old_stun
FuzzStunClient - stun_client, channel_data, addr_codec, oauth_token,
oauth_roundtrip
No upstream OSS-Fuzz changes are required.
## Summary
- Fixes compilation error on Linux when `_GNU_SOURCE` is not defined by
the toolchain: `struct mmsghdr` has incomplete type and `recvmmsg()` is
implicitly declared
- Defines `_GNU_SOURCE` in three places for full coverage across build
systems:
- `dtls_listener.c` — before includes, guarded by `#if
defined(__linux__)`
- `configure` — adds `-D_GNU_SOURCE` to `OSCFLAGS` on Linux for the
legacy build path
- `CMakeLists.txt` — adds `-D_GNU_SOURCE` on Linux for the CMake build
## Context
The `recvmmsg()` batched receive path added in #1852 uses `struct
mmsghdr` and `recvmmsg()`, which are glibc extensions requiring
`_GNU_SOURCE`. Some Linux distros/toolchains don't define this
implicitly, causing:
```
src/apps/relay/dtls_listener.c:129:18: error: array type has incomplete element type 'struct mmsghdr'
src/apps/relay/dtls_listener.c:748:18: warning: implicit declaration of function 'recvmmsg'
```
Fixes#1867
## Test plan
- [x] Verified CMake build succeeds on macOS (recvmmsg code is `#if
defined(__linux__)` guarded — no effect on non-Linux)
- [x] Verify build succeeds on Linux with and without `_GNU_SOURCE` in
the environment
- [x] Verify both `cmake` and `./configure && make` build paths work
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The first ALLOCATE set ss->origin_set=1 before check_stun_auth ran, so
an unauthenticated attacker could lock the session into a realm of their
choice by forging the ORIGIN attribute on the first packet. If per-realm
ACLs differ, this lets the attacker pick the most permissive realm for
that session.
Defer the commit of ss->origin_set until check_stun_auth succeeds with a
valid MESSAGE-INTEGRITY. Until auth passes, every request re-parses
ORIGIN, so the 401 challenge still carries the correct realm derived
from the current ORIGIN attribute.