The libevent EV_READ handler used to do one recvfrom + one sendto per
ready event, so a packet flood through the relay generated O(N) libevent
re-entries and 2N syscalls per N relayed datagrams — saturating one core
on the loadgen-side peer well below modern relay throughput.
On Linux, replace the handler with:
* a drain loop: keep recvmmsg'ing in MSG_DONTWAIT until the queue
returns less than a full batch, bounded by MAX_DRAIN_ROUNDS so a
flood can't starve the rest of the event loop;
* recvmmsg into a static mmsghdr[32] (peer is single-threaded) and
reuse the same mmsghdr array for sendmmsg back — each entry already
has msg_name pointing at the source (the echo destination) and the
iovec pointing at the received bytes, so no userspace copy;
* UDP-GSO: when the recvmmsg batch is homogeneous (≥2 entries, same
source, same size, ≤1472 B), echo it as one sendmsg with UDP_SEGMENT
cmsg so the kernel allocates one super-skb that traverses the
network stack once.
The non-Linux build keeps the original recvfrom/sendto handler.
DigitalOcean nyc1 c-4 30 s alternating A/B paired with the GSO
turnserver (-Y packet -m 1):
old peer: turn TX mean 228 k pps, peer CPU mean 91.0 % (saturated)
new peer: turn TX mean 255 k pps, peer CPU mean 28.8 %
Peer CPU drops 3.2× while turn-side throughput climbs ~12 % because the
old peer was no longer fully reflecting at the GSO turnserver's rate.
The peer is no longer the loadgen-side bottleneck, freeing CPU for
multi-flow tests.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>