AF_XDP / XSK

Since 1.9.0, dnsdist can use AF_XDP for high performance UDP packet processing recent Linux kernels (4.18+). It requires dnsdist to have the CAP_NET_ADMIN, CAP_SYS_ADMIN and CAP_NET_RAW capabilities at startup, and to have been compiled with the --with-xsk configure option.

Note

To retain the required capabilities it is necessary to call addCapabilitiesToRetain() during startup, as dnsdist drops capabilities after startup.

Note

AppArmor users might need to update their policy to allow dnsdist to keep the capabilities. Adding capability sys_admin, (for CAP_SYS_ADMIN) and capability net_admin, (for CAP_NET_ADMIN) lines to the policy file is usually enough.

Warning

DNSdist’s AF_XDP implementation comes with several limitations:

  • Asymmetrical network setups where the DNS query and its response do not go through the same network device are not supported
  • Ethernet packets larger than 2048 bytes are not supported
  • IP and UDP-level checksums are not verified on incoming DNS messages
  • IP options in incoming packets are not supported

The way AF_XDP works is that dnsdist allocates a number of frames in a memory area called a UMEM, which is accessible both by the program, in userspace, and by the kernel. Using in-memory ring buffers, the receive (RX), transmit (TX), completion (cq) and fill (fq) rings, the kernel can very efficiently pass raw incoming packets to dnsdist, which can in return pass raw outgoing packets to the kernel. In addition to these, an eBPF XDP program needs to be loaded to decide which packets to distribute via the AF_XDP socket (and to which, as there are usually more than one). This program uses a BPF map of type XSKMAP (located at /sys/fs/bpf/dnsdist/xskmap by default) that is populated by dnsdist at startup to locate the AF_XDP socket to use. dnsdist also sets up two additional BPF maps (located at /sys/fs/bpf/dnsdist/xsk-destinations-v4 and /sys/fs/bpf/dnsdist/xsk-destinations-v6) to let the XDP program know which IP destinations are to be routed to the AF_XDP sockets and which are to be passed to the regular network stack (health-checks queries and responses, for example). A ready-to-use XDP program can be found in the contrib directory of the PowerDNS Git repository:

$ python xdp.py --xsk --interface eth0

Then dnsdist needs to be configured to use AF_XDP, first by creating a XskSocket object that are tied to a specific queue of a specific network interface:

xsk = newXsk({ifName="enp1s0", NIC_queue_id=0, frameNums=65536, xskMapPath="/sys/fs/bpf/dnsdist/xskmap"})

This ties the new object to the first receive queue on enp1s0, allocating 65536 frames and populating the map located at /sys/fs/bpf/dnsdist/xskmap.

Then we can tell dnsdist to listen for AF_XDP packets to 192.0.2.1:53, in addition to packets coming via the regular network stack:

addLocal("192.0.2.1:53", {xskSocket=xsk})

In practice most high-speed (>= 10 Gbps) network interfaces support multiple queues to offer better performance, so we need to allocate one XskSocket per queue. We can retrieve the number of queues for a given interface via:

$ sudo ethtool -l enp1s0
Channel parameters for enp1s0:
Pre-set maximums:
RX:           n/a
TX:           n/a
Other:                1
Combined:     8
Current hardware settings:
RX:           n/a
TX:           n/a
Other:                1
Combined:     8

The Combined lines tell us that the interface supports 8 queues, so we can do something like this:

for i=1,8 do
  xsk = newXsk({ifName="enp1s0", NIC_queue_id=i-1, frameNums=65536, xskMapPath="/sys/fs/bpf/dnsdist/xskmap"})
  addLocal("192.0.2.1:53", {xskSocket=xsk, reusePort=true})
end

This will start one router thread per XskSocket object, plus one worker thread per addLocal() using that XskSocket object.

We can instructs dnsdist to use AF_XDP to send and receive UDP packets to a backend in addition to packets from clients:

local sockets = {}
for i=1,8 do
  xsk = newXsk({ifName="enp1s0", NIC_queue_id=i-1, frameNums=65536, xskMapPath="/sys/fs/bpf/dnsdist/xskmap"})
  table.insert(sockets, xsk)
  addLocal("192.0.2.1:53", {xskSocket=xsk, reusePort=true})
end

newServer("192.0.2.2:53", {xskSocket=sockets})

This will start one router thread per XskSocket object, plus one worker thread per addLocal()/newServer() using that XskSocket object.

We are not passing the MAC address of the backend (or the gateway to reach it) directly, so dnsdist will try to fetch it from the system MAC address cache. This may not work, in which case we might need to pass explicitly:

newServer("192.0.2.2:53", {xskSocket=sockets, MACAddr='00:11:22:33:44:55'})

Performance

Using kxdpgun, we can compare the performance of dnsdist using the regular network stack and AF_XDP.

This test was realized using two Intel E3-1270 with 4 cores (8 threads) running at 3.8 Ghz, using Intel 82599 10 Gbps network cards. On both the injector running kxdpgun and the box running dnsdist there was no firewall, the governor was set to performance, the UDP buffers were raised to 16777216 and the receive queue hash policy set to use the IP addresses and ports (see Performance Tuning).

dnsdist was configured to immediately respond to incoming queries with REFUSED:

addAction(AllRule(), RCodeAction(DNSRCode.REFUSED))

On the injector box we executed:

$ sudo kxdpgun -Q 2500000 -p 53 -i random_1M 192.0.2.1 -t 60
using interface enp1s0, XDP threads 8, UDP, native mode
[...]

We first ran without AF_XDP:

for i=1,8 do
  addLocal("192.0.2.1:53", {reusePort=true})
end

then with:

for i=1,8 do
  xsk = newXsk({ifName="enp1s0", NIC_queue_id=i-1, frameNums=65536, xskMapPath="/sys/fs/bpf/dnsdist/xskmap"})
  addLocal("192.0.2.1:53", {xskSocket=xsk, reusePort=true})
end
AF_XDP QPS
AF_XDP CPU

The first run handled roughly 1 million QPS, the second run 2.5 millions, with the CPU usage being much lower in the AF_XDP case.

Running under systemd

dnsdist needs quite a few more additional permissions to use AF_XDP:

  • to access the BPF maps directory, it needs to be able to go into the /sys/fs/bpf directory: one option is to chmod o+x /sys/fs/bpf, a safer one is to restrict that to the dnsdist user instead via chgrp dnsdist /sys/fs/bpf && chmod g+x /sys/fs/bpf
  • to read the BPF maps themselves, they need to be readable by the dnsdist user: chown -R dnsdist:dnsdist /sys/fs/bpf/dnsdist/
  • to create AF_XDP sockets: add AF_XDP to RestrictAddressFamilies in the systemd unit file
  • to load a BPF program: add CAP_SYS_ADMIN to CapabilityBoundingSet and AmbientCapabilities in the systemd unit file
  • to create raw network sockets: add CAP_NET_RAW to CapabilityBoundingSet and AmbientCapabilities in the systemd unit file
  • and finally to lock enough memory: ensure that LimitMEMLOCK=infinity is set in the systemd unit file