How to receive 1 million packets per second

In a casual conversation last week, I overheard a colleague say: Linux's network stack is too slow! You can't expect it to handle more than 50,000 packets per second per core!

This got me thinking. While I agree that 50kpps per core is probably the limit for any practical application, what can a Linux networking stack do? Let's put it another way and make it more interesting:

How hard is it to write a program that receives 1 million UDP packets per second on Linux?

Hopefully, answering this question sheds a good light on modern network stack design.

First, let's assume:

  • Measuring packets per second (pps) is much more interesting than measuring bytes per second (Bps). You can achieve higher Bps with better pipelining and sending longer packets. However improving pps is much more difficult.
  • Since we are interested in pps, our experiments will use short UDP messages. To be precise: 32 bytes of UDP payload. That means 74 bytes on the Ethernet layer.
  • For the experiments, we will use two physical servers: receiver and sender
  • They both have two six-core 2GHz Xeon processors. Up to 24 processors per chassis with Hyper-Threading (HT) enabled. These cores have a multi-queue 10G network card provided by Solarflare configured with 11 receive queues. More on that later.
  • The source code of the test program can be found here: https://github.com/majek/dump/tree/master/how-to-receive-a-million-packets

prerequisite

Let us use port 4321 as UDP packet sending port. Before we start, we have to make sure the traffic is not being interfered with by iptables:

receiver$ iptables -I INPUT 1 -p udp --dport 4321 -j ACCEPT
receiver$ iptables -t raw -I PREROUTING 1 -p udp --dport 4321 -j NOTRACK
copy

Explicitly defined IP addresses:

receiver$ for i in `seq 1 20`; do \
              ip addr add 192.168.254.$i/24 dev eth2; \
          done
sender$ ip addr add 192.168.254.30/24 dev eth3
copy

original method

First, let's do the simplest experiment. For raw send and receive, how many packets will be delivered?

Sender pseudocode:

fd = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
fd.bind(("0.0.0.0", 65400)) # select source port to reduce nondeterminism
fd.connect(("192.168.254.1", 4321))
while True:
    fd.sendmmsg(["\x00" * 32] * 1024)
copy

While we could use the usual send system call, it's not efficient. Context switching to the kernel has a cost, and it's best to avoid it. Fortunately, Linux recently added a handy system call: sendmmsg (http://man7.org/linux/man-pages/man2/sendmmsg.2.html). It allows us to send multiple packets at once. Let's send 1024 packets at a time.

Receiver pseudocode:

fd = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
fd.bind(("0.0.0.0", 4321))
while True:
    packets = [None] * 1024
    fd.recvmmsg(packets, MSG_WAITFORONE)
copy

Similarly, recvmmsg is a more efficient version of the common recv system call.

Let's try it:

sender$ ./udpsender 192.168.254.1:4321
receiver$ ./udpreceiver1 0.0.0.0:4321
  0.352M pps  10.730MiB /  90.010Mb
  0.284M pps   8.655MiB /  72.603Mb
  0.262M pps   7.991MiB /  67.033Mb
  0.199M pps   6.081MiB /  51.013Mb
  0.195M pps   5.956MiB /  49.966Mb
  0.199M pps   6.060MiB /  50.836Mb
  0.200M pps   6.097MiB /  51.147Mb
  0.197M pps   6.021MiB /  50.509Mb
copy

With this simple method, we can do 197k to 350k pps, which is ok. Unfortunately, there are a lot of variations on this one. This is caused by kernels switching programs between different kernels. It helps if you limit the process to a specific cpu:

sender$ taskset -c 1 ./udpsender 192.168.254.1:4321
receiver$ taskset -c 1 ./udpreceiver1 0.0.0.0:4321
  0.362M pps  11.058MiB /  92.760Mb
  0.374M pps  11.411MiB /  95.723Mb
  0.369M pps  11.252MiB /  94.389Mb
  0.370M pps  11.289MiB /  94.696Mb
  0.365M pps  11.152MiB /  93.552Mb
  0.360M pps  10.971MiB /  92.033Mb
copy

Now the kernel scheduler keeps the process on the defined cpu. This improves processor cache locality and makes the numbers more consistent, which is what we want.

send more packets

While 370k pps is decent for a simple program, it's still far from the 1Mpps goal. To receive more packets, first we have to send more packets. Send independently from two threads:

sender$ taskset -c 1,2 ./udpsender \
            192.168.254.1:4321 192.168.254.1:4321
receiver$ taskset -c 1 ./udpreceiver1 0.0.0.0:4321
  0.349M pps  10.651MiB /  89.343Mb
  0.354M pps  10.815MiB /  90.724Mb
  0.354M pps  10.806MiB /  90.646Mb
  0.354M pps  10.811MiB /  90.690Mb
copy

Receiver's packets are not incremented. ethtool -S will show where the packet actually went:

receiver$ watch 'sudo ethtool -S eth2 |grep rx'
     rx_nodesc_drop_cnt:    451.3k/s
     rx-0.rx_packets:     8.0/s
     rx-1.rx_packets:     0.0/s
     rx-2.rx_packets:     0.0/s
     rx-3.rx_packets:     0.5/s
     rx-4.rx_packets:  355.2k/s
     rx-5.rx_packets:     0.0/s
     rx-6.rx_packets:     0.0/s
     rx-7.rx_packets:     0.5/s
     rx-8.rx_packets:     0.0/s
     rx-9.rx_packets:     0.0/s
     rx-10.rx_packets:    0.0/s
copy

The NIC reports that with these stats it has successfully signaled about 350kpps to RX queue #4. rx_nodesc_drop_cnt is a Solarflare specific counter indicating that the NIC was unable to send 450kpps to the kernel.

Sometimes it's not obvious why a packet wasn't delivered. In our case, it's obvious: RX queue #4 sends packets to CPU #4. CPU #4 can't do any more work - it's completely busy reading 350kpps of data. Here's what it looks like in htop:

multi-queue NIC

The NIC has an RX queue for passing packets between the hardware and the kernel. This design has an obvious limitation - it is impossible to deliver more packets than a single CPU can handle.

To take advantage of multi-core systems, NICs started supporting multiple RX queues. The design is simple: each RX queue is pinned to a separate CPU, so by sending packets to all RX queues, one network card can utilize all CPUs. But it raises a question: given a packet, how does the NIC decide which RX queue to push it to?

A round robin algorithm is unacceptable because it may introduce reordering of packets within a single connection, which can lead to data corruption. Another approach is to use the packet hash to determine the RX queue number. Hashes are usually counted from a tuple (src IP, dst IP, src port, dst port). This guarantees that packets for a single flow will always end up on the exact same RX queue, and that no reordering of packets occurs within a single flow.

In our case the hash can be used like this:

RX_queue_number = hash('192.168.254.30', '192.168.254.1', 65400, 4321) % number_of_queues
copy

Multi-queue hash algorithm

The hash algorithm can be configured through ethtool. In our setup it is:

receiver$ ethtool -n eth2 rx-flow-hash udp4
UDP over IPV4 flows use these fields for computing Hash flow key:
IP SA
IP DA
copy

This reads: For IPv4 UDP packets, the network card will hash the (src IP, dst IP) address. For example:

RX_queue_number = hash('192.168.254.30', '192.168.254.1') % number_of_queues
copy

This is very limited as it ignores the port number. Many network cards allow custom hashing. Likewise, using ethtool, we can select the tuple (src IP, dst IP, src port, dst port) to hash:

receiver$ ethtool -N eth2 rx-flow-hash udp4 sdfn
Cannot change RX network flow hashing options: Operation not supported
copy

Unfortunately, our NIC doesn't support it - it's limited to (src IP, dst IP) hashes.

A note on NUMA performance

So far, all our packets have only flowed to one RX queue, and only reached one CPU. Let's use this metric to benchmark the performance of different CPUs. In our setup, the receiving host has two separate processors, each a different NUMA node.

In our setup, we can pin the single-threaded sink to one of the four CPUs. The four options are:

  • Run the receiver on another CPU, but on the same NUMA node as the RX queue. The performance we saw above was around 360kpps.
  • If the receiver and RX queue are on the same CPU, we can achieve 430kpps. But it creates high variability. If the network card is overloaded, performance drops to zero.
  • When the receiver is running on the HT copy of the CPU processing the RX queue, the performance is half of usual, about 200kpps.
  • With the receiver on a CPU on a different NUMA node than the RX queue, we get ~330k pps. However, the numbers are not quite consistent.

While a 10% performance penalty running on different NUMA nodes doesn't sound too bad, the problem only gets worse as it scales up. In some tests I was only able to squeeze out 250kpps per core. All cross-NUMA tests have poor variability. At higher throughputs, the performance loss across NUMA nodes is more pronounced. In one of the tests I got a 4x loss when running the sink on a bad NUMA node.

Multiple Accept IP s

Due to the very limited hash algorithm on our NIC, the only way to distribute packets across RX queues is to use multiple IP addresses. Here is how to send packets to different destination ip:

sender$ taskset -c 1,2 ./udpsender 192.168.254.1:4321 192.168.254.2:4321
copy

ethtool confirms that packets go to different RX queues:

receiver$ watch 'sudo ethtool -S eth2 |grep rx'
     rx-0.rx_packets:     8.0/s
     rx-1.rx_packets:     0.0/s
     rx-2.rx_packets:     0.0/s
     rx-3.rx_packets:  355.2k/s
     rx-4.rx_packets:     0.5/s
     rx-5.rx_packets:  297.0k/s
     rx-6.rx_packets:     0.0/s
     rx-7.rx_packets:     0.5/s
     rx-8.rx_packets:     0.0/s
     rx-9.rx_packets:     0.0/s
     rx-10.rx_packets:    0.0/s
copy

Receive data side:

receiver$ taskset -c 1 ./udpreceiver1 0.0.0.0:4321
  0.609M pps  18.599MiB / 156.019Mb
  0.657M pps  20.039MiB / 168.102Mb
  0.649M pps  19.803MiB / 166.120Mb
copy

Check it out! Two cores are busy processing the RX queue, and the third core runs the application, potentially getting ~650k pps!

We can further increase this number by sending traffic to 3 or 4 RX queues, but soon the application will hit another limit. This time rx_nodesc_drop_cnt is not growing, but the netstat sink error is:

receiver$ watch 'netstat -s --udp'
Udp:
      437.0k/s packets received
        0.0/s packets to unknown port received.
      386.9k/s packet receive errors
        0.0/s packets sent
    RcvbufErrors:  123.8k/s
    SndbufErrors: 0
    InCsumErrors: 0
copy

This means that while the network card is able to send packets to the kernel, the kernel cannot send packets to the application. In our case it was only able to deliver 440kpps, the remaining 390kpps + 123kpps were dropped because the application wasn't receiving them fast enough.

Receive with multiple threads

We need to extend the receiver application. The original approach, receiving from multiple threads, still doesn't work very well:

sender$ taskset -c 1,2 ./udpsender 192.168.254.1:4321 192.168.254.2:4321
receiver$ taskset -c 1,2 ./udpreceiver1 0.0.0.0:4321 2
  0.495M pps  15.108MiB / 126.733Mb
  0.480M pps  14.636MiB / 122.775Mb
  0.461M pps  14.071MiB / 118.038Mb
  0.486M pps  14.820MiB / 124.322Mb
copy

Receive performance is degraded compared to single-threaded programs. This is caused by lock contention on the UDP receive buffer side. Since both threads are using the same socket, they spend a disproportionate amount of time fighting for the lock on the UDP receive buffer. This article describes this problem in more detail.

Using multiple threads to receive data from a socket is not optimal.

SO_REUSEPORT

Fortunately, a solution was recently added to Linux: the SO_REUSEPORT flag. When this flag is set on a socket, Linux will allow multiple processes to bind to the same port. In fact, it will allow any number of processes to bind and spread the load evenly.

With SO_REUSEPORT each process will have a separate socket. Therefore, each process will have a dedicated UDP receive buffer. This avoids the contention problem encountered before:

receiver$ taskset -c 1,2,3,4 ./udpreceiver1 0.0.0.0:4321 4 1
  1.114M pps  34.007MiB / 285.271Mb
  1.147M pps  34.990MiB / 293.518Mb
  1.126M pps  34.374MiB / 288.354Mb
copy

That's it, the throughput is pretty good now!

More experiments will reveal room for further improvements. Even though we start four receive threads, the load is not evenly distributed among them:

Two threads receive all the work, the other two receive no packets at all. This is caused by a hash collision, but this time at the SO_REUSEPORT layer.

Summarize

I did some further testing and it is possible to get 1.4Mpps with a fully aligned RX queue and receive thread on a single NUMA node. Running the receiver on a different NUMA node caused the numbers to drop, up to 1Mpps.

To sum up, if you want a flawless performance, you need:

  • Ensure traffic is evenly distributed across many RX queues and SO_REUSEPORT processes. In practice, as long as there are a large number of connections (or flows), the load is usually evenly distributed.
  • You need to have enough free CPU capacity to actually fetch packets from the kernel.
  • Even more difficult, both the RX queue and the receiving process should reside on a single NUMA node.

While we've shown that it's technically possible to receive 1Mpps on a Linux box, the application doesn't do any actual processing of the received packets -- it doesn't even look at the content of the traffic. That's fine without a lot of work, otherwise don't expect this kind of performance from any real application.

recommend

How to use Ingress-nginx for front-end and back-end separation?

Kubernetes introductory training (including PPT)

Detailed explanation of Ingress-nginx grayscale release function

K8S Ingress use|FAQ list

Tags: Linux udp ip SRC

Posted by Theophilus on Sat, 18 Mar 2023 10:12:52 +0530