nanog mailing list archives

Re: scaling linux-based router hardware recommendations


From: Jim Shankland <nanog () shankland org>
Date: Tue, 27 Jan 2015 08:31:09 -0800

On 1/26/15 11:33 PM, Pavel Odintsov wrote:
Hello!

Looks like somebody want to build Linux soft router!) Nice idea for
routing 10-30 GBps. I route about 5+ Gbps in Xeon E5-2620v2 with 4
10GE cards Intel 82599 and Debian Wheezy 3.2 (but it's really terrible
kernel, everyone should use modern kernels since 3.16 because "buggy
linux route cache"). My current processor load on server is about:
15%, thus I can route about 15 GE on my Linux server.


I looked into the promise and limits of this approach pretty intensively a few years back before abandoning the effort abruptly due to other constraints. Underscoring what others have said: it's all about pps, not aggregate throughput. Modern NICs can inject packets at line rate into the kernel, and distribute them across per-processor queues, etc. Payloads end up getting DMA-ed from NIC to RAM to NIC. There's really no reason you shouldn't be able to push 80 Gb/s of traffic, or more, through these boxes. As for routing protocol performance (BGP convergence time, ability to handle multiple full tables, etc.): that's just CPU and RAM.

The part that's hard (as in "can't be fixed without rethinking this approach") is the per-packet routing overhead: the cost of reading the packet header, looking up the destination in the routing table, decrementing the TTL, and enqueueing the packet on the correct outbound interface. At the time, I was able to convince myself that being able to do this in 4 us, average, in the Linux kernel, was within reach. That's not really very much time: you start asking things like "will the entire routing table fit into the L2 cache?"

4 us to "think about" each packet comes out to 250Kpps per processor; with 24 processors, it's 6Mpps (assuming zero concurrency/locking overhead, which might be a little bit of an ... assumption). With 1500-byte packets, 6Mpps is 72 Gb/s of throughput -- not too shabby. But with 40-byte packets, it's less than 2 Gb/s. Which means that your Xeon ES-2620v2 will not cope well with a DDoS of 40-byte packets. That's not necessarily a reason not to use this approach, depending on your situation; but it's something to be aware of.

I ended up convincing myself that OpenFlow was the right general idea: marry fast, dumb, and cheap switching hardware with fast, smart, and cheap generic CPU for the complicated stuff.

My expertise, such as it ever was, is a bit stale at this point, and my figures might be a little off. But I think the general principle applies: think about the minimum number of x86 instructions, and the minimum number of main memory accesses, to inspect a packet header, do a routing table lookup, and enqueue the packet on an outbound interface. I can't see that ever getting reduced to the point where a generic server can handle 40-byte packets at line rate (for that matter, "line rate" is increasing a lot faster than "speed of generic server" these days).

Jim




Current thread: