nanog mailing list archives

Re: 400G forwarding - how does it work?


From: Jeff Tantsura <jefftant.ietf () gmail com>
Date: Tue, 26 Jul 2022 13:11:52 -0700

As Lincoln said - all of us directly working with BCM/other silicon vendors have signed numerous NDAs.
However if you ask a well crafted question - there’s always a way to talk about it ;-)

In general, if we look at the whole spectrum, on one side there’re massively parallelized “many core” RTC ASICs, such 
as Trio, Lightspeed, and similar (as the last gasp of Redback/Ericsson venture - we have built 1400 HW threads ASIC 
(Spider).
On another side of the spectrum - fixed pipeline ASICs, from BCM Tomahawk at its extreme (max speed/radix - min 
features) moving with BCM Trident, Innovium, Barefoot(quite different animal wrt programmability), etc - usually 
shallow on chip buffer only (100-200M).

In between we have got so called programmable pipeline silicon, BCM DNX and Juniper Express are in this category, 
usually a combo of OCB + off chip memory (most often HBM), (2-6G), usually have line-rate/high scale security/overlay 
encap/decap capabilities. Usually have highly optimized RTC blocks within a pipeline (RTC within macro). The way and 
speed to access DBs, memories is evolving with each generation, number/speed of non networking cores(usually ARM)  
keeps growing - OAM, INT, local optimizations are primary users of it.

Cheers,
Jeff

On Jul 25, 2022, at 15:59, Lincoln Dale <ltd () interlink com au> wrote:


On Mon, Jul 25, 2022 at 11:58 AM James Bensley <jwbensley+nanog () gmail com> wrote:

On Mon, 25 Jul 2022 at 15:34, Lawrence Wobker <ljwobker () gmail com> wrote:
This is the parallelism part.  I can take multiple instances of these memory/logic pipelines, and run them in 
parallel to increase the throughput.
...
I work on/with a chip that can forwarding about 10B packets per second… so if we go back to the order-of-magnitude 
number that I’m doing about “tens” of memory lookups for every one of those packets, we’re talking about something 
like a hundred BILLION total memory lookups… and since memory does NOT give me answers in 1 picoseconds… we get 
back to pipelining and parallelism.

What level of parallelism is required to forward 10Bpps? Or 2Bpps like
my J2 example :)

I suspect many folks know the exact answer for J2, but it's likely under NDA to talk about said specific answer for a 
given thing.

Without being platform or device-specific, the core clock rate of many network devices is often in a "goldilocks" 
zone of (today) 1 to 1.5GHz with a goal of 1 packet forwarded 'per-clock'. As LJ described the pipeline that doesn't 
mean a latency of 1 clock ingress-to-egress but rather that every clock there is a forwarding decision from one 
'pipeline', and the MPPS/BPPS packet rate is achieved by having enough pipelines in parallel to achieve that.
The number here is often "1" or "0.5" so you can work the number backwards. (e.g. it emits a packet every clock, or 
every 2nd clock).

It's possible to build an ASIC/NPU to run a faster clock rate, but gets back to what I'm hand-waving describing as 
"goldilocks". Look up power vs frequency and you'll see its non-linear.
Just as CPUs can scale by adding more cores (vs increasing frequency), ~same holds true on network silicon, and you 
can go wider, multiple pipelines. But its not 10K parallel slices, there's some parallel parts, but there are 
multiple 'stages' on each doing different things.

Using your CPU comparison, there are some analogies here that do work:
 - you have multiple cpu cores that can do things in parallel -- analogous to pipelines
 - they often share some common I/O (e.g. CPUs have PCIe, maybe sharing some DRAM or LLC)  -- maybe some lookup 
engines, or centralized buffer/memory
 - most modern CPUs are out-of-order execution, where under-the-covers, a cache-miss or DRAM fetch has a 
disproportionate hit on performance, so its hidden away from you as much as possible by speculative execution 
out-of-order
    -- no direct analogy to this one - it's unlikely most forwarding pipelines do speculative execution like a 
general purpose CPU does - but they definitely do 'other work' while waiting for a lookup to happen

A common-garden x86 is unlikely to achieve such a rate for a few different reasons:
 - packets-in or packets-out go via DRAM then you need sufficient DRAM (page opens/sec, DRAM bandwidth) to sustain at 
least one write and one read per packet. Look closer at DRAM and see its speed, Pay attention to page opens/sec, and 
what that consumes.
 - one 'trick' is to not DMA packets to DRAM but instead have it go into SRAM of some form - e.g. Intel DDIO, ARM 
Cache Stashing, which at least potentially saves you that DRAM write+read per packet
  - ... but then do e.g. a LPM lookup, and best case that is back to a memory access/packet. Maybe it's in L1/L2/L3 
cache, but likely at large table sizes it isn't.
 - ... do more things to the packet (urpf lookups, counters) and it's yet more lookups.

Software can achieve high rates, but note that a typical ASIC/NPU does on the order of >100 separate lookups per 
packet, and 100 counter updates per packet.
Just as forwarding in a ASIC or NPU is a series of tradeoffs, forwarding in software on generic CPUs is also a series 
of tradeoffs.


cheers,

lincoln.


Current thread: