nanog mailing list archives

RE: 400G forwarding - how does it work?


From: <ljwobker () gmail com>
Date: Sun, 7 Aug 2022 10:24:25 -0400

You're getting to the core of the question (sorry, I could not resist...) -- and again the complexity is as much in the 
terminology as anything else.

In EZChip, at least as we used it on the ASR9k, the chip had a bunch of processing cores, and each core performed some 
of the work on each packet.  I honestly don't know if the cores themselves were different or if they were all the same 
physical design, but they were definitely attached to different memories, and they definitely ran different microcode.  
These cores were allocated to separate stages, and had names along the lines of {parse, search, encap, transmit} etc.  
I'm sure these aren't 100% correct but you get the point.  Importantly, there were NOT the same number of cores for 
each stage, so when a packet went from stage A to stage B there was some kind of mux in between.  If you knew precisely 
that each stage had the same number of cores, you could choose to arrange it such that the packet always followed a 
"straight-line" through the processing pipe, which would make some parts of the implementation cheaper/easier.  

You're correct that the instruction set for stuff like this is definitely not ARM (nor x86 nor anything else standard) 
because the problem space you're optimizing for is a lot smaller that what you'd have on a more general purpose CPU.

The (enormous) challenge for running the same ucode on multiple targets is that networking has exceptionally high 
performance requirements  -- billions of packets per second is where this whole thread started!  Fortunately, we also 
have a much smaller problem space to solve than general purpose compute, although in a lot of places that's because we 
vendors have told operators "Look, if you want something that can forward a couple hundred terabits in a single system, 
you're going to have to constrain what features you need, because otherwise the current hardware just can't do it".  

To get that kind of performance without breaking the bank requires -- or at least has required up until this point in 
time -- some very tight integration between the hardware forwarding design and the microcode.  I was at Barefoot when 
P4 was released, and Tofino was the closest thing I've seen to a "general purpose network ucode machine" -- and even 
that was still very much optimized in terms of how the hardware was designed and built, and it VERY much required the 
P4 programmer to have a deep understanding of what hardware resources were available.  When you write a P4 program and 
compile it for an x86 machine, you can basically create as many tables and lookup stages as you want -- you just have 
to eat more CPU and memory accesses for more complex programs and they run slower.  But on a chip like Tofino (or any 
other NPU-like target) you're going to have finite limits on how many processing stages and memory tables exist... so 
it's more the case that when your program gets bigger it no longer "just runs slower" but rather it "doesn't run at 
all". 

The industry would greatly benefit from some magical abstraction layer that would let people write forwarding code 
that's both target-independent AND high-performance, but at least so far the performance penalty for making such code 
target independent has been waaaaay more than the market is willing to bear.

--lj

-----Original Message-----
From: Saku Ytti <saku () ytti fi> 
Sent: Sunday, August 7, 2022 4:44 AM
To: ljwobker () gmail com
Cc: Jeff Tantsura <jefftant.ietf () gmail com>; NANOG <nanog () nanog org>; Jeff Doyle <jdoyle () juniper net>
Subject: Re: 400G forwarding - how does it work?

On Sat, 6 Aug 2022 at 17:08, <ljwobker () gmail com> wrote:


For a while, GSR and CRS type systems had linecards where each card had a bunch of chips that together built the 
forwarding pipeline.  You had chips for the L1/L2 interfaces, chips for the packet lookups, chips for the 
QoS/queueing math, and chips for the fabric interfaces.  Over time, we integrated more and more of these things 
together until you (more or less) had a linecard where everything was done on one or two chips, instead of a half 
dozen or more.  Once we got here, the next step was to build linecards where you actually had multiple independent 
things doing forwarding -- on the ASR9k we called these "slices".  This again multiplies the performance you can get, 
but now both the software and the operators have to deal with the complexity of having multiple things running code 
where you used to only have one.  Now let's jump into the 2010's where the silicon integration allows you to put down 
multiple cores or pipelines on a single chip, each of these is now (more or less) it's own forwarding entity.  So now 
you've got yet ANOTHER layer of abstraction.  If I can attempt to draw out the tree, it looks like this now:

1) you have a chassis or a system, which has a bunch of linecards.
2) each of those linecards has a bunch of NPUs/ASICs
3) each of those NPUs has a bunch of cores/pipelines

Thank you for this. I think we may have some ambiguity here. I'll ignore multichassis designs, as those went out of 
fashion, for now.
And describe only 'NPU' not express/brcm style pipeline.

1) you have a chassis with multiple linecards
2) each linecard has 1 or more forwarding packages
3) each package has 1 or more NPUs (Juniper calls these slices, unsure if EZchip vocabulary is same here)
4) each NPU has 1 or more identical cores (well, I can't really name any with 1 core, I reckon, NPU like GPU pretty 
inherently has many many cores, and unlike some in this thread, I don't think they ever are ARM instruction set, that 
makes no sense, you create instruction set targeting the application at hand which ARM instruction set is not, but 
maybe some day we have some forwarding-IA, allowing customers to provide ucode that runs on multiple targets, but this 
would reduce pace of innovation)

Some of those NPU core architectures are flat, like Trio, where a single core handles the entire packet. Where other 
core architectures, like FP are matrices, where you have multiple lines and packet picks 1 of the lines and traverses 
each core in line. (FP has much more cores in line, compared to leaba/pacific stuff)

--
  ++ytti


Current thread: