nanog mailing list archives

RE: 400G forwarding - how does it work?


From: <ljwobker () gmail com>
Date: Sat, 6 Aug 2022 10:08:35 -0400

I don't think I can add much here to the FP and Trio specific questions, for obvious reasons... but ultimately it comes 
down to a set of tradeoffs where some of the big concerns are things like "how do I get the forwarding state I need 
back and forth to the things doing the processing work"  -- that's an insane level oversimplification, as a huge amount 
of engineering time goes into those choices.

I think the "revolutionary-ness" (to vocabulate a useful word?) of putting multiple cores or whatever onto a single 
package is somewhat in the eye of the beholder.  The vast majority of customers would never know nor care whether a 
chip on the inside was implemented as two parallel "cores" or whether it was just one bigger "core" that does twice the 
amount of work in the same time.  But to the silicon designer, and to a somewhat lesser extent the people writing the 
forwarding and associated chip-management code, it's definitely a big big deal.  Also, having the ability to put two 
cores down on a given chip opens the door to eventually doing MORE than two cores, and if you really stretch your brain 
you get to where you might be able to put down "N" pipelines.

This is the story of integration: back in the day we built systems where everything was forwarded on a single CPU.  
From a performance standpoint all we cared about was the clock rate and how much work was required to forward a packet. 
 Divide the second number by the first, and you get your answer.  In the late 90's we built systems (the 7500 for me) 
that were distributed, so now we had a bunch of CPUs on linecards running that code.  Horizontal scaling -- sort of.  
In the early 2000's the GSR came along and now we're doing forwarding in hardware, which is an order or two faster, but 
a whole bunch of features are now too complex to do in hardware, so they go over the side and people have to adapt.  To 
the best of my knowledge, TCP intercept has never come back...
For a while, GSR and CRS type systems had linecards where each card had a bunch of chips that together built the 
forwarding pipeline.  You had chips for the L1/L2 interfaces, chips for the packet lookups, chips for the QoS/queueing 
math, and chips for the fabric interfaces.  Over time, we integrated more and more of these things together until you 
(more or less) had a linecard where everything was done on one or two chips, instead of a half dozen or more.  Once we 
got here, the next step was to build linecards where you actually had multiple independent things doing forwarding -- 
on the ASR9k we called these "slices".  This again multiplies the performance you can get, but now both the software 
and the operators have to deal with the complexity of having multiple things running code where you used to only have 
one.  Now let's jump into the 2010's where the silicon integration allows you to put down multiple cores or pipelines 
on a single chip, each of these is now (more or less) it's own forwarding entity.  So now you've got yet ANOTHER layer 
of abstraction.  If I can attempt to draw out the tree, it looks like this now:
1) you have a chassis or a system, which has a bunch of linecards.
2) each of those linecards has a bunch of NPUs/ASICs
3) each of those NPUs has a bunch of cores/pipelines

And all of this stuff has to be managed and tracked by the software.  If I've got a system with 16 linecards, and each 
of those has 4 NPUs, and each of THOSE has 4 cores - I've got over *two hundred and fifty* separate things forwarding 
packets at the same time.  Now a lot of the info they're using is common (the FIB is probably the same for all these 
entities...) but some of it is NOT.  There's no value in wasting memory for the encapsulation data to host XXX if I 
know that none of the ports on my given NPU/core are going to talk to that host, right?  So - figuring out how to 
manage the *state locality* becomes super important.  And yes, this code breaks like all code, but no one has figured 
out any better way to scale up the performance.  If you have a brilliant idea here that will get me the performance of 
250+ things running in parallel but the simplicity of it looking and acting like a single thing to the rest of the 
world, please find an angel investor and we'll get phenomenally rich together.


--lj

-----Original Message-----
From: Saku Ytti <saku () ytti fi> 
Sent: Saturday, August 6, 2022 1:38 AM
To: ljwobker () gmail com
Cc: Jeff Tantsura <jefftant.ietf () gmail com>; NANOG <nanog () nanog org>; Jeff Doyle <jdoyle () juniper net>
Subject: Re: 400G forwarding - how does it work?

On Fri, 5 Aug 2022 at 20:31, <ljwobker () gmail com> wrote:

Hey LJ,

Disclaimer:  I work for Cisco on a bunch of silicon.  I'm not intimately familiar with any of these devices, but I'm 
familiar with the high level tradeoffs.  There are also exceptions to almost EVERYTHING I'm about to say, especially 
once you get into the second- and third-order implementation details.  Your mileage will vary...   ;-)

I expect it may come to this, my question may be too specific to be answered without violating some NDA.

If you have a model where one core/block does ALL of the processing, you generally benefit from lower latency, 
simpler programming, etc.  A major downside is that to do this, all of these cores have to have access to all of the 
different memories used to forward said packet.  Conversely, if you break up the processing into stages, you can only 
connect the FIB lookup memory to the cores that are going to be doing the FIB lookup, and only connect the encap 
memories to the cores/blocks that are doing the encapsulation work.  Those interconnects take up silicon space, which 
equates to higher cost and power.

While an interesting answer, that is, the statement is, cost of giving access to memory for cores versus having a more 
complex to program pipeline of cores is a balanced tradeoff, I don't think it applies to my specific question, while 
may apply to generic questions. We can roughly think of FP having a similar amount of lines as Trio has PPEs, 
therefore, a similar number of cores need access to memory, and possibly higher number, as more than 1 core in line 
will need memory access.
So the question is more, why a lot of less performant cores, where performance is achieved by making pipeline, compared 
to fewer performant cores, where individual  cores will work on packet to completion, when the former has a similar 
number of core lines as latter has cores.

Packaging two cores on a single device is beneficial in that you only 
have one physical chip to work with instead of two.  This often 
simplifies the board designers' job, and is often lower power than two 
separate chips.  This starts to break down as you get to exceptionally 
large chips as you bump into the various physical/reticle limitations 
of how large a chip you can actually build.  With newer packaging 
technology (2.5D chips, HBM and similar memories, chiplets down the 
road, etc) this becomes even more complicated, but the answer to "why 
would you put two XYZs on a package?" is that it's just cheaper and 
lower power from a system standpoint (and often also from a pure 
silicon standpoint...)

Thank you for this, this does confirm that benefits aren't perhaps as revolutionary as the presentation of thread 
proposed, presentation divided Trio evolution to 3 phases, and this multiple trios on package was presented as one of 
those big evolutions, and perhaps some other division of generations could have been more communicative.

Lots and lots of Smart People Time has gone into different memory designs that attempt to optimize this problem, and 
it's a major part of the intellectual property of various chip designs.

I choose to read this as 'where a lot of innovation happens, a lot of mistakes happen'. Hopefully we'll figure out a 
good answer here soon, as the answers vendors are ending up with are becoming increasingly visible compromises in the 
field. I suspect a large part of this is that cloudy shops represent, if not disproportionate revenue, disproportionate 
focus and their networks tend to be a lot more static in config and traffic than access/SP networks. And when you have 
that quality, you can make increasingly broad assumptions, assumptions which don't play as well in SP networks.

--
  ++ytti


Current thread: