nanog mailing list archives

RE: 400G forwarding - how does it work?


From: <ljwobker () gmail com>
Date: Sun, 7 Aug 2022 10:37:29 -0400

Buffering is a near-religious topic across a large swath of the network industry, but here are some opinions of mine:

a LOT of operators/providers need more buffering than you can realistically put directly onto the ASIC die.  Fast chips 
without external buffers measure capacity in tens of microseconds, which is nowhere near enough for a lot of the 
market.  We can (and do) argue about exactly where and what network roles can be met by this amount of buffering, but 
it's absolutely not a large enough part of the market to totally go away from "big" external buffers.
Once you "jump off the cliff" of needing something more than on-chip SRAM, you're in this weird area where nothing 
exists in the technology space that *really* solves the problem, because you really need access rate and bandwidth more 
than you need capacity.   HBM is currently the best (or at least the most popular) combination of capacity, power, 
access rate, and bandwidth... but it's still nowhere near perfect.  A common HBM2 implementation gives you 8GB of 
buffer space and about 2Tb of raw bandwidth, and a few hundred million IOPS.  (A lot of that gets gobbled up by various 
overheads....)

These values are a function of two things:
1) memory physics - I don't know enough about how these things are Like Really Actually Built to talk about this part.
2) market forces... the market for this stuff is really GPUs, ML/AI applications, etc.  The networking silicon market 
is a drop in the ocean compared to the rest of compute, so the specific needs of my router aren't going to ever drive 
enough volume to get big memory makers to do exactly what **I** want.  I'm at the mercy of what they build for the 
gigantic players in the rest of the market.  

If you told me that someone had a memory technology that was something like "one-fourth the capacity of HBM, but four 
times the bandwidth and four times the access rate" I would do backflips and buy a lot of it, because it's a way better 
fit for the specific performance dimensions I need for A Really Fast Router.  But nothing remotely along these lines 
exists... so like a lot of other people I just have to order off the menu.   ;-)


--lj

-----Original Message-----
From: NANOG <nanog-bounces+ljwobker=gmail.com () nanog org> On Behalf Of Masataka Ohta
Sent: Sunday, August 7, 2022 5:13 AM
To: nanog () nanog org
Subject: Re: 400G forwarding - how does it work?

ljwobker () gmail com wrote:

Buffer designs are *really* hard in modern high speed chips, and there 
are always lots and lots of tradeoffs.  The "ideal" answer is an 
extremely large block of memory that ALL of the forwarding/queueing 
elements have fair/equal access to... but this physically looks more 
or less like a full mesh between the memory/buffering subsystem and 
all the forwarding engines, which becomes really unwieldly 
(expensive!) from a design standpoint.  The amount of memory you can 
practically put on the main NPU die is on the order of 20-200 **mega** 
bytes, where a single stack of HBM memory comes in at 4GB -- it's 
literally 100x the size.

I'm afraid you imply too much buffer bloat only to cause unnecessary and unpleasant delay.

With 99% load M/M/1, 500 packets (750kB for 1500B MTU) of buffer is enough to make packet drop probability less than 
1%. With 98% load, the probability is 0.0041%.

But, there are so many router engineers who think, with bloated buffer, packet drop probability can be zero, which is 
wrong.

For example,

        https://www.broadcom.com/products/ethernet-connectivity/switching/stratadnx/bcm88690
        Jericho2 delivers a complete set of advanced features for
        the most demanding carrier, campus and cloud environments.
        The device supports low power, high bandwidth HBM packet
        memory offering up to 160X more traffic buffering compared
        with on-chip memory, enabling zero-packet-loss in heavily
        congested networks.

                                        Masataka Ohta


Current thread: