nanog mailing list archives

RE: few big monolithic PEs vs many small PEs


From: <adamv0025 () netconsultings com>
Date: Fri, 28 Jun 2019 09:01:20 +0100

Hi James,

From: James Bensley <jwbensley+nanog () gmail com>
Sent: Thursday, June 27, 2019 1:48 PM

On Thu, 27 Jun 2019 at 12:46, <adamv0025 () netconsultings com> wrote:

From: James Bensley <jwbensley () gmail com>
Sent: Thursday, June 27, 2019 9:56 AM

One experience I have made is that when there is an outage on a
large PE, even when it still has spare capacity, is that the
business impact can be too much to handle (the support desk is
overwhelmed, customers become irate if you can't quickly tell them
what all the impacted services are, when service will be restored,
the NMS has so many alarms it’s not clear what the problem is or where
it's coming from etc.).

I see what you mean, my hope is to address these challenges by having a
"single source of truth" provisioning system that will have, among other
things, also HW-customer/service mapping -so Ops team will be able to say
that if particular LC X fails then customers/services X,Y,Z will be affected.
But yes I agree with smaller PEs any failure fallout is minimized
proportionally.

Hi Adam,

My experience is that it is much more complex than that (although it also
depends on what sort of service you're offering), one can't easily model the
inter-dependency between multiple physical assets like links, interfaces, line
cards, racks, DCs etc and logical services such as a VRFs/L3VPNs, cloud hosted
proxies and the P&T edge.

Consider this, in my opinion, relatively simple example:
Three PEs in a triangle. Customer is dual-homed to PE1 and PE2 and their link
to PE1 is their primary/active link. Transit is dual-homed to PE2 and PE3 and
your hosted filtering service cluster is also dual-homed to PE2 and PE3 to be
near the Internet connectivity.

I agree the scenario you proposed is perfectly valid seems simple but might contain high degree of complexity in terms 
of traffic patterns.
Thinking about this I'd propose to separate the problem into two parts,

The simpler one to solve is the physical resource allocation part of the problem 
This is where the hierarchical record of physical assets could give us the right answers to what happens if this card 
fails 
(example of hierarchy POP->PE->LineCard->PhysicalPort(s)-> 
PhysicalPort(s)->Aggregation-SW->PhysicalPort(s)->Customer/Service)

The other part of the problem is much harder and has two sub parts: 
-first subpart is to model interactions between number of protocols to accurately predict traffic patterns under 
various failure conditions.
(I'd argue that this to some extent should be part of the design documentation and well understood and tested during 
POC testing for a new design -although entropy...)
-And now the tricky subpart is to be able to map individual customer->service/service->customer traffic flows onto the 
first subpart 
(This subpart I didn't give much thought so can't possibly comment )  

adam       


Current thread: