nanog mailing list archives

Re: Soliciting your opinions on Internet routing: A survey on BGP convergence


From: Baldur Norddahl <baldur.norddahl () gmail com>
Date: Tue, 10 Jan 2017 03:51:04 +0100

Hello

I find that the type of outage that affects our network the most is neither of the two options you describe. As is probably typical for smaller networks, we do not have redundant uplinks to all of our transits. If a transit link goes, for example because we had to reboot a router, traffic is supposed to reroute to the remaining transit links. Internally our network handles this fairly fast for egress traffic.

However the problem is the ingress traffic - it can be 5 to 15 minutes before everything has settled down. This is the time before everyone else on the internet has processed that they will have to switch to your alternate transit.

The only solution I know of is to have redundant links to all transits. Going forward I will make sure we have this because it is a huge disadvantage not being able to take a router out of service without causing downtime for all users. Not to mention that a router crash or link failure that should have taken seconds at most to reroute, but instead causes at least 5 minutes of unstable internet.

Regards,

Baldur


Den 09/01/2017 kl. 23.56 skrev Laurent Vanbever:
Hi NANOG,

We often read that the Internet (i.e. BGP) is "slow to converge". But how slow
is it really? Do you care anyway? And can we (researchers) do anything about it?
Please help us out to find out by answering our short anonymous survey
(<10 minutes).

Survey URL: https://goo.gl/forms/JZd2CK0EFpCk0c272 <https://goo.gl/forms/WW7KX5kT45m6UUM82>


** Background:

While existing fast-reroute mechanisms enable sub-second convergence upon
local outages (planned or not), they do not apply to remote outages happening
further away from your AS as their detection and protection mechanisms only
work locally.

Remote outages therefore mandate a "BGP-only" convergence which tends to be
slow, as long streams of BGP UPDATEs (containing up to 100,000s of them) must
be propagated router-by-router. Our initial measurements indicate that it can
take state-of-the-art BGP routers dozens of seconds to process and propagate
these large streams of BGP UPDATEs. During this time, traffic for important
destinations can be lost.


** This survey:

This survey aims at evaluating the impact of slow BGP convergence on
operational practices. We expect the findings to increase the understanding of
the perceived BGP convergence in the Internet, which could then help
researchers to design better fast-reroute mechanisms.

We expect the questionnaire to be filled out by network operators whose job relates
to BGP operations. It has a total of 17 questions and should take less 10 minutes
to answer. The survey and the collected data are anonymous (so please do *not*
include information that may help to identify you or your organization).
All questions are optional, so if you don't like a question or don't know the answer,
please skip it.

A summary of the aggregate results will be published as a part of a scientific
article later this year.

Thank you so much in advance, and we look forward to read your responses!


Laurent Vanbever (ETH Zürich, Switzerland)


PS: It goes without saying that we would be also extremely grateful if you could
forward this email to any operator you might know who may not read NANOG.


Current thread: