nanog mailing list archives

Re: DNS pulling BGP routes?


From: Owen DeLong via NANOG <nanog () nanog org>
Date: Sun, 10 Oct 2021 17:28:59 -0700



On Oct 7, 2021, at 06:49 , Masataka Ohta <mohta () necom830 hpcl titech ac jp> wrote:

William Herrin wrote:

This is quite common to tie an underlying service announcement to BGP
announcements in an Anycast or similar environment.

Yes, that is a commonly seen mistake with anycast.
You don't know what you're talking about.

I do but you don't.

If your anycast node stops
receiving updated data and you can't reach any of the other nodes to
check whether they're online, 99 times out of 100 this means a local
failure of some sort.

Yes. In case of DNS, if expiration period of a zone is passed
without successful check of the current most zone version,
unicast or anycast name servers stop responding requests for
the zone.

But, it has nothing specifically to do with anycast. As there
are other name servers with different IP addresses, there is
no reason to withdraw routes. So?

WRONG.

First, assuming that there are non-anycast name servers assumes
facts not in evidence.

Second, if you are a participant in an anycast name server network,
there are good reasons to withdraw your announcement of that
prefix in order to avoid users having to wait for timeouts (which in
some cases might be even worse than serving stale data).

You withdraw the node's announcement so that you
don't serve bad data to the end user.

That will only introduce new failure modes of mismatches between
server availability and server reachability and is a bad idea.

No, if the server is available, it should announce the anycast prefix.
If it i snot available, it should withdraw it. That’s the best way to make
anycast work and it’s what virtually every anycast DNS server network
does.

If the server is unavailable, but doesn’t withdraw, then you have the
failure mode of the server being reachable, but unavailable and it
becomes a black hole for traffic that should otherwise flow to other
available anycast nodes.

That's what happened here -

Yes, facebook did wrong thing to actively withdraw routes.

No, facebook did the right thing for 99+% of situations that would trigger this
withdraw. The problem was that they withdrew EVERY server when the failure
wasn’t local instead of having some way to recognize the failure for what it was,
global in nature and continue serving DNS.

Simply
turning themselves off, instead of withdrawing the routes, would
result in suboptimal performance.

This time, facebook is saying that they could not reach their
name servers even though the servers were perfectly working.

Because their servers couldn’t verify that they were working and thus
thought that they had stale data. Thus, the servers were “perfectly
working” with stale data and the safe thing to do if you can’t confirm
that your reason for believing you have stale data is erroneous,
is to stop serving what you have. If you’re not going to serve what you
have, then you shouldn’t announce the anycast prefix, either.

How much performance, do you think, facebook enjoyed? A lot
less than "suboptimal", I'm afraid.

As noted, this was that 1% failure that isn’t anticipated. The behavior of
the system was correct for 99% of failures and the number of years facebook
has operated without a significant or noticeable DNS outage is testament to
that fact.


And 99 times out of 100, not doing
one or the other would cause rather than prevent an outage.

That is a commonly seen misconception wrongly assuming that
server routes were withdrawn if and only if the server is
unavailable.

The servers withdrew their routes because the servers had no ability to
verify that they were serving valid data. If you can’t verify your data is valid,
it’s better (in most cases) to not serve the data you have. If you’re not going
to serve, the best thing to do is withdraw the anycast prefix that claims you
are a server for the data.

But, the reality is that it is impossible to correctly
recognize server is unavailable or to correctly withdraw
routes only when server is unavailable.

Yes… So you go with something that works 99% of the time and you get
an event like this in that 1% of cases where the failure in question was not
one of the failure modes that was previously anticipated. I’m betting that
facebook is quickly figuring out changes that will mitigate this type of failure
in the future and their DNS will likely stay up until the next 1 in 100 (or will it
be 1 in 10,000 this time?) events pops up that surprised them again.

That’s the nature of operations.

Owen


Current thread: