nanog mailing list archives

Re: Better description of what happened

From: Tom Beecher <beecher () beecher cc>
Date: Wed, 6 Oct 2021 11:44:32 -0400

By what they have said publicly, the initial trigger point was that all of
their datacenters were disconnected from their internal backbone, thus
unreachable.

Once that occurs, nothing else really matters. Even if the external
announcements were not withdrawn, and the edge DNS servers could provide
stale answers, the IPs those answers provided wouldn't have actually been
reachable, and there wouldn't be 3 days of red herring conversations about
DNS design.

No DNS design exists that can help people reach resources not network
reachable. /shrug


On Tue, Oct 5, 2021 at 6:30 PM Hugo Slabbert <hugo () slabnet com> wrote:

Had some chats with other folks:
Arguably you could change the nameserver isolation check failure action to
be "depref your exports" rather than "yank it all".  Basically, set up a
tiered setup so the boxes passing those additional health checks and that
should have correct entries would be your primary destination and failing
nodes shouldn't receive query traffic since they're depref'd in your
internal routing.  But in case all nodes fail that check simultaneously,
those nodes failing the isolation check would attract traffic again as no
better paths remain.  Better to serve stale data than none at all; CAP
theorem trade-offs at work?

--
Hugo Slabbert


On Tue, Oct 5, 2021 at 3:22 PM Michael Thomas <mike () mtcc com> wrote:


On 10/5/21 3:09 PM, Andy Brezinsky wrote:

It's a few years old, but Facebook has talked a little bit about their
DNS infrastructure before.  Here's a little clip that talks about
Cartographer: https://youtu.be/bxhYNfFeVF4?t=2073

From their outage report, it sounds like their authoritative DNS servers
withdraw their anycast announcements when they're unhealthy.  The health
check from those servers must have relied on something upstream.  Maybe
they couldn't talk to Cartographer for a few minutes so they thought they
might be isolated from the rest of the network and they decided to withdraw
their routes instead of serving stale data.  Makes sense when a single node
does it, not so much when the entire fleet thinks that they're out on their
own.

A performance issue in Cartographer (or whatever manages this fleet these
days) could have been the ticking time bomb that set the whole thing in
motion.

Rereading it is said that their internal (?) backbone went down so
pulling the routes was arguably the right thing to do. Or at least not flat
out wrong. Taking out their nameserver subnets was clearly a problem
though, though a fix is probably tricky since you clearly want to take down
errant nameservers too.


Mike

Current thread:

Re: Facebook post-mortems..., (continued)
- - - Re: Facebook post-mortems... Masataka Ohta (Oct 05)
    - Re: Facebook post-mortems... Michael Thomas (Oct 05)
    - Re: Facebook post-mortems... Randy Monroe via NANOG (Oct 05)
    - Better description of what happened Michael Thomas (Oct 05)
    - Re: Better description of what happened scott (Oct 05)
    - Re: Better description of what happened Curtis Maurand (Oct 06)
    - Re: Better description of what happened PJ Capelli via NANOG (Oct 06)
    - Re: Better description of what happened Andy Brezinsky (Oct 05)
    - Re: Better description of what happened Michael Thomas (Oct 05)
    - Re: Better description of what happened Hugo Slabbert (Oct 05)
    - Re: Better description of what happened Tom Beecher (Oct 06)
    - Re: Better description of what happened Bjørn Mork (Oct 06)
    - Re: Better description of what happened Tom Beecher (Oct 06)
    - Re: Better description of what happened Hugo Slabbert (Oct 06)
    - Re: Facebook post-mortems... Masataka Ohta (Oct 05)
    - Re: Facebook post-mortems... Bjørn Mork (Oct 05)
    - Re: Facebook post-mortems... Masataka Ohta (Oct 06)
    - Re: Facebook post-mortems... Bjørn Mork (Oct 06)
    - DNS pulling BGP routes? Michael Thomas (Oct 06)
    - Re: DNS pulling BGP routes? J. Hellenthal via NANOG (Oct 06)
    - Re: DNS pulling BGP routes? Jared Mauch (Oct 06)

(Thread continues...)