nanog mailing list archives
Re: Better description of what happened
From: Michael Thomas <mike () mtcc com>
Date: Tue, 5 Oct 2021 15:18:54 -0700
On 10/5/21 3:09 PM, Andy Brezinsky wrote:
Rereading it is said that their internal (?) backbone went down so pulling the routes was arguably the right thing to do. Or at least not flat out wrong. Taking out their nameserver subnets was clearly a problem though, though a fix is probably tricky since you clearly want to take down errant nameservers too.It's a few years old, but Facebook has talked a little bit about their DNS infrastructure before. Here's a little clip that talks about Cartographer: https://youtu.be/bxhYNfFeVF4?t=2073From their outage report, it sounds like their authoritative DNS servers withdraw their anycast announcements when they're unhealthy. The health check from those servers must have relied on something upstream. Maybe they couldn't talk to Cartographer for a few minutes so they thought they might be isolated from the rest of the network and they decided to withdraw their routes instead of serving stale data. Makes sense when a single node does it, not so much when the entire fleet thinks that they're out on their own.A performance issue in Cartographer (or whatever manages this fleet these days) could have been the ticking time bomb that set the whole thing in motion.
Mike
Current thread:
- Re: Facebook post-mortems..., (continued)
- Re: Facebook post-mortems... Bjørn Mork (Oct 05)
- Re: Facebook post-mortems... Carsten Bormann (Oct 05)
- Re: Facebook post-mortems... Masataka Ohta (Oct 05)
- Re: Facebook post-mortems... Michael Thomas (Oct 05)
- Re: Facebook post-mortems... Randy Monroe via NANOG (Oct 05)
- Better description of what happened Michael Thomas (Oct 05)
- Re: Better description of what happened scott (Oct 05)
- Re: Better description of what happened Curtis Maurand (Oct 06)
- Re: Better description of what happened PJ Capelli via NANOG (Oct 06)
- Re: Better description of what happened Andy Brezinsky (Oct 05)
- Re: Better description of what happened Michael Thomas (Oct 05)
- Re: Better description of what happened Hugo Slabbert (Oct 05)
- Re: Better description of what happened Tom Beecher (Oct 06)
- Re: Better description of what happened Bjørn Mork (Oct 06)
- Re: Better description of what happened Tom Beecher (Oct 06)
- Re: Better description of what happened Hugo Slabbert (Oct 06)
- Re: Facebook post-mortems... Masataka Ohta (Oct 05)
- Re: Facebook post-mortems... Bjørn Mork (Oct 05)
- Re: Facebook post-mortems... Masataka Ohta (Oct 06)
- Re: Facebook post-mortems... Bjørn Mork (Oct 06)
- DNS pulling BGP routes? Michael Thomas (Oct 06)