nanog mailing list archives
Better description of what happened
From: Michael Thomas <mike () mtcc com>
Date: Tue, 5 Oct 2021 13:39:21 -0700
This bit posted by Randy might get lost in the other thread, but it appears that their DNS withdraws BGP routes for prefixes that they can't reach or are flaky it seems. Apparently that goes for the prefixes that the name servers are on too. This caused internal outages too as it seems they use their front facing DNS just like everybody else.
Sounds like they might consider having at least one split horizon server internally. Lots of fodder here.
Mike On 10/5/21 11:11 AM, Randy Monroe wrote:
Updated: https://engineering.fb.com/2021/10/05/networking-traffic/outage-details/ <https://engineering.fb.com/2021/10/05/networking-traffic/outage-details/>On Tue, Oct 5, 2021 at 1:26 PM Michael Thomas <mike () mtcc com <mailto:mike () mtcc com>> wrote:On 10/5/21 12:17 AM, Carsten Bormann wrote: > On 5. Oct 2021, at 07:42, William Herrin <bill () herrin us <mailto:bill () herrin us>> wrote: >> On Mon, Oct 4, 2021 at 6:15 PM Michael Thomas <mike () mtcc com <mailto:mike () mtcc com>> wrote: >>> They have a monkey patch subsystem. Lol. >> Yes, actually, they do. They use Chef extensively to configure >> operating systems. Chef is written in Ruby. Ruby has something called >> Monkey Patches. > While Ruby indeed has a chain-saw (read: powerful, dangerous, still the tool of choice in certain cases) in its toolkit that is generally called “monkey-patching”, I think Michael was actually thinking about the “chaos monkey”, > https://en.wikipedia.org/wiki/Chaos_engineering#Chaos_Monkey <https://en.wikipedia.org/wiki/Chaos_engineering#Chaos_Monkey> > https://netflix.github.io/chaosmonkey/ <https://netflix.github.io/chaosmonkey/> No, chaos monkey is a purposeful thing to induce corner case errors so they can be fixed. The earlier outage involved a config sanitizer that screwed up and then pushed it out. I can't get my head around why anybody thought that was a good idea vs rejecting it and making somebody fix the config. Mike -- Randy Monroe Network Engineering Uber <https://uber.com/>
Current thread:
- RE: Facebook post-mortems..., (continued)
- RE: Facebook post-mortems... Jean St-Laurent via NANOG (Oct 05)
- Re: Facebook post-mortems... Ryan Landry (Oct 05)
- Re: Facebook post-mortems... Niels Bakker (Oct 05)
- RE: Facebook post-mortems... Jean St-Laurent via NANOG (Oct 05)
- RE: Facebook post-mortems... Jean St-Laurent via NANOG (Oct 05)
- Re: Facebook post-mortems... Bjørn Mork (Oct 05)
- Re: Facebook post-mortems... Carsten Bormann (Oct 05)
- Re: Facebook post-mortems... Masataka Ohta (Oct 05)
- Re: Facebook post-mortems... Michael Thomas (Oct 05)
- Re: Facebook post-mortems... Randy Monroe via NANOG (Oct 05)
- Better description of what happened Michael Thomas (Oct 05)
- Re: Better description of what happened scott (Oct 05)
- Re: Better description of what happened Curtis Maurand (Oct 06)
- Re: Better description of what happened PJ Capelli via NANOG (Oct 06)
- Re: Better description of what happened Andy Brezinsky (Oct 05)
- Re: Better description of what happened Michael Thomas (Oct 05)
- Re: Better description of what happened Hugo Slabbert (Oct 05)
- Re: Better description of what happened Tom Beecher (Oct 06)
- Re: Better description of what happened Bjørn Mork (Oct 06)
- Re: Better description of what happened Tom Beecher (Oct 06)
- Re: Better description of what happened Hugo Slabbert (Oct 06)