nanog mailing list archives

Re: Facebook post-mortems...


From: Mark Tinka <mark@tinka.africa>
Date: Wed, 6 Oct 2021 07:08:09 +0200



On 10/6/21 06:51, Hank Nussbacher wrote:


- "During one of these routine maintenance jobs, a command was issued with the intention to assess the availability of global backbone capacity, which unintentionally took down all the connections in our backbone network"

Can anyone guess as to what command FB issued that would cause them to withdraw all those prefixes?

Hard to say, as it seems that the command was innocent enough, perhaps running a batch of other sub-commands to check port status, bandwidth utilization, MPLS-TE values, e.t.c. However, sounds like another unforeseen bug in the command ran other things, or the cascade process of how the sub-commands were ran caused unforeseen problems.

We shall guess this one forever, as I doubt Facebook will go into that much detail.

What I can tell you is that all the major content providers spend a lot of time, money and effort in automating both capacity planning, as well as capacity auditing. It's a bit more complex for them, because their variables aren't just links and utilization, but also locations, fibre availability, fibre pricing, capacity lease pricing, the presence of carrier-neutral data centres, the presence of exchange points, current vendor equipment models and pricing, projection of future fibre and capacity pricing, e.t.c.

It's a totally different world from normal ISP-land.



- "it was not possible to access our data centers through our normal means because their networks were down, and second, the total loss of DNS broke many of the internal tools we’d normally use to investigate and resolve outages like this.  Our primary and out-of-band network access was down..."

Does this mean that FB acknowledges that the loss of DNS broke their OOB access?

I need to put my thinking cap on, but not sure whether running DNS in the IGP would have been better in this instance.

We run our Anycast DNS network in our IGP, mainly to always guarantee latency-based routing, but also to ensure that the failure of a higher-level protocol like BGP does not disconnect internal access that is needed for troubleshooting and repair. Given the IGP is a much more lower-level routing protocol, it's more likely (not impossible) that it would not go down with BGP.

In the past, we have, indeed, had BGP issues that allowed us to maintain DNS access internally as the IGP was unaffected.

The final statement from that report is interesting:

    "From here on out, our job is to strengthen our testing,
    drills, and overall resilience to make sure events like this
    happen as rarely as possible."

... which, in my rudimentary translation, means that:

    "There are no guarantees that our automation software will not
    poop cows again, but we hope that when that does happen, we
    shall be able to send our guys out to site much more quickly."

... which, to be fair, is totally understandable. These automation tools, especially in large networks such as BigContent, are significantly more fragile the more complex they get, and the more batch tasks they need to perform on various parts of a network of this size and scope. It's a pity these automation tools are all homegrown, and can't be bought "pre-packaged and pre-approved to never fail" from IT Software Store down the road. But it's the only way for networks of this capacity to operate, and the risk they always sit with for being that large.

Mark.


Current thread: