nanog mailing list archives

Re: massive facebook outage presently


From: Baldur Norddahl <baldur.norddahl () gmail com>
Date: Mon, 4 Oct 2021 22:12:06 +0200

On Mon, 4 Oct 2021 at 21:58, Michael Thomas <mike () mtcc com> wrote:


On 10/4/21 11:48 AM, Luke Guillory wrote:


I believe the original change was 'automatic' (as in configuration done
via a web interface). However, now that connection to the outside world is
down, remote access to those tools don't exist anymore, so the emergency
procedure is to gain physical access to the peering routers and do all the
configuration locally.

Assuming that this is what actually happened, what should fb have done
different (beyond the obvious of not screwing up the immediate issue)? This
seems like it's a single point of failure. Should all of the BGP speakers
have been dual homed or something like that? Or should they not have been
mixing ops and production networks? Sorry if this sounds dumb.


Facebook is a huge network. It is doubtful that what is going on is this
simple. So I will make no guesses to what Facebook is or should be doing.

However the traditional way for us small timers is to have a backdoor using
someone else's network. Nowadays this could be a simple 4/5G router with a
VPN, to a terminal server that allows the operator to configure the
equipment through the monitor port even when the config is completely
destroyed.

Regards,

Baldur

Current thread: