Politech mailing list archives

FC: Why all Microsoft sites went offline today: Some theories

From: Declan McCullagh <declan () well com>
Date: Wed, 24 Jan 2001 18:13:06 -0500

Here's one more development... Less than an hour ago, ping requests sent tothe four Microsoft-wide name servers stopped working:

57 packets transmitted, 0 packets received, +1 errors, 100% packet loss

Pings throughout the day showed high -- but not 100 percent -- packet loss. At
11:20 ET I was getting zero packet loss. At 5 pm ET:
59 packets transmitted, 19 packets received, 67% packet loss

I also can no longer get through to microsoft.com web site -- even though Icould earlier -- using its 207.46.230.218 address:

22 packets transmitted, 0 packets received, 100% packet loss

This could be evidence of a DDOS attack, Microsoft disabling pingresponses, a natural result of caches expiring and additional traffic, orunrelated problems elsewhere in the network not under MS control.

Background:
http://www.politechbot.com/p-01662.html

-Declan

*********

Date: Wed, 24 Jan 2001 17:41:36 -0500 (EST)
From: BMM <bmm () minder net>
To: Declan McCullagh <declan () well com>
Subject: Re: FC: Microsoft websites blacked out -- but what happened?

Declan,

This is a simple but powerful example of poor infrastructure design.
Somehow they felt they needed four authoritative nameservers in order to
provide reliable name service, but then placed all of them on the same
network and in close proximity.  Microsoft's down because of a rookie
mistake.

On Wed, 24 Jan 2001, Declan McCullagh wrote:

> Millions of people have been prevented from visiting dozens of Microsoft
> websites today.
>
> Here's my notes on what happened. Briefly, four Microsoft computers
> somewhere in Redmond aren't working properly:
>
> * a.root-servers.net for microsoft.com, msnbc.com and others points to four
> DNS servers
> * those DNS servers are dns4.cp.msft.net through dns7.cp.msft.net
> * all four are alive: they respond to ping requests

This part seems strange.  If the hosts are up, why not the nameservers?
Didn't they have backups?  I'm not sure what flavor of DNS server they
using (I'm mostly familiar with BIND) but maybe it keeps it's zones in
some kind of proprietary database that MS is famous for, and are very
difficult to back up live.

> * that netblock appears to be owned by microsoft, so this is almost
> certainly not a hacker attack
> * the DNS servers seem to be physically close together, a terrible design
> decision, with IP addresses from just 207.46.138.11 to 207.46.138.21. they
> could even be in the same machine room.

No kidding.  DNS isn't all that hard to do well.  The ability to
load-balance and failover is built into the system, you just have to take
advantage of it. I'm impressed.

> * those DNS servers don't respond to dns lookup requests
> * therefore, things are screwed and people can't get through.
> * other affected sites: expedia.com, slate.com, encarta.com, passport.com
> * that is, unless your computer knows the ip address to microsoft.com etc.
> since your isp/corporation/university has it cached
> * but caches expire, so microsoft properties have been fading from the web
> all day

And since noone has it cached by now I'll bet the root servers are getting
hammered.  Both by the normal traffic that moves through those sites in a
day and by those watching the whole ordeal with morbid fascination.

> * the web servers are working fine; microsoft.com is athttp://207.46.230.218/

> * the first person to identify the problem seems to be sean donelan at
> 11:05 pm PT last night
> * even though hotmail.com uses other DNS servers, it's still affected.
> reason: it redirects to http://lc1.law13.hotmail.passport.com/cgi-bin/login
> (per my attempt to connect to port 80)
> * my mail to microsoft.com addresses goes through fine, except to
> exchange.microsoft.com addreses, which had intermittent errors. that seems
> to be working because the DNS servers are still responding to requests for
> MX records.

MS's DNS servers aren't responding to MX lookups right now but some others
are. Maybe this is still cached data with a higher TTL than the A records
for some reason.  Sites I've found that have it cached have less that an
hour of cache-time left, others have lost it.

> * normally when a website can't be reached, internet explorer defaults to
> auto.search.msn.com, which, ironically, is also offline. talk about a
> catastrophic failure. (this is one of the risks of moving services, like
> error messages and search functionality, to the net.)
> * at 4:26 pm ET, microsoft.com was still offline for me.

It's also a risk of acting like you ARE the web when your engineers don't
really understand how this whole Internet thing works.

>
> One Microsoft representative blamed ICANN, which as we can tell from the
> above has nothing to do with the problems:
>    http://www.idg.net/ic_386962_1793_1-1681.html
>    Microsoft has yet to pin down the cause of the DNS error. "It can
>    be a system or human error, but somebody could also have done this
>    intentionally," De Jonge said. "We don't manage the DNS ourselves,
>    it is a system controlled by the Internet Corporation for Assigned
>    Names and Numbers (ICANN) with worldwide replicas."
>

Whaaa?  It sounds like he's talking about ALL of DNS, from root to leaf,
here.  Global DNS is working just fine, and if MS actually did have
worldwide replicas they would be in the mess they're in.

> That said, this remains a mystery. Why would it take so long to get even
> one of those computers back online? Any network admins want to speculate?
>

The sheer length of time that's passed leads me to believe there's been
some corruption of the zones and they don't have good backups of the data.
Maybe the system administrators there never thought to actually run
through a disaster recovery scenario.  The inexperience that seems to have
brought this on makes this all too plausible.

Maybe Microsoft needs a US$300 an hour consultant to redesign their
network infrastructure?  I think I know where to find one...

Thanks,

-Brian

--
bmm () minder net
1024/8C7C4DE9

***********

From a politechnical who does not want to be identified:

Sounds to me like somebody nuked their zone files. The MX records comingback, but nothing else, is just plain weird - it's the same lookup, intothe same database, by the same engine. The only explanation I can come upwith is that there isn't anything *but* the MX records in the db. Nasty,if true, and indicitive of a major compromise to their DNS system. Must be"hackers!" (grin)
The "blame ICANN" bit is cute, but sorry - no cookie. ICANN doesn't manageyour zone info, *you* do - that's the whole point of the "distributedadministration" of the DNS heirarchy...

***********

To: declan () well com
Subject: Re: FC: Microsoft websites blacked out -- but what happened?
From: Jered Floyd <jered () MIT EDU>
X-Mailer: Gnus v5.6.45/XEmacs 21.1 - "Arches"

Declan McCullagh <declan () well com> writes:

> That said, this remains a mystery. Why would it take so long to get even
> one of those computers back online? Any network admins want to speculate?

Perhaps it's a cascade failure?  Given the standard quality of
Microsoft Engineering(tm), I would assume that the Windows-based DNS
servers they use are barely able to stand up to the normal load they
encounter.  Perhaps someone at Microsoft screwed up, rendering them
inaccessible for a period of time. As soon as they came back online,
they were so inundated with requests that they immediately crash,
unable to handle the load? That's the closest to likely suggestion I
can come up with.

--Jered

**********

Date: Wed, 24 Jan 2001 12:57:31 -0900 (AKST)
From: B Potter <gdead () shmoo com>
To: Declan McCullagh <declan () well com>
Subject: Re: FC: Microsoft websites blacked out -- but what happened?

Howdy,

> * the DNS servers seem to be physically close together, a terrible design
> decision, with IP addresses from just 207.46.138.11 to 207.46.138.21. they
> could even be in the same machine room.

This is just insane.  If you take a look at the bgp announcment for that
netblock at nitrous.digex.net (click MAE-west looking glass, then click
BGP and put in the address) you'll see the routes going to this block:
BGP routing table entry for 207.46.128.0/18, version 23510879
Paths: (18 available, best #3)
  Advertised to peer-groups:
     rr-pop
  3561 8070
    165.117.59.14 (metric 355700) from 165.117.1.57 (165.117.1.57)
      Origin IGP, metric 4294967294, localpref 100, valid, internal
      Community: 2548:239 2548:666 3706:120
  3356 8070
    209.244.219.161 (metric 440700) from 165.117.1.195 (165.117.1.195)
      Origin IGP, metric 4294967294, localpref 100, valid, internal
      Community: 2548:172 2548:264 2548:666 3706:115
  3356 8070
    166.90.50.141 (metric 80200) from 165.117.1.204 (165.117.1.204)
      Origin IGP, metric 4294967294, localpref 100, valid, internal, best
      Community: 2548:172 2548:263 2548:666 3706:170

etc...  all of them terminate in AS 8070.  Query ARIN to see who owns AS
8070:
bash-2.04$ whois -h whois.arin.net 8070
Microsoft Corporation (ASNBLK-MICROSOFT-AS-BLOCK)
   One Microsoft Way
   Redmond, WA 98008
   US

   Autonomous System Name: MICROSOFT-AS-BLOCK
   Autonomous System Block: 8068 - 8075
 etc...

It's typically concidered best practice to put at least one name server in
a different AS.. that way if part of the global routing table goes nuts,
you have a fighting chance of keeping nameservice alive.  (remember that
if an MX record for a domain doesn't exist, the mail will bounce back from
most MTA's not be queued).

So physical closeness is not the only concideration here.

> * those DNS servers don't respond to dns lookup requests

ditto from my view of the world

> That said, this remains a mystery. Why would it take so long to get even
> one of those computers back online? Any network admins want to speculate?

I have no idea.  I _assume_ they would be doing
regular backups, and they could restore from a last known good backup
without much trouble.  My guess would be someone found a vulnerability
in MS's nameserver and keeps hitting them with it every time they get the
box back up.  And since DNS queries are UDP based, they're REALLY easy to
spoof.  Therefore as long as those boxes are up on the Net, the hackeer
could keep pummelling them with the evil packet and there would be very
little way to stop them.  So it becomes a race to patch the hole in code
before everyone's cache expires, OR switch to a different brand of
webserver.

Which is worse?  MS going offline, or them running solaris boxes using
BIND? ;)  Now that's a PR decision I wouldn't want to have to make.

later

bruce
The Shmoo Group

*********

Date: Wed, 24 Jan 2001 18:09:35 -0500
From: Rich Kulawiec <rsk () gsp org>
To: Declan McCullagh <declan () well com>
Subject: Re: FC: Microsoft websites blacked out -- but what happened?

On Wed, Jan 24, 2001 at 04:30:58PM -0500, Declan McCullagh wrote:
> That said, this remains a mystery. Why would it take so long to get even
> one of those computers back online? Any network admins want to speculate?

Sure, I'll bite.  In fact, I'll give you two independent explanations,
neither one of which may have anything to do with reality.

1. Incompetence/malfeasance.  We see some evidence of this already:
sure, small organizations have their DNS servers located together
and there's nothing wrong with that.  But national/global organizations
should, as SOP, have their DNS servers on different networks served
by different ISPs and (IMHO) running on different operating systems
(e.g. Solaris and FreeBSD, or Linux and HPUX) so as to minimize the
threats for DoS attacks, known OS vulnerabilities, and connectivity
issues.  It's hard to imagine a reason why MS hasn't done this...but
this is apparently the case.

Given that we know that, it's not hard to guess at the next possibility:
no backups and an auto-updated configuration.  The latter is pretty common:
it's desirable to update multiple DNS servers in lock-step, and so using
some kind of mirroring software is often used...which is great until
you do something ill-advised to the master and the mirroring software
quickly replicates it on all the copies.  (You'd still be okay, albeit
temporarily inconvienced, if you were using some kind of revision control
or had frequently-refreshed backups: just undo the changes, re-distribute,
and then take a break to think through what went wrong the first time.)

My guess is that there are no backups, and so someone (many someones)
is/are cobbling together DNS tables by hand, from old hosts files,
from pings of the network, from whatever.  Very ugly.  But it's hard
to imagine what other sort of process would take so long -- by now
they could have easily installed an OS from scratch, including BIND
(I'm assuming they're using BIND for DNS, but I don't know that) and
reloaded everything -- heck, that would take a couple of hours, tops.
So now that this is closing in on 24 hours, I become more suspicious
that there's a lot of frantic typing going on.

2. Self-inflicted wound for propaganda reasons.  (This is obviously
a much more far-fetched explanation, but you *did* invite "speculation!)

MS does not like the current DNS system.  They especially do not like
BIND, which is so incredibly pervasive that it makes sendmail's market
penetration look weak by comparison.  MS wants us to use *their* product
(of course) and one way to make their product seem superior to one
which is (a) open-source (b) mature (c) portable (d) scalable
(e) heavily peer-reviewed (f) frequently updated (g) a de facto standard
is to show how much despair and hand-wringing they can exhibit
when it "fails" on them.  "Oh, look", they'll whine, "even we who
are like unto Godz of Computing are rendered helpless by DNS...
bad DNS...inadequate DNS...non-MS DNS..."

Maybe.  Maybe not.  See attached recent thread from Usenet's
comp.protocols.tcp-ip.domains for more illumination including
a pointer to an "interesting" article at ZDnet.

---Rsk
Rich Kulawiec
rsk () gsp org

***********

-------------------------------------------------------------------------
POLITECH -- Declan McCullagh's politics and technology mailing list
You may redistribute this message freely if it remains intact.
To subscribe, visit http://www.politechbot.com/info/subscribe.html
This message is archived at http://www.politechbot.com/
-------------------------------------------------------------------------

Current thread:

FC: Why all Microsoft sites went offline today: Some theories Declan McCullagh (Jan 24)