nanog mailing list archives

Re: Journal of Internet Disasters


From: Paul Vixie <paul () vix com>
Date: 13 Nov 1998 09:03:34 -0800

This should go on the name-droppers list, but here goes....

these days it's not clear whether namedroppers is an operations list
or a protocol list or still both.  i think nanog is a fine forum for this:

What do we know about the events with the name servers

   - f.root-servers.net was not able to transfer a copy of some of
      the zone files from a.root-servers.net
   - f.root-servers.net became lame for some zones

just COM.

   - tcpdump showed odd AXFR from a.root-servers.net

just a lot of missed/retransmitted ACKs.

   - [fjk].gtld-servers.net have been reported answering NXDOMAIN to
      some valid domains, NSI denies any problem

the nanog archives include some dig results that are hard for NSI to deny.

Other events which may or may not have been related
    - BGP routing bug disrupted connectivity for some backbones in the
      preceeding days

this turned up a performance problem in BIND's retry code, btw, but was
not otherwise related to the COM lossage of yesterday (as far as i know).

    - Last month the .GOV domain was missing on a.root-servers.net due
      to a 'known bug' affecting zone transfers from GOV-NIC

different bug.  that one causes truncated zone transfers; the secondary
zone files on [fjk].gtld-servers.net yesterday were not truncated and it
just took a restart to make them stop behaving badly.

    - Someone has been probing DNS ports for an unknown reason

Things I don't know
    - f.root-servers.net and NSI's servers reacted differently.  What
      are the differences between them (BIND versions, in-house source
      code changes, operating systems/run-time libraries/compilers)

they are completely different systems (solaris vs. digital unix) running
the same (unmodified) bind 8.1.2 sources, which had completely different
failure modes for completely different reasons.

    - how long were servers unable to transfer the zone?  The SOA says
      a zone is good for 7 days.  Why they expire/corrupt the old zone
      before getting a new copy?

damn good question.  i'll look into that.  shouldn't've happened.

    - Routing between ISC and NSI for the preceeding period before the
      problem was discovered

there was asymmetry (they reached me via bbnplanet, i reached them via
alternet).  they are now preferring alternet to reach me, so we have
better path symmetry now.  but their first mile is still congested and
i am still retransmitting a lot of ACKs.

Theories
    - Network connectivity was insufficient between NSI and ISC for long
      enough the zones timed out (why were other servers affected?)

other servers are more conservative, and had switched to manual daily FTP
of the COM zone longer ago than F has done.  (with manual daily FTP you
get the advantages of gzip, and of the pretense of "zone master" status
while you manually retry after timeouts.  AXFR needs those properties.)

    - Bug in BIND (or an in-house modified version) (why did vixie's and
      NSI's servers return different responses?)

there's definitely a bug in BIND if [fjk].gtld-servers.net were able to
return different answers after restarts with no new zone transfers.  (i'm
sitting here wishing i had core dumps.)

    - Bug in a support system (O/S, RTL, Compiler, etc) or its installation
    - Operator error (erroneous reports of failure)
    - Other malicious activity?

i think there were a goodly number of procedural errors.
-- 
Paul Vixie <paul () vix com>


Current thread: