nanog mailing list archives

Re: ISC DHCP server failover

From: Mike <mike-nanog () tiedyenetworks com>
Date: Fri, 19 Mar 2010 17:10:04 -0700

David W. Hankins wrote:

On Wed, Mar 17, 2010 at 09:22:06AM -0500, Dan White wrote:

  The servers stop balancing their addresses, and one server starts to
exhibit 'peer holds all free leases' in its logs, in which case we need to
restart the dhcpd process(es) to force a rebalance.


If restarting one or both dhcpd processes corrects a pool balancing
problem, then I suspect what you're looking at is a bug where the
servers would fail to schedule a reconnection if the failover socket
is lost in a particular way.  Because the protocol also uses a message
exchange inside the TCP channel to determine if the socket is up
(rather than just TCP keepalives) this can sometimes happen even
without a network outage during load spikes or other brief hiccups on

<long explanation snipped>

With all due respect and acknowledgment of the tremendous contributionsof ISC and you yourself Mr. Hankins, I have to comment that failover inisc-dhcp is broken by design because it requires the amount ofhandholding and operator thinking in the event of a failure that youexplained to us at length is required. Failure needs to be handledautomatically and without any intervention at all, otherwise you mightas well not have it and I think most network operators would agree.

I am certainly not prepared to develop proof of concept code or go thefull route of developing such a server myself, however, I belive firmlythat a failover implementation in dhcp could be designed as acounterpoint to the current implementation that is reliable, simple,scalable and requiring no special procedures once a 'break' occurs. Themethod used by isc-dhcpd, I think, creates the problem of the potentialfor unreliable failover because it's not designed for the 'right'problem. But there are example implementations - such as vrrp/carp -that would form the basis of trustworthy dhcp failover protocol. Yourkey issues are a) broadcast discovery packets, which every listeninghost on the lan segment (such as 1 or more slaves) can easily respondto, and b) unicast frames from relay agents and others, which couldeasily be handled by a virtual mac/shared ip address by a group ofslaves. This means that redundancy of more than 2 hosts is alreadypossible. The last pieces are protocol for servers to join and leave thepool of hosts serving dhcp, a master election protocol thatpre-determines the order of slaves to fail over to in order to avoid thehalf-brain syndrome, a sanity checking protocol to ensure the electedmaster is sane and kicking (eg: the slaves all hit the master with, whatelse, dhcp requests), and a well defined group database update protocolover the network so that leases hit some fixed storage somewhere, sometime.



Just my $0.02 worth.

Mike-

Current thread:

ISC DHCP server failover Summers, William (Mar 17)
- Re: ISC DHCP server failover sthaug (Mar 17)
- Re: ISC DHCP server failover Dan White (Mar 17)
  - Re: ISC DHCP server failover Blake Covarrubias (Mar 17)
  - Re: ISC DHCP server failover David W. Hankins (Mar 19)
    - Re: ISC DHCP server failover Mike (Mar 19)
    - Re: ISC DHCP server failover sthaug (Mar 20)
    - Re: ISC DHCP server failover Dan White (Mar 20)
    - Re: ISC DHCP server failover Leo Bicknell (Mar 20)
    - Re: ISC DHCP server failover David W. Hankins (Mar 21)
- Re: ISC DHCP server failover Raymond Dijkxhoorn (Mar 17)