nanog mailing list archives

Re: ISC DHCP server failover

From: Dan White <dwhite () olp net>
Date: Sat, 20 Mar 2010 13:20:09 -0500

On 19/03/10 17:10 -0700, Mike wrote:

David W. Hankins wrote:
On Wed, Mar 17, 2010 at 09:22:06AM -0500, Dan White wrote:
  The servers stop balancing their addresses, and one server starts to
exhibit 'peer holds all free leases' in its logs, in which case we need to
restart the dhcpd process(es) to force a rebalance.
If restarting one or both dhcpd processes corrects a pool balancing
problem, then I suspect what you're looking at is a bug where the
servers would fail to schedule a reconnection if the failover socket
is lost in a particular way.  Because the protocol also uses a message
exchange inside the TCP channel to determine if the socket is up
(rather than just TCP keepalives) this can sometimes happen even
without a network outage during load spikes or other brief hiccups on
<long explanation snipped>
With all due respect and acknowledgment of the tremendous contributionsof ISC and you yourself Mr. Hankins, I have to comment that failover inisc-dhcp is broken by design because it requires the amount ofhandholding and operator thinking in the event of a failure that youexplained to us at length is required. Failure needs to be handledautomatically and without any intervention at all, otherwise you mightas well not have it and I think most network operators would agree.


I don't want to defend bad code where it may exist, but I view the problems
we've encountered with ISC DHCP to be minor compared to the benefits.

It may not be fair to compare DHCP failover to redundancy in a routing
scenario.  In a routing failure, I'd be highly motivated to find the root
cause, open tickets, and get the problem fixed.

In a scenario where a couple of customers are unable to pull an IP address,
every few months, I'm OK with manual intervention as long as state is
maintained. I'd argue that it's more important to maintain data integrity
(no two servers think they own the same IP) than availability (where one
server is too aggressive and corrupts data).

That's true of much of the open source software I use, such as cyrus
(email) replication and openldap synchronization.

Given the resources I and others in my company have to deal with issues,
it's always a matter of putting out the biggest fire. If/when problems with
DHCP failover become a big enough issues, we'll spend the time to find out
what in our network is causing the issue and fix it, or find out what the
bug is in the software and open a bug report.

All problems are fixable given enough resources, and enough motivation.

--
Dan White

Current thread:

ISC DHCP server failover Summers, William (Mar 17)
- Re: ISC DHCP server failover sthaug (Mar 17)
- Re: ISC DHCP server failover Dan White (Mar 17)
  - Re: ISC DHCP server failover Blake Covarrubias (Mar 17)
  - Re: ISC DHCP server failover David W. Hankins (Mar 19)
    - Re: ISC DHCP server failover Mike (Mar 19)
    - Re: ISC DHCP server failover sthaug (Mar 20)
    - Re: ISC DHCP server failover Dan White (Mar 20)
    - Re: ISC DHCP server failover Leo Bicknell (Mar 20)
    - Re: ISC DHCP server failover David W. Hankins (Mar 21)
- Re: ISC DHCP server failover Raymond Dijkxhoorn (Mar 17)