nanog mailing list archives

Re: ISC DHCP server failover


From: Dan White <dwhite () olp net>
Date: Sat, 20 Mar 2010 13:20:09 -0500

On 19/03/10 17:10 -0700, Mike wrote:
David W. Hankins wrote:
On Wed, Mar 17, 2010 at 09:22:06AM -0500, Dan White wrote:
  The servers stop balancing their addresses, and one server starts to
exhibit 'peer holds all free leases' in its logs, in which case we need to
restart the dhcpd process(es) to force a rebalance.

If restarting one or both dhcpd processes corrects a pool balancing
problem, then I suspect what you're looking at is a bug where the
servers would fail to schedule a reconnection if the failover socket
is lost in a particular way.  Because the protocol also uses a message
exchange inside the TCP channel to determine if the socket is up
(rather than just TCP keepalives) this can sometimes happen even
without a network outage during load spikes or other brief hiccups on
<long explanation snipped>

With all due respect and acknowledgment of the tremendous contributions of ISC and you yourself Mr. Hankins, I have to comment that failover in isc-dhcp is broken by design because it requires the amount of handholding and operator thinking in the event of a failure that you explained to us at length is required. Failure needs to be handled automatically and without any intervention at all, otherwise you might as well not have it and I think most network operators would agree.

I don't want to defend bad code where it may exist, but I view the problems
we've encountered with ISC DHCP to be minor compared to the benefits.

It may not be fair to compare DHCP failover to redundancy in a routing
scenario.  In a routing failure, I'd be highly motivated to find the root
cause, open tickets, and get the problem fixed.

In a scenario where a couple of customers are unable to pull an IP address,
every few months, I'm OK with manual intervention as long as state is
maintained. I'd argue that it's more important to maintain data integrity
(no two servers think they own the same IP) than availability (where one
server is too aggressive and corrupts data).

That's true of much of the open source software I use, such as cyrus
(email) replication and openldap synchronization.

Given the resources I and others in my company have to deal with issues,
it's always a matter of putting out the biggest fire. If/when problems with
DHCP failover become a big enough issues, we'll spend the time to find out
what in our network is causing the issue and fix it, or find out what the
bug is in the software and open a bug report.

All problems are fixable given enough resources, and enough motivation.

--
Dan White


Current thread: