nanog mailing list archives

Re: FYI Netflix is down


From: "steve pirk [egrep]" <steve () pirk com>
Date: Wed, 11 Jul 2012 10:00:41 -0700

On Mon, Jul 9, 2012 at 10:20 AM, Dave Hart <davehart () gmail com> wrote:

"We continue to investigate why these connections were timing out
during connect, rather than quickly determining that there was no
route to the unavailable hosts and failing quickly."

potential translation:

"We continue to shoot ourselves in the foot by filtering all ICMP
without understanding the implications."


Sorry to mention my favorite hardware vendor again, but that is what I
liked about using F5 BigIP as load balancing devices... They did layer 7
url checking to see if the service was really responding (instead of just
pinging or opening a connection to the IP).
We performed tests that would do a complete LDAP SSL query to verify a
directory server could actually look up a person. If it failed to answer
within a certain time frame, then it was taken out of rotation.

I do not know if that was ever implemented in production, but we did verify
it worked.

On the "software in the hardware can fail" point, my only defense is you do
redundant testing of the watcher devices, and have enough of them to vote
misbehaving ones out of service. Oh, and it is best if the global load
balancing hardware/software is located somewhere else besides the data
centers being monitored.

-- 
steve pirk


Current thread: