nanog mailing list archives

Re: Human Factors and Accident reduction/mitigation


From: Anton Kapela <tkapela () gmail com>
Date: Sun, 8 Nov 2009 11:26:40 -0500

Owen,

We could learn a lot about this from Aviation.  Nowhere in human history has
more research, care, training, and discipline been applied to accident
prevention,
mitigation, and analysis as in aviation.  A few examples:

Others later in this thread duly noted a definite relationship of
costs associated, which are clearly "worth it" given the particular
application of these methods [snipped]. However, I assert this is
warranted because of the specific public trust that commercial
aviation must be given. Additionally, this form of professional or
industry "standard" isn't unique in the world; you can find (albeit
small) parallels in most states' PE certification tracks and the like.

In the case of the big-I internet, I assert we can't (yet)
successfully argue that it's deserving of similar public trust. In
short, I'm arguing that big-I internet deserves special-pleading
status in these sorts of "instrument -> record -> improve" strawmen
and that we shouldn't apply similar concepts or regulation.

(Robert B. then responded):

All,
The real problem is same human factors we have in aviation which cause most
accidents. Look at the list below and replace the word Pilot with Network
Engineer or Support Tech or Programmer or whatever... and think about all
the problems where something didn't work out right. It's because someone
circumvented the rules, processes, and cross checks put in place to prevent
the problem in the first place. Nothing can be made idiot proof because
idiots are so creative.

I'd like to suggest we also swap "bug" for "software defect" or
"hardware defect" - perhaps if operators started talking about
problems like engineers, we'd get more global buy-in for a
process-based solution.

I certainly like the idea of improving the state of affairs where
possible - especially the operator->device direction (i.e
fat-fingering acl, prefix list, community list, etc). When people make
mistakes, it seems very wise to accurately record the entrance
criteria, the results of their actions, and ways to avoid it - then
shared to all operators (like at NANOG meetings!). The part I don't
like is being ultimately responsible for, or to "design around" a
class of systemic problems which are entirely outside of an operators
sphere of control.

What curve must we shift to get routers with hardware and software
that's both a) fast b) reliable and c) cheap -- in the hopes that the
only problems left to solve indeed are human ones?

-Tk


Current thread: