nanog mailing list archives

Re: Ph.D. student looking for data on network failure causes


From: "David L. Oppenheimer" <davidopp () cs berkeley edu>
Date: Tue, 9 Apr 2002 15:20:05 -0700


By the way, just a clarification of my original message: the results of this
study will (eventually) be published in some academic forum. I'll post to
the NANOG mailing list a pointer to any results when they are available, so
there's no need to email me separately to indicate your interest in the
results. (Interest in providing data, on the other hand, would be most
welcome!)

Thanks,
David


Hello network operators! I'm a Ph.D. student at UC Berkeley working for
Dave
Patterson on the ROC project, which is investgiating techniques for
improving the availability and manageability of large-scale Internet
services and systems. I'm currently conducting a study of the root causes
(hardware, softare, human, etc.) and durations of failures in such
systems.
To do this, I have been examining the operations trouble ticket databases
from several large-scale Internet services (of the Hotmail, eBay, Yahoo!,
etc. type).

In doing this research,  it has become apparent that for many services
(especially geographically distributed ones, e.g. those that use multiple
colocation facilities), a major cause of problems is failures of various
types in the Internet. Thus I've become interested in finding out the
types
and root causes of problems in wide-area networks, e.g. within the kinds
of
large-scale ASes that are administered by the folks on this list. I'm not
sure how your services track failures and problems; the problem tracking
databases at the services I've examined have been a great source of data
about problem scale, symptoms, root causes, durations, steps (and
missteps!)
taken in diagnosing and fixing problems, etc.

I'm writing to the list because I'm very interested in working with
network
operators to study the causes of failures in large networks. I realize
this
type of data is very sensitive to your organizations. I would be happy to
talk offline with anyone who is interested in the possibility of sharing
data, about how I've overcome the multitude of objections that have been
raised by folks I have solicited for data (protecting their customers'
privacy, securing datasets when they are not examined on the premises of
the
services, anonymizing and aggregating data in reporting, etc. etc.). I'm
interested in the relative causes of failures, *not* overall availability
numbers. As a result of the precautions we've taken, several
household-name
Internet services have allowed me to examine and report on the problems
their servcies have experienced.

If you're interested in discussing the possibility of sharing access to
this
kind of data about your service, please contact me. I'm willing to examine
data on the premises of your service, to anonymize it fully, to submit any
results I want to publish to your organization prior to publication, to
sign
any necessary NDAs, etc. In return, I'm happy to share with you any
insights
I have about the problems your service experiences, and you'll contribute
to
the world's knowledge of why bad things happen to good networks. :-)

If you're not the right person in your organization to contact with this
request, but you think your organization might be interested in
participating in this study, perhaps you could forward this email to the
appropriate person or let me know who the right person to contact in your
organization would be.



Current thread: