nanog mailing list archives

Re: FYI - unproven technology


From: bmanning () ISI EDU
Date: Wed, 19 Oct 1994 15:31:27 -0700 (PDT)


Ken, I don't think it was Bob Metcalfe... Note this telling extract from
USENET of just over a year back.

------- Start of forwarded message -------
Newsgroups: comp.dcom.lans.ethernet
Path: cronkite.cisco.com!decwrl!parc!wirish
From: wirish () parc xerox com (Wes Irish)
Subject: Performance problems on high utilization Ethernets
Message-ID: <wirish.750731783@misty>
Summary: High utilization Ethernet performance problems traced to controller implementation bugs
Keywords: Ethernet, communications, interframe gap, IFG, collisions, controller, interface, packet loss, data link
Sender: news () parc xerox com
Organization: Xerox PARC
Date: 16 Oct 93 00:36:23 GMT
Lines: 115

For the past year or so I have been investigating performance problems
on the Ethernets here at PARC.  This work has uncovered problems with a
number of Ethernet controllers in common use today.  These low-level
controller problems can lead to serious performance problems for many of
the systems involved.  A full paper on this work, "Investigations into
observed performance problems on high utilization Ethernet networks",
will be released soon (initially as a PARC Blue & White report).  But,
since I have been giving talks on this work and news of it has begun to
hit the Internet, I feel that a should post a preliminary report in
order to reduce speculation and to make sure that the facts are
correctly stated.  Below is a short summary of some of the key facts and
issues.

The Ethernet specifications talk about making sure that transmitters
enforce a 9.6 microsecond gap (IFG) between frames (packets).  This is
straight forward in the case of a gap following a just completed good
packet.  But, gaps following collision events are less straight forward.
I do not want to debate the details of what is and is not "correct" in
this case -- that is a discussion for another time and place.  The
reality of the situation is that there are a number of controllers in
wide-spread use on networks today that do not interoperate very well in
the face of collisions.

In general, the problems arise when the gap following a collision is too
short for a particular implementation of a receiver.  In addition to
uncovering controllers that simply generate short IFGs I have also
uncovered a major implementation bug in a particular chip that injects
short signal bursts onto the network.  These bursts can damage the IFG
"enforced" by other machines.  Either way, the result is that same -- a
short IFG preceding a packet which can result in a missed packet.

It is important to note that when a controller misses a packet due to a
short IFG THE FACT THAT THE PACKET WAS MISSED IS NOT DETECTED NOR
REPORTED TO THE SYSTEM.  System and driver statistics will claim no
packets lost (unless some are lost for other reasons).  Even most
network analyzers are subject to the same undetected and therefore
unreported packet loss.  I have resorted to using a digital oscilloscope
to capture and analyze these events.

Let me emphasize that these problems are almost exclusively related to
dealing with collision events.  On a lightly loaded network, where
collisions are few and far between, these problems are virtually
non-existent.  But these problems do indeed come into play on moderate
to heavily loaded networks.  Based on my observations a VERY ROUGH
network load dividing line is about 25% load (using 0.1 or 1.0 sec
samples).

Here is an enumeration of some of the facts related to particular
controllers that I have uncovered so far.  There may be problems with
other controllers but they may not appear on the networks that I have
inspected.

Controller: Intel 82586
Commonly found in: SUN 3's and SUN 4's (ie interfaces), many other
machines
Problem: Can generate a short IFG following a collision
Cause: starts IFG timer on CS dropout

Controller: Intel 82596
Commonly found in: Network General Sniffer using Cogent interface card
Problem: Will not hear packet unless preceding IFG is 4.6 usec or larger

Controller: SEEQ 8003
Commonly found in: Cisco MEC and MCI interfaces, older SGI (Silicon
Graphics) including 4D/35 and Indigo (but not Indigo2)
Problem: Can generate a short IFG following a collision
Cause: Starts 9.6 usec timer at end of its on jam and not end of
collision
Problem: Generates 24 bit signal burst onto network following some
collisions.  This burst lands in the IFG following the collision and
will often result in two short IFGs resulting in other controllers
missing the packet.  NB: this can happen even if the chip has nothing to
transmit!

Controller: AMD 7990 "LANCE"
Commonly found in: SUN SPARCStation machines (SS-1, SS-1+, SS-2, SS-10,
...), many DEC machines, Cisco/SynOptics routers, Cisco IGS, many other
machines
Problem: Will not hear packet unless preceding IFG is 4.1 usec or larger
Cause: implementation state machine
Problem: many other problems including lock-up, transmit gaps greater
than 9.6 usec under load, etc.
Fix: A new version of the controller, the 79C90 CLANCE, fixes many of
these problems but is not in common use like the LANCE.

Interface chip: AT&T T7213
Commonly found in: SUN SPARCStation 10 and other newer SUN machines
Problem: Will hold the collision (and kill data) sent to the controller
chip across IFGs of roughly 1.0 usec or less.  It will also do this if a
"manchester coding violation" is detected in a packet -- a job that
should be left to the controller.


The result of all of these implementation details is that it is very
possible, even probable, to put together a network that results in
"undetected" packet loss.  Packet loss rates of even less than 1% can
result in performance hits as high as 80%, depending on a multitude of
factors including the protocols and implementations being used.  I have
clocked the potential packet drop rate at PARC due to these problems to
be in the 1% - 5% range at times.

I have been working with many of these vendors for a number of months
now in an attempt to get these various bugs fixed so that different
equipment interoperates properly.  Most of the vendors have been very
receptive to making things work now that they know there is a problem.
Some have already identified solutions while others are still working on
them.


Wesley Irish
Network Scientist
Xerox PARC
wirish () parc xerox com

[Please send any replies via e-mail as I do not normally read netnews]
------- End of forwarded message -------



------------------------------------------------------------------------

Curtis I'm not sure I understand your use of the term "unproven."

In Lan circles we've been discussing this exact same phenomena for the
last 9 months (I raised it with Jessica as a potential explanation
of some of the problems we were seeing in our early testing).

Bob Metcalfe (coinventor of ethernet) discovered the some ethernet chip
sets were also violating the inter-packet gap spec. A particular problem
was that many of the devices used for sniffing themselves had the same
chip sets and simply couldn't see what was happening to the packet
stream (silent discards withour errors signalled at the receiving end).

He needed very expensive signal analysis hardware before
the cause could be isolated.

Ken Latta, Merit Network, Inc.
NSFNET Project, Internet Engineering Group
1071 Beal, Ann Arbor, MI 48109-2103
313.936.2115 voice,  313.747.3745 fax
klatta () umich edu, USERLFQF@umichum.bitnet

From:    Curtis Villamizar <curtis () ans net>
To:      nanog () merit edu


FYI-

For those that don't appreciate the consequences of using unproven
technology.  The good news on Mae-East is packet loss is down to 15%
from 40%?  :-(

Congratulations to Sprint for picking a technology that is known to
work for the Sprint NAP.  FDDI works.  We'll see how the others NAPs
do, though I'm not encouraged by test results so far.

Curtis

BTW - this is Mae-East (the MFS bridged ethernet), not Mae-East+ (the
bridged FDDI).

------- Forwarded Message

From: Sean Doran <smd () sprint net>
Reply-To: smd () sprint net
To: mae-east () uunet uu net
Subject: Moderately urgent: getting rid of annoying packet losses
Date: Wed, 19 Oct 1994 02:07:06 -0400
Sender: smd () tiny sprintlink net


The Magnum boxes are *very* unhappy with inter-packet gaps of less
than about 23 microseconds, and drop back-to-back packets like
superheated rocks.

We have a kludge which will help until the MFS hardware gets fixed.

Those of you running one Cisco with EIP 10-0 microcode or better should
set the transmitter-delay of your MAE-EAST interface to 96 (0x60).
This will dramatically reduce the packet loss across MAE-EAST.

IMPORTANT: Those of you who have more than one box on your ethernet
drop to MFS will need to a/ acquire EIP 171-1 from Cisco and load
it in then b/ set the transmitter-delay of each of your MAE-EAST
interfaces to 0x360 (864).

The new microcode has apparently been well tested, and is doing the
right thing for icm-dc-1.icp.net and sl-dc-6.sprintlink.net (drops
to most of you have fallen from 40% to much less than 15%).  It
works by assigning new meanings to the upper 8 bits of the transmitter-
delay value; this particular setting will delay the transfer of
the packet to the datalink controller when there is traffic
on the wire, then require an additional quiet time of 30usec, 
after which there will be the standard 9.6 usec IEEE 802.3 delay.

(The original intent apparently was to avoid drops when bursting
ethernet traffic encounters collisions by backing off on handing
the packet to the datalink layer; the application here is not quite
exactly what was intended, but definitely helps us).

Each of your routers attached to MAE-EAST must run the new EIP 171-6
microcode and have the 0x360 transmitter-delay setting.

Thanks to Robert M. Broberg of Cisco for the code.

Those of you without Ciscos will have to come up with a similar hack 
somehow.

    Sean.

P.S.: We are *very* keen on PSI, NETCOM, and MCI to implement the
      change, especially PSI.  We aren't having problems with anyone
      else we exchange traffic with at MAE-EAST (other than Dante
      AS1133, but that's not a Cisco) but everyone would probably 
      benefit from the upgrade anyway.  Try pinging each of your peers
      in 192.41.177 a hundred or so times.

- - --
Sean Doran <smd () sprint net>  SprintLink/ICM engineering   +1 703 904 2089

------- End of Forwarded Message




-- 
--bill
- - - - - - - - - - - - - - - - -


Current thread: