nanog mailing list archives

RE: tools and techniques to pinpoint and respond to loss on a path


From: Andy Litzinger <Andy.Litzinger () theplatform com>
Date: Tue, 16 Jul 2013 18:12:32 +0000

From: Blake Dunlap [mailto:ikiris () gmail com]
While any provider will attempt to fix peer / upstream issues as they can, any
SLA you would have is between two points on their private network, not
from point A to point Z that they have no control over across multiple peers
and the public internet itself.

makes sense- thanks for confirming

The much more common design is using a single
provider for each thread between sites. Then at least you have an end-to-
end SLA in effect, as well as a single entity that is responsible for the entire
link in question.

This sounds like you're trying to achieve private link IGP / FRR level site to site
failover/convergence across the public internet. Perhaps you should rethink
your goals here or your design?

Kind of- I can actually tolerate the blips, but I want to be able to measure and track
 them in such a way that I know where the loss is occurring.  If a particular path
is reconverging more often than should be reasonably expected I want to be able to
prove it within reason.

We also have a customer who happens to host at DC B with the same connectivity.
Every time there is one of these blips their alerting fires off a thousand messages
and they open a ticket with us.  I'd like to be able to show them some good data
on the path during the blip so we back a discussion along the  lines
of "live with it, or pay to privately connect to us".

-andy

-Blake

On Mon, Jul 15, 2013 at 4:18 PM, Andy Litzinger
<Andy.Litzinger () theplatform com> wrote:
Hi,

Does anyone have any recommendations on how to pinpoint and react to
packet loss across the internet?  preferably in an automated fashion.  For
detection I'm currently looking at trying smoketrace to run from inside my
network, but I'd love to be able to run traceroutes from my edge routers
triggered during periods of loss.  I have Juniper MX80s on one end- which I'm
hopeful I'll be able to cobble together some combo of RPM and event
scripting to kick off a traceroute.  We have Cisco4900Ms on the other end and
maybe the same thing is possible but I'm not so sure.

I'd love to hear other suggestions and experience for detection and also for
options on what I might be able to do when loss is detected on a path.

In my specific situation I control equipment on both ends of the path that I
care about with details below.

we are a hosted service company and we currently have two data centers,
DC A and DC B.  DC A uses juniper MX routers, advertises our own IP space
and takes full BGP feeds from two providers, ISPs A1 and A2.  At DC B we
have a smaller installation and instead take redundant drops (and IP space)
from a single provider, ISP B1, who then peers upstream with two providers,
B2 and B3

We have a fairly consistent bi-directional stream of traffic between DC A and
DC B.  Both of ISP A1 and A2 have good peering with ISP B2 so under normal
network conditions traffic flows across ISP B1 to B2 and then to either ISP A1
or A2

oversimplified ascii pic showing only the normal best paths:

              -- ISP A1----------------------ISP B2-- DC A--
|                                                                 |---  ISP B1 ----- DC B
             -- ISP A2----------------------ISP B2--


with increasing frequency we've been experiencing packet loss along the
path from DC A to DC B.  Usually the periods of loss are brief,  30 seconds to a
minute, but they are total blackouts.

  I'd like to be able to collect enough relevant data to pinpoint the trouble
spot as much as possible so I can take it to the ISPs and request a
solution.  The blackouts are so quick that it's impossible to log in and get a
trace- hence the desire to automate it.

I can provide more details off list if helpful- I'm trying not to vilify anyone-
especially without copious amounts of data points.

As a side question, what should my expectation be regarding packet loss
when sending packets from point A to point B across multiple providers
across the internet?  Is 30 seconds to a minute of blackout between two
destinations every couple of weeks par for the course?  My directly
connected ISPs offer me an SLA, but what should I reasonably expect from
them when one of their upstream peers (or a peer of their peers) has
issues?  If this turns out to be BGP reconvergence or similar do I have any
options?

many thanks,
-andy



Current thread: