nanog mailing list archives

Re: Lossy cogent p2p experiences?


From: David Hubbard <dhubbard () dino hostasaurus com>
Date: Thu, 31 Aug 2023 21:42:07 +0000

That’s not what I’m trying to do, that’s just what I’m using during testing to demonstrate the loss to them.  It’s 
intended to bridge a number of networks with hundreds of flows, including inbound internet sources, but any new TCP 
flow is subject to numerous dropped packets at establishment and then ongoing loss every five to ten seconds.  The 
initial loss and ongoing bursts of loss cause the TCP window to shrink so much that any single flow, between systems 
that can’t be optimized, ends up varying from 50 Mbit/sec to something far short of a gigabit.  It was also fine for 
six months before this miserable behavior began in late June.


From: Eric Kuhnke <eric.kuhnke () gmail com>
Date: Thursday, August 31, 2023 at 4:51 PM
To: David Hubbard <dhubbard () dino hostasaurus com>
Cc: Nanog () nanog org <nanog () nanog org>
Subject: Re: Lossy cogent p2p experiences?
Cogent has asked many people NOT to purchase their ethernet private circuit point to point service unless they can 
guarantee that you won't move any single flow of greater than 2 Gbps. This works fine as long as the service is used 
mostly for mixed IP traffic like a bunch of randomly mixed customers together.

What you are trying to do is probably against the guidelines their engineering group has given them for what they can 
sell now.

This is a known weird limitation with Cogent's private circuit service.

The best working theory that several people I know in the neteng community have come up with is because Cogent does not 
want to adversely impact all other customers on their router in some sites, where the site's upstreams and links to 
neighboring POPs are implemented as something like 4 x 10 Gbps. In places where they have not upgraded that specific 
router to a full 100 Gbps upstream. Moving large flows >2Gbps could result in flat topping a traffic chart on just 1 of 
those 10Gbps circuits.



On Thu, Aug 31, 2023 at 10:04 AM David Hubbard <dhubbard () dino hostasaurus com<mailto:dhubbard () dino hostasaurus 
com>> wrote:
Hi all, curious if anyone who has used Cogent as a point to point provider has gone through packet loss issues with 
them and were able to successfully resolve?  I’ve got a non-rate-limited 10gig circuit between two geographic locations 
that have about 52ms of latency.  Mine is set up to support both jumbo frames and vlan tagging.  I do know Cogent 
packetizes these circuits, so they’re not like waves, and that the expected single session TCP performance may be 
limited to a few gbit/sec, but I should otherwise be able to fully utilize the circuit given enough flows.

Circuit went live earlier this year, had zero issues with it.  Testing with common tools like iperf would allow several 
gbit/sec of TCP traffic using single flows, even without an optimized TCP stack.  Using parallel flows or UDP we could 
easily get close to wire speed.  Starting about ten weeks ago we had a significant slowdown, to even complete failure, 
of bursty data replication tasks between equipment that was using this circuit.  Rounds of testing demonstrate that new 
flows often experience significant initial packet loss of several thousand packets, and will then have ongoing lesser 
packet loss every five to ten seconds after that.  There are times we can’t do better than 50 Mbit/sec, but it’s rare 
to achieve gigabit most of the time unless we do a bunch of streams with a lot of tuning.  UDP we also see the loss, 
but can still push many gigabits through with one sender, or wire speed with several nodes.

For equipment which doesn’t use a tunable TCP stack, such as storage arrays or vmware, the retransmits completely ruin 
performance or may result in ongoing failure we can’t overcome.

Cogent support has been about as bad as you can get.  Everything is great, clean your fiber, iperf isn’t a good test, 
install a physical loop oh wait we don’t want that so go pull it back off, new updates come at three to seven day 
intervals, etc.  If the performance had never been good to begin with I’d have just attributed this to their circuits, 
but since it worked until late June, I know something has changed.  I’m hoping someone else has run into this and maybe 
knows of some hints I could give them to investigate.  To me it sounds like there’s a rate limiter / policer defined 
somewhere in the circuit, or an overloaded interface/device we’re forced to traverse, but they assure me this is not 
the case and claim to have destroyed and rebuilt the logical circuit.

Thanks!

Current thread: