nanog mailing list archives

RE: Telia Not Withdrawing v6 Routes


From: <adamv0025 () netconsultings com>
Date: Wed, 18 Nov 2020 12:58:32 -0000

Saku Ytti
Sent: Tuesday, November 17, 2020 6:55 AM

On Tue, 17 Nov 2020 at 03:40, Sabri Berisha <sabri () cluecentral net> wrote:

Hey Sabri,

Also, in the case that I described it wasn't a Junos device. Makes me
wonder how bugs like that get introduced. One would expect that after
20+ years of writing BGP code, handling a withdrawl would be easy-peasy.

I don't think this is related to skill, that there was some hard programming
problem that DE couldn't solve. These are honest mistakes.
I've not experienced in my tenure the frequency of these bugs change at all,
NOS are as common now as they were in the 90s.

I put most of the blame on the market, we've modelled commercial router
market so that poor quality NOS is good for business and good quality NOS is
bad for business, I don't think this is in anyone's formal business plan or that
companies even realise they are not even trying to make good NOS. I think it's
emergent behaviour due to the market and people follow that market demand
unknowingly.
If we suddenly had one commercial NOS which is 100% bug free, many of their
customers would stop buying support, would rely on spare HW and Internet
forums for configuration help. Lot of us only need contracts to deal with novel
bugs all of us find on a regular basis, so good NOS would immediately reduce
revenue. For some reason Windows, macOS or Linux almost never have novel
bugs that the end user finds and when those are found, it's big news. While we
don't go a month without hitting a novel bug in one of our NOS, and no one
cares about it, it's business as usual.

I also put a lot of blame on C, it was a terrific language when compiling had to
be fast. Basically macro assembler. Now the utility of being 'close to HW' is
gone, as the CPU does so much C compiler has no control over, it's not really
even executing the same code as-written anymore. MSFT estimated >70% of
their bugs are related to memory safety. We could accomplish significant
improvements in software quality if we'd ditch C and allow the computer to do
more formal correctness checks at compile time and design languages which
lend towards this.


We constantly misattribute problems (like in this post) to config or HW, while
most common reasons for outages are pilot error and SW defect, and very little
engineering time is spent on those. And often the time spent improving the two
first increases the risk of the two latter, reducing mean availability over time.

I agree with everything but the last statement. 
From my experience, most of the SPs spend a considerable time testing for SW defects on features (and combinations of 
features) that will be used and at scale intended, that's how you identify most of the bugs. What you're left with 
afterwards are special packets of death or some slow memory leaks (basically the more exotic stuff).
 
adam
 


Current thread: