nanog mailing list archives

What is going on with BGP


From: Ignas Bagdonas <ibagdona.nog () gmail com>
Date: Tue, 13 Jun 2023 02:42:47 +0100

A brief overview of developments happening in the IETF working groups
related to BGP evolution. The view is current as of mid-2023, in the
timeframe between IETF meetings 116 and 117, and looking back several years
to cover the recently published documents. The overview is given from the
perspective of development of the protocol mechanics and recommended
operational considerations, and is not directly related to specific
implementation aspects of specific platforms – for that you would need to
consult your vendors’ documentation. It is not expected that all of the
functionality described here will be universally productized, as well as
there will be specific deviations and extensions to functionality
implemented by different vendors as seen required by the market. This is
not an end to end overview of BGP, instead it focuses on specific protocol
changes and therefore it is assumed that a reader has a sufficient
understanding of foundations of BGP and its supporting machinery. It is a
high level overview and does not go deep into the specifics, pointers to
documents are provided for further and more detailed view into the topics
under the discussion. This part covers the core protocol part and
mechanisms specific to IPv4 unicast and IPv6 unicast AFs.



Deprecation of AS path aggregation sets
(draft-ietf-idr-deprecate-as-set-confed-set). When aggregating multiple
prefixes with different ASNs into a shorter covering prefix, besides other
aggregation related path attributes, the ASN identifiers of component
prefixes are contained in an unordered structure which has a different type
than the ordered ASN sequence of the AS path attribute. The need for such a
structure is to avoid possible propagation loops due to missing information
on which ASNs the update has traversed previously. However such an approach
obfuscates the real origin and prefix lengths of component prefixes and as
a result is directly incompatible with the developments in global routing
security mechanisms. Also it is yet another influencing aspect into
possible attribute packing conflicts due to different interpretations of
what an unordered set of ASNs in fact means. Therefore aggregation
resulting in generation of ASN sets (AS_SET and AS_CONFED_SET segments in
AS path attribute) is deprecated and should not be used. Receipt of an
update carrying such segments should be treated as a withdraw due to a
recoverable error (RFC7606), and no announcements carrying ASN set segments
can be advertised. This does not deprecate the aggregation of component
prefixes as such, but only the generation of ASN set segments. Both
aggregation within an AS and proxy aggregation can be deployed as designed,
and the fact of aggregation is indicated by the AGGREGATOR path attribute
carrying the ASN and RID of the node that performed the aggregation – same
as before, just omitting the addition of ASN set segment. The loop
avoidance is ensured by controlling the advertisement of the resulting
covering aggregate prefix – it must not be advertised to any of the origins
of component prefixes.

Overall this is not a new concept, RFC6472 recommended against advertising
ASN sets but did not change the behaviour of the receiving side. The amount
of ASN sets seen in the global routing system is small enough (such cases
do exist, but they are a clear exception or simply a neglect to clean the
configuration up) to justify a more strict set of rules that would remove
the ambiguity of interpretation of prefix origin.



Extended messages (RFC 8654). BGP message size is limited to 4096 octets
(PDU size should not be confused with link and packet layer MTU sizes and
transport window size), and that might not be enough in some cases. BGP
PDUs do not have any mechanism for fragmentation, and therefore a set of
path attributes that does not fit into the message cannot be advertised at
all. In addition, attribute packing is an efficient way of speeding up the
convergence of BGP, and the PDU size limitation puts an upper bound on the
balance of how many NLRIs can be advertised together with the path
attributes. New address families may carry larger NLRI elements and contain
more or larger attributes, and therefore 4K octet limit may be not enough.
BGP message encoding allows for a larger size, it is a historical limit
that is now being lifted. Extended messages mechanism defines a new
capability that needs to be configured and exchanged between the peers and
if both sides agree, they can use messages up to 65535 octets for BGP
signalling. Open message is excluded for backwards compatibility reasons,
and a large keepalive message just does not make practical sense. Update,
notification, refresh signalling, and potentially other newly defined
messages may use this mechanism for exchanging both larger size objects and
larger amounts of objects. Of particular note is the handling of
notification message – while the overall size of the message now may be
larger, so can be the size of notification data, and it cannot exceed the
negotiated total limit of 4096 or 65535 octets.



Extended optional parameters (RFC9072). BGP Open message carries
capabilities – a set of parameters that are exchanged between peers for
finding out common operational modes and their corresponding parameters.
The encoding used historically had a single octet length field for both
total length of all parameters and for a length of each individual
parameter, and therefore practically limited the amount of capabilities and
their parameters that can be exchanged. Given the trends in increasing the
amount of address families and their configuration parameters, the
usability features carrying human readable information such as host names
and version information, and supported BGP feature indicators, a mechanism
that would allow for a larger capability size is needed. The overall
approach of this extension is by defining a new optional parameter having a
specific length which acts as an indicator that a new format of container
is used for encoding individual optional parameters (BGP capability is a
type of optional parameter). No changes are made to the actual optional
parameters carried, it is just a container that is larger. This change is
unambiguous to a speaker that does not understand the new encoding – it
will result in an error indicating a presence of an unsupported optional
parameter. This mechanism allows for slightly above 4K octets of usable
space for optional open parameters – should be enough for everyone.



Dynamic capabilities (draft-ietf-idr-dynamic-cap). Capabilities provide an
ability for parametrizing a set of BGP operational parameters during
initial session startup. Once negotiated, capabilities and their
corresponding parameters stay constant for the duration of the BGP session
without any protocol-level ability to adjust it if needed. Dynamic
capabilities introduce a two-way capability parameter synchronization
mechanism without requiring a session bringdown. This is implemented by
means of a new protocol message that carries a list of potentially
negotiable capabilities and a request-response type of negotiation between
the peers. While this mechanism by itself allows for renegotiation of any
BGP capabilities after the initial session establishment, not all of such
negotiations would be practical or even technically feasible. Enabling a
new address family without bringing down an already established session
might be both practical and easy to implement, while disabling an already
established address family may result in logical dependency conflicts that
would render the remaining address families unusable. Changing already
negotiated timer parameters is easy, while enabling functionality such as
additional paths may not be technically feasible – those all are
implementation dependent aspects and would limit the practical breadth of
applicability of dynamic capabilities.



Send hold timer (draft-ietf-idr-bgp-sendholdtimer). BGP transport session
termination results in withdrawal of all received updates – which is a
practical way of clearing out the state that has become stale. However,
transport session liveness is controlled by the operation of the receive
side of the connection – a timeout in receiving of any message from the
remote peer is treated as transport failure. It is not the case for the
transmit side of the connection though, and a scenario where a remote peer
happily sends out periodic keepalives but fails to process any incoming
messages from the local peer would result in keeping the potentially stale
received information on a local peer. The concept of a send hold timer
tracks the local transport endpoint activity on the transmit side and if no
locally generated messages can be sent for a timer interval, local peer
will initiate a session teardown and clear out all the state received from
the remote peer.



Optimal route reflection (RFC 9107). Reflectors represent a universally
used mechanism for reducing the amount of state to be transferred
throughout the AS, their usage patterns are well understood, as well as
some limitations. This proposal addresses one of them – the path selection
as performed by the reflector is not necessarily an optimal one if that
same selection were to be performed by the reflector client itself. If BGP
and forwarding topologies are not congruent (which is the case in many
reflector deployments), the path selection will be influenced by the
metrics relevant and observed by the reflector itself and not by the
reflector client. Given that the number of paths to be selected from is
limited by a reflector, a client has no means for choosing what would have
been a more optimal path from its own perspective. From the perspective of
a reflector the differences would lie in the per-client path selection
(which may be different due to different policies being used for different
clients), and interpretation of IGP metric from the perspective of a
particular client and not of a reflector itself. There are no changes
required to be performed on the client side for this mechanism to work,
while a reflector would need to have a more detailed visibility into the
IGP topology for deriving a proper next hop cost for a prefix from a
perspective of a reflector client. In addition a path selection process on
a reflector would need to be performed individually per client or per group
of clients sharing the same reflection policy.



Wide BGP communities (draft-ietf-idr-wide-bgp-communities). Standard
communities, extended communities, large communities – haven’t we got
enough of different flavours of communities for expressing the policy
constructs? It appears that we in fact haven’t. The limiting factor appears
to be the ability to express actions and parameters for those actions in a
reasonably scalable and extensible way. Standard and large communities form
a functionally equivalent pair, with large communities primarily addressing
the 32 bit ASN clean signalling; extended communities mostly deal with
address families other than global unicast and also lack sufficient
flexibility for carrying information elements that have semantic
interpretation differing from a single plain 32 bit field.

The underlying base for wide communities is a new BGP path attribute that
acts as a container for community sub-objects, and wide communities are the
first actual user of this container infrastructure. Container attribute
itself is optional transitive, and individual sub-objects may have a finer
level of transitivity control, thus allowing for more controllable
attribute propagation.

The overall format of wide communities is <Community>:<Source ASN>:<Target
ASN>:<variable length parameters>, with Community being the actual
community value, Source ASN indicating the AS that is originating this
community, Target ASN indicating the AS that is supposed to interpret and
react on the community, and then followed by a variable length and format
set of parameters that are interpreted in the context of a community and
target ASN namespaces. Wide communities also define a parameterized
matching mechanism for indicating whether the community should or should
not be acted upon based on a set of criteria matching or not matching upon
the specific parameters. The actual parameters used for wide community
policy include ASN lists, IPv4 and IPv6 prefix lists, uint32 and IEEE 754
fp32 number lists, neighbour class list, an UTF-8 string, and also a
user-defined binary object having a free interpretation.



Extended admin shutdown communication (RFC 9003). RFC8302 defined a
mechanism for sending a human-readable message for several Notification
Cease subcodes, for a form of “in-band” message channel for BGP session
shutdown. The initial message length of 128 octets appears to be too
limited in the context of multibyte encodings, therefore this extension
lifts the limitation to 255 octets (but not necessarily characters!). The
message must be carried in a valid UTF-8 encoding, and it is not for the
receiving router to try to make sense of it – it is to be presented to the
operator as is.



Cease notification due to BFD session going down (RFC9384). BFD is a
universally deployed mechanism for tracking data plane liveness, and BFD
session state can also be propagated into control plane protocols for
speeding up their reaction to data plane failures. BGP is not an exception
to this mechanism, and it just works. When a BGP session goes down for some
reason, peer sends out a Cease message that provides some information on
the reason why the session is going down. This extension defines a subcode
indicating that the session was brought down due to the underlying BFD
session going down too. It may well be the case that a remote peer will not
be able to receive this notification, but the local peer will contain an
indication that the session was brought down due to BFD failure.



BGP graceful restart for notification messages (RFC 8538). BGP graceful
restart (RFC4724) allows for retaining the forwarding state while the
actual BGP session is restarting – except for when the session was brought
down due to a notification. Having an ability to retain the state in case
of protocol errors that are recoverable in some way appears to be of value
for maintaining the stability and reducing the amount of state that needs
to be propagated around. The core of this extension is exactly this – for
some error scenarios GR retains the forwarding state while BGP sessions
recover after the error that resulted in notification being sent. Not all
errors can be treated this way – only those that are of a temporary nature,
such as a remote peer running out of resources or transport having
intermittent problems.



Long-lived BGP graceful restart (draft-ietf-idr-long-lived-gr). BGP
graceful restart functionality (RFC4724) has been around for a while and is
widely deployed. There is a limit of time for which the stale routing
information can be retained by the BGP speaker, and that limit has the
upper bound of 4K seconds – for a simple reason that a field used for
signalling that value has got 12 bits. While an interval of more than an
hour for a remote peer to come back up might seem reasonably long, there
are use cases where it would be beneficial to retain a stale routing
information for a longer time, and primarily to avoid propagating the
withdrawals for the prefixes affected by the GR, at a cost and risk of
potential blackholing of a traffic destined to those prefixes. Extending
the timer value namespace is trivial, and that is precisely what is done by
this specification. Keeping stale routing information for extended periods
of time might be a bit more dangerous, therefore prefixes covered by the
long lived GR are also marked by a specific community value indicating that
they should be treated as a last resort for best path selection and also
can be propagated only to peers that also support long lived GR.



Peer roles (RFC9234). BGP does not define a semantic relationship between
peers – that is just a session over which prefixes can be advertised. What
is the relationship of those peers and whether they in fact are allowed to
exchange any routes learned from various sources historically was not tied
to the actual BGP session and was left to the domain of policy
implementation. This extension defines a set of mechanisms for specifying
and validating the roles of BGP peers and the prefixes they are allowed to
announce. Peers have roles defined based on their type and place in the
network (a customer, a provider, a route server, a generic peer among the
others) which get exchanged during the session establishment. If roles
disagree the session is not allowed to come up in the first place; if roles
allow for session to be established then depending on the actual role
routes advertised get an additional path attribute that indicates the class
of an advertising peer. Quite a similar mechanism for route leak detection
can be implemented by using community tagging – and while the overall logic
of peer roles is the same, the implementation is different: community needs
to be acted upon by the policy, and if the policy is not configured
properly or simply does not exist at all then such method will not work at
all. A dedicated path attribute can be advertised and acted upon by the
implementation not based on a user configurable policy and be active all
the time regardless of the policy configuration. Policy can be applied on
top of what is allowed in based on the peer roles, but peer roles have
priority. This mechanism is implemented as a capability for session
establishment with a set of role pairs that are allowed, and also as an
additional path attribute that gets added to advertised routes and can be
acted upon by the policy of the receiving peer.



Well-known large communities (draft-heitz-idr-wklc). Large communities
enjoy a wide deployment, and that same wide deployment starts to bring in
some requirements for an additional functionality, specifically limiting
propagation scope and avoiding conflicts in assigning function identifiers.
Large communities are a direct replacement of standard BGP communities and
as a result they do not have any structure of interpretation of the values
carried – it is just an AS number in the Global Administrator field. The
proposal is to reserve a set of ASNs for the purpose of being used as
well-known large community identifiers and also for control of propagation
scope, leaving a total of 10 octets for use within a context of a
particular WKLC. This appears to be quite controversial, both on the aspect
of reserving a significant portion of 32 bit ASN space, and also trying to
control the propagation of a transitive path attribute based on a value of
that same attribute. However, the functionality appears to be needed, and
it might be a right time to restart a discussion on defining an equivalent
of extended communities for 32 bit ASN clean operation.



AS path prepending (draft-ietf-grow-as-path-prepending). More does not
necessarily mean better. AS path prepending as a policy mechanism to
influence path selection is well known and is universally deployed.
Increasing the length of AS path by prepending ASN more than once decreases
the probability of such path being selected as best. The whole question is
how much to prepend in order to achieve the intended depreference of a
path, especially in the conditions of everyone else effectively doing the
same, and not to become too vulnerable to intentionally crafted shorter AS
paths. It is not uncommon to see paths with multiples of tens of prepends
in the global routing table – that is not surprising as AS path is a
transitive attribute and it is strictly not allowed to remove ASNs from the
received path. This can lead to a situation when any other – including
intentionally crafted – AS path is seen as more preferred due to having a
shorter length, and prepending yet more times only makes such types of
attacks easier. Signalling of the policy intent to the remote peers is
recommended to be implemented via specifically allocated communities or
other path attributes, and prepending should be used only when there are no
other alternatives available.



Maximum prefix limits (draft-sas-idr-maxprefix-inbound,
draft-sas-idr-maxprefix-outbound). The number of prefixes received by the
peer and advertised to other peers depends on the role and the place in the
network of a particular BGP speaker. Generally those numbers are quite
specifically bounded – a leaf site is not supposed to advertise a full
global table, and a peer in the exchange should not advertise substantially
larger amounts or prefixes than what it originates. A counter based
mechanism for controlling the number of received prefixes is supported by
virtually all BGP implementations. There are some deficiencies though – it
may be that the large amount of prefixes received from the remote peer will
be rejected by the inbound policy, and while keeping such rejected prefixes
is a handy optimization for speeding up the convergence in case of inbound
policy changes, the storage and processing of such rejected prefixes does
not come for free. In addition, refresh signalling could be used for
requesting only a specific set of prefixes to be readvertised if needed.
Another aspect is the number of prefixes accepted by the inbound policy –
for various reasons, including operator errors, the number of such prefixes
may end up being larger than it should be. On the sending side, there is
also a need for a mechanism to limit the number of prefixes past the
outbound policy that will get advertised to a remote peer. This results in
three separate counters – inbound pre-policy, inbound post-policy, and
outbound post-policy - that together can control the tolerable amount of
prefixes at various points of the BGP speaker processing. A twin set of
documents defines a recommended session shutdown and notification of the
remote side with an appropriate cease error to indicate the specific reason.



BMP support for local RIB monitoring (draft-ietf-grow-bmp-local-rib). BMP
provides a view into input and output BGP RIBs but lacks a mechanism to
transport all local RIB routes (all local routes meaning the full view of
the local RIB after the path selection process, not only the routes
imported into BGP context from other sources). Portions of such view can be
derived from the information available via input RIB monitoring (likely
requiring a coordinated monitoring of multiple nodes), however that may
result in a notable amount of data to be filtered through, would require
access to entities and state outside of BGP context, and still may lack to
provide a specific sequence of events that happened during topology
convergence. A dedicated mechanism for exporting information about all the
prefixes contained in the local RIB is defined by means of a new type of
BMP peer logically representing the contents of a local RIB. From the
perspective of BMP protocol processing there are no logical changes to the
usual operation of BMP route monitoring functionality – local RIB routes
would be represented as being received from an emulated peer bound to a
specific instance of a local RIB on a node.



More BMP TLVs (draft-ietf-grow-bmp-tlv). This is a framework document for
defining an extensibility mechanism for BMP route monitoring messages in
order to be able to carry additional and structured information within BMP.
Initially BMP defined a fixed packet format for route monitoring messages,
and deployment experience and evolving uses of BMP have indicated a need to
convey additional information elements that were not thought of at the
initial design time of BMP or may be specific to a particular
implementation. BMP has a TLV based encoding mechanism from the start for
most of its messages, but not for route monitoring. Therefore this simple
extension mechanism allows for a TLV based encoding to be used for all BMP
messages.



Autonomous System Provider Authorization (draft-ietf-sidrops-aspa-profile,
draft-ietf-sidrops-aspa-verification). Origin validation provides a
practical and reasonable level of verification of origination of prefixes,
but the propagation path of those prefixes once originated is difficult to
validate and protect from both unintended errors and malicious attacks. The
tree-like nature of BGP peerings may be used for building sets of adjacency
lists, treating one AS as a customer, and its peers as providers. This
provides a 1:n relationship between a particular AS and its peers, and a
distributed collection of such relationships forms a foundation for AS path
validation in terms of checking whether adjacent pairs of AS numbers in
fact have a customer-provider relationship. ASPA object is yet another
object type in the RPKI infrastructure, and therefore existing mechanisms
of RPKI used for origin validation would require only modest extensions in
order to distribute path validation information. The validation outcome
from the perspective of routing policy would result in already familiar
states of “valid”, “invalid”, and “unknown” related to AS path attributes
contained in received announcements.



RTR extensions (draft-ietf-sidrops-8210bis). RTR is a protocol used between
a router and a cache for distribution of RPKI data objects. ASPA brings in
a requirement for distributing AS path attestation objects to routers, and
it is a simple extension to RTR for carrying yet another object type – and
this set of extensions defines RTR protocol version 2. The overall protocol
mechanics is equivalent to the lower versions (version 0 is specified in
RFC6810, and version 1 is specified in RFC8210), and is based on the
concept of a router controlling the pull in of the validation information
instead of a cache pushing it out towards the router. Operation starts with
a maximum supported version and can negotiate a lower backwards compatible
protocol version if required during the session startup, and stays at the
negotiated version for a lifetime of a session. Deployment experience has
also identified several synchronization corner cases in the cache content
transfer in the presence of dynamic changes of that content. ROAs for
longer prefixes should be advertised before ROAs for corresponding shorter
covering prefixes, and multiple ROAs for the same prefix should be
advertised consecutively.

Current thread: