By Hirotaka Yamamoto (@ymmt2005)
This is a supplementary material for Modular, Pure Layer 3 Network for Kubernetes: The Implementation to cover the basics of Border Gateway Protocol (BGP) used inside data centers.
The intended readers are software engineers who only heard of it.
Table of contents:
- What is a routing protocol?
- BGP concepts
- BGP characteristics
- How iBGP works
- BGP inside a data center
What is a routing protocol?
A routing protocol is a specification to exchange network routes between network peripherals (routers). A network route is digital information describing how a destination network can be reached.
Border Gateway Protocol (BGP) is the routing protocol of the Internet. BGP is also used in a local network, as described in a later section. The other routing protocols, such as RIP or OSPF, are called collectively as Interior Gateway Protocol (IGP).
Static routing
One can configure network routes manually without using routing protocols. This is called static routing.
Static routing is often used to configure very trivial and rarely changed routes, such as the default gateway.
When IGP should be used?
BGP can scale up to the Internet-size and is highly fault-tolerant. Nevertheless, IGP is often used in local networks because each routing protocol differs in various characteristics such as route convergence time, resource consumption, and operational costs.
The following material is a good read to compare routing protocols:
BGP concepts
AS / ASN
The very fundamental design of BGP is that independent network operators called autonomous system (AS) exchange routing information among them at their will. Each AS is assigned a unique number called autonomous system number (ASN) by a regional Internet registry (RIR) such as ARIN, APNIC, or RIPE.
Just like private IP addresses, there are private ASNs reserved for local network usage.
eBGP, iBGP
BGP comes with two flavors: eBGP and iBGP.
- iBGP is used between routers in an autonomous system.
- eBGP is used between routers of different autonomous systems.
They are different in many aspects, as described below.
Peering and transit
Each autonomous system determines its policy to choose other autonomous systems to exchange routes and advertise received routes to other autonomous systems. This self-determinism is the key that makes the Internet a distributed autonomous system.
Autonomous systems that exchange routes are called peers, and becoming peers is called peering.
An autonomous system is called transit if it allows traffics to go through itself for other networks.
BGP characteristics
Path vector protocol
A route exchanged in BGP has various attributes. One such attribute is a path vector that is a list of ASNs where packets go through to the destination.
Path vectors are used to detect loops. An autonomous system will not receive routes whose path vector contains the ASN of the AS to avoid loops.
Attribute-based route selection
Autonomous systems can select a route for each destination based on various attributes attached to routes. BGP defines mandatory and optional attributes. Important attributes are:
NEXT_HOP
The router address for this route. In iBGP, this is not necessarily an address of the peer router.
AS_PATH
The aforementioned path vector. Routes with shorter path vector are often preferred.
LOCAL_PREF
An arbitrary preference value assigned by the network administrator.
Stateful
Because BGP is designed to manage full routes of the Internet, the protocol uses TCP for reliability and sends/receives only updates in the routing table.
There is no periodic exchange of routes.
Slow convergence
To detect connection failure for a peer router, BGP sends keep-alive messages every 60 seconds. Usually, BGP routes are configured to detect a peer failure when the peer failed to send keep-alive messages successively three times. This leads to a fairly long time for route convergence.
The convergence time can be reduced by tuning keep-alive configurations or using BFD for faster failure detection.
How iBGP works
Split horizon
iBGP is a variant of BGP to be used between routers in the same autonomous system.
Because both sides have the same ASN, iBGP does not add ASN to AS_PATH
attribute. This prevents loop detection using AS_PATH
.
Instead, iBGP does not re-advertise routes from an iBGP peer to other iBGP peers. This is called split horizon.
Full mesh interconnect
The introduction of split horizon means that each iBGP router needs to peer with all other iBGP routers to have all routes from different autonomous systems. This requirement is often referred to as a full mesh.
Route reflector
A full mesh of n
iBGP routers requires n * (n-1) / 2
BGP sessions, that is, O(n2). For a large network, the full mesh can degrade router performance due to increased BGP sessions.
The configuration of the full mesh also gets too large and cumbersome.
A workaround for this problem is to configure some routers as route reflectors. A route reflector marks some iBGP peers as route reflect clients, and advertises routes learned from other routers to them ignoring split horizon.
This way, a route reflect client does not need to peer with all other iBGP routers; it only needs to peer with route reflectors.
Keeping NEXT_HOP
In eBGP, a router rewrites NEXT_HOP
attribute of routes when advertising them to peers.
In iBGP, NEXT_HOP
is kept untouched.
A fresh look at BGP’s NEXT_HOP is a good read about this.
When an iBGP router receives a route whose NEXT_HOP
is unreachable, it rejects and ignores the route.
There are two possible workarounds for this problem:
Pre-register routes to possible
NEXT_HOP
Use IGP or static routing to pre-register routes to all possible
NEXT_HOP
values.Rewrite
NEXT_HOP
with the address of the advertising routerMost routers can do this with
next hop self
or similar configurations. This is handy but suboptimal because it affects not only routes from eBGP peers but also routes from iBGP peers when the router is a route reflector.
BGP inside a data center
Traditionally, BGP has not been used inside data centers at all, or has been used with IGP to compensate its defects like slow convergence.
But today, more and more companies are using BGP rather than IGP to route traffics inside data centers because:
- Defects in BGP can be covered with technologies like BFD or ADDPATH extension now.
- Prices of network switches with BGP capability get lower.
- Software such as Calico or MetalLB speaks BGP for the best performance.