By Hirotaka Yamamoto (@ymmt2005)
This is a supplementary material for Modular, Pure Layer 3 Network for Kubernetes: The Implementation to cover the basics of Border Gateway Protocol (BGP) used inside data centers.
The intended readers are software engineers who only heard of it.
Table of contents:
- What is a routing protocol?
- BGP concepts
- BGP characteristics
- How iBGP works
- BGP inside a data center
What is a routing protocol?
A routing protocol is a specification to exchange network routes between network peripherals (routers). A network route is digital information describing how a destination network can be reached.
Border Gateway Protocol (BGP) is the routing protocol of the Internet. BGP is also used in a local network, as described in a later section. The other routing protocols, such as RIP or OSPF, are called collectively as Interior Gateway Protocol (IGP).
One can configure network routes manually without using routing protocols. This is called static routing.
Static routing is often used to configure very trivial and rarely changed routes, such as the default gateway.
When IGP should be used?
BGP can scale up to the Internet-size and is highly fault-tolerant. Nevertheless, IGP is often used in local networks because each routing protocol differs in various characteristics such as route convergence time, resource consumption, and operational costs.
The following material is a good read to compare routing protocols:
AS / ASN
The very fundamental design of BGP is that independent network operators called autonomous system (AS) exchange routing information among them at their will. Each AS is assigned a unique number called autonomous system number (ASN) by a regional Internet registry (RIR) such as ARIN, APNIC, or RIPE.
Just like private IP addresses, there are private ASNs reserved for local network usage.
BGP comes with two flavors: eBGP and iBGP.
- iBGP is used between routers in an autonomous system.
- eBGP is used between routers of different autonomous systems.
They are different in many aspects, as described below.
Peering and transit
Each autonomous system determines its policy to choose other autonomous systems to exchange routes and advertise received routes to other autonomous systems. This self-determinism is the key that makes the Internet a distributed autonomous system.
Autonomous systems that exchange routes are called peers, and becoming peers is called peering.
An autonomous system is called transit if it allows traffics to go through itself for other networks.
Path vector protocol
A route exchanged in BGP has various attributes. One such attribute is a path vector that is a list of ASNs where packets go through to the destination.
Path vectors are used to detect loops. An autonomous system will not receive routes whose path vector contains the ASN of the AS to avoid loops.
Attribute-based route selection
Autonomous systems can select a route for each destination based on various attributes attached to routes. BGP defines mandatory and optional attributes. Important attributes are:
The router address for this route. In iBGP, this is not necessarily an address of the peer router.
The aforementioned path vector. Routes with shorter path vector are often preferred.
An arbitrary preference value assigned by the network administrator.
Because BGP is designed to manage full routes of the Internet, the protocol uses TCP for reliability and sends/receives only updates in the routing table.
There is no periodic exchange of routes.
To detect connection failure for a peer router, BGP sends keep-alive messages every 60 seconds. Usually, BGP routes are configured to detect a peer failure when the peer failed to send keep-alive messages successively three times. This leads to a fairly long time for route convergence.
The convergence time can be reduced by tuning keep-alive configurations or using BFD for faster failure detection.
How iBGP works
iBGP is a variant of BGP to be used between routers in the same autonomous system.
Because both sides have the same ASN, iBGP does not add ASN to
AS_PATH attribute. This prevents loop detection using
Instead, iBGP does not re-advertise routes from an iBGP peer to other iBGP peers. This is called split horizon.
Full mesh interconnect
The introduction of split horizon means that each iBGP router needs to peer with all other iBGP routers to have all routes from different autonomous systems. This requirement is often referred to as a full mesh.
A full mesh of
n iBGP routers requires
n * (n-1) / 2 BGP sessions, that is, O(n2). For a large network, the full mesh can degrade router performance due to increased BGP sessions.
The configuration of the full mesh also gets too large and cumbersome.
A workaround for this problem is to configure some routers as route reflectors. A route reflector marks some iBGP peers as route reflect clients, and advertises routes learned from other routers to them ignoring split horizon.
This way, a route reflect client does not need to peer with all other iBGP routers; it only needs to peer with route reflectors.
In eBGP, a router rewrites
NEXT_HOP attribute of routes when advertising them to peers.
NEXT_HOP is kept untouched.
A fresh look at BGP’s NEXT_HOP is a good read about this.
When an iBGP router receives a route whose
NEXT_HOP is unreachable, it rejects and ignores the route.
There are two possible workarounds for this problem:
Pre-register routes to possible
Use IGP or static routing to pre-register routes to all possible
NEXT_HOPwith the address of the advertising router
Most routers can do this with
next hop selfor similar configurations. This is handy but suboptimal because it affects not only routes from eBGP peers but also routes from iBGP peers when the router is a route reflector.
BGP inside a data center
Traditionally, BGP has not been used inside data centers at all, or has been used with IGP to compensate its defects like slow convergence.
But today, more and more companies are using BGP rather than IGP to route traffics inside data centers because: