Modular, Pure Layer 3 Network for Kubernetes: The Implementation

By Hirotaka Yamamoto (@ymmt2005)

As part of Project Neco, we are building a highly scalable data center network for large Kubernetes clusters. In Modular, Pure Layer 3 Network for Kubernetes: The Design, we described the challenges of building a pure layer 3 data center network and the ground design to implement it.

blog.kintone.io

The main part of the design is that each server is a BGP router, and redundant network linkage can be created with BFD and ECMP.

In this article, we further describe how to implement the design using BIRD routing software on Linux. Since a lot of prerequisite knowledge is required, we have prepared the following two tutorials. If you are not familiar with BGP or BIRD, please read them first.

Table of contents:

Autonomous system design
Network inside a rack
Management IP address
Avoiding ICMP redirects
Implementing the Kubernetes network model

Autonomous system design

In Neco, every server is a BGP speaker. If each server has a unique autonomous system number (ASN), every peer becomes eBGP. This model is called AS per Compute Server.

Alternatively, a model that uses a single ASN for everything in a rack (top-of-rack switches and servers) is called AS per Rack. In this model, peers inside a rack become iBGP.

Other variations are possible, but basically, either of these should be chosen. We chose AS per Rack model mainly because our network engineers asked so. From their point of view, AS numbers coming from servers look like aliens.

The diagram below shows an example of AS per Rack model with spine-leaf topology.

AS per Rack model

Network inside a rack

As described in the design article, each rack has two top-of-rack (ToR) switches for redundancy. Since we do not use MC-LAG, each ToR switch has its own layer-2 subnet.

To reduce peering connections, each ToR switch is configured as a route reflector. This way, each server needs to establish connections only with two ToR switches.

Note that two ToR switches need not to peer with each other because both have all routes to servers in the rack and connects with all spine routers. This is an example of an iBGP configuration that does not require a full mesh.

The diagram below shows a server in a rack establishing connections with two ToR switches.

a server in a rack establishing connections with ToR switches

The server has two network interfaces and assigns different IP addresses to them. These addresses are used only for communications between the server and ToR switches. All other communications are done with another IP address called management IP address, which is described in the next section.

Management IP address

Using addresses assigned to physical network interfaces has several problems:

When one ToR switch fails, the address for the subnet of the ToR switch becomes unreachable.
a Kubernetes Node is expected to have a single representative address.

An example of the latter is that a Pod can have only one IP address of the running node in status.hostIP.

To avoid these problems, another IP address called management IP address is assigned to each server. The management IP address does not belong to any layer-2 subnet and therefore has a full-size netmask (/32 for IPv4). Each server advertises its management IP address via BGP.

Let's see how to implement the management IP address on Linux.

Dummy device

Linux has a virtual network interface called dummy. The server can receive packets to the IP addresses assigned to its dummy device.

To create a dummy device and assign an IP address, execute the following commands:

$ sudo ip link add node0 type dummy
$ sudo ip link set node0 up
$ sudo ip address add 10.1.0.1/32 dev node0

This creates node0 as a new dummy device and assigns 10.1.0.1/32 to it. To advertise the IP address of node0, add the following configurations to bird.conf:

# Import IP addresses assigned to node0 into master4 routing table.
protocol direct {
    ipv4;
    interface "node0";
}

# Prevent exporting IP addresses from node0 to the kernel routing table.
protocol kernel {
    merge paths;
    ipv4 {
        export filter {
            if source = RTS_DEVICE then reject;
            accept;
        };
    };
}

Avoiding indefinite source address selection

Linux kernel selects any IP address whose scope is global when it sends packets through unbound sockets. Therefore, IP addresses other than those assigned to node0 must have link local scope. Otherwise, TCP connections could be established using non-management IP addresses.

With systemd-networkd, the scope of an address can be specified as follows:

[Match]
Name=eno1

[Address]
Address=10.1.0.65/26
Scope=link

By the way, when I encountered this problem, I thought it might be a bug of Linux kernel. So I sent a question to netdev (the mailing list for network-related Linux development) with a reproducer and got a reply that this behavior is intended and valid with a hint to limit the scope of the addresses.

If you are interested, here is the thread in the mailing list.

Avoiding ICMP redirects

Since iBGP does not change NEXT_HOP attribute when redistributing routes from eBGP peers to iBGP peers, such routes may be rejected by iBGP peers if the NEXT_HOP address is unreachable. To avoid this problem, there are several options as described in Getting Started with BIRD Routing Software.

An easy and often-used workaround is to enable next hop self feature in iBGP routers. However, this is not the best option in our network model because next hop self would involve ICMP redirects.

An ICMP redirect message is typically sent when a router receives packets and sends them to another router in the same layer-2 network. This is to notify the server that sent the packet that there is a direct route to the destination. Sounds good? Unfortunately, there are too many reasons to avoid network configurations involving ICMP redirects, as listed below:

Why do ICMP redirects occur?

Because all servers in the same layer-2 network are BGP routers communicating through ToR switches working as iBGP route reflectors. If a ToR switch enables next hop self, it would send ICMP redirects as follows:

A server advertises its management IP address to the ToR switch.
The switch receives the route to the management IP and rewrites its NEXT_HOP attribute because of next hop self.
The switch sends the rewritten route to other servers in the same network because it is a route reflector.
Other servers receive the rewritten route.
When one of the servers tries to send a packet to the management IP address, it sends packets to the switch address.
Because the destination server exists in the same network, the switch sends an ICMP redirect message.

The problem is that next hop self cannot distinguish routes from routers in the same network.

Rewriting `NEXT_HOP` on the receiving end

Our solution to this problem is to configure BIRD to accept routes with unreachable NEXT_HOP. Think of this as next hop self on the receiving end.

The following setting rewrites the NEXT_HOP attribute of routes coming from a router outside the layer-2 network to the address of a ToR switch.

# Prevent BIRD from rejecting routes with unreachable NEXT_HOP attribute
# by preparing an IGP table having a dummy default gateway.
ipv4 table dummytab;
protocol static dummystatic {
    ipv4 { table dummytab; };
    route 0.0.0.0/0 via "lo";
}

protocol bgp {
    ...

    ipv4 {
        # Use the IGP table to accept any routes.  "gateway recursive" is also necessary.
        igp table dummytab;
        gateway recursive;

        import filter {
            # if NEXT_HOP is in the same network, keep it as is.  /26 is just an example netmask.
            if bgp_next_hop.mask(26) = from.mask(26) then {
                gw = bgp_next_hop;
                accept;
            }

            # Otherwise, change NEXT_HOP to the address of the sender route (ToR switch).
            gw = from;
            accept;
        };
    };
}

Another solution: full mesh

You: Wait, why don't you giving up the route reflection? That would ...

Me: You know too much, so you have to 🎉

Implementing the Kubernetes network model

Several Kubernetes network plugins that use BGP for advertising Pod addresses:

Calico embeds BIRD to advertise Pod addresses.
Romana: has Route Publisher addon that uses BIRD.
kube-router: speaks BGP by itself.

We decided to run BIRD by ourselves and do not rely on BGP features of these plugins because, as we have seen so far, we need to configure BIRD substantially to implement the management IP and to cope with ICMP redirects.

Also, we want to use MetalLB to implement LoadBalancer service for Kubernetes. However, Calico and MetalLB have issues when used with MetalLB.

To achieve these goals, we created a new network plugin called Coil. It does not speak routing protocols by itself, but only export routing information to an unused Linux kernel routing table. BIRD can import the routing information from the kernel routing table and advertises them via BGP.

Our current networking stack consists of:

BIRD for BGP, BFD, and ECMP redundant connectivity
Coil for IP address management (IPAM) and intra-node routing of Kubernetes Pods
MetalLB to implement LoadBalancer service
Calico for policy-enforcement

We will introduce Coil in another article.

Summary

We built a highly scalable software-defined network using pure layer-3 technologies such as BGP and BFD.

Thanks to our new modular network plugin Coil, we can combine MetalLB, Calico, and BIRD to build a fully-featured Kubernetes network.

Kintone Engineering Blog

Learn about Kintone's engineering efforts. Kintone is provided by Cybozu Inc., a Tokyo-based public company founded in 1997.

Modular, Pure Layer 3 Network for Kubernetes: The Implementation

Autonomous system design

Network inside a rack

Management IP address

Dummy device

Avoiding indefinite source address selection

Avoiding ICMP redirects

Why do ICMP redirects occur?

Rewriting `NEXT_HOP` on the receiving end

Another solution: full mesh

Implementing the Kubernetes network model

Summary

Autonomous system design

Network inside a rack

Management IP address

Dummy device

Avoiding indefinite source address selection

Avoiding ICMP redirects

Why do ICMP redirects occur?

Rewriting NEXT_HOP on the receiving end

Another solution: full mesh

Implementing the Kubernetes network model

Summary

Rewriting `NEXT_HOP` on the receiving end