Implementing highly-available NAT service on Kubernetes

By Hirotaka Yamamoto (@ymmt2005)

In the previous article, we introduced Coil v2 and its opt-in egress NAT feature. The feature allows any Kubernetes Pods to initiate outgoing TCP/UDP communication with external networks.

This article dives into how it is implemented. For the impatients, here is the summary:

Coil reconciles Egress custom resource with Deployment and Service.
To route packets from a client Pod, coil establishes a bidirectional tunnel between the client Pod and the egress Pods.
Foo over UDP (FoU) is chosen as tunnel protocol to make the NAT service highly available.

Table of contents

The problems
Anatomy of the egress NAT feature
How Foo over UDP tunnel works
Bidirectional tunneling
Session persistence
Routing rules
- Rules in a client Pod
- Rules in an egress Pod
The implementation
Summary

The problems

Implementing a NAT gateway on Kubernetes has several problems that need to be resolved.

The first problem is how to route packets from clients to a NAT gateway. Because of its transparent nature, application programs are not aware of NAT gateways. So, packets from applications have the destination address of external networks. As the underlying network is also not aware of NAT gateways built on Kubernetes, the packets would not be sent to NAT gateway Pods.

A general solution to this problem is tunneling. Tunneling is a technology that encapsulates packets, sends them to an arbitrary destination, and decapsulates them at the destination where they arrive. We will establish tunnels between client Pods and NAT gateway Pods.

Another problem is how to make NAT gateways highly available. Because Pods are tentative resources, they are usually hidden after Service. A Service has a stable virtual IP address and distributes client requests to the target Pods. So, the problem can be translated into how to establish tunnels between client Pods and NAT gateway Pods over Services.

Anatomy of the egress NAT feature

f:id:cybozuinsideout:20201117152920p:plain — Anatomy of the egress NAT feature

The above picture illustrates how egress NAT works.

Suppose that Pods in internet-egress namespace can communicate with services on the Internet. Likewise, suppose that Pods in domestic-egress namespace can communicate with services on a domestic network, say, 192.168.10.0/24. This can be achieved by using Coil's AddressPool feature.

Let's see how Pods in default namespace communicate with external networks.

`Egress` custom resource

Egress is a custom resource to define a NAT gateway service in a namespace.

The following Egress is for the Internet:

apiVersion: coil.cybozu.com/v2
kind: Egress
metadata:
  namespace: internet-egress
  name: nat
spec:
  replicas: 2
  destinations:
  - 0.0.0.0/0
  - ::/0

whereas the following is for a domestic network 192.168.10.0/24:

apiVersion: coil.cybozu.com/v2
kind: Egress
metadata:
  namespace: domestic-egress
  name: nat
spec:
  replicas: 2
  destinations:
  - 192.168.10.0/24

Coil creates a Deployment and a Service from an Egress in the same namespace. The Deployment then creates Pods that run coil-egress container. These Pods are called egress Pods. The Service works as a load balancer of the egress Pods.

Service and tunnel protocol

Coil uses Service in front of egress Pods to make NAT service highly available. Since Service supports only TCP or UDP, we need to choose a tunneling protocol that uses UDP as its encapsulate header.

Among others, we chose Foo over UDP (FoU) as the tunneling protocol. FoU encapsulates packets with UDP header and supports both IPv4 and IPv6.

See An introduction to Linux virtual interfaces: Tunnels if you are interested in other options.

developers.redhat.com

Client Pod

Not all Pods need to communicate with external networks. In fact, allowing all Pods to do so could introduce certain security risks.

Therefore, Coil setups tunnels for egress NAT only on Pods that you opted in. To setup a tunnel for an Egress named nat in namespace domestic-egress, set an annotation like this:

apiVersion: v1
kind: Pod
metadata:
  annotations:
    egress.coil.cybozu.com/domestic-egress: nat
  ...

This design also allows you to use different sets of source IP addresses depending on the workload as follows:

Create two AddressPool resources for two different set of IPv4 global addresses
Create two Namespaces and annotate them to use different IPv4 global pools
Create Egress for each Namespace
Specify either one of Egress depending on the workload

Egress Pod

f:id:cybozuinsideout:20201116183540p:plain — Anatomy of Egress Pod

Inside an egress Pod, a program called coil-egress watches kube-apiserver to detect new client Pods and their deletions. For each client Pod, coil-egress setups FoU tunnel and routing rules. It also configures iptables for source network address translation (SNAT).

How Foo over UDP tunnel works

Let's take a quick detour and look at how to setup Foo over UDP tunnel.

FoU acts as a wrapper for existing IPv4 or IPv6 tunnels. The simplest FoU encapsulations are those that wrap IPv4-in-IPv4 (IPIP) and IPv6-in-IPv6 (IP6IP6).

To send IPv4-in-IPv4 packets encapsulated with FoU headers to an external network 192.168.10.0/24 via a NAT gateway 10.1.2.3, create ipip (protocol number 4) link with encap fou option and add a routing rule as follows:

$ ip link add fou1 type ipip \
    remote 10.1.2.3 local 10.20.30.40 ttl 225 \
    encap fou encap-sport 5555 encap-dport 5555
$ ip link set fou1 up
$ ip route add 192.168.10.0/24 dev fou1

To receive IPv4-in-IPv4 packets encapsulated with FoU headers from a NAT gateway 10.1.2.3, configure UDP listening port and create a matching ipip link as follows:

$ ip fou add port 5555 ipproto 4
$ ip link add fou1 type ipip \
    remote 10.1.2.3 local 10.20.30.40 ttl 225 \
    encap fou encap-sport 5555 encap-dport 5555
$ ip link set fou1 up

The same FoU link can be used for both sending and receiving packets.

FoU for IPv6 is basically the same. Create ip6tnl link with mode ip6ip6. The listening port needs to be created for ip6ip6 (protocol number 41) with -6 flag.

$ ip link add fou2 type ip6tnl mode ip6ip6 \
    remote fd04::1 local fd02::1 ttl 225 \
    encap fou encap-sport 5555 encap-dport 5555
$ ip fou add port 5555 ipproto 41 -6

Bidirectional tunneling

Although an egress Pod can directly send packets to its client Pods, the tunnel between client and egress Pods needs to be bidirectional. Otherwise, packets coming back through the egress Pod from the external network may not reach the client Pod for the following reasons:

The source address of the returned packet can be a global IP address, and such packets are often rejected by NetworkPolicy.
If the packet is TCP's SYN-ACK, it is often dropped by underlying firewalls because there seems no corresponding SYN packet. Note that SYN packet was sent through FoU tunnel.

Therefore, Coil setups FoU links and routing rules for sending and receiving in both egress Pods and client Pods.

Session persistence

In order to keep a long TCP session over FoU tunnel, all packets of the session must be sent to the same egress Pod. For this reason, Coil enables session affinity of Services in front of egress Pods.

The affinity information of idle sessions is dropped after no packets are received or sent for a certain time. By default, the idle timeout duration is 3 hours. To keep a session longer, there are two options:

Enable TCP keepalive. This may require modification of the application.
Configure longer idle timeout with .spec.sessionAffinityConfig.clientIP.timeoutSeconds. See below.

apiVersion: coil.cybozu.com/v2
kind: Egress
metadata:
  namespace: other-network
  name: egress
spec:
  # snip
  sessionAffinityConfig:
    clientIP:
      timeoutSeconds: 43200

Routing rules

Coil setups the default route via eth0 for each Pod. This is reasonable because Coil does not know the subnet addresses used for Nodes and Services in general.

Now, suppose that a client Pod wants to send packets to the Internet via an Egress like this:

apiVersion: coil.cybozu.com/v2
kind: Egress
spec:
  subnets:
  - 0.0.0.0/0
...

Coil cannot add a route for this via the FoU tunnel link because it conflicts with the existing default route.

Likewise, in an egress Pod, Coil cannot add a route to a client Pod via the FoU tunnel link because doing so would prevent sending FoU-encapsulated packets to the client. Coil should send a packet to the FoU tunnel link only when the packet is not yet encapsulated. The following picture illustrates what should be done for the returned packets.

f:id:cybozuinsideout:20201117165001p:plain — Routing in an egress Pod

To resolve these issues, Coil uses policy routing. Policy routing is a feature of the Linux kernel that allows more complex routing rules than a destination address. For more information about policy routing, please refer to ip rule man page.

Rules in a client Pod

Following is an example of routing rules and routing table contents in a client Pod. The client Pod is opted in to two Egress gateways, one for 0.0.0.0/0 and another for 192.168.10.0/24.

# ip rule
0:      from all lookup local
1800:   from all to 169.254.0.0/16 lookup main
1900:   from all lookup 117
2000:   from all to 10.0.0.0/8 lookup main
2001:   from all to 172.16.0.0/12 lookup main
2002:   from all to 192.168.0.0/16 lookup main
2100:   from all lookup 118
32766:  from all lookup main
32767:  from all lookup default

Rules inserted by Coil are:

1800: prioritize packets to be sent to a link-local peer
1900: send packet to FoU links for private external networks
2000, 2001, 2002: send packet to IPv4 private networks using the main table (for Nodes, Pods, and Cluster IPs)
2100: send packet to FoU links for public external network (the Internet)

Rule 1900 and 2100 use special routing tables. They would look like:

# ip route show table 117
192.168.10.0/24 dev fou4_0a440c31 proto 30

# ip route show table 118
default dev fou4_0a447c91 proto 30

Rules in an egress Pod

To send only non-encapsulated packets to FoU links, Coil inserts a rule with iif eth0 lookup 118 (iif stands for an incoming interface) as follows:

# ip rule
0:      from all lookup local
2000:   from all iif eth0 lookup 118
32766:  from all lookup main
32767:  from all lookup default

The routing table 118 would look like:

# ip route show table 118
10.20.30.40 dev fou4_0a141e28 proto 30

The implementation

The core of this feature is available as a reusable package:

github.com/cybozu-go/coil/v2/pkg/founat

nat_test.go is a complete example of how to use the package.

Summary

Coil implements highly-available egress NAT service on Kubernetes using Foo over UDP tunnel and the standard Kubernetes resources. The implementation supports both IPv4 and IPv6 and is available as a reusable package on GitHub.

That's it. Thank you for reading!

Kintone Engineering Blog

Learn about Kintone's engineering efforts. Kintone is provided by Cybozu Inc., a Tokyo-based public company founded in 1997.

Implementing highly-available NAT service on Kubernetes

The problems

Anatomy of the egress NAT feature

`Egress` custom resource

Service and tunnel protocol

Client Pod

Egress Pod

How Foo over UDP tunnel works

Bidirectional tunneling

Session persistence

Routing rules

Rules in a client Pod

Rules in an egress Pod

The implementation

Summary

The problems

Anatomy of the egress NAT feature

Egress custom resource

Service and tunnel protocol

Client Pod

Egress Pod

How Foo over UDP tunnel works

Bidirectional tunneling

Session persistence

Routing rules

Rules in a client Pod

Rules in an egress Pod

The implementation

Summary

`Egress` custom resource