By Hirotaka Yamamoto (@ymmt2005)
In the previous article, we introduced Coil v2 and its opt-in egress NAT feature. The feature allows any Kubernetes Pods to initiate outgoing TCP/UDP communication with external networks.
This article dives into how it is implemented. For the impatients, here is the summary:
- Coil reconciles
Egresscustom resource with Deployment and Service.
- To route packets from a client Pod, coil establishes a bidirectional tunnel between the client Pod and the egress Pods.
- Foo over UDP (FoU) is chosen as tunnel protocol to make the NAT service highly available.
Table of contents
- The problems
- Anatomy of the egress NAT feature
- How Foo over UDP tunnel works
- Bidirectional tunneling
- Session persistence
- Routing rules
- The implementation
Implementing a NAT gateway on Kubernetes has several problems that need to be resolved.
The first problem is how to route packets from clients to a NAT gateway. Because of its transparent nature, application programs are not aware of NAT gateways. So, packets from applications have the destination address of external networks. As the underlying network is also not aware of NAT gateways built on Kubernetes, the packets would not be sent to NAT gateway Pods.
A general solution to this problem is tunneling. Tunneling is a technology that encapsulates packets, sends them to an arbitrary destination, and decapsulates them at the destination where they arrive. We will establish tunnels between client Pods and NAT gateway Pods.
Another problem is how to make NAT gateways highly available. Because Pods are tentative resources, they are usually hidden after Service. A Service has a stable virtual IP address and distributes client requests to the target Pods. So, the problem can be translated into how to establish tunnels between client Pods and NAT gateway Pods over Services.
Anatomy of the egress NAT feature
The above picture illustrates how egress NAT works.
Suppose that Pods in
internet-egress namespace can communicate with services on the Internet.
Likewise, suppose that Pods in
domestic-egress namespace can communicate with services on a domestic network, say,
This can be achieved by using Coil's
Let's see how Pods in
default namespace communicate with external networks.
Egress custom resource
Egress is a custom resource to define a NAT gateway service in a namespace.
Egress is for the Internet:
apiVersion: coil.cybozu.com/v2 kind: Egress metadata: namespace: internet-egress name: nat spec: replicas: 2 destinations: - 0.0.0.0/0 - ::/0
whereas the following is for a domestic network
apiVersion: coil.cybozu.com/v2 kind: Egress metadata: namespace: domestic-egress name: nat spec: replicas: 2 destinations: - 192.168.10.0/24
Coil creates a Deployment and a Service from an
Egress in the same namespace.
The Deployment then creates Pods that run
These Pods are called egress Pods.
The Service works as a load balancer of the egress Pods.
Service and tunnel protocol
Coil uses Service in front of egress Pods to make NAT service highly available. Since Service supports only TCP or UDP, we need to choose a tunneling protocol that uses UDP as its encapsulate header.
Among others, we chose Foo over UDP (FoU) as the tunneling protocol. FoU encapsulates packets with UDP header and supports both IPv4 and IPv6.
See An introduction to Linux virtual interfaces: Tunnels if you are interested in other options.
Not all Pods need to communicate with external networks. In fact, allowing all Pods to do so could introduce certain security risks.
Therefore, Coil setups tunnels for egress NAT only on Pods that you opted in.
To setup a tunnel for an
nat in namespace
domestic-egress, set an annotation like this:
apiVersion: v1 kind: Pod metadata: annotations: egress.coil.cybozu.com/domestic-egress: nat ...
This design also allows you to use different sets of source IP addresses depending on the workload as follows:
- Create two
AddressPoolresources for two different set of IPv4 global addresses
- Create two Namespaces and annotate them to use different IPv4 global pools
Egressfor each Namespace
- Specify either one of
Egressdepending on the workload
Inside an egress Pod, a program called
kube-apiserver to detect new client Pods and their deletions.
For each client Pod,
coil-egress setups FoU tunnel and routing rules.
It also configures
iptables for source network address translation (SNAT).
How Foo over UDP tunnel works
Let's take a quick detour and look at how to setup Foo over UDP tunnel.
FoU acts as a wrapper for existing IPv4 or IPv6 tunnels. The simplest FoU encapsulations are those that wrap IPv4-in-IPv4 (IPIP) and IPv6-in-IPv6 (IP6IP6).
To send IPv4-in-IPv4 packets encapsulated with FoU headers to an external network
192.168.10.0/24 via a NAT gateway
ipip (protocol number 4) link with
encap fou option and add a routing rule as follows:
$ ip link add fou1 type ipip \ remote 10.1.2.3 local 10.20.30.40 ttl 225 \ encap fou encap-sport 5555 encap-dport 5555 $ ip link set fou1 up $ ip route add 192.168.10.0/24 dev fou1
To receive IPv4-in-IPv4 packets encapsulated with FoU headers from a NAT gateway
10.1.2.3, configure UDP listening port and create a matching
ipip link as follows:
$ ip fou add port 5555 ipproto 4 $ ip link add fou1 type ipip \ remote 10.1.2.3 local 10.20.30.40 ttl 225 \ encap fou encap-sport 5555 encap-dport 5555 $ ip link set fou1 up
The same FoU link can be used for both sending and receiving packets.
FoU for IPv6 is basically the same. Create
ip6tnl link with
The listening port needs to be created for
ip6ip6 (protocol number 41) with
$ ip link add fou2 type ip6tnl mode ip6ip6 \ remote fd04::1 local fd02::1 ttl 225 \ encap fou encap-sport 5555 encap-dport 5555 $ ip fou add port 5555 ipproto 41 -6
Although an egress Pod can directly send packets to its client Pods, the tunnel between client and egress Pods needs to be bidirectional. Otherwise, packets coming back through the egress Pod from the external network may not reach the client Pod for the following reasons:
- The source address of the returned packet can be a global IP address, and such packets are often rejected by NetworkPolicy.
- If the packet is TCP's SYN-ACK, it is often dropped by underlying firewalls because there seems no corresponding SYN packet. Note that SYN packet was sent through FoU tunnel.
Therefore, Coil setups FoU links and routing rules for sending and receiving in both egress Pods and client Pods.
Because TCP is a stateful protocol, all packets in a TCP session must be sent to the same egress Pod. For this reason, Coil usually enables session affinity of Services in front of egress Pods.
One subtle problem of this is that the affinity can't be kept forever.
Service is usually implemented by
kube-proxy using iptables or IPVS.
Both iptables and IPVS have certain maximum duration to keep affinity of a session.
Alternatively, you may run
kube-proxy in IPVS mode with
sh (source hashing) is an algorithm to determine the destination Pod based on the source address of packets.
This means that the session can be maintained while the set of destination Pods remains unchanged.
For this case, Coil gives an option to disable Service session affinity with
Egress as follows:
apiVersion: coil.cybozu.com/v2 kind: Egress spec: sessionAffinity: None ...
Coil setups the default route via
eth0 for each Pod.
This is reasonable because Coil does not know the subnet addresses used for Nodes and Services in general.
Now, suppose that a client Pod wants to send packets to the Internet via an
Egress like this:
apiVersion: coil.cybozu.com/v2 kind: Egress spec: subnets: - 0.0.0.0/0 ...
Coil cannot add a route for this via the FoU tunnel link because it conflicts with the existing default route.
Likewise, in an egress Pod, Coil cannot add a route to a client Pod via the FoU tunnel link because doing so would prevent sending FoU-encapsulated packets to the client. Coil should send a packet to the FoU tunnel link only when the packet is not yet encapsulated. The following picture illustrates what should be done for the returned packets.
To resolve these issues, Coil uses policy routing.
Policy routing is a feature of the Linux kernel that allows more complex routing rules than a destination address.
For more information about policy routing, please refer to
ip rule man page.
Rules in a client Pod
Following is an example of routing rules and routing table contents in a client Pod.
The client Pod is opted in to two
Egress gateways, one for
0.0.0.0/0 and another for
# ip rule 0: from all lookup local 1800: from all to 169.254.0.0/16 lookup main 1900: from all lookup 117 2000: from all to 10.0.0.0/8 lookup main 2001: from all to 172.16.0.0/12 lookup main 2002: from all to 192.168.0.0/16 lookup main 2100: from all lookup 118 32766: from all lookup main 32767: from all lookup default
Rules inserted by Coil are:
- 1800: prioritize packets to be sent to a link-local peer
- 1900: send packet to FoU links for private external networks
- 2000, 2001, 2002: send packet to IPv4 private networks using the main table (for Nodes, Pods, and Cluster IPs)
- 2100: send packet to FoU links for public external network (the Internet)
Rule 1900 and 2100 use special routing tables. They would look like:
# ip route show table 117 192.168.10.0/24 dev fou4_0a440c31 proto 30 # ip route show table 118 default dev fou4_0a447c91 proto 30
Rules in an egress Pod
To send only non-encapsulated packets to FoU links, Coil inserts a rule with
iif eth0 lookup 118 (
iif stands for an incoming interface) as follows:
# ip rule 0: from all lookup local 2000: from all iif eth0 lookup 118 32766: from all lookup main 32767: from all lookup default
The routing table 118 would look like:
# ip route show table 118 10.20.30.40 dev fou4_0a141e28 proto 30
The core of this feature is available as a reusable package:
nat_test.go is a complete example of how to use the package.
Coil implements highly-available egress NAT service on Kubernetes using Foo over UDP tunnel and the standard Kubernetes resources. The implementation supports both IPv4 and IPv6 and is available as a reusable package on GitHub.
That's it. Thank you for reading!