By Hirotaka Yamamoto (@ymmt2005)
In the previous article, we introduced Coil v2 and its opt-in egress NAT feature. The feature allows any Kubernetes Pods to initiate outgoing TCP/UDP communication with external networks.
This article dives into how it is implemented. For the impatients, here is the summary:
- Coil reconciles
Egress
custom resource with Deployment and Service. - To route packets from a client Pod, coil establishes a bidirectional tunnel between the client Pod and the egress Pods.
- Foo over UDP (FoU) is chosen as tunnel protocol to make the NAT service highly available.
Table of contents
- The problems
- Anatomy of the egress NAT feature
- How Foo over UDP tunnel works
- Bidirectional tunneling
- Session persistence
- Routing rules
- The implementation
- Summary
The problems
Implementing a NAT gateway on Kubernetes has several problems that need to be resolved.
The first problem is how to route packets from clients to a NAT gateway. Because of its transparent nature, application programs are not aware of NAT gateways. So, packets from applications have the destination address of external networks. As the underlying network is also not aware of NAT gateways built on Kubernetes, the packets would not be sent to NAT gateway Pods.
A general solution to this problem is tunneling. Tunneling is a technology that encapsulates packets, sends them to an arbitrary destination, and decapsulates them at the destination where they arrive. We will establish tunnels between client Pods and NAT gateway Pods.
Another problem is how to make NAT gateways highly available. Because Pods are tentative resources, they are usually hidden after Service. A Service has a stable virtual IP address and distributes client requests to the target Pods. So, the problem can be translated into how to establish tunnels between client Pods and NAT gateway Pods over Services.
Anatomy of the egress NAT feature
The above picture illustrates how egress NAT works.
Suppose that Pods in internet-egress
namespace can communicate with services on the Internet.
Likewise, suppose that Pods in domestic-egress
namespace can communicate with services on a domestic network, say, 192.168.10.0/24
.
This can be achieved by using Coil's AddressPool
feature.
Let's see how Pods in default
namespace communicate with external networks.
Egress
custom resource
Egress
is a custom resource to define a NAT gateway service in a namespace.
The following Egress
is for the Internet:
apiVersion: coil.cybozu.com/v2 kind: Egress metadata: namespace: internet-egress name: nat spec: replicas: 2 destinations: - 0.0.0.0/0 - ::/0
whereas the following is for a domestic network 192.168.10.0/24
:
apiVersion: coil.cybozu.com/v2 kind: Egress metadata: namespace: domestic-egress name: nat spec: replicas: 2 destinations: - 192.168.10.0/24
Coil creates a Deployment and a Service from an Egress
in the same namespace.
The Deployment then creates Pods that run coil-egress
container.
These Pods are called egress Pods.
The Service works as a load balancer of the egress Pods.
Service and tunnel protocol
Coil uses Service in front of egress Pods to make NAT service highly available. Since Service supports only TCP or UDP, we need to choose a tunneling protocol that uses UDP as its encapsulate header.
Among others, we chose Foo over UDP (FoU) as the tunneling protocol. FoU encapsulates packets with UDP header and supports both IPv4 and IPv6.
See An introduction to Linux virtual interfaces: Tunnels if you are interested in other options.
Client Pod
Not all Pods need to communicate with external networks. In fact, allowing all Pods to do so could introduce certain security risks.
Therefore, Coil setups tunnels for egress NAT only on Pods that you opted in.
To setup a tunnel for an Egress
named nat
in namespace domestic-egress
, set an annotation like this:
apiVersion: v1 kind: Pod metadata: annotations: egress.coil.cybozu.com/domestic-egress: nat ...
This design also allows you to use different sets of source IP addresses depending on the workload as follows:
- Create two
AddressPool
resources for two different set of IPv4 global addresses - Create two Namespaces and annotate them to use different IPv4 global pools
- Create
Egress
for each Namespace - Specify either one of
Egress
depending on the workload
Egress Pod
Inside an egress Pod, a program called coil-egress
watches kube-apiserver
to detect new client Pods and their deletions.
For each client Pod, coil-egress
setups FoU tunnel and routing rules.
It also configures iptables
for source network address translation (SNAT).
How Foo over UDP tunnel works
Let's take a quick detour and look at how to setup Foo over UDP tunnel.
FoU acts as a wrapper for existing IPv4 or IPv6 tunnels. The simplest FoU encapsulations are those that wrap IPv4-in-IPv4 (IPIP) and IPv6-in-IPv6 (IP6IP6).
To send IPv4-in-IPv4 packets encapsulated with FoU headers to an external network 192.168.10.0/24
via a NAT gateway 10.1.2.3
, create ipip
(protocol number 4) link with encap fou
option and add a routing rule as follows:
$ ip link add fou1 type ipip \ remote 10.1.2.3 local 10.20.30.40 ttl 225 \ encap fou encap-sport 5555 encap-dport 5555 $ ip link set fou1 up $ ip route add 192.168.10.0/24 dev fou1
To receive IPv4-in-IPv4 packets encapsulated with FoU headers from a NAT gateway 10.1.2.3
, configure UDP listening port and create a matching ipip
link as follows:
$ ip fou add port 5555 ipproto 4 $ ip link add fou1 type ipip \ remote 10.1.2.3 local 10.20.30.40 ttl 225 \ encap fou encap-sport 5555 encap-dport 5555 $ ip link set fou1 up
The same FoU link can be used for both sending and receiving packets.
FoU for IPv6 is basically the same. Create ip6tnl
link with mode ip6ip6
.
The listening port needs to be created for ip6ip6
(protocol number 41) with -6
flag.
$ ip link add fou2 type ip6tnl mode ip6ip6 \ remote fd04::1 local fd02::1 ttl 225 \ encap fou encap-sport 5555 encap-dport 5555 $ ip fou add port 5555 ipproto 41 -6
Bidirectional tunneling
Although an egress Pod can directly send packets to its client Pods, the tunnel between client and egress Pods needs to be bidirectional. Otherwise, packets coming back through the egress Pod from the external network may not reach the client Pod for the following reasons:
- The source address of the returned packet can be a global IP address, and such packets are often rejected by NetworkPolicy.
- If the packet is TCP's SYN-ACK, it is often dropped by underlying firewalls because there seems no corresponding SYN packet. Note that SYN packet was sent through FoU tunnel.
Therefore, Coil setups FoU links and routing rules for sending and receiving in both egress Pods and client Pods.
Session persistence
In order to keep a long TCP session over FoU tunnel, all packets of the session must be sent to the same egress Pod. For this reason, Coil enables session affinity of Services in front of egress Pods.
The affinity information of idle sessions is dropped after no packets are received or sent for a certain time. By default, the idle timeout duration is 3 hours. To keep a session longer, there are two options:
- Enable TCP keepalive. This may require modification of the application.
- Configure longer idle timeout with
.spec.sessionAffinityConfig.clientIP.timeoutSeconds
. See below.
apiVersion: coil.cybozu.com/v2 kind: Egress metadata: namespace: other-network name: egress spec: # snip sessionAffinityConfig: clientIP: timeoutSeconds: 43200
Routing rules
Coil setups the default route via eth0
for each Pod.
This is reasonable because Coil does not know the subnet addresses used for Nodes and Services in general.
Now, suppose that a client Pod wants to send packets to the Internet via an Egress
like this:
apiVersion: coil.cybozu.com/v2 kind: Egress spec: subnets: - 0.0.0.0/0 ...
Coil cannot add a route for this via the FoU tunnel link because it conflicts with the existing default route.
Likewise, in an egress Pod, Coil cannot add a route to a client Pod via the FoU tunnel link because doing so would prevent sending FoU-encapsulated packets to the client. Coil should send a packet to the FoU tunnel link only when the packet is not yet encapsulated. The following picture illustrates what should be done for the returned packets.
To resolve these issues, Coil uses policy routing.
Policy routing is a feature of the Linux kernel that allows more complex routing rules than a destination address.
For more information about policy routing, please refer to ip rule
man page.
Rules in a client Pod
Following is an example of routing rules and routing table contents in a client Pod.
The client Pod is opted in to two Egress
gateways, one for 0.0.0.0/0
and another for 192.168.10.0/24
.
# ip rule 0: from all lookup local 1800: from all to 169.254.0.0/16 lookup main 1900: from all lookup 117 2000: from all to 10.0.0.0/8 lookup main 2001: from all to 172.16.0.0/12 lookup main 2002: from all to 192.168.0.0/16 lookup main 2100: from all lookup 118 32766: from all lookup main 32767: from all lookup default
Rules inserted by Coil are:
- 1800: prioritize packets to be sent to a link-local peer
- 1900: send packet to FoU links for private external networks
- 2000, 2001, 2002: send packet to IPv4 private networks using the main table (for Nodes, Pods, and Cluster IPs)
- 2100: send packet to FoU links for public external network (the Internet)
Rule 1900 and 2100 use special routing tables. They would look like:
# ip route show table 117 192.168.10.0/24 dev fou4_0a440c31 proto 30 # ip route show table 118 default dev fou4_0a447c91 proto 30
Rules in an egress Pod
To send only non-encapsulated packets to FoU links, Coil inserts a rule with iif eth0 lookup 118
(iif
stands for an incoming interface) as follows:
# ip rule 0: from all lookup local 2000: from all iif eth0 lookup 118 32766: from all lookup main 32767: from all lookup default
The routing table 118 would look like:
# ip route show table 118 10.20.30.40 dev fou4_0a141e28 proto 30
The implementation
The core of this feature is available as a reusable package:
nat_test.go
is a complete example of how to use the package.
Summary
Coil implements highly-available egress NAT service on Kubernetes using Foo over UDP tunnel and the standard Kubernetes resources. The implementation supports both IPv4 and IPv6 and is available as a reusable package on GitHub.
That's it. Thank you for reading!