By Hirotaka Yamamoto (@ymmt2005)
kintone.com, a low-code application platform service, is branded as cybozu.com in Japan. Today, cybozu.com is used by more than 30,000 customer companies. To provide a reliable and cost-effective service, we have several geographically distributed data centers.
Project Neco is an effort to re-architecture our data centers to make them resilient, manageable, and observable with robust automation. In short, we are building Cloud Native data centers.
In this article, we introduce the project with respect to the motivation, scopes, designs, the current status, and its open-source deliverables.
The rise of Docker and Kubernetes brought us the core concepts on how to build resilient, self-healing, and scalable applications. Key concepts include declarative API, loosely-coupled components, and stateless containers.
What if Kubernetes itself and underlying servers can be managed samely as applications on Kubernetes?
Yes, this is what we wanted to do. The main challenges are:
- CI/CD: How to test systems for on-premise data centers?
- Self-Healing: How to detect and replace failed servers automatically?
- Networking: How to implement a software-defined, scalable, and fault-tolerant network stack?
- Storage: How to provide scalable / fault-tolerant storages to Kubernetes applications?
Neco started in January 2018 and is expected to be completed by the end of 2020. The project is led by Hirotaka Yamamoto (me).
Scopes and goals
- Continuously deliver and update the data center system including Kubernetes.
- Design and implement a software-defined network for large Kubernetes clusters.
- Run stateful workloads such as Elasticsearch or MySQL on local storage devices.
At the very beginning of the project, we defined the following design principles. These principles are intended to guide project members to make the right technical decisions.
Just like Kubernetes, all software developed in Neco should have declarative API or declarative interface.
Define by software
In other words, Neco does not have special-purpose networks or hardware.
They are the sources of toils and rigidity.
We make our bests to make things, including servers and networks, testable.
These principles have been working very well at every decision in Neco development.
As of September 2019, we have set up and are running three real data centers based on Neco.
Currently, we are adding more features and implementing a soft multi-tenancy environment.
A Neco data center consists of a few (from 3 to 5) management servers called boot sever and thousands of servers for a Kubernetes cluster.
Boot servers are symmetrical with each other except for host names and IP addresses.
All persistent data are stored in an etcd cluster. Thanks to etcd, adding or removing a boot server can be done easily.
Sabakan is a network boot server that supports DHCP and HTTP. It can also manage an inventory of Node servers and assign IP addresses automatically.
CKE, or Cybozu Kubernetes Engines, is a CNCF certified Kubernetes installer. It queries Node server status to Sabakan and chooses Nodes to construct Kubernetes control plane and worker nodes. TLS certificates required for Kubernetes are issued by Vault.
CKE is not only an installer, but it can upgrade the Kubernetes cluster in-place, and maintain Kubernetes components and nodes healthy.
Neco Continuous Delivery System (Neco CD)
To bootstrap and maintain boot servers, we have created a simple continuous delivery system called "Neco CD". Every software in Neco can be delivered with this system by so-called GitOps.
A Kubernetes cluster created with CKE is a vanilla installation; it almost has nothing useful.
We have selected a set of applications such as MetalLB, Calico, or Teleport to make the vanilla Kubernetes fully featured. The manifests of the applications are maintained in github.com/cybozu-go/neco-apps and continuously delivered by Argo CD.
This article is just a very brief introduction to the project Neco. We will post new articles to deep-dive details like:
- L2-less BGP oriented data center networking
- The strategy of continuous integration and delivery
- CKE: our certified Kubernetes installer
- TopoLVM: a CSI plugin to dynamically schedule and create LVM volumes
- Soft multi-tenancy implementation using Argo CD and custom admission webhooks
Please visit the main repository if you want to learn more.