Kintone Engineering Blog

Learn about Kintone's engineering efforts. Kintone is provided by Cybozu Inc., a Tokyo-based public company founded in 1997.

Production-grade Deployment of PVC-based Rook/Ceph Cluster

By Satoru Takeuchi (@satoru-takeuchi)

Introduction

Rook/Ceph support two types of clusters, "host-based cluster" and "PVC-based cluster". The former specifies host paths and raw devices to create OSD, and the latter specifies the storage class and volumeClaimTemplate that Rook should use to consume storage via PVCs. This article introduces production-grade deployment of PVC-based Rook/Ceph cluster by referring to my team's cluster.

The most important benefit of PVC-based cluster is that you don't need to mind the topology of your Kubernetes cluster. It's just necessary to specify storageClassDeviceSets field as follows.

apiVersion: ceph.rook.io/v1
kind: CephCluster
...
spec:
...
  storage:
   storageClassDeviceSets:
    - name: set1
      count: 10 # The number of OSDs
...
      volumeClaimTemplates:
      - metadata:
          name: data
        spec:
          resources:
            requests:
              storage: 10Gi # The size of OSDs
          storageClassName: local-storage
          volumeMode: Block
...

There are no host names nor device names in the above-mentioned configuration. Here Kubernetes decides the proper places of OSDs with TopologySpreadConstraints feature as will be described later.

Provisioning OSDs

It's very convenient to use CSI drivers which support dynamic volume provisioning features. Then you don't need to provision PVs beforehand and can add OSDs just with incrementing storageClassDeviceSets->count field. We use TopoLVM for this purpose. Please refer to this article and this presentation if you are interested in TopoLVM.

Here is our cluster's setting.

...
  storage:
    storageClassDeviceSets:
      - name: set1
        count: 20
...
        volumeClaimTemplates:
          - metadata:
              name: data
            spec:
              resources:
                requests:
                  storage: 1Ti
...
              storageClassName: topolvm-provisioner
...

If there is no such CSI driver in your case, you need to provision PVCs beforehand. It's a fair amount of work, but you can mitigate it by using automatic PV provisioners like local-static-provisioner and our local-pv-provisioner.

Even OSD Pods Spreading

By default, Kubernetes doesn't keep the topology of storage devices in mind. So the imbalance of OSD pods would happens among nodes and other failure domains like racks. Kubernetes supports the concept of failure domains with TopologySpreadConstraints feature.

In our cases, we use two TopologySpreadConstraints. One is to spread OSD pods evenly over all racks*1, and the other to spread over all nodes like this.

...
  storage:
    storageClassDeviceSets:
 ...
        placement:
          topologySpreadConstraints:
            - maxSkew: 1
              topologyKey: topology.kubernetes.io/zone # The nodes in the same rack have the same topology.kubernetes.io/zone` label.
              whenUnsatisfiable: ScheduleAnyway
              labelSelector:
                matchExpressions:
                  - key: app
                    operator: In
                    values:
                      - rook-ceph-osd
                      - rook-ceph-osd-prepare
            - maxSkew: 1
              topologyKey: kubernetes.io/hostname
              whenUnsatisfiable: ScheduleAnyway
              labelSelector:
                matchExpressions:
                  - key: app
                    operator: In
                    values:
                      - rook-ceph-osd
                      - rook-ceph-osd-prepare
...

However, it's virtually impossible to keep the completely even OSD pod spreading. For example, the number of storage devices in each node may differ very much in a long-running cluster. At a first glance, there are two ways to mitigate this problem. The one is to set maxSkew field to larger than one. It allows the difference in the number of pods in any two topology domains. However, it's very hard to estimate the proper value of maxSkew. So we didn't adopt this way.

The other is to set whenUnsatisfiable field to ScheduleAnyway. The official document says "ScheduleAnyway tells the scheduler to still schedule it while prioritizing nodes that minimize the skew". However, in fact, Kubernetes does not try to minimize the skew if the spread constraint is not satisfiable. In addition, Kubernetes developers said this as an intended behavior although they admit it is not well-documented.

The last resort is to maximize the scoring weight of even OSD pod spreading in pod scheduling. Because the scoring weight of even pod spreading is very low, imbalance of OSD pods happens easily, for example, when there are high CPU load in some nodes. To mitigate this problem, we tuned priorities field of kube-scheduler to increase the scoring weight of even pod spreading as follows.

  kube-scheduler:
...
    priorities:
    - |
      {
        "name": "SelectorSpreadPriority",
        "weight": 1
      }
    - |
      {
        "name": "InterPodAffinityPriority",
        "weight": 1
      }
    - |
      {
        "name": "LeastRequestedPriority",
        "weight": 1
      }
    - |
      {
        "name": "BalancedResourceAllocation",
        "weight": 1
      }
    - |
      {
        "name": "NodePreferAvoidPodsPriority",
        "weight": 100000
      }
    - |
      {
        "name": "NodeAffinityPriority",
        "weight": 1
      }
    - |
      {
        "name": "TaintTolerationPriority",
        "weight": 1
      }
    - |
      {
        "name": "ImageLocalityPriority",
        "weight": 1
      }
    - |
      {
        "name": "EvenPodsSpreadPriority",
        "weight": 500
      }
...

We should change the global setting of kube-scheduler because we uses Kubernetes 1.17 and there is no choice in this version. However, Kubernetes 1.18 introduced scheduling profile feature, with which we can change the scoring weight only for OSD pods. So we have a plan to change the above-mentioned configuration to use scheduling profile.

Summary

This article introduced production-grade deployment of PVC-based Rook/Ceph cluster. It's very convenient to use CSI drivers with dynamic volume provisioning and then there are two key factors.

  • TopologySpreadConstraints feature of Kubernetes
  • Tuning the scoring weight of even pod spreading in kube-scheduler

If you don't use such drivers, there are some automatic PV provisioners.

*1:We regards rack as failure domain.