Home kubespray restore strategies
Post
Cancel

kubespray restore strategies

Introduction

In our previous article we installed and configured a kubernetes cluster on the hetzner cloud. We used kubespray to run the installation which relies on ansible playbooks as opposed to the bash scripts that we used in our first article. Even though they are harder to read, I prefer to rely on playbooks because of idempotency, error handling, rollbacks and roles.

Idempotency ensures i’m working in a declarative manner instead of imperitavely issuing commands that might be redundant. Let’s say you’re adding a user to a linux system. The first time it runs fine because the user does not exist. The second time it runs the command will fail and return an stderr=1 because the user already exists. To give an example of error handling and rollback imagine the following situation: you want to stop an nginx webserver, apply some configuration changes and restart. Using ansible you can define a “try” block that applies the changes and automatically rollback to the latest working configuration if it catches any errors in the “try” block. And finally, ansible roles. If I had to explain ansible roles, they are simply streamlined playbooks that can be reused in other projects making your programs simple and modular which is in line with the UNIX philosophy.

Working with kubernetes on baremetal brings some advantages on the licensing fees side if we compare it to VMware or proxmox paired with a proxmox backup server but on the other hand it is not as straightforward to rebuild in case of a catastrophic event. Our RTO (Recovery Time Objective) using a hypervisor is well defined and controllable but kubernetes does not offer that out of the box. In order to build the same level of resilience I wrote a bash script that periodically “pulls” the storage into compressed archives but I haven’t tried rebuilding a cluster from scratch or restoring it based on an etcd backup. This article aims to fill that gap and explores different restore strategies.

iframe

A simple nginx deployment

We installed ansible and the ansible-galaxy plugin on the baremetal setup earlier so we’ll just activate the python environment to use them:

1
2
3
4
5
6
[root@kube-node1 kubespray]$ source env/bin/activate
[root@kube-node1 kubespray]$ ansible-galaxy collection list | grep kubernetes.core
kubernetes.core                          2.4.2  
kubernetes.core                          2.4.2  
[root@kube-node1 kubespray]$ pip install --upgrade pip
[root@kube-node1 kubespray]$ pip install kubernetes
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
[root@kube-node1 kubespray]$ mkdir -p homelab_playbooks/files
[root@kube-node1 kubespray]$ nano homelab_playbooks/files/nginx-svc.yaml
apiVersion: v1
kind: Service
metadata:
  name: nginx
  namespace: default
spec:
  selector:
    app: nginx
  ports:
    - port: 80
      targetPort: 80
      protocol: TCP
      name: http
  type: ClusterIP
[root@kube-node1 kubespray]$ nano homelab_playbooks/files/nginx-deploy.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx
  namespace: default
spec:
  replicas: 1
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
	app: nginx
    spec:
      containers:
      - name: nginx
        image: nginx:stable
        ports:
       	- containerPort: 80
[root@kube-node1 kubespray]$ nano homelab_playbooks/deploy-nginx.yaml
---
- name: Deploy simple Nginx webserver
#  hosts: kube_control_plane[0]
  hosts: localhost
  become: true
  gather_facts: false

  vars:
    kubeconfig_path: /root/.kube/config

  tasks:
    - name: Apply Nginx manifests
      kubernetes.core.k8s:
        state: present
        kubeconfig: "{{ kubeconfig_path }}"
        src: "{{ item }}"
      loop:
       	- nginx-deploy.yaml
       	- nginx-svc.yaml

    - name: Wait for Nginx pod to become Ready
      ansible.builtin.shell: |
        kubectl get pods -l app=nginx -n default --no-headers | \
        awk '{print $3}' | grep -vE 'Running|Completed' | wc -l
      register: nginx_status
      retries: 30
      delay: 10
      until: nginx_status.stdout|int == 0

    - name: Confirm Nginx is running
      ansible.builtin.shell: kubectl get pods -l app=nginx -n default
      register: nginx_pods

    - name: Print success message
      ansible.builtin.debug:
        msg: |
          ✅ Nginx successfully deployed and running!
          {{ nginx_pods.stdout }}

[root@kube-node1 kubespray]$ ansible-playbook -i inventory/mycluster/inventory.ini homelab_playbooks/deploy-nginx.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
[WARNING]: Skipping callback plugin 'ara_default', unable to load

PLAY [Deploy simple Nginx webserver] ***************************************************************************************************************************************
Monday 03 November 2025  11:46:26 +0100 (0:00:00.015)       0:00:00.015 ******* 

TASK [Apply Nginx manifests] ***********************************************************************************************************************************************
changed: [localhost] => (item=nginx-deploy.yaml)
changed: [localhost] => (item=nginx-svc.yaml)
Monday 03 November 2025  11:46:31 +0100 (0:00:05.011)       0:00:05.027 ******* 
FAILED - RETRYING: [localhost]: Wait for Nginx pod to become Ready (30 retries left).

TASK [Wait for Nginx pod to become Ready] **********************************************************************************************************************************
changed: [localhost]
Monday 03 November 2025  11:46:44 +0100 (0:00:12.596)       0:00:17.623 ******* 

TASK [Confirm Nginx is running] ********************************************************************************************************************************************
changed: [localhost]
Monday 03 November 2025  11:46:44 +0100 (0:00:00.753)       0:00:18.377 ******* 

TASK [Print success message] ***********************************************************************************************************************************************
ok: [localhost] => {
    "msg": "✅ Nginx successfully deployed and running!\nNAME                     READY   STATUS    RESTARTS   AGE\nnginx-5654587fb9-9bcrk   1/1     Running   0          14s\n"
}

PLAY RECAP *****************************************************************************************************************************************************************
localhost                  : ok=4    changed=3    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   

Monday 03 November 2025  11:46:44 +0100 (0:00:00.031)       0:00:18.409 ******* 
=============================================================================== 
Wait for Nginx pod to become Ready --------------------------------------------------------------------------------------------------------------------------------- 12.60s
Apply Nginx manifests ----------------------------------------------------------------------------------------------------------------------------------------------- 5.01s
Confirm Nginx is running -------------------------------------------------------------------------------------------------------------------------------------------- 0.75s
Print success message ----------------------------------------------------------------------------------------------------------------------------------------------- 0.03s

You can then verify that the pod is reachable by running kubectl port-forward svc/nginx 8080:80 and then simply open a browser on localhost:8080.

Once you verified it works you can instead trigger the deployment with a wrapper:

1
2
3
4
5
6
[root@kube-node1 kubespray]$ nano homelab_playbooks/wrapper.yaml 
---
- import_playbook: deploy-service1.yaml
- import_playbook: deploy-service2.yaml
- import_playbook: deploy-service3.yaml
...

In order to remove the installed resources we could run this instead:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
---
- name: Uninstall Nginx webserver
  hosts: localhost
  gather_facts: false

  vars:
    kubeconfig_path: /root/.kube/config
    nginx_namespace: default
    nginx_labels:
      app: nginx
    nginx_manifests:
      - nginx-deploy.yaml
      - nginx-svc.yaml

  tasks:
    - name: Delete Nginx manifests
      kubernetes.core.k8s:
        state: absent
        kubeconfig: "{{ kubeconfig_path }}"
        src: "{{ item }}"
      loop: "{{ nginx_manifests }}"

    - name: Wait for Nginx pods to be fully deleted
      kubernetes.core.k8s_info:
        kind: Pod
        namespace: "{{ nginx_namespace }}"
        label_selectors:
          - "app=nginx"
      register: nginx_pods_info
      retries: 30
      delay: 5
      until: nginx_pods_info.resources | length == 0

    - name: Print success message
      ansible.builtin.debug:
        msg: " Nginx resources fully deleted!"

Disaster recovery

We now have all that’s needed to write up the ansible playbooks for our deployments, statefulsets and so on. the ansible kubespray playbook took ~ 20 minutes to run which can be avoided by keeping a couple of hotspare VMs with a pre-installed cluster.

This is my rook-ceph deployment playbook:

1
2
3
ansible-galaxy collection install -r requirements.yml
pip install -r requirements.txt
ansible-playbook -i ../inventory/mycluster/inventory.ini deploy-storage.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
- name: Deploy Ceph Rook cluster
  hosts: localhost
  connection: local
  gather_facts: false

  # define some variables that you can use in the playbook
  vars:
    rook_repo: "git@gitlab.thekor.eu:kube/rook.git"
    rook_dir: "/root/rook"
    kubeconfig_path: "/root/.kube/config"

  tasks:
    
    # do a git clone
    - name: Clone Rook repository
      ansible.builtin.git:
        repo: "{{ rook_repo }}"
        dest: "{{ rook_dir }}"
        version: master
        update: yes
        accept_hostkey: yes
    
    # this waits for me to checkout the right branch for the project i'm working on
    # also allows me to configure if monitoring should be enabled or not
    # the storage targets and other configurations
    - name: Configure the deploy/examples/ceph-cluster.yml
      pause:
        prompt: "Configure the deploy/examples/ceph-cluster.yml before continuing. Press Enter to proceed or Ctrl+C to abort."

    - name: Create rook-ceph namespace
      kubernetes.core.k8s:
        api_version: v1
        kind: Namespace
        name: rook-ceph
        state: present

    - name: Create monitoring namespace
      kubernetes.core.k8s:
        api_version: v1
        kind: Namespace
        name: monitoring
        state: present

    # helm add the dependency
    - name: Add prometheus-community Helm repo
      community.kubernetes.helm_repository:
        name: prometheus-community
        repo_url: https://prometheus-community.github.io/helm-charts
        state: present

    # helm update ...
    - name: Update Helm repo cache
      command: helm repo update

    # helm install prometheus
    - name: Install Prometheus stack
      community.kubernetes.helm:
        name: prometheus
        chart_ref: prometheus-community/kube-prometheus-stack
        release_namespace: monitoring
        kubeconfig: "{{ kubeconfig_path }}"
        create_namespace: false
        update_repo_cache: true
        wait: true
        values:
          grafana:
            enabled: true
          prometheus:
            prometheusSpec:
              serviceMonitorSelectorNilUsesHelmValues: false

    - name: Wait for Prometheus pods to be ready
      kubernetes.core.k8s_info:
        api_version: v1
        kind: Pod
        namespace: monitoring
        label_selectors:
          - "app.kubernetes.io/instance=prometheus"
      register: prom_pods
      until: prom_pods.resources | selectattr('status.phase', 'equalto', 'Running') | list | length == prom_pods.resources | length
      retries: 30
      delay: 20

    - name: Apply Rook operator
      kubernetes.core.k8s:
        state: present
        kubeconfig: "{{ kubeconfig_path }}"
        src: "{{ rook_dir }}/deploy/examples/operator.yaml"

    - name: Apply Rook common resources
      kubernetes.core.k8s:
        state: present
        kubeconfig: "{{ kubeconfig_path }}"
        src: "{{ rook_dir }}/deploy/examples/common.yaml"

    - name: Apply Rook crds
      kubernetes.core.k8s:
        state: present
        kubeconfig: "{{ kubeconfig_path }}"
        src: "{{ rook_dir }}/deploy/examples/crds.yaml"
        
    - name: Wait for Ceph CRDs to be registered
      shell: until kubectl get crd cephclusters.ceph.rook.io >/dev/null 2>&1; do sleep 2; done
      changed_when: false

    - name: Deploy Ceph cluster
      kubernetes.core.k8s:
        state: present
        kubeconfig: ~/.kube/config
        src: "{{ rook_dir }}/deploy/examples/ceph-cluster.yml"

    - name: Apply service monitor + rbac 
      kubernetes.core.k8s:
        state: present
        kubeconfig: "{{ kubeconfig_path }}"
        src: "{{ item }}"
      loop:
        - rook-ceph-servicemonitor-rbac.yml
        - rook-ceph-mgr-servicemonitor.yml

    - name: Wait for CephCluster to be created
      kubernetes.core.k8s_info:
        api_version: ceph.rook.io/v1
        kind: CephCluster
        namespace: rook-ceph
      register: ceph_clusters
      until: ceph_clusters.resources | length > 0
      retries: 10
      delay: 60

    - name: Display CephCluster status
      debug:
        var: ceph_clusters.resources

    - name: Apply RBD StorageClass
      kubernetes.core.k8s:
        state: present
        kubeconfig: "{{ kubeconfig_path }}"
        src: "{{ rook_dir }}/deploy/examples/csi/rbd/storageclass.yaml"
      
    - name: Apply the cephfs storage class
      kubernetes.core.k8s:
        state: present
        kubeconfig: "{{ kubeconfig_path }}"
        src: "{{ rook_dir }}/deploy/examples/csi/cephfs/storageclass.yaml"

    - name: Apply the toolbox
      kubernetes.core.k8s:
        state: present
        kubeconfig: "{{ kubeconfig_path }}"
        src: "{{ rook_dir }}/deploy/examples/toolbox.yaml"

    - name: Wait for rook-ceph-tools pod to be ready
      ansible.builtin.shell: kubectl -n rook-ceph wait pod -l app=rook-ceph-tools --for=condition=Ready --timeout=180s
      register: toolbox_ready
      retries: 30
      delay: 20
      until: toolbox_ready.rc == 0
      changed_when: false

    - name: Wait for Ceph cluster to reach HEALTH_OK
      ansible.builtin.shell: kubectl -n rook-ceph exec rook-ceph-tools -- ceph -s | grep -q 'HEALTH_OK'
      retries: 30
      delay: 20
      register: ceph_health
      until: ceph_health.rc == 0

    - name: Show last Ceph status
      ansible.builtin.debug:
        var: ceph_status.stdout

Let’s deploy a test web app together with a mongodb on the ceph block storage. We’ll start off by creating a vault

1
2
3
4
5
6
ansible-vault create group_vars/all/vault.yml

mongo_user: XXXXXXXXX
mongo_password: XXXXXXXXXXXXXXXXXXXXXXXXX

ansible-playbook -i inventory/mycluster/inventory.ini homelab_playbooks/deploy-webapp.yaml --ask-vault-pass

or if you’re planning on running it on gitlab or any script:

1
2
export WEBAPP_PASSWORD=supersecret123
ansible-playbook -i inventory/mycluster/inventory.ini homelab_playbooks/deploy-webapp.yaml -e webapp_password=$WEBAPP_PASSWORD

To view or edit the vault afterwards run:

1
2
3
ansible-vault view group_vars/all/vault.yml
ansible-vault decrypt group_vars/all/vault.yml
ansible-vault encrypt group_vars/all/vault.yml

And deploy it

1
ansible-playbook -i ../inventory/mycluster/inventory.ini deploy-notls.yaml --ask-vault-pass
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
---
- name: Deploy Mongo Express namespace and PVC
  hosts: localhost
  gather_facts: false
  vars:
    kubeconfig_path: /root/.kube/config
  vars:
    kubespray_repo: "git@gitlab.thekor.eu:kube/kubespray.git"
    kubespray_dir: "/root/kubespray"
    kubeconfig_path: "/root/.kube/config"

  tasks:
    
    # do a git clone
    - name: Clone the custom kubespray repository
      ansible.builtin.git:
        repo: "{{ kubespray_repo }}"
        dest: "{{ kubespray_dir }}"
        version: home_staging
        update: yes
        accept_hostkey: yes
        
    - name: Apply manifests
      kubernetes.core.k8s:
        state: present
        kubeconfig: "{{ kubeconfig_path }}"
        # this looks at the contents of the ./files directory by default
        src: "{{ item }}"
      loop:
        - 0-namespaces.yml
        - 1-mongo-pvc.yml
    - name: Wait for mongoexpress-pvc to be bound
      kubernetes.core.k8s_info:
        api_version: v1
        kind: PersistentVolumeClaim
        name: mongoexpress-pvc
        namespace: mongo-express
      register: pvc_info
      until: pvc_info.resources | length > 0 and pvc_info.resources[0].status.phase == "Bound"
      retries: 30     # retry 30 times
      delay: 5        # wait 5 seconds between retries
- name: Create Kubernetes Secret from Ansible Vault
  hosts: localhost
  gather_facts: false
  vars_files:
    - group_vars/all/vault.yml
  vars:
    kubeconfig_path: /root/.kube/config
  tasks:
    - name: Create DB credentials Secret
      kubernetes.core.k8s:
        state: present
        kubeconfig: "{{ kubeconfig_path }}"
        definition:
          apiVersion: v1
          kind: Secret
          metadata:
            name: mongodb-secret
            namespace: mongo-express
          type: Opaque
          stringData:   # automatically base64-encoded by k8s
            mongo-root-username: "{{ mongo_user }}"
            mongo-root-password: "{{ mongo_password }}"
    
- name: Deploy test Mongodb
  hosts: localhost
  become: true
  gather_facts: false
  vars:
    kubeconfig_path: /root/.kube/config

  tasks:
    - name: Apply deployment, configmap and svc manifests
      kubernetes.core.k8s:
        state: present
        kubeconfig: "{{ kubeconfig_path }}"
        src: "{{ item }}"
      loop:
        - 2-mongo-database.yml

    - name: Wait for mongodb-deployment to be ready
      kubernetes.core.k8s_info:
        api_version: apps/v1
        kind: Deployment
        name: mongodb-deployment
        namespace: mongo-express
      register: deploy_info
      until:
        - deploy_info.resources | length > 0
        - deploy_info.resources[0].status.readyReplicas is defined
        - deploy_info.resources[0].status.readyReplicas == deploy_info.resources[0].status.replicas
      retries: 40   # e.g. wait up to 200s total
      delay: 5

- name: Deploy Mongo Express
  hosts: localhost
  gather_facts: false
  vars:
    kubeconfig_path: /root/.kube/config

  tasks:
    - name: Apply mongoexpress deployment, svc, haproxy, rbac for haproxy and the ingress for mongoexpress
      kubernetes.core.k8s:
        state: present
        kubeconfig: "{{ kubeconfig_path }}"
        src: "{{ item }}"
      loop:
        - 3-mongo-express.yml
        - 4-ingress-controller.yml
        - 5-ingress-rbac.yml
        - 6-ingress-mongoexpress.yml
        - 7-nginx-fallback.yml
    - name: Wait for mongodb-deployment to be ready
      kubernetes.core.k8s_info:
        api_version: apps/v1
        kind: Deployment
        name: mongo-express
        namespace: mongo-express
      register: deploy_info
      until:
        - deploy_info.resources | length > 0
        - deploy_info.resources[0].status.readyReplicas is defined
        - deploy_info.resources[0].status.readyReplicas == deploy_info.resources[0].status.replicas
      retries: 40   # e.g. wait up to 200s total
      delay: 5

On and all I haven’t figured what my Recovery time objective is because I haven’t finished building the cluster yet but I’d say it would take 1 hour provided that there is a barebone cluster with k8s already running on it. The cephfs filesystems and RBD blocks used in my deployments are rather small < 10Gb so it won’t be that much of a bottleneck for the disaster recovery.

Conclusion

I love the freedom that comes from abstracting away the complicated bash syntax and store it into a human readable format. Writing and testing playbooks are very intuitive as well because ansible skips the stuff it sees as already installed or already configured which is very handy.

This post is licensed under CC BY 4.0 by the author.