Introduction
In our previous article we installed and configured a kubernetes cluster on the hetzner cloud. We used kubespray to run the installation which relies on ansible playbooks as opposed to the bash scripts that we used in our first article. Even though they are harder to read, I prefer to rely on playbooks because of idempotency, error handling, rollbacks and roles.
Idempotency ensures i’m working in a declarative manner instead of imperitavely issuing commands that might be redundant. Let’s say you’re adding a user to a linux system. The first time it runs fine because the user does not exist. The second time it runs the command will fail and return an stderr=1 because the user already exists. To give an example of error handling and rollback imagine the following situation: you want to stop an nginx webserver, apply some configuration changes and restart. Using ansible you can define a “try” block that applies the changes and automatically rollback to the latest working configuration if it catches any errors in the “try” block. And finally, ansible roles. If I had to explain ansible roles, they are simply streamlined playbooks that can be reused in other projects making your programs simple and modular which is in line with the UNIX philosophy.
Working with kubernetes on baremetal brings some advantages on the licensing fees side if we compare it to VMware or proxmox paired with a proxmox backup server but on the other hand it is not as straightforward to rebuild in case of a catastrophic event. Our RTO (Recovery Time Objective) using a hypervisor is well defined and controllable but kubernetes does not offer that out of the box. In order to build the same level of resilience I wrote a bash script that periodically “pulls” the storage into compressed archives but I haven’t tried rebuilding a cluster from scratch or restoring it based on an etcd backup. This article aims to fill that gap and explores different restore strategies.

A simple nginx deployment
We installed ansible and the ansible-galaxy plugin on the baremetal setup earlier so we’ll just activate the python environment to use them:
1
2
3
4
5
6
[root@kube-node1 kubespray]$ source env/bin/activate
[root@kube-node1 kubespray]$ ansible-galaxy collection list | grep kubernetes.core
kubernetes.core 2.4.2
kubernetes.core 2.4.2
[root@kube-node1 kubespray]$ pip install --upgrade pip
[root@kube-node1 kubespray]$ pip install kubernetes
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
[root@kube-node1 kubespray]$ mkdir -p homelab_playbooks/files
[root@kube-node1 kubespray]$ nano homelab_playbooks/files/nginx-svc.yaml
apiVersion: v1
kind: Service
metadata:
name: nginx
namespace: default
spec:
selector:
app: nginx
ports:
- port: 80
targetPort: 80
protocol: TCP
name: http
type: ClusterIP
[root@kube-node1 kubespray]$ nano homelab_playbooks/files/nginx-deploy.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx
namespace: default
spec:
replicas: 1
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx:stable
ports:
- containerPort: 80
[root@kube-node1 kubespray]$ nano homelab_playbooks/deploy-nginx.yaml
---
- name: Deploy simple Nginx webserver
# hosts: kube_control_plane[0]
hosts: localhost
become: true
gather_facts: false
vars:
kubeconfig_path: /root/.kube/config
tasks:
- name: Apply Nginx manifests
kubernetes.core.k8s:
state: present
kubeconfig: "{{ kubeconfig_path }}"
src: "{{ item }}"
loop:
- nginx-deploy.yaml
- nginx-svc.yaml
- name: Wait for Nginx pod to become Ready
ansible.builtin.shell: |
kubectl get pods -l app=nginx -n default --no-headers | \
awk '{print $3}' | grep -vE 'Running|Completed' | wc -l
register: nginx_status
retries: 30
delay: 10
until: nginx_status.stdout|int == 0
- name: Confirm Nginx is running
ansible.builtin.shell: kubectl get pods -l app=nginx -n default
register: nginx_pods
- name: Print success message
ansible.builtin.debug:
msg: |
✅ Nginx successfully deployed and running!
{{ nginx_pods.stdout }}
[root@kube-node1 kubespray]$ ansible-playbook -i inventory/mycluster/inventory.ini homelab_playbooks/deploy-nginx.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
[WARNING]: Skipping callback plugin 'ara_default', unable to load
PLAY [Deploy simple Nginx webserver] ***************************************************************************************************************************************
Monday 03 November 2025 11:46:26 +0100 (0:00:00.015) 0:00:00.015 *******
TASK [Apply Nginx manifests] ***********************************************************************************************************************************************
changed: [localhost] => (item=nginx-deploy.yaml)
changed: [localhost] => (item=nginx-svc.yaml)
Monday 03 November 2025 11:46:31 +0100 (0:00:05.011) 0:00:05.027 *******
FAILED - RETRYING: [localhost]: Wait for Nginx pod to become Ready (30 retries left).
TASK [Wait for Nginx pod to become Ready] **********************************************************************************************************************************
changed: [localhost]
Monday 03 November 2025 11:46:44 +0100 (0:00:12.596) 0:00:17.623 *******
TASK [Confirm Nginx is running] ********************************************************************************************************************************************
changed: [localhost]
Monday 03 November 2025 11:46:44 +0100 (0:00:00.753) 0:00:18.377 *******
TASK [Print success message] ***********************************************************************************************************************************************
ok: [localhost] => {
"msg": "✅ Nginx successfully deployed and running!\nNAME READY STATUS RESTARTS AGE\nnginx-5654587fb9-9bcrk 1/1 Running 0 14s\n"
}
PLAY RECAP *****************************************************************************************************************************************************************
localhost : ok=4 changed=3 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
Monday 03 November 2025 11:46:44 +0100 (0:00:00.031) 0:00:18.409 *******
===============================================================================
Wait for Nginx pod to become Ready --------------------------------------------------------------------------------------------------------------------------------- 12.60s
Apply Nginx manifests ----------------------------------------------------------------------------------------------------------------------------------------------- 5.01s
Confirm Nginx is running -------------------------------------------------------------------------------------------------------------------------------------------- 0.75s
Print success message ----------------------------------------------------------------------------------------------------------------------------------------------- 0.03s
You can then verify that the pod is reachable by running kubectl port-forward svc/nginx 8080:80 and then simply open a browser on localhost:8080.
Once you verified it works you can instead trigger the deployment with a wrapper:
1
2
3
4
5
6
[root@kube-node1 kubespray]$ nano homelab_playbooks/wrapper.yaml
---
- import_playbook: deploy-service1.yaml
- import_playbook: deploy-service2.yaml
- import_playbook: deploy-service3.yaml
...
In order to remove the installed resources we could run this instead:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
---
- name: Uninstall Nginx webserver
hosts: localhost
gather_facts: false
vars:
kubeconfig_path: /root/.kube/config
nginx_namespace: default
nginx_labels:
app: nginx
nginx_manifests:
- nginx-deploy.yaml
- nginx-svc.yaml
tasks:
- name: Delete Nginx manifests
kubernetes.core.k8s:
state: absent
kubeconfig: "{{ kubeconfig_path }}"
src: "{{ item }}"
loop: "{{ nginx_manifests }}"
- name: Wait for Nginx pods to be fully deleted
kubernetes.core.k8s_info:
kind: Pod
namespace: "{{ nginx_namespace }}"
label_selectors:
- "app=nginx"
register: nginx_pods_info
retries: 30
delay: 5
until: nginx_pods_info.resources | length == 0
- name: Print success message
ansible.builtin.debug:
msg: "✅ Nginx resources fully deleted!"
Disaster recovery
We now have all that’s needed to write up the ansible playbooks for our deployments, statefulsets and so on. the ansible kubespray playbook took ~ 20 minutes to run which can be avoided by keeping a couple of hotspare VMs with a pre-installed cluster.
This is my rook-ceph deployment playbook:
1
2
3
ansible-galaxy collection install -r requirements.yml
pip install -r requirements.txt
ansible-playbook -i ../inventory/mycluster/inventory.ini deploy-storage.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
- name: Deploy Ceph Rook cluster
hosts: localhost
connection: local
gather_facts: false
# define some variables that you can use in the playbook
vars:
rook_repo: "git@gitlab.thekor.eu:kube/rook.git"
rook_dir: "/root/rook"
kubeconfig_path: "/root/.kube/config"
tasks:
# do a git clone
- name: Clone Rook repository
ansible.builtin.git:
repo: "{{ rook_repo }}"
dest: "{{ rook_dir }}"
version: master
update: yes
accept_hostkey: yes
# this waits for me to checkout the right branch for the project i'm working on
# also allows me to configure if monitoring should be enabled or not
# the storage targets and other configurations
- name: Configure the deploy/examples/ceph-cluster.yml
pause:
prompt: "Configure the deploy/examples/ceph-cluster.yml before continuing. Press Enter to proceed or Ctrl+C to abort."
- name: Create rook-ceph namespace
kubernetes.core.k8s:
api_version: v1
kind: Namespace
name: rook-ceph
state: present
- name: Create monitoring namespace
kubernetes.core.k8s:
api_version: v1
kind: Namespace
name: monitoring
state: present
# helm add the dependency
- name: Add prometheus-community Helm repo
community.kubernetes.helm_repository:
name: prometheus-community
repo_url: https://prometheus-community.github.io/helm-charts
state: present
# helm update ...
- name: Update Helm repo cache
command: helm repo update
# helm install prometheus
- name: Install Prometheus stack
community.kubernetes.helm:
name: prometheus
chart_ref: prometheus-community/kube-prometheus-stack
release_namespace: monitoring
kubeconfig: "{{ kubeconfig_path }}"
create_namespace: false
update_repo_cache: true
wait: true
values:
grafana:
enabled: true
prometheus:
prometheusSpec:
serviceMonitorSelectorNilUsesHelmValues: false
- name: Wait for Prometheus pods to be ready
kubernetes.core.k8s_info:
api_version: v1
kind: Pod
namespace: monitoring
label_selectors:
- "app.kubernetes.io/instance=prometheus"
register: prom_pods
until: prom_pods.resources | selectattr('status.phase', 'equalto', 'Running') | list | length == prom_pods.resources | length
retries: 30
delay: 20
- name: Apply Rook operator
kubernetes.core.k8s:
state: present
kubeconfig: "{{ kubeconfig_path }}"
src: "{{ rook_dir }}/deploy/examples/operator.yaml"
- name: Apply Rook common resources
kubernetes.core.k8s:
state: present
kubeconfig: "{{ kubeconfig_path }}"
src: "{{ rook_dir }}/deploy/examples/common.yaml"
- name: Apply Rook crds
kubernetes.core.k8s:
state: present
kubeconfig: "{{ kubeconfig_path }}"
src: "{{ rook_dir }}/deploy/examples/crds.yaml"
- name: Wait for Ceph CRDs to be registered
shell: until kubectl get crd cephclusters.ceph.rook.io >/dev/null 2>&1; do sleep 2; done
changed_when: false
- name: Deploy Ceph cluster
kubernetes.core.k8s:
state: present
kubeconfig: ~/.kube/config
src: "{{ rook_dir }}/deploy/examples/ceph-cluster.yml"
- name: Apply service monitor + rbac
kubernetes.core.k8s:
state: present
kubeconfig: "{{ kubeconfig_path }}"
src: "{{ item }}"
loop:
- rook-ceph-servicemonitor-rbac.yml
- rook-ceph-mgr-servicemonitor.yml
- name: Wait for CephCluster to be created
kubernetes.core.k8s_info:
api_version: ceph.rook.io/v1
kind: CephCluster
namespace: rook-ceph
register: ceph_clusters
until: ceph_clusters.resources | length > 0
retries: 10
delay: 60
- name: Display CephCluster status
debug:
var: ceph_clusters.resources
- name: Apply RBD StorageClass
kubernetes.core.k8s:
state: present
kubeconfig: "{{ kubeconfig_path }}"
src: "{{ rook_dir }}/deploy/examples/csi/rbd/storageclass.yaml"
- name: Apply the cephfs storage class
kubernetes.core.k8s:
state: present
kubeconfig: "{{ kubeconfig_path }}"
src: "{{ rook_dir }}/deploy/examples/csi/cephfs/storageclass.yaml"
- name: Apply the toolbox
kubernetes.core.k8s:
state: present
kubeconfig: "{{ kubeconfig_path }}"
src: "{{ rook_dir }}/deploy/examples/toolbox.yaml"
- name: Wait for rook-ceph-tools pod to be ready
ansible.builtin.shell: kubectl -n rook-ceph wait pod -l app=rook-ceph-tools --for=condition=Ready --timeout=180s
register: toolbox_ready
retries: 30
delay: 20
until: toolbox_ready.rc == 0
changed_when: false
- name: Wait for Ceph cluster to reach HEALTH_OK
ansible.builtin.shell: kubectl -n rook-ceph exec rook-ceph-tools -- ceph -s | grep -q 'HEALTH_OK'
retries: 30
delay: 20
register: ceph_health
until: ceph_health.rc == 0
- name: Show last Ceph status
ansible.builtin.debug:
var: ceph_status.stdout
Let’s deploy a test web app together with a mongodb on the ceph block storage. We’ll start off by creating a vault
1
2
3
4
5
6
ansible-vault create group_vars/all/vault.yml
mongo_user: XXXXXXXXX
mongo_password: XXXXXXXXXXXXXXXXXXXXXXXXX
ansible-playbook -i inventory/mycluster/inventory.ini homelab_playbooks/deploy-webapp.yaml --ask-vault-pass
or if you’re planning on running it on gitlab or any script:
1
2
export WEBAPP_PASSWORD=supersecret123
ansible-playbook -i inventory/mycluster/inventory.ini homelab_playbooks/deploy-webapp.yaml -e webapp_password=$WEBAPP_PASSWORD
To view or edit the vault afterwards run:
1
2
3
ansible-vault view group_vars/all/vault.yml
ansible-vault decrypt group_vars/all/vault.yml
ansible-vault encrypt group_vars/all/vault.yml
And deploy it
1
ansible-playbook -i ../inventory/mycluster/inventory.ini deploy-notls.yaml --ask-vault-pass
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
---
- name: Deploy Mongo Express namespace and PVC
hosts: localhost
gather_facts: false
vars:
kubeconfig_path: /root/.kube/config
vars:
kubespray_repo: "git@gitlab.thekor.eu:kube/kubespray.git"
kubespray_dir: "/root/kubespray"
kubeconfig_path: "/root/.kube/config"
tasks:
# do a git clone
- name: Clone the custom kubespray repository
ansible.builtin.git:
repo: "{{ kubespray_repo }}"
dest: "{{ kubespray_dir }}"
version: home_staging
update: yes
accept_hostkey: yes
- name: Apply manifests
kubernetes.core.k8s:
state: present
kubeconfig: "{{ kubeconfig_path }}"
# this looks at the contents of the ./files directory by default
src: "{{ item }}"
loop:
- 0-namespaces.yml
- 1-mongo-pvc.yml
- name: Wait for mongoexpress-pvc to be bound
kubernetes.core.k8s_info:
api_version: v1
kind: PersistentVolumeClaim
name: mongoexpress-pvc
namespace: mongo-express
register: pvc_info
until: pvc_info.resources | length > 0 and pvc_info.resources[0].status.phase == "Bound"
retries: 30 # retry 30 times
delay: 5 # wait 5 seconds between retries
- name: Create Kubernetes Secret from Ansible Vault
hosts: localhost
gather_facts: false
vars_files:
- group_vars/all/vault.yml
vars:
kubeconfig_path: /root/.kube/config
tasks:
- name: Create DB credentials Secret
kubernetes.core.k8s:
state: present
kubeconfig: "{{ kubeconfig_path }}"
definition:
apiVersion: v1
kind: Secret
metadata:
name: mongodb-secret
namespace: mongo-express
type: Opaque
stringData: # automatically base64-encoded by k8s
mongo-root-username: "{{ mongo_user }}"
mongo-root-password: "{{ mongo_password }}"
- name: Deploy test Mongodb
hosts: localhost
become: true
gather_facts: false
vars:
kubeconfig_path: /root/.kube/config
tasks:
- name: Apply deployment, configmap and svc manifests
kubernetes.core.k8s:
state: present
kubeconfig: "{{ kubeconfig_path }}"
src: "{{ item }}"
loop:
- 2-mongo-database.yml
- name: Wait for mongodb-deployment to be ready
kubernetes.core.k8s_info:
api_version: apps/v1
kind: Deployment
name: mongodb-deployment
namespace: mongo-express
register: deploy_info
until:
- deploy_info.resources | length > 0
- deploy_info.resources[0].status.readyReplicas is defined
- deploy_info.resources[0].status.readyReplicas == deploy_info.resources[0].status.replicas
retries: 40 # e.g. wait up to 200s total
delay: 5
- name: Deploy Mongo Express
hosts: localhost
gather_facts: false
vars:
kubeconfig_path: /root/.kube/config
tasks:
- name: Apply mongoexpress deployment, svc, haproxy, rbac for haproxy and the ingress for mongoexpress
kubernetes.core.k8s:
state: present
kubeconfig: "{{ kubeconfig_path }}"
src: "{{ item }}"
loop:
- 3-mongo-express.yml
- 4-ingress-controller.yml
- 5-ingress-rbac.yml
- 6-ingress-mongoexpress.yml
- 7-nginx-fallback.yml
- name: Wait for mongodb-deployment to be ready
kubernetes.core.k8s_info:
api_version: apps/v1
kind: Deployment
name: mongo-express
namespace: mongo-express
register: deploy_info
until:
- deploy_info.resources | length > 0
- deploy_info.resources[0].status.readyReplicas is defined
- deploy_info.resources[0].status.readyReplicas == deploy_info.resources[0].status.replicas
retries: 40 # e.g. wait up to 200s total
delay: 5
On and all I haven’t figured what my Recovery time objective is because I haven’t finished building the cluster yet but I’d say it would take 1 hour provided that there is a barebone cluster with k8s already running on it. The cephfs filesystems and RBD blocks used in my deployments are rather small < 10Gb so it won’t be that much of a bottleneck for the disaster recovery.
Conclusion
I love the freedom that comes from abstracting away the complicated bash syntax and store it into a human readable format. Writing and testing playbooks are very intuitive as well because ansible skips the stuff it sees as already installed or already configured which is very handy.