Your Cluster Is Not Safe: The Dark Side of Backups
We all want to feel safe. That’s why we create backups. We want to know that our systems will survive no mather what happens.
The feeling of safety is very very important, and I need to appologise in advance for what I’m about to say.
You are NOT safe. If the disaster happens, you might not be able to survive it.
Here’s why.
Backups alone are not enough!; especially if its only one type of a backup.
When the day of reckoning comes, you will NOT be saved!
Let me prove that to you.
Setup
rm -rf velero-demo
Watch the GitHub CLI (gh) - How to manage repositories more efficiently video if you are not familiar with GitHub CLI.
gh repo fork vfarcic/velero-demo --clone --remote
cd velero-demo
gh repo set-default
Watch Nix for Everyone: Unleash Devbox for Simplified Development if you are not familiar with Devbox. Alternatively, you can skip Devbox and install all the tools listed in
devbox.json
yourself.
devbox shell
Please watch The Future of Shells with Nushell! Shell + Data + Programming Language if you are not familiar with Nushell. Alternatively, you can inspect the
setup.nu
script and transform the instructions in it to Bash or ZShell if you prefer not to use that Nushell script.
chmod +x setup-cnpg-crossplane.nu
./setup-cnpg-crossplane.nu
source .env
Before The Disaster
I already have an application running in my cluster.
kubectl --namespace a-team get all,ingresses
The output is as follows.
NAME READY STATUS RESTARTS AGE
pod/silly-demo-1 1/1 Running 0 4m24s
pod/silly-demo-595c89b567-gj4fj 1/1 Running 0 5m19s
pod/silly-demo-videos-atlas-dev-db-678f49ffb9-r658p 1/1 Running 0 3m29s
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/silly-demo ClusterIP 10.100.4.103 <none> 8080/TCP 5m20s
service/silly-demo-r ClusterIP 10.100.49.95 <none> 5432/TCP 5m19s
service/silly-demo-ro ClusterIP 10.100.141.126 <none> 5432/TCP 5m19s
service/silly-demo-rw ClusterIP 10.100.142.203 <none> 5432/TCP 5m19s
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/silly-demo 1/1 1 1 5m19s
deployment.apps/silly-demo-videos-atlas-dev-db 1/1 1 1 5m19s
NAME DESIRED CURRENT READY AGE
replicaset.apps/silly-demo-595c89b567 1 1 1 5m19s
replicaset.apps/silly-demo-videos-atlas-dev-db-678f49ffb9 1 1 1 5m19s
NAME CLASS HOSTS ADDRESS PORTS AGE
ingress.networking.k8s.io/silly-demo traefik silly-demo.52.86.219.243.nip.io a00117ef59e2f48aa8dacfd39899e1fd-1803699707.us-east-1.elb.amazonaws.com 80 5m20s
It’s a stateless app that uses PostgreSQL to store data. There is a deployment
of the app (silly-demo
), and a Deployment of Atlas (silly-demo-videos-atlas-dev-db
) that manages PostgreSQL schemas.
There is a persistentvolume
…
kubectl --namespace a-team get persistentvolumes
The output is as follows.
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS VOLUMEATTRIBUTESCLASS REASON AGE
persistentvolume/pvc-ec0cc2c3-d41a-4fa8-87e5-48b2f3ffbf07 1Gi RWO Delete Bound a-team/silly-demo-1 gp2 <unset> 5m52s
…that contains data from the database.
There are also custom resources clusters
, and atlasschemas
.
kubectl --namespace a-team get clusters,atlasschemas
The output is as follows.
NAME AGE INSTANCES READY STATUS PRIMARY
cluster.postgresql.cnpg.io/silly-demo 5m87s Cluster in healthy state silly-demo-1
NAME READY REASON
atlasschema.db.atlasgo.io/silly-demo-videos False VerifyingFirstRun
Those are not special either. Cluster
resources come from CNPG that is designed to run PostgreSQL in a cloud native way. AtlasSchema
is a custom resource from the Atlas Operator that, as we mentioned earlier, is in charge of managing database schemas.
If you are not familiar with CNPG and the Atlas Operator, you might want to check out Should We Run Databases In Kubernetes? CloudNativePG (CNPG) PostgreSQL and Kubernetes? Database Schema? Schema Management with Atlas Operator.
There is nothing truly special with those resources. They are still a very very simple example. It’s so simple that I won’t even dive into the details.
So far, I don’t see a problem. Velero can handle those without any issues. “Right?” Well… “Wrong!”, but we’ll get to that.
Now, to make it just marginally more complex situation, I have a Crossplane claim running in my cluster. Here it goes…
kubectl --namespace infra get sqlclaims
The output is as follows.
NAME SYNCED READY CONNECTION-SECRET AGE
my-db True False 3m47s
That claim represents a database server, and consists of a bunch of other resources.
kubectl get managed
The output is as follows.
NAME SYNCED READY EXTERNAL-NAME AGE
route.ec2.aws.upbound.io/my-db True True r-rtb-00161d9b28292789d1080289494 4m3s
NAME SYNCED READY EXTERNAL-NAME AGE
internetgateway.ec2.aws.upbound.io/my-db True True igw-06f07944db89c5883 4m7s
NAME SYNCED READY EXTERNAL-NAME AGE
mainroutetableassociation.ec2.aws.upbound.io/my-db True True rtbassoc-0ed65b23c7197c10c 4m7s
NAME SYNCED READY EXTERNAL-NAME AGE
routetableassociation.ec2.aws.upbound.io/my-db-1a True True rtbassoc-055b508b098ecdd4c 4m8s
routetableassociation.ec2.aws.upbound.io/my-db-1b True True rtbassoc-07422d44457f01f73 4m8s
routetableassociation.ec2.aws.upbound.io/my-db-1c True True rtbassoc-0d90c51759baa34fb 4m8s
NAME SYNCED READY EXTERNAL-NAME AGE
routetable.ec2.aws.upbound.io/my-db True True rtb-00161d9b28292789d 4m8s
NAME SYNCED READY EXTERNAL-NAME AGE
securitygrouprule.ec2.aws.upbound.io/my-db True True sgrule-1364994703 4m9s
NAME SYNCED READY EXTERNAL-NAME AGE
securitygroup.ec2.aws.upbound.io/my-db True True sg-0839bca02e3629a18 4m9s
NAME SYNCED READY EXTERNAL-NAME AGE
subnet.ec2.aws.upbound.io/my-db-a True True subnet-00a7554e7a87104da 4m10s
subnet.ec2.aws.upbound.io/my-db-b True True subnet-063055c491be14340 4m10s
subnet.ec2.aws.upbound.io/my-db-c True True subnet-0c8f8916b9b1fb4ca 4m10s
NAME SYNCED READY EXTERNAL-NAME AGE
vpc.ec2.aws.upbound.io/my-db True True vpc-09f9e3df7626f19e0 4m14s
NAME KIND PROVIDERCONFIG SYNCED READY AGE
object.kubernetes.crossplane.io/my-db-secret Secret my-db-sql False 4m14s
NAME READY SYNCED AGE
database.postgresql.sql.crossplane.io/my-db-db-01 False 4m15s
database.postgresql.sql.crossplane.io/my-db-db-02 False 4m15s
NAME SYNCED READY EXTERNAL-NAME AGE
instance.rds.aws.upbound.io/my-db False False 4m16s
NAME SYNCED READY EXTERNAL-NAME AGE
subnetgroup.rds.aws.upbound.io/my-db False True my-db 4m18s
That output is from the managed resources from the AWS provider. Depending on your setup choice, the output in your case might differ.
Still, there is nothing alarming. There is still no reason to think that we are not safe with cluster backups created with, let’s say, Velero. “Right?” Well… “Wrong!”, but we’ll get to that.
Disaster Recovery with Kuberentes Backups (Velero)
We are very fortunate that Joe is working with us. He has a special skill. He is clairvoyant. He can see the future and he is always right. He’s the one who predicted the fall of block chain, and now he’s predisting that our cluster is about to get busted. That’s why Joe is on our payrol. He can tell us when a cluster is going to go down, most of the time because he’s messing it up. Let’s make a backup while we still can.
velero backup create pre-disaster
I won’t go into details how Velero works. I already explained it in Master Kubernetes Backups with Velero: Step-by-Step Guide. That was the one with reinbows and unicorns, while this one is mostly doom and gloom.
Now, let’s say that a new cluster was auto-magically created and that we would like to restore the backup in hopes that everything will continue working on that new cluster as if nothing happened to the old one. If we accomplish that, we can avoid firing Joe for messing it up again.
Let’s do it. Let’s restore
the pre-disaster
backup or, to be more precise, whatever we have in that backup associated with the a-team
Namespace.
velero --kubeconfig kubeconfig-dot2.yaml restore create \
--from-backup pre-disaster --include-namespaces a-team
Everything should be working? Right?
If we take a look at all the resources in that Namespace,…
kubectl --kubeconfig kubeconfig-dot2.yaml --namespace a-team \
get all
The output is as follows.
NAME READY STATUS RESTARTS AGE
pod/silly-demo-595c89b567-gj4fj 1/1 Running 0 24s
pod/silly-demo-videos-atlas-dev-db-678f49ffb9-r658p 1/1 Running 0 24s
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/silly-demo ClusterIP 10.100.234.248 <none> 8080/TCP 25s
service/silly-demo-r ClusterIP 10.100.8.142 <none> 5432/TCP 25s
service/silly-demo-ro ClusterIP 10.100.134.203 <none> 5432/TCP 25s
service/silly-demo-rw ClusterIP 10.100.234.202 <none> 5432/TCP 25s
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/silly-demo 1/1 1 1 25s
deployment.apps/silly-demo-videos-atlas-dev-db 1/1 1 1 25s
NAME DESIRED CURRENT READY AGE
replicaset.apps/silly-demo-595c89b567 1 1 1 25s
replicaset.apps/silly-demo-videos-atlas-dev-db-678f49ffb9 1 1 1 25s
Everything seem to be just peachy.
Did it restore persistent volumes with data as well?
kubectl --kubeconfig kubeconfig-dot2.yaml --namespace a-team \
get persistentvolumes
The output is as follows (truncated for brevity).
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS VOLUMEATTRIBUTESCLASS REASON AGE
pvc-ec0... 1Gi RWO Delete Bound a-team/silly-demo-1 gp2 <unset> 65s
It indeed did. Well done Velero. You’re amazing.
How about custom resources like CNPG clusters and Atlas schemas?
kubectl --kubeconfig kubeconfig-dot2.yaml --namespace a-team \
get clusters,atlasschemas
The output is as follows.
NAME AGE INSTANCES READY STATUS PRIMARY
cluster.postgresql.cnpg.io/silly-demo 87s Unable to create required cluster objects
NAME READY REASON
atlasschema.db.atlasgo.io/silly-demo-videos False VerifyingFirstRun
F**k! It’s not working.
Let’s see what’s wrong.
kubectl --kubeconfig kubeconfig-dot2.yaml --namespace a-team \
describe cluster silly-demo
The output is as follows (truncated for brevity).
...
Status:
Conditions:
Last Transition Time: 2024-11-08T23:22:55Z
Message: Cluster Is Not Ready
Reason: ClusterIsNotReady
Status: False
Type: Ready
Image: ghcr.io/cloudnative-pg/postgresql:17.0
Latest Generated Node: 1
Phase: Unable to create required cluster objects
Phase Reason: refusing to reconcile service: silly-demo-r, not owned by the cluster
Target Primary: silly-demo-1
Events: <none>
The problem is in Kubernetes itself. We often have resources owned by some other resources. A Deployment creates a ReplicaSet which creates Pods. As a result, a Pod is owned by a ReplicaSet which is owned by a Deployment. Now, in some cases that is and in others that isn’t a problem when backups are concerned. In this case, we do have a problem.
When we restored the backup, among other resources we restored silly-demo-r
Service as well as the CNPG Cluster resource. As a result, that Cluster resource did not create that service but saw that it already exists but is not owned by it. The end result is that it is failing.
One solution could be to apply filters to the backup restore operation. We could have, for example, said that we do NOT want to restore Services, but that would not work either because there are other Services in that Namespace that are not created by other resources and, hence, not owned by them. We could have had even more elaborated filters that would exclude Services with specific labels, but, in those cases, we would start facing very complex setups with potentially infinite number of filters.
On top of all that, the resources we restored with Velero are likely not up-to-date. They are not resources in the state they were before the crash but resources in the state at the time we created the last backup.
Even if we did manage to solve the filtering conundrum and we would start making backups much more frequently, we are still faced with a potential issue with data.
Do we trust Velero alone to backup database data? I think we shouldn’t. Databases have elaborated tools to backup data that are much safer than more generic backups like Velero. If we rely on Velero for data, there is a chance that backups of database data will be corrupted or incomplete. On top of that, data restoration needs to be coordinated with initialization of database servers.
Now, most of those problems can be solved with Velero, but are likely already solved in database operators. CNPG spec, for example, already has entries that allow us to specify data required to backup and restore its data safely and without the need for any “special” operations.
The problem, however, is that we did not specify that we want CNPG to create and restore backups, so we cannot leverage it in the current situation since the cluster where PostgreSQL was running is now dead or, to be more precise, we are pretending it’s dead.
All in all, Velero poses problems when restoring Kubernetes resources as well as when working with database data. If that would be the only tool we have at our disposal, we might try to solve those, but that’s not the case. We have alternatives and, for those two problems, they are GitOps and database-specific backups. We won’t be able to demonstrate the latter since we did not create CNPG backup while we still could, but we certainly can switch to GitOps. But, before we do that, let me give you an advanced warning. GitOps might not solve our issues either.
Let’s start over and delete the Namespace with the resources we restored.
kubectl --kubeconfig kubeconfig-dot2.yaml delete namespace a-team
Disaster Recovery with GitOps (Argo CD)
Assuming that databases are creating their own data backups which are restored automatically whenever we spin them up, we are left with the issue of backups for Kubernetes resources not being up-to-date and causing potential problems like those with ownership we saw earlier.
Let’s see whether we can solve that one with GitOps.
Here’s an Argo CD Application definition.
cat apps/silly-demo.yaml
The output is as follows.
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: silly-demo
namespace: argocd
spec:
project: default
source:
repoURL: https://github.com/vfarcic/velero-demo
targetRevision: HEAD
path: app/overlays/full
destination:
server: https://kubernetes.default.svc
namespace: a-team
syncPolicy:
automated:
selfHeal: true
prune: true
allowEmpty: true
Typically, we would be using App of Apps model where a single Argo CD Application would synchronize everything in a cluster. Today, however, I want to keep it simple so we have an Application
that will synchronize everything in the app/overlays/full
directory of the velero-demo
repo.
Instead of restoring a backup, we can just apply that resource and let Argo CD synchronize whatever is in that repo into the new cluster.
So, let’s apply it, and…
kubectl --kubeconfig kubeconfig-dot2.yaml apply \
--filename apps/silly-demo.yaml
…wait for a while for Argo CD to do the work.
A while later, we can retrieve all
Kubernetes core resources, and persistent volumes, and CNPG clusters, and Atlas schemas.
kubectl --kubeconfig kubeconfig-dot2.yaml --namespace a-team \
get all,ingresses,persistentvolumes,clusters,atlasschemas
The output is as follows (truncated for brevity).
NAME READY STATUS RESTARTS AGE
pod/silly-demo-1 1/1 Running 0 2m18s
pod/silly-demo-595c89b567-bxlv5 1/1 Running 0 2m49s
pod/silly-demo-videos-atlas-dev-db-... 1/1 Running 0 117s
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/silly-demo ClusterIP 10.100.71.12 <none> 8080/TCP 2m50s
service/silly-demo-r ClusterIP 10.100.248.163 <none> 5432/TCP 2m50s
service/silly-demo-ro ClusterIP 10.100.62.94 <none> 5432/TCP 2m50s
service/silly-demo-rw ClusterIP 10.100.90.155 <none> 5432/TCP 2m50s
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/silly-demo 1/1 1 1 2m50s
deployment.apps/silly-demo-videos-atlas-dev-db 1/1 1 1 2m50s
NAME DESIRED CURRENT READY AGE
replicaset.apps/silly-demo-595c89b567 1 1 1 2m50s
replicaset.apps/silly-demo-videos-atlas-dev-db-... 1 1 1 2m50s
NAME CLASS HOSTS ADDRESS PORTS AGE
ingress.networking.k8s.io/silly-demo traefik silly-demo.52.86.219.243.nip.io aa81601... 80 2m51s
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS VOLUMEATTRIBUTESCLASS REASON AGE
persistentvolume/pvc-11f... 1Gi RWO Delete Bound a-team/silly-demo-1 gp2 <unset> 2m48s
NAME AGE INSTANCES READY STATUS PRIMARY
cluster.postgresql.cnpg.io/silly-demo 2m51s 1 1 Cluster in healthy state silly-demo-1
NAME READY REASON
atlasschema.db.atlasgo.io/silly-demo-videos True Applied
This time, everything seem to be working correctly, with the assumption that we did add backups to CNPG using its own backup and restore machanism.
However, “everything is working” is only an illusion. Everything is working in such a simple setup when we’re looking only at a single application. If we instructed Argo CD to synchronize everything that should be running in that cluster at once, it would almost certainly fail. Argo CD assumes that we do things in certain order and that order is specified through Sync Waves. Without them, it would try to synchronize something that depends on something else, fail to do so, and stop the process resulting in only a fraction of the resources that should be running in that cluster.
I won’t go deeper into the issues with Argo CD synchronization since I already did that in Argo CD Synchronization is BROKEN! It Should Switch to Eventual Consistency!. Watch it if you haven’t already.
Still, Argo CD synchronization issues can be solved with a lot of work, yet that is what we’re trying to avoid in the first place. So, Velero can restore all resources if we spend the time defining all the filters, and Argo CD can do the same if we define an infinite number of sync waves. Both are likely not going to work on the new cluster in the first attempt but, if we practice a lot and we are persistent enough, we might get there with both.
Argo CD, however, has an advantage over backups. It’s source of the desired state is most likely up-to-date. If we are practicing GitOps, we always commit changes to the desired state to Git. So, Git always has the latest desired state. Backups, on the other hand, makes periodic snapshots of the actual state. As a result, GitOps synchronization into a new cluster is more likely to result in that cluster being as close as it can be to the old one that was destroyed (assuming that data is backup up by whichever database we’re using).
However, even if we ignore the issue with Argo CD sync waves and we do restore all the resources from Git, it will still not work.
Let me demonstrate that through Crossplane.
Disaster Recovery of Mutated Resources
I already have a claim in my cluster which expanded into a bunch of managed resources. To be more precise, I have a Crossplane claim in the old cluster that we are pretending to be destroyed. Still, let’s stop pretending for a second that it is not running and take a look at it.
crossplane beta trace sqlclaim my-db --namespace infra
The output is as follows (truncated for brevity).
NAME SYNCED READY STATUS
SQLClaim/my-db (infra) True True Available
└─ SQL/my-db-sprhz True True Available
├─ InternetGateway/my-db True True Available
├─ MainRouteTableAssociation/my-db True True Available
├─ RouteTableAssociation/my-db-1a True True Available
├─ RouteTableAssociation/my-db-1b True True Available
├─ RouteTableAssociation/my-db-1c True True Available
├─ RouteTable/my-db True True Available
├─ Route/my-db True True Available
├─ SecurityGroupRule/my-db True True Available
├─ SecurityGroup/my-db True True Available
├─ Subnet/my-db-a True True Available
├─ Subnet/my-db-b True True Available
├─ Subnet/my-db-c True True Available
├─ VPC/my-db True True Available
├─ ProviderConfig/my-db-sql - -
├─ ProviderConfig/my-db-sql - -
├─ Object/my-db-secret False - Available
├─ Database/my-db-db-01 False - Available
├─ Database/my-db-db-02 False - Available
├─ ProviderConfig/my-db - -
├─ SubnetGroup/my-db False True Available
└─ Instance/my-db False True Available
Everything seem to be working well… because it is working. If that cluster was destroyed, we could have restored those resources using Argo CD. Right? Wrong!
One thing we might easily overlook are mutations. Almost all Kubernetes resources mutate over time. Their specs are modified by the scheduler, their statuses are updated, and events are generated. Some of those mutations are not important and the resources will behave correctly even if we apply those same resources into a new cluster without those specific mutations. That’s what we do with Argo CD, and many people do not understand that the desired state stored in Git is not the same as the actual state in a cluster.
Still, we can ignore many of those mutations, but not all.
Here’s an example.
kubectl get managed --output yaml
The output is as follows.
...
- apiVersion: ec2.aws.upbound.io/v1beta1
kind: SecurityGroupRule
metadata:
annotations:
crossplane.io/composition-resource-name: securityGroupRule
crossplane.io/external-create-pending: "2024-11-20T12:47:45Z"
crossplane.io/external-create-succeeded: "2024-11-20T12:47:45Z"
crossplane.io/external-name: sgrule-1730703853
...
Before we continue, let me stress that the example comes from Crossplane, but we can observe issues that originate from mutations in many others.
Let’s, for example, take a look at the external-name
annotation. That annotation was not defined as the desired state. We won’t find it in Git. It was added by the controller. While mutations often do not matter, this one does, in some cases.
That specific resource manages a VPC in AWS. Since AWS autogenerates IDs of some (but not all) its resources, Crossplane mutated that resource by, among other things, injecting that label. Through it, Crossplane is able to establish the relation between that resource and the Route in AWS and do whatever needs to be done.
Here’s the question.
What would happen if we instruct Argo CD to synchronize the claim from Git into the new cluster?
Since that annotation is added at runtime and is part of the desired state in Git, Argo CD would create that resource without it. Crossplane, on the other hand, would not be able to find that Route in AWS and create a new one. That’s probably not what we wanted. The intention should be for Crossplane in the new cluster to continue managing that specific Route.
So, in that specific scenario, Argo CD is not an option, at least not right away. So, we are forced to go back to Velero. Since some of the resources are cluster scoped, while others are in a specific Namespace, we’ll restore two parts of the backup.
We’ll restore all the resources with the label crossplane.io/claim-name=my-db
,…
velero --kubeconfig kubeconfig-dot2.yaml restore create \
--from-backup pre-disaster \
--selector "crossplane.io/claim-name=my-db"
…and all those inside the Namespace infra
.
velero --kubeconfig kubeconfig-dot2.yaml restore create \
--from-backup pre-disaster \
--include-namespaces infra
Let’s double check that worked by retrieving all Crossplane managed resources.
kubectl --kubeconfig kubeconfig-dot2.yaml get managed
The output is as follows.
NAME SYNCED READY EXTERNAL-NAME AGE
route.ec2.aws.upbound.io/my-db True True rtb-00161d9b28292789d_0.0.0.0/0 37s
NAME SYNCED READY EXTERNAL-NAME AGE
internetgateway.ec2.aws.upbound.io/my-db True True igw-06f07944db89c5883 41s
NAME SYNCED READY EXTERNAL-NAME AGE
mainroutetableassociation.ec2.aws.upbound.io/my-db True True rtbassoc-0ed65b23c7197c10c 41s
NAME SYNCED READY EXTERNAL-NAME AGE
routetableassociation.ec2.aws.upbound.io/my-db-1a True True rtbassoc-055b508b098ecdd4c 43s
routetableassociation.ec2.aws.upbound.io/my-db-1b True True rtbassoc-07422d44457f01f73 43s
routetableassociation.ec2.aws.upbound.io/my-db-1c True True rtbassoc-0d90c51759baa34fb 43s
NAME SYNCED READY EXTERNAL-NAME AGE
routetable.ec2.aws.upbound.io/my-db True True rtb-00161d9b28292789d 43s
NAME SYNCED READY EXTERNAL-NAME AGE
securitygrouprule.ec2.aws.upbound.io/my-db True True sgrule-1364994703 43s
NAME SYNCED READY EXTERNAL-NAME AGE
securitygroup.ec2.aws.upbound.io/my-db True True sg-0839bca02e3629a18 43s
NAME SYNCED READY EXTERNAL-NAME AGE
subnet.ec2.aws.upbound.io/my-db-a True True subnet-00a7554e7a87104da 44s
subnet.ec2.aws.upbound.io/my-db-b True True subnet-063055c491be14340 44s
subnet.ec2.aws.upbound.io/my-db-c True True subnet-0c8f8916b9b1fb4ca 44s
NAME SYNCED READY EXTERNAL-NAME AGE
vpc.ec2.aws.upbound.io/my-db True True vpc-09f9e3df7626f19e0 48s
NAME KIND PROVIDERCONFIG SYNCED READY AGE
object.kubernetes.crossplane.io/my-db-secret Secret my-db-sql True 48s
NAME READY SYNCED AGE
database.postgresql.sql.crossplane.io/my-db-db-01 True 50s
database.postgresql.sql.crossplane.io/my-db-db-02 True 49s
NAME SYNCED READY EXTERNAL-NAME AGE
instance.rds.aws.upbound.io/my-db True True 50s
NAME SYNCED READY EXTERNAL-NAME AGE
subnetgroup.rds.aws.upbound.io/my-db True True my-db 52s
It’s working and the fact that all the resources became READY
right away signals that Crossplane correctly identified the resources that should be managed through the external-name annotation. Otherwise, some of them would not be READY
since it takes a while to create some of those resources in AWS.
Similarly, we can confirm that the claim itself, the resource that manages all those managed resources, was created as well.
kubectl --kubeconfig kubeconfig-dot2.yaml --namespace infra \
get sqlclaims
The output is as follows.
NAME SYNCED READY CONNECTION-SECRET AGE
my-db True True 3m20s
It’s there as well. It’s status is set to READY
only after all those managed resources are ready so that’s yet another confirmation that it worked as expected.
Now we should be able to point Argo CD to the Git repo where the desired state of that claim is stored. Argo CD would not do anything right away since the actual state is the same as the desired state. However, the future updates to the desired state will be synced into the cluster.
So, what did we learn today?
What Did We Learn?
The main outcome of today’s exercisees is depression knowing that, if a disaster occurs, there is no single tool, or maybe even a combination of tools that will enable us to recover easily.
There is no simple and fast disaster recovery, especially if we wait for the disaster to happen to validate whether our processes work.
If we want to take disaster recovery seriously, we have to practice it all the time and the best way to do that is to enforce destruction of clusters. My recommendation is to never upgrade clusters but to always create new ones when we want to jump to the next Kubernetes release. If we do that, we will force ourselves to practice disaster recorvery couple of times a year, given that’s the cadence of Kubernetes releases. It will be painful at first, but, over time, we might manage to strike the right balance between Kubernetes backups, database-specific backups, GitOps, and whatever else we might be using.
The problem is that it will never end. Even if we do manage to get to the point that we can reliably restore a cluster with everything in it, the situation will change. New applications are added, existing resources are evolving, and so on and so forth. Disaster recovery is a practice that needs to be exercised often if we want to be sure it is actually working.
As for tools… I recommend taking database backups using database-specific processes to create and restore them. Use Velero for resources that contain mutations that are critical yet missing from the desired state. Use Argo CD for all the other resources and make sure that you set up Argo CD or Flux in a way that resources are synced no matter the current state of the cluster.
Disaster recovery is painful and it never gets easy. The best we can do it make sure it is reliable.
Destroy
chmod +x destroy.nu
./destroy.nu
exit