Internal Developer Platform Day 2 Operations Solved with Kubernetes and Crossplane
I did it. No! We did it. Actually, that’s not correct either. Someone else did it. Doesn’t matter who did it. What matters is that one of, in my opinion, big problems has been solved. Developers can now not only create, update, and remove their applications and infrastructure, but they can also get to know what’s happening with the resources they are managing.
So far, in most, cases, if we work very very very hard, we could enable developers to create something meaningful. Most of us do that by creating abstractions. Instead of forcing developers to spend years trying to understand intricacies of Kubernetes, we create Helm charts that enable developers to modify a simple YAML values file and apply charts to the cluster. If we are advanced, we create CRDs and controllers that result in even better abstractions and start resembling services that developers can consume. We progress further from there by creating services not only for applications but also for databases, clusters, or anything else. We, platform engineers, become service providers and developers start consuming those services.
Here’s the problem though. All that tend to help only with day zero operations. Developers can create or update something in a very easy way but when it gets to operating and observing that something, they often end up being equally confused as they were before. Our services do not show the information they need so they need to dig deeper into lower-level resources to find out what’s going on. That negates some of the main reasons we started offering them services. That’s horrible. We’re building something that looks useful when someone starts using it and it turns useless afterwards.
We need to change that and now we can do just that. Let me explain.
Setup
git clone https://github.com/vfarcic/crossplane-sql
cd crossplane-sql
git pull
git fetch
git checkout status-transformer
Make sure that Docker is up-and-running. We’ll use it to create a KinD cluster.
Watch Nix for Everyone: Unleash Devbox for Simplified Development if you are not familiar with Devbox. Alternatively, you can skip Devbox and install all the tools listed in
devbox.json
yourself.
devbox shell
chmod +x examples/setup.nu
./examples/setup.nu
source .env
kubectl delete \
--filename examples/provider-config-$HYPERSCALER.yaml
kubectl --namespace infra apply \
--filename examples/$HYPERSCALER-secret.yaml
Execute the command that follows only if you are using AWS
export MANAGED_RESOURCE=vpc.ec2.aws.upbound.io
Execute the command that follows only if you are using Google Cloud
export MANAGED_RESOURCE=databaseinstance.sql.gcp.upbound.io
Execute the command that follows only if you are using Azure
export MANAGED_RESOURCE=resourcegroup.azure.upbound.io
The Problem In Kubernetes
Let me show what the problem is by showing you a relatively simple definition of an application that should be running in Kubernetes.
Before we proceed, if you saw Kubernetes Events Are Broken (If You Are Building a Developer Portal), you might think that its deja vu. You might think you already heard me talking about this subject. That’s partly true. This is a continuation. Back then, I had complaints. Today, I’ll show the solution but, to get there, I might need to explain the problem again. I’ll be brief. I promise.
Here’s the definition.
cat app/error.yaml
The output is as follows.
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: silly-demo
labels:
app.kubernetes.io/name: silly-demo
spec:
selector:
matchLabels:
app.kubernetes.io/name: silly-demo
template:
metadata:
labels:
app.kubernetes.io/name: silly-demo
spec:
containers:
- name: main
image: "ghcr.io/vfarcic/silly-demo:this-tag-does-not-exist"
ports:
- containerPort: 8080
resources:
limits:
cpu: 500m
memory: 512Mi
requests:
cpu: 250m
memory: 256Mi
livenessProbe:
httpGet:
path: /
port: 8080
readinessProbe:
httpGet:
path: /
port: 8080
---
apiVersion: v1
kind: Service
metadata:
name: silly-demo
labels:
app.kubernetes.io/name: silly-demo
spec:
type: ClusterIP
ports:
- port: 8080
targetPort: 8080
protocol: TCP
name: http
selector:
app.kubernetes.io/name: silly-demo
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: silly-demo
labels:
app.kubernetes.io/name: silly-demo
annotations:
ingress.kubernetes.io/ssl-redirect: "false"
spec:
ingressClassName: nginx
rules:
- http:
paths:
- path: /
pathType: ImplementationSpecific
backend:
service:
name: silly-demo
port:
number: 8080
host: silly-demo.127.0.0.1.nip.io
This is a very simple definition of an application which, with a little patience, we can teach developers how to write themselves.
We have a Deployment
that defines our application. That’s where we define which container image
should run, which ports
are exposed, how much cpu
and memory
it should consume, and so on and so forth. Then there is the Service
which defines how other applications can communicate with that one and, finally, there is Ingress
that specifies the host
through which our application is exposed.
Even though those resource definitions might be a bit verbose, almost anyone can learn how to write them. I don’t see a problem with that. However, there are at least two problems with it.
To begin with, our applications are often more complicated than that. There could be horizontal or vertical Pod autoscalers. There could be virtual services from a service mesh. There could be network policies. There could be… You get the point. It could be simple, or it could be complex to define and manage all those resources.
That problem can be solved in many different ways. We could create Helm charts so that people can focus only on things that matter and those “things” being defined in values.yaml. Alternatively, we could accomplish a similar result with CUE with Timoni, Carvel YTT, KCL, or through many other tools that all serve the same purpose. They all allow us to create templates, of some sort or another, and focus on customizations by changing parameters, values, or some other type of file.
That’s not the problem we’re solving today, especially since I believe that they are all wrong because we should be creating Custom Resource Definitions (CRDs) and controllers.
The second problem, the one we’re exploring today, can be described as “What the heck do we do after day zero operations?” It’s relatively easy to change a few manifests or a Helm values.yaml file or a custom resource or whatever else you might be using. What is much harder is to deduce the status of something running in or being managed by Kubernetes. It’s much harder to understand “what’s the cause of an issue and how to fix it?” Whatever we do during day zero is much easier than what we do afterwards.
Let me apply those resources as a way to demonstrate the issues I’m talking about.
kubectl --namespace infra apply --filename app/error.yaml
Now, imagine that you are not a person who neglected their family so that you can spend an infinite amount of time learning Kubernetes. Imagine that you were told to define a Kubernetes Deployment (and a few other simple things) or that you changed a few Helm values or something similar. Imagine that you are not a Kubernetes expert but that you just followed instructions from a platform engineer who decided to enable you to do it all by yourself. In this example, you defined a Deployment to the best of your abilities and applied it to the cluster. It’s only natural to take a look at it to deduce whether it’s working as expected, so let’s do just that.
kubectl --namespace infra get deployments
The output is as follows.
NAME READY UP-TO-DATE AVAILABLE AGE
silly-demo 0/1 1 0 52s
Okay. That clearly shows that something is not ready. Since you’re a fast learner, you figured out, or you were told to, describe
something to get more information.
kubectl --namespace infra describe deployment silly-demo
The output is as follows (truncated for brevity).
...
Conditions:
Type Status Reason
---- ------ ------
Available False MinimumReplicasUnavailable
Progressing True ReplicaSetUpdated
OldReplicaSets: <none>
NewReplicaSet: silly-demo-7f8c98455b (1/1 replicas created)
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal ScalingReplicaSet 70s deployment-controller Scaled up replica set silly-demo-7f8c98455b to 1
Most of the output is the same information we defined but you noticed that there are Conditions
and Events
. Based on that information, you can easily conclude that everything is okay. There’s nothing to worry about except that the Available
condition is set to False
indicating that there is no minimum number of replicas (MinimumReplicasUnavailable
).
What the heck! Only Kubernetes can claim that everything is okay but something might be wrong yet there is no indication of what that something might be.
The reality is that you would need to know Kubernetes much more than what a person who did not choose to get divorced because it’s better to spend endless hours learning Kubernetes on top of doing the “real” job of writing NodeJS code than spending time with your wife. You would need to know that a Deployment creates a ReplicaSet which will also not give you any meaningful information because it created Pods which tried to create containers which, in this case, are not working.
Here’s the proof.
kubectl --namespace infra get pods
The output is as follows.
NAME READY STATUS RESTARTS AGE
silly-demo-7f8c98455b-djnfb 0/1 ImagePullBackOff 0 89s
That Pod, which was created by a ReplicaSet which was created by the Deployment you wrote could not pull the image (ImagePullBackOff
).
We can confirm that by describing that specific Pod.
kubectl --namespace infra describe pod \
--selector app.kubernetes.io/name=silly-demo
The output is as follows (truncated for brevity).
...
Conditions:
Type Status
PodReadyToStartContainers True
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
kube-api-access-7mhnt:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 110s default-scheduler Successfully assigned infra/silly-demo-7f8c98455b-djnfb to kind-control-plane
Normal Pulling 15s (x4 over 109s) kubelet Pulling image "ghcr.io/vfarcic/silly-demo:this-tag-does-not-exist"
Warning Failed 15s (x4 over 109s) kubelet Failed to pull image "ghcr.io/vfarcic/silly-demo:this-tag-does-not-exist": rpc error: code = NotFound desc = failed to pull and unpack image "ghcr.io/vfarcic/silly-demo:this-tag-does-not-exist": failed to resolve reference "ghcr.io/vfarcic/silly-demo:this-tag-does-not-exist": ghcr.io/vfarcic/silly-demo:this-tag-does-not-exist: not found
Warning Failed 15s (x4 over 109s) kubelet Error: ErrImagePull
Normal BackOff 3s (x6 over 108s) kubelet Back-off pulling image "ghcr.io/vfarcic/silly-demo:this-tag-does-not-exist"
Warning Failed 3s (x6 over 108s) kubelet Error: ImagePullBackOff
We can see, in the Conditions
, that the containers are not ready (ContainersReady
set to False
). Further down, we can see in the Events
that it could not pull the image silly-demo
with the tag this-tag-does-not-exist
.
If you are a “Kubernetes ninja” it is easy to deduce all that, mostly because that was a simple scenario. Even those experienced with Kubernetes can easily get lost when faced with more complicated situations. A person who was given instructions to write a simple Deployment or who was told to modify Helm values or who was instructed to write a Custom Resource might not be able to do anything but cry.
What we need is propagation of specific events, statuses, and logs to the parent resource. Whomever applied that Deployment should be able to see that there is something wrong with the image. That person should be able to see what matters to them, and nothing else. A platform engineer that is in charge of internal services should have done that, and the only question is “How?”
Now, this is a good news bad news type of a situation. I do not have a solution for Kubernetes in general, but I do have a solution for those writing your own Custom Resource Definitions (CRDs) and controllers or, even better, those using Crossplane to do that. Let’s take a look at the problem from a different angle.
The Problem With Custom Resources
Here’s yet another resource but, this time, based on a Custom Resource Definition I created through Crossplane Compositions.
cat examples/$HYPERSCALER-error.yaml
The output is as follows.
apiVersion: devopstoolkitseries.com/v1alpha1
kind: SQLClaim
metadata:
name: my-db
spec:
id: my-db-20240925155601
compositionSelector:
matchLabels:
provider: azure
db: postgresql
parameters:
version: '11'
size: small
region: otherregion
databases:
- db-01
- db-02
That’s a very simple YAML definition based on a CRD I made with Crossplane. It enables developers to manage postgresql
servers, databases
, schemas, and quite a few other things in AWS, Google Cloud, and, as in this case, azure
.
I won’t go into details since I already explored Crossplane in quite a few videos in this channel. The only thing relevant for today’s discussion is that developers can easily create, update, or delete PostgreSQL, but we are yet to discover whether they can do day 2 operations for which they might need to see the status, relevant events, and a few other things without being overwhelmed with low-level details. After all, one of the main reasons for creating such abstractions is to surface things that matter and hide those that don’t.
Before we proceed, let me stress out that today I am using Azure but the instructions are made for any of the “big three” hyperscalers. So, if you’re following along, you should be able to get a similar result in AWS and Google Cloud as well.
Now, let’s apply that resource and…
kubectl --namespace infra apply \
--filename examples/$HYPERSCALER-error.yaml
…see what we’ll get.
kubectl --namespace infra get sqlclaims
The output is as follows.
NAME SYNCED READY CONNECTION-SECRET AGE
my-db True False 17s
So far, everything seem to be working correctly. The database server and other resources are not yet ready but that’s to be expected since it might take a while until everything is done.
Before we proceed, let me stress that I am fully aware that not all developers like using kubectl. Some might prefer using a Web UI like Backstage or Port. Others might like sticking with plugins in VS Code. There is an infinite number of ways people might want to interact with control planes. That should not matter for today’s story since the question is whether the information people need is available at the right level of abstractions. It should be relatively easy to display the same info anywhere, as long as the info exists at the right place.
Okay. Since listing the resources seem to show that everything is working, but not necessarily ready, let’s dive deeper and describe it.
kubectl --namespace infra describe sqlclaim my-db
The output is as follows (truncated for brevity).
...
Status:
Conditions:
Last Transition Time: 2024-09-25T14:08:10Z
Reason: ReconcileSuccess
Status: True
Type: Synced
Last Transition Time: 2024-09-25T14:08:10Z
Message: Claim is waiting for composite resource to become Ready
Reason: Waiting
Status: False
Type: Ready
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal BindCompositeResource 44s offered/compositeresourcedefinition.apiextensions.crossplane.io Successfully bound composite resource
Normal BindCompositeResource 43s (x8 over 44s) offered/compositeresourcedefinition.apiextensions.crossplane.io Composite resource is not yet ready
Everything seems to be working correctly, so far. It’s waiting for composite resource to become Ready
but, as I already mentioned, that is not necessarily a problem since it might take a while. Even if it is an issue, how would a developer know what is the composite resource
and what is not Ready
and why it’s not ready? Based on the current information, the only conclusion a developer could make is that, so far, all is good and the only thing missing is a bit of patience.
Now, let’s change roles and assume that we are a person who created that Composition. We are now a person who understands Kubernetes, Crossplane, Azure or whichever hyperscaler is used, and everything in between.
Such a person might execute the following command to see what’s going on with the resources created from that claim.
crossplane beta trace sqlclaim my-db --namespace infra
The output is as follows.
NAME SYNCED READY STATUS
SQLClaim/my-db (infra) True False Waiting: Claim is waiting for composite resource to become Ready
└─ SQL/my-db-fmfd2 True False Creating: Unready resources: firewall-rule, resourcegroup, and server
├─ ResourceGroup/my-db-20240925155601 False - ReconcileError: ...viderConfig: ProviderConfig.azure.upbound.io "default" not found
├─ FirewallRule/my-db-20240925155601 False - ReconcileError: ...viderConfig: ProviderConfig.azure.upbound.io "default" not found
├─ Server/my-db-20240925155601 False - ReconcileError: ...viderConfig: ProviderConfig.azure.upbound.io "default" not found
├─ ProviderConfig/my-db-20240925155601-sql - -
├─ ProviderConfig/my-db-20240925155601-sql - -
├─ Database/my-db-20240925155601-db-01 False False ...
├─ Database/my-db-20240925155601-db-02 False False ...
└─ ProviderConfig/my-db-20240925155601 - -
Look at that? The SQLClaim
created by the developer created the SQL
composition which created a ResourceGroup
, and a FirewallRule
, and a Server
, and some ProviderConfig
, and Database
resources. The “expert” understands what those are, how they work, why there are being created, and anything else related to running PostgreSQL server, databases, and other resources in Azure and beyond. That cannot be said for the developer. They created a claim that is an abstraction and it would NOT be reasonable to expect the same level of understanding of low-level resources created from it.
The “expert” can immediately see that there is a ReconcileError
. Something’s wrong with the ProviderConfig
. It’s nowhere to be found.
The “expert” would probably describe
one of the resources created by that claim.
kubectl describe $MANAGED_RESOURCE my-db
The output is as follows (truncated for brevity).
...
API Version: azure.upbound.io/v1beta1
Kind: ResourceGroup
...
Status:
At Provider:
Conditions:
Last Transition Time: 2024-09-25T14:08:10Z
Message: connect failed: cannot initialize the Terraform plugin SDK async external client: cannot get terraform setup: cannot get referenced ProviderConfig: ProviderConfig.azure.upbound.io "default" not found
Reason: ReconcileError
Status: False
Type: Synced
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning CannotConnectToProvider 45s (x7 over 106s) managed/azure.upbound.io/v1beta1, kind=resourcegroup cannot initialize the Terraform plugin SDK async external client: cannot get terraform setup: cannot get referenced ProviderConfig: ProviderConfig.azure.upbound.io "default" not found
We can see that it cannot get referenced ProviderConfig
. A person experienced with Crossplane would know immediately that the default
provider is missing and that’s the reason why ResourceGroup
cannot be created.
A developer did not necessarily even know that there is a resource group.
We finished with the depressing part. Let’s take a look at the solution.
Status Propagation
Before we proceed, let me stress that what you are about to see is a solution in Crossplane. If you are building CRDs and controllers or operators yourself, you should be able to replicate the same behavior. If you are using a third-party tool to do that, you might want to yell at the maintainers of the project you choose to implement something similar. In any case, think of what follows not only as a showcase of a solution in Crossplane but also as a path forward in other projects.
Let’s update the Composition to a newer version in which I implemented the solution.
yq --inplace \
'.spec.package = "xpkg.upbound.io/devops-toolkit/dot-sql:v0.8.138"' \
config.yaml
kubectl apply --filename config.yaml
Now we’re changing the role again. From now on, we are a developer who created that claim.
Let’s describe it, again.
kubectl --namespace infra describe sqlclaim my-db
The output is as follows (truncated for brevity).
...
Status:
Conditions:
Last Transition Time: 2024-09-25T14:08:10Z
Reason: ReconcileSuccess
Status: True
Type: Synced
Last Transition Time: 2024-09-25T14:08:10Z
Message: Claim is waiting for composite resource to become Ready
Reason: Waiting
Status: False
Type: Ready
Last Transition Time: 2024-09-25T14:10:55Z
Message: providerConfig is missing. Contact service owner.
Reason: FailedToConnect
Status: False
Type: Developer
...
Look at that!
A new Status
appeared automagically. Besides the massage that it is waiting for composite resource to become Ready
that we saw before, now we have a new one. The type is Developer
so that it’s easy to see for whom it is. It says that it FailedToConnect
because the providerConfig is missing
. It instructs the developer to Contact service owner
.
If you do not see the providerConfig is missing. Contact service owner message, Crossplane probably did not yet to a new round of reconciliation. Wait for a few moments and describe the claim agian.
Typically, there are two types of issues when consuming services created by others, even when “others” are platform engineers in your company. In some cases, the service itself is not working correctly and only the service owner can fix it. Only the person or the team in charge of it can make it work. In some other cases, an error can be fixed by the service consumer, in this case the developer. That’s similar to, let’s say, consuming services from hyperscalers like AWS, Azure, and Google Cloud. If the whole zone goes down, they will let us know that they need to fix it. There’s nothing we can do. On the other hand, if we do something wrong ourselves like, for example, specify a wrong region, they will not do anything. It’s up to us to specify the correct one. More often than not, there is a clear division between responsabilities of the service provider and us.
This is a similar situation. There’s nothing the developer can do to fix the issue so the only thing left is to contact whomever is in charge of it to fix it. We’ll see the other case soon.
Bear in mind that the status we just saw does not come out of the box. It’s not something baked into Crossplane but a conscious choice of the person who wrote the Composition, the service. The service owner.
For now, let’s change the role again. Now we are the person in charge of it, the service owner. I just got an email from an angry developer saying that there is something wrong with the providerConfig and that we should fix it. There is also a footnote in that email stating that we should have a better alerting system and not wait for angry emails. They’re right, but that would be a subject of a different post.
Since this is not a tutorial about Crossplane, I’ll skip the explanation of what is needed and we’ll just apply
the fix.
kubectl apply \
--filename examples/provider-config-$HYPERSCALER.yaml
The provider was created and the claim should, supposedly, maybe, hopefuly work. We’ll see.
We reply back to the developer saying “It’s fixed. You’re good to go. Sorry for the inconvenience.”
With that, we’re switching the role back to the developer. What would that person do after receiving the email?
The logical course of action would be to verify whether the claim now works by describing it again, and fantasising how much nicer it would be if that information is in Backstage or Port. We should send an email to the platform engineering team to incorporate this into the portal and they might write a comment to this video asking Viktor to use that as the next subject.
kubectl --namespace infra describe sqlclaim my-db
The output is as follows (truncated for brevity).
...
Status:
Conditions:
Last Transition Time: 2024-09-25T14:08:10Z
Reason: ReconcileSuccess
Status: True
Type: Synced
Last Transition Time: 2024-09-25T14:08:10Z
Message: Claim is waiting for composite resource to become Ready
Reason: Waiting
Status: False
Type: Ready
Last Transition Time: 2024-09-25T14:12:28Z
Message: selected region otherregion is not available. Double check the `spec.parameters.region` value.
Reason: FailedToConnect
Status: False
Type: Developer
...
If you do not see the selected region otherregion is not available… message, Crossplane probably did not yet to a new round of reconciliation. Wait for a few moments and describe the claim again.
The previous message dissapeared or, to be more precise, it was replaced with a new one. Now it says that the selected region otherregion is not available.
. That makes sense. Even if we are not proficient in Azure, it is highlighly unlikely that there is a region called otherregion. More importantly, this time, the message does not say that we should contact the service owner but instructs us to Double check the spec.parameters.region value
. This is the case when it’s not the fault of the service owner. The service or, in this case, the Composition works correctly. We made a mistake. There is no need to yell at anyone. We should own this one and just fix it by changing the region to a correct one.
Here’s a modified version of the manifest.
cat examples/$HYPERSCALER.yaml
The output is as follows.
apiVersion: devopstoolkitseries.com/v1alpha1
kind: SQLClaim
metadata:
name: my-db
spec:
id: my-db-20240925155601
compositionSelector:
matchLabels:
provider: azure
db: postgresql
parameters:
version: '11'
size: small
region: eastus
databases:
- db-01
- db-02
The only difference is that, this time, we have eastus
as the region
. That sounds like a correct one and, at the same time, makes us want to have Backstage or Port integration even more since we could have a drop-down list of available regions instead of guessing it ourselves. Someone should definitely tell Viktor to explore it in one of the next videos.
Let’s apply that change,…
kubectl --namespace infra apply \
--filename examples/$HYPERSCALER.yaml
…and describe
the claim again.
kubectl --namespace infra describe sqlclaim my-db
The output is as follows (truncated for brevity).
...
Status:
Conditions:
Last Transition Time: 2024-09-25T14:08:10Z
Reason: ReconcileSuccess
Status: True
Type: Synced
Last Transition Time: 2024-09-25T14:08:10Z
Message: Claim is waiting for composite resource to become Ready
Reason: Waiting
Status: False
Type: Ready
Last Transition Time: 2024-09-25T14:13:59Z
Message: So far so good
Reason:
Status: True
Type: Developer
...
If you do not see the So far so good message, Crossplane probably did not yet to a new round of reconciliation. Wait for a few moments and describe the claim again.
This time, we are greeted with the So far so good
message indicating that everything seem to be working correctly, for now. Crossplane is, probably, creating the resource group, the firewall rule, the server, and all other resources we, the developers, should not care about.
A while later, if we list all the sqlclaims
…
kubectl --namespace infra get sqlclaims
The output is as follows.
NAME SYNCED READY CONNECTION-SECRET AGE
my-db True True 18m
…we can see that it is now READY
.
If the column
READY
is notTrue
, some of the resources are still being created. It might take anything between a minute and ten minutes or more to create all the resources depending on which hyperscaler you’re using.
Hurray!
The database is up-and-running. It was easy for the developer not only to create it, but also to see whether it’s working and, if it’s not, what the problem is and whether it is something they can fix themselves or contact someone. The service owner managed to propagate information that matters, and keep the details that don’t to where they belong.
They lived happily ever after.
We saw the result of propagation of statuses to parent resources. Now it’s time to see how it’s done.
How It’s Done
The whole logic that propagates statuses from children to parent or root resources is done through the Status Transformer Crossplane Function. I’ll show you how it works in a second, right after we go through a bit of history.
A while ago, I complained about Kubernetes event and status propagation. Those complaints resulted in the Kubernetes Events Are Broken (If You Are Building a Developer Portal) post which, essentially, claimed that without a mechanism to propagate events and statuses we cannot build a developer platform on top of Kubernetes without limiting ourselves to day zero operations. Long story short, that resulted in this proposal which was eventually picked up by the community and the end result in the function that enables us to filter, transform, and propagate statuses all the way to the top of the hierarchy tree to Composition resources and Claims.
Let’s take a look at the Composite Definition that made it possible to create claims.
cat package/compositions.yaml
The output is as follows (truncated for brevity).
...
- step: statuses
functionRef:
name: crossplane-contrib-function-status-transformer
input:
apiVersion: function-status-transformer.fn.crossplane.io/v1beta1
kind: StatusTransformation
statusConditionHooks:
- matchers:
- resources:
- name: resourcegroup
conditions:
- type: Synced
setConditions:
- target: CompositeAndClaim
force: true
condition:
type: Developer
status: 'True'
message: So far so good
- matchers:
- resources:
- name: resourcegroup
conditions:
- type: Synced
status: 'False'
reason: ReconcileError
message: (.*)cannot get referenced ProviderConfig(.*)
setConditions:
- target: CompositeAndClaim
force: true
condition:
type: Developer
status: 'False'
reason: FailedToConnect
message: providerConfig is missing. Contact service owner.
- matchers:
- resources:
- name: resourcegroup
conditions:
- type: Synced
status: 'False'
reason: ReconcileError
message: (.*)cannot get referenced ProviderConfig(.*)
setConditions:
- target: CompositeAndClaim
force: true
condition:
type: Developer
status: 'False'
reason: FailedToConnect
message: providerConfig is missing. Contact service owner.
- matchers:
- resources:
- name: resourcegroup
conditions:
- type: Synced
status: 'False'
reason: ReconcileError
message: (.*)The specified location '(?P<Region>.*)' is invalid(.*)
setConditions:
- target: CompositeAndClaim
force: true
condition:
type: Developer
status: 'False'
reason: FailedToConnect
message: selected region {{ .Region }} is not available. Double check the `spec.parameters.region` value.
- matchers:
- resources:
- name: resourcegroup
conditions:
- type: Synced
status: 'False'
reason: ReconcileError
message: (.*)The provided location '(?P<Region>.*)' is not available for resource group(.*)
setConditions:
- target: CompositeAndClaim
force: true
condition:
type: Developer
status: 'False'
reason: FailedToConnect
message: selected region {{ .Region }} is not available. Double check the `spec.parameters.region` value.
- step: automatically-detect-ready-composed-resources
functionRef:
name: crossplane-contrib-function-auto-ready
writeConnectionSecretsToNamespace: crossplane-system
As I mentioned earlier, I won’t go into details how Crossplane Compositions are built nor how Crossplane functions work. I already explored that in quite a few videos on this channel. Instead, I’ll just explain the logic behind the Status Transformer Function.
There is a number of statusConditionHooks
. They all rely on matchers
to find conditions in specific resources and setConditions
to create or update conditions in the Composition Resource and the Claim.
The first one, in this case, is a catch all matcher. If the Azure resourcegroup
resource has the status type set to Synced
, it sets the status of the type
Developer
to True
with the message So far so good
. Since conditions can be overwritten, that one will be displayed if none of the other matchers are met. In other words, if we do not find any issue in managed resources, that is the message developers will see.
Further on we have the second matcher that looks for the reason
ReconcileError
and the message that contains cannot get referenced ProviderConfig
. That one uses RegEx so that we don’t need to look for the exact message. If that condition is met, it will overwrite the previous status of the type Developer
with the message providerConfig is missing. Contact service owner.
The rest of the matchers follow the same pattern.
The one with The specified location
message
is an interesting one since it extracts the Region
which is later used to generate a dynamic message
to end users. That message contains the value of the Region
extracted earlier.
There are quite a few other ways we can propagate statuses and even events. I invite you to check the documentation.
For now, what matters, is that we can design services in Kubernetes based on CRDs and controllers or operators and that we need to think not only about day zero operations but the whole experience which, among other things, includes custom developer-friendly statuses and events. Those can be visualized in many different ways, be it through kubectl as we did today, or through Wen UIs or any other interface. As long as the data is available in the top resources, it should not be a problem to present it.
If you’re creating CRDs and operators yourself from scratch, you might want to implement a similar logic. If you’re using Crossplane, the Status Transformer Function does the heavy lifting and all you have to do is define matchers and set conditions that should be propagated to the top resource.
Thank you for watching. See you in the next one. Cheers.
Destroy
chmox +x examples/destroy.nu
./examples/destroy.nu
yq --inplace \
'.spec.package = "xpkg.upbound.io/devops-toolkit/dot-sql:v0.8.132"' \
config.yaml