GitOps Broke CI/CD! Here is How to Fix It With Argo Events
GitOps is amazing, especially for those using Kubernetes. Yet, GitOps poses significant problems when it comes to execution of tasks that should be performed after deployments. For example, we cannot run functional tests through workflows or pipelines like GitHub Actions or Jenkins. GitOps broke CI/CD processes, and we need to fix that.
Let me explain.
Imagine the following workflow steps.
- Checkout code
- Run unit tests
- Build and push a new container image
- Deploy the new release
- Run functional tests
The whole process is often much more complicated than that. Still, those few steps should be enough for me to make the point I’m about to make.
Now, if we are practicing GitOps, the step that deploys the code is actually not deploying anything but, instead, it is making changes to manifests and pushing them to a repo. From there on, Argo CD or Flux would realize that there is a change in the desired state and make changes to the cluster so that the actual state is the same.
GitOps requires rethinking CI/CD to accommodate decoupled deployment processes.
The problem is in whichever steps we might have in our workflows after the deployment. How do we, for example, run functional tests or do whatever else needs to be done once a new release is running?
We cannot let the workflow run continue its course after changes are pushed to Git since there is a decoupled process that does the deployment or, to be more precise, reconciliation between the desired state (what is in Git) and the actual state (what is in a cluster). We cannot know for sure when will that process start nor when it will end. So, we have to remove those steps from that specific workflow run and, probably, execute it separately. Hence, we have two, if not more workflows with decoupled and asynchronous processes in between. We need a way to trigger that second workflow run and the only way we can do that is by monitoring the state of resources and executing some kind of a trigger when those we’re interested in are created or updated.
Here’s an example.
Do not run the commands just yet. This is only a preview. We’ll go through all the commands starting with the setup later.
I’ll change the source code of the app and push changes to the Git repo,
git add .
git commit -m "Silly demo"
git push
If we open GitHub Actions…
gh repo view --web
…we can see that the ci
workflow run was executed.
That workflow ended up pushing changes to a manifest and stopped. There was nothing else for it to do since that’s where Argo CD kicks in. It detected changes to the repo and synchronized them into the cluster.
However, a few moments later, a new workflow run was executed even though we did not push anything to any Git repo.
How did that happen? What executed that workflow run? How did it know that a new release of the application was deployed to the cluster?
The answer to those and a few other questions is in event based workflows and Argo Events.
What Is Argo Events?
A simple way to describe Argo Events would be to say that it orchestrates workflow or pipeline runs. It does that orchestration by monitoring specific events and triggering processes.
Almost all of us are already using such architecture. When we push a change to a Git repo, that change is an event that triggers workflow runs. That’s how most of CI or CD processes are executed. That’s what we’ve been doing for decades, except that it was SVN or CVS back in the days when I was young.
However, changes to Git repositories are only a fraction of events our systems are generating. There are events generated in queues and pub/subs like AWS SQS, Azure Service Bus, NATS, Kafka, Redis and others. Then there are Kubernetes events, file events, and many others. Our systems are generating events all the time. We can monitor those and use them to trigger some processes.
Now, I won’t go deeper into Argo Events. Actually, I won’t explain Argo Events at all. I already did that in the Argo Events - Event-Based Dependency Manager for Kubernetes video and I don’t like repeating myself.
Today I want to focus on solving the GitOps problem and figure out how to connect broken workflow runs. More specifically, we’ll explore how to execute whatever needs to be executed after GitOps tools create or update something in our clusters.
Setup
Please watch the GitHub CLI (gh) - How to manage repositories more efficiently video if you are not familiar with GitHub CLI.
gh repo fork vfarcic/argo-events-gh-demo --clone --remote
cd argo-events-gh-demo
gh repo set-default
Select the fork as the default repository
Make sure that Docker is up-and-running. We’ll use it to create a KinD cluster.
Watch Nix for Everyone: Unleash Devbox for Simplified Development if you are not familiar with Devbox. Alternatively, you can skip Devbox and install all the tools listed in
devbox.json
yourself.
devbox shell
kind create cluster
kubectl create namespace a-team
helm upgrade --install argocd argo-cd \
--repo https://argoproj.github.io/argo-helm \
--namespace argocd --create-namespace \
--values values-argocd.yaml --wait
REPO_URL=$(git config --get remote.origin.url)
yq --inplace ".spec.source.repoURL = \"$REPO_URL\"" \
apps-argocd.yaml
yq --inplace ".spec.source.repoURL = \"$REPO_URL\"" \
apps/silly-demo.yaml
git add .
git commit -m "Make it my own"
git push
kubectl apply --filename apps-argocd.yaml
helm upgrade --install argo-events argo-events \
--repo https://argoproj.github.io/argo-helm \
--namespace argo-events --create-namespace \
--set webhook.enabled=true --wait
kubectl --namespace argo-events apply \
--filename https://raw.githubusercontent.com/argoproj/argo-events/stable/examples/eventbus/native.yaml
Replace
[...]
with the GitHub token in the command that follows.
export GITHUB_TOKEN=[...]
kubectl --namespace argo-events \
create secret generic github \
--from-literal token="token $GITHUB_TOKEN"
Replace
[...]
with the GitHub user
export GITHUB_USER=[...]
yq --inplace \
".spec.triggers[0].template.http.url = \"https://api.github.com/repos/$GITHUB_USER/argo-events-gh-demo/dispatches\"" \
sensor-deployment.yaml
gh repo view --web
Click the
Actions
button from the top menu and click theI understand my workflows, go ahead and enable them
button.
Click the
Settings
button, chooseActions
followed byGeneral
in the left-hand menu, select theRead and write permissions
checkbox in theWorkflow permissions
section, and click theSave
button.
Argo Events Setup
Once we have Argo Events installed we still need to set up a few things specific to the events we will be watching and triggering.
To begin with, we need to create a service account, a role, and a role binding so that we have the permissions to observe or operate resources we’re interested in.
Here’s an example.
cat sa.yaml
The output is as follows.
apiVersion: v1
kind: ServiceAccount
metadata:
name: argo-events
namespace: argo-events
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: argo-events
rules:
- apiGroups:
- "apps"
verbs:
- "*"
resources:
- deployments
- replicasets
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: argo-events
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: argo-events
subjects:
- kind: ServiceAccount
name: argo-events
namespace: argo-events
If you are familiar with Kubernetes, there should be no need for an explanation. We are creating a service account argo-events
that will have permissions to do anything with deployments
and replicasets
. Now, if that is confusing and you need more detailed explanation, all I will say is that you won’t find it here. I’m assuming that you are not new to Kubernetes. Otherwise, go away.
Typically, we should push those manifests to Git and let Argo CD take care of the rest, but we won’t. In the interest of getting to the important part as fast as possible, I’ll perform a cardinal sin of applying them directly with kubectl.
kubectl apply --filename sa.yaml
Next, we need an Event Source.
cat event-source-deployment.yaml
The output is as follows.
apiVersion: argoproj.io/v1alpha1
kind: EventSource
metadata:
name: app
spec:
template:
serviceAccountName: argo-events
resource:
app:
namespace: a-team
group: apps
version: v1
resource: deployments
eventTypes:
- ADD
- UPDATE
filter:
afterStart: true
EventSouce
resources are essentially configurations that define how to consume events, transform them into cloudevents, and send them to the Argo Events Event Bus.
In this specific case, we are using the argo-events
Service Account we created earlier so that the EventSource can observe resources we’re interested in.
Event sources can be many. We could consume events from a bunch of sources. Here comes the list: AMQP, AWS SNS, AWS SQS, Azure Service Bus, Azure Queue Storage, Calendar, Emitter, File, GCP Pub/Sub, GitHub, GitLab, Bitbucket (Cloud), Bitbucket Server, Kafka, Minio, MQTT, NATS, NSQ, Redis, Redis Streams, Resource, Webhook, Pulsar.
I got tired just by reading those and I’m certainly not going to go through all of them in more detail, mostly because only one of those matters for today’s story.
We want to observe specific Kubernetes resources, so that manifest defines resource
named app
. That’s an arbitrary name that does not matter much. What does matter is that it will watch deployments
. Specifically, it will generate a cloud event every time a deployment is created (ADD
) or updated (UPDATE
) inside the a-team
Namespace.
In other words, we’re generating cloud events whenever specific resources in Kubernetes are created or updated. We are not specifying what to do with those events just yet. We’ll do that soon. For now, let’s apply that manifest.
kubectl --namespace argo-events apply \
--filename event-source-deployment.yaml
Next there are sensors, like this one.
cat sensor-deployment.yaml
The output is as follows.
apiVersion: argoproj.io/v1alpha1
kind: Sensor
metadata:
name: app
spec:
template:
serviceAccountName: argo-events
dependencies:
- name: app
eventSourceName: app
eventName: app
triggers:
- template:
name: github-actions
http:
url: https://api.github.com/repos/vfarcic/argo-events-gh-demo/dispatches
method: POST
payload:
- src:
dependencyName: app
dataKey: body.metadata.labels.actions\.github\.com\/name
dest: event_type
- src:
dependencyName: app
dataKey: body
dest: client_payload.resource
headers:
Accept: "application/vnd.github.everest-preview+json"
secureHeaders:
- name: Authorization
valueFrom:
secretKeyRef:
name: github
key: token
retryStrategy:
steps: 3
policy:
status:
allow:
- 200
- 201
- 204
Sensors are essentially triggers we can use to specify what to do when an event is detected. They contain two major sections.
First, there are dependencies
. That’s where we’re saying that it depends on the app
EventSource we applied earlier.
Most of the action is in triggers
. We could have specified that we’d like to start an Argo Workflow run, or to instantiate an AWS Lambda, or to send a message to NATS or Kafka, or to create a Kubernetes resource, or to send it to logs, or to create a function in OpenWhisk, or to send a message on Slack, or an Azure event, or to create a Pulsar topic. We could also build our own http trigger.
We won’t use any of those since all we need is to send an HTTP request to GitHub Actions so we’re using the http
trigger.
The url
and method
fields should be self explanatory.
The payload
section is an array of instructions that, essentially, say “get data from there and add it as payload”. Since triggering GitHub Actions workflows requires an event_type
and given that I like providing flexibility, instead of hard-coding it we’re getting it from one of the labels
of the Deployment.
Also, given that it might be useful to have the whole definition of the resource that triggered actions, we’re passing it (body
) as the client_payload.resource
field. We’ll see how to use that one soon.
Further on, we’re adding the Accept
and Authorization
headers, with the latter one retrieved from the github
secret for obvious reasons.
Finally, the trigger should be considered successful if the destination returns 200
, 201
, or 204
response.
It’s all pretty straight forwards, yet, as you will see soon, it does not work. It fails misserably. Nevertheless, I was not supposed to reveal the “disaster” so please ignore what I said, and let’s just apply that manifest.
kubectl --namespace argo-events apply \
--filename sensor-deployment.yaml
That’s it. We’re ready to see it in action and experience the disaster it will create.
Combining Workflow Runs and GitOps
Modify
main.go
or any other file as so that there is a change that can be pushed to Git.
I’ll simulate development by changing something in the source code. It’s a silly demo so there’s not much to it, yet enogh to demonstrate the “glue” required to run workflows or pipelines.
Once the code changed, all we have to do is push it to Git.
git add .
git commit -m "Silly demo"
git push
That’s it. That’s where humans stop and machines take over. The action starts with the GitHub Actions, even though we’d get the same result with Jenkins, Argo Workflows, Tekton, or any other similar solution.
gh repo view --web
Open the
Actions
tab and observe that theci
workflow is running.
We can see that the ci
workflow was executed.
While it’s running, let’s take a quick look at it’s definition.
cat .github/workflows/ci.yaml
The output is as follows.
name: ci
run-name: ci
on:
push:
branches:
- main
jobs:
build-container-image:
runs-on: ubuntu-latest
env:
IMAGE: ttl.sh/silly-demo-${{github.triggering_actor}}:${{github.run_number}}
FORCE_COLOR: 1
steps:
- uses: earthly/actions-setup@v1
with:
version: v0.8.0
- name: Checkout
uses: actions/checkout@v4
- name: Set up QEMU
uses: docker/setup-qemu-action@v3
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3
- name: Run unit tests
run: |
echo "Running unit tests..."
- name: Build and push
uses: docker/build-push-action@v5
with:
push: true
tags: ${{ env.IMAGE }}
platforms: linux/amd64,linux/arm64
- name: Modify image
run: |
yq --inplace \
".spec.template.spec.containers[0].image = \"${{ env.IMAGE }}\"" \
silly-demo/deployment.yaml
- name: Commit changes
run: |
git config --local user.email "41898282+github-actions[bot]@users.noreply.github.com"
git config --local user.name "github-actions[bot]"
git add .
git commit -m "Release ${{ env.IMAGE }} [skip ci]"
- name: Push changes
uses: ad-m/github-push-action@master
with:
github_token: ${{ secrets.GITHUB_TOKEN }}
branch: ${{ github.ref }}
It checks out the code (Checkout
), prepares Docker to build images (Set up QEMU
, Set up Docker Buildx
), runs “fake” unit tests (Run unit tests
), and builds and pushed a new container image (Build and push
).
What comes now is the important part that differs from what were typically doing in the past. Instead of deploying the new container image to the cluster, we are modifying the Deployment manifest by changing the image to the one we just built (Modify image
) and, after that, committing (Commit changes
) and pushing (Push changes
) changes back to the repo.
In other words, that workflow is not deploying anything but updating manifests in the Git repo. After that, it cannot do anything else simply because that workflow has no idea who or what will apply those changes nor when will those changes be applied. That workflow is completely decoupled from deployment processes which, in our case, will be done by Argo CD.
We can confirm that’s what really happened by pulling the latest changes from Git.
git pull
We should see from the output that silly-demo/deployment.yaml
was modified, thus confirming that the first workflow run was successful.
From now on, Argo CD should detect the change and we should see new Pods deployed to the cluster.
kubectl --namespace a-team get pods
The output is as follows.
NAME READY STATUS RESTARTS AGE
silly-demo-5c444b879b-48pq2 1/1 Running 0 3s
silly-demo-5c444b879b-jhrpq 1/1 Running 0 7s
Judging by the AGE
, those are indeed new Pods created by Argo CD as a result of changing the image in the Deployment manifest. The second process worked as expected.
Chances are that, so far, you haven’t seen anything new. You probably already know how to trigger workflow or pipeline runs through commits to Git, and you are likely already using Argo CD or Flux to sync the desired state stored in Git and the actual state in your Kubernetes clusters.
What comes next is new, and we can see it by going back to the GitHub repo…
gh repo view --web
…and opening the Actions
tab.
We can see that a new workflow functional-tests was executed. That’s the workflow triggered by Argo Events once it detected that a Deployment was created or updated. That is also the disaster I mentioned earlier.
Instead of triggering one workflow run that executes functional tests, we got plenty. What we did is pointless. There is no good reason why we would run functional tests many times for a single change, yet, Argo Events did not do anything wrong. As a matter of fact, Argo Events triggered all those workflows correctly.
We said that we’d like to trigger a workflow every time a Deployment is created (added) or updated, so that’s what Argo Events did, even though it might look like it went beserk.
You see, it’s not only us who are modifying Kubernetes resources. Kubernetes itself is modifying them as well. It adds anotations, statuses, and quite a few other things to resources. Otherwise, without statuses, we would not know what’s going on.
We can blame Deployment itself for not having a clear indication when its children resources are operational. On the other hand, filters offered by Argo Events are terrible so that complicates the situation as well.
Nevertheless, I do have a solution that prevents multiple events firing due to Kubernetes updating resources. We’ll get to it soon. For now, let’s take a quick look at the workflow definition that was executed.
cat .github/workflows/functional-tests.yaml
The output is as follows.
name: functional-tests
run-name: functional-tests
on:
repository_dispatch:
types:
- functional-tests
jobs:
run-functional-tests:
runs-on: ubuntu-latest
steps:
- name: Functional tests
run: |
echo "The following resource was created or updated:"
echo
echo '${{ github.event.client_payload.resource }}' | jq .
echo "Running functional tests..."
There’s not much going on there. We’re outputting client_payload.resource
. We’re not doing anything with it, but we could. If that run would need some additional information it could extract it from the resource passed to it. I just wanted to ensure that Argo Events could pass any information to a workflow run, including the resource that initiated it.
Below it we can see the Running functional tests...
message. I was too lazy to add real tests. I’m sure that you do know how to test your releases. Also, please note that it does not have to run only tests. That workflow should execute whichever tasks you need to run after each new deployment.
Let’s get back to solving the problem with too many events being triggered.
As you hopefully know, Kubernetes Deployments create ReplicaSets which manage Pods. Every time we make a change to a Deployment, a new ReplicaSet is created. That means that we can be watching for events that create ReplicaSets without the need to watch for updates. As a result, if everything turns up as expected, Argo Events should trigger a GitHub Actions workflow run every time we create or update a Deployment. It’s an indirect route towards solving the issue we’re having.
Here’s the diff between the current and the new version of the EventSource.
diff event-source-deployment.yaml event-source-replicaset.yaml
The output is as follows.
< resource: deployments
---
> resource: replicasets
16d15
< - UPDATE
As we can see, the only change is that we are now watching for replicasets
instead of deployments
and that updates (UPDATE
) should not trigger events.
Let’s apply the new version of the EventSource.
kubectl --namespace argo-events apply \
--filename event-source-replicaset.yaml
Next, we’ll modify the source code.
Modify
main.go
or any other file as so that there is a change that can be pushed to Git.
Finally, we’ll push the changes to the repo.
git add .
git commit -m "Silly demo"
git push
Now we need to wait until the ci workflow run is executed and Argo CD detects and synchronizes changes to the cluster. Once that happens, we should have a new functional-tests workflow run executed.
Few minutes later…
gh repo view --web
If we take a look at the functional-tests
workflow runs, we can see that only one was executed.
Mission accomplished. To quote Borat, one of the biggest phylosophers or our time: “Great success!”
Here’s what happened.
What Happened?
One of GitHub Actions Workflows is configured to run on every commit to the mainline so a run was executed. It checked out the code, it run unit tests, it built an image and pushed it to a registry, it modified the image in the Deployment manifest, and it pushed the changes back to the repo. That’s the end of the Workflow run. It did not deploy anything nor it executed any task we might want to run after deployments.
Argo CD is unaware of what we do in workflows. It does not care about any of them. It’s decoupled from other processes. All it cares is the drift between the desired and the actual state. So, once changes to the manifests were pushed to the repo, it detected drift between the desired state, the manifests, and the actual state, the Kubernetes resources. It did whatever it needs to do to remove the drift by making changes to the actual resources. That’s GitOps equivalent of a “traditional” deployment. Just as Argo CD was unaware of the workflow run that made changes to the manifests stored in Git, it is also unaware of what needs to be done afterwards. It does not care. It has a single job and it does not care about jobs done by anyone else.
EventSource, an Argo Events component, also unaware of GitHub Actions and Argo CD, detected a new ReplicaSet that was created by the Deployment and triggered an HTTP request to GitHub Actions workflow functional-test. That resulted in a run of that workflow that executed functional tests.
That’s a simple version of the process and I’ll leave it to you to expand it to your own needs.
Here’s the key takeaway for today: Argo Events effectively solves GitOps’ post-deployment challenges by intelligently triggering workflows.
Destroy
kind delete cluster
yq --inplace \
".spec.template.spec.containers[0].image = \"ghcr.io/vfarcic/silly-demo:1.4.171\"" \
silly-demo/deployment.yaml
git add .
git commit -m "Revert [skip ci]"
git push