Testing in Production! Progressive Delivery with Canary Deployments Explained!

Today I will make an outrageous claim. Ready? Here it goes… The only testing that truly matters is testing in production. The only way to truly verify that a release is working as expected is to run it in production with “real” users and “real” workload. Testing a release before it reaches production is helpful and I am certainly not going to tell you to stop writing and running your unit tests, and functional tests, and integration tests, and whichever other type of testing you might normally do. What I am going to tell you is that you have to test your releases in production. Confirmation that “real” users got what they expected is the only thing that truly matters.

Intro

Here’s the thing. We cannot truly know that a new release will be working correctly before we deploy it to production. We can certainly be more confident by running all sorts of tests, but we cannot be sure. As a result of that uncertainty, we have to test a release after it’s been released to production.

Here comes the important note. Almost everyone is testing releases after they’ve been deployed to production, but not everyone is aware that’s what they’re doing. You see, every single company that has some sort of observability is effectively testing in production. But, before I back that claim, let me quickly explain what testing is.

In a nutshell, testing is all about performing some actions and verifying that outcomes of those actions are as expected. We click a link and we confirm that it led us to a specific page. We invoke a function with specific parameters and we test that the output of that function is as expected. We fill in some fields and click a submit button and we confirm that data was stored in a database. You get the point. Right? We perform some actions and we verify outcomes of those actions.

Now, let’s get back to observability and testing in production.

When running something in production, that something is exposed to real users. They are interacting with our applications. They are performing some actions, and they are receiving some outcomes. If we don’t do anything, our users are the only testers of applications running in production. They are even filing issues by opening support tickets.

We can do much better than that. We can keep “real” users perform actions, and do verifications ourselves. Instead of waiting for support tickets, we can be proactive and try to figure out that something is wrong ourselves. We can find out that something is wrong while only a small subset of our users is exposed to it.

Setup

git clone https://github.com/vfarcic/argo-rollouts-demo

cd argo-rollouts-demo

Please watch https://youtu.be/WiFLtcBvGMU if you are not familiar with Devbox. Alternatively, you can skip Devbox and install all the tools listed in devbox.json yourself.

devbox shell

Please watch The Future of Shells with Nushell! Shell + Data + Programming Language if you are not familiar with Nushell. Alternatively, you can inspect the setup.nu script and transform the instructions in it to Bash or ZShell if you prefer not to use that Nushell script.

chmod +x setup.nu

./setup.nu

Open two new terminal sessions and make sure that they are in the same directory as the first. All commands should be executed in the first terminal session unless specified otherwise.

Execute the command that follows in all terminal sessions.

source .env

Manual Testing Of Progressive Delivery (Canary Deployments)

I’ll make an assumption that it is impossible to ensure that every single release to production is without bugs and based on features that our users actually want. If that’s the case, if issues are unavoidable, the only thing we can do is limit the blast radius. Instead of exposing all our users to a new release, we can limit it to a subset of them and see how they react to it. If it seems to be working correctly for that subset of users, we can choose to increase the reach. If, let’s say, we started by exposing that release to ten percent of the traffic, after we confirm that our users are having good experience with it, we can increase the exposure to twenty, then to thirty, and so on and so forth, all the way until hundred percent, at which point the new release is fully rolled out.

Alternatively, if at any point, we discover that the new release is not working as expected, we can roll it back. As a result, only a subset of users will be negatively affected. That’s certainly not ideal, but it’s still much better than if we let it blow into everyone’s faces.

Such a gradual rollout of a release is called Canary Deployments which is a subset of a wider range of techniques called progressive delivery.

Progressive delivery is all about gradually or progressively rolling out releases. Besides canary deployments, we could employ rolling updates, blue-green, or any other similar strategy. As a matter of fact, you might already be using one of those strategies without even knowing that’s what you’re doing. For example, Kubernetes tends to rely on rolling updates. A change to a Deployment initiates rolling updates where Pods with the old release are gradually phased our while, at the same time, Pods with the new release are brought up. Similarly, most managed Kubernetes services tend to use rolling updates when upgrading nodes.

All those techniquest are based on a simple principle that we should release something progressively instead of “big bag, here you go, it’s all or nothing” type of approach.

Here’s how canary deployments look like in action.

cat kustomize/base/deployment.yaml

The output is as follows (truncated for brevity).

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: silly-demo
  name: silly-demo
spec:
  replicas: 0
  ...
  template:
    ...
    spec:
      containers:
      - image: ghcr.io/vfarcic/silly-demo:1.4.126
        ...

That is a very simple Kubernetes Deployment that, under normal circumstances, would perform rolling updates of the release 1.4.126 of the silly-demo application. The only thing that makes that Deployment a bit “special” is that the number of replicas is set to 0. If we would apply that Deployment alone, nothing would be running.

I set it to zero replicas since it will not be me who decides how many replicas should run. That decision will be made by Argo Rollouts which, by the way, I’m using today only to demonstrate certain principles rather than to convince you that it is better than other similar tools like, for example, Flagger.

cat kustomize/overlays/simple/rollout.yaml

The output is as follows.

---
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: silly-demo
spec:
  replicas: 5
  strategy:
    canary:
      steps:
        - setWeight: 20
        - pause: {}
        - setWeight: 40
        - pause: {duration: 10}
        - setWeight: 60
        - pause: {duration: 10}
        - setWeight: 80
        - pause: {duration: 10}
  revisionHistoryLimit: 2
  selector:
    matchLabels:
      app: silly-demo
  workloadRef:
    apiVersion: apps/v1
    kind: Deployment
    name: silly-demo

That is an Argo Rollouts Rollout definition. The spec starts with the numer of replicas set to 5. As you’ll see later, a release of our application might not always run as five replicas. Think of that number as the final number of replicas. Also, please note that, more often than not, we should not specify a static number of replicas but, rather, use HorizontalPodAutoscaler which, by the way, is supported by Argo Rollouts. Still, for the sake of simplicity, we’ll use a static number of replicas today.

Next, we have the strategy set to canary. There are others but, in my opinion, that’s, more often than not, the one we want.

Inside the canary strategy, we have an array of steps. In this scenario, we start by telling it to setWeight to 20. That means that the approximate number of requests that will go to a new release will be twenty percent. As a result, only twenty percent of users accessing our application will see the new release, at least during the first step. As such, we can validate whether it’s working correctly before we proceed to the next step. For now, I will assume that validation is done manually. You might be testing it yourself, or you might be watching some observability dashboards to see whether there are any anomalies, or you might be waiting for support issues from your users letting you know that they are very dissapointed. The only one of those options we should be doing is watching observability metrics since they show us the information we need to deduce whether it’s working as expected. That could be error rates, latency, or anything else we might be observing today. There is a hidden motive for saying that observability is the key. I’ll get to it later.

The second step instructs Argo Rollouts to pause progression indefinitely ({}). That should give us ample amount of time to test, observe, or collect support tickets related to the new release. The important thing is that, at this point, it will roll out to a limited number of users so that the blast radius is limited. Once we decide whether it’s a good one or not, we can proceed to move forward, or roll back.

The rest of the steps keep increasing the weight to 40, then 60, and, finally, to 80 percent. No matter the steps, once the rollout is finished, it will reach hundred percent, so there’s no need to set that explicitly.

In between those steps are pause instructions but, inlike the first one, those are not waiting for us indefinitely. Instead, they are pausing, in this case, for only 10 seconds. Since we don’t have any automated validation, that means that we have ten seconds to validate before the rollout continues to the next weight. That’s unreasonably short period to do anything yet it should be more than enough for our demo.

Further on we have the selector that tells Argo Rollouts how to find the app which, in this case, is through label app set to silly-demo.

Finally, there is workloadRef. In the past, we had to replace the typical Deployment for Rollout. In those cases, the Rollout would be managing ReplicaSets just as Deployment does. That was, in my opinion, a bad idea. It makes much more sense to keep our application definitions independent of Argo Rollouts instead of rewriting what we have. On top of that, if we wanted our application to use custom resources, we would need to look elsewhere. To be honest, that was one of the biggest grudges I had with Argo Rollouts.

Fortunately, that was fixed sometime in 2022 with the introduction of workloadRef. Now we can instruct Argo Rollouts to “control” any type of resource. In this case we’re referencing the Deployment silly-demo.

That’s it. Let’s see it in action by applying the resources.

kubectl --namespace a-team apply \
    --kustomize kustomize/overlays/simple

We can see what’s going on through the argo rollout plugin for kubectl.

Execute the command that follows in the second terminal session

kubectl argo rollouts --namespace a-team get rollout silly-demo \
    --watch

The output of Argo Rollouts is as follows.

...
NAME                                    KIND        STATUS     AGE   INFO
⟳ silly-demo                            Rollout     ✔ Healthy  118s
└──# revision:1
   └──⧉ silly-demo-5c5547db68           ReplicaSet  ✔ Healthy  118s  stable
      ├──□ silly-demo-5c5547db68-2rn84  Pod         ✔ Running  118s  ready:2/2
      ├──□ silly-demo-5c5547db68-492kk  Pod         ✔ Running  118s  ready:2/2
      ├──□ silly-demo-5c5547db68-bm9vt  Pod         ✔ Running  118s  ready:2/2
      ├──□ silly-demo-5c5547db68-fkpp2  Pod         ✔ Running  118s  ready:2/2
      └──□ silly-demo-5c5547db68-zh8ck  Pod         ✔ Running  118s  ready:2/2

Actually, there’s no much to look at right now. When deploying the first release it does not make sense to do anything but roll it out as fast as possible. So, that’s what it did. It deployed five replicas of the first revision.

But… If we modify the image,…

cd kustomize/overlays/simple

kustomize edit set image \
    ghcr.io/vfarcic/silly-demo=ghcr.io/vfarcic/silly-demo:1.4.127

cd ../../../

…and apply the manifests again, we should see a very different outcome.

kubectl --namespace a-team apply \
    --kustomize kustomize/overlays/simple

The output of Argo Rollouts is as follows (truncated for brevity.).

...
NAME                                    KIND        STATUS     AGE    INFO
⟳ silly-demo                            Rollout     ॥ Paused   2m15s
├──# revision:2
│  └──⧉ silly-demo-5d574b5f4f           ReplicaSet  ✔ Healthy    50s  canary
│     ├──□ silly-demo-5d574b5f4f-4f77d  Pod         ✔ Running    50s  ready:2/2
└──# revision:1
   └──⧉ silly-demo-5c5547db68           ReplicaSet  ✔ Healthy  2m15s  stable
      ├──□ silly-demo-5c5547db68-mk6p9  Pod         ✔ Running  2m15s  ready:2/2
      ├──□ silly-demo-5c5547db68-t457w  Pod         ✔ Running  2m15s  ready:2/2
...

There are two things we should note.

First, we can see that only one Pod is running as the revision 2. Since there are five Pods in total and we specified that we want start by rolling out to twenty percent of users, we got one Pod or fifth of the total.

So, right now, approximately one fifth of the requests are being sent to the new release while all the others are still being served by the old release. We’ll confirm that later. For now, you need to trust me on that one.

The second important note is that the STATUS is now set to Paused. We specified that we would like it to wait at this point indefinitely so that we can have all the time in the world to test the new release while affecting only a fraction of users.

Now, let’s say that we finished testing, or observing, or collecting support issues, or whatever we might be doing. Also, let’s say that we did not discover any major issue that would prevent us from rolling out the new release to an even bigger number of users.

To proceed with the rollout, we can simply promote silly-demo. That action, from the Argo Rollouts perspective, can be translated to “continue into the next step.”

kubectl argo rollouts --namespace a-team promote silly-demo

The output of Argo Rollouts is as follows.

Name:            silly-demo
Namespace:       a-team
Status:          ॥ Paused
Message:         CanaryPauseStep
Strategy:        Canary
  Step:          3/8
  SetWeight:     40
  ActualWeight:  40
Images:          ghcr.io/vfarcic/silly-demo:1.4.126 (stable)
                 ghcr.io/vfarcic/silly-demo:1.4.127 (canary)
Replicas:
  Desired:       5
  Current:       5
  Updated:       2
  Ready:         5
  Available:     5

NAME                                    KIND        STATUS     AGE    INFO
⟳ silly-demo                            Rollout     ॥ Paused   6m6s
├──# revision:2
│  └──⧉ silly-demo-5d574b5f4f           ReplicaSet  ✔ Healthy  2m57s  canary
│     ├──□ silly-demo-5d574b5f4f-jq459  Pod         ✔ Running  2m56s  ready:2/2
│     └──□ silly-demo-5d574b5f4f-ffmxr  Pod         ✔ Running  10s    ready:2/2
└──# revision:1
   └──⧉ silly-demo-5c5547db68           ReplicaSet  ✔ Healthy  6m6s   stable
      ├──□ silly-demo-5c5547db68-492kk  Pod         ✔ Running  6m6s   ready:2/2
      ├──□ silly-demo-5c5547db68-fkpp2  Pod         ✔ Running  6m6s   ready:2/2
      └──□ silly-demo-5c5547db68-zh8ck  Pod         ✔ Running  6m6s   ready:2/2

We can see that the rollout continues to 40 percent, then it wait for ten seconds. After that, it continues to sixty percent, waits again, continues to eighty, waits again, and, finally, it rolls out to hundred percent.

Name:            silly-demo
Namespace:       a-team
Status:          ✔ Healthy
Strategy:        Canary
  Step:          8/8
  SetWeight:     100
  ActualWeight:  100
Images:          ghcr.io/vfarcic/silly-demo:1.4.127 (stable)
Replicas:
  Desired:       5
  Current:       5
  Updated:       5
  Ready:         5
  Available:     5

NAME                                    KIND        STATUS        AGE  INFO
⟳ silly-demo                            Rollout     ✔ Healthy     42m
├──# revision:2
│  └──⧉ silly-demo-5d574b5f4f           ReplicaSet  ✔ Healthy     39m  stable
│     ├──□ silly-demo-5d574b5f4f-jq459  Pod         ✔ Running     39m  ready:2/2
│     ├──□ silly-demo-5d574b5f4f-ffmxr  Pod         ✔ Running     36m  ready:2/2
│     ├──□ silly-demo-5d574b5f4f-xxfrp  Pod         ✔ Running     36m  ready:2/2
│     ├──□ silly-demo-5d574b5f4f-8tqtl  Pod         ✔ Running     36m  ready:2/2
│     └──□ silly-demo-5d574b5f4f-t2tdf  Pod         ✔ Running     35m  ready:2/2
└──# revision:1
   └──⧉ silly-demo-5c5547db68           ReplicaSet  • ScaledDown  42m

While the rollout of the new release was hapening, it was, at the same time, decreasing the number of replicas of the old release and, by doing that, maintaining the desired weights. Once it finished, we ended up with five replicas of the new release and zero replicas of the old. The old one has been ScaledDown and now all the traffic is going to the new one.

Let’s do another release by changing the image tag one more time,…

cd kustomize/overlays/simple

kustomize edit set image \
    ghcr.io/vfarcic/silly-demo=ghcr.io/vfarcic/silly-demo:1.4.128

cd ../../../

…and re-applying the manifests.

kubectl --namespace a-team apply \
    --kustomize kustomize/overlays/simple

The output of Argo Rollouts is as follows.

Name:            silly-demo
Namespace:       a-team
Status:          ॥ Paused
Message:         CanaryPauseStep
Strategy:        Canary
  Step:          1/8
  SetWeight:     20
  ActualWeight:  20
Images:          ghcr.io/vfarcic/silly-demo:1.4.127 (stable)
                 ghcr.io/vfarcic/silly-demo:1.4.128 (canary)
Replicas:
  Desired:       5
  Current:       5
  Updated:       1
  Ready:         5
  Available:     5

NAME                                    KIND        STATUS        AGE   INFO
⟳ silly-demo                            Rollout     ॥ Paused      103m
├──# revision:3
│  └──⧉ silly-demo-6b8dbddd4b           ReplicaSet  ✔ Healthy     24m   canary
│     └──□ silly-demo-6b8dbddd4b-qqppw  Pod         ✔ Running     24m   ready:2/2
├──# revision:2
│  └──⧉ silly-demo-5d574b5f4f           ReplicaSet  ✔ Healthy     100m  stable
│     ├──□ silly-demo-5d574b5f4f-jq459  Pod         ✔ Running     100m  ready:2/2
│     ├──□ silly-demo-5d574b5f4f-ffmxr  Pod         ✔ Running     97m   ready:2/2
│     ├──□ silly-demo-5d574b5f4f-xxfrp  Pod         ✔ Running     97m   ready:2/2
│     └──□ silly-demo-5d574b5f4f-8tqtl  Pod         ✔ Running     96m   ready:2/2
└──# revision:1
   └──⧉ silly-demo-5c5547db68           ReplicaSet  • ScaledDown  103m

A few moments later it stopped after the first step that set the weight to twenty percent thus giving us one Pod of the new and reducing the number of the Pods of the old release to four.

Now, let’s assume that we discovered a major issue and made a decision to roll back to the previous release.

All we have to do is execute the abort command.

kubectl argo rollouts --namespace a-team abort silly-demo

The output of Argo Rollouts is as follows.

Name:            silly-demo
Namespace:       a-team
Status:          ✖ Degraded
Message:         RolloutAborted: Rollout aborted update to revision 3
Strategy:        Canary
  Step:          0/8
  SetWeight:     0
  ActualWeight:  0
Images:          ghcr.io/vfarcic/silly-demo:1.4.127 (stable)
Replicas:
  Desired:       5
  Current:       5
  Updated:       0
  Ready:         5
  Available:     5

NAME                                    KIND        STATUS        AGE   INFO
⟳ silly-demo                            Rollout     ✖ Degraded    103m
├──# revision:3
│  └──⧉ silly-demo-6b8dbddd4b           ReplicaSet  • ScaledDown  24m   canary
├──# revision:2
│  └──⧉ silly-demo-5d574b5f4f           ReplicaSet  ✔ Healthy     100m  stable
│     ├──□ silly-demo-5d574b5f4f-jq459  Pod         ✔ Running     100m  ready:2/2
│     ├──□ silly-demo-5d574b5f4f-ffmxr  Pod         ✔ Running     97m   ready:2/2
│     ├──□ silly-demo-5d574b5f4f-xxfrp  Pod         ✔ Running     97m   ready:2/2
│     ├──□ silly-demo-5d574b5f4f-8tqtl  Pod         ✔ Running     97m   ready:2/2
│     └──□ silly-demo-5d574b5f4f-f8rhm  Pod         ✔ Running     13s   ready:2/2
└──# revision:1
   └──⧉ silly-demo-5c5547db68           ReplicaSet  • ScaledDown  103m

After a while, we can see that the Status is Degraded. We aborted the process so Argo Rollouts reverted all the changes. It scaled the old release back to five replicas and it scaled the new one to zero. As a result, all the traffic is now being served by the old release. We’re, more or less, safe. Only a fraction of the users experienced issues.

Here comes an important note.

Do NOT do what I just did. We can do so much better than that.

Before I explain why I said that, let’s delete the app so that we can start over.

kubectl --namespace a-team delete \
    --kustomize kustomize/overlays/simple

Controlling the Traffic Through a Service Mesh (Istio)

Controlling the traffic by increasing and decreasing the number of Pods is silly. Imagine that we would like to start with ten percent of the traffic going to the new release. To do that, we’d need to have nine replicas of the old release and one of the new. While that might, somehow, work if we truly need ten replicas in total, more often than not, the number of replicas we need to run an application will not match the rollout increments.

We can solve that problem with Ingress controllers. Most of them can be configured to send specific percentage of requests to specific services. That’s not a great idea either since not all applications might be exposed through Ingress. We might have one backend application that talks to another directly, without going through Ingress.

A better way to solve that issue is through a Service Mesh.

We’ll use Istio today, but the logic should be the same for any other Service Mesh, and even for Ingress in case you do not want to use a Service Mesh.

Here’s an example of Istio VirtualService.

cat kustomize/overlays/istio/virtualservice-01.yaml

The output is as follows.

apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: silly-demo-1
spec:
  gateways:
    - silly-demo-gateway
  hosts:
    - silly-demo.34.139.252.110.nip.io
  http:
    - name: primary
      route:
        - destination:
            host: silly-demo-stable
            port:
              number: 8080
          weight: 100
        - destination:
            host: silly-demo-canary
            port:
              number: 8080
          weight: 0

The only thing that matters in that definition is that all the traffic will be sent to the host silly-demo-stable. We can see that through the weight set to 100 percent. The second destination is silly-demo-canary with the weight set to 0, meaning that no traffic will be sent to it.

You can probably guess where this is going.

cat kustomize/overlays/istio/rollout.yaml

The output is as follows.

---
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: silly-demo
spec:
  replicas: 5
  strategy:
    canary:
      canaryService: silly-demo-canary
      stableService: silly-demo-stable
      trafficRouting:
        istio:
          virtualServices:
            - name: silly-demo-1
              routes:
                - primary
            - name: silly-demo-2
              routes:
                - secondary
      steps:
        - setWeight: 20
        - pause: {}
        - setWeight: 40
        - pause: {duration: 10}
        - setWeight: 60
        - pause: {duration: 10}
        - setWeight: 80
        - pause: {duration: 10}
  revisionHistoryLimit: 2
  selector:
    matchLabels:
      app: silly-demo
  workloadRef:
    apiVersion: apps/v1
    kind: Deployment
    name: silly-demo

Time time, we are instructing Argo Rollouts to use service silly-demo-canary for canary releases and silly-demo-stable for, as the name says, stable releases.

Further on, we’re telling it to do trafficRouting using istio virtualServices.

The rest of the manifest is exactly the same as before. The only important difference is that, this time, it should not manipulate the number of Pods but, instead, use Istio to set weight on the virtual services.

Let’s apply it.

kubectl --namespace a-team apply \
    --kustomize kustomize/overlays/istio

The output of Argo Rollouts is as follows.

Name:            silly-demo
Namespace:       a-team
Status:          ✔ Healthy
Strategy:        Canary
  Step:          8/8
  SetWeight:     100
  ActualWeight:  100
Images:          ghcr.io/vfarcic/silly-demo:1.4.126 (stable)
Replicas:
  Desired:       5
  Current:       5
  Updated:       5
  Ready:         5
  Available:     5

NAME                                    KIND        STATUS     AGE  INFO
⟳ silly-demo                            Rollout     ✔ Healthy  12s
└──# revision:1
   └──⧉ silly-demo-5c5547db68           ReplicaSet  ✔ Healthy  11s  stable
      ├──□ silly-demo-5c5547db68-4fjw2  Pod         ✔ Running  11s  ready:2/2
      ├──□ silly-demo-5c5547db68-7bfx8  Pod         ✔ Running  11s  ready:2/2
      ├──□ silly-demo-5c5547db68-9r6lx  Pod         ✔ Running  11s  ready:2/2
      ├──□ silly-demo-5c5547db68-hbjvl  Pod         ✔ Running  11s  ready:2/2
      └──□ silly-demo-5c5547db68-qcbc8  Pod         ✔ Running  11s  ready:2/2

This is the first release so there’s not much to look at since first releases are always rolled out right away.

So, let’s make a second release by changing the tag of the image,…

cd kustomize/overlays/istio

kustomize edit set image \
    ghcr.io/vfarcic/silly-demo=ghcr.io/vfarcic/silly-demo:1.4.127

cd ../../../

…and re-applying the manifests.

kubectl --namespace a-team apply \
    --kustomize kustomize/overlays/istio

The output of Argo Rollouts is as follows.

Name:            silly-demo
Namespace:       a-team
Status:          ॥ Paused
Message:         CanaryPauseStep
Strategy:        Canary
  Step:          1/8
  SetWeight:     20
  ActualWeight:  20
Images:          ghcr.io/vfarcic/silly-demo:1.4.126 (stable)
                 ghcr.io/vfarcic/silly-demo:1.4.127 (canary)
Replicas:
  Desired:       5
  Current:       6
  Updated:       1
  Ready:         6
  Available:     6

NAME                                    KIND        STATUS     AGE    INFO
⟳ silly-demo                            Rollout     ॥ Paused   2m24s
├──# revision:2
│  └──⧉ silly-demo-5d574b5f4f           ReplicaSet  ✔ Healthy  15s    canary
│     └──□ silly-demo-5d574b5f4f-cxwnl  Pod         ✔ Running  15s    ready:2/2
└──# revision:1
   └──⧉ silly-demo-5c5547db68           ReplicaSet  ✔ Healthy  2m23s  stable
      ├──□ silly-demo-5c5547db68-4fjw2  Pod         ✔ Running  2m23s  ready:2/2
      ├──□ silly-demo-5c5547db68-7bfx8  Pod         ✔ Running  2m23s  ready:2/2
      ├──□ silly-demo-5c5547db68-9r6lx  Pod         ✔ Running  2m23s  ready:2/2
      ├──□ silly-demo-5c5547db68-hbjvl  Pod         ✔ Running  2m23s  ready:2/2
      └──□ silly-demo-5c5547db68-qcbc8  Pod         ✔ Running  2m23s  ready:2/2

On the first look, the outcome is exactly the same. It rolled out to twenty percent of requests and it is now waiting for us to do whatever we need to do to validate it before promoting it to the next step.

The change, however, is in the virtual service we applied, so let’s take a look at it.

kubectl --namespace a-team get virtualservice silly-demo-1 \
    --output yaml

The output is as follows (truncated for brevity).

apiVersion: networking.istio.io/v1
kind: VirtualService
metadata:
  ...
  name: silly-demo-1
  ...
spec:
  ...
  http:
  - name: primary
    route:
    - destination:
        host: silly-demo-stable
        port:
          number: 8080
      weight: 80
    - destination:
        host: silly-demo-canary
        port:
          number: 8080
      weight: 20

We can see that Argo Rollouts changed the weight. The silly-demo-stable destination weight has been changed from 100 to 80 while the other one was increased to 20. As a result, now it does not matter how many replicas we are running since that exact percentage of requests will be forwarded to one release or the other.

We can confirm that by, let’s say, sending twenty requests to the application.

for i in {1..20}; do
    curl "http://silly-demo.$ISTIO_HOST"
done

The output is as follows.

This is a silly demo version 1.4.126
This is a silly demo version 1.4.126
This is a silly demo version 1.4.126
This is a silly demo version 1.4.126
This is a silly demo version 1.4.127
This is a silly demo version 1.4.126
This is a silly demo version 1.4.127
This is a silly demo version 1.4.126
This is a silly demo version 1.4.126
This is a silly demo version 1.4.127
This is a silly demo version 1.4.126
This is a silly demo version 1.4.126
This is a silly demo version 1.4.127
This is a silly demo version 1.4.126
This is a silly demo version 1.4.126
This is a silly demo version 1.4.126
This is a silly demo version 1.4.126
This is a silly demo version 1.4.126
This is a silly demo version 1.4.126
This is a silly demo version 1.4.126

We can see that, approximately, twenty percent of those twenty requests are coming from the release 127 while the rest keeps coming from the previous release 126.

Please note that I said “approximately” instead of exactly since those releases might be receiving requests from other places so our final number might be skewed.

The rest follows the same pattern as before.

We can promote the canary to the next step,…

kubectl argo rollouts --namespace a-team promote silly-demo

…and the virtual services will be automatically updated with new weights.

To be on the safe side, let’s send another round of requests.

for i in {1..20}; do
    curl "http://silly-demo.$ISTIO_HOST"
done

The output is as follows.

This is a silly demo version 1.4.126
This is a silly demo version 1.4.126
This is a silly demo version 1.4.126
This is a silly demo version 1.4.127
This is a silly demo version 1.4.126
This is a silly demo version 1.4.126
This is a silly demo version 1.4.126
This is a silly demo version 1.4.127
This is a silly demo version 1.4.127
This is a silly demo version 1.4.127
This is a silly demo version 1.4.126
This is a silly demo version 1.4.127
This is a silly demo version 1.4.126
This is a silly demo version 1.4.126
This is a silly demo version 1.4.126
This is a silly demo version 1.4.126
This is a silly demo version 1.4.127
This is a silly demo version 1.4.127
This is a silly demo version 1.4.126
This is a silly demo version 1.4.127

We can see that, this time, the number of responses from the 1.4.127 release now increased.

If we wait long enough… and send another round of requests,…

for i in {1..20}; do
    curl "http://silly-demo.$ISTIO_HOST"
done

The output is as follows.

This is a silly demo version 1.4.127
This is a silly demo version 1.4.127
This is a silly demo version 1.4.127
This is a silly demo version 1.4.127
This is a silly demo version 1.4.127
This is a silly demo version 1.4.127
This is a silly demo version 1.4.127
This is a silly demo version 1.4.127
This is a silly demo version 1.4.127
This is a silly demo version 1.4.127
This is a silly demo version 1.4.127
This is a silly demo version 1.4.127
This is a silly demo version 1.4.127
This is a silly demo version 1.4.127
This is a silly demo version 1.4.127
This is a silly demo version 1.4.127
This is a silly demo version 1.4.127
This is a silly demo version 1.4.127
This is a silly demo version 1.4.127
This is a silly demo version 1.4.127

We can see that all the requests are now coming from 1.4.127.

Now, to be clear, we are currently using the easiest setup of Istio virtual services. There are many more sophisticated ways we could be deciding who sees the new and who sees the old release. We could be rolling it out to selected users, or users in specific location, or to users filtered through some other criteria. The only limitation are capabilities of the Service Mesh or Ingress we chose. Nevertheless, I feel that we made the point that Service Mesh or Ingress is a valuable addition to our progressive delivery setup.

While we made an improvement, that is still not what we should do. We can do better, much better than just adding a Service Mesh to the mix.

Testing by Observing Metrics

You or someone in your company is almost certainly observing the state of production and reacting when bad things are spotted. Users are performing some actions and we are, through metrics, traces, and logs validating whether the system works as expected. That’s, by definition, testing, even though we might not call it like that.

What matters, for today’s subject, is that we should add metrics to the mix.

We’ll start by executing hey to send traffic to our application. That won’t give us “real user” behavior but, since this is a demo, we can use imagination and pretend that’s the “real” traffic.

Execute the command that follows in the third terminal session

hey -z 60m "http://silly-demo.$ISTIO_HOST"

If this would be a “real” production, we would be observing the state of our system in, let’s say, Grafana. We would have dashboards that execute queries to, let’s say, Prometheus, and “paint” the results as dashboards. We’ll skip Grafana today and go straight into Prometheus.

echo "http://prometheus.$INGRESS_HOST"

Open the URL from the output in a browser.

One way to deduce whether an application is behaving as expected could be to retrieve a sum of the rate of requests going into it. We can do that with the following query.

Execute sum(irate(istio_requests_total{reporter="source",destination_service=~"silly-demo-stable.a-team.svc.cluster.local"}[5m])) query in the Prometheus Web UI.

That is the sum of the rate of requests generated over the last five minutes.

However, that alone is not enough. We should also be able to filter requests based on response code so that, let’s say, we retrieve only those that are NOT in the five hundred range.

Execute sum(irate(istio_requests_total{reporter="source",destination_service=~"silly-demo-stable.a-team.svc.cluster.local",response_code!~"5.*"}[5m])) query in the Prometheus Web UI

If we combine those two, we should be able to get the percentage of successful requests.

Execute sum(irate(istio_requests_total{reporter="source",destination_service=~"silly-demo-stable.a-team.svc.cluster.local",response_code!~"5.*"}[5m])) / sum(irate(istio_requests_total{reporter="source",destination_service=~"silly-demo-stable.a-team.svc.cluster.local"}[5m])) query in the Prometheus Web UI.

Right now, the output is 1 meaning that hundred percent of requests are successful.

There are many other queries we should be executing to find out whether a release running in production is behaving as expected and that’s probably what we should be doing during progressive delivery. That sounds like a much better method that just letting our users discover that there’s something wrong and send us “angry” support tickets. That would be a proactive way to deduce whether to continue rolling forward or to roll back.

Still, we can do much better than that.

Automating Progressive Delivery Decision Making

Once we find out what we are observing and which queries give us confidence that our applications are behaving as expected, we are likely to realize that’s too much work. Who wants to sit on front of a laptop constantly executing queries or watching dashboards? A rollout can last minutes, hours, or, in some cases, even days. It would be ridiculous to work in shifts to deduce whether a release is working as expected while knowing all that time is spent on repetitive tasks. When something is repetitive, it can be automated.

We should be able to instruct Argo Rollouts to run the same queries we would normally run and use the results to decide whether to roll forward or to roll back.

We can do that by modifying the Rollout definition or, even better, separate analysis from rollouts given that different applications are likely going to use some if not all the same analysis.

Here’s an example of a ClusterAnalysisTemplate.

cat cluster-analysis-template.yaml

The output is as follows.

apiVersion: argoproj.io/v1alpha1
kind: ClusterAnalysisTemplate
metadata:
  name: success-rate
spec:
  args:
  - name: service-name
  - name: prometheus-addr
    value: http://kube-prometheus-stack-prometheus.monitoring
  - name: prometheus-port
    value: "9090"
  metrics:
  - name: success-rate
    successCondition: result[0] >= 0.95
    provider:
      prometheus:
        address: "{{args.prometheus-addr}}:{{args.prometheus-port}}"
        query: |
          sum(irate(
            istio_requests_total{reporter="source",destination_service="{{args.service-name}}",response_code!~"5.*"}[5m]
          )) /
          sum(irate(
            istio_requests_total{reporter="source",destination_service="{{args.service-name}}"}[5m]
          ))

The name is set to success-rate. Remember that since we’ll have to reference it in our Rollout.

Inside the spec, we are defining a few arguments like the service-name. That one does not have a value since we’ll set it when we reference that analysis in the Rollout. There is also prometheus-addr and prometheus-port that are used to point the analysis to our Prometheus instance.

The real action is happening in the metrics. We can have many but, for today, one should be enough to demonstrate how it all works.

We are setting the successCondition to be equal or greater than 0.95 or ninety five percent. That means that if the result of a query is within that threshold it will be considered a success and the rollout should be able to continue. Otherwise, if it’s below that threshold it will be considered a failure and eligible for a rollback.

The rest should be self-explanatory. We’re using prometheus in the providers (there are many others), and the query is almost the same as the one we executed earlier. The only difference is that instead of a hard-coded value for the destination_service we are using the service-name argument. That way, that same analysis template can be used for multiple applications.

Let’s apply it,…

kubectl apply --filename cluster-analysis-template.yaml

…and take a look at a modified version of the rollout.

cat kustomize/overlays/istio-prometheus/rollout.yaml

The output is as follows (truncated for brevity).

---
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: silly-demo
spec:
  replicas: 5
  strategy:
    canary:
      ...
      analysis:
        templates:
          - templateName: success-rate
            clusterScope: true
        startingStep: 2
        args:
          - name: service-name
            value: silly-demo-canary.a-team.svc.cluster.local
      steps:
        - setWeight: 20
        - pause: {duration: 300}
        - setWeight: 40
        - pause: {duration: 300}
        ...

This time, we added the analysis section that is referencing the success-rate template we just applied. The startingStep is set to 2 meaning that the analysis will start only after the rollout reaches the second step which is the pause right after the weight is set to 20 percent. As a result, the analysis will start only after twenty percent of users start seeing the new release.

Finally, we changed the pause duration to 300 seconds, or five minutes. Since the Prometheus query is set to take five minutes into the account, anything less than that could produce false sense of security. You’ll probably want to set that value to a much higher number given that five minutes of interaction with the application might not be enough to convince you that it’s working as expected.

Let’s apply that rollout.

kubectl --namespace a-team apply \
    --kustomize kustomize/overlays/istio-prometheus

The output of Argo Rollouts is as follows (truncated for brevity).

NAME                                    KIND         STATUS        AGE    INFO
⟳ silly-demo                            Rollout      ॥ Paused      8m15s
├──# revision:3
│  ├──⧉ silly-demo-5c5547db68           ReplicaSet   ✔ Healthy     8m14s  canary
│  │  ├──□ silly-demo-5c5547db68-lqbdb  Pod          ✔ Running   5m69s    ready:2/2
│  │  └──□ silly-demo-5c5547db68-w5wff  Pod          ✔ Running   5m5s     ready:2/2
│  └──α silly-demo-5c5547db68-3         AnalysisRun  ✔ Successful  5s     ✔ 1
└──# revision:2
   └──⧉ silly-demo-5d574b5f4f           ReplicaSet   ✔ Healthy     8m     stable
      ├──□ silly-demo-5d574b5f4f-vl7ft  Pod          ✔ Running     7m59s  ready:2/2
      ├──□ silly-demo-5d574b5f4f-7997c  Pod          ✔ Running     7m43s  ready:2/2
      ├──□ silly-demo-5d574b5f4f-bwwx7  Pod          ✔ Running     7m30s  ready:2/2
      ├──□ silly-demo-5d574b5f4f-qnzck  Pod          ✔ Running     7m16s  ready:2/2
      └──□ silly-demo-5d574b5f4f-6mr6p  Pod          ✔ Running     7m3s   ready:2/2

If we wait for five minutes or more, the output is almost the same as what we saw earlier when we were running it without the analysis. The only tangible difference, judging from the output, is that it does not pause indefinitely waiting for out blessing to progress to the next step. Instead, the pause lasts for five minutes before progressing to the next step that increments the weight.

The key difference is in the AnalysisRun entry. It shows that it was Successful. So far, the application is passing all verifications and, if we give it enough time, it will, eventually, be rolled out completely.

Let’s spice it up a bit by applying a new release but, this time, we’ll do it with a twist.

So, let’s change the tag of the image,…

cd kustomize/overlays/istio-prometheus

kustomize edit set image \
    ghcr.io/vfarcic/silly-demo=ghcr.io/vfarcic/silly-demo:1.4.129

cd ../../../

…and apply the manifests.

kubectl --namespace a-team apply \
    --kustomize kustomize/overlays/istio-prometheus

Next, we’ll stop hey that is generating traffic that should normally be generated by “real” users.

Press ctrl+c to stop hey in the third terminal session

Finally, we’ll execute hey again, but, this time, with fail=true. The application we’re using is configured to simulate failure if that query parameter is in a request.

Execute the command that follows in the third terminal session

hey -z 60m "http://silly-demo.$ISTIO_HOST?fail=true"

If we pay attention to Argo Rollouts output, it starts in the same way as before, by increasing the reach of the new release to 20 percent and, after that, it starts running the analysis.

The output of Argo Rollouts is as follows (truncated for brevity).

Name:            silly-demo
Namespace:       a-team
Status:          ✖ Degraded
Message:         RolloutAborted: Rollout aborted update to revision 4: Metric "succes
s-rate" assessed Failed due to failed (1) > failureLimit (0)
Strategy:        Canary
  Step:          0/8
  SetWeight:     0
  ActualWeight:  0
Images:          ghcr.io/vfarcic/silly-demo:1.4.126 (stable)
Replicas:
  Desired:       5
  Current:       5
  Updated:       0
  Ready:         5
  Available:     5

NAME                                    KIND         STATUS        AGE    INFO
⟳ silly-demo                            Rollout      ✖ Degraded    18m
├──# revision:4
│  ├──⧉ silly-demo-64b5875f59           ReplicaSet   • ScaledDown  9m30s  canary,dela
y:passed
│  └──α silly-demo-64b5875f59-4         AnalysisRun  ✖ Failed      8m26s  ✖ 1
├──# revision:3
│  ├──⧉ silly-demo-5c5547db68           ReplicaSet   ✔ Healthy     18m    stable
│  │  ├──□ silly-demo-5c5547db68-lqbdb  Pod          ✔ Running     16m    ready:2/2
│  │  ├──□ silly-demo-5c5547db68-w5wff  Pod          ✔ Running     15m    ready:2/2
│  │  ├──□ silly-demo-5c5547db68-5bln2  Pod          ✔ Running    14m24s  ready:2/2
│  │  ├──□ silly-demo-5c5547db68-tb4vh  Pod          ✔ Running    13m21s  ready:2/2
│  │  └──□ silly-demo-5c5547db68-5scvp  Pod          ✔ Running    12m17s  ready:2/2
│  └──α silly-demo-5c5547db68-3         AnalysisRun  ✔ Successful  10m    ✔ 1
└──# revision:2
   └──⧉ silly-demo-5d574b5f4f           ReplicaSet   • ScaledDown  13m

After a while (five minutes or more), the AnalysisRun changed to Failed. The percentage of failed requests reached a number higher than five percent and, as we already saw, that’s the threshold we defined. As a result, it scaled down the new release to zero replicas, and changed the weight of the virtual service to redirect all the traffic to the new release.

We succeeded.

We are now testing in production by performing canary deployments with Argo Rollouts, Istio, and Prometheus metrics.

That is how fully automated testing in production is done.

Thank you for watching. See you in the next one. Cheers.

Destroy

Stop the processes in the second and the third terminal session by pressing ctrl+c.

chmod +x destroy.nu

./destroy.nu

Execute the command that follows in all terminal sessions.

exit