I Stopped Staring at Dashboards. AI Reads My Grafana Metrics Now.

Something just broke in production. An alert fired. You open Grafana, click through three dashboards trying to find one that matches what’s actually happening, and twenty minutes later you’re still squinting at panels and switching tabs to grep logs.

That entire workflow is about to disappear.

AI agents can now read your metrics, your logs, and your traces directly. They draw conclusions. They build custom dashboards on the spot. They tell you what’s wrong while you’re still typing the question. And the best part: you don’t have to leave your terminal to do any of it.

In this video, I’ll show you how. We’ll start with Grafana Assistant inside the Grafana UI, then move to Claude Code wired up to the Grafana MCP server. By the end, you’ll see how to query observability data from a chat prompt, build dashboards fitted to whatever you’re investigating, and connect your own agents to live runtime data, all without touching a browser.

Setup

This specific demo cannot be replicated as-is, but you should be able to do similar operations against your system.

What Is Grafana Cloud and Grafana Assistant?

If you’re going to do observability, you need somewhere to ship the metrics, the logs, and the traces, and somewhere to query them. You can build that yourself. You can run Prometheus, Loki, Tempo, and Grafana on your own clusters and operate them as another set of services that need babysitting. Or you can let someone else operate that stack for you. Grafana Cloud is the hosted version of exactly that. Same Grafana, same Loki, same Tempo, with Mimir as the Prometheus-compatible metrics backend. The difference is that you don’t have to keep the lights on. There’s a forever-free tier that’s generous enough to cover small workloads, and it scales from there. For this video, that matters because the things I want to show next are easiest to demo against Grafana Cloud, but the same patterns apply if you’re running Grafana yourself.

The piece I want to focus on is Grafana Assistant. It’s an AI agent that lives inside Grafana, and it knows about your data. Ask it about CPU on a workload, ask it about errors in your logs, ask it about traces, and it figures out which datasources to query, builds the queries, runs them, and visualizes the results. It can also explain panels, write PromQL or LogQL for you, and even generate dashboards from a description. It’s the kind of thing that, a couple of years ago, would have been a research demo. Today it’s a button in the corner of every Grafana page.

Let me give you a quick taste of how it looks inside the Grafana UI before we get into the more interesting part.

Grafana Assistant is available everywhere Grafana runs. Grafana Cloud, of course. But also self-managed Grafana Enterprise, and even self-managed OSS. Everything I’ll show requires a Grafana Cloud account, but the forever-free tier covers it. There’s one important nuance though: the hosted MCP server at mcp.grafana.com/mcp, which is the part that lets external agents like Claude Code talk to Grafana, is Cloud-only. If you’re self-managing, you still get the Assistant inside your Grafana UI, but for external agents you’d run the open-source Grafana MCP server locally instead. Same idea, slightly different setup.

Now, before we go further, I want to set the stakes.

Pre-defined dashboards used to be our YouTube. We’d stare at them. Hours per day, in some teams. We did that not because dashboards are entertainment, but because we, the humans, the carbon-based meat puppets, are bad at digesting raw metrics, raw logs, and raw traces. The real work, the one that matters, happens after an alert fires. Something is wrong, you need to figure out what is wrong, and that means querying metrics, logs, and traces until you find the cause. That part has always been tedious and slow. Agentic AI makes it fast. And the moment that happens, the reason we’ve been staring at pre-defined dashboards starts to disappear.

Grafana Assistant Demo: Metrics, Logs, and Traces

Here’s the Grafana UI. The Assistant lives in that little sparkle icon in the top-right corner. Click it, get a chat panel, type a question. That’s the whole interface.

Let me throw a few obvious questions at it from the browser. Nothing fancy. I just want to see what it does with my workloads.

Show me CPU and memory usage for dot-ai and youtube workloads.

The Assistant figures out which Prometheus metrics map to “CPU and memory,” builds a query for each container in those namespaces, renders a panel right inside the chat, and then writes a few sentences summarizing what it found. The panel here shows CPU and memory by container, with the mcp-server slowly growing, the manager stable, and youtube-automation sitting around forty-five megabytes. Below the chart, there’s a written summary calling out the same things.

Same trick, different signal.

Show the logs of those workloads.

This time it picks Loki as the datasource, queries logs for both namespaces, and gives me a written summary at the top — the manager is reconciling solutions, one of them is degraded because a Pod called silly-demo doesn’t exist, and the mcp-server is doing normal sync activity. Underneath the summary, there’s a logs panel with the actual log lines so I can scroll through and verify.

And once more, this time for traces.

Show the traces of those workloads.

Tempo this time. The Assistant lists traces from dot-ai-mcp, the only workload here that’s actually instrumented, with trace IDs, timestamps, and the service name. If I wanted to drill into one of them, I’d click and Grafana would open the trace view.

Now, in real life, nobody would actually ask the Assistant these questions. Nobody opens a chat panel and types “show me CPU.” That’s what dashboards are for. The Assistant earns its keep when something is wrong, and you ask it to analyze — not show, analyze. “Why is this pod restarting.” “What changed in the last hour.” “Where is this latency coming from.” Unfortunately for the demo, and fortunately for me, I don’t have any real issues in this cluster right now, so I can’t show that. You’ll have to trust me that it works.

Now here’s the thing. The Assistant in a browser is great. Genuinely. The visualizations are gorgeous, and they’re things a terminal can’t easily replicate. But I don’t want to be in the browser. No browser-based agent matches Claude Code, OpenCode, or Cursor. Anyone who has spent serious time with one of those agents knows the gap. And even if the Grafana Assistant were as good — and it’s pretty good — I still wouldn’t want to switch agents depending on the task. I don’t want to leave my terminal agent for analysis and then come back for everything else. I want to extend the agent I’m already using, not collect a fleet of specialized agents and rotate through them.

There’s a deeper reason too, and this one matters more.

Analysis without remediation is pointless. If you can only analyze, you’ve solved half the problem. The Web UI Assistant does only analysis — it can tell you what’s wrong, but it can’t fix it. Remediation happens in the terminal. You edit code. You push to Git. GitOps syncs the change into the cluster. Even when people skip the GitOps part and fix things directly, they’re still in a terminal. The browser isn’t where fixes live, and if you are clicking buttons to fix production, those buttons aren’t in Grafana anyway, and you should probably be doing literally anything else with your career. So if I have to be in a terminal for the fix, I might as well be in a terminal for the analysis too.

Which means switching to Claude Code.

Claude Code with the Grafana MCP Server

I’ll start Claude Code from the same project where my infrastructure lives.

claude

The first thing I want to do is confirm that Claude has access to the Grafana MCP server. That’s the bridge that gives it direct access to Grafana’s tools — Prometheus queries, Loki, Tempo, dashboards, alerts, all of it. I’ll check the MCP list.

[user]

/mcp

Here’s what we get.

[agent]

  Manage MCP servers
  12 servers

    Project MCPs (/Users/viktorfarcic/code/dot-ai-infra/.mcp.json)
  ❯ dot-ai · ✔ connected
    grafana · ✔ connected

Twelve MCP servers wired up in this project, and the one I care about right now is grafana. Connected. In my case that’s the hosted MCP at mcp.grafana.com/mcp because I’m using Grafana Cloud, but if you’re running self-hosted Grafana you’d point Claude at the open-source Grafana MCP instead and the rest of the demo would be the same.

Same questions as before, just from the terminal this time. Starting with the same CPU and memory query.

[user]

Show me CPU and memory usage for dot-ai and youtube workloads.

Here’s what we got.

[agent]

  Called grafana 4 times (ctrl+o to expand)

⏺ Current per-pod usage (5m CPU rate, working-set memory):

  namespace: dot-ai

  ┌─────────────────────────────────────────────┬─────────────┬──────────┐
  │                     Pod                     │ CPU (cores) │ Memory   │
  ├─────────────────────────────────────────────┼─────────────┼──────────┤
  │ dot-ai-controller-manager-69959496b4-d9wrd  │      0.0103 │ 188 MiB  │
  ├─────────────────────────────────────────────┼─────────────┼──────────┤
  │ dot-ai-68dd94677-wqpj2                      │      0.0030 │  265 MiB │
  ├─────────────────────────────────────────────┼─────────────┼──────────┤
  │ dot-ai-stack-qdrant-0                       │      0.0036 │   41 MiB │
  ├─────────────────────────────────────────────┼─────────────┼──────────┤
  │ dot-ai-stack-agentic-tools-6cfb9b77c7-dsxx2 │      0.0019 │   78 MiB │
  ├─────────────────────────────────────────────┼─────────────┼──────────┤
  │ dot-ai-ui-fbd6f6b77-dvbtz                   │      0.0011 │   28 MiB │
  ├─────────────────────────────────────────────┼─────────────┼──────────┤
  │ dot-ai-stack-dex-67544949-vw77p             │      0.0006 │   13 MiB │
  ├─────────────────────────────────────────────┼─────────────┼──────────┤
  │ dot-agent-deck-docs-7c5475dcf4-qkbhh        │      0.0003 │    5 MiB │
  ├─────────────────────────────────────────────┼─────────────┼──────────┤
  │ dot-ai-website-8694d89b9-7rzhb              │      0.0002 │   20 MiB │
  ├─────────────────────────────────────────────┼─────────────┼──────────┤
  │ Total                                       │      ~0.021 │ ~638 MiB │
  └─────────────────────────────────────────────┴─────────────┴──────────┘

  namespace: youtube-automation

  ┌─────────────────────────────────────┬─────────────┬────────┐
  │                 Pod                 │ CPU (cores) │ Memory │
  ├─────────────────────────────────────┼─────────────┼────────┤
  │ youtube-automation-5c7d4f995f-7gtfx │      0.0007 │ 44 MiB │
  └─────────────────────────────────────┴─────────────┴────────┘

  All workloads are essentially idle — CPU well under 0.05 cores total, memory
  dominated by dot-ai (265 MiB) and dot-ai-controller-manager (188 MiB). Let me know if
   you want a range query (e.g. last 6h trends) or to include limits/requests for a
  utilization-vs-capacity view.

Two ASCII tables, one per namespace, with per-pod CPU and memory. Then a written summary saying everything is idle. Same data the browser showed me, just in a form I can paste, grep, diff, and feed into the next question.

Now logs.

[user]

Show the logs of those workloads.

And the response.

[agent]

  Called grafana 3 times (ctrl+o to expand)

⏺ dot-ai namespace (last hour, 25 most recent)

  Logs are dominated by dot-ai-controller-manager running its Solution reconcile loop.
  Two solutions show up repeatedly:

  - dot-ai-website (dot-ai ns) — state: deployed, 3/3 resources ready (Deployment,
  Service, HTTPRoute).
  - solution-sol-1773699448783-4be83c06 (examples ns) — state: degraded. Reconciler
  logs Resource not found, counting as failed for Pod/silly-demo — the Solution
  declares Deploy ghcr.io/vfarcic/silly-demo:v1.5.235 to the examples namespace but the
   Pod doesn't exist.

  Other pods in the namespace (last hour):
  - dot-agent-deck-docs — only nginx access logs from kube-probe and GoogleHC health
  checks.
  - dot-ai, dot-ai-ui, dot-ai-website, dot-ai-stack-qdrant-0,
  dot-ai-stack-agentic-tools, dot-ai-stack-dex — no log lines surfaced in the top 25
  (drowned out by the controller's debug stream).

  youtube-automation namespace

  A single pod (youtube-automation-5c7d4f995f-7gtfx). Two patterns:

  1. Healthy traffic — periodic GET /api/categories, /api/videos/list,
  /api/videos/phases returning 200 in 70–130 ms.
  2. WordPress vulnerability scan — a burst at 00:51:29–00:51:31 hitting
  /2018/wp-includes/wlwmanifest.xml, /wp1/..., /shop/..., /cms/..., /sito/..., etc. All
   returned 200 with ~30 µs latency, which is suspicious — most apps return 404 for
  these. Worth checking whether the app is actually serving content there or just
  default-200'ing.

  Two things worth a closer look:
  - The solution-sol-1773699448783-4be83c06 Solution is stuck degraded — want me to
  investigate why the silly-demo Pod is missing?
  - The youtube-automation app returning 200 to WordPress probe paths.

Notice what just happened. It pulled logs, summarized them, and on its own caught two things worth investigating: a stuck Solution reconciliation, and a WordPress vulnerability scan that’s hitting suspicious endpoints. I didn’t ask it to find problems. I asked it to show me logs. It found problems anyway. That is the analysis layer the browser Assistant gives you too — but here it lands directly in the same agent I’d use to fix any of these things.

Last one. Traces.

[user]

Show the traces of those workloads.

The output.

[agent]

  Called grafana 7 times (ctrl+o to expand)

⏺ Trace coverage

  Only one workload is instrumented for tracing: the dot-ai-mcp process running in pod
  dot-ai-68dd94677-wqpj2 (Node.js 22.22.2, OpenTelemetry SDK 2.7.0, app version
  1.16.3). The youtube-automation pod and the rest of the dot-ai pods
  (controller-manager, qdrant, agentic-tools, dex, ui, website, deck-docs) emit no
  traces — query {resource.service.name="youtube-automation"} returned 0 results.

  dot-ai-mcp — span counts (last hour)

  ┌───────────────────────────────────────────────────────────────┬────────────────┐
  │                          Span / Endpoint                      │ Count          │
  ├───────────────────────────────────────────────────────────────┼────────────────┤
  │ GET /healthz                                                  │ 3,214          │
  ├───────────────────────────────────────────────────────────────┼────────────────┤
  │ embeddings text-embedding-3-small                             │          1,862 │
  ├───────────────────────────────────────────────────────────────┼────────────────┤
  │ POST /api/v1/resources/sync                                   │            239 │
  ├───────────────────────────────────────────────────────────────┼────────────────┤
  │ MCP /                                                         │             27 │
  ├───────────────────────────────────────────────────────────────┼────────────────┤
  │ POST /api/v1/tools/manageKnowledge / execute_tool             │          4 / 4 │
  │ manageKnowledge                                               │                │
  ├───────────────────────────────────────────────────────────────┼────────────────┤
  │ GET /.well-known/oauth-authorization-server                   │              4 │
  ├───────────────────────────────────────────────────────────────┼────────────────┤
  │ GET /.well-known/oauth-protected-resource                     │              4 │
  ├───────────────────────────────────────────────────────────────┼────────────────┤
  │ GET /api/v1/prompts                                           │              3 │
  ├───────────────────────────────────────────────────────────────┼────────────────┤
  │ POST /api/v1/prompts/* (prd-create, prd-next, generate-cicd,  │    3 each (16  │
  │ write-docs, query-dot-ai, etc.)                               │     endpoints) │
  ├───────────────────────────────────────────────────────────────┼────────────────┤
  │ GET /authorize, GET /callback, POST /register, POST /token    │   OAuth flow,  │
  │                                                               │       1–2 each │
  ├───────────────────────────────────────────────────────────────┼────────────────┤
  │ GET /api/v1/tools                                             │              1 │
  └───────────────────────────────────────────────────────────────┴────────────────┘

  Read: the service is essentially idle — health probes (~1/s) plus periodic resource
  syncs from the dot-ai controller. Real user activity in the last hour: ~4
  manageKnowledge tool executions (each fanning out to ~465 OpenAI embedding calls —
  that's where the 1,862 embedding spans come from), one OAuth login flow, and a small
  batch of MCP prompt fetches.

  Sample trace — POST /api/v1/resources/sync (147 ms)

  service:  dot-ai-mcp  (host: dot-ai-68dd94677-wqpj2)
  kind:     SERVER
  status:   200 OK
  client:   10.96.2.15  (user-agent: dot-ai-controller/v1.0.0)
  route:    /api/v1/resources/sync

  The controller-manager calls this endpoint roughly every 10 s — matches the 239
  syncs/hour.

  If you want, I can pull a manageKnowledge trace to see the embedding call breakdown,
  or tell me which specific call you want drilled into. To get traces from
  youtube-automation and the other dot-ai pods, they'd need OTel instrumentation —
  they're currently not exporting any spans.

A span-count table, a written read on what’s actually traffic versus background noise, and a sample trace breakdown. Same as before, it noticed something I didn’t ask about — most of my workloads aren’t instrumented and aren’t exporting any spans, and it called that out as something I should fix.

Now, is reading these tables easier than looking at the browser visualizations? No. Of course not. The browser is prettier. But that’s not the point. We’re not supposed to be digesting this data. That’s the whole point. The data is for the LLM. The LLM digests it, draws conclusions, and tells me what matters. The human-readable visualizations are a courtesy, not the substance. And on the rare occasions when I do want to actually look at the data myself, in a graphically rich form, I can ask for that too.

Generate Grafana Dashboards From Natural Language

[user]

Can you visualize all those in the `tmp` dir in Grafana itself?

[agent]

  Listed 1 directory (ctrl+o to expand)

⏺ I'll build a dashboard with CPU, memory, logs, and traces panels, push it to Grafana,
   and save the JSON to tmp/.

⏺ Write(tmp/dot-ai-yt-overview.dashboard.json)
  ⎿  Wrote 176 lines to tmp/dot-ai-yt-overview.dashboard.json
       1 {
       2   "title": "dot-ai & youtube-automation overview",
       3   "uid": "dot-ai-yt-overview",
       4   "tags": ["dot-ai", "youtube-automation"],
       5   "timezone": "browser",
       6   "schemaVersion": 38,
       7   "version": 1,
       8   "refresh": "30s",
       9   "time": {"from": "now-1h", "to": "now"},
      10   "templating": {"list": []},
     … +178 lines (ctrl+o to expand)

  Read 1 file (ctrl+o to expand)

⏺ Now pushing it to Grafana.

  Called grafana 2 times (ctrl+o to expand)

⏺ Dashboard created and pushed to Grafana, JSON saved locally.

  Grafana: https://dot.grafana.net/d/dot-ai-yt-overview?from=now-1h&to=now
  Local: tmp/dot-ai-yt-overview.dashboard.json

  Layout (3 rows, 9 panels):

  ┌───────────┬────────────────────────────────────────────────────────────────────┐
  │    Row    │                               Panels                               │
  ├───────────┼────────────────────────────────────────────────────────────────────┤
  │ Resource  │ CPU usage by pod (cores) · Memory working set by pod (bytes) —     │
  │ usage     │ both filtered to namespace=~"dot-ai|youtube-automation"            │
  ├───────────┼────────────────────────────────────────────────────────────────────┤
  │ Logs      │ {namespace="dot-ai"} · {namespace="youtube-automation"}            │
  ├───────────┼────────────────────────────────────────────────────────────────────┤
  │           │ dot-ai-mcp span rate by endpoint (TraceQL metrics) · Trace search  │
  │ Traces    │ table for both services (youtube-automation will stay empty until  │
  │           │ it's instrumented)                                                 │
  └───────────┴────────────────────────────────────────────────────────────────────┘

  Refresh: 30s · default range: last 1h · UID: dot-ai-yt-overview. Open the link and
  let me know if you want different panels, thresholds, or a namespace dropdown
  variable.

It built the dashboard JSON from scratch, saved a copy locally, pushed it to Grafana, and gave me a link. Let’s open that link and see what we got.

A real Grafana dashboard. Three rows. Top row: CPU and memory by pod, filtered to just the namespaces I care about. Middle row: logs from dot-ai and youtube-automation side by side. Bottom row has trace panels. This dashboard didn’t exist five seconds ago. I asked for it, and now it exists, fitted exactly to the workloads I was just analyzing.

This is the thing I find genuinely exciting. The ability to generate dashboards from the specific data we’re investigating, instead of squinting at pre-defined dashboards that almost-but-not-quite match what we’re trying to figure out, is a killer feature. We’ve always had the option to build ad-hoc dashboards, but the friction was real. Open Grafana, click new dashboard, pick a datasource, write a query, pick a visualization, repeat for every panel, set up the layout, save. By the time you’ve done that, the incident is over and the dashboard is useless. Now? It’s close to instant.

Of course, sometimes we don’t want to build a brand new dashboard. We want to look at one of the existing ones — community dashboards, the ones the platform team set up, or whatever else is already there. So let’s see what we have.

[user]

Which Grafana dashboards do we have?

[agent]

  Called grafana (ctrl+o to expand)

⏺ 18 dashboards across 3 folders.

  Folders
  - GrafanaCloud (uid: dfjv4b2o4vkzkd)
  - Kubernetes (uid: kubernetes)
  - tmp (uid: tmp) — empty

  General (no folder)
  - dot-ai & youtube-automation overview — /d/dot-ai-yt-overview (the one we just
  created)
  - Alert Groups Insights — /d/c4031170-df55-47af-be9c-89d3c1b7a3fd
  - Incident Insights — /d/f932b5d5-4fc4-4db9-ae90-b09f547deff1

  GrafanaCloud folder
  - Billing/Usage
  - Cardinality management - 1 - overview
  - Cardinality management - 2 - metrics
  - Cardinality management - 3 - labels
  - Cloud Logs Export Insights
  - Usage Insights - 1 - Overview
  - Usage Insights - 2 - Data sources
  - Usage Insights - 3 - Query Errors
  - Usage Insights - 4 - Alertmanager
  - Usage Insights - 5 - Metrics Ingestion
  - Usage Insights - 6 - Loki Query Fair Usage Drilldown

  Kubernetes folder
  - kube-state-metrics-v2

  Note: there's a tmp folder in Grafana that's empty — when I created the overview
  dashboard I didn't put it there. Want me to move it into the tmp folder?

Eighteen dashboards across three folders. Most of them are Grafana Cloud’s own usage and billing dashboards. The interesting one for me is kube-state-metrics-v2, the standard Kubernetes dashboard. Let me grab the link to it.

[user]

Show me the link to the `kube-state-metrics-v2` dashboard.

The output.

[agent]

  Called grafana (ctrl+o to expand)

⏺ https://dot.grafana.net/d/k8s-kube-state-metrics-v2

That’s the standard kube-state-metrics dashboard — cluster pod requested, CPU requested, memory requested at the top, then node count, node availability, and a long stack of pod, deployment, and statefulset panels below. Useful, but generic. So the question becomes: can the agent help me adapt it?

First step: get a copy I can mess with without breaking the original. While I’m at it, I’ll ask it to favorite the new copy.

[user]

Create a copy of the `kube-state-metrics-v2` dashboard, name it `dot`, and make it a favorite.

Watch what happens.

[agent]

  Read 5 files, called grafana (ctrl+o to expand)

⏺ Bash(env | grep -i grafana 2>&1 | sed
      's/=.*/=<redacted>/')
  ⎿  (No output)
  ⎿  Allowed by auto mode classifier

  Read 1 file, listed 1 directory (ctrl+o to expand)

⏺ Bash(vals eval -f /Users/viktorfarcic/cod
      e/dot-ai-infra/.env.vals.yaml
      2>/dev/null | grep
      GRAFANA_SERVICE_ACCOUNT_TOKEN | head
      -1 | wc -c)
  ⎿  Denied by auto mode classifier ∙

  Read 2 files (ctrl+o to expand)

⏺ Given the dashboard's size, I'll delegate the copy to a sub-agent so the JSON doesn't
   bloat the main context.

⏺ Agent(Push dot dashboard to Grafana)
  ⎿  Read(tmp/dot.dashboard.json)
     Read(tmp/dot.dashboard.json)
     grafana - Create or update dashboard (MCP)
...
  ⎿  Denied by auto mode classifier ∙

  Read 2 files (ctrl+o to expand)

⏺ Given the dashboard's size, I'll delegate the copy to a sub-agent so the JSON doesn't
   bloat the main context.

⏺ Agent(Push dot dashboard to Grafana)
  ⎿  Done (8 tool uses · 101.6k tokens · 3m 55s)
  (ctrl+o to expand)

⏺ Dashboard pushed as UID dot. Now checking whether there's an MCP path to
  favorite/star it.

⏺ Dashboard copy complete; favoriting needs your click.

  Copy created
  - Title: dot
  - UID: dot
  - URL: https://dot.grafana.net/d/dot
  - Local JSON: tmp/dot.dashboard.json

  Favoriting — there's no MCP tool for starring dashboards (Grafana exposes it via POST
   /api/user/stars/dashboard/uid/dot, but the Grafana MCP server doesn't surface that
  endpoint). Options:

  1. Click the ★ next to the title at https://dot.grafana.net/d/dot — one click, done.
  2. If you want me to do it programmatically, I'd need to call
  https://dot.grafana.net/api/user/stars/dashboard/uid/dot directly with
  GRAFANA_SERVICE_ACCOUNT_TOKEN. That requires (a) you authorizing me to extract that
  secret from vals/GCP Secret Manager and (b) the token having users:write scope. Let
  me know and I'll proceed.

✻ Cogitated for 6m 3s

A few interesting things in there. It pulled the source dashboard, saved it locally, pushed a new copy under the UID dot, and reported back. Favoriting hit a small wall — the Grafana MCP doesn’t expose a star/favorite endpoint, so it offered me the click-it-yourself path or a programmatic fallback that needs a token. Not magical, just honest about what it can and can’t do.

The bigger thing happening behind the scenes is the dashboard JSON itself. Grafana’s modern dashboard format is workable. Their old format — which is what most community dashboards still ship with — is a sprawling, deeply-nested wall of JSON that nobody enjoys editing. I would struggle with it. The agent struggles with it too, but it struggles faster than I would, and I refuse to manage anything by clicking buttons anyway. Buttons are evil — except if it’s an arcade machine. So even when this is a bit messy under the hood, I’d rather have a small mess in code than a tidy mess in a browser.

Here’s the new dashboard.

The dot dashboard, copied straight from kube-state-metrics-v2. Same Cluster row at the top — Pod Requested, CPU Requested, Memory Requested, then capacity charts underneath. Below that, a stack of collapsed rows: Node, Deployments, Statefulset, Pods, Containers, HPA. Exact replica of the original, just under a new name.

All those collapsed rows annoy me. I want everything open by default. So I’ll ask for that.

[user]

Can we make all the panes in the `dot` dashboard be expanded by default?

[agent]

  Read 1 file (ctrl+o to expand)

⏺ Flipping collapsed: true to false isn't enough by itself — Grafana stores child
  panels inside collapsed rows, and they need to be hoisted to the top-level panels
  array when the row is expanded. Let me transform the JSON properly.

⏺ Reading 1 file… (ctrl+o to expand)
  ⎿  $ jq '
     .panels |= (
     [ .[] |
     if .type == "row" and .collapsed == true then

  Read 1 file (ctrl+o to expand)

⏺ Flipping collapsed: true to false isn't enough by itself — Grafana stores child
  panels inside collapsed rows, and they need to be hoisted to the top-level panels
  array when the row is expanded. Let me transform the JSON properly.

  Read 2 files (ctrl+o to expand)

⏺ Now pushing the expanded version to Grafana via sub-agent.

⏺ Agent(Push expanded dot dashboard)
  ⎿  Read(tmp/dot.dashboard.expanded.json)
     Bash(jq -e '.uid, .title, (.panels | length)'
         /Users/viktorfarcic/code/dot-ai-infra/tmp/dot.dashboard.expanded.json)
     Running…
     grafana - Create or update dashboard (MCP)
     ...
⏺ Now pushing the expanded version to Grafana via sub-agent.

⏺ Agent(Push expanded dot dashboard)
  ⎿  Done (6 tool uses · 101.1k tokens · 3m 53s)
  (ctrl+o to expand)

⏺ Dashboard updated to v2 with all 9 rows expanded by default.

  - URL: https://dot.grafana.net/d/dot
  - Local: tmp/dot.dashboard.expanded.json (51 top-level panels)

  Reload the dashboard and every row (Cluster, Node, Deployments, Statefuleset, Pods,
  Containers, HPA, Jobs, PVC) should be open on first load.

✻ Worked for 4m 38s

It tried, but it got it wrong.

The dashboard is broken. Panels in the wrong places, the layout is jumbled. And honestly, I’m not surprised. The agent is wrestling with Grafana’s old dashboard format, and that format is the kind of thing I would not wish on my worst enemy. Working with it by hand is punishment.

The agent screwed it up. I would have screwed it up. The difference is that the agent screwed it up in three minutes and I get to ask for a redo, instead of me screwing it up over an hour. I’m outsourcing the worst part of this work to my agent, and I’m genuinely happy about it. Let me ask it to start over.

[user]

I think the dashboard is messed up. It should be the same as before, but with panes expanded by default. No other changes. Would it make to create a copy of original as a file here in this project (start fresh) and then update it and push it to Grafana.

The response.

[agent]

⏺ Yes, that's the right approach. Let me start fresh: re-fetch the original from
  Grafana, save it as a clean file in this project, then apply only the minimal change.

⏺ Agent(Fetch original kube-state-metrics-v2 JSON)
  ⎿  Done (5 tool uses · 19.7k tokens · 18s)
  (ctrl+o to expand)

⏺ Bash(jq '
        del(.id, .uid, .version)…)
  ⎿  diff vs original (excluding title/uid/id/version/collapsed):
     <<< no diff means only intended changes >>>
     row summary:
     … +49 lines (ctrl+o to expand)
  ⎿  Allowed by auto mode classifier

⏺ Clean diff — only intended changes (title/uid + 8 collapsed flags). Pushing now.

⏺ Agent(Push fresh dot dashboard)
  ⎿  Bash(ls -la /Users/viktorfarcic/code/dot-ai-infra/tmp/dot.dashboard.json && jq -c
⏺ Agent(Push fresh dot dashboard)
  ⎿  Bash(wc -l /Users/viktorfarcic/code/dot-ai-infra/tmp/dot.dashboard.compact.json &&
          jq -c '. | {title, uid, panels_count: (.panels | length)}'
⏺ Agent(Push fresh dot dashboard)
  ⎿  Read(tmp/dot.dashboard.folded.json)
     Read(tmp/dot.dashboard.folded.json)
     grafana - Create or update dashboard (MCP)
     ...
⏺ Clean diff — only intended changes (title/uid + 8 collapsed flags). Pushing now.

⏺ Agent(Push fresh dot dashboard)
  ⎿  Done (9 tool uses · 81.1k tokens · 4m 1s)
  (ctrl+o to expand)

⏺ Pushed as version 3.

  - URL: https://dot.grafana.net/d/dot
  - Local source of truth: tmp/dot.dashboard.json (built from
  tmp/kube-state-metrics-v2.original.json with only title/uid swapped and 8 collapsed:
  true → false)

  Reload the dashboard. If Grafana renders the rows correctly with children visible
  inline, we're done. If a row shows as expanded but its panels still appear hidden
  (some Grafana versions auto-hoist on save and may have quirks), let me know and I'll
  switch to a layout-recomputing approach instead of the minimal flip.

✻ Baked for 5m 8s

It re-fetched the original from Grafana, saved a clean copy as a local file, applied only the minimal change — flip eight collapsed flags from true to false — and pushed it back. Treating the dashboard JSON like source code: clean baseline file, minimal diff, push. Much better than mutating an in-flight version.

Back to the original layout, just with rows expanded. Now the dashboard reads top to bottom without me having to click anything to reveal panels.

Now that I have something working, let me actually customize it. The default kube-state-metrics-v2 has a lot of detail I don’t care about, and it’s missing the things I do care about. So I’ll prune and add at the same time.

[user]

Remove "Deployments" and "Statefulsets" rows and create a new row with "Deployment Replicas - Unavailable" and "Statefulset Replicas - Unavailable" panes instead. Make both red if there is more than 0.

Remove the `Containers` and `Jobs` rows.

The agent grinds through it.

[agent]

⏺ Reading 3 files… (ctrl+o to expand)
  ⎿  $ jq --argjson newRow '
     {
     "id": 1001,
     "type": "row",
     "title": "Replica Health",
     "collapsed": false,
     "gridPos": {"h": 1, "w": 24, "x": 0, "y": 11},
     "datasource": {"type": "prometheus", "uid": "grafanacloud-prom"},
   both red if there is more than 0.

  Remove the `Containers` and `Jobs` rows.

  Read 3 files (ctrl+o to expand)

⏺ Bash(mv /Users/viktorfarcic/code/dot-ai-infra/tmp/dot.dashboard.next.json
      /Users/viktorfarcic/code/dot-ai-infra/tmp/dot.dashboard.json && wc -c
      /Users/viktorfarcic/c…)
  ⎿     70733 /Users/viktorfarcic/code/dot-ai-infra/tmp/dot.dashboard.json

⏺ Agent(Push updated dot dashboard)
  ⎿  Bash(ls -la /Users/viktorfarcic/code/dot-ai-infra/tmp/dot.dashboard.json && jq -c
         '.' /Users/viktorfarcic/code/dot-ai-infra/tmp/dot.dashboard.json >
         /Users/viktorfar…)
     Running…
     Read(tmp/dot.dashboard.compact.json)
     grafana - Create or update dashboard (MCP)
     ...
⏺ Bash(mv /Users/viktorfarcic/code/dot-ai-infra/tmp/dot.dashboard.next.json
      /Users/viktorfarcic/code/dot-ai-infra/tmp/dot.dashboard.json && wc -c
      /Users/viktorfarcic/c…)
  ⎿     70733 /Users/viktorfarcic/code/dot-ai-infra/tmp/dot.dashboard.json

⏺ Agent(Push updated dot dashboard)
  ⎿  Done (4 tool uses · 57.9k tokens · 2m 31s)
  (ctrl+o to expand)

⏺ Pushed as version 4.

  - URL: https://dot.grafana.net/d/dot
  - Local: tmp/dot.dashboard.json (70 KB)

  Final row layout (top → bottom): Cluster · Node · Replica Health · Pods · HPA · PVC.

  The new "Replica Health" row has two stat panels:
  - Deployment Replicas - Unavailable — sum(kube_deployment_status_replicas_unavailable
  {namespace=~"$namespace",cluster=~"$cluster"})
  - Statefulset Replicas - Unavailable — sum(kube_statefulset_replicas{...}) -
  sum(kube_statefulset_status_replicas_ready{...}) (kube-state-metrics doesn't expose a
   direct _unavailable series for statefulsets)

  Both use thresholds: green at 0, red background at ≥ 1.

✻ Cogitated for 3m 46s

It removed the rows I asked to remove, added a new Replica Health row with two stat panels — one summing unavailable Deployment replicas, one computing unavailable Statefulset replicas from the difference between desired and ready, since kube-state-metrics doesn’t expose a direct unavailable-statefulset series. And it set thresholds so both panels turn red when the value goes above zero. Here’s what that looks like.

The new Replica Health row, sitting in the middle of the dashboard. Two big stat panels — Deployment Replicas Unavailable and Statefulset Replicas Unavailable — both showing zero, both green. The moment something breaks, they turn red. Below them, the Pods row I asked to keep. The Containers and Jobs rows are gone.

That’s the demo. Now let me step back and lay out where the Grafana Assistant and the Grafana MCP shine, and where they still have rough edges.

Grafana Assistant: My Verdict

Now think about what just happened, end to end. I have a generic agent — Claude Code — that knows my whole stack, lives in my terminal, and can fix code, push to Git, talk to my cluster. I gave it access to Grafana through an MCP. From there, it can pull metrics, logs, and traces directly. It can build me a fresh dashboard fitted to whatever I’m investigating right now. It can copy, modify, and prune existing dashboards. And when I want the specialized Grafana Assistant — the one with the deepest Grafana-specific knowledge — I can have my generic agent call that too, through the same MCP, via ask_assistant. That combination — a generic agent I already use, plus direct access to all my data, plus the specialized Grafana Assistant on tap, plus on-demand custom dashboards — is, for me, the dream come true. I don’t have to switch agents. I don’t have to leave the terminal. Analysis and remediation live in the same place. And the moment I need to look at something graphically, I get a dashboard built for that exact thing.

If you have your own custom agent — for analysis, for remediation, for incident response, for anything that touches running systems — connecting it to the Grafana MCP gives it super-powers. It did exactly that to mine.

The agents I’ve built for my own infrastructure used to be effectively blind to runtime state. They could read code, they could talk to my cluster’s control plane, they could write and apply fixes, but they couldn’t see what was actually happening in production. They were reasoning in the dark. The moment I plugged them into the Grafana MCP, that changed. They now ground their reasoning in actual metrics, actual logs, actual traces. They verify whether a fix worked by checking the data after the deploy, not by hoping. They notice things I didn’t ask them to look for, because they’re scanning observability data on the way to doing whatever else I asked them to do.

The MCP is sitting right there. The data is sitting right there. The moment you connect them, your agent stops being a code-aware system and starts being a code-and-runtime-aware system. That’s a different category of useful.

Now, normally this is the part of the video where I list the rough edges. Here’s the honest version: I could not find anything wrong with the Assistant or the Grafana MCP themselves. Both work. They do what they say they do. So if you want me to complain about something, I’m going to have to pick on an old Grafana wound that AI just makes more visible.

That wound is the dashboard JSON format. The modern format is workable, but most community dashboards still ship with the old format, and that thing is a sprawling, deeply-nested wall that’s hard to edit by hand and hard for agents too. You saw it earlier. The agent got it wrong on the first try and needed a clean retry to flip eight collapsed flags without scrambling the layout. That isn’t a bug in the MCP, and it isn’t a bug in Claude Code. It’s the format. It was painful before AI, and it’s still painful now. The agent absorbs the pain instead of you, which is a real win, but the format itself could use a serious cleanup.

Bottom line: this is the dream come true. If you live in Grafana, the Assistant is already there in the UI. If you’ve moved to a coding agent like Claude Code, OpenCode, or Cursor, the Grafana MCP is exactly what you needed, including the Assistant on tap via ask_assistant. Generic agents, plus direct access to your data, plus specialized agents, is how observability is going to work from now on. Strongly recommended.