Digital InfrastructureFreedom to Operate

Replacing OpsGenie with GoAlert: GitOps Incident Management

Heystaq platform (Aknostic-operated, 5 tenants)

// KEY OUTCOMES

✓OpsGenie dependency eliminated
✓Full incident management stack in Git
✓5 tenants onboarded in under 1 hour each
✓AI-assisted root cause analysis via gctl

Replacing OpsGenie with GoAlert: GitOps Incident Management

The Challenge

Atlassian is retiring OpsGenie as a standalone product, absorbing it into the Jira Service Management monolith. For a team running a multi-tenant European observability platform — Grafana, Mimir, Loki, Tempo on Scaleway in France — this forced a choice: buy into an ever-growing bundle of non-EU SaaS services just to keep on-call schedules running, or replace OpsGenie entirely.

Three problems made the status quo untenable:

Bundling lock-in. OpsGenie was a focused tool. Now it is becoming a feature inside Jira Service Management. The bundle grows, the price grows, the data gravity increases. Leaving gets harder every quarter.
Data residency. Alert data, on-call schedules, notification logs — all on Atlassian's US infrastructure. For a platform that stores everything else in France, this was an unjustifiable gap.
Operational mismatch. Our entire stack — infrastructure, observability, application configuration — is managed through Git and reconciled by Flux. OpsGenie's web UI and API-only configuration broke the GitOps model.

The Approach

We evaluated OpsGenie alternatives against three criteria: open source (Apache 2.0 or equivalent), self-hostable on European infrastructure, and simple enough to operate without a dedicated team.

GoAlert — originally built by Target Corporation — met all three. It is a single Go binary backed by PostgreSQL. It handles on-call scheduling, escalation policies, and SMS/voice notifications with two-way acknowledgment. It has a clean GraphQL API and supports generic webhook ingestion.

But GoAlert had a gap: no GitOps provisioning. Services, escalation policies, schedules, and user notification rules are configured through a web UI or direct API calls. For a team that manages everything through Git, that was not going to work.

So we built two tools to close the gap.

The Solution

GoAlert Deployment

The alert flow is straightforward:

Grafana Alert Rules → Grafana Alertmanager → GoAlert Webhook → Notifications
                                                                    ↑
                                                              SMS ack (1a/1c)

Grafana fires alerts, GoAlert routes them to whoever is on call, Twilio delivers SMS and voice. Engineers acknowledge by replying 1a to the SMS. When Grafana resolves the alert, GoAlert auto-closes it.

GoAlert's generic webhook ingestion also serves as a transitional bridge: customers who have not migrated their monitoring yet can forward Datadog and CloudWatch alerts into GoAlert via HTTP POST. No additional integration work required.

goalert-provisioning: A Kubernetes Operator

We built a Kubernetes operator that reconciles GoAlert configuration from Custom Resource Definitions:

apiVersion: goalert.heystaq.com/v1alpha1
kind: GoAlertService
metadata:
  name: atlas-critical
  namespace: org-atlas
spec:
  serviceName: "Atlas CRITICAL"
  escalationPolicyRef:
    name: aknostic-critical
    namespace: aknostic
  integrationKeys:
    - name: datadog-critical
      secretRef:
        name: atlas-integration-keys
        key: datadog-critical

Push to Git, Flux syncs to the cluster, the operator creates the service in GoAlert and writes the integration key token to a Kubernetes Secret.

The operator covers the full lifecycle:

Services and integration keys
Escalation policies
On-call schedules and rotations
User accounts and contact methods
System-level admin configuration

This mirrors the pattern we already use for Grafana resources through the Grafana Operator. The entire incident management stack — from metric collection to alert routing to on-call notification — lives in version control. One repository. One review process. One audit log.

gctl: A CLI for Model-Based Root Cause Analysis

The operator solved provisioning. But we also wanted to bring large language models into incident response. That required a programmatic interface to GoAlert.

gctl oncall
gctl alert list
gctl query '{ services { nodes { name } } }'

gctl is a thin GraphQL client. The reason we built it: we expose GoAlert data to our AI agents. An RCA investigation usually starts at the end — with the alert. The model needs to find patterns in incident response management, just as it needs access to the other components of the observability stack.

The model gets access to alert context, related metrics in Mimir, logs in Loki, and traces in Tempo. It correlates signals across these sources and proposes a root cause. It does not replace the engineer's judgment, but it handles the tedious part: cross-referencing dashboards, crafting log queries, chasing through trace spans.

The model operates within the same tenant boundaries enforced by the platform, querying Mimir and Loki with the correct X-Scope-OrgID header. Multi-tenancy is not an afterthought — it is the same isolation model that governs everything else.

Multi-Tenant Onboarding

Each customer gets a Kubernetes namespace containing their GoAlert CRDs, Grafana CRDs, and SOPS-encrypted secrets. Onboarding a new tenant is a directory copy with find-and-replace:

Copy the namespace template directory
Replace tenant name, escalation policy references, and integration key names
Open a pull request
Flux reconciles, operator provisions GoAlert, secrets are written

We onboarded five tenants this way, each with critical and non-critical alert routing, in under an hour per tenant.

Outcomes

| Aspect | OpsGenie | GoAlert + Operator | |--------|----------|-------------------| | Licensing | Proprietary (Atlassian bundle) | Apache 2.0 (open source) | | Data residency | US (Atlassian infrastructure) | France (Scaleway PostgreSQL) | | Configuration model | Web UI / API calls | GitOps (CRDs + Flux) | | Tenant onboarding | Manual per-tenant setup | Pull request, < 1 hour | | Audit trail | Vendor logs | Git commits | | AI/programmatic access | Limited API | Full GraphQL via gctl | | Vendor dependency | Atlassian roadmap | Community + self-maintained |

What we gained beyond independence:

Full GitOps lifecycle. When someone asks why an alert went to a particular person, the answer is a Git commit.
AI-assisted RCA. gctl gives language models access to alert context alongside Mimir, Loki, and Tempo — enabling cross-signal root cause analysis within tenant boundaries.
Transitional bridge. Generic webhook ingestion means customers can forward existing Datadog/CloudWatch alerts without migrating their monitoring first.

One remaining US dependency: Twilio for SMS and voice delivery. We will likely replace it, but there are more important workloads to repatriate first. Prioritisation matters.

Lessons Learned

Build the operator, not the integration. GoAlert's GraphQL API was clean enough that building a Kubernetes operator was straightforward. The operator eliminated an entire category of manual work — and made the tool fit our existing GitOps workflow instead of creating an exception.
Design for AI access from the start. Building gctl as a CLI with structured GraphQL output meant language models could consume GoAlert data immediately. Retrofitting programmatic access to a web-UI-first tool is always harder.
Multi-tenant onboarding should be boring. A directory copy with find-and-replace is not elegant. It is fast, auditable, and requires no custom tooling. Five tenants in five hours.
Do not let perfect block good. Twilio is a US dependency. We acknowledged it, documented it, and moved on to higher-priority repatriation work. Sovereignty is a direction, not a binary state.

The goalert-provisioning operator and gctl are open source. The operator installs via Helm and works with any GoAlert instance. gctl installs with go install.

// RELATED

More Case Studies

Digital Infrastructure

From GitHub to GitLab: CI/CD Independence for a European Platform

Company size: Clouds of Europe platform (Aknostic-operated)

-Full CI/CD sovereignty on European infrastructure
-Source code, build logs, and secrets off US infrastructure
-Custom security scanning with zero false positives
-Selective stage execution and environment targeting

Ocean Science / Nonprofit

Project SeaSense: From Zero to Production on European Infrastructure

Company size: Small team, no dedicated platform engineers

-Production Kubernetes platform delivered in ~2 months
-< EUR 50/month infrastructure cost
-100% CNCF open-source stack, zero vendor lock-in
-Team deploys through Git commits — no kubectl required