Replacing OpsGenie with GoAlert: GitOps Incident Management
The Challenge
Atlassian is retiring OpsGenie as a standalone product, absorbing it into the Jira Service Management monolith. For a team running a multi-tenant European observability platform — Grafana, Mimir, Loki, Tempo on Scaleway in France — this forced a choice: buy into an ever-growing bundle of non-EU SaaS services just to keep on-call schedules running, or replace OpsGenie entirely.
Three problems made the status quo untenable:
- Bundling lock-in. OpsGenie was a focused tool. Now it is becoming a feature inside Jira Service Management. The bundle grows, the price grows, the data gravity increases. Leaving gets harder every quarter.
- Data residency. Alert data, on-call schedules, notification logs — all on Atlassian's US infrastructure. For a platform that stores everything else in France, this was an unjustifiable gap.
- Operational mismatch. Our entire stack — infrastructure, observability, application configuration — is managed through Git and reconciled by Flux. OpsGenie's web UI and API-only configuration broke the GitOps model.
The Approach
We evaluated OpsGenie alternatives against three criteria: open source (Apache 2.0 or equivalent), self-hostable on European infrastructure, and simple enough to operate without a dedicated team.
GoAlert — originally built by Target Corporation — met all three. It is a single Go binary backed by PostgreSQL. It handles on-call scheduling, escalation policies, and SMS/voice notifications with two-way acknowledgment. It has a clean GraphQL API and supports generic webhook ingestion.
But GoAlert had a gap: no GitOps provisioning. Services, escalation policies, schedules, and user notification rules are configured through a web UI or direct API calls. For a team that manages everything through Git, that was not going to work.
So we built two tools to close the gap.
The Solution
GoAlert Deployment
The alert flow is straightforward:
Grafana Alert Rules → Grafana Alertmanager → GoAlert Webhook → Notifications
↑
SMS ack (1a/1c)
Grafana fires alerts, GoAlert routes them to whoever is on call, Twilio delivers SMS and voice. Engineers acknowledge by replying 1a to the SMS. When Grafana resolves the alert, GoAlert auto-closes it.
GoAlert's generic webhook ingestion also serves as a transitional bridge: customers who have not migrated their monitoring yet can forward Datadog and CloudWatch alerts into GoAlert via HTTP POST. No additional integration work required.
goalert-provisioning: A Kubernetes Operator
We built a Kubernetes operator that reconciles GoAlert configuration from Custom Resource Definitions:
apiVersion: goalert.heystaq.com/v1alpha1
kind: GoAlertService
metadata:
name: atlas-critical
namespace: org-atlas
spec:
serviceName: "Atlas CRITICAL"
escalationPolicyRef:
name: aknostic-critical
namespace: aknostic
integrationKeys:
- name: datadog-critical
secretRef:
name: atlas-integration-keys
key: datadog-criticalPush to Git, Flux syncs to the cluster, the operator creates the service in GoAlert and writes the integration key token to a Kubernetes Secret.
The operator covers the full lifecycle:
- Services and integration keys
- Escalation policies
- On-call schedules and rotations
- User accounts and contact methods
- System-level admin configuration
This mirrors the pattern we already use for Grafana resources through the Grafana Operator. The entire incident management stack — from metric collection to alert routing to on-call notification — lives in version control. One repository. One review process. One audit log.
gctl: A CLI for Model-Based Root Cause Analysis
The operator solved provisioning. But we also wanted to bring large language models into incident response. That required a programmatic interface to GoAlert.
gctl oncall
gctl alert list
gctl query '{ services { nodes { name } } }'gctl is a thin GraphQL client. The reason we built it: we expose GoAlert data to our AI agents. An RCA investigation usually starts at the end — with the alert. The model needs to find patterns in incident response management, just as it needs access to the other components of the observability stack.
The model gets access to alert context, related metrics in Mimir, logs in Loki, and traces in Tempo. It correlates signals across these sources and proposes a root cause. It does not replace the engineer's judgment, but it handles the tedious part: cross-referencing dashboards, crafting log queries, chasing through trace spans.
The model operates within the same tenant boundaries enforced by the platform, querying Mimir and Loki with the correct X-Scope-OrgID header. Multi-tenancy is not an afterthought — it is the same isolation model that governs everything else.
Multi-Tenant Onboarding
Each customer gets a Kubernetes namespace containing their GoAlert CRDs, Grafana CRDs, and SOPS-encrypted secrets. Onboarding a new tenant is a directory copy with find-and-replace:
- Copy the namespace template directory
- Replace tenant name, escalation policy references, and integration key names
- Open a pull request
- Flux reconciles, operator provisions GoAlert, secrets are written
We onboarded five tenants this way, each with critical and non-critical alert routing, in under an hour per tenant.
Outcomes
| Aspect | OpsGenie | GoAlert + Operator | |--------|----------|-------------------| | Licensing | Proprietary (Atlassian bundle) | Apache 2.0 (open source) | | Data residency | US (Atlassian infrastructure) | France (Scaleway PostgreSQL) | | Configuration model | Web UI / API calls | GitOps (CRDs + Flux) | | Tenant onboarding | Manual per-tenant setup | Pull request, < 1 hour | | Audit trail | Vendor logs | Git commits | | AI/programmatic access | Limited API | Full GraphQL via gctl | | Vendor dependency | Atlassian roadmap | Community + self-maintained |
What we gained beyond independence:
- Full GitOps lifecycle. When someone asks why an alert went to a particular person, the answer is a Git commit.
- AI-assisted RCA. gctl gives language models access to alert context alongside Mimir, Loki, and Tempo — enabling cross-signal root cause analysis within tenant boundaries.
- Transitional bridge. Generic webhook ingestion means customers can forward existing Datadog/CloudWatch alerts without migrating their monitoring first.
One remaining US dependency: Twilio for SMS and voice delivery. We will likely replace it, but there are more important workloads to repatriate first. Prioritisation matters.
Lessons Learned
-
Build the operator, not the integration. GoAlert's GraphQL API was clean enough that building a Kubernetes operator was straightforward. The operator eliminated an entire category of manual work — and made the tool fit our existing GitOps workflow instead of creating an exception.
-
Design for AI access from the start. Building gctl as a CLI with structured GraphQL output meant language models could consume GoAlert data immediately. Retrofitting programmatic access to a web-UI-first tool is always harder.
-
Multi-tenant onboarding should be boring. A directory copy with find-and-replace is not elegant. It is fast, auditable, and requires no custom tooling. Five tenants in five hours.
-
Do not let perfect block good. Twilio is a US dependency. We acknowledged it, documented it, and moved on to higher-priority repatriation work. Sovereignty is a direction, not a binary state.
The goalert-provisioning operator and gctl are open source. The operator installs via Helm and works with any GoAlert instance. gctl installs with go install.