Incident Management with Grafana OnCall

Incident Management with Grafana OnCall

Disclaimer: This post was originally written for my company’s blog . I’ve translated it and added a personal touch for this space.

Grafana has become a central building block for many observability setups - bringing metrics, logs, and dashboards together in one place. But once alerts start firing, visibility alone is no longer enough - especially when the team grows: Who is actually responsible right now? Has someone already reacted - or are even multiple people working on the same issue without knowing about each other?

This is where Grafana OnCall comes into play. It adds structured incident management on top of Grafana’s alerting by introducing clear ownership, on-call schedules, and escalation rules. In this article, I’ll take a practical look at how OnCall works and how to set it up to manage alerts and responsibilities - without adding unnecessary complexity.

What is Grafana OnCall?

Grafana OnCall is an incident management tool that integrates directly into your existing Grafana setup. It was designed to address common weaknesses in traditional alerting setups - such as unclear responsibilities, notifications spread across multiple tools, or missing escalation logic.

Its core features include:

  • Escalation rules and escalation chains
  • On-call schedules (including rotations and absences)
  • Acknowledgement and escalation of alerts
  • Integration with Alertmanager, Loki, and more
  • Notifications via Slack, Telegram, SMS, mobile app, etc.

Prerequisites

To use Grafana OnCall in your environment, the following should already be in place:

  • A Kubernetes cluster with appropriate access
  • A running Grafana instance (including access to the admin interface)
  • Prometheus configured as a data source
  • Initial alerts defined either in Prometheus or via Grafana Alerting

Enabling and setting up OnCall

Starting with Grafana version 9.4, the OnCall plugin is usually already installed and just needs to be enabled.

To do so, open the Alerts & IRM section in the left sidebar of your Grafana instance (IRM stands for Incident Response Management). You should see an entry OnCall there. If it’s missing, you can install the plugin easily via the plugin marketplace (found under Administration). Make sure to choose the official plugin from Grafana Labs, as there are also community plugins with similar names.

Next, you’ll connect OnCall to your existing alerting infrastructure. Within the OnCall UI, go to the Integrations tab and create a new integration - for example Prometheus or Alertmanager, depending on what you’re using.

Once the integration is set up, OnCall automatically creates a default routing rule: Alerts without any additional routing information will be sent there. This is a good starting point, but it can be refined easily. For example, you can use labels like team=platform in your alerts to explicitly control which team should receive them. Label-based routing becomes especially useful when multiple teams with different responsibilities are involved.

Teams, schedules, and escalation rules

A central concept in Grafana OnCall is clearly defined ownership - and that starts with teams. In the OnCall interface, you can create a team for each group responsible for handling alerts, such as a platform team or a feature team. Members can be added either via email or through single sign-on, if available.

Next, you can define on-call schedules: Who is on duty and when. A common setup is a weekly rotation starting on Monday at 08:00. Responsibility is then automatically rotated between team members. A particularly useful feature is absence management: planned time off can be entered directly, allowing OnCall to automatically assign a replacement if needed. More complex setups, such as first- and second-level on-call rotations, are also supported.

To ensure no alert gets lost, you can also configure an escalation chain. A typical example: If the first notified person does not acknowledge the alert within ten minutes, the next person in the chain is notified automatically. Escalations can be multi-level and repeated at fixed intervals (e.g. every 15 minutes) until someone takes ownership of the incident. This ensures that even critical alerts are handled reliably.

Connecting alerts to OnCall

Once teams, schedules, and escalation chains are in place, the next step is routing alerts to the right destination - and this is where OnCall relies on a simple but effective mechanism: labels.

In your alerting rules (e.g. in Alertmanager), you can define labels that OnCall uses as routing criteria - e.g. the label team=platform mentioned above. As soon as an alert carries this label, OnCall knows that it belongs to the platform team and triggers the corresponding escalation chain.

Here’s a simple Alertmanager example:

- alert: HighCPUUsage
  expr: node_cpu_seconds_total{mode="user"} > 90
  labels:
    severity: critical
    team: platform

This ensures that the alert doesn’t just land somewhere in a generic Slack channel, but is routed directly to the responsible team.

Receiving and responding to alerts

When an alert is triggered, the responsible person receives it through the configured channel - for example via Slack, the Grafana mobile app, or SMS. With a single click, the alert can be acknowledged, signaling that someone is actively working on it. Optionally, comments can be added or the alert can be reassigned to another team or individual. Once the issue is resolved, the alert can be marked as resolved.

If no one responds within the defined time window, the escalation rules automatically take effect and notify the next person in the chain - continuing until the alert is acknowledged.

This way, you not only keep a clear overview of ongoing incidents, but also ensure that critical alerts are always handled – without manual follow-ups or uncertainty within the team.

Summing up

Whether you’re a small team with a rotating on-call schedule or a larger setup with clearly distributed responsibilities, Grafana OnCall can help organize ownership in a transparent and reliable way.

For smaller teams, OnCall offers an easy way to model schedules and escalations directly within the familiar Grafana environment. In larger or distributed teams, label-based routing and escalation chains make it possible to handle more complex responsibility models without losing clarity.

Grafana OnCall extends classic alert notifications with exactly what matters in real incidents: clarity about who takes over - and confidence that nothing falls through the cracks.