* docs: edits for alerting learning content (#105500)
* docs: edits for alerting learning content
edits for alerting learning content
* vale'd
* left nav change
* final adjustments
link fixes and the like
* Update _index.md
(cherry picked from commit bf87c6f774)
* docs: edits for alerting learning content (#105500)
* docs: edits for alerting learning content
edits for alerting learning content
* vale'd
* left nav change
* final adjustments
link fixes and the like
* Update _index.md
(cherry picked from commit bf87c6f774)
* docs: edits for alerting learning content (#105500)
* docs: edits for alerting learning content
edits for alerting learning content
* vale'd
* left nav change
* final adjustments
link fixes and the like
* Update _index.md
(cherry picked from commit bf87c6f774)
description: This section provides a set of guides for useful alerting practices and recommendations
keywords:
- grafana
labels:
products:
- cloud
- enterprise
- oss
menuTitle: Best Practices
title: Grafana Alerting best practices
weight: 170
---
# Grafana Alerting best practices
This section provides a set of guides and examples of best practices for Grafana Alerting. Here you can learn more about more about how to handle common alert management problems and you can see examples of more advanced usage of Grafana Alerting.
Connectivity issues are one of the common causes of misleading alerts or unnoticed failures.
Connectivity issues are a common cause of misleading alerts or unnoticed failures.
Maybe your target went offline, or Prometheus couldn't scrape it. Or maybe your alert query failed because its target timed out or the network went down. These situations might look similar, but require different considerations in your alerting setup.
There could be a number of reasons for these errors. Maybe your target went offline, or Prometheus couldn't scrape it. Or maybe your alert query failed because its target timed out or the network went down. These situations might look similar, but require different considerations in your alerting setup.
This guide walks through how to detect and handle these types of failures, whether you're writing alert rules in Prometheus, using Grafana Alerting, or combining both. It covers both availability monitoring and alert query failures, and outlines strategies to improve the reliability of your alerts.
## Understanding connectivity issues in alerts
## Understand connectivity issues in alerts
Typically, connectivity issues fall into a few common scenarios:
@ -80,7 +80,7 @@ Keep in mind that most alert rules don’t hit the target directly. They query m
In this second setup, you can run into connectivity issues on either side. If Prometheus fails to scrape the target, your alert rule might not fire, even though something is likely wrong.
## Detecting target availability with the Prometheus `up` metric
## Detect target availability with the Prometheus `up` metric
Prometheus scrapes metrics from its targets regularly, following the [`scrape_interval`](https://prometheus.io/docs/prometheus/latest/configuration/configuration/#scrape_config) period. The default scrape interval is 60 seconds, which is generally considered common practice.
@ -94,51 +94,51 @@ A typical PromQL expression for an alert rule to detect when a target becomes un
`up == 0`
But this alert rule might result in noisy alerts as one brief hiccup (a single scrape failure) will fire the alert. To reduce noise, you should add a delay:
But this alert rule might result in noisy alerts as one single scrape failure will fire the alert. To reduce noise, you should add a delay:
`up == 0 for: 5m`
The `for` option in Prometheus (or [pending period](ref:pending-period) in Grafana) delays the alert until the condition has been true for the full duration.
In this example, waiting for 5 minutes helps skip temporary hiccups. Since Prometheus scrapes metrics every minute by default, the alert only fires after five consecutive failures.
In this example, waiting for 5 minutes means the single scrape error won't result in a fired alert. Since Prometheus scrapes metrics every minute by default, the alert only fires after five consecutive failures.
However, this kind of `up` alert has a few gotchas:
However, this kind of `up` alert has a few potential downfalls:
- **Failures can slip between scrape intervals**: An outage that starts and ends between two evaluations go undetected. You could shorten the `for` duration, but this might lead to temporary hiccups triggering false alarms.
- **Intermittent recoveries reset the `for` timer**: A single successful scrape resets the alert timer, masking intermittent outages.
- **Failures can slip between scrape intervals**: An outage that starts and ends between two evaluations go undetected. You could shorten the `for` duration, but this might lead to scrape failures that trigger false alarms.
- **Intermittent recoveries reset the `for` timer**: A single successful scrape resets the alert timer, which masks intermittent outages.
Brief connectivity drops are common in real-world environments, so expect some flakiness in `up` alerts. For example:
| 05:00 `up == 0` | No alert yet; timer hasn’t reached the `for` duration |
The longer the period, the more likely this is to happen.
A single recovery resets the alert, that’s why `up == 0 for: 5m` can sometimes be unreliable. Even if the target is down most of the time, the alert didn't fire, leaving you unaware of a potential persistent issue.
### Using `avg_over_time`
### Use `avg_over_time` to smooth signal
One way to work around these issues is to smooth the signal by averaging the `up` metric over a similar or longer period:
`avg_over_time(up[10m]) < 0.8`
This alert rule fires when the target is unreachable for more than 20% of the last 10 minutes, rather than looking for consecutive scrape failures. With a one-minute scrape interval, three or more failed scrapes within the last 10 minutes will now trigger the alert.
This alert rule fires when the target is unreachable for more than 20% of the last 10 minutes, rather than looking for consecutive scrape failures. With a oneminute scrape interval, three or more failed scrapes within the last 10 minutes now triggers the alert.
Since this query uses a threshold and time window to control accuracy, you can now lower the `for` duration (or [pending period](ref:pending-period) in Grafana) to something shorter—`0m` or `1m`—so the alert fires faster.
This approach gives you more flexibility in detecting real crashes or network issues. As always, adjust the threshold and period based on your noise tolerance and how critical the target is.
### Using synthetic checks for monitoring external availability
### Use synthetic checks to monitor external availability
Prometheus often runs inside the same network as the target it monitors. That means Prometheus might be able to reach the target, but doesn’t ensure it’s reachable to users on the outside.
Firewalls, DNS misconfigurations, or other network issues might block public traffic while Prometheus scraping`up` successfully.
Firewalls, DNS misconfigurations, or other network issues might block public traffic while Prometheus scrapes`up` successfully.
This is where synthetic monitoring helps. Tools like the [Blackbox Exporter](https://github.com/prometheus/blackbox_exporter) let you continuously verify whether a service is available and reachable from outside your network—not just internally.
@ -162,9 +162,9 @@ As with the `up` metric, you might want to smooth this out using `avg_over_time(
This alert fires when Prometheus couldn't scrape the target successfully for more than 20% of the past 10 minutes, or when the external probes have been failing more than 20% of the time. This smoothing technique can be applied to any binary availability signal.
## When only some hosts stop reporting
## Manage offline hosts
In many setups, Prometheus scrapes multiple hosts under the same target — for example, a fleet of servers or containers behind a common job label. It’s common for one host to go offline while the others continue to report metrics normally.
In many setups, Prometheus scrapes multiple hosts under the same target, such as a fleet of servers or containers behind a common job label. It’s common for one host to go offline while the others continue to report metrics normally.
If your alert only checks the general `up` metric without breaking it down by labels (like `instance`, `host`, or `pod`), you might miss when a host stops reporting. For example, an alert that looks only at the aggregated status of all instances will likely fail to catch when individual instances go missing.
@ -172,17 +172,17 @@ This isn't a connectivity error in this context — it’s not that the alert or
For these cases, see the complementary [guide on handling missing data](ref:missing-data-guide) — it covers common scenarios where the alert queries return no data at all, or where only some targets stop reporting. These aren't full availability failures or execution errors, but they can still lead to blind spots in alert detection.
## Handling query errors in Grafana Alerting
## Handle query errors in Grafana Alerting
Not all connectivity issues come from targets going offline. Sometimes, the alert rule fails when querying its target. These aren’t availability problems—they’re query execution errors: maybe the data source timed out, the network dropped, or the query was invalid.
These errors lead to broken alerts. But they come from a different part of the stack: between the alert rule and the data source, not between the data source (e.g., Prometheus) and its target.
These errors lead to broken alerts. But they come from a different part of the stack: between the alert rule and the data source, not between the data source (for example, Prometheus) and its target.
This difference matters. Availability issues are typically handled using metrics like `up` or `probe_success` but execution errors require a different setup.
Grafana Alerting has built-in handling for execution errors, regardless of the data source. That includes Prometheus, and others like Graphite, InfluxDB, PostgreSQL, etc. By default, Grafana Alerting automatically handles query errors so you don’t miss critical failures. When an alert rule fails to execute, Grafana fires a special `DatasourceError` alert.
You can configure this behavior depending on how critical the alert is—and whether you already have other alerts detecting the issue. In [**Configure no data and error handling**](ref:configure-nodata-and-error-handling), click **Alert state if execution error or timeout**, and choose the desired option for the alert:
You can configure this behavior depending on how critical the alert is and on whether you already have other alerts detecting the issue. In [**Configure no data and error handling**](ref:configure-nodata-and-error-handling), click **Alert state if execution error or timeout**, and choose the desired option for the alert:
- **Error (default)**: Triggers a separate `DatasourceError` alert. This default ensures alert rules always inform about query errors but can create noise.
- **Alerting**: Treats the error as if the alert condition is firing. Grafana transitions all existing instances for that rule to the `Alerting` state.
@ -193,13 +193,13 @@ You can configure this behavior depending on how critical the alert is—and whe
This applies even when alert rules query Prometheus itself—not just external data sources.
### Designing alerts for connectivity errors
### Design alerts for connectivity errors
In practice, start by deciding if you want to create explicit alert rules — for example, using `up` or `probe_success` — to detect when a target is down or having connectivity issues.
In practice, start by deciding if you want to create explicit alert rules — for example, using `up` or `probe_success` — to detect when a target is down or has connectivity issues.
Then, for each alert rule, choose the error-handling behavior based on whether you already have dedicated connectivity alerts, the stability of the target, and how critical the alert is. Prioritize alerts based on symptom severity rather than just infrastructure signals that might not impact users.
### Reducing redundant error notifications
### Reduce redundant error notifications
A single data source error can lead to multiple alerts firing simultaneously, sometimes bombarding you with many alerts and generating too much noise.
@ -217,7 +217,7 @@ Consider not treating these alerts in the same way as the original alerts, and i
For details on how to configure grouping and routing, refer to [handling notifications](ref:notifications) and [`No Data` and `Error` alerts](ref:no-data-and-error-alerts) documentation.
## Wrapping up
## Conclusion
Connectivity issues are one of the common causes of noisy or misleading alerts. This guide covered two distinct types:
Missing data, or when a target stops reporting metric data, is one of the most common issues when troubleshooting alerts. In cloud-native environments, this happens all the time — pods or nodes scale down to match demand, or an entire job quietly disappears.
Missing data from when a target stops reporting metric data can be one of the most common issues when troubleshooting alerts. In cloud-native environments, this happens all the time. Pods or nodes scale down to match demand, or an entire job quietly disappears.
When this happens, alerts won’t fire, and you might not notice the system has stopped reporting.
Sometimes it's just a lack of data from a few instances. Other times, it's a connectivity issue where the entire target is unreachable.
This guide covers different scenarios where the underlying data is missing and how to design your alerts to act on those cases. If you're troubleshooting an unreachable host or a network failure, see [Handling connectivity errors](ref:connectivity-errors-guide) as well.
This guide covers different scenarios where the underlying data is missing and shows how to design your alerts to act on those cases. If you're troubleshooting an unreachable host or a network failure, see the [Handle connectivity errors documentation](ref:connectivity-errors-guide) as well.
## No Data vs. Missing Series
@ -80,7 +80,7 @@ For example, imagine a recorded metric, `http_request_latency_seconds`, that rep
In both _No Data_ and _Missing Series_ cases, the query still technically "works", but the alert won’t fire unless you explicitly configure it to handle these situations.
Let’s walk through both scenarios using the previous example, with an alert that triggers if the latency exceeds 2 seconds in any region: `avg_over_time(http_request_latency_seconds[5m]) > 2`
The following tables illustrate both scenarios using the previous example, with an alert that triggers if the latency exceeds 2 seconds in any region: `avg_over_time(http_request_latency_seconds[5m]) > 2`.
**No Data Scenario:** The query returns no data for any series:
@ -102,17 +102,17 @@ Let’s walk through both scenarios using the previous example, with an alert th
In both cases, something broke silently.
## Detecting missing data in Prometheus
## Detect missing data in Prometheus
Prometheus doesn't fire alerts when the query returns no data. It simply assumes there was nothing to report, like with query errors. Missing data won’t trigger existing alerts unless you explicitly check for it.
In Prometheus, a common way to catch missing data is by using the `absent_over_time` function.
In Prometheus, a common way to catch missing data is by to use the `absent_over_time` function.
This triggers when all series for `http_request_latency_seconds` are absent for 5 minutes — catching the _No Data_ case when the entire metric disappears.
However, `absent_over_time()` can’t detect which specific series are missing since it doesn’t preserve labels. The alert won’t tell you which series stopped reporting — only that the query returns no data.
However, `absent_over_time()` can’t detect which specific series are missing since it doesn’t preserve labels. The alert won’t tell you which series stopped reporting, only that the query returns no data.
If you want to check for missing data per-region or label, you can specify the label in the alert query as follows:
@ -120,15 +120,15 @@ If you want to check for missing data per-region or label, you can specify the l
But this doesn't scale well. Hardcoding queries for each label set is fragile, especially in dynamic cloud environments where instances can appear or disappear at any time.
But this doesn't scale well. It is unreliable to have hard-coded queries for each label set, especially in dynamic cloud environments where instances can appear or disappear at any time.
## Handling No Data in Grafana alerts
## Manage No Data issues in Grafana alerts
While Prometheus provides functions like `absent_over_time()` to detect missing data, not all data sources — like Graphite, InfluxDB, PostgreSQL, and others — available to Grafana alerts support a similar function.
To handle this, Grafana Alerting implements a built-in `No Data` state logic, so you don’t need to detect missing data with `absent_*` queries. Instead, you can configure in the alert rule settings how alerts behave when no data is returned.
Similar to error handling, Grafana by default triggers a special _No data_ alert and lets you control this behavior. In [**Configure no data and error handling**](ref:configure-nodata-and-error-handling), click **Alert state if no data or all values are null**, and choose one of the following options:
Similar to error handling, Grafana triggers a special _No data_ alert by default and lets you control this behavior. In [**Configure no data and error handling**](ref:configure-nodata-and-error-handling), click **Alert state if no data or all values are null**, and choose one of the following options:
- **No Data (default):** Triggers a new `DatasourceNoData` alert, treating _No data_ as a specific problem.
- **Alerting:** Transition each existing alert instance into the `Alerting` state when data disappears.
@ -137,7 +137,7 @@ Similar to error handling, Grafana by default triggers a special _No data_ alert
{{<figuresrc="/media/docs/alerting/alert-rule-configure-no-data.png"alt="A screenshot of the `Configure no data handling` option in Grafana Alerting."max-width="500px">}}
### Handling DatasourceNoData notifications
### Manage DatasourceNoData notifications
When Grafana triggers a [NoData alert](ref:no-data-and-error-alerts), it creates a distinct alert instance, separate from the original alert instance. These alerts behave differently:
@ -147,20 +147,20 @@ When Grafana triggers a [NoData alert](ref:no-data-and-error-alerts), it creates
Because of this, `DatasourceNoData` alerts might require a dedicated setup to handle their notifications. For general recommendations, see [Reduce redundant DatasourceError alerts](ref:connectivity-errors-reduce-alert-fatigue) — similar practices can apply to _NoData_ alerts.
## Evicting alert instances for missing series
## Evict alert instances for missing series
_MissingSeries_ occurs when only some series disappear but not all. This case is subtle — but important.
_MissingSeries_ occurs when only some series disappear but not all. This case is subtle, but important.
By default, Grafana marks missing series as [**stale**](ref:stale-alert-instances) after two consecutive evaluation intervals without data and triggers the alert instance eviction process. Here’s what happens under the hood:
Grafana marks missing series as [**stale**](ref:stale-alert-instances) after two evaluation intervals and triggers the alert instance eviction process. Here’s what happens under the hood:
- Alert instances with missing data keep their last state for a number of consecutive evaluation intervals, defined by the **Missing series evaluations to resolve** option (default: 2).
- If still missing after that:
- Alert instances with missing data keep their last state for two evaluation intervals.
- If the data is still missing after that:
- Grafana adds the annotation `grafana_state_reason: MissingSeries`.
- The alert instance transitions to the `Normal` state.
- A **resolved notification** is sent if the alert was previously firing.
- The **alert instance is removed** from the Grafana UI.
If an alert instance becomes stale, you’ll find in the [alert history](ref:alert-history) as `Normal (Missing Series)` before it disappears. This table shows the eviction process from the previous example:
If an alert instance becomes stale, you’ll find it in the [alert history](ref:alert-history) as `Normal (Missing Series)` before it disappears. This table shows the eviction process from the previous example:
### Why doesn’t MissingSeries match No Data behavior?
In dynamic environments — autoscaling groups, ephemeral pods, spot instances — series naturally come and go. **MissingSeries** normally signals infrastructure or deployment changes.
In dynamic environments, such as autoscaling groups, ephemeral pods, spot instances, series naturally come and go. **MissingSeries** normally signals infrastructure or deployment changes.
By default, **No Data** triggers an alert to indicate a potential problem.
@ -181,7 +183,7 @@ The eviction process for **MissingSeries** is designed to prevent alert flapping
In environments with frequent scale events, prioritize symptom-based alerts over individual infrastructure signals and use aggregate alerts unless you explicitly need to track individual instances.
### Handling MissingSeries notifications
### Handle MissingSeries notifications
A stale alert instance triggers a **resolved notification** if it transitions from a firing state (such as `Alerting`, `No Data`, or `Error`) to `Normal`.
@ -192,7 +194,7 @@ Review these notifications to confirm whether something broke or if the alert wa
- Silence or mute alerts during planned maintenance or rollouts.
- Adjust alert rules to avoid triggering on series you expect to come and go, and use aggregated alerts instead.
## Wrapping up
## Conclusion
Missing data isn’t always a failure. It’s a common scenario in dynamic environments when certain targets stop reporting.
@ -202,5 +204,5 @@ Grafana Alerting handles distinct scenarios automatically. Here’s how to think
- Understand `DatasourceNoData` and `MissingSeries` notifications, since they don’t behave like regular alerts.
- Use `absent()` or `absent_over_time()` in Prometheus for fine-grained detection when a metric or label disappears entirely.
- Don’t alert on every instance by default. In dynamic environments, it’s better to aggregate and alert on symptoms — unless a missing individual instance directly impacts users.
- If you’re getting too much noise from disappearing data, consider adjusting alerts, using `Keep Last State` and the `Missing series evaluations to resolve` option, or routing those alerts differently.
- If you’re getting too much noise from disappearing data, consider adjusting alerts, using `Keep Last State`, or routing those alerts differently.
- For connectivity issues involving alert query failures, see the sibling guide: [Handling connectivity errors in Grafana Alerting](ref:connectivity-errors-guide).
description: This section provides practical examples of alert rules for common monitoring scenarios.
keywords:
- grafana
labels:
products:
- cloud
- enterprise
- oss
menuTitle: Examples
title: Grafana Alerting Examples
weight: 1100
---
# Grafana Alerting Examples
This section provides practical examples of alert rules for common monitoring scenarios. Each example focuses on a specific use case, showing how to structure queries, evaluate conditions, and understand how Grafana generates alert instances.