Alerting docs: Update `State and health of alerts` docs (#87846)

* Update `View alert state and history`

* first edit `State and Healths of alerts`

* Complete `State and health of alerts`

* Update docs/sources/alerting/fundamentals/alert-rule-evaluation/_index.md

Co-authored-by: brendamuir <100768211+brendamuir@users.noreply.github.com>

---------

Co-authored-by: brendamuir <100768211+brendamuir@users.noreply.github.com>
pull/87788/head^2
Pepe Cano 1 year ago committed by GitHub
parent 28bf5a4577
commit 37d1d1c0a0
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
  1. 26
      docs/sources/alerting/fundamentals/alert-rule-evaluation/_index.md
  2. 79
      docs/sources/alerting/fundamentals/alert-rule-evaluation/state-and-health.md
  3. 92
      docs/sources/alerting/manage-notifications/view-state-health.md

@ -49,9 +49,9 @@ Keep in mind:
- One alert rule can generate multiple alert instances - one for each time series produced by the alert rule's query.
- Alert instances from the same alert rule may be in different states. For instance, only one observed machine might start firing.
- Only firing and resolved alert instances are routed to manage their notifications.
- Only **Alerting** and **Resolved** alert instances are routed to manage their notifications.
{{< figure src="/media/docs/alerting/alert-rule-evaluation-overview-statediagram.png" max-width="750px" >}}
{{< figure src="/media/docs/alerting/alert-rule-evaluation-overview-statediagram-v2.png" max-width="750px" >}}
<!--
Remove ///
@ -62,23 +62,23 @@ stateDiagram-v2
Route "Resolved" alert instances
for notifications
end note
Pending --///> Firing
Firing --///> Normal: Resolved
note right of Firing
Route "Firing" alert instances
Pending --///> Alerting
Alerting --///> Normal: Resolved
note right of Alerting
Route "Alerting" alert instances
for notifications
end note
-->
Consider an alert rule with an **evaluation interval** set at every 30 seconds and a **pending period** of 90 seconds. The evaluation occurs as follows:
| Time | Condition | Alert instance state | Pending counter |
| ------------------------- | --------- | -------------------- | --------------- |
| 00:30 (first evaluation) | Not met | Normal | - |
| 01:00 (second evaluation) | Breached | Pending | 0s |
| 01:30 (third evaluation) | Breached | Pending | 30s |
| 02:00 (fourth evaluation) | Breached | Pending | 60s |
| 02:30 (fifth evaluation) | Breached | Firing<sup>\*</sup> | 90s |
| Time | Condition | Alert instance state | Pending counter |
| ------------------------- | --------- | --------------------- | --------------- |
| 00:30 (first evaluation) | Not met | Normal | - |
| 01:00 (second evaluation) | Breached | Pending | 0s |
| 01:30 (third evaluation) | Breached | Pending | 30s |
| 02:00 (fourth evaluation) | Breached | Pending | 60s |
| 02:30 (fifth evaluation) | Breached | Alerting<sup>\*</sup> | 90s |
An alert instance is resolved when it transitions from the `Firing` to the `Normal` state. For instance, in the previous example:

@ -16,47 +16,62 @@ labels:
- cloud
- enterprise
- oss
title: State and health of alert rules
title: State and health of alerts
weight: 109
---
# State and health of alert rules
# State and health of alerts
The state and health of alert rules help you understand several key status indicators about your alerts.
There are three key components that help you understand how your alerts behave during their evaluation: [alert instance state](#alert-instance-state), [alert rule state](#alert-rule-state), and [alert rule health](#alert-rule-health). Although related, each component conveys subtly different information.
There are three key components: [alert rule state](#alert-rule-state), [alert instance state](#alert-instance-state), and [alert rule health](#alert-rule-health). Although related, each component conveys subtly different information.
## Alert instance state
## Alert rule state
An alert instance can be in either of the following states:
An alert rule can be in either of the following states:
| State | Description |
| ------------ | ------------------------------------------------------------------------------------------- |
| **Normal** | The state of an alert when the condition (threshold) is not met. |
| **Pending** | The state of an alert that has breached the threshold but for less than the pending period. |
| **Alerting** | The state of an alert that has breached the threshold for longer than the pending period. |
| **NoData** | The state of an alert whose query returns no data or all values are null. |
| **Error** | The state of an alert when an error or timeout occurred evaluating the alert rule. |
| State | Description |
| ----------- | -------------------------------------------------------------------------------------------------- |
| **Normal** | None of the alert instances returned by the evaluation engine is in a `Pending` or `Firing` state. |
| **Pending** | At least one alert instances returned by the evaluation engine is `Pending`. |
| **Firing** | At least one alert instances returned by the evaluation engine is `Firing`. |
{{< figure src="/media/docs/alerting/alert-instance-states-v3.png" caption="Alert instance state diagram" alt="Alert instance state diagram" max-width="750px" >}}
The alert rule state is determined by the “worst case” state of the alert instances produced. For example, if one alert instance is firing, the alert rule state is also firing.
### Notifications
{{% admonition type="note" %}}
Alerts transition first to `pending` and then `firing`, thus it takes at least two evaluation cycles before an alert is fired.
{{% /admonition %}}
Alert instances will be routed for [notifications][notifications] when they are in the `Alerting` state or have been `Resolved`, transitioning from `Alerting` to `Normal` state.
## Alert instance state
{{< figure src="/media/docs/alerting/alert-rule-evaluation-overview-statediagram-v2.png" max-width="750px" >}}
An alert instance can be in either of the following states:
### Keep last state
In the alert rule settings, you can configure to keep the last state of the alert instance when a `NoData` and/or `Error` state is encountered.
The "Keep Last State" option can prevent unintentional alerts from firing, and from resolving and re-firing. Just like normal evaluation, the alert instance transitions from `Pending` to `Alerting` after the pending period has elapsed.
{{< figure src="/media/docs/alerting/alert-rule-configure-no-data-and-error.png" max-width="500px" >}}
| State | Description |
| ------------ | --------------------------------------------------------------------------------------------- |
| **Normal** | The state of an alert that is neither firing nor pending, everything is working correctly. |
| **Pending** | The state of an alert that has been active for less than the configured threshold duration. |
| **Alerting** | The state of an alert that has been active for longer than the configured threshold duration. |
| **NoData** | No data has been received for the configured time window. |
| **Error** | The error that occurred when attempting to evaluate an alert rule. |
### Labels for `NoData` and `Error`
## Keep last state
When an alert instance is on the `NoData` or `Error` state, Grafana Alerting includes the following additional labels:
An alert rule can be configured to keep the last state when a `NoData` and/or `Error` state is encountered. This both prevents alerts from firing, and from resolving and re-firing. Just like normal evaluation, the alert rule transitions from `Pending` to `Firing` after the pending period has elapsed.
- `alertname`: Either `DatasourceNoData` or `DatasourceError` depending on the state.
- `datasource_uid`: The UID of the data source that caused the state.
You can manage these alerts like regular ones by using their labels to apply actions such as adding a silence, routing via notification policies, and more.
## Alert rule state
The alert rule state is determined by the “worst case” state of the alert instances produced. For example, if one alert instance is `Alerting`, the alert rule state is firing.
An alert rule can be in either of the following states:
| State | Description |
| ----------- | ---------------------------------------------------------------------------------------------------- |
| **Normal** | None of the alert instances returned by the evaluation engine is in a `Pending` or `Alerting` state. |
| **Pending** | At least one alert instances returned by the evaluation engine is `Pending`. |
| **Firing** | At least one alert instances returned by the evaluation engine is `Alerting`. |
## Alert rule health
@ -69,13 +84,9 @@ An alert rule can have one of the following health statuses:
| **NoData** | The absence of data in at least one time series returned during a rule evaluation. |
| **{status}, KeepLast** | The rule would have received another status but was configured to keep the last state of the alert rule. |
## Special alerts for `NoData` and `Error`
When evaluation of an alert rule produces state `NoData` or `Error`, Grafana Alerting generates alert instances that have the following additional labels:
{{% docs/reference %}}
| Label | Description |
| ------------------ | ---------------------------------------------------------------------- |
| **alertname** | Either `DatasourceNoData` or `DatasourceError` depending on the state. |
| **datasource_uid** | The UID of the data source that caused the state. |
[notifications]: "/docs/grafana/ -> /docs/grafana/<GRAFANA_VERSION>/alerting/fundamentals/notifications"
[notifications]: "/docs/grafana-cloud/ -> /docs/grafana-cloud/alerting-and-irm/alerting/fundamentals/notifications"
You can handle these alerts the same way as regular alerts by adding a silence, route to a contact point, and so on.
{{% /docs/reference %}}

@ -15,15 +15,17 @@ labels:
- cloud
- enterprise
- oss
title: View the state and health of alert rules
title: View alert state and history
weight: 420
---
# View the state and health of alert rules
# View alert state and history
The state and health of alert rules helps you understand several key status indicators about your alerts.
An alert rule and its corresponding alert instances can transition through distinct states during their [evaluation][alert-rule-evaluation]. There are three key components that helps us understand the behavior of our alerts:
There are three key components: [alert rule state](#alert-rule-state), [alert instance state](#alert-instance-state), and [alert rule health](#alert-rule-health). Although related, each component conveys subtly different information.
- [Alert Instance State][alert-instance-state]: Refers to the state of the individual alert instances.
- [Alert Rule State][alert-rule-state]: Determined by the "worst state" among its alert instances.
- [Alert Rule Health][alert-rule-health]: Indicates the status in cases of `Error` or `NoData` events.
To view the state and health of your alert rules:
@ -31,73 +33,15 @@ To view the state and health of your alert rules:
1. Click **Alert rules** to view the list of existing alerts.
1. Click an alert rule to view its state, health, and state history.
## Alert rule state
An alert rule can be in either of the following states:
| State | Description |
| ----------- | ---------------------------------------------------------------------------------------------- |
| **Normal** | None of the time series returned by the evaluation engine is in a `Pending` or `Firing` state. |
| **Pending** | At least one time series returned by the evaluation engine is `Pending`. |
| **Firing** | At least one time series returned by the evaluation engine is `Firing`. |
{{% admonition type="note" %}}
Alerts will transition first to `pending` and then `firing`, thus it will take at least two evaluation cycles before an alert is fired.
{{% /admonition %}}
## Alert instance state
An alert instance can be in either of the following states:
| State | Description |
| ------------ | --------------------------------------------------------------------------------------------- |
| **Normal** | The state of an alert that is neither firing nor pending, everything is working correctly. |
| **Pending** | The state of an alert that has been active for less than the configured threshold duration. |
| **Alerting** | The state of an alert that has been active for longer than the configured threshold duration. |
| **NoData** | No data has been received for the configured time window. |
| **Error** | The error that occurred when attempting to evaluate an alerting rule. |
## Keep last state
An alert rule can be configured to keep the last state when a `NoData` and/or `Error` state is encountered. This will both prevent alerts from firing, and from resolving and re-firing. Just like normal evaluation, the alert rule will transition from `Pending` to `Firing` after the pending period has elapsed.
## Alert rule health
An alert rule can have one the following health statuses:
| State | Description |
| ---------------------- | -------------------------------------------------------------------------------------------------------- |
| **Ok** | No error when evaluating an alerting rule. |
| **Error** | An error occurred when evaluating an alerting rule. |
| **NoData** | The absence of data in at least one time series returned during a rule evaluation. |
| **{status}, KeepLast** | The rule would have received another status but was configured to keep the last state of the alert rule. |
## Special alerts for `NoData` and `Error`
When evaluation of an alerting rule produces state `NoData` or `Error`, Grafana Alerting will generate alert instances that have the following additional labels:
| Label | Description |
| ------------------ | ---------------------------------------------------------------------- |
| **alertname** | Either `DatasourceNoData` or `DatasourceError` depending on the state. |
| **datasource_uid** | The UID of the data source that caused the state. |
{{% admonition type="note" %}}
You will need to set the No Data and Error Handling to `No Data` or `Error` in the alert rule as per this doc: <https://grafana.com/docs/grafana/latest/alerting/alerting-rules/create-grafana-managed-rule/#configure-no-data-and-error-handling> in order to generate the additional labels.
{{% /admonition %}}
You can handle these alerts the same way as regular alerts by adding a silence, route to a contact point, and so on.
## State history view
## View state history
Use the State history view to get insight into how your alert instances behave over time. View information on when a state change occurred, what the previous state was, the current state, any other alert instances that changed their state at the same time as well as what the query value was that triggered the change.
{{% admonition type="note" %}}
Open source users must configure alert state history in order to be able to access the view.
Open source users must [configure alert state history](/docs/grafana/latest/alerting/set-up/configure-alert-state-history/) in order to be able to access the view.
{{% /admonition %}}
### View state history
To use the State history view, complete the following steps.
To access the State history view, complete the following steps.
1. Navigate to **Alerts & IRM** -> **Alerting** -> **Alert rules**.
1. Click an alert rule.
@ -118,3 +62,21 @@ To use the State history view, complete the following steps.
The value shown for each instance is for each part of the expression that was evaluated.
1. Click the labels to filter and narrow down the results.
{{< figure src="/media/docs/alerting/state-history.png" max-width="750px" >}}
{{% docs/reference %}}
[alert-rule-evaluation]: "/docs/grafana/ -> /docs/grafana/<GRAFANA_VERSION>/alerting/fundamentals/alert-rule-evaluation"
[alert-rule-evaluation]: "/docs/grafana-cloud/ -> /docs/grafana-cloud/alerting-and-irm/alerting/fundamentals/alert-rule-evaluation"
[alert-rule-state]: "/docs/grafana/ -> /docs/grafana/<GRAFANA_VERSION>/alerting/fundamentals/alert-rule-evaluation/state-and-health#alert-rule-state"
[alert-rule-state]: "/docs/grafana-cloud/ -> /docs/grafana-cloud/alerting-and-irm/alerting/fundamentals/alert-rule-evaluation/state-and-health#alert-rule-state"
[alert-instance-state]: "/docs/grafana/ -> /docs/grafana/<GRAFANA_VERSION>/alerting/fundamentals/alert-rule-evaluation/state-and-health#alert-instance-state"
[alert-instance-state]: "/docs/grafana-cloud/ -> /docs/grafana-cloud/alerting-and-irm/alerting/fundamentals/alert-rule-evaluation/state-and-health#alert-instance-state"
[alert-rule-health]: "/docs/grafana/ -> /docs/grafana/<GRAFANA_VERSION>/alerting/fundamentals/alert-rule-evaluation/state-and-health#alert-rule-health"
[alert-rule-health]: "/docs/grafana-cloud/ -> /docs/grafana-cloud/alerting-and-irm/alerting/fundamentals/alert-rule-evaluation/state-and-health#alert-rule-health"
{{% /docs/reference %}}

Loading…
Cancel
Save