When AlertManager alarms

When to alarm

Many places in Prometheus and Alertmanager involve the configuration of alarm timing. When will my alarm rules alarm?

First, there is an evaluation in the global configuration of Prometheus_ Interval attribute, which indicates how long the alarm rule is evaluated. For example, if we configure 30s here, we will verify whether our alarm rule reaches the threshold every 30s.

    global:
      scrape_interval: 15s
      scrape_timeout: 10s
      evaluation_interval: 1m

At this time, our alarm status is inactive. If the threshold value is reached after evaluation, it will become pending. At this time, the alarm has not been sent to the Alertmanager. When the alarm is triggered depends on the configuration of the for attribute in the alarm rule. For example, if we configure it to 1m, that is, if the threshold value is reached in the subsequent evaluation within 1 minute, it will become pending, The alarm is sent to the Alertmanager, and the subsequent operations are handled by the Alertmanager.

Therefore, in some scenarios, some indicators on our monitoring charts have reached the alarm threshold, but the alarm rules do not necessarily trigger. For example, in our above rules, we set a Pending Duration of 1 minute. In the following figure, the alarm will not be triggered because the duration is too short to reach one minute:

When an alarm is sent depends on how our alarm routing rules are configured. The core properties are the following:

group_by: [instance] # Alarm grouping
group_wait: 30s # Wait for the configured time in the group. If the same alarm occurs within 30 seconds in the same group, it will occur in a group.
group_interval: 5m # Send up to one alert every 5 minutes in each group
repeat_interval: 1h # Send the alarm interval. If it is not repaired within the specified time, the alarm will be sent again.

Alertmanager will group the received alarm rules_ By. After an alarm is triggered, if there is no group before, a group will be created. After the creation, the group will wait_ It takes so long to send a wait. It will not be sent immediately here. It is necessary to accumulate a certain number of alarms to prevent the excessive number of alarms from forming an alarm storm.

After an alarm is triggered, if the alarm belongs to a group, but the group is enough away from the first alert_ After the wait time has expired and the message has been sent once, the group will wait_ After the interval time, a new alarm group is sent again.

Therefore, if the alarm has been sent to the Alertmanager from Prometheus, it is possible to wait max(group_wait, group\u interval) at most.

If the same alarm is always filling, the Alertmanager will not always send the alarm, but will wait for a period of time. This waiting time is repeat_interval. Obviously, different types of alarms are sent at different frequencies.

For example, now we add two alarm rules as follows:

groups:
  - name: test-node-mem
    rules: # Specific alarm rules
      - alert: NodeMemoryUsage # Name of the alarm rule
        expr: (node_memory_MemTotal_bytes - (node_memory_MemFree_bytes + node_memory_Buffers_bytes + node_memory_Cached_bytes)) / node_memory_MemTotal_bytes * 100 > 30
        for: 1m
        labels:
          team: node
        annotations:
          summary: "{{$labels.instance}}: High Memory usage detected"
          description: "{{$labels.instance}}: Memory usage is above 30% (current value is: {{ $value }})"
  - name: test-node-cpu
    rules:
      - alert: NodeCpuUsage
        expr: ((1 - sum(increase(node_cpu_seconds_total{mode="idle"}[1m])) by (instance) / sum(increase(node_cpu_seconds_total[1m])) by (instance) ) * 100) > 5
        for: 2m
        labels:
          team: node
        annotations:
          summary: "{{$labels.instance}}: High CPU usage detected"
          description: "{{$labels.instance}}: CPU usage is above 5% (current value is: {{ $value }})"

When configuring the route in Alertmanager, set grouping according to instance:

routes:
  - receiver: email
    group_wait: 10s
    group_by: ["instance"]
    match:
      team: node

An alarm in the same group will be sent. If a new alarm message appears in the group, it will wait for the group_ The interval time is sent again. If there is no new change and the alarm information is still in the filling state, you need to wait for the repeat_ Send again after the interval time.

Tags: Prometheus

Posted by harvillo on Thu, 02 Jun 2022 08:52:17 +0530