RCFMARTIN

This guide assumes a working Zabbix server with at least a handful of hosts and triggers (the architecture post sketches the bigger picture).

A Zabbix that pages on every blip stops being a monitoring tool and becomes background noise. Within a quarter, every operator has built a mental filter and the one real outage that should have woken them up at 3 AM gets filtered along with the 200 false positives. Quieting Zabbix isn't optional; it's the difference between a tool people trust and a tool they ignore.

This post covers the four mechanisms that, used together, turn signal-to-noise around.

1. Trigger Dependencies

When the upstream switch is down, every host behind it is "unreachable". Without dependencies, you get one trigger per downstream host plus the switch sometimes hundreds of pages from one outage.

In Data Collection -> Hosts -> {host} -> Triggers -> {trigger} -> Dependencies:

Add the upstream trigger that this one depends on.

Trigger:    "Web01: Service down"
Depends on: "Switch01: Interface eth0/1 down"
            "Firewall01: ICMP loss > 50%"

When any dependency is in PROBLEM state, the dependent trigger does not fire. The check still runs and the value still flows in you just don't get woken up. When the upstream recovers, normal alerting resumes.

Build dependencies up your topology, not down. Hosts depend on switches; switches on the firewall; the firewall on the ISP link. Map this once and import it via the API; doing it by hand on a 500-host fleet is unmaintainable.

2. Maintenance Windows

Two flavors:

No data collection Zabbix stops polling. Use during invasive maintenance.
With data collection Zabbix keeps collecting, but suppresses notifications. Use for almost everything else.

Data Collection -> Maintenance -> Create maintenance period:

Name: Friday deploy window
Maintenance type: With data collection
Active since/Active till: the window
Periods: one-off or recurring (Every Friday 02:00–03:00)
Hosts and host groups: scope it down never blanket-suppress the whole environment

"With data collection" is the default for a reason. You almost always want graphs to keep updating during a deploy so you can see the effect you just don't want a page for every brief downtime.

Maintenance via the API

For deploy-time suppression, scripts beat the UI. Schedule maintenance from your CD pipeline:

# Reuse the API wrapper from the proxy load-balancing post
$start = [DateTimeOffset]::UtcNow.ToUnixTimeSeconds()
$stop  = $start + 1800             # 30 minutes

Invoke-Zabbix -Session $s -Method 'maintenance.create' -Params @{
    name              = "deploy-$(Get-Date -Format 'yyyyMMddHHmmss')"
    active_since      = $start
    active_till       = $stop
    maintenance_type  = 0          # with data collection
    hosts           = @($targetHostId)
    timeperiods       = @( @{ timeperiod_type = 0; period = 1800 } )
}

Pair with a teardown call in your deploy's success/failure hooks. The maintenance window auto-expires anyway, but a clean teardown means alerts come back the moment the deploy finishes.

3. Hysteresis Stop the Flap

A trigger like last(/host/cpu.util) > 90 with recovery: < 90 will flap every 30 seconds when the metric oscillates around the threshold. Two distinct thresholds fire at one, recover at another kills the flap:

Problem:  last(/host/cpu.util) > 90
Recovery: last(/host/cpu.util) < 70

The trigger fires when CPU crosses 90% and recovers only when it drops below 70%. A value bouncing between 85–95% stays in PROBLEM until it genuinely cools off.

For non-numeric or threshold-style triggers, use time-based filters:

min(/host/cpu.util,5m) > 90

Fire only when the minimum over the last 5 minutes exceeds 90 i.e. CPU stayed above 90% for the entire window. A single 95% spike won't trip it.

Pick one anti-flap technique per trigger, not both. Hysteresis + 5-minute window stack to "fires after 5 minutes of pain, recovers after another 5 minutes of cool" sometimes what you want, often surprising. Document the choice.

4. Escalations and Repeat-Ahead

In Alerts -> Actions -> Trigger actions -> Create action -> Operations:

Step 1, after 0  min: notify primary on-call (email + chat)
Step 2, after 15 min: notify secondary on-call (chat)
Step 3, after 30 min: page the team (PagerDuty)
Step 4, after 60 min: page the manager

Each step is conditional on the trigger still being open. The first responder ack'ing or resolving the alert (via the Zabbix API or the alert action's Acknowledge step) stops the escalation.

Pair with recovery operations that send "all clear" to the same channels silence is more confusing than a "resolved" message.

Don't escalate too aggressively. Step delays of less than 5 minutes mean the on-call hasn't even read the first message before the second arrives. Pick delays based on actual P1 response SLAs.

Combining Them A Real-World Pattern

A web service trigger:

Trigger:        avg(/web01/web.test.fail[Login],3m) > 0
Severity:       High
Hysteresis:     recover when avg(...,5m) = 0
Dependency:     "Firewall: ICMP loss > 50%"
Maintenance:    "Friday deploy window" applies to web01

Action: Web service down
- Step 1 (0 min):  email + Slack #web-oncall
- Step 2 (10 min): page primary
- Step 3 (25 min): page secondary
Recovery: post "RESOLVED" to the same Slack thread

What this composition gives you:

Doesn't fire during the deploy window (maintenance suppression).
Doesn't fire if the firewall is the actual problem (dependency).
Doesn't flap on transient blips (3-minute average + recovery on 5-minute zero).
Doesn't blanket-page the whole team at minute one (escalation steps).
Doesn't leave the channel guessing what happened (recovery message).

Each individual mechanism quiets a category of false positive. Stacked, you get a trigger that genuinely means something the moment it fires.

Acknowledgement Discipline

Zabbix supports per-event acknowledgement with a comment. Use it. Configure the action to not re-page someone who has acknowledged:

Operation condition: Event is not acknowledged

Now an "I'm on it" ack stops further escalation. The trigger stays open, but the noise stops.

Make acks part of the on-call ritual. The first thing the responder does is ack with "investigating + initial guess". The last thing they do is resolve with a one-line postmortem. These comments become your incident timeline for free.

Triggers You Should Not Have

Some patterns are wrong by construction:

"X is up" triggers without context. A "host up" trigger on a thousand hosts is a thousand false positives waiting to happen. Trigger on the absence of function, not the presence of life.
Triggers with no recovery condition. They never resolve, fill the dashboard, train operators to ignore the column.
Triggers per-CPU-core, per-disk, per-interface without aggregation. The unit of alerting should be "the host has a problem", not "core 7 is busy".
Severity == High everywhere. When everything is high, nothing is. Reserve High for "wake someone up", Average for "deal with it tomorrow", Information for graphs.

Auditing What's Actually Firing

Reports -> Action log
Reports -> Top 100 triggers

The "Top 100 triggers" report shows which triggers fired most often in the last week. The top of that list is your noise. Either fix the trigger or delete it leaving it there is the slowest, most expensive bug in any monitoring system.

For programmatic audit:

Invoke-Zabbix -Session $s -Method 'event.get' -Params @{
    output       = 'extend'
    time_from    = (Get-Date).AddDays(-7).ToUniversalTime().Subtract([datetime]'1970-01-01').TotalSeconds
    selectAcknowledges = 'extend'
    sortfield    = 'clock'
} | Group-Object objectid | Sort-Object Count -Descending | Select-Object -First 20

Top 20 triggers by event count over the last 7 days. Re-run weekly. Your job is to drive that list down, not let it grow.

What to Do Next

Dependencies suppress cascades. Maintenance suppresses scheduled noise. Hysteresis kills flap. Escalations control who gets woken up and when. Combined, they turn a noisy Zabbix into one operators trust the kind where, when the phone rings at 3 AM, the first reaction is "oh, that's real" instead of "ugh, again". Quiet by default, loud when it matters.

Three concrete moves to drive your alert volume down this week:

Run the top-20-noisy-triggers query. Sort by event count over the last 7 days. The top 5 are usually responsible for 80% of your noise rewriting them with hysteresis or a dependency is the highest-leverage hour you'll spend on alerting all month.
Add a global maintenance window for your deploy windows. Most "false positive" alerts during deploys are a configuration gap, not a Zabbix limitation. Cron a maintenance window matching your deploy cadence and the noise disappears.
Move at least one trigger from immediate-page to escalation. If a trigger fires for 5 minutes and self-clears, nobody needs to wake up. Escalation steps (notify channel at 0min, page at 10min, page manager at 30min) turn the same trigger from a 3 AM phone call into a Slack ping nobody had to take.

Pairs naturally with the log monitoring post (because log-based triggers are the most common storm source and benefit most from these patterns) and the web scenarios post (synthetic checks need scheduled mutes during deploys to stay credible).

Quieting Zabbix Alerts Maintenance, Dependencies, Hysteresis, Escalations