On this page

Alerting and Escalation

Alerting and escalation form the bridge between monitoring systems that detect conditions and human responders who resolve them. An alert is a notification generated when a monitored metric crosses a defined threshold or a specific event occurs. Escalation is the structured progression of an unacknowledged or unresolved alert through increasingly senior or specialised responders until someone takes ownership. Together, these mechanisms determine whether a disk filling at 3am wakes the right person within minutes or triggers a cascade of missed notifications that becomes a major incident by morning.

The effectiveness of alerting depends on three properties: alerts must be actionable, meaning someone can and should do something in response; alerts must be appropriately targeted, reaching people with both the authority and capability to respond; and alerts must be accurate, firing when genuine problems exist and remaining silent otherwise. Failures in any property create distinct pathologies. Non-actionable alerts train responders to ignore notifications. Poorly targeted alerts waste time as messages bounce between teams. Inaccurate alerts erode trust in the monitoring system itself, leading to disabled notifications and blind spots.

Alert: A notification generated when monitoring detects a condition requiring human attention, characterised by severity, source, and required response.
Threshold: A boundary value that triggers an alert when crossed. Static thresholds use fixed values; dynamic thresholds adjust based on historical patterns.
Escalation path: A defined sequence of responders and timing intervals that determines who receives an alert and when, progressing until acknowledgement occurs.
On-call: A scheduled responsibility period during which a designated person must be reachable and capable of responding to alerts within defined timeframes.
Acknowledgement: An explicit action by a responder indicating they have received an alert and taken ownership of the response, halting further escalation.
Alert fatigue: A condition where excessive alert volume causes responders to ignore or inadequately respond to notifications, including genuine incidents.
Runbook: A documented procedure linked to a specific alert type that guides the responder through diagnosis and resolution steps.

Alert severity classification

Alert severity determines notification urgency, escalation speed, and expected response time. A severity model creates shared understanding between monitoring systems, responders, and service level commitments. Most organisations use three to five severity levels, with more granular models adding complexity without proportional benefit.

A four-level model provides sufficient differentiation for most mission-driven organisations:

Critical alerts indicate service unavailability or data loss affecting production systems. The primary beneficiary database being unreachable, payment processing failing, or authentication services down all qualify as critical. These alerts require immediate response regardless of time, with acknowledgement expected within 15 minutes and resolution efforts beginning immediately. Critical alerts always trigger phone calls or high-priority push notifications.

High alerts indicate significant degradation that will become critical without intervention. A database server at 95% disk capacity, response times exceeding SLA thresholds, or backup failures fall into this category. Response within 30 minutes during business hours and within 1 hour outside business hours represents appropriate urgency. High alerts use push notifications and escalate to phone calls if unacknowledged.

Medium alerts indicate conditions requiring attention within the business day. A certificate expiring in 14 days, elevated but non-critical error rates, or capacity trending toward thresholds warrant medium severity. Response within 4 business hours allows planned intervention before degradation. Email and chat notifications suffice for medium alerts.

Low alerts indicate informational conditions for awareness rather than immediate action. Scheduled maintenance completing, daily backup success, or routine threshold warnings belong here. These alerts appear in dashboards and daily summaries but do not generate real-time notifications.

+-------------------------------------------------------------------+
|                      ALERT SEVERITY MODEL                         |
+-------------------------------------------------------------------+
|                                                                   |
|  CRITICAL         HIGH            MEDIUM          LOW             |
|  +-----------+    +-----------+   +-----------+   +-----------+   |
|  | Service   |    | Degraded  |   | Attention |   | Info      |   |
|  | down or   |    | service   |   | needed    |   | only      |   |
|  | data loss |    | or risk   |   | today     |   |           |   |
|  +-----------+    +-----------+   +-----------+   +-----------+   |
|       |                |               |               |          |
|       v                v               v               v          |
|  +-----------+    +-----------+   +-----------+   +-----------+   |
|  | Phone +   |    | Push +    |   | Email +   |   | Dashboard |   |
|  | Push +    |    | Email     |   | Chat      |   | + Daily   |   |
|  | Email     |    |           |   |           |   | summary   |   |
|  +-----------+    +-----------+   +-----------+   +-----------+   |
|       |                |               |               |          |
|       v                v               v               v          |
|  +-----------+    +-----------+   +-----------+   +-----------+   |
|  | ACK: 15m  |    | ACK: 30m  |   | ACK: 4h   |   | No ACK    |   |
|  | Response: |    | Response: |   | Response: |   | required  |   |
|  | Immediate |    | 1 hour    |   | Same day  |   |           |   |
|  +-----------+    +-----------+   +-----------+   +-----------+   |
|                                                                   |
+-------------------------------------------------------------------+

Figure 1: Four-level severity model showing notification channels and response expectations

Severity assignment occurs through alert rules, not human judgement during an incident. Each alert definition specifies its severity based on the condition being detected. A rule monitoring database connectivity assigns critical severity; the same monitoring system watching disk space might assign high severity at 90% and medium at 80%. This pre-classification ensures consistent handling regardless of which responder receives the alert or what time it arrives.

Alert rule design

Alert rules translate monitoring data into actionable notifications. A rule consists of a condition (what triggers the alert), a threshold (the boundary that activates the condition), and metadata (severity, description, target responders, linked runbook). Effective rules fire precisely when human intervention adds value and remain silent when automated systems handle the situation or when no action is possible.

The condition defines what the monitoring system evaluates. Conditions range from simple comparisons (CPU utilisation exceeds 90%) to complex queries combining multiple metrics (error rate exceeds 1% AND request volume exceeds 100 per minute). Simple conditions are easier to understand and troubleshoot but may generate false positives in edge cases. Complex conditions reduce noise but obscure the triggering factors when alerts fire.

Threshold selection requires balancing sensitivity against noise. Setting a disk space alert at 70% capacity generates warnings weeks before problems occur but fires constantly on systems that routinely operate at 75%. Setting the threshold at 95% provides less warning time but eliminates noise from stable systems. The appropriate threshold depends on how quickly the condition changes and how long remediation takes. A disk that fills by 1% daily warrants alerting at 85% to provide two weeks of runway. A disk that can fill within hours from a runaway process needs alerting at 70% to ensure response time.

Static thresholds use fixed values determined through analysis or experience. A web server responding in under 200ms during normal operation might alert at 500ms, representing clear degradation. Static thresholds work well for metrics with consistent baselines and predictable variation.

Dynamic thresholds adjust based on historical patterns, accounting for expected variation. A system processing payroll shows predictably higher load on the last day of each month. A static threshold accommodating this peak would be too lax during normal periods; a threshold tuned for normal periods would generate monthly false alarms. Dynamic thresholds learn these patterns and alert only when metrics deviate from expected values at that time.

+------------------------------------------------------------------+
|                    THRESHOLD COMPARISON                          |
+------------------------------------------------------------------+
|                                                                  |
|  100% +                                                          |
|       |                    Static threshold: 85%                 |
|   90% +  - - - - - - - - - - - - - - - - - - - - - - - - - -     |
|       |         /\                         /\                    |
|   80% +        /  \     ALERT             /  \    FALSE          |
|       |       /    \    (genuine)        /    \   POSITIVE       |
|   70% +      /      \                   /      \                 |
|       |     /        \                 /        \  (monthly      |
|   60% +    /          \               /          \  payroll)     |
|       |   /            \             /            \              |
|   50% +--+              +---+-------+              +--------     |
|       |                                                          |
|       +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+    |
|          Week 1    Week 2    Week 3    Week 4                    |
|                                                                  |
|  100% +                                                          |
|       |                   Dynamic threshold (adjusts)            |
|   90% +                          . . . . . .                     |
|       |         /\              .          .                     |
|   80% +        /  \    ALERT   .            .   NO ALERT         |
|       |       /    \   ~~~~~~ .              .  (expected)       |
|   70% +      /      \ /      \               /\                  |
|       |     /        X        \             /  \                 |
|   60% +    /        / \        \           /    \                |
|       |   /        /   \        \         /      \               |
|   50% +--+--------+     +--------+-------+        +--------      |
|       |                                                          |
|       +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+    |
|          Week 1    Week 2    Week 3    Week 4                    |
|                                                                  |
+------------------------------------------------------------------+

Figure 2: Static versus dynamic thresholds showing false positive elimination for predictable patterns

Rule metadata provides context that makes alerts actionable. The description explains what condition triggered the alert and why it matters, written for a responder who may be unfamiliar with the specific system. Including the current value, threshold, and affected system in the alert message eliminates the need for immediate investigation just to understand the situation. A poorly constructed alert reading “Disk space warning” requires the responder to log in and investigate before knowing whether this is urgent. A well-constructed alert reading “Server db-primary-01 disk /var/data at 92% (threshold: 90%), increasing 2% daily, estimated full in 4 days” provides immediate context for prioritisation.

Notification channels

Notification channels determine how alerts reach responders. Channel selection balances reliability, urgency, and responder burden. Using aggressive channels for low-severity alerts creates fatigue; using passive channels for critical alerts delays response.

Email provides reliable delivery with low intrusion, making it suitable for medium and low severity alerts. Email’s asynchronous nature means responders may not see messages for hours, making it inappropriate for time-sensitive conditions. Email also suffers from volume problems; a responder receiving 50 alert emails daily will struggle to identify the important ones. Filtering and aggregation help, but email works best as a record-keeping channel supplementing more immediate notifications.

Chat platforms (Slack, Teams, Mattermost) offer higher visibility than email with lower intrusion than phone calls. Dedicated alert channels concentrate notifications where responders expect them, and threading allows discussion of specific alerts. Chat works well for high and medium severity alerts during business hours. The limitation is that chat notifications may go unnoticed when responders are away from their computers or have notifications muted.

Push notifications to mobile devices provide high visibility without requiring responders to be at their desks. Most alerting systems support mobile applications that deliver push notifications, vibrate devices, and override do-not-disturb settings for critical alerts. Push notifications suit high severity alerts and critical alerts as a first-tier notification before phone escalation.

Phone calls and SMS provide the highest urgency and reliability for critical alerts. Phone calls are difficult to ignore and confirm delivery when answered. SMS provides near-immediate delivery to any mobile phone without requiring application installation or internet connectivity, valuable in field contexts with unreliable data connections. The intrusion of phone calls makes them inappropriate for anything below critical severity; waking someone at 3am for a medium-severity alert destroys trust in the alerting system.

Multi-channel notification sends the same alert through several channels simultaneously or in sequence. A critical alert might immediately send a push notification and email, then place a phone call after 5 minutes without acknowledgement. This layered approach increases the probability of reaching someone while allowing less intrusive channels first.

On-call scheduling

On-call scheduling assigns responsibility for alert response to specific individuals during defined time periods. Effective on-call structures ensure coverage without burning out participants and provide clear accountability for every incoming alert.

A rotation distributes on-call responsibility across a team over time. Weekly rotations assign one person as primary responder for a full week, providing continuity but concentrating burden. Daily rotations distribute the load more evenly but create handoff overhead each day. Most teams find weekly rotations with weekend splits (one person covers weekdays, another covers the weekend) provide a reasonable balance.

Primary and secondary (backup) on-call roles provide redundancy. The primary responder receives alerts first. If no acknowledgement occurs within the defined window, the secondary responder receives the same alert. This layered approach handles situations where the primary is temporarily unavailable (in a meeting, travelling, or asleep despite notifications). Tertiary escalation typically reaches a manager or team lead rather than another individual contributor.

+------------------------------------------------------------------+
|                    ON-CALL ROTATION MODEL                        |
+------------------------------------------------------------------+
|                                                                  |
|  WEEK 1          WEEK 2          WEEK 3          WEEK 4          |
|  +----------+    +----------+    +----------+    +----------+    |
|  | PRIMARY  |    | PRIMARY  |    | PRIMARY  |    | PRIMARY  |    |
|  |  Alice   |    |   Bob    |    |  Carol   |    |  David   |    |
|  +----+-----+    +----+-----+    +----+-----+    +----+-----+    |
|       |               |               |               |          |
|  +----v-----+    +----v-----+    +----v-----+    +----v-----+    |
|  | SECONDARY|    | SECONDARY|    | SECONDARY|    | SECONDARY|    |
|  |   Bob    |    |  Carol   |    |  David   |    |  Alice   |    |
|  +----+-----+    +----+-----+    +----+-----+    +----+-----+    |
|       |               |               |               |          |
|  +----v-----+    +----v-----+    +----v-----+    +----v-----+    |
|  | TERTIARY |    | TERTIARY |    | TERTIARY |    | TERTIARY |    |
|  | (Manager)|    | (Manager)|    | (Manager)|    | (Manager)|    |
|  +----------+    +----------+    +----------+    +----------+    |
|                                                                  |
|  Timeline for single alert (no acknowledgement):                 |
|                                                                  |
|     0 min          5 min          15 min         30 min          |
|       |              |              |              |             |
|       v              v              v              v             |
|  +---------+    +---------+    +---------+    +---------+        |
|  |Push to  |    |Call to  |    |Secondary|    |Tertiary |        |
|  |Primary  |--->|Primary  |--->|called   |--->|notified |        |
|  |         |    |         |    |         |    |         |        |
|  +---------+    +---------+    +---------+    +---------+        |
|                                                                  |
+------------------------------------------------------------------+

Figure 3: Weekly rotation with escalation timeline for unacknowledged alerts

On-call compensation recognises the burden of availability requirements. Whether through additional pay, time off in lieu, or reduced regular duties during on-call weeks, compensation acknowledges that being on-call restricts personal time even when no alerts fire. Organisations that treat on-call as an uncompensated expectation find participation reluctant and response quality declining.

Schedule management requires handling holidays, illness, and other absences. A swap system allows on-call individuals to trade shifts with colleagues. Override capabilities let managers reassign on-call when someone becomes unavailable unexpectedly. Scheduled overrides handle known absences like annual leave.

Follow-the-sun models distribute on-call across geographic locations, limiting out-of-hours calls. An organisation with staff in London, Nairobi, and Manila can route alerts to whoever is in business hours, eliminating middle-of-the-night calls for most alerts. This model requires sufficient expertise in each location and clear handoff procedures between shifts.

Escalation path design

Escalation paths define who receives an alert and when, progressing through increasingly senior or specialised responders until someone acknowledges. A well-designed escalation path reaches the right person quickly without unnecessary notifications to others.

Time-based escalation advances through responders based on elapsed time without acknowledgement. A critical alert reaches the primary on-call immediately. After 5 minutes without acknowledgement, it reaches the secondary. After 10 more minutes, it reaches the tertiary. Time-based escalation suits alerts where any skilled responder can handle the issue.

Severity-based escalation routes different severities through different paths. Critical alerts might go directly to senior engineers and managers simultaneously, while medium alerts follow the standard on-call path. Severity-based routing ensures serious issues receive appropriate attention without flooding senior staff with routine matters.

Functional escalation routes alerts to teams with specific expertise. A database alert escalates through the database team’s on-call rotation. A network alert escalates through the network team. Functional escalation ensures responders have relevant skills but requires maintaining separate rotations for each function.

Hierarchical escalation adds management notification for alerts meeting certain criteria or duration thresholds. An alert unresolved after 1 hour might notify the IT manager. An alert affecting a critical system might notify both technical responders and management from the start. Hierarchical escalation keeps leadership informed without requiring them to acknowledge or respond technically.

+------------------------------------------------------------------+
|                    ESCALATION PATH ARCHITECTURE                  |
+------------------------------------------------------------------+
|                                                                  |
|                      +-------------------+                       |
|                      |   ALERT FIRES     |                       |
|                      +--------+----------+                       |
|                               |                                  |
|                      +--------v----------+                       |
|                      | Severity routing  |                       |
|                      +--------+----------+                       |
|                               |                                  |
|          +--------------------+--------------------+             |
|          |                    |                    |             |
|          v                    v                    v             |
|  +-------+-------+    +-------+-------+    +-------+-------+     |
|  |   CRITICAL    |    |     HIGH      |    |  MEDIUM/LOW   |     |
|  +-------+-------+    +-------+-------+    +-------+-------+     |
|          |                    |                    |             |
|          v                    v                    v             |
|  +-------+-------+    +-------+-------+    +-------+-------+     |
|  | Primary ON    |    | Primary ON    |    | Email to team |     |
|  | CALL + Manager|    | CALL          |    | channel       |     |
|  | (phone)       |    | (push)        |    |               |     |
|  +-------+-------+    +-------+-------+    +-------+-------+     |
|          |                    |                    |             |
|    5 min | No ACK       15 min| No ACK       4 hrs | No ACK      |
|          v                    v                    v             |
|  +-------+-------+    +-------+-------+    +-------+-------+     |
|  | Secondary ON  |    | Secondary ON  |    | Primary ON    |     |
|  | CALL + Sr Mgr |    | CALL          |    | CALL          |     |
|  | (phone)       |    | (phone)       |    | (email)       |     |
|  +-------+-------+    +-------+-------+    +-------+-------+     |
|          |                    |                    |             |
|   10 min | No ACK       30 min| No ACK       8 hrs | No ACK      |
|          v                    v                    v             |
|  +-------+-------+    +-------+-------+    +-------+-------+     |
|  | Director +    |    | Manager       |    | Manager       |     |
|  | Incident      |    | (phone)       |    | (email)       |     |
|  | bridge        |    |               |    |               |     |
|  +---------------+    +---------------+    +---------------+     |
|                                                                  |
+------------------------------------------------------------------+

Figure 4: Multi-path escalation showing different flows by severity

Escalation timing balances speed against unnecessary interruption. Aggressive escalation (5-minute intervals) quickly reaches backup responders but may page someone just as the primary is acknowledging. Conservative escalation (30-minute intervals) gives responders time to acknowledge but delays response when someone genuinely misses the alert. Critical alerts warrant aggressive escalation; the cost of delayed response exceeds the cost of occasional redundant notifications. Lower severities can use longer intervals.

Worked example: A database server becomes unreachable at 02:15 on a Tuesday. The monitoring system generates a critical alert. At 02:15, the alerting system sends a push notification and places a phone call to Alice (primary on-call). Alice’s phone is on silent for an important reason. At 02:20, having received no acknowledgement, the system phones Bob (secondary) and sends the IT manager an SMS notification. Bob answers at 02:21, acknowledges the alert, and begins investigation. The escalation stops; no further notifications go out. Bob resolves the issue by 02:45 and closes the alert.

Alert correlation and suppression

Alert storms occur when a single root cause triggers multiple alerts simultaneously. A network switch failure might generate separate alerts for every server behind that switch becoming unreachable. Without correlation, responders receive dozens or hundreds of alerts for what is actually one problem, obscuring the root cause in noise.

Alert correlation groups related alerts under a common parent. When the monitoring system detects that multiple alerts share a probable cause (all affected servers on the same network segment, all errors starting at the same timestamp), it creates a single summary alert referencing the underlying alerts. Responders see “Network segment failure affecting 15 servers” rather than 15 individual server unreachable alerts.

Correlation rules define relationships between alerts. Topological correlation uses known infrastructure relationships: if a router fails, suppress downstream device alerts. Temporal correlation groups alerts firing within a narrow time window. Pattern correlation identifies alert sequences that historically indicate specific root causes.

Suppression prevents redundant alerts from firing during known conditions. A scheduled maintenance window should suppress alerts for systems under maintenance. A known issue should suppress repeated alerts until resolved. Suppression differs from disabling: suppressed alerts still log for audit purposes but do not notify.

Deduplication prevents the same alert from generating multiple notifications. If disk space on server-01 has triggered an alert and remains above threshold, subsequent checks should not fire new alerts until the condition clears and recurs. Deduplication typically works by maintaining state: once an alert fires, the rule suppresses re-firing until the condition returns to normal.

Flapping detection identifies conditions that rapidly oscillate between alerting and non-alerting states. A network link with intermittent connectivity might generate dozens of up/down alerts per hour. Flapping detection recognises this pattern and consolidates notifications, typically sending one “flapping” alert rather than continuous individual state changes.

Alert fatigue prevention

Alert fatigue occurs when responders become desensitised to notifications due to volume, leading to slow response or missed critical alerts. Studies in healthcare monitoring found that when ICU staff receive more than 350 alerts per patient per day, response to genuine critical conditions degrades significantly. IT operations face similar dynamics: an engineer receiving 100 alerts daily will eventually start ignoring them.

The primary cause of alert fatigue is non-actionable alerts. Every alert should have a clear action the responder can take. An alert firing for a condition that cannot be fixed, does not matter, or resolves automatically without intervention trains responders to ignore alerts. Auditing alerts quarterly to identify non-actionable patterns and either fixing the underlying condition, adjusting thresholds, or removing the alert reduces noise.

Threshold tuning addresses alerts that fire too frequently for conditions that do not require response. If CPU utilisation alerts at 80% fire dozens of times monthly but investigation always finds normal operation, the threshold is wrong. Raising the threshold to 90% might reduce alerts by 80% while still catching genuine problems. Each alert should fire rarely enough that responders treat it seriously.

Alert ownership assigns responsibility for maintaining specific alerts. Without ownership, alerts accumulate without review. The person owning an alert should receive its notifications and is responsible for tuning thresholds, updating runbooks, and retiring the alert if it no longer adds value.

Metrics for alert health provide visibility into fatigue risk. Track alerts per week by severity, acknowledgement times, alerts closed without action, and alerts that recur within 24 hours of closure. Rising volumes, increasing acknowledgement delays, and high close-without-action rates indicate fatigue developing.

A target for sustainable alerting is fewer than 10 alerts per on-call shift that require investigation. At this volume, each alert receives appropriate attention. Exceeding 50 alerts per shift indicates systemic problems requiring alert review, threshold adjustment, or automation of responses.

Runbook integration

Runbooks connect alerts to resolution procedures, ensuring consistent response regardless of which responder receives the alert. Each alert type should link to a runbook providing diagnostic steps, common causes, and resolution procedures.

Alert metadata includes a runbook link that the responder can access directly from the notification. When Alice receives an alert about database replication lag, clicking the runbook link opens documentation explaining how to diagnose replication issues, common causes (network latency, disk I/O, lock contention), and step-by-step resolution for each cause.

Runbook structure for alert response differs from general documentation. Alert runbooks assume the reader is responding to a live issue and needs to act quickly. They begin with immediate diagnostic commands to run, followed by decision trees based on results, with resolution steps for each branch. Background explanation and theory belong in concept documentation, not alert runbooks.

Automated response extends runbook logic to execute without human intervention. If the runbook for a full disk always starts with “clear temp files older than 7 days,” that step can run automatically when the alert fires. The alert then notifies humans only if automated remediation fails. Automation suits predictable issues with safe, well-understood resolutions. Issues requiring judgement or carrying risk of making things worse remain manual.

Runbook maintenance keeps documentation accurate as systems change. A runbook referencing commands for an old system version misleads responders. Including runbook review in change processes ensures documentation updates alongside system changes. Alert owners are typically responsible for runbook accuracy.

Implementation considerations

Organisational context shapes alerting implementation significantly. A single IT person monitoring 50 systems faces different challenges than a 20-person team supporting 2,000 systems across multiple countries.

For organisations with minimal IT capacity, start with critical alerts only. Configure alerts for conditions that cause immediate service impact: system down, security breach indicators, backup failures. Accept that medium and low severity conditions will not generate alerts; address them through periodic review instead. Use email and mobile push notifications to a single person rather than complex rotation systems. Document escalation to an external contact (managed service provider, volunteer technical advisor) for situations beyond internal capability. A realistic target is 5-10 well-tuned critical alerts rather than comprehensive monitoring.

For organisations with small IT teams (2-5 people), implement a basic on-call rotation with weekly primary assignment. Add high-severity alerts for degradation conditions alongside critical alerts. Configure chat channel integration for team visibility. Establish monthly alert review to retire non-actionable alerts and adjust thresholds. Cross-train team members so anyone on call can handle common alerts. A realistic target is 20-50 alerts across the estate with clear runbooks for each.

For established IT functions, implement comprehensive alerting across severity levels with formal on-call scheduling including compensation. Deploy alert correlation to manage complexity. Integrate alerting with incident management systems so acknowledged alerts create incident tickets automatically. Establish alert ownership across teams. Implement dynamic thresholds for systems with variable baselines. Target fewer than 10 actionable alerts per on-call shift through rigorous noise reduction.

Field and distributed contexts require additional consideration. Staff in field offices may have unreliable internet connectivity, making push notifications and chat channels unreliable. SMS provides more reliable delivery where mobile networks function but internet does not. Time zone distribution across offices enables follow-the-sun on-call, reducing out-of-hours burden for any single location. Field infrastructure (solar power, satellite connectivity) may require different thresholds than data centre equipment, with longer response windows reflecting physical access constraints.

Technology options

Alerting capabilities exist in three categories: built into monitoring tools, standalone alerting platforms, and integrated platforms combining monitoring and alerting.

Monitoring tools with built-in alerting include Prometheus with Alertmanager (open source, widely deployed, requires operational skill), Zabbix (open source, comprehensive, traditional architecture), and Grafana with its alerting module (open source, integrates with multiple data sources). These tools suit organisations already using the monitoring platform and wanting consistent configuration.

Standalone alerting platforms focus specifically on notification routing and on-call management. PagerDuty (commercial, SaaS), Opsgenie (commercial, Atlassian), and Grafana OnCall (open source) receive alerts from any monitoring source and handle escalation, scheduling, and notification. These platforms suit organisations using multiple monitoring tools or wanting specialised on-call management features.

Open source options provide capable alerting without licensing costs. Alertmanager (part of Prometheus ecosystem) handles deduplication, grouping, and routing. Grafana OnCall provides on-call scheduling and escalation. Zabbix includes comprehensive alerting alongside monitoring. These tools require operational investment to deploy and maintain but avoid vendor lock-in and subscription costs.

Commercial options reduce operational burden through managed services. PagerDuty offers sophisticated escalation, mobile applications, and integrations with hundreds of tools. Many commercial options provide nonprofit pricing: PagerDuty’s nonprofit programme offers significant discounts; Opsgenie through Atlassian’s Community licence provides free access for eligible organisations.

Selection criteria include existing monitoring tools (integration matters), team size (complex scheduling needs increase with team size), budget (commercial tools cost but reduce operational burden), and technical capacity (self-hosted tools require administration skill).