Network Monitoring
Network monitoring is the continuous collection, aggregation, and analysis of data from network infrastructure to measure health, performance, and availability. Monitoring systems collect metrics through standardised protocols, aggregate data at collection points, and present information through dashboards and alerts that enable operational response. For organisations operating across headquarters, regional offices, and field locations with varying connectivity quality, network monitoring provides visibility into infrastructure that staff depend upon but cannot directly observe.
The monitoring discipline divides into four functional areas: availability monitoring confirms that devices and links are reachable; performance monitoring measures latency, jitter, and throughput; utilisation monitoring tracks bandwidth consumption and capacity; and fault monitoring detects and diagnoses error conditions. Each area employs distinct collection methods and produces different metric types, but all feed into a unified monitoring architecture that correlates data across sources.
- Polling
- Active collection method where the monitoring system queries devices at regular intervals to retrieve current metric values. Polling intervals balance data freshness against device load and network overhead.
- Flow data
- Traffic metadata exported by network devices describing connections passing through them, including source, destination, protocol, byte counts, and timing. Flow protocols include NetFlow, sFlow, and IPFIX.
- Trap
- Asynchronous notification sent by a network device to the monitoring system when a defined condition occurs, such as an interface going down or a threshold being exceeded.
- Baseline
- Statistical profile of normal network behaviour derived from historical data, against which current measurements are compared to detect anomalies.
Collection methods
Network monitoring relies on three primary collection methods: protocol-based polling, flow export, and packet capture. Each method provides different information at different costs, and production monitoring architectures combine all three to achieve comprehensive visibility.
Simple Network Management Protocol (SNMP) remains the foundation of device monitoring despite its age. Network devices expose operational data through a Management Information Base (MIB), a hierarchical namespace of object identifiers (OIDs) that map to specific metrics. The monitoring system polls devices by requesting OID values, and devices respond with current readings. SNMP version 2c uses community strings for authentication, while SNMPv3 adds encryption and stronger authentication suitable for production use.
A typical SNMP poll retrieves interface statistics including bytes transmitted and received, error counts, and operational status. The monitoring system stores these values with timestamps, calculates rates of change between polls, and derives metrics such as bits per second and error rates. Poll intervals of 60 seconds suit most environments; shorter intervals increase precision but multiply polling load linearly. A monitoring server polling 500 devices across 20 interfaces each at 60-second intervals generates approximately 167 polls per second, a load that commodity hardware handles comfortably but that scales quadratically as device counts grow.
+------------------------------------------------------------------+| SNMP POLLING ARCHITECTURE |+------------------------------------------------------------------+| || +-----------------+ +-----------------+ || | Network Device | | Network Device | || | (Router) | | (Switch) | || | | | | || | MIB: IF-MIB | | MIB: IF-MIB | || | IP-MIB | | BRIDGE-MIB | || +--------+--------+ +--------+--------+ || | | || | SNMP GET/GETNEXT | SNMP GET/GETNEXT || | (UDP 161) | (UDP 161) || | | || v v || +--------+---------------------------+--------+ || | | || | SNMP Poller | || | | || | +---------------+ +---------------+ | || | | Poll Queue | | Response | | || | | (scheduled) |--->| Parser | | || | +---------------+ +-------+-------+ | || | | | || +-------------------------------+-------------+ || | || v || +-------------+-------------+ || | | || | Time-Series Database | || | (metrics storage) | || | | || +---------------------------+ || |+------------------------------------------------------------------+Figure 1: SNMP polling architecture showing device query and metric storage flow
Flow monitoring captures traffic metadata without inspecting packet contents. When enabled on a router or switch, the device maintains a flow cache that tracks active connections. Each flow record contains source and destination addresses, ports, protocol, byte count, packet count, and timestamps. The device exports completed flow records to a collector at configurable intervals or upon flow termination.
NetFlow v5, the most widely deployed format, uses fixed 48-byte records containing IPv4 address pairs, port numbers, protocol, TCP flags, and counters. NetFlow v9 and IPFIX use template-based formats that support IPv6, MPLS labels, and custom fields. sFlow differs fundamentally by sampling packets rather than tracking flows; it captures 1-in-N packets (configurable, commonly 1-in-1000 or 1-in-2000) and extracts headers, providing statistical traffic visibility with lower device overhead than full flow tracking.
Flow data answers questions that SNMP cannot: which hosts generate the most traffic, what applications consume bandwidth, and where traffic originates and terminates. A flow collector receiving exports from edge routers can identify that 40% of bandwidth serves video conferencing, 25% serves file synchronisation, and 15% serves email, enabling capacity planning and policy decisions that raw interface counters cannot inform.
Packet capture provides complete visibility into network traffic at the cost of storage and processing overhead. Capture occurs at strategic points using network TAPs (Test Access Points) or switch mirror ports that copy traffic to analysis systems. Full packet capture stores entire frames including payloads, while header-only capture reduces storage requirements by 90% or more while retaining addressing and protocol information.
Continuous full packet capture at 1 Gbps generates approximately 450 GB per hour assuming 50% link utilisation. Organisations deploy packet capture selectively: at internet egress points for security analysis, at data centre boundaries for troubleshooting, and on-demand at specific segments during incident investigation. Packet capture complements rather than replaces polling and flow monitoring.
Monitoring architecture
Production monitoring systems follow a hierarchical architecture with collection tiers that aggregate data before central processing. This architecture reduces bandwidth consumption, provides resilience against collector failures, and enables geographic distribution of collection points.
The collection tier consists of pollers and collectors deployed near the devices they monitor. A poller queries devices via SNMP and forwards metrics to storage. A flow collector receives NetFlow or sFlow exports and processes them into aggregated records. Collection components run on modest hardware; a server with 4 CPU cores and 8 GB RAM handles SNMP polling for 2,000 devices or flow collection from 50 exporters generating 10,000 flows per second.
The storage tier persists metrics in time-series databases optimised for write-heavy workloads with time-based queries. Time-series databases store data points as timestamp-value pairs organised by metric name and dimensional tags. Storage requirements depend on metric cardinality (number of unique metric series), retention period, and collection interval. An environment with 1,000 devices, 50 metrics per device, 60-second collection, and 90-day retention requires approximately 500 GB of storage with typical compression ratios.
The presentation tier provides dashboards, alerting, and query interfaces. Dashboards display current and historical metrics through graphs, gauges, and status indicators. Alert rules evaluate metric values against thresholds or statistical models and trigger notifications when conditions match. Query interfaces allow ad-hoc investigation of historical data for troubleshooting and capacity planning.
+--------------------------------------------------------------------+| MONITORING ARCHITECTURE TIERS |+--------------------------------------------------------------------+| || COLLECTION TIER || +------------------+ +------------------+ +------------------+ || | Regional Poller | | Regional Poller | | Regional Poller | || | (HQ) | | (East Africa) | | (Southeast Asia) | || | | | | | | || | - SNMP polling | | - SNMP polling | | - SNMP polling | || | - Flow collector | | - Flow collector | | - Flow collector | || | - Local buffer | | - Local buffer | | - Local buffer | || +--------+---------+ +--------+---------+ +--------+---------+ || | | | || +---------------------+---------------------+ || | || v || AGGREGATION TIER || +-------------------------------------------------------------+ || | Central Collector | || | | || | +----------------+ +----------------+ +---------------+ | || | | Metric | | Flow | | Event | | || | | Aggregator | | Aggregator | | Correlator | | || | +-------+--------+ +-------+--------+ +-------+-------+ | || | | | | | || +-------------------------------------------------------------+ || | || v || STORAGE TIER || +-----------------------------------------------------------+ || | | || | +-------------------+ +-------------------+ | || | | Time-Series DB | | Flow Database | | || | | (Prometheus/ | | (ClickHouse/ | | || | | VictoriaMetrics) | | PostgreSQL) | | || | +-------------------+ +-------------------+ | || | | || +-----------------------------------------------------------+ || | || v || PRESENTATION TIER || +-----------------------------------------------------------+ || | | || | +------------+ +------------+ +------------+ | || | | Dashboards | | Alerting | | Query | | || | | (Grafana) | | Engine | | Interface | | || | +------------+ +------------+ +------------+ | || | | || +-----------------------------------------------------------+ || |+--------------------------------------------------------------------+Figure 2: Four-tier monitoring architecture with regional collection, central aggregation, time-series storage, and presentation
For geographically distributed organisations, regional collectors reduce WAN bandwidth consumption and provide monitoring continuity during connectivity outages. A regional collector in Nairobi polling 200 devices across East African offices generates monitoring traffic within the region rather than across intercontinental links. The collector buffers metrics locally and forwards aggregated data to central storage, consuming 10-20 KB/s of WAN bandwidth rather than the 200-400 KB/s that direct polling from headquarters would require.
Regional collectors also provide monitoring resilience. When WAN connectivity fails, the regional collector continues polling local devices and buffering metrics. Upon reconnection, it forwards buffered data to central storage, maintaining continuity in historical records. Buffer sizing determines the maximum outage duration without data loss; a 10 GB buffer holds approximately 7 days of metrics for a 200-device regional deployment.
Performance metrics
Network performance monitoring focuses on three primary metrics: latency, jitter, and packet loss. These metrics directly affect application performance and user experience, making them essential for service level management.
Latency measures the time required for a packet to traverse from source to destination. Round-trip time (RTT) captures the complete journey including return path and is the most commonly measured latency metric. One-way latency provides more precise measurements but requires clock synchronisation between endpoints, typically achieved through GPS or precision time protocol (PTP).
Latency comprises multiple components: serialisation delay (time to place bits on the wire, inversely proportional to link speed), propagation delay (time for signals to traverse physical media, approximately 5 microseconds per kilometre in fibre), processing delay (time for devices to examine and forward packets), and queuing delay (time spent waiting in device buffers during congestion). For a packet traversing a 5,000 km path through 10 network devices, propagation delay contributes approximately 25 ms, while processing and queuing delays vary from microseconds during light load to tens of milliseconds during congestion.
Monitoring systems measure latency through synthetic probes that send test packets between measurement points and record response times. ICMP echo (ping) provides basic reachability and latency measurement. TCP and UDP probes simulate application traffic patterns more accurately. Measurement frequency affects statistical validity; collecting 60 measurements per minute provides sufficient samples for calculating meaningful percentiles and detecting short-duration anomalies.
Jitter quantifies variation in latency over time. Consistent 50 ms latency affects real-time applications less severely than latency varying between 20 ms and 100 ms, even though the average equals 50 ms in both cases. Jitter calculation uses the difference between consecutive latency measurements; a series of measurements [45, 52, 48, 55, 47] yields jitter values of [7, 4, 7, 8] with mean jitter of 6.5 ms.
Real-time applications including voice and video require jitter below specific thresholds to maintain quality. Voice over IP tolerates jitter up to 30 ms with appropriate buffering; video conferencing tolerates up to 50 ms. Jitter exceeding these thresholds causes audible artifacts in voice and visible stuttering in video. Monitoring dashboards should display jitter alongside latency to provide complete performance visibility.
Packet loss occurs when packets fail to reach their destination due to congestion, transmission errors, or device failures. Loss percentage calculation divides lost packets by total packets transmitted over a measurement period. Even small loss percentages significantly affect TCP-based applications because TCP interprets loss as congestion and reduces transmission rate; 1% packet loss can reduce TCP throughput by 75% under certain conditions.
Synthetic monitoring measures packet loss by transmitting known packet sequences and counting responses. The measurement period must be long enough to capture statistically meaningful loss rates; measuring 100 packets detects loss only when it exceeds 1%, while measuring 10,000 packets detects loss at 0.01% granularity. Production monitoring typically uses 5-minute measurement windows with 1000+ probe packets per window.
Bandwidth and utilisation
Bandwidth utilisation monitoring tracks how much of available capacity is in use, enabling capacity planning and congestion identification. SNMP polling provides interface counters from which utilisation derives through rate calculation.
Interface MIBs expose octet counters (IF-MIB::ifHCInOctets and IF-MIB::ifHCOutOctets for 64-bit counters) that increment with each byte transmitted or received. The monitoring system polls these counters at regular intervals, calculates the delta between consecutive readings, and divides by the interval duration to derive bytes per second. Multiplying by 8 converts to bits per second, the standard unit for network capacity.
For a gigabit interface polled at 60-second intervals, the calculation proceeds as follows: if ifHCInOctets reads 1,500,000,000 at time T and 1,875,000,000 at time T+60s, the delta equals 375,000,000 bytes. Dividing by 60 seconds yields 6,250,000 bytes per second, which equals 50,000,000 bits per second or 50 Mbps. On a 1 Gbps interface, this represents 5% utilisation.
Utilisation thresholds trigger alerts when capacity constraints approach. Common thresholds set warnings at 70% utilisation and critical alerts at 85% utilisation, though appropriate values depend on traffic patterns. Bursty traffic on a link averaging 60% utilisation may experience congestion during peaks even though average utilisation appears acceptable. Monitoring systems should track peak utilisation (maximum reading in each measurement period) alongside average utilisation to detect this condition.
Flow data enriches utilisation monitoring by revealing traffic composition. While SNMP counters indicate that an interface carries 500 Mbps, flow analysis reveals that 200 Mbps serves video conferencing, 150 Mbps serves file synchronisation, 100 Mbps serves web browsing, and 50 Mbps serves other applications. This breakdown informs capacity planning decisions: adding capacity benefits all traffic, while optimising video conferencing or implementing quality of service policies targets specific consumption patterns.
+---------------------------------------------------------------------+| FLOW COLLECTION ARCHITECTURE |+---------------------------------------------------------------------+| || +------------------+ +------------------+ || | Edge Router 1 | | Edge Router 2 | || | | | | || | NetFlow v9 | | NetFlow v9 | || | Export enabled | | Export enabled | || +--------+---------+ +--------+---------+ || | | || | UDP 2055 | UDP 2055 || | | || v v || +--------+------------------------+--------+ || | | || | Flow Collector | || | (nfdump/GoFlow2) | || | | || | +------------------+ +--------------+ | || | | Template Cache | | Flow Cache | | || | | (decode formats) | | (aggregate) | | || | +------------------+ +------+-------+ | || | | | || +-------------------------------+----------+ || | || +-------------------+-------------------+ || | | | || v v v || +-----------+------+ +---------+---------+ +------+-----------+ || | | | | | | || | Flow Files | | Flow Database | | Real-time | || | (nfcapd format) | | (ClickHouse) | | Stream | || | | | | | (Kafka) | || | - 5-min rotated | | - Queryable | | | || | - Compressed | | - Long retention | | - Analytics | || | | | | | - Anomaly detect | || +------------------+ +-------------------+ +------------------+ || |+---------------------------------------------------------------------+Figure 3: Flow collection architecture showing export from routers, collection, and multiple storage backends
Availability monitoring
Availability monitoring confirms that network devices and paths are operational and reachable. The fundamental availability measurement is reachability: can a probe packet successfully traverse to the target and return? ICMP echo (ping) serves as the standard reachability test, though some environments filter ICMP, requiring TCP or UDP-based alternatives.
Availability percentage calculates from successful probes divided by total probes over a measurement period. For a device probed every 60 seconds over a month (approximately 43,200 probes), 99.9% availability permits 43 failed probes or approximately 43 minutes of downtime. Service level agreements commonly specify availability targets: 99% allows 7.3 hours monthly downtime, 99.9% allows 43 minutes, and 99.99% allows 4.3 minutes.
Device availability and path availability measure different things. A router may remain operational while a specific path through it fails due to interface or routing problems. Comprehensive availability monitoring tests both device reachability (probe to the device itself) and path reachability (probe through the device to destinations beyond it). Testing paths end-to-end from source to destination reveals failures that device-focused monitoring misses.
SNMP traps provide immediate failure notification without waiting for the next poll cycle. When a router interface goes down, the router generates an SNMP trap (IF-MIB::linkDown) containing the interface index and status. The monitoring system receives this trap and immediately updates device status, enabling alert generation within seconds of failure rather than waiting up to 60 seconds for the next poll to detect the condition.
Trap-based monitoring requires reliable trap delivery and processing. SNMP traps use UDP, which provides no delivery guarantee; traps can be lost during network congestion or collector unavailability. Production deployments configure devices to send traps to multiple collectors and implement trap confirmation through inform requests (SNMPv2c/v3), which require acknowledgement and retry on failure.
Baselining and anomaly detection
Static thresholds fail to capture the dynamic nature of network behaviour. Traffic patterns that indicate problems during business hours may be normal during overnight batch processing. Baseline-based monitoring learns normal patterns and alerts on deviations, reducing false positives while catching anomalies that static thresholds miss.
Baseline construction analyses historical data to establish expected values for each metric at each time period. The simplest approach calculates hourly or daily averages over several weeks, establishing that a link typically carries 200 Mbps during business hours, 50 Mbps overnight, and 75 Mbps on weekends. More sophisticated approaches use seasonal decomposition to separate trend, seasonal, and residual components, or machine learning models that capture complex patterns.
Anomaly detection compares current values against baseline predictions. The deviation calculation subtracts the baseline value from the current value and divides by historical standard deviation to produce a z-score. A z-score exceeding 3 indicates the current value lies more than three standard deviations from expected, occurring by chance less than 0.3% of the time. Alerting on z-scores rather than absolute values adapts automatically to changing baselines.
Consider a WAN link with the following baseline profile for Monday 10:00-11:00: mean utilisation 180 Mbps, standard deviation 25 Mbps. Current utilisation of 250 Mbps produces z-score = (250 - 180) / 25 = 2.8, within normal variation. Current utilisation of 280 Mbps produces z-score = (280 - 180) / 25 = 4.0, indicating a statistically significant anomaly warranting investigation.
Baseline periods must span sufficient history to capture normal variation. A two-week baseline captures weekly patterns but misses monthly patterns like end-of-month reporting spikes. A rolling 90-day baseline with weekly seasonality captures most recurring patterns while adapting to gradual trends. Initial deployment should run in learning mode for one full baseline period before enabling anomaly-based alerting.
Alerting architecture
Alert configuration transforms monitoring data into actionable notifications. The alerting pipeline evaluates conditions, applies filters to reduce noise, and routes notifications to appropriate recipients through configured channels.
Alert rules define conditions that trigger notifications. A simple rule fires when a metric crosses a threshold: “alert when interface utilisation exceeds 85% for 5 minutes.” The duration clause prevents alerting on momentary spikes that self-resolve. Complex rules combine multiple conditions: “alert when latency exceeds 100 ms AND packet loss exceeds 1% AND the condition persists for 10 minutes.”
Alert severity classification determines response urgency and notification routing. A common scheme uses four levels: critical alerts indicate service-affecting failures requiring immediate response (device down, link failure); warning alerts indicate degraded conditions requiring prompt attention (high utilisation, elevated latency); informational alerts note conditions worth awareness but not immediate action (threshold approached, unusual pattern); and cleared alerts confirm that previously alerting conditions have resolved.
+------------------------------------------------------------------+| ALERT PROCESSING PIPELINE |+------------------------------------------------------------------+| || METRIC STREAM || +----------------------------------------------------------+ || | interface_utilisation{device="rtr-hq-01",if="ge-0/0/1"} | || | = 87% (current) | || +------------------------------+---------------------------+ || | || v || RULE EVALUATION || +----------------------------------------------------------+ || | Rule: utilisation_high | || | Condition: interface_utilisation > 85 | || | Duration: 5m | || | Status: PENDING (2m 30s elapsed) | || +------------------------------+---------------------------+ || | || v (after 5m) || ALERT FIRING || +----------------------------------------------------------+ || | Alert: HighInterfaceUtilisation | || | Severity: warning | || | Labels: device=rtr-hq-01, interface=ge-0/0/1 | || | Annotations: "Interface at 87% utilisation" | || +------------------------------+---------------------------+ || | || v || GROUPING AND DEDUPLICATION || +----------------------------------------------------------+ || | Group by: device | || | Wait: 30s (collect related alerts) | || | Deduplicate: suppress repeats within 4h | || +------------------------------+---------------------------+ || | || v || ROUTING || +----------------------------------------------------------+ || | Match: severity=warning AND device=~"rtr-.*" | || | Route to: network-team channel | || | Method: Slack + email | || | Escalate after: 30m if unacknowledged | || +----------------------------------------------------------+ || |+------------------------------------------------------------------+Figure 4: Alert processing pipeline from metric evaluation through notification routing
Alert grouping reduces notification volume by combining related alerts into single notifications. When multiple interfaces on the same device exceed thresholds simultaneously, grouping by device produces one notification listing all affected interfaces rather than separate notifications for each. Grouping windows of 30-60 seconds allow related alerts to accumulate before notification dispatch.
Alert inhibition suppresses lower-severity alerts when higher-severity alerts indicate root cause. When a router becomes unreachable (critical alert), all interface-level alerts for that router provide no additional information and should be inhibited. Inhibition rules specify that alerts matching certain labels suppress alerts matching other labels, reducing notification storms during major incidents.
Notification routing directs alerts to appropriate channels based on severity, time, and affected systems. Business hours warnings route to team channels; after-hours critical alerts route to on-call personnel via paging systems. Routing rules support escalation: if a critical alert remains unacknowledged for 15 minutes, escalate to secondary on-call; if still unacknowledged after 30 minutes, escalate to management.
Field network monitoring
Monitoring networks in field locations presents distinct challenges: limited bandwidth constrains monitoring traffic, intermittent connectivity interrupts data collection, and remote locations may lack local technical staff to investigate alerts.
Bandwidth constraints require monitoring efficiency. A field office with 512 Kbps satellite connectivity cannot sustain the monitoring traffic volumes appropriate for well-connected offices. Optimisation techniques include extending poll intervals (300 seconds rather than 60 seconds), reducing polled metrics to essential subset, enabling SNMP response compression where supported, and deploying local collectors that aggregate data before WAN transmission.
A field monitoring profile for bandwidth-constrained sites might poll device availability every 60 seconds (minimal traffic, essential for alerting), interface utilisation every 300 seconds (acceptable latency for capacity monitoring), and detailed device metrics every 900 seconds (sufficient for troubleshooting). This profile generates approximately 2-3 KB/s of monitoring traffic compared to 10-15 KB/s for standard polling intervals.
Intermittent connectivity requires local buffering and store-and-forward capability. A regional collector deployed at a field hub maintains monitoring continuity during WAN outages, buffering metrics locally until connectivity restores. Buffer sizing must accommodate maximum expected outage duration; a site experiencing daily 4-hour connectivity windows requires minimum 20-hour buffer capacity to avoid data gaps.
Local alerting provides autonomous response capability during connectivity loss. The regional collector evaluates alert rules locally and can notify in-country staff through local channels (SMS, local email) even when central systems are unreachable. This architecture ensures that critical alerts reach responders regardless of WAN status.
Synthetic monitoring from central locations measures field connectivity quality from the user perspective. Probes from headquarters to field office endpoints measure the latency and packet loss that field staff experience when accessing central applications. These measurements complement device-level monitoring by capturing end-to-end path performance including internet segments outside organisational control.
+----------------------------------------------------------------------+| DISTRIBUTED FIELD MONITORING ARCHITECTURE |+----------------------------------------------------------------------+| || HEADQUARTERS || +----------------------------------------------------------+ || | | || | +------------------+ +------------------+ | || | | Central | | Dashboard/ | | || | | Monitoring |<-------->| Alerting | | || | | (aggregation) | | (presentation) | | || | +--------+---------+ +------------------+ | || | | | || +----------------------------------------------------------+ || | || | Aggregated metrics || | (compressed, batched) || | || +-----------------+----------------+ || | | || v v || || REGIONAL HUB (East Africa) REGIONAL HUB (Southeast Asia) || +------------------------+ +------------------------+ || | | | | || | +------------------+ | | +------------------+ | || | | Regional | | | | Regional | | || | | Collector | | | | Collector | | || | | | | | | | | || | | - Local polling | | | |- Local polling | | || | | - Local alerting | | | |- Local alerting | | || | | - 7-day buffer | | | |- 7-day buffer | | || | +--------+---------+ | | +--------+---------+ | || | | | | | | || +------------------------+ +------------------------+ || | | || +---------+--------+ +---------+---------+ || | | | | | | || v v v v v v || || FIELD SITES || +-------+ +-------+ +-------+ +-------+ +-------+ +-------+ || |Site A | |Site B | |Site C | |Site D | |Site E | |Site F | || | | | | | | | | | | | | || |Router | |Router | |Router | |Router | |Router | |Router | || |Switch | |Switch | |Switch | |Switch | |Switch | |Switch | || |AP | |AP | |AP | |AP | |AP | |AP | || +-------+ +-------+ +-------+ +-------+ +-------+ +-------+ || |+----------------------------------------------------------------------+Figure 5: Distributed monitoring architecture for field operations with regional collectors and local buffering
Service management integration
Monitoring systems generate data that informs service management processes. Integration between monitoring and service management platforms enables automated incident creation, enriches incidents with diagnostic data, and supports service level reporting.
Incident management integration automatically creates tickets when alerts fire. The integration maps alert severity to incident priority, populates incident descriptions with alert details and diagnostic links, and assigns incidents to appropriate resolver groups based on affected systems. When alerts clear, the integration can automatically resolve associated incidents or add resolution notes.
Configuration Management Database (CMDB) integration enriches alerts with business context. A raw alert stating “device rtr-nairobi-01 unreachable” provides technical information; CMDB integration adds that this device serves the Kenya country office, supports 45 users, and connects to the finance and programme management applications. This context helps responders assess impact and prioritise response.
Service level reporting draws on monitoring data to calculate achieved service levels against targets. Availability SLAs require uptime measurements from availability monitoring. Performance SLAs require latency and jitter measurements from synthetic monitoring. Automated reporting extracts these measurements, calculates achievement percentages, and generates reports for service review meetings.
Technology options
Network monitoring platforms range from comprehensive commercial suites to composable open source stacks. Selection depends on operational capacity, integration requirements, and budget constraints.
Open source monitoring stacks provide full functionality without licensing costs but require technical expertise for deployment and maintenance. A reference open source stack combines Prometheus for metric collection and storage, Grafana for visualisation and dashboards, Alertmanager for alert routing, and nfsen/nfdump or GoFlow2 for flow collection. This stack monitors thousands of devices on commodity hardware, scales horizontally for larger deployments, and integrates through standard APIs.
Prometheus uses a pull model where the monitoring server scrapes metrics from exporters running on or near monitored systems. The SNMP Exporter translates SNMP polling into Prometheus metrics. Configuration defines scrape targets and intervals; a typical configuration scrapes network device metrics every 60 seconds and synthetic probe results every 15 seconds. Prometheus stores metrics in its time-series database with configurable retention, commonly 15-90 days for detailed data with downsampled long-term storage in systems like Thanos or VictoriaMetrics.
LibreNMS and Zabbix provide integrated monitoring platforms with web interfaces, auto-discovery, and built-in alerting. These platforms suit organisations preferring turnkey solutions over composable stacks. LibreNMS focuses specifically on network monitoring with extensive device support and network-centric features. Zabbix provides broader infrastructure monitoring capabilities including network devices, servers, and applications in a unified platform.
Commercial platforms including PRTG, LogicMonitor, and Datadog offer managed services with reduced operational overhead. These platforms handle infrastructure, scaling, and maintenance, charging subscription fees based on monitored device counts or metric volume. Commercial platforms suit organisations with limited technical capacity for self-hosted monitoring or those preferring operational expenditure over capital investment in monitoring infrastructure.
For organisations with constrained IT capacity, cloud-hosted open source options provide middle ground: platforms like Grafana Cloud offer managed Prometheus and Grafana with free tiers covering small deployments (up to 10,000 active series) and predictable pricing for larger environments. This approach provides open source flexibility with reduced operational burden.
Implementation considerations
Deployment complexity scales with organisational distribution and monitoring requirements. A single-office organisation with 50 network devices requires only a single monitoring server; a globally distributed organisation with 500 devices across 30 locations requires regional collectors, central aggregation, and careful attention to WAN bandwidth consumption.
Initial deployment should establish baseline visibility before optimising. Deploy core polling and availability monitoring first, establish dashboards showing device status and utilisation, and operate for 2-4 weeks to establish baselines before configuring anomaly-based alerting. This approach prevents alert storms from thresholds set without understanding normal behaviour.
Monitoring credential management requires attention. SNMPv3 credentials, flow export configurations, and API tokens represent sensitive access that must be protected and rotated. Store credentials in secrets management systems rather than plain-text configuration files. Implement separate read-only monitoring credentials distinct from administrative credentials to limit exposure if monitoring systems are compromised.
Documentation should capture monitoring architecture decisions: what is monitored, collection intervals, retention periods, alerting thresholds, and escalation procedures. Runbooks for common alert responses reduce mean time to resolution and enable consistent handling across team members. Dashboard documentation explains what visualisations show and how to interpret them.
For organisations with minimal IT capacity, prioritise availability monitoring (know when things are down), utilisation monitoring for internet links (know when capacity constrains work), and basic alerting to email or messaging platforms. This minimal configuration provides essential visibility with modest implementation effort. Expand to performance monitoring, flow analysis, and sophisticated alerting as capacity allows.