Infrastructure Monitoring
Infrastructure monitoring collects and analyses metrics from servers, network devices, storage systems, and cloud resources to detect degradation before service impact occurs. This task establishes the collectors, agents, and integrations that feed your monitoring platform with the raw data required for alerting and capacity planning.
Prerequisites
Before implementing infrastructure monitoring, verify that the following requirements are satisfied.
Monitoring platform deployed. A functioning monitoring system must be operational and accessible. This procedure assumes one of the following platforms:
| Platform | Deployment model | Agent protocol | Minimum version |
|---|---|---|---|
| Prometheus + Grafana | Self-hosted | HTTP pull (scrape) | Prometheus 2.45+, Grafana 10+ |
| Zabbix | Self-hosted | Zabbix agent, SNMP, IPMI | 6.4+ |
| Checkmk | Self-hosted or SaaS | Checkmk agent, SNMP | 2.2+ |
| Datadog | SaaS | Datadog agent | Agent 7+ |
Network connectivity established. Monitoring traffic must traverse your network without obstruction. For agent-based collection, outbound connectivity from monitored hosts to the monitoring server is required on the agent’s port. For SNMP-based collection, inbound UDP 161 access to network devices from the monitoring server is required. For cloud API collection, outbound HTTPS to provider API endpoints is required.
Credentials and access prepared. Assemble the following before beginning:
- SSH access or local administrator rights on servers receiving agents
- SNMP community strings or SNMPv3 credentials for network devices
- Read-only API credentials for cloud providers (AWS IAM user with CloudWatch read access, Azure service principal with Monitoring Reader role, GCP service account with Monitoring Viewer role)
- Service account for the monitoring platform with appropriate permissions
Baseline data available. Normal operating ranges cannot be established without historical context. If this is a new deployment, plan to collect data for 14 days before setting thresholds. If migrating from another monitoring system, export historical baselines for reference.
Target inventory documented. List all infrastructure components to be monitored with their hostnames, IP addresses, operating systems, and roles. A spreadsheet or CMDB export suffices. For network devices, include model numbers and firmware versions to verify SNMP MIB compatibility.
Procedure
Infrastructure monitoring implementation proceeds through five phases: server monitoring, network monitoring, storage monitoring, cloud infrastructure monitoring, and threshold configuration. Complete each phase for your relevant infrastructure before proceeding to the next.
Phase 1: Server monitoring
Server monitoring captures compute resource utilisation, system health indicators, and process states. The collection mechanism varies by operating system and monitoring platform.
Deploy the monitoring agent to Linux servers.
For Prometheus-based monitoring, install Node Exporter on each Linux server. Node Exporter exposes system metrics on an HTTP endpoint that Prometheus scrapes at configured intervals.
On Debian/Ubuntu systems:
Terminal window sudo apt updatesudo apt install prometheus-node-exportersudo systemctl enable prometheus-node-exportersudo systemctl start prometheus-node-exporterOn RHEL/Rocky/AlmaLinux systems:
Terminal window sudo dnf install node_exportersudo systemctl enable node_exportersudo systemctl start node_exporterVerify the exporter is running and accessible:
Terminal window curl http://localhost:9100/metrics | head -20Expected output shows metric lines beginning with
node_:# HELP node_cpu_seconds_total Seconds the CPUs spent in each mode.# TYPE node_cpu_seconds_total counternode_cpu_seconds_total{cpu="0",mode="idle"} 258459.92node_cpu_seconds_total{cpu="0",mode="iowait"} 1029.36For Zabbix-based monitoring, install the Zabbix agent:
Terminal window # Add Zabbix repository first (Debian/Ubuntu example)wget https://repo.zabbix.com/zabbix/6.4/ubuntu/pool/main/z/zabbix-release/zabbix-release_6.4-1+ubuntu22.04_all.debsudo dpkg -i zabbix-release_6.4-1+ubuntu22.04_all.debsudo apt updatesudo apt install zabbix-agent2Configure the agent to connect to your Zabbix server by editing
/etc/zabbix/zabbix_agent2.conf:Server=monitoring.example.orgServerActive=monitoring.example.orgHostname=webserver01.example.orgStart the agent:
Terminal window sudo systemctl enable zabbix-agent2sudo systemctl start zabbix-agent2Deploy the monitoring agent to Windows servers.
For Prometheus-based monitoring, download Windows Exporter from the project’s GitHub releases page. Install using the MSI package with default options, which registers the service to start automatically:
Terminal window msiexec /i windows_exporter-0.25.1-amd64.msiVerify the exporter is accessible:
Terminal window Invoke-WebRequest -Uri http://localhost:9182/metrics | Select-Object -First 20For Zabbix-based monitoring, download the Zabbix agent MSI from the Zabbix website. During installation, specify your Zabbix server hostname and the local hostname for this server.
Register monitored servers with the monitoring platform.
For Prometheus, add scrape targets to your
prometheus.ymlconfiguration file. Each target specifies the host and port where metrics are exposed:scrape_configs:- job_name: 'linux-servers'static_configs:- targets:- 'webserver01.example.org:9100'- 'webserver02.example.org:9100'- 'dbserver01.example.org:9100'relabel_configs:- source_labels: [__address__]target_label: instanceregex: '([^:]+):\d+'replacement: '${1}'- job_name: 'windows-servers'static_configs:- targets:- 'fileserver01.example.org:9182'- 'appserver01.example.org:9182'For deployments exceeding 50 servers, use file-based service discovery instead of static configuration. Create a JSON file listing targets:
[{"targets": ["webserver01.example.org:9100", "webserver02.example.org:9100"],"labels": {"env": "production", "role": "web"}},{"targets": ["dbserver01.example.org:9100"],"labels": {"env": "production", "role": "database"}}]Reference the file in your Prometheus configuration:
scrape_configs:- job_name: 'linux-servers'file_sd_configs:- files:- '/etc/prometheus/targets/linux-servers.json'refresh_interval: 5mReload Prometheus configuration:
Terminal window curl -X POST http://localhost:9090/-/reloadFor Zabbix, navigate to Configuration → Hosts → Create host in the web interface. Assign the appropriate template (Template OS Linux by Zabbix agent for Linux servers, Template OS Windows by Zabbix agent for Windows servers) to enable standard metric collection.
Configure essential server metrics.
The following metrics form the baseline for server health monitoring. All values should be collected at 60-second intervals for operational monitoring. Longer intervals (300 seconds) are acceptable for capacity planning metrics where real-time visibility is unnecessary.
For Prometheus with Node Exporter, the metrics are collected automatically. Create recording rules to pre-calculate commonly used aggregations. Add to
/etc/prometheus/rules/server-rules.yml:groups:- name: server-metricsrules:- record: instance:node_cpu_utilisation:ratioexpr: 1 - avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m]))- record: instance:node_memory_utilisation:ratioexpr: 1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)- record: instance:node_filesystem_utilisation:ratioexpr: 1 - (node_filesystem_avail_bytes{fstype=~"ext4|xfs"} / node_filesystem_size_bytes{fstype=~"ext4|xfs"})- record: instance:node_disk_io_utilisation:ratioexpr: rate(node_disk_io_time_seconds_total[5m])For Windows Exporter, equivalent metrics use different names:
groups:- name: windows-metricsrules:- record: instance:windows_cpu_utilisation:ratioexpr: 1 - avg by (instance) (rate(windows_cpu_time_total{mode="idle"}[5m]))- record: instance:windows_memory_utilisation:ratioexpr: 1 - (windows_os_physical_memory_free_bytes / windows_cs_physical_memory_bytes)- record: instance:windows_disk_utilisation:ratioexpr: 1 - (windows_logical_disk_free_bytes / windows_logical_disk_size_bytes)
Phase 2: Network monitoring
Network monitoring tracks device availability, interface utilisation, error rates, and traffic patterns across switches, routers, firewalls, and wireless access points.
Enable SNMP on network devices.
SNMP (Simple Network Management Protocol) remains the standard mechanism for network device monitoring. SNMPv3 provides authentication and encryption; use SNMPv2c only when devices lack SNMPv3 support.
Configuration syntax varies by vendor. For Cisco IOS devices:
snmp-server community readonly-community ROsnmp-server location "Headquarters DC Rack A3"snmp-server contact "it-operations@example.org"snmp-server enable trapssnmp-server host 10.0.1.50 version 2c readonly-communityFor Juniper Junos devices:
set snmp community readonly-community authorization read-onlyset snmp location "Headquarters DC Rack A3"set snmp contact "it-operations@example.org"set snmp trap-group monitoring-traps targets 10.0.1.50For SNMPv3 (recommended), configure authentication and privacy:
# Cisco IOS SNMPv3snmp-server group monitoring-group v3 privsnmp-server user monitoring-user monitoring-group v3 auth sha AuthPassword priv aes 128 PrivPasswordConfigure SNMP polling in your monitoring platform.
For Prometheus-based monitoring, deploy the SNMP Exporter. This component translates SNMP OIDs into Prometheus metrics using generator-produced configuration.
Install SNMP Exporter:
Terminal window wget https://github.com/prometheus/snmp_exporter/releases/download/v0.24.1/snmp_exporter-0.24.1.linux-amd64.tar.gztar xzf snmp_exporter-0.24.1.linux-amd64.tar.gzsudo mv snmp_exporter-0.24.1.linux-amd64/snmp_exporter /usr/local/bin/The default
snmp.ymlconfiguration supports standard MIBs. For vendor-specific MIBs, use the generator tool to create custom configurations.Create a systemd service file at
/etc/systemd/system/snmp-exporter.service:[Unit]Description=Prometheus SNMP ExporterAfter=network.target[Service]User=prometheusExecStart=/usr/local/bin/snmp_exporter --config.file=/etc/prometheus/snmp.ymlRestart=always[Install]WantedBy=multi-user.targetStart the exporter:
Terminal window sudo systemctl daemon-reloadsudo systemctl enable snmp-exportersudo systemctl start snmp-exporterAdd network device targets to your Prometheus configuration:
scrape_configs:- job_name: 'network-devices'static_configs:- targets:- 'switch01.example.org'- 'switch02.example.org'- 'router01.example.org'- 'firewall01.example.org'metrics_path: /snmpparams:module: [if_mib]relabel_configs:- source_labels: [__address__]target_label: __param_target- source_labels: [__param_target]target_label: instance- target_label: __address__replacement: localhost:9116 # SNMP Exporter addressFor Zabbix, add each network device as a host and assign the appropriate SNMP template. Zabbix includes templates for common vendors (Cisco, Juniper, HP/Aruba, Ubiquiti) that auto-discover interfaces and apply standard items.
Implement interface monitoring.
Interface metrics reveal bandwidth utilisation, error rates, and packet loss. The IF-MIB standard provides these metrics across vendors.
Key interface metrics to monitor:
ifHCInOctets - Bytes received (64-bit counter)ifHCOutOctets - Bytes transmitted (64-bit counter)ifInErrors - Inbound packet errorsifOutErrors - Outbound packet errorsifInDiscards - Inbound packets discardedifOutDiscards - Outbound packets discardedifOperStatus - Interface operational state (1=up, 2=down)ifSpeed - Interface speed in bits per secondCalculate utilisation as a percentage of interface capacity. For a 1 Gbps interface with 5-minute average traffic of 400 Mbps inbound:
Utilisation = (400,000,000 / 1,000,000,000) × 100 = 40%In Prometheus, express this as a query:
rate(ifHCInOctets{ifDescr="GigabitEthernet0/1"}[5m]) * 8 / ifSpeed * 100Configure network device availability monitoring.
ICMP ping provides basic reachability verification. For Prometheus, deploy the Blackbox Exporter:
Terminal window wget https://github.com/prometheus/blackbox_exporter/releases/download/v0.24.0/blackbox_exporter-0.24.0.linux-amd64.tar.gztar xzf blackbox_exporter-0.24.0.linux-amd64.tar.gzsudo mv blackbox_exporter-0.24.0.linux-amd64/blackbox_exporter /usr/local/bin/Configure ICMP probing in
/etc/prometheus/blackbox.yml:modules:icmp:prober: icmptimeout: 5sicmp:preferred_ip_protocol: ip4Add ping targets to Prometheus:
scrape_configs:- job_name: 'network-ping'metrics_path: /probeparams:module: [icmp]static_configs:- targets:- 'switch01.example.org'- 'switch02.example.org'- 'router01.example.org'- '192.168.1.1' # ISP gatewayrelabel_configs:- source_labels: [__address__]target_label: __param_target- source_labels: [__param_target]target_label: instance- target_label: __address__replacement: localhost:9115 # Blackbox Exporter address
The following diagram illustrates the network monitoring data flow from devices through collectors to the monitoring platform:
+-------------------------------------------------------------------+| NETWORK INFRASTRUCTURE |+-------------------------------------------------------------------+| || +------------+ +------------+ +------------+ +----------+ || | Core | | Access | | Firewall | | Wireless | || | Switch | | Switches | | | | APs | || | SNMP:161 | | SNMP:161 | | SNMP:161 | | SNMP:161 | || +-----+------+ +-----+------+ +-----+------+ +----+-----+ || | | | | |+-------------------------------------------------------------------+ | | | | +-------+-------+-------+-------+ | | | | v v v+----------------+---------------+------------------------+---------+| MONITORING NETWORK |+-------------------------------------------------------------------+| || +------------------+ +------------------+ || | SNMP Exporter | | Blackbox Exporter| || | :9116 | | :9115 | || | | | | || | SNMP polling | | ICMP probes | || | every 60s | | every 30s | || +--------+---------+ +--------+---------+ || | | || +-------------+-------------+ || | || v || +----------+----------+ || | Prometheus | || | :9090 | || | | || | Scrapes exporters | || | Stores time series | || +----------+----------+ || | || v || +----------+----------+ || | Grafana | || | :3000 | || | | || | Dashboards | || | Visualisation | || +---------------------+ || |+-------------------------------------------------------------------+Figure 1: Network monitoring architecture showing SNMP collection and availability probing
Phase 3: Storage monitoring
Storage monitoring tracks capacity consumption, performance characteristics, and health indicators across local disks, network storage, and storage area networks.
Configure local disk monitoring.
Node Exporter and Windows Exporter include disk metrics by default. The relevant metrics for capacity monitoring are:
# Linux filesystem capacitynode_filesystem_size_bytes{fstype=~"ext4|xfs|btrfs"}node_filesystem_avail_bytes{fstype=~"ext4|xfs|btrfs"}# Windows disk capacitywindows_logical_disk_size_byteswindows_logical_disk_free_bytesFor disk performance, monitor I/O operations and latency:
# Linux disk I/Orate(node_disk_reads_completed_total[5m]) # Read IOPSrate(node_disk_writes_completed_total[5m]) # Write IOPSrate(node_disk_read_bytes_total[5m]) # Read throughputrate(node_disk_written_bytes_total[5m]) # Write throughputrate(node_disk_read_time_seconds_total[5m]) / rate(node_disk_reads_completed_total[5m]) # Read latencyrate(node_disk_write_time_seconds_total[5m]) / rate(node_disk_writes_completed_total[5m]) # Write latency# Windows disk I/Owindows_logical_disk_reads_totalwindows_logical_disk_writes_totalwindows_logical_disk_read_seconds_totalwindows_logical_disk_write_seconds_totalCreate recording rules for disk health metrics in
/etc/prometheus/rules/storage-rules.yml:groups:- name: storage-metricsrules:- record: instance:node_disk_read_latency:avg5mexpr: rate(node_disk_read_time_seconds_total[5m]) / rate(node_disk_reads_completed_total[5m])- record: instance:node_disk_write_latency:avg5mexpr: rate(node_disk_write_time_seconds_total[5m]) / rate(node_disk_writes_completed_total[5m])- record: instance:node_disk_iops:rate5mexpr: rate(node_disk_reads_completed_total[5m]) + rate(node_disk_writes_completed_total[5m])Configure network storage monitoring.
For NFS servers, monitor export availability and client connections through the NFS exporter or host-level metrics. For NFS clients, Node Exporter exposes NFS operation statistics through the
node_nfs_*metrics when thenfscollector is enabled.For SMB/CIFS shares on Windows, use Windows Exporter with the
smbcollector enabled. Edit the Windows Exporter configuration or pass collector flags:Terminal window # Enable SMB collectorwindows_exporter.exe --collectors.enabled "cpu,cs,logical_disk,memory,net,os,process,smb"For dedicated NAS appliances (Synology, QNAP, NetApp), use SNMP monitoring with vendor-specific MIBs. Many NAS vendors also provide Prometheus exporters or API integrations.
Configure SAN monitoring.
Storage area network monitoring requires vendor-specific approaches. For Fibre Channel SANs, monitor switch port statistics through SNMP. For iSCSI, monitor target availability and session state.
Common SAN metrics to collect:
Storage array capacity (total, used, available)Volume/LUN capacity utilisationArray controller CPU and cache utilisationPort throughput and errorsDisk health and predictive failure indicatorsReplication lag (if applicable)For open-source storage platforms like TrueNAS, enable the Prometheus endpoint in the web interface. For commercial arrays, consult vendor documentation for monitoring integration options. Most enterprise arrays (NetApp, Dell EMC, Pure Storage, HPE) provide REST APIs that can be scraped with custom exporters or vendor-provided integrations.
Phase 4: Cloud infrastructure monitoring
Cloud infrastructure monitoring extends visibility to resources deployed in public cloud environments. Each provider offers native monitoring that can be integrated with your existing monitoring platform.
Configure AWS CloudWatch integration.
For Prometheus-based monitoring, deploy the CloudWatch Exporter or use the YACE (Yet Another CloudWatch Exporter) for more flexible configuration.
Create an IAM user or role with the following policy:
{"Version": "2012-10-17","Statement": [{"Effect": "Allow","Action": ["cloudwatch:GetMetricData","cloudwatch:GetMetricStatistics","cloudwatch:ListMetrics","ec2:DescribeInstances","ec2:DescribeVolumes","rds:DescribeDBInstances","elasticloadbalancing:DescribeLoadBalancers","tag:GetResources"],"Resource": "*"}]}Configure YACE in
/etc/yace/config.yml:discovery:jobs:- type: AWS/EC2regions:- eu-west-1period: 300length: 300metrics:- name: CPUUtilizationstatistics: [Average, Maximum]- name: NetworkInstatistics: [Sum]- name: NetworkOutstatistics: [Sum]- name: DiskReadOpsstatistics: [Sum]- name: DiskWriteOpsstatistics: [Sum]- type: AWS/RDSregions:- eu-west-1period: 300length: 300metrics:- name: CPUUtilizationstatistics: [Average]- name: FreeStorageSpacestatistics: [Average]- name: DatabaseConnectionsstatistics: [Average]- name: ReadIOPSstatistics: [Average]- name: WriteIOPSstatistics: [Average]Add the exporter to your Prometheus configuration:
scrape_configs:- job_name: 'aws-cloudwatch'static_configs:- targets: ['localhost:5000']Configure Azure Monitor integration.
For Azure, use the Azure Monitor Exporter for Prometheus. Create a service principal with the Monitoring Reader role:
Terminal window az ad sp create-for-rbac --name "prometheus-monitor" --role "Monitoring Reader" --scopes /subscriptions/YOUR_SUBSCRIPTION_IDConfigure the exporter with subscription and resource group filters to avoid excessive API calls and costs:
active_directory_authority_url: "https://login.microsoftonline.com/"resource_manager_url: "https://management.azure.com/"credentials:subscription_id: "your-subscription-id"client_id: "your-client-id"client_secret: "your-client-secret"tenant_id: "your-tenant-id"targets:- resource: "/subscriptions/xxx/resourceGroups/production/providers/Microsoft.Compute/virtualMachines/webserver01"metrics:- name: "Percentage CPU"aggregations: ["Average", "Maximum"]- name: "Network In Total"aggregations: ["Total"]- name: "Network Out Total"aggregations: ["Total"]Configure GCP Cloud Monitoring integration.
For Google Cloud Platform, use the Stackdriver Exporter (now Cloud Monitoring Exporter). Create a service account with the Monitoring Viewer role:
Terminal window gcloud iam service-accounts create prometheus-monitor --display-name="Prometheus Monitoring"gcloud projects add-iam-policy-binding YOUR_PROJECT_ID \--member="serviceAccount:prometheus-monitor@YOUR_PROJECT_ID.iam.gserviceaccount.com" \--role="roles/monitoring.viewer"gcloud iam service-accounts keys create ~/prometheus-monitor-key.json \--iam-account=prometheus-monitor@YOUR_PROJECT_ID.iam.gserviceaccount.comConfigure the exporter to collect Compute Engine metrics:
google:project_id: "your-project-id"metrics_type_prefixes:- "compute.googleapis.com/instance/cpu"- "compute.googleapis.com/instance/disk"- "compute.googleapis.com/instance/network"Deploy cloud-native agents for detailed visibility.
Cloud provider monitoring APIs provide aggregate metrics with 1-5 minute granularity. For detailed system-level metrics equivalent to on-premises monitoring, deploy agents to cloud instances using the same procedures as Phase 1.
For autoscaling groups and containerised workloads, use configuration management or cloud-init to deploy agents automatically:
# cloud-init example for EC2 instances#cloud-configpackages:- prometheus-node-exporterruncmd:- systemctl enable prometheus-node-exporter- systemctl start prometheus-node-exporterUse EC2 service discovery in Prometheus to automatically find instances:
scrape_configs:- job_name: 'ec2-instances'ec2_sd_configs:- region: eu-west-1port: 9100filters:- name: tag:Environmentvalues: [production]relabel_configs:- source_labels: [__meta_ec2_tag_Name]target_label: instance_name- source_labels: [__meta_ec2_instance_id]target_label: instance_id- source_labels: [__meta_ec2_availability_zone]target_label: availability_zone
Phase 5: Threshold configuration
Thresholds define the boundaries between normal operation and conditions requiring attention. Effective thresholds balance sensitivity (detecting real issues) against noise (avoiding false positives).
Establish baselines from collected data.
After collecting data for 14 days, query your monitoring platform to understand normal operating ranges. For CPU utilisation:
# Average CPU utilisation by server over 14 daysavg_over_time(instance:node_cpu_utilisation:ratio[14d]) * 100# Maximum CPU utilisation peaksmax_over_time(instance:node_cpu_utilisation:ratio[14d]) * 100# 95th percentile CPU utilisationquantile_over_time(0.95, instance:node_cpu_utilisation:ratio[14d]) * 100Document baseline values for each server class. A web server with average CPU of 25%, 95th percentile of 60%, and peaks of 85% has headroom for growth. A database server averaging 70% with peaks at 98% requires immediate capacity attention.
Configure static thresholds.
Static thresholds apply fixed limits appropriate for the resource type. Start with conservative thresholds and refine based on operational experience.
For Prometheus alerting rules, create
/etc/prometheus/rules/infrastructure-alerts.yml:groups:- name: infrastructure-alertsrules:# CPU alerts- alert: HighCPUUtilisationexpr: instance:node_cpu_utilisation:ratio > 0.85for: 15mlabels:severity: warningannotations:summary: "High CPU utilisation on {{ $labels.instance }}"description: "CPU utilisation is {{ $value | humanizePercentage }} for 15 minutes"- alert: CriticalCPUUtilisationexpr: instance:node_cpu_utilisation:ratio > 0.95for: 5mlabels:severity: criticalannotations:summary: "Critical CPU utilisation on {{ $labels.instance }}"description: "CPU utilisation is {{ $value | humanizePercentage }} for 5 minutes"# Memory alerts- alert: HighMemoryUtilisationexpr: instance:node_memory_utilisation:ratio > 0.85for: 15mlabels:severity: warningannotations:summary: "High memory utilisation on {{ $labels.instance }}"description: "Memory utilisation is {{ $value | humanizePercentage }}"- alert: CriticalMemoryUtilisationexpr: instance:node_memory_utilisation:ratio > 0.95for: 5mlabels:severity: criticalannotations:summary: "Critical memory utilisation on {{ $labels.instance }}"description: "Memory utilisation is {{ $value | humanizePercentage }}"# Disk space alerts- alert: DiskSpaceWarningexpr: instance:node_filesystem_utilisation:ratio > 0.80for: 30mlabels:severity: warningannotations:summary: "Disk space low on {{ $labels.instance }}"description: "Filesystem {{ $labels.mountpoint }} is {{ $value | humanizePercentage }} full"- alert: DiskSpaceCriticalexpr: instance:node_filesystem_utilisation:ratio > 0.90for: 15mlabels:severity: criticalannotations:summary: "Disk space critical on {{ $labels.instance }}"description: "Filesystem {{ $labels.mountpoint }} is {{ $value | humanizePercentage }} full"# Disk I/O alerts- alert: HighDiskLatencyexpr: instance:node_disk_read_latency:avg5m > 0.1 or instance:node_disk_write_latency:avg5m > 0.1for: 10mlabels:severity: warningannotations:summary: "High disk latency on {{ $labels.instance }}"description: "Disk latency exceeds 100ms"# Network device alerts- alert: NetworkDeviceDownexpr: probe_success{job="network-ping"} == 0for: 2mlabels:severity: criticalannotations:summary: "Network device unreachable: {{ $labels.instance }}"- alert: InterfaceDownexpr: ifOperStatus == 2for: 5mlabels:severity: warningannotations:summary: "Interface down: {{ $labels.ifDescr }} on {{ $labels.instance }}"- alert: HighInterfaceUtilisationexpr: (rate(ifHCInOctets[5m]) * 8 / ifSpeed) > 0.80for: 15mlabels:severity: warningannotations:summary: "High interface utilisation on {{ $labels.instance }}"description: "Interface {{ $labels.ifDescr }} is at {{ $value | humanizePercentage }} utilisation"Configure predictive thresholds for capacity.
Predictive thresholds anticipate exhaustion before it occurs. Linear regression across historical data projects when a resource will reach capacity.
For disk space exhaustion prediction:
# Predict when disk will be full based on 7-day trendpredict_linear(node_filesystem_avail_bytes{fstype=~"ext4|xfs"}[7d], 86400 * 14) < 0This alert fires when extrapolation predicts disk exhaustion within 14 days:
- alert: DiskSpaceExhaustionPredictedexpr: predict_linear(node_filesystem_avail_bytes{fstype=~"ext4|xfs"}[7d], 86400 * 14) < 0for: 1hlabels:severity: warningannotations:summary: "Disk space exhaustion predicted on {{ $labels.instance }}"description: "Filesystem {{ $labels.mountpoint }} predicted to exhaust within 14 days"
Phase 6: Dashboard creation
Dashboards provide visual representation of infrastructure health for operational monitoring and capacity planning.
Import standard dashboards.
For Grafana with Prometheus, import community dashboards as starting points:
- Node Exporter Full (Dashboard ID 1860): Comprehensive Linux server metrics
- Windows Exporter Dashboard (Dashboard ID 14694): Windows server metrics
- SNMP Device Dashboard (Dashboard ID 11169): Network device metrics
To import, navigate to Dashboards → Import in Grafana, enter the dashboard ID, and select your Prometheus data source.
Create an infrastructure overview dashboard.
An overview dashboard provides at-a-glance status for all infrastructure. Create a new dashboard in Grafana and add the following panels:
Server health matrix. Use a stat panel with the following query to show server count by health status:
# Healthy servers (CPU < 85%, Memory < 85%, Disk < 80%)count(instance:node_cpu_utilisation:ratio < 0.85and instance:node_memory_utilisation:ratio < 0.85and instance:node_filesystem_utilisation:ratio < 0.80)Network device status. Use a stat panel showing device availability:
# Available devicescount(probe_success{job="network-ping"} == 1)# Unavailable devicescount(probe_success{job="network-ping"} == 0)Resource utilisation heatmap. Use a table panel showing current utilisation across servers:
# Query for table(label_replace(instance:node_cpu_utilisation:ratio * 100, "metric", "CPU", "", "")orlabel_replace(instance:node_memory_utilisation:ratio * 100, "metric", "Memory", "", "")orlabel_replace(max by (instance) (instance:node_filesystem_utilisation:ratio) * 100, "metric", "Disk", "", ""))Create capacity trend dashboards.
Capacity dashboards show historical trends and projections for planning purposes.
Disk growth trend panel. Use a time series panel with 30-day range:
node_filesystem_size_bytes{fstype=~"ext4|xfs"} - node_filesystem_avail_bytes{fstype=~"ext4|xfs"}Add an annotation layer showing alert firings to correlate capacity changes with events.
Memory trend panel. Show memory utilisation over time with baseline reference:
instance:node_memory_utilisation:ratio * 100Add threshold lines at 85% and 95% using panel overrides.
Verification
After completing implementation, verify that monitoring data is flowing correctly and thresholds are functioning.
Verify metric collection. Query each metric type to confirm data exists:
# Server metrics presentcount(node_cpu_seconds_total)count(windows_cpu_time_total)
# Network metrics presentcount(ifHCInOctets)count(probe_success{job="network-ping"})
# Storage metrics presentcount(node_filesystem_size_bytes)Expected result: Each query returns a count greater than zero matching the number of monitored resources.
Verify scrape targets are healthy. In Prometheus, navigate to Status → Targets. All targets should show state “UP” with recent scrape timestamps. Any targets showing “DOWN” indicate collection failures requiring investigation.
Verify alerting rules are loaded. Navigate to Status → Rules in Prometheus. All rule groups should show status “ok” with evaluation timestamps.
Test alert firing. Temporarily lower a threshold to trigger an alert:
# Temporarily set CPU threshold to 5% to verify alerting- alert: TestAlert expr: instance:node_cpu_utilisation:ratio > 0.05 for: 1mVerify the alert appears in the Alerts view, then remove the test rule.
Verify dashboard data population. Open each created dashboard and confirm panels display data without errors. Panels showing “No data” or query errors indicate configuration problems.
Test data retention. Query historical data matching your retention configuration:
# Query data from 7 days agonode_cpu_seconds_total offset 7dIf retention is configured for 15 days and queries beyond that range return no data, retention is functioning correctly.
Troubleshooting
| Symptom | Cause | Resolution |
|---|---|---|
| Node Exporter returns connection refused | Service not running | Check systemctl status prometheus-node-exporter. Start if stopped. Check journal logs with journalctl -u prometheus-node-exporter |
| Prometheus target shows as DOWN | Network connectivity or firewall blocking | Verify connectivity with curl http://target:port/metrics. Check firewall rules on target and intermediate devices |
| SNMP metrics not appearing | Incorrect community string or ACL | Test SNMP manually: snmpwalk -v2c -c community device_ip 1.3.6.1.2.1.1. Verify community string and source IP ACL on device |
| CloudWatch metrics delayed by 10+ minutes | API polling interval or propagation delay | CloudWatch metrics have inherent 1-5 minute delay. Verify exporter polling interval. Check AWS service health dashboard |
| Dashboard panels show “No data” | Metric name mismatch or label selector error | Check metric exists in Prometheus expression browser. Verify label selectors match actual labels |
| Alert firing but no notification | Alertmanager configuration issue | Verify Alertmanager is running and connected. Check alertmanager.yml routing and receiver configuration |
| High cardinality warning in Prometheus | Too many label combinations | Review metrics with high label cardinality. Use relabeling to drop unnecessary labels. Consider aggregating recording rules |
| Disk space exhaustion on monitoring server | Retention too long or sample rate too high | Reduce --storage.tsdb.retention.time. Decrease scrape interval for non-critical metrics. Review and prune unused metrics |
| SNMP Exporter timeout errors | Network latency or slow device response | Increase timeout in snmp.yml. Reduce metrics collected per scrape. Check device CPU during polling |
| Windows Exporter service fails to start | Missing dependencies or port conflict | Check Windows Event Log for errors. Verify .NET Framework 4.5+ installed. Check port 9182 availability |
| Recording rules not evaluating | Syntax error or dependency failure | Check Prometheus logs for rule evaluation errors. Validate YAML syntax. Ensure source metrics exist |
| Cloud service discovery returns no targets | IAM permissions or tag filters | Verify service account permissions. Check filter configuration matches actual resource tags |