On this page

Infrastructure Monitoring

Infrastructure monitoring collects and analyses metrics from servers, network devices, storage systems, and cloud resources to detect degradation before service impact occurs. This task establishes the collectors, agents, and integrations that feed your monitoring platform with the raw data required for alerting and capacity planning.

Prerequisites

Before implementing infrastructure monitoring, verify that the following requirements are satisfied.

Monitoring platform deployed. A functioning monitoring system must be operational and accessible. This procedure assumes one of the following platforms:

Platform	Deployment model	Agent protocol	Minimum version
Prometheus + Grafana	Self-hosted	HTTP pull (scrape)	Prometheus 2.45+, Grafana 10+
Zabbix	Self-hosted	Zabbix agent, SNMP, IPMI	6.4+
Checkmk	Self-hosted or SaaS	Checkmk agent, SNMP	2.2+
Datadog	SaaS	Datadog agent	Agent 7+

Network connectivity established. Monitoring traffic must traverse your network without obstruction. For agent-based collection, outbound connectivity from monitored hosts to the monitoring server is required on the agent’s port. For SNMP-based collection, inbound UDP 161 access to network devices from the monitoring server is required. For cloud API collection, outbound HTTPS to provider API endpoints is required.

Credentials and access prepared. Assemble the following before beginning:

SSH access or local administrator rights on servers receiving agents
SNMP community strings or SNMPv3 credentials for network devices
Read-only API credentials for cloud providers (AWS IAM user with CloudWatch read access, Azure service principal with Monitoring Reader role, GCP service account with Monitoring Viewer role)
Service account for the monitoring platform with appropriate permissions

Baseline data available. Normal operating ranges cannot be established without historical context. If this is a new deployment, plan to collect data for 14 days before setting thresholds. If migrating from another monitoring system, export historical baselines for reference.

Target inventory documented. List all infrastructure components to be monitored with their hostnames, IP addresses, operating systems, and roles. A spreadsheet or CMDB export suffices. For network devices, include model numbers and firmware versions to verify SNMP MIB compatibility.

Procedure

Infrastructure monitoring implementation proceeds through five phases: server monitoring, network monitoring, storage monitoring, cloud infrastructure monitoring, and threshold configuration. Complete each phase for your relevant infrastructure before proceeding to the next.

Phase 1: Server monitoring

Server monitoring captures compute resource utilisation, system health indicators, and process states. The collection mechanism varies by operating system and monitoring platform.

Deploy the monitoring agent to Linux servers.

For Prometheus-based monitoring, install Node Exporter on each Linux server. Node Exporter exposes system metrics on an HTTP endpoint that Prometheus scrapes at configured intervals.

On Debian/Ubuntu systems:

sudo apt update
sudo apt install prometheus-node-exporter
sudo systemctl enable prometheus-node-exporter
sudo systemctl start prometheus-node-exporter

On RHEL/Rocky/AlmaLinux systems:

sudo dnf install node_exporter
sudo systemctl enable node_exporter
sudo systemctl start node_exporter

Verify the exporter is running and accessible:

curl http://localhost:9100/metrics | head -20

Expected output shows metric lines beginning with node_:

# HELP node_cpu_seconds_total Seconds the CPUs spent in each mode.
# TYPE node_cpu_seconds_total counter
node_cpu_seconds_total{cpu="0",mode="idle"} 258459.92
node_cpu_seconds_total{cpu="0",mode="iowait"} 1029.36

For Zabbix-based monitoring, install the Zabbix agent:

# Add Zabbix repository first (Debian/Ubuntu example)
wget https://repo.zabbix.com/zabbix/6.4/ubuntu/pool/main/z/zabbix-release/zabbix-release_6.4-1+ubuntu22.04_all.deb
sudo dpkg -i zabbix-release_6.4-1+ubuntu22.04_all.deb
sudo apt update
sudo apt install zabbix-agent2

Configure the agent to connect to your Zabbix server by editing /etc/zabbix/zabbix_agent2.conf:

Server=monitoring.example.org
ServerActive=monitoring.example.org
Hostname=webserver01.example.org

Start the agent:

sudo systemctl enable zabbix-agent2
sudo systemctl start zabbix-agent2

Deploy the monitoring agent to Windows servers.
For Prometheus-based monitoring, download Windows Exporter from the project’s GitHub releases page. Install using the MSI package with default options, which registers the service to start automatically:
Terminal window
```
msiexec /i windows_exporter-0.25.1-amd64.msi
```
Verify the exporter is accessible:
Terminal window
```
Invoke-WebRequest -Uri http://localhost:9182/metrics | Select-Object -First 20
```
For Zabbix-based monitoring, download the Zabbix agent MSI from the Zabbix website. During installation, specify your Zabbix server hostname and the local hostname for this server.

For Prometheus, add scrape targets to your prometheus.yml configuration file. Each target specifies the host and port where metrics are exposed:

scrape_configs:
  - job_name: 'linux-servers'
    static_configs:
      - targets:
        - 'webserver01.example.org:9100'
        - 'webserver02.example.org:9100'
        - 'dbserver01.example.org:9100'
    relabel_configs:
      - source_labels: [__address__]
        target_label: instance
        regex: '([^:]+):\d+'
        replacement: '${1}'

  - job_name: 'windows-servers'
    static_configs:
      - targets:
        - 'fileserver01.example.org:9182'
        - 'appserver01.example.org:9182'

For deployments exceeding 50 servers, use file-based service discovery instead of static configuration. Create a JSON file listing targets:

[
  {
    "targets": ["webserver01.example.org:9100", "webserver02.example.org:9100"],
    "labels": {"env": "production", "role": "web"}
  },
  {
    "targets": ["dbserver01.example.org:9100"],
    "labels": {"env": "production", "role": "database"}
  }
]

Reference the file in your Prometheus configuration:

scrape_configs:
  - job_name: 'linux-servers'
    file_sd_configs:
      - files:
        - '/etc/prometheus/targets/linux-servers.json'
        refresh_interval: 5m

Reload Prometheus configuration:

curl -X POST http://localhost:9090/-/reload

For Zabbix, navigate to Configuration → Hosts → Create host in the web interface. Assign the appropriate template (Template OS Linux by Zabbix agent for Linux servers, Template OS Windows by Zabbix agent for Windows servers) to enable standard metric collection.

Configure essential server metrics.

The following metrics form the baseline for server health monitoring. All values should be collected at 60-second intervals for operational monitoring. Longer intervals (300 seconds) are acceptable for capacity planning metrics where real-time visibility is unnecessary.

For Prometheus with Node Exporter, the metrics are collected automatically. Create recording rules to pre-calculate commonly used aggregations. Add to /etc/prometheus/rules/server-rules.yml:

groups:
  - name: server-metrics
    rules:
      - record: instance:node_cpu_utilisation:ratio
        expr: 1 - avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m]))

      - record: instance:node_memory_utilisation:ratio
        expr: 1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)

      - record: instance:node_filesystem_utilisation:ratio
        expr: 1 - (node_filesystem_avail_bytes{fstype=~"ext4|xfs"} / node_filesystem_size_bytes{fstype=~"ext4|xfs"})

      - record: instance:node_disk_io_utilisation:ratio
        expr: rate(node_disk_io_time_seconds_total[5m])

For Windows Exporter, equivalent metrics use different names:

groups:
  - name: windows-metrics
    rules:
      - record: instance:windows_cpu_utilisation:ratio
        expr: 1 - avg by (instance) (rate(windows_cpu_time_total{mode="idle"}[5m]))

      - record: instance:windows_memory_utilisation:ratio
        expr: 1 - (windows_os_physical_memory_free_bytes / windows_cs_physical_memory_bytes)

      - record: instance:windows_disk_utilisation:ratio
        expr: 1 - (windows_logical_disk_free_bytes / windows_logical_disk_size_bytes)

Phase 2: Network monitoring

Network monitoring tracks device availability, interface utilisation, error rates, and traffic patterns across switches, routers, firewalls, and wireless access points.

Enable SNMP on network devices.

SNMP (Simple Network Management Protocol) remains the standard mechanism for network device monitoring. SNMPv3 provides authentication and encryption; use SNMPv2c only when devices lack SNMPv3 support.

Configuration syntax varies by vendor. For Cisco IOS devices:

snmp-server community readonly-community RO
snmp-server location "Headquarters DC Rack A3"
snmp-server contact "it-operations@example.org"
snmp-server enable traps
snmp-server host 10.0.1.50 version 2c readonly-community

For Juniper Junos devices:

set snmp community readonly-community authorization read-only
set snmp location "Headquarters DC Rack A3"
set snmp contact "it-operations@example.org"
set snmp trap-group monitoring-traps targets 10.0.1.50

For SNMPv3 (recommended), configure authentication and privacy:

# Cisco IOS SNMPv3
snmp-server group monitoring-group v3 priv
snmp-server user monitoring-user monitoring-group v3 auth sha AuthPassword priv aes 128 PrivPassword

Configure SNMP polling in your monitoring platform.

For Prometheus-based monitoring, deploy the SNMP Exporter. This component translates SNMP OIDs into Prometheus metrics using generator-produced configuration.

Install SNMP Exporter:

wget https://github.com/prometheus/snmp_exporter/releases/download/v0.24.1/snmp_exporter-0.24.1.linux-amd64.tar.gz
tar xzf snmp_exporter-0.24.1.linux-amd64.tar.gz
sudo mv snmp_exporter-0.24.1.linux-amd64/snmp_exporter /usr/local/bin/

The default snmp.yml configuration supports standard MIBs. For vendor-specific MIBs, use the generator tool to create custom configurations.

Create a systemd service file at /etc/systemd/system/snmp-exporter.service:

[Unit]
Description=Prometheus SNMP Exporter
After=network.target

[Service]
User=prometheus
ExecStart=/usr/local/bin/snmp_exporter --config.file=/etc/prometheus/snmp.yml
Restart=always

[Install]
WantedBy=multi-user.target

Start the exporter:

sudo systemctl daemon-reload
sudo systemctl enable snmp-exporter
sudo systemctl start snmp-exporter

Add network device targets to your Prometheus configuration:

scrape_configs:
  - job_name: 'network-devices'
    static_configs:
      - targets:
        - 'switch01.example.org'
        - 'switch02.example.org'
        - 'router01.example.org'
        - 'firewall01.example.org'
    metrics_path: /snmp
    params:
      module: [if_mib]
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: localhost:9116  # SNMP Exporter address

For Zabbix, add each network device as a host and assign the appropriate SNMP template. Zabbix includes templates for common vendors (Cisco, Juniper, HP/Aruba, Ubiquiti) that auto-discover interfaces and apply standard items.

Implement interface monitoring.

Interface metrics reveal bandwidth utilisation, error rates, and packet loss. The IF-MIB standard provides these metrics across vendors.

Key interface metrics to monitor:

ifHCInOctets    - Bytes received (64-bit counter)
ifHCOutOctets   - Bytes transmitted (64-bit counter)
ifInErrors      - Inbound packet errors
ifOutErrors     - Outbound packet errors
ifInDiscards    - Inbound packets discarded
ifOutDiscards   - Outbound packets discarded
ifOperStatus    - Interface operational state (1=up, 2=down)
ifSpeed         - Interface speed in bits per second

Calculate utilisation as a percentage of interface capacity. For a 1 Gbps interface with 5-minute average traffic of 400 Mbps inbound:

Utilisation = (400,000,000 / 1,000,000,000) × 100 = 40%

In Prometheus, express this as a query:

rate(ifHCInOctets{ifDescr="GigabitEthernet0/1"}[5m]) * 8 / ifSpeed * 100

Configure network device availability monitoring.

ICMP ping provides basic reachability verification. For Prometheus, deploy the Blackbox Exporter:

wget https://github.com/prometheus/blackbox_exporter/releases/download/v0.24.0/blackbox_exporter-0.24.0.linux-amd64.tar.gz
tar xzf blackbox_exporter-0.24.0.linux-amd64.tar.gz
sudo mv blackbox_exporter-0.24.0.linux-amd64/blackbox_exporter /usr/local/bin/

Configure ICMP probing in /etc/prometheus/blackbox.yml:

modules:
  icmp:
    prober: icmp
    timeout: 5s
    icmp:
      preferred_ip_protocol: ip4

Add ping targets to Prometheus:

scrape_configs:
  - job_name: 'network-ping'
    metrics_path: /probe
    params:
      module: [icmp]
    static_configs:
      - targets:
        - 'switch01.example.org'
        - 'switch02.example.org'
        - 'router01.example.org'
        - '192.168.1.1'  # ISP gateway
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: localhost:9115  # Blackbox Exporter address

The following diagram illustrates the network monitoring data flow from devices through collectors to the monitoring platform:

+-------------------------------------------------------------------+
|                     NETWORK INFRASTRUCTURE                        |
+-------------------------------------------------------------------+
|                                                                   |
|  +------------+   +------------+   +------------+   +----------+  |
|  | Core       |   | Access     |   | Firewall   |   | Wireless |  |
|  | Switch     |   | Switches   |   |            |   | APs      |  |
|  | SNMP:161   |   | SNMP:161   |   | SNMP:161   |   | SNMP:161 |  |
|  +-----+------+   +-----+------+   +-----+------+   +----+-----+  |
|        |               |               |                |         |
+-------------------------------------------------------------------+
         |               |               |                |
         +-------+-------+-------+-------+                |
                 |               |                        |
                 v               v                        v
+----------------+---------------+------------------------+---------+
|                     MONITORING NETWORK                            |
+-------------------------------------------------------------------+
|                                                                   |
|        +------------------+        +------------------+           |
|        |  SNMP Exporter   |        | Blackbox Exporter|           |
|        |  :9116           |        |  :9115           |           |
|        |                  |        |                  |           |
|        | SNMP polling     |        | ICMP probes      |           |
|        | every 60s        |        | every 30s        |           |
|        +--------+---------+        +--------+---------+           |
|                 |                           |                     |
|                 +-------------+-------------+                     |
|                               |                                   |
|                               v                                   |
|                    +----------+----------+                        |
|                    |     Prometheus      |                        |
|                    |     :9090           |                        |
|                    |                     |                        |
|                    | Scrapes exporters   |                        |
|                    | Stores time series  |                        |
|                    +----------+----------+                        |
|                               |                                   |
|                               v                                   |
|                    +----------+----------+                        |
|                    |      Grafana        |                        |
|                    |      :3000          |                        |
|                    |                     |                        |
|                    | Dashboards          |                        |
|                    | Visualisation       |                        |
|                    +---------------------+                        |
|                                                                   |
+-------------------------------------------------------------------+

Figure 1: Network monitoring architecture showing SNMP collection and availability probing

Phase 3: Storage monitoring

Storage monitoring tracks capacity consumption, performance characteristics, and health indicators across local disks, network storage, and storage area networks.

Configure local disk monitoring.

Node Exporter and Windows Exporter include disk metrics by default. The relevant metrics for capacity monitoring are:

# Linux filesystem capacity
node_filesystem_size_bytes{fstype=~"ext4|xfs|btrfs"}
node_filesystem_avail_bytes{fstype=~"ext4|xfs|btrfs"}

# Windows disk capacity
windows_logical_disk_size_bytes
windows_logical_disk_free_bytes

For disk performance, monitor I/O operations and latency:

# Linux disk I/O
rate(node_disk_reads_completed_total[5m])     # Read IOPS
rate(node_disk_writes_completed_total[5m])    # Write IOPS
rate(node_disk_read_bytes_total[5m])          # Read throughput
rate(node_disk_written_bytes_total[5m])       # Write throughput
rate(node_disk_read_time_seconds_total[5m]) / rate(node_disk_reads_completed_total[5m])   # Read latency
rate(node_disk_write_time_seconds_total[5m]) / rate(node_disk_writes_completed_total[5m]) # Write latency

# Windows disk I/O
windows_logical_disk_reads_total
windows_logical_disk_writes_total
windows_logical_disk_read_seconds_total
windows_logical_disk_write_seconds_total

Create recording rules for disk health metrics in /etc/prometheus/rules/storage-rules.yml:

groups:
  - name: storage-metrics
    rules:
      - record: instance:node_disk_read_latency:avg5m
        expr: rate(node_disk_read_time_seconds_total[5m]) / rate(node_disk_reads_completed_total[5m])

      - record: instance:node_disk_write_latency:avg5m
        expr: rate(node_disk_write_time_seconds_total[5m]) / rate(node_disk_writes_completed_total[5m])

      - record: instance:node_disk_iops:rate5m
        expr: rate(node_disk_reads_completed_total[5m]) + rate(node_disk_writes_completed_total[5m])

Configure network storage monitoring.
For NFS servers, monitor export availability and client connections through the NFS exporter or host-level metrics. For NFS clients, Node Exporter exposes NFS operation statistics through the node_nfs_* metrics when the nfs collector is enabled.
For SMB/CIFS shares on Windows, use Windows Exporter with the smb collector enabled. Edit the Windows Exporter configuration or pass collector flags:
Terminal window
```
# Enable SMB collector
windows_exporter.exe --collectors.enabled "cpu,cs,logical_disk,memory,net,os,process,smb"
```
For dedicated NAS appliances (Synology, QNAP, NetApp), use SNMP monitoring with vendor-specific MIBs. Many NAS vendors also provide Prometheus exporters or API integrations.
Configure SAN monitoring.
Storage area network monitoring requires vendor-specific approaches. For Fibre Channel SANs, monitor switch port statistics through SNMP. For iSCSI, monitor target availability and session state.
Common SAN metrics to collect:
```
Storage array capacity (total, used, available)
Volume/LUN capacity utilisation
Array controller CPU and cache utilisation
Port throughput and errors
Disk health and predictive failure indicators
Replication lag (if applicable)
```
For open-source storage platforms like TrueNAS, enable the Prometheus endpoint in the web interface. For commercial arrays, consult vendor documentation for monitoring integration options. Most enterprise arrays (NetApp, Dell EMC, Pure Storage, HPE) provide REST APIs that can be scraped with custom exporters or vendor-provided integrations.

Phase 4: Cloud infrastructure monitoring

Cloud infrastructure monitoring extends visibility to resources deployed in public cloud environments. Each provider offers native monitoring that can be integrated with your existing monitoring platform.

Configure AWS CloudWatch integration.

For Prometheus-based monitoring, deploy the CloudWatch Exporter or use the YACE (Yet Another CloudWatch Exporter) for more flexible configuration.

Create an IAM user or role with the following policy:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "cloudwatch:GetMetricData",
        "cloudwatch:GetMetricStatistics",
        "cloudwatch:ListMetrics",
        "ec2:DescribeInstances",
        "ec2:DescribeVolumes",
        "rds:DescribeDBInstances",
        "elasticloadbalancing:DescribeLoadBalancers",
        "tag:GetResources"
      ],
      "Resource": "*"
    }
  ]
}

Configure YACE in /etc/yace/config.yml:

discovery:
  jobs:
    - type: AWS/EC2
      regions:
        - eu-west-1
      period: 300
      length: 300
      metrics:
        - name: CPUUtilization
          statistics: [Average, Maximum]
        - name: NetworkIn
          statistics: [Sum]
        - name: NetworkOut
          statistics: [Sum]
        - name: DiskReadOps
          statistics: [Sum]
        - name: DiskWriteOps
          statistics: [Sum]

    - type: AWS/RDS
      regions:
        - eu-west-1
      period: 300
      length: 300
      metrics:
        - name: CPUUtilization
          statistics: [Average]
        - name: FreeStorageSpace
          statistics: [Average]
        - name: DatabaseConnections
          statistics: [Average]
        - name: ReadIOPS
          statistics: [Average]
        - name: WriteIOPS
          statistics: [Average]

Add the exporter to your Prometheus configuration:

scrape_configs:
  - job_name: 'aws-cloudwatch'
    static_configs:
      - targets: ['localhost:5000']

Configure Azure Monitor integration.

For Azure, use the Azure Monitor Exporter for Prometheus. Create a service principal with the Monitoring Reader role:

az ad sp create-for-rbac --name "prometheus-monitor" --role "Monitoring Reader" --scopes /subscriptions/YOUR_SUBSCRIPTION_ID

Configure the exporter with subscription and resource group filters to avoid excessive API calls and costs:

active_directory_authority_url: "https://login.microsoftonline.com/"
resource_manager_url: "https://management.azure.com/"
credentials:
  subscription_id: "your-subscription-id"
  client_id: "your-client-id"
  client_secret: "your-client-secret"
  tenant_id: "your-tenant-id"

targets:
  - resource: "/subscriptions/xxx/resourceGroups/production/providers/Microsoft.Compute/virtualMachines/webserver01"
    metrics:
      - name: "Percentage CPU"
        aggregations: ["Average", "Maximum"]
      - name: "Network In Total"
        aggregations: ["Total"]
      - name: "Network Out Total"
        aggregations: ["Total"]

Configure GCP Cloud Monitoring integration.

For Google Cloud Platform, use the Stackdriver Exporter (now Cloud Monitoring Exporter). Create a service account with the Monitoring Viewer role:

gcloud iam service-accounts create prometheus-monitor --display-name="Prometheus Monitoring"

gcloud projects add-iam-policy-binding YOUR_PROJECT_ID \
  --member="serviceAccount:prometheus-monitor@YOUR_PROJECT_ID.iam.gserviceaccount.com" \
  --role="roles/monitoring.viewer"

gcloud iam service-accounts keys create ~/prometheus-monitor-key.json \
  --iam-account=prometheus-monitor@YOUR_PROJECT_ID.iam.gserviceaccount.com

Configure the exporter to collect Compute Engine metrics:

google:
  project_id: "your-project-id"

metrics_type_prefixes:
  - "compute.googleapis.com/instance/cpu"
  - "compute.googleapis.com/instance/disk"
  - "compute.googleapis.com/instance/network"

Deploy cloud-native agents for detailed visibility.

Cloud provider monitoring APIs provide aggregate metrics with 1-5 minute granularity. For detailed system-level metrics equivalent to on-premises monitoring, deploy agents to cloud instances using the same procedures as Phase 1.

For autoscaling groups and containerised workloads, use configuration management or cloud-init to deploy agents automatically:

# cloud-init example for EC2 instances
#cloud-config
packages:
  - prometheus-node-exporter

runcmd:
  - systemctl enable prometheus-node-exporter
  - systemctl start prometheus-node-exporter

Use EC2 service discovery in Prometheus to automatically find instances:

scrape_configs:
  - job_name: 'ec2-instances'
    ec2_sd_configs:
      - region: eu-west-1
        port: 9100
        filters:
          - name: tag:Environment
            values: [production]
    relabel_configs:
      - source_labels: [__meta_ec2_tag_Name]
        target_label: instance_name
      - source_labels: [__meta_ec2_instance_id]
        target_label: instance_id
      - source_labels: [__meta_ec2_availability_zone]
        target_label: availability_zone

Phase 5: Threshold configuration

Thresholds define the boundaries between normal operation and conditions requiring attention. Effective thresholds balance sensitivity (detecting real issues) against noise (avoiding false positives).

Establish baselines from collected data.
After collecting data for 14 days, query your monitoring platform to understand normal operating ranges. For CPU utilisation:
```
# Average CPU utilisation by server over 14 days
avg_over_time(instance:node_cpu_utilisation:ratio[14d]) * 100

# Maximum CPU utilisation peaks
max_over_time(instance:node_cpu_utilisation:ratio[14d]) * 100

# 95th percentile CPU utilisation
quantile_over_time(0.95, instance:node_cpu_utilisation:ratio[14d]) * 100
```
Document baseline values for each server class. A web server with average CPU of 25%, 95th percentile of 60%, and peaks of 85% has headroom for growth. A database server averaging 70% with peaks at 98% requires immediate capacity attention.

Configure static thresholds.

Static thresholds apply fixed limits appropriate for the resource type. Start with conservative thresholds and refine based on operational experience.

For Prometheus alerting rules, create /etc/prometheus/rules/infrastructure-alerts.yml:

groups:
  - name: infrastructure-alerts
    rules:
      # CPU alerts
      - alert: HighCPUUtilisation
        expr: instance:node_cpu_utilisation:ratio > 0.85
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "High CPU utilisation on {{ $labels.instance }}"
          description: "CPU utilisation is {{ $value | humanizePercentage }} for 15 minutes"

      - alert: CriticalCPUUtilisation
        expr: instance:node_cpu_utilisation:ratio > 0.95
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Critical CPU utilisation on {{ $labels.instance }}"
          description: "CPU utilisation is {{ $value | humanizePercentage }} for 5 minutes"

      # Memory alerts
      - alert: HighMemoryUtilisation
        expr: instance:node_memory_utilisation:ratio > 0.85
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "High memory utilisation on {{ $labels.instance }}"
          description: "Memory utilisation is {{ $value | humanizePercentage }}"

      - alert: CriticalMemoryUtilisation
        expr: instance:node_memory_utilisation:ratio > 0.95
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Critical memory utilisation on {{ $labels.instance }}"
          description: "Memory utilisation is {{ $value | humanizePercentage }}"

      # Disk space alerts
      - alert: DiskSpaceWarning
        expr: instance:node_filesystem_utilisation:ratio > 0.80
        for: 30m
        labels:
          severity: warning
        annotations:
          summary: "Disk space low on {{ $labels.instance }}"
          description: "Filesystem {{ $labels.mountpoint }} is {{ $value | humanizePercentage }} full"

      - alert: DiskSpaceCritical
        expr: instance:node_filesystem_utilisation:ratio > 0.90
        for: 15m
        labels:
          severity: critical
        annotations:
          summary: "Disk space critical on {{ $labels.instance }}"
          description: "Filesystem {{ $labels.mountpoint }} is {{ $value | humanizePercentage }} full"

      # Disk I/O alerts
      - alert: HighDiskLatency
        expr: instance:node_disk_read_latency:avg5m > 0.1 or instance:node_disk_write_latency:avg5m > 0.1
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High disk latency on {{ $labels.instance }}"
          description: "Disk latency exceeds 100ms"

      # Network device alerts
      - alert: NetworkDeviceDown
        expr: probe_success{job="network-ping"} == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Network device unreachable: {{ $labels.instance }}"

      - alert: InterfaceDown
        expr: ifOperStatus == 2
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Interface down: {{ $labels.ifDescr }} on {{ $labels.instance }}"

      - alert: HighInterfaceUtilisation
        expr: (rate(ifHCInOctets[5m]) * 8 / ifSpeed) > 0.80
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "High interface utilisation on {{ $labels.instance }}"
          description: "Interface {{ $labels.ifDescr }} is at {{ $value | humanizePercentage }} utilisation"

Configure predictive thresholds for capacity.

Predictive thresholds anticipate exhaustion before it occurs. Linear regression across historical data projects when a resource will reach capacity.

For disk space exhaustion prediction:

# Predict when disk will be full based on 7-day trend
predict_linear(node_filesystem_avail_bytes{fstype=~"ext4|xfs"}[7d], 86400 * 14) < 0

This alert fires when extrapolation predicts disk exhaustion within 14 days:

- alert: DiskSpaceExhaustionPredicted
  expr: predict_linear(node_filesystem_avail_bytes{fstype=~"ext4|xfs"}[7d], 86400 * 14) < 0
  for: 1h
  labels:
    severity: warning
  annotations:
    summary: "Disk space exhaustion predicted on {{ $labels.instance }}"
    description: "Filesystem {{ $labels.mountpoint }} predicted to exhaust within 14 days"

Phase 6: Dashboard creation

Dashboards provide visual representation of infrastructure health for operational monitoring and capacity planning.

Import standard dashboards.
For Grafana with Prometheus, import community dashboards as starting points:
- Node Exporter Full (Dashboard ID 1860): Comprehensive Linux server metrics
- Windows Exporter Dashboard (Dashboard ID 14694): Windows server metrics
- SNMP Device Dashboard (Dashboard ID 11169): Network device metrics
To import, navigate to Dashboards → Import in Grafana, enter the dashboard ID, and select your Prometheus data source.

Create an infrastructure overview dashboard.

An overview dashboard provides at-a-glance status for all infrastructure. Create a new dashboard in Grafana and add the following panels:

Server health matrix. Use a stat panel with the following query to show server count by health status:

# Healthy servers (CPU < 85%, Memory < 85%, Disk < 80%)
count(
  instance:node_cpu_utilisation:ratio < 0.85
  and instance:node_memory_utilisation:ratio < 0.85
  and instance:node_filesystem_utilisation:ratio < 0.80
)

Network device status. Use a stat panel showing device availability:

# Available devices
count(probe_success{job="network-ping"} == 1)
# Unavailable devices
count(probe_success{job="network-ping"} == 0)

Resource utilisation heatmap. Use a table panel showing current utilisation across servers:

# Query for table
(
  label_replace(instance:node_cpu_utilisation:ratio * 100, "metric", "CPU", "", "")
  or
  label_replace(instance:node_memory_utilisation:ratio * 100, "metric", "Memory", "", "")
  or
  label_replace(max by (instance) (instance:node_filesystem_utilisation:ratio) * 100, "metric", "Disk", "", "")
)

Create capacity trend dashboards.
Capacity dashboards show historical trends and projections for planning purposes.
Disk growth trend panel. Use a time series panel with 30-day range:
```
node_filesystem_size_bytes{fstype=~"ext4|xfs"} - node_filesystem_avail_bytes{fstype=~"ext4|xfs"}
```
Add an annotation layer showing alert firings to correlate capacity changes with events.
Memory trend panel. Show memory utilisation over time with baseline reference:
```
instance:node_memory_utilisation:ratio * 100
```
Add threshold lines at 85% and 95% using panel overrides.

Verification

After completing implementation, verify that monitoring data is flowing correctly and thresholds are functioning.

Verify metric collection. Query each metric type to confirm data exists:

# Server metrics present
count(node_cpu_seconds_total)
count(windows_cpu_time_total)

# Network metrics present
count(ifHCInOctets)
count(probe_success{job="network-ping"})

# Storage metrics present
count(node_filesystem_size_bytes)

Expected result: Each query returns a count greater than zero matching the number of monitored resources.

Verify scrape targets are healthy. In Prometheus, navigate to Status → Targets. All targets should show state “UP” with recent scrape timestamps. Any targets showing “DOWN” indicate collection failures requiring investigation.

Verify alerting rules are loaded. Navigate to Status → Rules in Prometheus. All rule groups should show status “ok” with evaluation timestamps.

Test alert firing. Temporarily lower a threshold to trigger an alert:

# Temporarily set CPU threshold to 5% to verify alerting
- alert: TestAlert
  expr: instance:node_cpu_utilisation:ratio > 0.05
  for: 1m

Verify the alert appears in the Alerts view, then remove the test rule.

Verify dashboard data population. Open each created dashboard and confirm panels display data without errors. Panels showing “No data” or query errors indicate configuration problems.

Test data retention. Query historical data matching your retention configuration:

# Query data from 7 days ago
node_cpu_seconds_total offset 7d

If retention is configured for 15 days and queries beyond that range return no data, retention is functioning correctly.

Troubleshooting

Symptom	Cause	Resolution
Node Exporter returns connection refused	Service not running	Check `systemctl status prometheus-node-exporter`. Start if stopped. Check journal logs with `journalctl -u prometheus-node-exporter`
Prometheus target shows as DOWN	Network connectivity or firewall blocking	Verify connectivity with `curl http://target:port/metrics`. Check firewall rules on target and intermediate devices
SNMP metrics not appearing	Incorrect community string or ACL	Test SNMP manually: `snmpwalk -v2c -c community device_ip 1.3.6.1.2.1.1`. Verify community string and source IP ACL on device
CloudWatch metrics delayed by 10+ minutes	API polling interval or propagation delay	CloudWatch metrics have inherent 1-5 minute delay. Verify exporter polling interval. Check AWS service health dashboard
Dashboard panels show “No data”	Metric name mismatch or label selector error	Check metric exists in Prometheus expression browser. Verify label selectors match actual labels
Alert firing but no notification	Alertmanager configuration issue	Verify Alertmanager is running and connected. Check alertmanager.yml routing and receiver configuration
High cardinality warning in Prometheus	Too many label combinations	Review metrics with high label cardinality. Use relabeling to drop unnecessary labels. Consider aggregating recording rules
Disk space exhaustion on monitoring server	Retention too long or sample rate too high	Reduce `--storage.tsdb.retention.time`. Decrease scrape interval for non-critical metrics. Review and prune unused metrics
SNMP Exporter timeout errors	Network latency or slow device response	Increase timeout in `snmp.yml`. Reduce metrics collected per scrape. Check device CPU during polling
Windows Exporter service fails to start	Missing dependencies or port conflict	Check Windows Event Log for errors. Verify .NET Framework 4.5+ installed. Check port 9182 availability
Recording rules not evaluating	Syntax error or dependency failure	Check Prometheus logs for rule evaluation errors. Validate YAML syntax. Ensure source metrics exist
Cloud service discovery returns no targets	IAM permissions or tag filters	Verify service account permissions. Check filter configuration matches actual resource tags