Skip to main content

Infrastructure Monitoring

Infrastructure monitoring collects and analyses metrics from servers, network devices, storage systems, and cloud resources to detect degradation before service impact occurs. This task establishes the collectors, agents, and integrations that feed your monitoring platform with the raw data required for alerting and capacity planning.

Prerequisites

Before implementing infrastructure monitoring, verify that the following requirements are satisfied.

Monitoring platform deployed. A functioning monitoring system must be operational and accessible. This procedure assumes one of the following platforms:

PlatformDeployment modelAgent protocolMinimum version
Prometheus + GrafanaSelf-hostedHTTP pull (scrape)Prometheus 2.45+, Grafana 10+
ZabbixSelf-hostedZabbix agent, SNMP, IPMI6.4+
CheckmkSelf-hosted or SaaSCheckmk agent, SNMP2.2+
DatadogSaaSDatadog agentAgent 7+

Network connectivity established. Monitoring traffic must traverse your network without obstruction. For agent-based collection, outbound connectivity from monitored hosts to the monitoring server is required on the agent’s port. For SNMP-based collection, inbound UDP 161 access to network devices from the monitoring server is required. For cloud API collection, outbound HTTPS to provider API endpoints is required.

Credentials and access prepared. Assemble the following before beginning:

  • SSH access or local administrator rights on servers receiving agents
  • SNMP community strings or SNMPv3 credentials for network devices
  • Read-only API credentials for cloud providers (AWS IAM user with CloudWatch read access, Azure service principal with Monitoring Reader role, GCP service account with Monitoring Viewer role)
  • Service account for the monitoring platform with appropriate permissions

Baseline data available. Normal operating ranges cannot be established without historical context. If this is a new deployment, plan to collect data for 14 days before setting thresholds. If migrating from another monitoring system, export historical baselines for reference.

Target inventory documented. List all infrastructure components to be monitored with their hostnames, IP addresses, operating systems, and roles. A spreadsheet or CMDB export suffices. For network devices, include model numbers and firmware versions to verify SNMP MIB compatibility.

Procedure

Infrastructure monitoring implementation proceeds through five phases: server monitoring, network monitoring, storage monitoring, cloud infrastructure monitoring, and threshold configuration. Complete each phase for your relevant infrastructure before proceeding to the next.

Phase 1: Server monitoring

Server monitoring captures compute resource utilisation, system health indicators, and process states. The collection mechanism varies by operating system and monitoring platform.

  1. Deploy the monitoring agent to Linux servers.

    For Prometheus-based monitoring, install Node Exporter on each Linux server. Node Exporter exposes system metrics on an HTTP endpoint that Prometheus scrapes at configured intervals.

    On Debian/Ubuntu systems:

    Terminal window
    sudo apt update
    sudo apt install prometheus-node-exporter
    sudo systemctl enable prometheus-node-exporter
    sudo systemctl start prometheus-node-exporter

    On RHEL/Rocky/AlmaLinux systems:

    Terminal window
    sudo dnf install node_exporter
    sudo systemctl enable node_exporter
    sudo systemctl start node_exporter

    Verify the exporter is running and accessible:

    Terminal window
    curl http://localhost:9100/metrics | head -20

    Expected output shows metric lines beginning with node_:

    # HELP node_cpu_seconds_total Seconds the CPUs spent in each mode.
    # TYPE node_cpu_seconds_total counter
    node_cpu_seconds_total{cpu="0",mode="idle"} 258459.92
    node_cpu_seconds_total{cpu="0",mode="iowait"} 1029.36

    For Zabbix-based monitoring, install the Zabbix agent:

    Terminal window
    # Add Zabbix repository first (Debian/Ubuntu example)
    wget https://repo.zabbix.com/zabbix/6.4/ubuntu/pool/main/z/zabbix-release/zabbix-release_6.4-1+ubuntu22.04_all.deb
    sudo dpkg -i zabbix-release_6.4-1+ubuntu22.04_all.deb
    sudo apt update
    sudo apt install zabbix-agent2

    Configure the agent to connect to your Zabbix server by editing /etc/zabbix/zabbix_agent2.conf:

    Server=monitoring.example.org
    ServerActive=monitoring.example.org
    Hostname=webserver01.example.org

    Start the agent:

    Terminal window
    sudo systemctl enable zabbix-agent2
    sudo systemctl start zabbix-agent2
  2. Deploy the monitoring agent to Windows servers.

    For Prometheus-based monitoring, download Windows Exporter from the project’s GitHub releases page. Install using the MSI package with default options, which registers the service to start automatically:

    Terminal window
    msiexec /i windows_exporter-0.25.1-amd64.msi

    Verify the exporter is accessible:

    Terminal window
    Invoke-WebRequest -Uri http://localhost:9182/metrics | Select-Object -First 20

    For Zabbix-based monitoring, download the Zabbix agent MSI from the Zabbix website. During installation, specify your Zabbix server hostname and the local hostname for this server.

  3. Register monitored servers with the monitoring platform.

    For Prometheus, add scrape targets to your prometheus.yml configuration file. Each target specifies the host and port where metrics are exposed:

    scrape_configs:
    - job_name: 'linux-servers'
    static_configs:
    - targets:
    - 'webserver01.example.org:9100'
    - 'webserver02.example.org:9100'
    - 'dbserver01.example.org:9100'
    relabel_configs:
    - source_labels: [__address__]
    target_label: instance
    regex: '([^:]+):\d+'
    replacement: '${1}'
    - job_name: 'windows-servers'
    static_configs:
    - targets:
    - 'fileserver01.example.org:9182'
    - 'appserver01.example.org:9182'

    For deployments exceeding 50 servers, use file-based service discovery instead of static configuration. Create a JSON file listing targets:

    [
    {
    "targets": ["webserver01.example.org:9100", "webserver02.example.org:9100"],
    "labels": {"env": "production", "role": "web"}
    },
    {
    "targets": ["dbserver01.example.org:9100"],
    "labels": {"env": "production", "role": "database"}
    }
    ]

    Reference the file in your Prometheus configuration:

    scrape_configs:
    - job_name: 'linux-servers'
    file_sd_configs:
    - files:
    - '/etc/prometheus/targets/linux-servers.json'
    refresh_interval: 5m

    Reload Prometheus configuration:

    Terminal window
    curl -X POST http://localhost:9090/-/reload

    For Zabbix, navigate to Configuration → Hosts → Create host in the web interface. Assign the appropriate template (Template OS Linux by Zabbix agent for Linux servers, Template OS Windows by Zabbix agent for Windows servers) to enable standard metric collection.

  4. Configure essential server metrics.

    The following metrics form the baseline for server health monitoring. All values should be collected at 60-second intervals for operational monitoring. Longer intervals (300 seconds) are acceptable for capacity planning metrics where real-time visibility is unnecessary.

    For Prometheus with Node Exporter, the metrics are collected automatically. Create recording rules to pre-calculate commonly used aggregations. Add to /etc/prometheus/rules/server-rules.yml:

    groups:
    - name: server-metrics
    rules:
    - record: instance:node_cpu_utilisation:ratio
    expr: 1 - avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m]))
    - record: instance:node_memory_utilisation:ratio
    expr: 1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)
    - record: instance:node_filesystem_utilisation:ratio
    expr: 1 - (node_filesystem_avail_bytes{fstype=~"ext4|xfs"} / node_filesystem_size_bytes{fstype=~"ext4|xfs"})
    - record: instance:node_disk_io_utilisation:ratio
    expr: rate(node_disk_io_time_seconds_total[5m])

    For Windows Exporter, equivalent metrics use different names:

    groups:
    - name: windows-metrics
    rules:
    - record: instance:windows_cpu_utilisation:ratio
    expr: 1 - avg by (instance) (rate(windows_cpu_time_total{mode="idle"}[5m]))
    - record: instance:windows_memory_utilisation:ratio
    expr: 1 - (windows_os_physical_memory_free_bytes / windows_cs_physical_memory_bytes)
    - record: instance:windows_disk_utilisation:ratio
    expr: 1 - (windows_logical_disk_free_bytes / windows_logical_disk_size_bytes)

Phase 2: Network monitoring

Network monitoring tracks device availability, interface utilisation, error rates, and traffic patterns across switches, routers, firewalls, and wireless access points.

  1. Enable SNMP on network devices.

    SNMP (Simple Network Management Protocol) remains the standard mechanism for network device monitoring. SNMPv3 provides authentication and encryption; use SNMPv2c only when devices lack SNMPv3 support.

    Configuration syntax varies by vendor. For Cisco IOS devices:

    snmp-server community readonly-community RO
    snmp-server location "Headquarters DC Rack A3"
    snmp-server contact "it-operations@example.org"
    snmp-server enable traps
    snmp-server host 10.0.1.50 version 2c readonly-community

    For Juniper Junos devices:

    set snmp community readonly-community authorization read-only
    set snmp location "Headquarters DC Rack A3"
    set snmp contact "it-operations@example.org"
    set snmp trap-group monitoring-traps targets 10.0.1.50

    For SNMPv3 (recommended), configure authentication and privacy:

    # Cisco IOS SNMPv3
    snmp-server group monitoring-group v3 priv
    snmp-server user monitoring-user monitoring-group v3 auth sha AuthPassword priv aes 128 PrivPassword
  2. Configure SNMP polling in your monitoring platform.

    For Prometheus-based monitoring, deploy the SNMP Exporter. This component translates SNMP OIDs into Prometheus metrics using generator-produced configuration.

    Install SNMP Exporter:

    Terminal window
    wget https://github.com/prometheus/snmp_exporter/releases/download/v0.24.1/snmp_exporter-0.24.1.linux-amd64.tar.gz
    tar xzf snmp_exporter-0.24.1.linux-amd64.tar.gz
    sudo mv snmp_exporter-0.24.1.linux-amd64/snmp_exporter /usr/local/bin/

    The default snmp.yml configuration supports standard MIBs. For vendor-specific MIBs, use the generator tool to create custom configurations.

    Create a systemd service file at /etc/systemd/system/snmp-exporter.service:

    [Unit]
    Description=Prometheus SNMP Exporter
    After=network.target
    [Service]
    User=prometheus
    ExecStart=/usr/local/bin/snmp_exporter --config.file=/etc/prometheus/snmp.yml
    Restart=always
    [Install]
    WantedBy=multi-user.target

    Start the exporter:

    Terminal window
    sudo systemctl daemon-reload
    sudo systemctl enable snmp-exporter
    sudo systemctl start snmp-exporter

    Add network device targets to your Prometheus configuration:

    scrape_configs:
    - job_name: 'network-devices'
    static_configs:
    - targets:
    - 'switch01.example.org'
    - 'switch02.example.org'
    - 'router01.example.org'
    - 'firewall01.example.org'
    metrics_path: /snmp
    params:
    module: [if_mib]
    relabel_configs:
    - source_labels: [__address__]
    target_label: __param_target
    - source_labels: [__param_target]
    target_label: instance
    - target_label: __address__
    replacement: localhost:9116 # SNMP Exporter address

    For Zabbix, add each network device as a host and assign the appropriate SNMP template. Zabbix includes templates for common vendors (Cisco, Juniper, HP/Aruba, Ubiquiti) that auto-discover interfaces and apply standard items.

  3. Implement interface monitoring.

    Interface metrics reveal bandwidth utilisation, error rates, and packet loss. The IF-MIB standard provides these metrics across vendors.

    Key interface metrics to monitor:

    ifHCInOctets - Bytes received (64-bit counter)
    ifHCOutOctets - Bytes transmitted (64-bit counter)
    ifInErrors - Inbound packet errors
    ifOutErrors - Outbound packet errors
    ifInDiscards - Inbound packets discarded
    ifOutDiscards - Outbound packets discarded
    ifOperStatus - Interface operational state (1=up, 2=down)
    ifSpeed - Interface speed in bits per second

    Calculate utilisation as a percentage of interface capacity. For a 1 Gbps interface with 5-minute average traffic of 400 Mbps inbound:

    Utilisation = (400,000,000 / 1,000,000,000) × 100 = 40%

    In Prometheus, express this as a query:

    rate(ifHCInOctets{ifDescr="GigabitEthernet0/1"}[5m]) * 8 / ifSpeed * 100
  4. Configure network device availability monitoring.

    ICMP ping provides basic reachability verification. For Prometheus, deploy the Blackbox Exporter:

    Terminal window
    wget https://github.com/prometheus/blackbox_exporter/releases/download/v0.24.0/blackbox_exporter-0.24.0.linux-amd64.tar.gz
    tar xzf blackbox_exporter-0.24.0.linux-amd64.tar.gz
    sudo mv blackbox_exporter-0.24.0.linux-amd64/blackbox_exporter /usr/local/bin/

    Configure ICMP probing in /etc/prometheus/blackbox.yml:

    modules:
    icmp:
    prober: icmp
    timeout: 5s
    icmp:
    preferred_ip_protocol: ip4

    Add ping targets to Prometheus:

    scrape_configs:
    - job_name: 'network-ping'
    metrics_path: /probe
    params:
    module: [icmp]
    static_configs:
    - targets:
    - 'switch01.example.org'
    - 'switch02.example.org'
    - 'router01.example.org'
    - '192.168.1.1' # ISP gateway
    relabel_configs:
    - source_labels: [__address__]
    target_label: __param_target
    - source_labels: [__param_target]
    target_label: instance
    - target_label: __address__
    replacement: localhost:9115 # Blackbox Exporter address

The following diagram illustrates the network monitoring data flow from devices through collectors to the monitoring platform:

+-------------------------------------------------------------------+
| NETWORK INFRASTRUCTURE |
+-------------------------------------------------------------------+
| |
| +------------+ +------------+ +------------+ +----------+ |
| | Core | | Access | | Firewall | | Wireless | |
| | Switch | | Switches | | | | APs | |
| | SNMP:161 | | SNMP:161 | | SNMP:161 | | SNMP:161 | |
| +-----+------+ +-----+------+ +-----+------+ +----+-----+ |
| | | | | |
+-------------------------------------------------------------------+
| | | |
+-------+-------+-------+-------+ |
| | |
v v v
+----------------+---------------+------------------------+---------+
| MONITORING NETWORK |
+-------------------------------------------------------------------+
| |
| +------------------+ +------------------+ |
| | SNMP Exporter | | Blackbox Exporter| |
| | :9116 | | :9115 | |
| | | | | |
| | SNMP polling | | ICMP probes | |
| | every 60s | | every 30s | |
| +--------+---------+ +--------+---------+ |
| | | |
| +-------------+-------------+ |
| | |
| v |
| +----------+----------+ |
| | Prometheus | |
| | :9090 | |
| | | |
| | Scrapes exporters | |
| | Stores time series | |
| +----------+----------+ |
| | |
| v |
| +----------+----------+ |
| | Grafana | |
| | :3000 | |
| | | |
| | Dashboards | |
| | Visualisation | |
| +---------------------+ |
| |
+-------------------------------------------------------------------+

Figure 1: Network monitoring architecture showing SNMP collection and availability probing

Phase 3: Storage monitoring

Storage monitoring tracks capacity consumption, performance characteristics, and health indicators across local disks, network storage, and storage area networks.

  1. Configure local disk monitoring.

    Node Exporter and Windows Exporter include disk metrics by default. The relevant metrics for capacity monitoring are:

    # Linux filesystem capacity
    node_filesystem_size_bytes{fstype=~"ext4|xfs|btrfs"}
    node_filesystem_avail_bytes{fstype=~"ext4|xfs|btrfs"}
    # Windows disk capacity
    windows_logical_disk_size_bytes
    windows_logical_disk_free_bytes

    For disk performance, monitor I/O operations and latency:

    # Linux disk I/O
    rate(node_disk_reads_completed_total[5m]) # Read IOPS
    rate(node_disk_writes_completed_total[5m]) # Write IOPS
    rate(node_disk_read_bytes_total[5m]) # Read throughput
    rate(node_disk_written_bytes_total[5m]) # Write throughput
    rate(node_disk_read_time_seconds_total[5m]) / rate(node_disk_reads_completed_total[5m]) # Read latency
    rate(node_disk_write_time_seconds_total[5m]) / rate(node_disk_writes_completed_total[5m]) # Write latency
    # Windows disk I/O
    windows_logical_disk_reads_total
    windows_logical_disk_writes_total
    windows_logical_disk_read_seconds_total
    windows_logical_disk_write_seconds_total

    Create recording rules for disk health metrics in /etc/prometheus/rules/storage-rules.yml:

    groups:
    - name: storage-metrics
    rules:
    - record: instance:node_disk_read_latency:avg5m
    expr: rate(node_disk_read_time_seconds_total[5m]) / rate(node_disk_reads_completed_total[5m])
    - record: instance:node_disk_write_latency:avg5m
    expr: rate(node_disk_write_time_seconds_total[5m]) / rate(node_disk_writes_completed_total[5m])
    - record: instance:node_disk_iops:rate5m
    expr: rate(node_disk_reads_completed_total[5m]) + rate(node_disk_writes_completed_total[5m])
  2. Configure network storage monitoring.

    For NFS servers, monitor export availability and client connections through the NFS exporter or host-level metrics. For NFS clients, Node Exporter exposes NFS operation statistics through the node_nfs_* metrics when the nfs collector is enabled.

    For SMB/CIFS shares on Windows, use Windows Exporter with the smb collector enabled. Edit the Windows Exporter configuration or pass collector flags:

    Terminal window
    # Enable SMB collector
    windows_exporter.exe --collectors.enabled "cpu,cs,logical_disk,memory,net,os,process,smb"

    For dedicated NAS appliances (Synology, QNAP, NetApp), use SNMP monitoring with vendor-specific MIBs. Many NAS vendors also provide Prometheus exporters or API integrations.

  3. Configure SAN monitoring.

    Storage area network monitoring requires vendor-specific approaches. For Fibre Channel SANs, monitor switch port statistics through SNMP. For iSCSI, monitor target availability and session state.

    Common SAN metrics to collect:

    Storage array capacity (total, used, available)
    Volume/LUN capacity utilisation
    Array controller CPU and cache utilisation
    Port throughput and errors
    Disk health and predictive failure indicators
    Replication lag (if applicable)

    For open-source storage platforms like TrueNAS, enable the Prometheus endpoint in the web interface. For commercial arrays, consult vendor documentation for monitoring integration options. Most enterprise arrays (NetApp, Dell EMC, Pure Storage, HPE) provide REST APIs that can be scraped with custom exporters or vendor-provided integrations.

Phase 4: Cloud infrastructure monitoring

Cloud infrastructure monitoring extends visibility to resources deployed in public cloud environments. Each provider offers native monitoring that can be integrated with your existing monitoring platform.

  1. Configure AWS CloudWatch integration.

    For Prometheus-based monitoring, deploy the CloudWatch Exporter or use the YACE (Yet Another CloudWatch Exporter) for more flexible configuration.

    Create an IAM user or role with the following policy:

    {
    "Version": "2012-10-17",
    "Statement": [
    {
    "Effect": "Allow",
    "Action": [
    "cloudwatch:GetMetricData",
    "cloudwatch:GetMetricStatistics",
    "cloudwatch:ListMetrics",
    "ec2:DescribeInstances",
    "ec2:DescribeVolumes",
    "rds:DescribeDBInstances",
    "elasticloadbalancing:DescribeLoadBalancers",
    "tag:GetResources"
    ],
    "Resource": "*"
    }
    ]
    }

    Configure YACE in /etc/yace/config.yml:

    discovery:
    jobs:
    - type: AWS/EC2
    regions:
    - eu-west-1
    period: 300
    length: 300
    metrics:
    - name: CPUUtilization
    statistics: [Average, Maximum]
    - name: NetworkIn
    statistics: [Sum]
    - name: NetworkOut
    statistics: [Sum]
    - name: DiskReadOps
    statistics: [Sum]
    - name: DiskWriteOps
    statistics: [Sum]
    - type: AWS/RDS
    regions:
    - eu-west-1
    period: 300
    length: 300
    metrics:
    - name: CPUUtilization
    statistics: [Average]
    - name: FreeStorageSpace
    statistics: [Average]
    - name: DatabaseConnections
    statistics: [Average]
    - name: ReadIOPS
    statistics: [Average]
    - name: WriteIOPS
    statistics: [Average]

    Add the exporter to your Prometheus configuration:

    scrape_configs:
    - job_name: 'aws-cloudwatch'
    static_configs:
    - targets: ['localhost:5000']
  2. Configure Azure Monitor integration.

    For Azure, use the Azure Monitor Exporter for Prometheus. Create a service principal with the Monitoring Reader role:

    Terminal window
    az ad sp create-for-rbac --name "prometheus-monitor" --role "Monitoring Reader" --scopes /subscriptions/YOUR_SUBSCRIPTION_ID

    Configure the exporter with subscription and resource group filters to avoid excessive API calls and costs:

    active_directory_authority_url: "https://login.microsoftonline.com/"
    resource_manager_url: "https://management.azure.com/"
    credentials:
    subscription_id: "your-subscription-id"
    client_id: "your-client-id"
    client_secret: "your-client-secret"
    tenant_id: "your-tenant-id"
    targets:
    - resource: "/subscriptions/xxx/resourceGroups/production/providers/Microsoft.Compute/virtualMachines/webserver01"
    metrics:
    - name: "Percentage CPU"
    aggregations: ["Average", "Maximum"]
    - name: "Network In Total"
    aggregations: ["Total"]
    - name: "Network Out Total"
    aggregations: ["Total"]
  3. Configure GCP Cloud Monitoring integration.

    For Google Cloud Platform, use the Stackdriver Exporter (now Cloud Monitoring Exporter). Create a service account with the Monitoring Viewer role:

    Terminal window
    gcloud iam service-accounts create prometheus-monitor --display-name="Prometheus Monitoring"
    gcloud projects add-iam-policy-binding YOUR_PROJECT_ID \
    --member="serviceAccount:prometheus-monitor@YOUR_PROJECT_ID.iam.gserviceaccount.com" \
    --role="roles/monitoring.viewer"
    gcloud iam service-accounts keys create ~/prometheus-monitor-key.json \
    --iam-account=prometheus-monitor@YOUR_PROJECT_ID.iam.gserviceaccount.com

    Configure the exporter to collect Compute Engine metrics:

    google:
    project_id: "your-project-id"
    metrics_type_prefixes:
    - "compute.googleapis.com/instance/cpu"
    - "compute.googleapis.com/instance/disk"
    - "compute.googleapis.com/instance/network"
  4. Deploy cloud-native agents for detailed visibility.

    Cloud provider monitoring APIs provide aggregate metrics with 1-5 minute granularity. For detailed system-level metrics equivalent to on-premises monitoring, deploy agents to cloud instances using the same procedures as Phase 1.

    For autoscaling groups and containerised workloads, use configuration management or cloud-init to deploy agents automatically:

    # cloud-init example for EC2 instances
    #cloud-config
    packages:
    - prometheus-node-exporter
    runcmd:
    - systemctl enable prometheus-node-exporter
    - systemctl start prometheus-node-exporter

    Use EC2 service discovery in Prometheus to automatically find instances:

    scrape_configs:
    - job_name: 'ec2-instances'
    ec2_sd_configs:
    - region: eu-west-1
    port: 9100
    filters:
    - name: tag:Environment
    values: [production]
    relabel_configs:
    - source_labels: [__meta_ec2_tag_Name]
    target_label: instance_name
    - source_labels: [__meta_ec2_instance_id]
    target_label: instance_id
    - source_labels: [__meta_ec2_availability_zone]
    target_label: availability_zone

Phase 5: Threshold configuration

Thresholds define the boundaries between normal operation and conditions requiring attention. Effective thresholds balance sensitivity (detecting real issues) against noise (avoiding false positives).

  1. Establish baselines from collected data.

    After collecting data for 14 days, query your monitoring platform to understand normal operating ranges. For CPU utilisation:

    # Average CPU utilisation by server over 14 days
    avg_over_time(instance:node_cpu_utilisation:ratio[14d]) * 100
    # Maximum CPU utilisation peaks
    max_over_time(instance:node_cpu_utilisation:ratio[14d]) * 100
    # 95th percentile CPU utilisation
    quantile_over_time(0.95, instance:node_cpu_utilisation:ratio[14d]) * 100

    Document baseline values for each server class. A web server with average CPU of 25%, 95th percentile of 60%, and peaks of 85% has headroom for growth. A database server averaging 70% with peaks at 98% requires immediate capacity attention.

  2. Configure static thresholds.

    Static thresholds apply fixed limits appropriate for the resource type. Start with conservative thresholds and refine based on operational experience.

    For Prometheus alerting rules, create /etc/prometheus/rules/infrastructure-alerts.yml:

    groups:
    - name: infrastructure-alerts
    rules:
    # CPU alerts
    - alert: HighCPUUtilisation
    expr: instance:node_cpu_utilisation:ratio > 0.85
    for: 15m
    labels:
    severity: warning
    annotations:
    summary: "High CPU utilisation on {{ $labels.instance }}"
    description: "CPU utilisation is {{ $value | humanizePercentage }} for 15 minutes"
    - alert: CriticalCPUUtilisation
    expr: instance:node_cpu_utilisation:ratio > 0.95
    for: 5m
    labels:
    severity: critical
    annotations:
    summary: "Critical CPU utilisation on {{ $labels.instance }}"
    description: "CPU utilisation is {{ $value | humanizePercentage }} for 5 minutes"
    # Memory alerts
    - alert: HighMemoryUtilisation
    expr: instance:node_memory_utilisation:ratio > 0.85
    for: 15m
    labels:
    severity: warning
    annotations:
    summary: "High memory utilisation on {{ $labels.instance }}"
    description: "Memory utilisation is {{ $value | humanizePercentage }}"
    - alert: CriticalMemoryUtilisation
    expr: instance:node_memory_utilisation:ratio > 0.95
    for: 5m
    labels:
    severity: critical
    annotations:
    summary: "Critical memory utilisation on {{ $labels.instance }}"
    description: "Memory utilisation is {{ $value | humanizePercentage }}"
    # Disk space alerts
    - alert: DiskSpaceWarning
    expr: instance:node_filesystem_utilisation:ratio > 0.80
    for: 30m
    labels:
    severity: warning
    annotations:
    summary: "Disk space low on {{ $labels.instance }}"
    description: "Filesystem {{ $labels.mountpoint }} is {{ $value | humanizePercentage }} full"
    - alert: DiskSpaceCritical
    expr: instance:node_filesystem_utilisation:ratio > 0.90
    for: 15m
    labels:
    severity: critical
    annotations:
    summary: "Disk space critical on {{ $labels.instance }}"
    description: "Filesystem {{ $labels.mountpoint }} is {{ $value | humanizePercentage }} full"
    # Disk I/O alerts
    - alert: HighDiskLatency
    expr: instance:node_disk_read_latency:avg5m > 0.1 or instance:node_disk_write_latency:avg5m > 0.1
    for: 10m
    labels:
    severity: warning
    annotations:
    summary: "High disk latency on {{ $labels.instance }}"
    description: "Disk latency exceeds 100ms"
    # Network device alerts
    - alert: NetworkDeviceDown
    expr: probe_success{job="network-ping"} == 0
    for: 2m
    labels:
    severity: critical
    annotations:
    summary: "Network device unreachable: {{ $labels.instance }}"
    - alert: InterfaceDown
    expr: ifOperStatus == 2
    for: 5m
    labels:
    severity: warning
    annotations:
    summary: "Interface down: {{ $labels.ifDescr }} on {{ $labels.instance }}"
    - alert: HighInterfaceUtilisation
    expr: (rate(ifHCInOctets[5m]) * 8 / ifSpeed) > 0.80
    for: 15m
    labels:
    severity: warning
    annotations:
    summary: "High interface utilisation on {{ $labels.instance }}"
    description: "Interface {{ $labels.ifDescr }} is at {{ $value | humanizePercentage }} utilisation"
  3. Configure predictive thresholds for capacity.

    Predictive thresholds anticipate exhaustion before it occurs. Linear regression across historical data projects when a resource will reach capacity.

    For disk space exhaustion prediction:

    # Predict when disk will be full based on 7-day trend
    predict_linear(node_filesystem_avail_bytes{fstype=~"ext4|xfs"}[7d], 86400 * 14) < 0

    This alert fires when extrapolation predicts disk exhaustion within 14 days:

    - alert: DiskSpaceExhaustionPredicted
    expr: predict_linear(node_filesystem_avail_bytes{fstype=~"ext4|xfs"}[7d], 86400 * 14) < 0
    for: 1h
    labels:
    severity: warning
    annotations:
    summary: "Disk space exhaustion predicted on {{ $labels.instance }}"
    description: "Filesystem {{ $labels.mountpoint }} predicted to exhaust within 14 days"

Phase 6: Dashboard creation

Dashboards provide visual representation of infrastructure health for operational monitoring and capacity planning.

  1. Import standard dashboards.

    For Grafana with Prometheus, import community dashboards as starting points:

    • Node Exporter Full (Dashboard ID 1860): Comprehensive Linux server metrics
    • Windows Exporter Dashboard (Dashboard ID 14694): Windows server metrics
    • SNMP Device Dashboard (Dashboard ID 11169): Network device metrics

    To import, navigate to Dashboards → Import in Grafana, enter the dashboard ID, and select your Prometheus data source.

  2. Create an infrastructure overview dashboard.

    An overview dashboard provides at-a-glance status for all infrastructure. Create a new dashboard in Grafana and add the following panels:

    Server health matrix. Use a stat panel with the following query to show server count by health status:

    # Healthy servers (CPU < 85%, Memory < 85%, Disk < 80%)
    count(
    instance:node_cpu_utilisation:ratio < 0.85
    and instance:node_memory_utilisation:ratio < 0.85
    and instance:node_filesystem_utilisation:ratio < 0.80
    )

    Network device status. Use a stat panel showing device availability:

    # Available devices
    count(probe_success{job="network-ping"} == 1)
    # Unavailable devices
    count(probe_success{job="network-ping"} == 0)

    Resource utilisation heatmap. Use a table panel showing current utilisation across servers:

    # Query for table
    (
    label_replace(instance:node_cpu_utilisation:ratio * 100, "metric", "CPU", "", "")
    or
    label_replace(instance:node_memory_utilisation:ratio * 100, "metric", "Memory", "", "")
    or
    label_replace(max by (instance) (instance:node_filesystem_utilisation:ratio) * 100, "metric", "Disk", "", "")
    )
  3. Create capacity trend dashboards.

    Capacity dashboards show historical trends and projections for planning purposes.

    Disk growth trend panel. Use a time series panel with 30-day range:

    node_filesystem_size_bytes{fstype=~"ext4|xfs"} - node_filesystem_avail_bytes{fstype=~"ext4|xfs"}

    Add an annotation layer showing alert firings to correlate capacity changes with events.

    Memory trend panel. Show memory utilisation over time with baseline reference:

    instance:node_memory_utilisation:ratio * 100

    Add threshold lines at 85% and 95% using panel overrides.

Verification

After completing implementation, verify that monitoring data is flowing correctly and thresholds are functioning.

Verify metric collection. Query each metric type to confirm data exists:

# Server metrics present
count(node_cpu_seconds_total)
count(windows_cpu_time_total)
# Network metrics present
count(ifHCInOctets)
count(probe_success{job="network-ping"})
# Storage metrics present
count(node_filesystem_size_bytes)

Expected result: Each query returns a count greater than zero matching the number of monitored resources.

Verify scrape targets are healthy. In Prometheus, navigate to Status → Targets. All targets should show state “UP” with recent scrape timestamps. Any targets showing “DOWN” indicate collection failures requiring investigation.

Verify alerting rules are loaded. Navigate to Status → Rules in Prometheus. All rule groups should show status “ok” with evaluation timestamps.

Test alert firing. Temporarily lower a threshold to trigger an alert:

# Temporarily set CPU threshold to 5% to verify alerting
- alert: TestAlert
expr: instance:node_cpu_utilisation:ratio > 0.05
for: 1m

Verify the alert appears in the Alerts view, then remove the test rule.

Verify dashboard data population. Open each created dashboard and confirm panels display data without errors. Panels showing “No data” or query errors indicate configuration problems.

Test data retention. Query historical data matching your retention configuration:

# Query data from 7 days ago
node_cpu_seconds_total offset 7d

If retention is configured for 15 days and queries beyond that range return no data, retention is functioning correctly.

Troubleshooting

SymptomCauseResolution
Node Exporter returns connection refusedService not runningCheck systemctl status prometheus-node-exporter. Start if stopped. Check journal logs with journalctl -u prometheus-node-exporter
Prometheus target shows as DOWNNetwork connectivity or firewall blockingVerify connectivity with curl http://target:port/metrics. Check firewall rules on target and intermediate devices
SNMP metrics not appearingIncorrect community string or ACLTest SNMP manually: snmpwalk -v2c -c community device_ip 1.3.6.1.2.1.1. Verify community string and source IP ACL on device
CloudWatch metrics delayed by 10+ minutesAPI polling interval or propagation delayCloudWatch metrics have inherent 1-5 minute delay. Verify exporter polling interval. Check AWS service health dashboard
Dashboard panels show “No data”Metric name mismatch or label selector errorCheck metric exists in Prometheus expression browser. Verify label selectors match actual labels
Alert firing but no notificationAlertmanager configuration issueVerify Alertmanager is running and connected. Check alertmanager.yml routing and receiver configuration
High cardinality warning in PrometheusToo many label combinationsReview metrics with high label cardinality. Use relabeling to drop unnecessary labels. Consider aggregating recording rules
Disk space exhaustion on monitoring serverRetention too long or sample rate too highReduce --storage.tsdb.retention.time. Decrease scrape interval for non-critical metrics. Review and prune unused metrics
SNMP Exporter timeout errorsNetwork latency or slow device responseIncrease timeout in snmp.yml. Reduce metrics collected per scrape. Check device CPU during polling
Windows Exporter service fails to startMissing dependencies or port conflictCheck Windows Event Log for errors. Verify .NET Framework 4.5+ installed. Check port 9182 availability
Recording rules not evaluatingSyntax error or dependency failureCheck Prometheus logs for rule evaluation errors. Validate YAML syntax. Ensure source metrics exist
Cloud service discovery returns no targetsIAM permissions or tag filtersVerify service account permissions. Check filter configuration matches actual resource tags

See also