Skip to main content

Monitoring Strategy

Monitoring strategy establishes the architectural patterns, data collection mechanisms, and analysis approaches that provide visibility into IT service health and infrastructure performance. The strategy determines what gets measured, how measurements flow through the monitoring ecosystem, and how those measurements translate into actionable information for operations teams. Organisations without coherent monitoring strategy accumulate disconnected tools that generate noise rather than insight, creating gaps in coverage while simultaneously overwhelming staff with irrelevant alerts.

Monitoring
The systematic collection, aggregation, and analysis of metrics, logs, and traces from IT systems to detect conditions requiring attention and to inform operational decisions.
Metric
A numerical measurement captured at regular intervals representing a specific aspect of system behaviour. Metrics have names, values, timestamps, and optional dimensional tags.
Observability
The degree to which a system’s internal state can be inferred from its external outputs. High observability requires instrumentation that exposes meaningful signals about system behaviour.
Telemetry
The automated collection and transmission of measurement data from remote systems to centralised analysis infrastructure.
Baseline
A statistical characterisation of normal system behaviour derived from historical metric data, used as reference for detecting anomalies.
Threshold
A boundary value that triggers alerting or automated response when a metric crosses it. Static thresholds use fixed values; dynamic thresholds adjust based on baseline patterns.
Cardinality
The number of unique combinations of dimensional tag values for a metric. High-cardinality metrics consume more storage and query resources.

Monitoring Objectives

Monitoring exists to answer operational questions with sufficient speed and accuracy that staff can maintain service quality. The fundamental questions fall into four categories, each requiring different data collection and analysis approaches.

Availability monitoring answers whether services are accessible and functioning. This requires synthetic checks that probe endpoints from external vantage points, simulating user access patterns. A web application availability check issues HTTP requests every 60 seconds from three geographic regions, expecting 200-status responses within 2 seconds. Availability calculations derive from the ratio of successful checks to total checks over a measurement period. An application returning successful responses for 43,150 of 43,200 checks in a 30-day period achieves 99.88% availability, translating to approximately 51 minutes of detected downtime.

Performance monitoring answers whether services respond within acceptable parameters. Response time, throughput, and resource utilisation form the core performance dimensions. Response time monitoring captures latency distributions rather than simple averages because averages obscure the experience of affected users. A service averaging 200ms response time might have a 99th percentile of 3 seconds, meaning 1% of requests experience unacceptable delays. Throughput monitoring tracks transactions per second, requests per minute, or similar rate metrics that indicate capacity consumption. Resource utilisation monitoring measures CPU, memory, disk, and network consumption to identify constraints before they impact performance.

Health monitoring answers whether infrastructure components operate within normal parameters. Unlike availability monitoring, which tests external behaviour, health monitoring examines internal state. A database server health check verifies connection pool utilisation, replication lag, query queue depth, and storage consumption. Health monitoring enables proactive intervention before degraded internal state manifests as external failure.

Business monitoring answers whether IT services support organisational outcomes. Transaction completion rates, data processing volumes, and workflow throughput connect technical metrics to programme delivery. A beneficiary registration system tracking 2,400 daily registrations can alert when volumes drop below 1,800, indicating potential access problems even when all technical health checks pass.

These objectives create a monitoring hierarchy where business outcomes depend on service availability, which depends on component health, which depends on infrastructure performance. Effective strategy addresses all layers while maintaining clear relationships between them.

+------------------------------------------------------------------------+
| MONITORING HIERARCHY |
+------------------------------------------------------------------------+
| |
| +------------------------------------------------------------------+ |
| | BUSINESS MONITORING | |
| | Transaction volumes, completion rates, SLA achievement | |
| +------------------------------------------------------------------+ |
| | | | |
| v v v |
| +------------------------------------------------------------------+ |
| | AVAILABILITY MONITORING | |
| | Synthetic checks, endpoint probing, user journey simulation | |
| +------------------------------------------------------------------+ |
| | | | |
| v v v |
| +------------------------------------------------------------------+ |
| | HEALTH MONITORING | |
| | Component state, internal metrics, dependency status | |
| +------------------------------------------------------------------+ |
| | | | |
| v v v |
| +------------------------------------------------------------------+ |
| | PERFORMANCE MONITORING | |
| | Latency, throughput, utilisation, saturation, errors | |
| +------------------------------------------------------------------+ |
| |
+------------------------------------------------------------------------+

Figure 1: Monitoring hierarchy showing dependency relationships between monitoring types

Monitoring Architecture Patterns

Monitoring architecture determines how telemetry flows from sources through collection, storage, and analysis to action. Three patterns dominate: centralised, distributed, and hybrid. The choice depends on organisational scale, geographic distribution, network topology, and operational requirements.

Centralised Architecture

Centralised monitoring routes all telemetry to a single collection and analysis platform. Every monitored endpoint transmits metrics, logs, and traces to central infrastructure, typically hosted in a primary data centre or cloud region. This pattern simplifies operations by providing a single interface for all monitoring data and eliminates the need to correlate information across multiple systems.

+--------------------------------------------------------------------------+
| CENTRALISED MONITORING |
+--------------------------------------------------------------------------+
| |
| FIELD OFFICES HEADQUARTERS CLOUD |
| |
| +-------------+ +----------------+ +-------------+ |
| | Nairobi | | | | Azure | |
| | Servers +------------->| |<-----+ VMs | |
| | Network | | | | | |
| +-------------+ | CENTRAL | +-------------+ |
| | MONITORING | |
| +-------------+ | PLATFORM | +-------------+ |
| | Kampala | | | | AWS | |
| | Servers +------------->| - Collection |<-----+ Containers | |
| | Network | | - Storage | | | |
| +-------------+ | - Analysis | +-------------+ |
| | - Alerting | |
| +-------------+ | - Dashboards | +-------------+ |
| | Juba | | | | SaaS | |
| | Servers +------------->| |<-----+ Apps | |
| | Network | +-------+--------+ | | |
| +-------------+ | +-------------+ |
| | |
| v |
| +--------+--------+ |
| | OPERATIONS | |
| | TEAM | |
| +-----------------+ |
+--------------------------------------------------------------------------+

Figure 2: Centralised monitoring architecture with all telemetry flowing to single platform

Centralised architecture works well for organisations with reliable network connectivity between all locations and the central platform. The pattern fails when network links are unreliable or bandwidth-constrained. A field office connected via satellite with 512 Kbps bandwidth cannot transmit detailed telemetry to headquarters without consuming capacity needed for programme operations. Centralised platforms also create single points of failure: if the monitoring infrastructure becomes unavailable, visibility into all systems disappears simultaneously.

Distributed Architecture

Distributed monitoring deploys independent monitoring capabilities at each significant location or network segment. Each instance collects, stores, and analyses telemetry locally, generating alerts and dashboards for local operations. Federation mechanisms aggregate summary data to provide organisation-wide visibility without requiring all raw telemetry to traverse the network.

+------------------------------------------------------------------------------+
| DISTRIBUTED MONITORING |
+------------------------------------------------------------------------------+
| |
| NAIROBI KAMPALA JUBA |
| +------------------+ +------------------+ +------------------+ |
| | | | | | | |
| | +------------+ | | +------------+ | | +------------+ | |
| | | Local | | | | Local | | | | Local | | |
| | | Servers | | | | Servers | | | | Servers | | |
| | +-----+------+ | | +-----+------+ | | +-----+------+ | |
| | | | | | | | | | |
| | v | | v | | v | |
| | +------------+ | | +------------+ | | +------------+ | |
| | | Monitoring | | | | Monitoring | | | | Monitoring | | |
| | | Instance | | | | Instance | | | | Instance | | |
| | | - Collect | | | | - Collect | | | | - Collect | | |
| | | - Store | | | | - Store | | | | - Store | | |
| | | - Alert | | | | - Alert | | | | - Alert | | |
| | +-----+------+ | | +-----+------+ | | +-----+------+ | |
| | | | | | | | | | |
| +-------+----------+ +-------+----------+ +-------+----------+ |
| | | | |
| | SUMMARY METRICS | | |
| +------------+-------------+-------------+------------+ |
| | | |
| v v |
| +-------+---------------------------+-------+ |
| | FEDERATION LAYER | |
| | - Aggregated dashboards | |
| | - Cross-site correlation | |
| | - Organisation-wide reporting | |
| +-------------------------------------------+ |
+------------------------------------------------------------------------------+

Figure 3: Distributed monitoring with local instances and federation layer

Distributed architecture suits organisations with autonomous regional operations, unreliable inter-site connectivity, or data sovereignty requirements that prevent telemetry from leaving specific jurisdictions. The pattern increases operational complexity because each instance requires maintenance, and troubleshooting cross-site issues demands correlating data from multiple sources. Federation reduces but does not eliminate this complexity.

Hybrid Architecture

Hybrid monitoring combines centralised collection for systems with reliable connectivity with distributed instances for locations where bandwidth or reliability constraints prevent central forwarding. Edge locations run lightweight monitoring that stores data locally and forwards summaries when connectivity permits, while well-connected infrastructure transmits full telemetry to central platforms.

+--------------------------------------------------------------------------+
| HYBRID MONITORING |
+--------------------------------------------------------------------------+
| |
| WELL-CONNECTED EDGE LOCATIONS |
| INFRASTRUCTURE |
| |
| +-------------+ +------------------+ |
| | Cloud | | Remote Clinic | |
| | Workloads +---+ | +------------+ | |
| +-------------+ | | | Minimal | | |
| | | | Collector | | |
| +-------------+ | +---------------+ | +-----+------+ | |
| | HQ | | | | | | | |
| | Servers +---+-->| CENTRAL | | +-----v------+ | |
| +-------------+ | | PLATFORM | | | Local | | |
| | | | | | Storage | | |
| +-------------+ | | Full | | | (7 days) | | |
| | Regional | | | Telemetry | | +-----+------+ | |
| | Offices +---+ | | | | | |
| +-------------+ +-------+-------+ +-------+----------+ |
| | | |
| FULL TELEMETRY | | SUMMARY ONLY |
| (real-time) | | (when connected) |
| v v |
| +-------+---------------------+-------+ |
| | UNIFIED DASHBOARDS | |
| | - Full detail for connected sites | |
| | - Summary view for edge sites | |
| | - Gap indicators for offline | |
| +-------------------------------------+ |
+--------------------------------------------------------------------------+

Figure 4: Hybrid architecture combining centralised and edge monitoring

The hybrid pattern matches the reality of organisations operating across varied connectivity conditions. Implementation requires clear policies defining which systems use which pattern, mechanisms for handling the transition when edge locations gain or lose connectivity, and dashboard designs that accommodate mixed data fidelity.

Data Collection Mechanisms

Telemetry reaches monitoring platforms through three primary mechanisms: agent-based collection, agentless protocols, and API integration. Each mechanism suits different system types and operational constraints.

Agent-Based Collection

Monitoring agents are software components installed on monitored systems that collect telemetry locally and transmit it to collection infrastructure. Agents provide the deepest visibility because they execute within the monitored environment, accessing process-level metrics, file system details, and application internals unavailable through network protocols.

Agent deployment follows push or pull models. Push agents transmit telemetry on schedules they control, initiating connections to collection endpoints. This model works through firewalls that permit outbound connections and enables agents to buffer data during collection endpoint unavailability. Pull agents expose metrics endpoints that collection infrastructure queries on its schedule. Pull models give central infrastructure control over collection timing and reduce agent complexity but require network paths permitting inbound connections to monitored systems.

Agent resource consumption requires consideration during selection and configuration. A Prometheus node exporter on Linux consumes approximately 10-15 MB of memory and negligible CPU during normal operation. Heavier agents providing application performance monitoring, log collection, and security functions together consume 200-500 MB of memory and 2-5% of CPU capacity. On resource-constrained field servers, this overhead affects application performance.

Configuration management for agents across distributed infrastructure presents operational challenges. Organisations with 200 servers require mechanisms to deploy agents, distribute configuration updates, and verify agent health. Configuration management tools like Ansible, Puppet, or Salt handle agent lifecycle, while monitoring platforms themselves should provide visibility into agent status across the fleet.

Agentless Collection

Agentless monitoring queries systems using network protocols without requiring software installation on monitored targets. SNMP remains the dominant protocol for network device monitoring, providing standardised access to device metrics through management information bases. SNMP v3 adds encryption and authentication absent in earlier versions and should be the minimum for production deployment.

WMI and WinRM enable agentless Windows server monitoring, exposing performance counters, event logs, and system state through authenticated remote queries. SSH-based collection executes commands on Unix-like systems, parsing output to extract metrics. IPMI provides hardware-level monitoring of server components including temperature, fan speed, and power consumption independent of operating system state.

Agentless approaches reduce deployment complexity and eliminate agent maintenance overhead. The trade-off is reduced visibility: network protocols expose only what systems publish through their management interfaces, missing application-internal metrics available to agents. Agentless collection also generates network traffic for each collection cycle, which aggregates significantly when polling hundreds of devices every minute.

API Integration

Modern applications and cloud services expose telemetry through APIs rather than traditional monitoring protocols. Cloud provider APIs report virtual machine metrics, managed database performance, storage utilisation, and service health. SaaS applications provide API endpoints for usage metrics, error rates, and integration status. Custom applications instrument their code to emit metrics through libraries that expose API endpoints or push to collection infrastructure.

API-based collection requires authentication credential management, rate limit awareness, and handling of API versioning and deprecation. Cloud provider APIs enforce rate limits that constrain collection frequency for large-scale deployments; AWS CloudWatch permits 400 GetMetricData transactions per second per account, which organisations with thousands of resources can exhaust if collection intervals are aggressive.

OpenTelemetry has emerged as the standard instrumentation framework for custom application telemetry. Applications instrumented with OpenTelemetry emit traces, metrics, and logs in vendor-neutral formats that any compatible collection platform can ingest. Adopting OpenTelemetry for new development avoids lock-in to specific monitoring vendors while providing rich observability data.

Metric Selection and Design

Effective monitoring requires deliberate metric selection rather than collecting everything available. Excessive metrics create storage costs, query performance problems, and cognitive overload without improving operational capability. Strategic metric selection focuses on signals that inform decisions.

The USE Method

The USE method provides a framework for infrastructure metric selection: Utilisation, Saturation, and Errors for each resource type. Utilisation measures the proportion of resource capacity currently consumed. Saturation measures the degree to which work queues waiting for the resource. Errors count failures in resource operations.

For a database server, USE metrics include CPU utilisation percentage, disk I/O utilisation, and network interface utilisation; saturation metrics include query queue depth, connection pool wait time, and disk I/O queue length; error metrics include failed queries, connection failures, and storage errors. This structured approach ensures coverage of failure modes while limiting metric proliferation.

The RED Method

The RED method provides a framework for service metric selection: Rate, Errors, and Duration. Rate measures requests per second processed by the service. Errors measures the proportion of requests that fail. Duration measures request latency, typically captured as histograms or percentile distributions.

For an API service handling beneficiary data queries, RED metrics include requests per minute by endpoint, error rate by response code category, and latency at 50th, 95th, and 99th percentiles. These metrics directly indicate user experience and service health without requiring understanding of underlying infrastructure.

Metric Naming and Dimensionality

Consistent metric naming enables discovery and correlation across systems. Names should follow a hierarchical pattern indicating the source system, component, and measurement type. The pattern {system}_{component}_{measurement}_{unit} produces names like registration_api_requests_total, registration_api_latency_seconds, and registration_database_connections_active.

Dimensional tags add context to metrics without creating separate metric names for each variation. A request counter tagged with endpoint, method, and status_code dimensions enables slicing by any combination: total requests, requests by endpoint, requests by status, or requests for a specific endpoint with error status. Each unique tag combination creates a separate time series, so high-cardinality dimensions like user ID or request ID explode storage requirements and should be avoided in metrics, instead appearing in logs or traces.

A registration API receiving 10,000 requests per minute across 15 endpoints, 4 HTTP methods, and 10 status code categories generates up to 600 unique time series for a single counter metric (15 × 4 × 10). Adding a region dimension with 8 values increases this to 4,800 time series. Adding request ID as a dimension would create 10,000 new series per minute, rapidly consuming storage allocation.

Threshold Strategies

Thresholds convert metric values into actionable signals by defining boundaries between normal and abnormal states. Threshold strategy significantly impacts both alert quality and operational workload.

Static Thresholds

Static thresholds use fixed values determined through capacity planning, vendor recommendations, or operational experience. A disk volume alerts when utilisation exceeds 85%, regardless of historical patterns or time of day. Static thresholds work well for absolute limits (disk full at 100%, memory exhausted at 100%) and for metrics with stable baselines.

Setting appropriate static thresholds requires understanding both the metric’s meaning and the operational context. CPU utilisation at 90% indicates healthy capacity consumption for batch processing workloads but signals imminent saturation for latency-sensitive services. Default thresholds from monitoring tools rarely match organisational requirements and should be reviewed during implementation.

Dynamic Thresholds

Dynamic thresholds adjust based on historical patterns, detecting deviations from baseline behaviour rather than violations of fixed limits. A service processing 2,000 requests per minute during business hours and 200 requests per minute overnight would trigger static threshold alerts during normal overnight operation if the threshold were set based on daytime patterns. Dynamic thresholds learn that 200 requests per minute is normal at 03:00 and alert only when overnight traffic deviates significantly from this baseline.

Dynamic threshold calculation requires sufficient historical data to establish patterns, typically 2-4 weeks of baseline data before reliable anomaly detection. The algorithms range from simple standard deviation calculations to sophisticated machine learning models. Most monitoring platforms offer dynamic thresholds as features, though the underlying mechanisms vary in sophistication.

Dynamic thresholds can generate false positives during legitimate changes. A marketing campaign doubling traffic triggers anomaly alerts even though the increase reflects intended business activity. Suppression mechanisms during known events and feedback loops that update baselines reduce this problem.

Threshold Worked Example

Consider CPU utilisation monitoring for an application server cluster:

The cluster contains 8 servers, each with 16 CPU cores, running a web application. Historical analysis shows utilisation averaging 35% during business hours with peaks reaching 60% during report generation at month-end. Overnight utilisation averages 10% for scheduled batch processing.

Static threshold approach: Set warning at 75% and critical at 90%. This catches genuine overload conditions but misses anomalies like 50% utilisation at 03:00 (which indicates unexpected batch job failure or malicious activity).

Dynamic threshold approach: Baseline learns hourly patterns across the week. Alert when current utilisation exceeds 2 standard deviations from the same hour’s historical mean. This catches both absolute overload and pattern deviation but requires tuning to avoid alerting during legitimate variation.

Hybrid approach: Static critical threshold at 90% (never acceptable regardless of baseline), dynamic thresholds for pattern deviation, and static minimum threshold at 5% (catch complete workload failures where all traffic stops).

Technology Options

Monitoring tool selection balances capability requirements against operational capacity and cost constraints. Open source platforms provide full functionality without licensing costs but require operational investment. Commercial platforms and managed services reduce operational burden at financial cost.

Open Source Platforms

Prometheus serves as the foundation for most open source monitoring architectures. It implements a pull-based collection model with a powerful query language (PromQL) for metric analysis. Prometheus stores time-series data in a local database optimised for high-cardinality metrics and provides alerting through its Alertmanager component. Single-instance Prometheus handles millions of active time series on modest hardware: 8 cores and 32 GB RAM support 10 million series with 30-day retention.

Grafana provides visualisation and dashboarding for data from Prometheus and numerous other sources. Its plugin architecture supports diverse data sources including cloud monitoring APIs, databases, and specialised monitoring systems. Grafana’s alerting capabilities have matured significantly, offering an alternative to Prometheus Alertmanager for organisations preferring unified configuration.

Victoria Metrics offers Prometheus-compatible storage with superior compression and query performance for large-scale deployments. Organisations exceeding single-instance Prometheus capacity can migrate to Victoria Metrics with minimal query modification. Victoria Metrics also provides better support for long-term retention, reducing storage costs for compliance-driven retention requirements.

Zabbix provides traditional infrastructure monitoring with extensive device support, template libraries, and integrated alerting. Its agent-based architecture suits environments where pull-based collection faces network constraints. Zabbix’s learning curve is steeper than Prometheus/Grafana, but its all-in-one approach simplifies deployment for organisations without dedicated monitoring specialists.

Commercial Platforms

Datadog provides comprehensive monitoring as a managed service covering infrastructure, applications, logs, and security. Pricing scales with host count and data volume, making cost prediction difficult for growing organisations. A 50-server deployment with application monitoring, log management, and standard retention costs approximately $15,000-25,000 annually. Datadog offers nonprofit discounts through application.

New Relic offers similar capabilities with a consumption-based pricing model charging per GB of ingested data. This model benefits organisations with efficient instrumentation but creates cost uncertainty for those with verbose logging or high-cardinality metrics.

Elastic Observability (formerly Elastic APM and Metrics) builds on the Elastic Stack, available as managed service or self-hosted. Organisations already using Elasticsearch for log management can extend to metrics and traces without additional platform complexity.

Selection Criteria

CriterionConsiderations
Operational capacitySelf-hosted platforms require Linux administration, storage management, and upgrade maintenance. Managed services eliminate this overhead.
Data sovereigntySelf-hosted platforms keep telemetry within organisational infrastructure. Managed services store data in vendor infrastructure, typically in US or EU regions.
Scale requirementsPrometheus single-instance supports most small-to-medium deployments. Beyond 50 million active time series, federated or clustered solutions become necessary.
Integration requirementsEvaluate pre-built integrations for existing infrastructure. Cloud-native platforms offer deeper cloud provider integration than generic tools.
Query capabilityComplex analysis and correlation require sophisticated query languages. Prometheus PromQL provides powerful capabilities; some commercial platforms offer simplified interfaces at the cost of flexibility.
Alert routingIntegration with existing notification channels (Slack, PagerDuty, SMS, email) and on-call management systems.

Integration with Service Management

Monitoring creates value through integration with service management processes. Isolated monitoring that generates alerts without connecting to incident management, problem management, and change management delivers limited operational benefit.

Incident Management Integration

Monitoring systems should create incident tickets automatically for alerts requiring human response. The integration maps alert severity to incident priority, routes tickets to appropriate assignment groups based on affected service, and populates tickets with diagnostic context from the alert. Bidirectional integration updates alert status when tickets progress, preventing duplicate alerts for acknowledged issues.

Alert correlation reduces ticket volume by grouping related alerts into single incidents. A storage failure generating alerts from the storage array, dependent databases, and affected applications should create one incident with context about all symptoms rather than a dozen tickets requiring manual correlation.

Problem Management Integration

Trend analysis and anomaly detection from monitoring data identify problem candidates before recurring incidents make them obvious. Gradual performance degradation, increasing error rates, or growing resource utilisation indicate underlying issues warranting investigation. Monitoring platforms should provide interfaces for exporting this analytical data to problem management processes.

Historical monitoring data supports root cause analysis by providing the evidence of system state before, during, and after incidents. Retention policies should preserve sufficient history for meaningful analysis, typically 30-90 days of high-resolution data with longer retention at reduced resolution.

Change Management Integration

Monitoring provides change verification through automated comparison of metrics before and after changes. A database upgrade change should include monitoring validation confirming query latency, error rates, and connection pool behaviour remain within expected parameters post-implementation. Change windows should suppress non-critical alerts for affected systems while maintaining critical threshold monitoring.

Deployment annotations in monitoring platforms mark when changes occurred, enabling correlation between deployments and subsequent metric shifts. When latency increases 30% following a deployment, the annotation provides immediate context that would otherwise require cross-referencing change records.

Implementation Considerations

For Organisations with Limited IT Capacity

Single-person IT departments cannot operate sophisticated distributed monitoring platforms alongside other responsibilities. The priority is basic visibility that catches service-affecting issues without creating operational overhead.

Start with cloud-provider native monitoring for cloud workloads. AWS CloudWatch, Azure Monitor, and Google Cloud Monitoring require no deployment, integrate automatically with provider services, and include alerting capabilities. Configure alerts for critical conditions: disk space exhaustion, service unavailability, and certificate expiration.

Add Uptime Robot or similar external availability monitoring for public-facing services. Free tiers support 50 monitors with 5-minute intervals. External monitoring catches issues invisible to internal monitoring, including DNS failures and network path problems.

Deploy Prometheus and Grafana only when cloud-native monitoring proves insufficient. Use container-based deployment (Docker Compose) on an existing server rather than dedicated monitoring infrastructure. Allocate 4 GB RAM and 50 GB storage for small environments. Use community dashboards rather than building custom visualisations.

Minimal monitoring stack (10 servers, single IT person):

  • Cloud provider monitoring for cloud workloads (included in cloud costs)
  • External availability monitoring: 0 cost (free tier)
  • Basic Prometheus/Grafana on existing server: 0 additional cost
  • Implementation time: 2-3 days
  • Maintenance time: 2-4 hours monthly

For Organisations with Established IT Functions

Dedicated IT teams can implement structured monitoring architecture with appropriate tooling investment. The priority shifts from basic visibility to comprehensive observability enabling proactive operations.

Deploy dedicated monitoring infrastructure sized for current scale with 2x growth capacity. A Prometheus instance with 8 cores, 64 GB RAM, and 500 GB SSD storage handles 20 million active time series with 30-day retention. Add Victoria Metrics for long-term storage if retention requirements exceed 90 days.

Implement the USE and RED metric frameworks across infrastructure and services. Create service-specific dashboards showing the metrics relevant to each service’s health. Establish baseline documentation for normal metric ranges and known anomaly patterns.

Integrate monitoring with incident management through bidirectional ticketing integration. Configure alert routing that matches organisational on-call structures. Implement alert correlation to reduce ticket volume from cascading failures.

Comprehensive monitoring stack (100 servers, 5-person IT team):

  • Dedicated Prometheus VM: 8 cores, 64 GB RAM, 500 GB SSD
  • Grafana VM: 4 cores, 16 GB RAM (or shared with Prometheus)
  • Alertmanager cluster: 3 small instances for high availability
  • Victoria Metrics for long-term storage: sized to retention requirements
  • Implementation time: 4-6 weeks
  • Maintenance time: 8-16 hours monthly

Field Deployment Considerations

Monitoring in bandwidth-constrained or intermittently connected locations requires architectural adaptation. Edge collection with summary forwarding reduces bandwidth consumption while maintaining local visibility during connectivity outages.

Deploy lightweight collectors at field locations running minimal resource overhead. Prometheus with aggressive scrape intervals consumes more bandwidth than collection value justifies; 5-minute intervals suffice for most field infrastructure. Configure local storage for 7-14 days of data, enabling local troubleshooting during extended connectivity gaps.

Forward only summary metrics to central platforms: aggregate values, anomaly indicators, and health status rather than raw time-series data. A field office generating 50,000 metric samples per minute can summarise to 500 samples covering key health indicators, reducing bandwidth by 99% while preserving essential visibility.

Implement store-and-forward mechanisms that queue telemetry during connectivity outages and transmit when connections restore. Configure transmission during low-usage periods to avoid competing with programme traffic for limited bandwidth.

Field monitoring configuration:

  • Scrape interval: 300 seconds (5 minutes) vs standard 15-60 seconds
  • Local retention: 7-14 days at full resolution
  • Central forwarding: Summary metrics only, queued during outages
  • Bandwidth budget: 10-50 KB per minute for typical field office

Legacy Integration

Existing infrastructure predating modern monitoring platforms requires protocol adaptation and custom collection. SNMP remains the primary interface for network equipment, older servers, and environmental monitoring systems (UPS, HVAC, generators). The Prometheus SNMP exporter translates SNMP data into Prometheus metrics, requiring MIB configuration for each device type.

Windows servers without modern agents expose metrics through WMI, collected via the Windows Exporter or Prometheus WMI exporter. Legacy Unix systems without agent support provide metrics through SSH-based command execution, though this approach scales poorly beyond tens of systems.

Application logs without structured output require parsing to extract metrics. Regular expression-based log parsing identifies error patterns, transaction completions, and performance indicators. This approach proves fragile when log formats change and should be considered transitional until applications receive proper instrumentation.

See Also