Capacity Planning
Capacity planning determines whether IT infrastructure can sustain projected workloads by measuring current utilisation, forecasting future demand, and defining thresholds that trigger scaling actions. This task produces capacity baselines for each monitored resource, demand forecasts extending 3 to 12 months forward, and documented trigger points that initiate procurement or provisioning before constraints affect service delivery.
Perform capacity planning quarterly for stable environments, monthly for growing organisations, and continuously for cloud infrastructure where consumption directly affects cost. The outcome is a capacity model that predicts when each resource will reach its constraint threshold, enabling proactive investment decisions rather than reactive crisis response.
Prerequisites
Before beginning capacity planning, confirm the following requirements are satisfied:
| Requirement | Specification | Verification |
|---|---|---|
| Monitoring data | 90 days minimum history for baseline accuracy | Query monitoring system for data retention period |
| Resource inventory | Complete list of capacity-constrained resources | Cross-reference with CMDB or asset register |
| Service catalogue | Documented services with resource dependencies | Confirm service-to-infrastructure mapping exists |
| Growth projections | Organisational plans affecting IT demand | Obtain from programme teams or strategic planning |
| Access permissions | Read access to monitoring dashboards and raw metrics | Test query execution before planning session |
| Financial data | Current costs per resource for scaling calculations | Obtain from IT Budgeting or cloud billing |
Monitoring data quality determines forecast accuracy. Verify that collection gaps do not exceed 5% of the measurement period by querying your monitoring system:
# Prometheus example: check data completeness for CPU metrics over 90 dayscurl -s "http://prometheus:9090/api/v1/query?query=count_over_time(node_cpu_seconds_total[90d])" | jq '.data.result[0].value[1]'
# Expected: approximately 129,600 samples (90 days × 24 hours × 60 minutes)# If below 123,120 (95% of expected), investigate collection gapsFor organisations without 90 days of monitoring history, capacity planning remains possible but forecasts carry higher uncertainty. Document the reduced confidence level in planning outputs and schedule a follow-up assessment once sufficient data accumulates.
Procedure
Establish resource inventory
- Generate a list of all capacity-constrained resources from your monitoring system and asset inventory. Capacity constraints occur in compute (CPU, memory), storage (capacity, IOPS), network (bandwidth, connections), and licensing (concurrent users, transaction limits).
# Export monitored hosts from Prometheus curl -s "http://prometheus:9090/api/v1/label/instance/values" | jq -r '.data[]' > monitored_hosts.txt
# Cross-reference with asset register (example using CSV export) comm -23 <(sort asset_register_hosts.txt) <(sort monitored_hosts.txt) > unmonitored_hosts.txtAny hosts appearing in unmonitored_hosts.txt require monitoring deployment before capacity planning can include them.
Categorise each resource by constraint type. A single server contributes multiple constraint dimensions: CPU cycles, memory bytes, disk capacity, disk IOPS, and network throughput each constitute separate planning targets.
Create a capacity inventory spreadsheet with columns for: resource identifier, constraint type, current maximum capacity, measurement unit, and monitoring metric name. For a typical application server:
Resource Constraint Maximum Unit Metric app-server-01 CPU 8 cores node_cpu_seconds_totalapp-server-01 Memory 32 GB node_memory_MemTotal_bytesapp-server-01 Disk capacity 500 GB node_filesystem_size_bytesapp-server-01 Disk IOPS 3000 ops/sec node_disk_io_time_seconds_totalapp-server-01 Network 1000 Mbps node_network_transmit_bytes_totalDocument shared resources where multiple services compete for capacity. Database servers, storage arrays, network links, and authentication services typically serve multiple consumers. Map each shared resource to its dependent services using configuration management data or service documentation.
Calculate capacity baselines
The capacity baseline represents normal operating levels against which you measure growth and detect anomalies. Calculate baselines using the 95th percentile of utilisation over your measurement period, which excludes transient spikes while capturing sustained high-water marks.
- Query your monitoring system for 95th percentile utilisation of each constraint type. The query syntax varies by platform:
# Prometheus: 95th percentile CPU utilisation over 90 days quantile_over_time(0.95, (1 - avg by(instance)(rate(node_cpu_seconds_total{mode="idle"}[5m])))[90d:1h] ) -- InfluxDB: 95th percentile memory utilisation SELECT percentile("used_percent", 95) FROM "mem" WHERE time > now() - 90d GROUP BY "host"Record the baseline value for each resource-constraint combination in your capacity inventory.
- Calculate headroom as the difference between maximum capacity and baseline utilisation, expressed as both absolute units and percentage:
Headroom (absolute) = Maximum capacity - Baseline utilisation Headroom (percentage) = ((Maximum - Baseline) / Maximum) × 100For app-server-01 with 32 GB memory and baseline utilisation of 24 GB:
- Headroom (absolute) = 32 - 24 = 8 GB
- Headroom (percentage) = ((32 - 24) / 32) × 100 = 25%
- Identify resources with headroom below 30%. These require immediate attention in the forecasting phase, as growth could push them into constraint within one planning cycle. Resources with headroom below 15% require emergency assessment outside the normal planning cycle.
+-------------------------------------------------------------+| CAPACITY BASELINE ANALYSIS |+-------------------------------------------------------------+| || Resource: database-primary (16 Cores) || Constraint: CPU Utilization (95th Percentile) || || 100% | (Max) || | || 80% | [#] || | [#] Risk || 75% |--------------------------------------[#]----(Limit)-|| | [#] [#] [#] || 60% | [#] [#] [#] [#] || | [#] [#] [#] [#] [#] || 40% | [#] [#] [#] [#] [#] [#] || | [#] [#] [#] [#] [#] [#] || 0% +-------------------------------------------------- || Jan Feb Mar Apr May Jun || || Current Headroom: 20% (3.2 cores remaining) || Status: ALERT - Breached 75% Baseline || |+-------------------------------------------------------------+Figure 1: Capacity baseline showing 95th percentile utilisation trend
Forecast demand growth
Demand forecasting projects future utilisation based on historical trends and known business drivers. The growth rate combines organic trend (what monitoring data shows) with planned change (what the organisation intends to do).
- Calculate the organic growth rate using linear regression on your baseline data. Most monitoring systems provide trend functions:
# Prometheus: predict CPU utilisation 90 days forward based on 90 days history predict_linear( (1 - avg by(instance)(rate(node_cpu_seconds_total{mode="idle"}[5m])))[90d:1d], 90 * 24 * 3600 )For manual calculation, use the least-squares method on monthly averages:
Growth rate = (Σ(x - x̄)(y - ȳ)) / (Σ(x - x̄)²)
Where: x = month number (1, 2, 3...) y = average utilisation for that month x̄ = mean of month numbers ȳ = mean of utilisation valuesExample with six months of memory data (percentages):
| Month | x | y (%) | x - x̄ | y - ȳ | (x-x̄)(y-ȳ) | (x-x̄)² |
|---|---|---|---|---|---|---|
| Jan | 1 | 45 | -2.5 | -7.5 | 18.75 | 6.25 |
| Feb | 2 | 48 | -1.5 | -4.5 | 6.75 | 2.25 |
| Mar | 3 | 50 | -0.5 | -2.5 | 1.25 | 0.25 |
| Apr | 4 | 54 | 0.5 | 1.5 | 0.75 | 0.25 |
| May | 5 | 56 | 1.5 | 3.5 | 5.25 | 2.25 |
| Jun | 6 | 62 | 2.5 | 9.5 | 23.75 | 6.25 |
| Sum | x̄=3.5 | ȳ=52.5 | 56.5 | 17.5 |
Growth rate = 56.5 / 17.5 = 3.23% per month
Gather planned changes from programme teams, HR (staff growth), and strategic planning. Each planned change translates to capacity demand through documented ratios. Establish these ratios from historical data:
- Users per GB of file storage: measure current storage divided by current users
- Database transactions per programme beneficiary: measure transaction logs against beneficiary counts
- Compute cycles per concurrent application user: measure during known-load periods
Example translation for a planned programme expansion adding 5,000 beneficiaries:
Current beneficiaries: 20,000 Current database size: 400 GB Ratio: 400 GB / 20,000 = 0.02 GB per beneficiary
Additional demand: 5,000 × 0.02 GB = 100 GB Timeline: Programme launches in 6 months Monthly demand increase: 100 GB / 6 = 16.7 GB per month (during ramp-up)- Combine organic growth with planned changes to create a composite forecast:
Forecast utilisation = Baseline + (Organic growth × Months) + Planned demandFor the database example with 400 GB baseline, 2% monthly organic growth, and 100 GB planned increase over 6 months:
| Month | Organic growth | Planned addition | Cumulative forecast |
|---|---|---|---|
| 1 | 408 GB | +16.7 GB | 424.7 GB |
| 2 | 416 GB | +16.7 GB | 449.4 GB |
| 3 | 424 GB | +16.7 GB | 474.1 GB |
| 4 | 433 GB | +16.7 GB | 498.8 GB |
| 5 | 441 GB | +16.7 GB | 523.5 GB |
| 6 | 450 GB | +16.7 GB | 548.2 GB |
+-------------------------------------------------------------+| CAPACITY BASELINE ANALYSIS |+-------------------------------------------------------------+| || Resource: database-primary (16 Cores) || Constraint: CPU Utilization (95th Percentile) || || 100% | (Max) || | || 80% | [#] || | [#] Risk || 75% |--------------------------------------[#]----(Limit)-|| | [#] [#] [#] || 60% | [#] [#] [#] [#] || | [#] [#] [#] [#] [#] || 40% | [#] [#] [#] [#] [#] [#] || | [#] [#] [#] [#] [#] [#] || 0% +-------------------------------------------------- || Jan Feb Mar Apr May Jun || || Current Headroom: 20% (3.2 cores remaining) || Status: ALERT - Breached 75% Baseline || |+-------------------------------------------------------------+Figure 2: Demand forecast combining historical trend with planned growth
Define capacity thresholds
Capacity thresholds are utilisation levels that trigger specific actions. Define three threshold tiers for each resource: monitoring threshold (increased observation), warning threshold (planning action required), and critical threshold (immediate action required).
Set threshold values based on resource characteristics and procurement lead times. Resources with long lead times (physical hardware requiring procurement, budget approval, and installation) require lower thresholds than instantly-scalable cloud resources:
Resource type Monitor Warning Critical Rationale Physical server 50% 65% 80% 12-16 week procurement cycle On-premises storage 55% 70% 85% 8-12 week procurement cycle Cloud compute 70% 80% 90% Minutes to provision Cloud storage 75% 85% 95% Instant provisioning Network link 60% 75% 85% 4-8 week circuit provisioning Software licence 70% 85% 95% Days to weeks for procurement Calculate the time-to-threshold for each resource using your demand forecast:
Time to threshold = (Threshold - Current utilisation) / Monthly growth rateFor a database at 52.5% utilisation with 3.23% monthly growth and 70% warning threshold:
Time to warning = (70 - 52.5) / 3.23 = 5.4 monthsResources with time-to-threshold less than their procurement lead time require immediate action.
- Configure alerting rules in your monitoring system to trigger at each threshold:
# Prometheus alerting rules example groups: - name: capacity_thresholds rules: - alert: CapacityMonitor expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100 < 50 for: 1h labels: severity: info annotations: summary: "Storage capacity monitoring threshold reached"
- alert: CapacityWarning expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100 < 35 for: 1h labels: severity: warning annotations: summary: "Storage capacity warning - planning action required"
- alert: CapacityCritical expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100 < 20 for: 15m labels: severity: critical annotations: summary: "Storage capacity critical - immediate action required"Model scaling scenarios
Before reaching thresholds, model the scaling options available for each constrained resource. Scaling takes two forms: vertical scaling (increasing capacity of existing resources) and horizontal scaling (adding additional resource instances).
- Document vertical scaling limits for each resource. Physical servers have maximum memory slots, maximum CPU sockets, and maximum disk bays. Virtual machines have hypervisor-imposed limits. Cloud instances have instance-type ceilings:
Resource: app-server-01 (physical) Current: 2× CPU (16 cores total), 64 GB RAM, 4× 1TB SSD Maximum: 4× CPU (32 cores), 256 GB RAM, 8× disk bays Vertical headroom: 100% CPU, 300% RAM, 100% storage
Resource: app-server-02 (AWS EC2 m5.2xlarge) Current: 8 vCPU, 32 GB RAM Maximum in family: m5.24xlarge (96 vCPU, 384 GB RAM) Vertical headroom: 1100% CPU, 1100% RAM Migration required for larger: m5.metal (96 vCPU, 384 GB)- Assess horizontal scaling feasibility for each application. Horizontal scaling requires application support for distributed operation: stateless design, external session storage, load balancer compatibility, and database connection pooling. Document scaling constraints:
Application: Grants Management System Horizontal scaling: Supported (stateless application tier) Session handling: Redis cluster (external) Database: PostgreSQL with read replicas Load balancer: HAProxy (already deployed) Scaling unit: 1 additional server adds capacity for ~500 concurrent users Cost per unit: approximately £400/month (cloud) or £8,000 capital (physical)
Application: Legacy Finance System Horizontal scaling: Not supported (stateful, single-instance design) Vertical scaling only: Current VM can expand to 4× current resources Replacement planning: Migration to cloud ERP scheduled for Q3- Calculate cost per capacity unit for each scaling option to enable comparison:
Vertical scaling cost efficiency: Current: 8 vCPU at £200/month = £25 per vCPU Upgrade to 16 vCPU: £380/month = £23.75 per vCPU (8% more efficient) Upgrade to 32 vCPU: £720/month = £22.50 per vCPU (10% more efficient)
Horizontal scaling cost efficiency: Additional 8 vCPU instance: £200/month = £25 per vCPU Plus load balancer overhead: £50/month shared across instances
Break-even analysis: - Below 16 vCPU total: vertical scaling more cost-effective - Above 16 vCPU total: horizontal scaling provides better resilience value+-------------------------------------------------------------------+| SCALING DECISION TREE |+-------------------------------------------------------------------+| || +------------------------+ || | Capacity threshold | || | breach predicted | || +-----------+------------+ || | || v || +------------------------+ || | Application supports | || | horizontal scaling? | || +-----------+------------+ || | || +---------------+---------------+ || | | || v v || +--------+--------+ +--------+--------+ || | Yes | | No | || +--------+--------+ +--------+--------+ || | | || v v || +--------+--------+ +--------+--------+ || | High | | Vertical | || | availability | | headroom | || | required? | | available? | || +--------+--------+ +--------+--------+ || | | || +-------+-------+ +-------+-------+ || | | | | || v v v v || +---+---+ +---+---+ +---+---+ +---+---+ || | Yes | | No | | Yes | | No | || +---+---+ +---+---+ +---+---+ +---+---+ || | | | | || v v v v || +---+---+ +---+---+ +---+---+ +---+---+ || |Scale | |Compare| |Scale | |Replace| || |out | |costs: | |up | |or | || |(horiz)| |vert vs| |(vert) | |migrate| || +-------+ |horiz | +-------+ +-------+ || +-------+ || |+-------------------------------------------------------------------+Figure 3: Decision tree for selecting scaling approach
Document capacity plan
Compile findings into a capacity plan document containing: resource inventory with current baselines, demand forecasts by resource, threshold breach predictions, recommended scaling actions, cost estimates, and timeline for procurement or provisioning.
Structure the document with an executive summary showing resources requiring action within the planning horizon (typically 12 months), followed by detailed analysis for each resource category.
Present the capacity plan to stakeholders with budget authority. The plan should answer: what resources will become constrained, when will constraints occur, what are the options to address them, and what is the cost of each option. Include a do-nothing scenario showing the service impact of failing to scale.
Obtain approval for scaling actions and update procurement plans, cloud budgets, or project schedules accordingly. Record approved actions in a capacity action register:
Resource Action Trigger date Lead time Completion target Budget Owner SAN-01 Add shelf 2024-09-01 8 weeks 2024-11-01 £12,000 Infrastructure db-cluster Add replica 2024-07-15 2 weeks 2024-08-01 £800/mo Database app-tier Scale policy Immediate 1 day 2024-06-15 Variable Cloud ops
Configure cloud auto-scaling
For cloud infrastructure, capacity planning translates into auto-scaling policies that respond to demand automatically. Configure scaling policies based on the thresholds established in your capacity plan.
- Define scaling metrics that align with your capacity constraints. CPU utilisation serves most compute workloads, but queue depth, response latency, or custom application metrics provide better scaling signals for specific workload types:
# AWS Auto Scaling policy example Resources: ScalingPolicy: Type: AWS::AutoScaling::ScalingPolicy Properties: AutoScalingGroupName: !Ref AppServerGroup PolicyType: TargetTrackingScaling TargetTrackingConfiguration: PredefinedMetricSpecification: PredefinedMetricType: ASGAverageCPUUtilization TargetValue: 70.0 ScaleInCooldown: 300 ScaleOutCooldown: 60- Set minimum and maximum instance counts based on your baseline and forecast:
AppServerGroup: Type: AWS::AutoScaling::AutoScalingGroup Properties: MinSize: 2 # Minimum for high availability MaxSize: 10 # Budget cap based on forecast peak DesiredCapacity: 3 # Current baseline requirementCalculate maximum size from your demand forecast: if peak forecast shows 280% of current baseline, and current baseline requires 3 instances, maximum should be at least 9 instances (3 × 2.8, rounded up).
- Configure scaling cooldown periods to prevent oscillation. Scale-out cooldown should be short (60 seconds) to respond to demand spikes. Scale-in cooldown should be longer (300 seconds or more) to avoid premature scale-down:
# Terraform example for Azure VM Scale Set resource "azurerm_monitor_autoscale_setting" "app" { name = "app-autoscale" resource_group_name = azurerm_resource_group.main.name location = azurerm_resource_group.main.location target_resource_id = azurerm_linux_virtual_machine_scale_set.app.id
profile { name = "default"
capacity { default = 3 minimum = 2 maximum = 10 }
rule { metric_trigger { metric_name = "Percentage CPU" metric_resource_id = azurerm_linux_virtual_machine_scale_set.app.id time_grain = "PT1M" statistic = "Average" time_window = "PT5M" time_aggregation = "Average" operator = "GreaterThan" threshold = 70 }
scale_action { direction = "Increase" type = "ChangeCount" value = "1" cooldown = "PT1M" } }
rule { metric_trigger { metric_name = "Percentage CPU" metric_resource_id = azurerm_linux_virtual_machine_scale_set.app.id time_grain = "PT1M" statistic = "Average" time_window = "PT10M" time_aggregation = "Average" operator = "LessThan" threshold = 30 }
scale_action { direction = "Decrease" type = "ChangeCount" value = "1" cooldown = "PT5M" } } } }Address field infrastructure constraints
Field locations present capacity planning challenges that differ from headquarters or cloud infrastructure. Procurement lead times extend to months rather than weeks due to shipping, customs, and installation logistics. Power and cooling constraints impose hard ceilings that procurement cannot solve.
- Inventory power capacity at each field location. Calculate total available watts, current draw, and headroom:
Location: Juba field office Power source: Solar + battery (4 kW system) Available for IT: 1.5 kW (remainder for lighting, HVAC, other) Current IT load: 1.1 kW Headroom: 400 W (27%)
Constraint: Cannot add equipment drawing more than 400 W without solar system upgrade (12-week lead time, £8,000 cost)Plan capacity additions to align with logistics windows. Field locations receiving quarterly supply shipments require capacity planning 4-6 months ahead to include equipment in the next shipment:
Location Logistics window Planning deadline Equipment cutoff Nairobi hub Monthly 3 weeks prior 4 weeks prior Juba office Quarterly 10 weeks prior 12 weeks prior Cox’s Bazar Bi-monthly 6 weeks prior 8 weeks prior Remote sites Per deployment 16 weeks prior 20 weeks prior Factor bandwidth constraints into application capacity planning. A field office with 2 Mbps connectivity cannot support the same concurrent user count as headquarters, regardless of local compute capacity:
Application: Case management system Per-user bandwidth: 50 Kbps average, 200 Kbps peak Headquarters (100 Mbps): supports 500 concurrent users (bandwidth) Field office (2 Mbps): supports 10 concurrent users (bandwidth limited)
Local caching reduces bandwidth requirement to 15 Kbps average: Field office with caching: supports 33 concurrent users+------------------------------------------------------------------+| FIELD CAPACITY PLANNING TIMELINE |+------------------------------------------------------------------+| || Week: -20 -16 -12 -8 -4 0 +4 +8 +12 +16 +20 || | | | | | | | | | | | || v v v v v v v v v v v || +-------+----+----+----+----+----+----+----+----+----+----+ || | Capacity assessment | | || +----------------------------+ | || | | || v | || +----+----+ | || | Forecast | | || +----+----+ | || | | || v | || +----+---------+ | || | Procurement | | || | approval | | || +----+---------+ | || | | || v | || +----+-------------------+ | || | Equipment ordering | | || +----+-------------------+ | || | | || v | || +----+-----------------------+ | || | Shipping and customs | | || +----+-----------------------+ | || | | || v | || +----+------------+ | || | Installation | | || +----+------------+ | || | | || v | || +----+----+ | || | Live | | || +---------+ | || | || Total lead time: 20 weeks from assessment to deployment | || | |+------------------------------------------------------------------+Figure 4: Field infrastructure capacity planning timeline showing extended lead times
Verification
After completing capacity planning activities, verify the outputs meet quality requirements:
Baseline accuracy: Compare calculated baselines against known peak utilisation periods. Baselines should be within 10% of observed peaks during normal operations. Query a known busy period:
# Verify baseline against last month's peakmax_over_time( (1 - avg by(instance)(rate(node_cpu_seconds_total{mode="idle"}[5m])))[30d:1h])If the peak exceeds your baseline by more than 10%, investigate whether the peak represents normal operation (adjust baseline) or an anomaly (document exception).
Forecast validation: Backtest your forecasting model by applying it to historical data and comparing predictions against actuals:
Backtest procedure:1. Use data from months 1-6 to forecast month 92. Compare forecast against actual month 9 utilisation3. Calculate forecast error: |Forecast - Actual| / Actual × 100
Acceptable error: below 15% for 3-month forecastsRequires investigation: error above 25%Threshold alert testing: Confirm monitoring alerts trigger at defined thresholds by temporarily lowering thresholds or using test metrics:
# Create test metric to verify alerting (Prometheus example)curl -X POST http://pushgateway:9091/metrics/job/capacity_test \ -d 'test_capacity_utilisation 95'
# Verify alert fires within expected timeframe# Remove test metric after verificationcurl -X DELETE http://pushgateway:9091/metrics/job/capacity_testDocumentation completeness: Confirm the capacity plan contains all required elements:
- Resource inventory with current baselines: present and dated within 30 days
- Demand forecasts with methodology documented: present for all constrained resources
- Threshold definitions with alert configuration references: present and tested
- Scaling recommendations with cost estimates: present for resources breaching thresholds
- Approval records for planned actions: present with dates and approvers
Troubleshooting
| Symptom | Cause | Resolution |
|---|---|---|
| Baseline calculation returns NULL or zero | Monitoring data gaps exceed query tolerance | Reduce query timeframe or repair monitoring collection; verify metric names match current configuration |
| Forecast shows negative growth despite observed increases | Seasonal pattern creating misleading trend | Use year-over-year comparison instead of linear regression; apply seasonal decomposition |
| Time-to-threshold shorter than procurement lead time | Delayed capacity planning or accelerated growth | Initiate emergency procurement; implement temporary mitigations (optimisation, load shedding); escalate to management |
| Auto-scaling triggers continuously (oscillation) | Cooldown period too short or threshold too close to steady-state | Increase cooldown period; widen gap between scale-out and scale-in thresholds; use predictive scaling |
| Auto-scaling fails to respond to demand spike | Wrong metric selected or aggregation period too long | Verify metric reflects actual bottleneck; reduce aggregation window; add multiple trigger metrics |
| Capacity plan rejected due to cost | Budget constraints not incorporated in planning | Include finance stakeholders earlier; model multiple scenarios with different cost profiles; identify optimisation opportunities |
| Field equipment arrives but cannot deploy | Power, cooling, or space constraints not assessed | Conduct site survey before procurement; include infrastructure requirements in equipment specifications |
| Forecast accuracy degrades over time | Business conditions changed from planning assumptions | Increase planning frequency; establish triggers for unscheduled plan updates; improve communication with programme teams |
| Horizontal scaling fails to improve performance | Application bottleneck not in scaled component | Profile application to identify actual constraint; verify load balancer distributes traffic; check for shared resource contention |
| Vertical scaling causes application instability | Application not tested at higher resource levels; configuration limits | Test scaling in non-production first; verify application configuration supports larger resource allocation; check for 32-bit limitations |
| Monitoring shows lower utilisation than users report | Monitoring sampling misses short spikes; wrong metric monitored | Increase sampling frequency; verify metric captures user-experienced performance; add application-level metrics |
| Storage capacity forecast inaccurate | Compression or deduplication ratios changed; data growth patterns shifted | Re-baseline using recent data; factor compression ratio into forecasts; monitor ratio changes |
Emergency capacity situations
When capacity constraints cause immediate service impact, bypass normal planning procedures. Document the emergency action taken, notify stakeholders, and schedule a post-incident review to update capacity plans and thresholds.