Skip to main content

Capacity Planning

Capacity planning determines whether IT infrastructure can sustain projected workloads by measuring current utilisation, forecasting future demand, and defining thresholds that trigger scaling actions. This task produces capacity baselines for each monitored resource, demand forecasts extending 3 to 12 months forward, and documented trigger points that initiate procurement or provisioning before constraints affect service delivery.

Perform capacity planning quarterly for stable environments, monthly for growing organisations, and continuously for cloud infrastructure where consumption directly affects cost. The outcome is a capacity model that predicts when each resource will reach its constraint threshold, enabling proactive investment decisions rather than reactive crisis response.

Prerequisites

Before beginning capacity planning, confirm the following requirements are satisfied:

RequirementSpecificationVerification
Monitoring data90 days minimum history for baseline accuracyQuery monitoring system for data retention period
Resource inventoryComplete list of capacity-constrained resourcesCross-reference with CMDB or asset register
Service catalogueDocumented services with resource dependenciesConfirm service-to-infrastructure mapping exists
Growth projectionsOrganisational plans affecting IT demandObtain from programme teams or strategic planning
Access permissionsRead access to monitoring dashboards and raw metricsTest query execution before planning session
Financial dataCurrent costs per resource for scaling calculationsObtain from IT Budgeting or cloud billing

Monitoring data quality determines forecast accuracy. Verify that collection gaps do not exceed 5% of the measurement period by querying your monitoring system:

Terminal window
# Prometheus example: check data completeness for CPU metrics over 90 days
curl -s "http://prometheus:9090/api/v1/query?query=count_over_time(node_cpu_seconds_total[90d])" | jq '.data.result[0].value[1]'
# Expected: approximately 129,600 samples (90 days × 24 hours × 60 minutes)
# If below 123,120 (95% of expected), investigate collection gaps

For organisations without 90 days of monitoring history, capacity planning remains possible but forecasts carry higher uncertainty. Document the reduced confidence level in planning outputs and schedule a follow-up assessment once sufficient data accumulates.

Procedure

Establish resource inventory

  1. Generate a list of all capacity-constrained resources from your monitoring system and asset inventory. Capacity constraints occur in compute (CPU, memory), storage (capacity, IOPS), network (bandwidth, connections), and licensing (concurrent users, transaction limits).
Terminal window
# Export monitored hosts from Prometheus
curl -s "http://prometheus:9090/api/v1/label/instance/values" | jq -r '.data[]' > monitored_hosts.txt
# Cross-reference with asset register (example using CSV export)
comm -23 <(sort asset_register_hosts.txt) <(sort monitored_hosts.txt) > unmonitored_hosts.txt

Any hosts appearing in unmonitored_hosts.txt require monitoring deployment before capacity planning can include them.

  1. Categorise each resource by constraint type. A single server contributes multiple constraint dimensions: CPU cycles, memory bytes, disk capacity, disk IOPS, and network throughput each constitute separate planning targets.

    Create a capacity inventory spreadsheet with columns for: resource identifier, constraint type, current maximum capacity, measurement unit, and monitoring metric name. For a typical application server:

    ResourceConstraintMaximumUnitMetric
    app-server-01CPU8coresnode_cpu_seconds_total
    app-server-01Memory32GBnode_memory_MemTotal_bytes
    app-server-01Disk capacity500GBnode_filesystem_size_bytes
    app-server-01Disk IOPS3000ops/secnode_disk_io_time_seconds_total
    app-server-01Network1000Mbpsnode_network_transmit_bytes_total
  2. Document shared resources where multiple services compete for capacity. Database servers, storage arrays, network links, and authentication services typically serve multiple consumers. Map each shared resource to its dependent services using configuration management data or service documentation.

Calculate capacity baselines

The capacity baseline represents normal operating levels against which you measure growth and detect anomalies. Calculate baselines using the 95th percentile of utilisation over your measurement period, which excludes transient spikes while capturing sustained high-water marks.

  1. Query your monitoring system for 95th percentile utilisation of each constraint type. The query syntax varies by platform:
# Prometheus: 95th percentile CPU utilisation over 90 days
quantile_over_time(0.95,
(1 - avg by(instance)(rate(node_cpu_seconds_total{mode="idle"}[5m])))[90d:1h]
)
-- InfluxDB: 95th percentile memory utilisation
SELECT percentile("used_percent", 95)
FROM "mem"
WHERE time > now() - 90d
GROUP BY "host"

Record the baseline value for each resource-constraint combination in your capacity inventory.

  1. Calculate headroom as the difference between maximum capacity and baseline utilisation, expressed as both absolute units and percentage:
Headroom (absolute) = Maximum capacity - Baseline utilisation
Headroom (percentage) = ((Maximum - Baseline) / Maximum) × 100

For app-server-01 with 32 GB memory and baseline utilisation of 24 GB:

  • Headroom (absolute) = 32 - 24 = 8 GB
  • Headroom (percentage) = ((32 - 24) / 32) × 100 = 25%
  1. Identify resources with headroom below 30%. These require immediate attention in the forecasting phase, as growth could push them into constraint within one planning cycle. Resources with headroom below 15% require emergency assessment outside the normal planning cycle.
+-------------------------------------------------------------+
| CAPACITY BASELINE ANALYSIS |
+-------------------------------------------------------------+
| |
| Resource: database-primary (16 Cores) |
| Constraint: CPU Utilization (95th Percentile) |
| |
| 100% | (Max) |
| | |
| 80% | [#] |
| | [#] Risk |
| 75% |--------------------------------------[#]----(Limit)-|
| | [#] [#] [#] |
| 60% | [#] [#] [#] [#] |
| | [#] [#] [#] [#] [#] |
| 40% | [#] [#] [#] [#] [#] [#] |
| | [#] [#] [#] [#] [#] [#] |
| 0% +-------------------------------------------------- |
| Jan Feb Mar Apr May Jun |
| |
| Current Headroom: 20% (3.2 cores remaining) |
| Status: ALERT - Breached 75% Baseline |
| |
+-------------------------------------------------------------+

Figure 1: Capacity baseline showing 95th percentile utilisation trend

Forecast demand growth

Demand forecasting projects future utilisation based on historical trends and known business drivers. The growth rate combines organic trend (what monitoring data shows) with planned change (what the organisation intends to do).

  1. Calculate the organic growth rate using linear regression on your baseline data. Most monitoring systems provide trend functions:
# Prometheus: predict CPU utilisation 90 days forward based on 90 days history
predict_linear(
(1 - avg by(instance)(rate(node_cpu_seconds_total{mode="idle"}[5m])))[90d:1d],
90 * 24 * 3600
)

For manual calculation, use the least-squares method on monthly averages:

Growth rate = (Σ(x - x̄)(y - ȳ)) / (Σ(x - x̄)²)
Where:
x = month number (1, 2, 3...)
y = average utilisation for that month
x̄ = mean of month numbers
ȳ = mean of utilisation values

Example with six months of memory data (percentages):

Monthxy (%)x - x̄y - ȳ(x-x̄)(y-ȳ)(x-x̄)²
Jan145-2.5-7.518.756.25
Feb248-1.5-4.56.752.25
Mar350-0.5-2.51.250.25
Apr4540.51.50.750.25
May5561.53.55.252.25
Jun6622.59.523.756.25
Sumx̄=3.5ȳ=52.556.517.5

Growth rate = 56.5 / 17.5 = 3.23% per month

  1. Gather planned changes from programme teams, HR (staff growth), and strategic planning. Each planned change translates to capacity demand through documented ratios. Establish these ratios from historical data:

    • Users per GB of file storage: measure current storage divided by current users
    • Database transactions per programme beneficiary: measure transaction logs against beneficiary counts
    • Compute cycles per concurrent application user: measure during known-load periods

    Example translation for a planned programme expansion adding 5,000 beneficiaries:

Current beneficiaries: 20,000
Current database size: 400 GB
Ratio: 400 GB / 20,000 = 0.02 GB per beneficiary
Additional demand: 5,000 × 0.02 GB = 100 GB
Timeline: Programme launches in 6 months
Monthly demand increase: 100 GB / 6 = 16.7 GB per month (during ramp-up)
  1. Combine organic growth with planned changes to create a composite forecast:
Forecast utilisation = Baseline + (Organic growth × Months) + Planned demand

For the database example with 400 GB baseline, 2% monthly organic growth, and 100 GB planned increase over 6 months:

MonthOrganic growthPlanned additionCumulative forecast
1408 GB+16.7 GB424.7 GB
2416 GB+16.7 GB449.4 GB
3424 GB+16.7 GB474.1 GB
4433 GB+16.7 GB498.8 GB
5441 GB+16.7 GB523.5 GB
6450 GB+16.7 GB548.2 GB
+-------------------------------------------------------------+
| CAPACITY BASELINE ANALYSIS |
+-------------------------------------------------------------+
| |
| Resource: database-primary (16 Cores) |
| Constraint: CPU Utilization (95th Percentile) |
| |
| 100% | (Max) |
| | |
| 80% | [#] |
| | [#] Risk |
| 75% |--------------------------------------[#]----(Limit)-|
| | [#] [#] [#] |
| 60% | [#] [#] [#] [#] |
| | [#] [#] [#] [#] [#] |
| 40% | [#] [#] [#] [#] [#] [#] |
| | [#] [#] [#] [#] [#] [#] |
| 0% +-------------------------------------------------- |
| Jan Feb Mar Apr May Jun |
| |
| Current Headroom: 20% (3.2 cores remaining) |
| Status: ALERT - Breached 75% Baseline |
| |
+-------------------------------------------------------------+

Figure 2: Demand forecast combining historical trend with planned growth

Define capacity thresholds

Capacity thresholds are utilisation levels that trigger specific actions. Define three threshold tiers for each resource: monitoring threshold (increased observation), warning threshold (planning action required), and critical threshold (immediate action required).

  1. Set threshold values based on resource characteristics and procurement lead times. Resources with long lead times (physical hardware requiring procurement, budget approval, and installation) require lower thresholds than instantly-scalable cloud resources:

    Resource typeMonitorWarningCriticalRationale
    Physical server50%65%80%12-16 week procurement cycle
    On-premises storage55%70%85%8-12 week procurement cycle
    Cloud compute70%80%90%Minutes to provision
    Cloud storage75%85%95%Instant provisioning
    Network link60%75%85%4-8 week circuit provisioning
    Software licence70%85%95%Days to weeks for procurement
  2. Calculate the time-to-threshold for each resource using your demand forecast:

Time to threshold = (Threshold - Current utilisation) / Monthly growth rate

For a database at 52.5% utilisation with 3.23% monthly growth and 70% warning threshold:

Time to warning = (70 - 52.5) / 3.23 = 5.4 months

Resources with time-to-threshold less than their procurement lead time require immediate action.

  1. Configure alerting rules in your monitoring system to trigger at each threshold:
# Prometheus alerting rules example
groups:
- name: capacity_thresholds
rules:
- alert: CapacityMonitor
expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100 < 50
for: 1h
labels:
severity: info
annotations:
summary: "Storage capacity monitoring threshold reached"
- alert: CapacityWarning
expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100 < 35
for: 1h
labels:
severity: warning
annotations:
summary: "Storage capacity warning - planning action required"
- alert: CapacityCritical
expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100 < 20
for: 15m
labels:
severity: critical
annotations:
summary: "Storage capacity critical - immediate action required"

Model scaling scenarios

Before reaching thresholds, model the scaling options available for each constrained resource. Scaling takes two forms: vertical scaling (increasing capacity of existing resources) and horizontal scaling (adding additional resource instances).

  1. Document vertical scaling limits for each resource. Physical servers have maximum memory slots, maximum CPU sockets, and maximum disk bays. Virtual machines have hypervisor-imposed limits. Cloud instances have instance-type ceilings:
Resource: app-server-01 (physical)
Current: 2× CPU (16 cores total), 64 GB RAM, 4× 1TB SSD
Maximum: 4× CPU (32 cores), 256 GB RAM, 8× disk bays
Vertical headroom: 100% CPU, 300% RAM, 100% storage
Resource: app-server-02 (AWS EC2 m5.2xlarge)
Current: 8 vCPU, 32 GB RAM
Maximum in family: m5.24xlarge (96 vCPU, 384 GB RAM)
Vertical headroom: 1100% CPU, 1100% RAM
Migration required for larger: m5.metal (96 vCPU, 384 GB)
  1. Assess horizontal scaling feasibility for each application. Horizontal scaling requires application support for distributed operation: stateless design, external session storage, load balancer compatibility, and database connection pooling. Document scaling constraints:
Application: Grants Management System
Horizontal scaling: Supported (stateless application tier)
Session handling: Redis cluster (external)
Database: PostgreSQL with read replicas
Load balancer: HAProxy (already deployed)
Scaling unit: 1 additional server adds capacity for ~500 concurrent users
Cost per unit: approximately £400/month (cloud) or £8,000 capital (physical)
Application: Legacy Finance System
Horizontal scaling: Not supported (stateful, single-instance design)
Vertical scaling only: Current VM can expand to 4× current resources
Replacement planning: Migration to cloud ERP scheduled for Q3
  1. Calculate cost per capacity unit for each scaling option to enable comparison:
Vertical scaling cost efficiency:
Current: 8 vCPU at £200/month = £25 per vCPU
Upgrade to 16 vCPU: £380/month = £23.75 per vCPU (8% more efficient)
Upgrade to 32 vCPU: £720/month = £22.50 per vCPU (10% more efficient)
Horizontal scaling cost efficiency:
Additional 8 vCPU instance: £200/month = £25 per vCPU
Plus load balancer overhead: £50/month shared across instances
Break-even analysis:
- Below 16 vCPU total: vertical scaling more cost-effective
- Above 16 vCPU total: horizontal scaling provides better resilience value
+-------------------------------------------------------------------+
| SCALING DECISION TREE |
+-------------------------------------------------------------------+
| |
| +------------------------+ |
| | Capacity threshold | |
| | breach predicted | |
| +-----------+------------+ |
| | |
| v |
| +------------------------+ |
| | Application supports | |
| | horizontal scaling? | |
| +-----------+------------+ |
| | |
| +---------------+---------------+ |
| | | |
| v v |
| +--------+--------+ +--------+--------+ |
| | Yes | | No | |
| +--------+--------+ +--------+--------+ |
| | | |
| v v |
| +--------+--------+ +--------+--------+ |
| | High | | Vertical | |
| | availability | | headroom | |
| | required? | | available? | |
| +--------+--------+ +--------+--------+ |
| | | |
| +-------+-------+ +-------+-------+ |
| | | | | |
| v v v v |
| +---+---+ +---+---+ +---+---+ +---+---+ |
| | Yes | | No | | Yes | | No | |
| +---+---+ +---+---+ +---+---+ +---+---+ |
| | | | | |
| v v v v |
| +---+---+ +---+---+ +---+---+ +---+---+ |
| |Scale | |Compare| |Scale | |Replace| |
| |out | |costs: | |up | |or | |
| |(horiz)| |vert vs| |(vert) | |migrate| |
| +-------+ |horiz | +-------+ +-------+ |
| +-------+ |
| |
+-------------------------------------------------------------------+

Figure 3: Decision tree for selecting scaling approach

Document capacity plan

  1. Compile findings into a capacity plan document containing: resource inventory with current baselines, demand forecasts by resource, threshold breach predictions, recommended scaling actions, cost estimates, and timeline for procurement or provisioning.

    Structure the document with an executive summary showing resources requiring action within the planning horizon (typically 12 months), followed by detailed analysis for each resource category.

  2. Present the capacity plan to stakeholders with budget authority. The plan should answer: what resources will become constrained, when will constraints occur, what are the options to address them, and what is the cost of each option. Include a do-nothing scenario showing the service impact of failing to scale.

  3. Obtain approval for scaling actions and update procurement plans, cloud budgets, or project schedules accordingly. Record approved actions in a capacity action register:

    ResourceActionTrigger dateLead timeCompletion targetBudgetOwner
    SAN-01Add shelf2024-09-018 weeks2024-11-01£12,000Infrastructure
    db-clusterAdd replica2024-07-152 weeks2024-08-01£800/moDatabase
    app-tierScale policyImmediate1 day2024-06-15VariableCloud ops

Configure cloud auto-scaling

For cloud infrastructure, capacity planning translates into auto-scaling policies that respond to demand automatically. Configure scaling policies based on the thresholds established in your capacity plan.

  1. Define scaling metrics that align with your capacity constraints. CPU utilisation serves most compute workloads, but queue depth, response latency, or custom application metrics provide better scaling signals for specific workload types:
# AWS Auto Scaling policy example
Resources:
ScalingPolicy:
Type: AWS::AutoScaling::ScalingPolicy
Properties:
AutoScalingGroupName: !Ref AppServerGroup
PolicyType: TargetTrackingScaling
TargetTrackingConfiguration:
PredefinedMetricSpecification:
PredefinedMetricType: ASGAverageCPUUtilization
TargetValue: 70.0
ScaleInCooldown: 300
ScaleOutCooldown: 60
  1. Set minimum and maximum instance counts based on your baseline and forecast:
AppServerGroup:
Type: AWS::AutoScaling::AutoScalingGroup
Properties:
MinSize: 2 # Minimum for high availability
MaxSize: 10 # Budget cap based on forecast peak
DesiredCapacity: 3 # Current baseline requirement

Calculate maximum size from your demand forecast: if peak forecast shows 280% of current baseline, and current baseline requires 3 instances, maximum should be at least 9 instances (3 × 2.8, rounded up).

  1. Configure scaling cooldown periods to prevent oscillation. Scale-out cooldown should be short (60 seconds) to respond to demand spikes. Scale-in cooldown should be longer (300 seconds or more) to avoid premature scale-down:
# Terraform example for Azure VM Scale Set
resource "azurerm_monitor_autoscale_setting" "app" {
name = "app-autoscale"
resource_group_name = azurerm_resource_group.main.name
location = azurerm_resource_group.main.location
target_resource_id = azurerm_linux_virtual_machine_scale_set.app.id
profile {
name = "default"
capacity {
default = 3
minimum = 2
maximum = 10
}
rule {
metric_trigger {
metric_name = "Percentage CPU"
metric_resource_id = azurerm_linux_virtual_machine_scale_set.app.id
time_grain = "PT1M"
statistic = "Average"
time_window = "PT5M"
time_aggregation = "Average"
operator = "GreaterThan"
threshold = 70
}
scale_action {
direction = "Increase"
type = "ChangeCount"
value = "1"
cooldown = "PT1M"
}
}
rule {
metric_trigger {
metric_name = "Percentage CPU"
metric_resource_id = azurerm_linux_virtual_machine_scale_set.app.id
time_grain = "PT1M"
statistic = "Average"
time_window = "PT10M"
time_aggregation = "Average"
operator = "LessThan"
threshold = 30
}
scale_action {
direction = "Decrease"
type = "ChangeCount"
value = "1"
cooldown = "PT5M"
}
}
}
}

Address field infrastructure constraints

Field locations present capacity planning challenges that differ from headquarters or cloud infrastructure. Procurement lead times extend to months rather than weeks due to shipping, customs, and installation logistics. Power and cooling constraints impose hard ceilings that procurement cannot solve.

  1. Inventory power capacity at each field location. Calculate total available watts, current draw, and headroom:
Location: Juba field office
Power source: Solar + battery (4 kW system)
Available for IT: 1.5 kW (remainder for lighting, HVAC, other)
Current IT load: 1.1 kW
Headroom: 400 W (27%)
Constraint: Cannot add equipment drawing more than 400 W without
solar system upgrade (12-week lead time, £8,000 cost)
  1. Plan capacity additions to align with logistics windows. Field locations receiving quarterly supply shipments require capacity planning 4-6 months ahead to include equipment in the next shipment:

    LocationLogistics windowPlanning deadlineEquipment cutoff
    Nairobi hubMonthly3 weeks prior4 weeks prior
    Juba officeQuarterly10 weeks prior12 weeks prior
    Cox’s BazarBi-monthly6 weeks prior8 weeks prior
    Remote sitesPer deployment16 weeks prior20 weeks prior
  2. Factor bandwidth constraints into application capacity planning. A field office with 2 Mbps connectivity cannot support the same concurrent user count as headquarters, regardless of local compute capacity:

Application: Case management system
Per-user bandwidth: 50 Kbps average, 200 Kbps peak
Headquarters (100 Mbps): supports 500 concurrent users (bandwidth)
Field office (2 Mbps): supports 10 concurrent users (bandwidth limited)
Local caching reduces bandwidth requirement to 15 Kbps average:
Field office with caching: supports 33 concurrent users
+------------------------------------------------------------------+
| FIELD CAPACITY PLANNING TIMELINE |
+------------------------------------------------------------------+
| |
| Week: -20 -16 -12 -8 -4 0 +4 +8 +12 +16 +20 |
| | | | | | | | | | | | |
| v v v v v v v v v v v |
| +-------+----+----+----+----+----+----+----+----+----+----+ |
| | Capacity assessment | | |
| +----------------------------+ | |
| | | |
| v | |
| +----+----+ | |
| | Forecast | | |
| +----+----+ | |
| | | |
| v | |
| +----+---------+ | |
| | Procurement | | |
| | approval | | |
| +----+---------+ | |
| | | |
| v | |
| +----+-------------------+ | |
| | Equipment ordering | | |
| +----+-------------------+ | |
| | | |
| v | |
| +----+-----------------------+ | |
| | Shipping and customs | | |
| +----+-----------------------+ | |
| | | |
| v | |
| +----+------------+ | |
| | Installation | | |
| +----+------------+ | |
| | | |
| v | |
| +----+----+ | |
| | Live | | |
| +---------+ | |
| | |
| Total lead time: 20 weeks from assessment to deployment | |
| | |
+------------------------------------------------------------------+

Figure 4: Field infrastructure capacity planning timeline showing extended lead times

Verification

After completing capacity planning activities, verify the outputs meet quality requirements:

Baseline accuracy: Compare calculated baselines against known peak utilisation periods. Baselines should be within 10% of observed peaks during normal operations. Query a known busy period:

# Verify baseline against last month's peak
max_over_time(
(1 - avg by(instance)(rate(node_cpu_seconds_total{mode="idle"}[5m])))[30d:1h]
)

If the peak exceeds your baseline by more than 10%, investigate whether the peak represents normal operation (adjust baseline) or an anomaly (document exception).

Forecast validation: Backtest your forecasting model by applying it to historical data and comparing predictions against actuals:

Backtest procedure:
1. Use data from months 1-6 to forecast month 9
2. Compare forecast against actual month 9 utilisation
3. Calculate forecast error: |Forecast - Actual| / Actual × 100
Acceptable error: below 15% for 3-month forecasts
Requires investigation: error above 25%

Threshold alert testing: Confirm monitoring alerts trigger at defined thresholds by temporarily lowering thresholds or using test metrics:

Terminal window
# Create test metric to verify alerting (Prometheus example)
curl -X POST http://pushgateway:9091/metrics/job/capacity_test \
-d 'test_capacity_utilisation 95'
# Verify alert fires within expected timeframe
# Remove test metric after verification
curl -X DELETE http://pushgateway:9091/metrics/job/capacity_test

Documentation completeness: Confirm the capacity plan contains all required elements:

  • Resource inventory with current baselines: present and dated within 30 days
  • Demand forecasts with methodology documented: present for all constrained resources
  • Threshold definitions with alert configuration references: present and tested
  • Scaling recommendations with cost estimates: present for resources breaching thresholds
  • Approval records for planned actions: present with dates and approvers

Troubleshooting

SymptomCauseResolution
Baseline calculation returns NULL or zeroMonitoring data gaps exceed query toleranceReduce query timeframe or repair monitoring collection; verify metric names match current configuration
Forecast shows negative growth despite observed increasesSeasonal pattern creating misleading trendUse year-over-year comparison instead of linear regression; apply seasonal decomposition
Time-to-threshold shorter than procurement lead timeDelayed capacity planning or accelerated growthInitiate emergency procurement; implement temporary mitigations (optimisation, load shedding); escalate to management
Auto-scaling triggers continuously (oscillation)Cooldown period too short or threshold too close to steady-stateIncrease cooldown period; widen gap between scale-out and scale-in thresholds; use predictive scaling
Auto-scaling fails to respond to demand spikeWrong metric selected or aggregation period too longVerify metric reflects actual bottleneck; reduce aggregation window; add multiple trigger metrics
Capacity plan rejected due to costBudget constraints not incorporated in planningInclude finance stakeholders earlier; model multiple scenarios with different cost profiles; identify optimisation opportunities
Field equipment arrives but cannot deployPower, cooling, or space constraints not assessedConduct site survey before procurement; include infrastructure requirements in equipment specifications
Forecast accuracy degrades over timeBusiness conditions changed from planning assumptionsIncrease planning frequency; establish triggers for unscheduled plan updates; improve communication with programme teams
Horizontal scaling fails to improve performanceApplication bottleneck not in scaled componentProfile application to identify actual constraint; verify load balancer distributes traffic; check for shared resource contention
Vertical scaling causes application instabilityApplication not tested at higher resource levels; configuration limitsTest scaling in non-production first; verify application configuration supports larger resource allocation; check for 32-bit limitations
Monitoring shows lower utilisation than users reportMonitoring sampling misses short spikes; wrong metric monitoredIncrease sampling frequency; verify metric captures user-experienced performance; add application-level metrics
Storage capacity forecast inaccurateCompression or deduplication ratios changed; data growth patterns shiftedRe-baseline using recent data; factor compression ratio into forecasts; monitor ratio changes

Emergency capacity situations

When capacity constraints cause immediate service impact, bypass normal planning procedures. Document the emergency action taken, notify stakeholders, and schedule a post-incident review to update capacity plans and thresholds.

See also