On this page

Capacity Planning

Capacity planning determines whether IT infrastructure can sustain projected workloads by measuring current utilisation, forecasting future demand, and defining thresholds that trigger scaling actions. This task produces capacity baselines for each monitored resource, demand forecasts extending 3 to 12 months forward, and documented trigger points that initiate procurement or provisioning before constraints affect service delivery.

Perform capacity planning quarterly for stable environments, monthly for growing organisations, and continuously for cloud infrastructure where consumption directly affects cost. The outcome is a capacity model that predicts when each resource will reach its constraint threshold, enabling proactive investment decisions rather than reactive crisis response.

Prerequisites

Before beginning capacity planning, confirm the following requirements are satisfied:

Requirement	Specification	Verification
Monitoring data	90 days minimum history for baseline accuracy	Query monitoring system for data retention period
Resource inventory	Complete list of capacity-constrained resources	Cross-reference with CMDB or asset register
Service catalogue	Documented services with resource dependencies	Confirm service-to-infrastructure mapping exists
Growth projections	Organisational plans affecting IT demand	Obtain from programme teams or strategic planning
Access permissions	Read access to monitoring dashboards and raw metrics	Test query execution before planning session
Financial data	Current costs per resource for scaling calculations	Obtain from IT Budgeting or cloud billing

Monitoring data quality determines forecast accuracy. Verify that collection gaps do not exceed 5% of the measurement period by querying your monitoring system:

# Prometheus example: check data completeness for CPU metrics over 90 days
curl -s "http://prometheus:9090/api/v1/query?query=count_over_time(node_cpu_seconds_total[90d])" | jq '.data.result[0].value[1]'

# Expected: approximately 129,600 samples (90 days × 24 hours × 60 minutes)
# If below 123,120 (95% of expected), investigate collection gaps

For organisations without 90 days of monitoring history, capacity planning remains possible but forecasts carry higher uncertainty. Document the reduced confidence level in planning outputs and schedule a follow-up assessment once sufficient data accumulates.

Procedure

Establish resource inventory

Generate a list of all capacity-constrained resources from your monitoring system and asset inventory. Capacity constraints occur in compute (CPU, memory), storage (capacity, IOPS), network (bandwidth, connections), and licensing (concurrent users, transaction limits).

   # Export monitored hosts from Prometheus
   curl -s "http://prometheus:9090/api/v1/label/instance/values" | jq -r '.data[]' > monitored_hosts.txt

   # Cross-reference with asset register (example using CSV export)
   comm -23 <(sort asset_register_hosts.txt) <(sort monitored_hosts.txt) > unmonitored_hosts.txt

Any hosts appearing in unmonitored_hosts.txt require monitoring deployment before capacity planning can include them.

Categorise each resource by constraint type. A single server contributes multiple constraint dimensions: CPU cycles, memory bytes, disk capacity, disk IOPS, and network throughput each constitute separate planning targets.

Create a capacity inventory spreadsheet with columns for: resource identifier, constraint type, current maximum capacity, measurement unit, and monitoring metric name. For a typical application server:

Resource	Constraint	Maximum	Unit	Metric
app-server-01	CPU	8	cores	`node_cpu_seconds_total`
app-server-01	Memory	32	GB	`node_memory_MemTotal_bytes`
app-server-01	Disk capacity	500	GB	`node_filesystem_size_bytes`
app-server-01	Disk IOPS	3000	ops/sec	`node_disk_io_time_seconds_total`
app-server-01	Network	1000	Mbps	`node_network_transmit_bytes_total`

Document shared resources where multiple services compete for capacity. Database servers, storage arrays, network links, and authentication services typically serve multiple consumers. Map each shared resource to its dependent services using configuration management data or service documentation.

Calculate capacity baselines

The capacity baseline represents normal operating levels against which you measure growth and detect anomalies. Calculate baselines using the 95th percentile of utilisation over your measurement period, which excludes transient spikes while capturing sustained high-water marks.

Query your monitoring system for 95th percentile utilisation of each constraint type. The query syntax varies by platform:

   # Prometheus: 95th percentile CPU utilisation over 90 days
   quantile_over_time(0.95,
     (1 - avg by(instance)(rate(node_cpu_seconds_total{mode="idle"}[5m])))[90d:1h]
   )

   -- InfluxDB: 95th percentile memory utilisation
   SELECT percentile("used_percent", 95)
   FROM "mem"
   WHERE time > now() - 90d
   GROUP BY "host"

Record the baseline value for each resource-constraint combination in your capacity inventory.

Calculate headroom as the difference between maximum capacity and baseline utilisation, expressed as both absolute units and percentage:

   Headroom (absolute) = Maximum capacity - Baseline utilisation
   Headroom (percentage) = ((Maximum - Baseline) / Maximum) × 100

For app-server-01 with 32 GB memory and baseline utilisation of 24 GB:

Headroom (absolute) = 32 - 24 = 8 GB
Headroom (percentage) = ((32 - 24) / 32) × 100 = 25%

Identify resources with headroom below 30%. These require immediate attention in the forecasting phase, as growth could push them into constraint within one planning cycle. Resources with headroom below 15% require emergency assessment outside the normal planning cycle.

+-------------------------------------------------------------+
|                 CAPACITY BASELINE ANALYSIS                  |
+-------------------------------------------------------------+
|                                                             |
|  Resource:   database-primary (16 Cores)                    |
|  Constraint: CPU Utilization (95th Percentile)              |
|                                                             |
|  100% |                                             (Max)   |
|       |                                                     |
|   80% |                                      [#]            |
|       |                                      [#]    Risk    |
|   75% |--------------------------------------[#]----(Limit)-|
|       |                         [#]    [#]   [#]            |
|   60% |                  [#]    [#]    [#]   [#]            |
|       |           [#]    [#]    [#]    [#]   [#]            |
|   40% |    [#]    [#]    [#]    [#]    [#]   [#]            |
|       |    [#]    [#]    [#]    [#]    [#]   [#]            |
|    0% +--------------------------------------------------   |
|            Jan    Feb    Mar    Apr    May   Jun            |
|                                                             |
|  Current Headroom: 20% (3.2 cores remaining)                |
|  Status:           ALERT - Breached 75% Baseline            |
|                                                             |
+-------------------------------------------------------------+

Figure 1: Capacity baseline showing 95th percentile utilisation trend

Forecast demand growth

Demand forecasting projects future utilisation based on historical trends and known business drivers. The growth rate combines organic trend (what monitoring data shows) with planned change (what the organisation intends to do).

Calculate the organic growth rate using linear regression on your baseline data. Most monitoring systems provide trend functions:

   # Prometheus: predict CPU utilisation 90 days forward based on 90 days history
   predict_linear(
     (1 - avg by(instance)(rate(node_cpu_seconds_total{mode="idle"}[5m])))[90d:1d],
     90 * 24 * 3600
   )

For manual calculation, use the least-squares method on monthly averages:

   Growth rate = (Σ(x - x̄)(y - ȳ)) / (Σ(x - x̄)²)

   Where:
   x = month number (1, 2, 3...)
   y = average utilisation for that month
   x̄ = mean of month numbers
   ȳ = mean of utilisation values

Example with six months of memory data (percentages):

Month	x	y (%)	x - x̄	y - ȳ	(x-x̄)(y-ȳ)	(x-x̄)²
Jan	1	45	-2.5	-7.5	18.75	6.25
Feb	2	48	-1.5	-4.5	6.75	2.25
Mar	3	50	-0.5	-2.5	1.25	0.25
Apr	4	54	0.5	1.5	0.75	0.25
May	5	56	1.5	3.5	5.25	2.25
Jun	6	62	2.5	9.5	23.75	6.25
Sum		x̄=3.5	ȳ=52.5		56.5	17.5

Growth rate = 56.5 / 17.5 = 3.23% per month

Gather planned changes from programme teams, HR (staff growth), and strategic planning. Each planned change translates to capacity demand through documented ratios. Establish these ratios from historical data:
- Users per GB of file storage: measure current storage divided by current users
- Database transactions per programme beneficiary: measure transaction logs against beneficiary counts
- Compute cycles per concurrent application user: measure during known-load periods
Example translation for a planned programme expansion adding 5,000 beneficiaries:

   Current beneficiaries: 20,000
   Current database size: 400 GB
   Ratio: 400 GB / 20,000 = 0.02 GB per beneficiary

   Additional demand: 5,000 × 0.02 GB = 100 GB
   Timeline: Programme launches in 6 months
   Monthly demand increase: 100 GB / 6 = 16.7 GB per month (during ramp-up)

Combine organic growth with planned changes to create a composite forecast:

   Forecast utilisation = Baseline + (Organic growth × Months) + Planned demand

For the database example with 400 GB baseline, 2% monthly organic growth, and 100 GB planned increase over 6 months:

Month	Organic growth	Planned addition	Cumulative forecast
1	408 GB	+16.7 GB	424.7 GB
2	416 GB	+16.7 GB	449.4 GB
3	424 GB	+16.7 GB	474.1 GB
4	433 GB	+16.7 GB	498.8 GB
5	441 GB	+16.7 GB	523.5 GB
6	450 GB	+16.7 GB	548.2 GB

+-------------------------------------------------------------+
|                 CAPACITY BASELINE ANALYSIS                  |
+-------------------------------------------------------------+
|                                                             |
|  Resource:   database-primary (16 Cores)                    |
|  Constraint: CPU Utilization (95th Percentile)              |
|                                                             |
|  100% |                                             (Max)   |
|       |                                                     |
|   80% |                                      [#]            |
|       |                                      [#]    Risk    |
|   75% |--------------------------------------[#]----(Limit)-|
|       |                         [#]    [#]   [#]            |
|   60% |                  [#]    [#]    [#]   [#]            |
|       |           [#]    [#]    [#]    [#]   [#]            |
|   40% |    [#]    [#]    [#]    [#]    [#]   [#]            |
|       |    [#]    [#]    [#]    [#]    [#]   [#]            |
|    0% +--------------------------------------------------   |
|            Jan    Feb    Mar    Apr    May   Jun            |
|                                                             |
|  Current Headroom: 20% (3.2 cores remaining)                |
|  Status:           ALERT - Breached 75% Baseline            |
|                                                             |
+-------------------------------------------------------------+

Figure 2: Demand forecast combining historical trend with planned growth

Define capacity thresholds

Capacity thresholds are utilisation levels that trigger specific actions. Define three threshold tiers for each resource: monitoring threshold (increased observation), warning threshold (planning action required), and critical threshold (immediate action required).

Set threshold values based on resource characteristics and procurement lead times. Resources with long lead times (physical hardware requiring procurement, budget approval, and installation) require lower thresholds than instantly-scalable cloud resources:

Resource type	Monitor	Warning	Critical	Rationale
Physical server	50%	65%	80%	12-16 week procurement cycle
On-premises storage	55%	70%	85%	8-12 week procurement cycle
Cloud compute	70%	80%	90%	Minutes to provision
Cloud storage	75%	85%	95%	Instant provisioning
Network link	60%	75%	85%	4-8 week circuit provisioning
Software licence	70%	85%	95%	Days to weeks for procurement

Calculate the time-to-threshold for each resource using your demand forecast:

   Time to threshold = (Threshold - Current utilisation) / Monthly growth rate

For a database at 52.5% utilisation with 3.23% monthly growth and 70% warning threshold:

   Time to warning = (70 - 52.5) / 3.23 = 5.4 months

Resources with time-to-threshold less than their procurement lead time require immediate action.

Configure alerting rules in your monitoring system to trigger at each threshold:

   # Prometheus alerting rules example
   groups:
     - name: capacity_thresholds
       rules:
         - alert: CapacityMonitor
           expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100 < 50
           for: 1h
           labels:
             severity: info
           annotations:
             summary: "Storage capacity monitoring threshold reached"

         - alert: CapacityWarning
           expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100 < 35
           for: 1h
           labels:
             severity: warning
           annotations:
             summary: "Storage capacity warning - planning action required"

         - alert: CapacityCritical
           expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100 < 20
           for: 15m
           labels:
             severity: critical
           annotations:
             summary: "Storage capacity critical - immediate action required"

Model scaling scenarios

Before reaching thresholds, model the scaling options available for each constrained resource. Scaling takes two forms: vertical scaling (increasing capacity of existing resources) and horizontal scaling (adding additional resource instances).

Document vertical scaling limits for each resource. Physical servers have maximum memory slots, maximum CPU sockets, and maximum disk bays. Virtual machines have hypervisor-imposed limits. Cloud instances have instance-type ceilings:

   Resource: app-server-01 (physical)
   Current: 2× CPU (16 cores total), 64 GB RAM, 4× 1TB SSD
   Maximum: 4× CPU (32 cores), 256 GB RAM, 8× disk bays
   Vertical headroom: 100% CPU, 300% RAM, 100% storage

   Resource: app-server-02 (AWS EC2 m5.2xlarge)
   Current: 8 vCPU, 32 GB RAM
   Maximum in family: m5.24xlarge (96 vCPU, 384 GB RAM)
   Vertical headroom: 1100% CPU, 1100% RAM
   Migration required for larger: m5.metal (96 vCPU, 384 GB)

Assess horizontal scaling feasibility for each application. Horizontal scaling requires application support for distributed operation: stateless design, external session storage, load balancer compatibility, and database connection pooling. Document scaling constraints:

   Application: Grants Management System
   Horizontal scaling: Supported (stateless application tier)
   Session handling: Redis cluster (external)
   Database: PostgreSQL with read replicas
   Load balancer: HAProxy (already deployed)
   Scaling unit: 1 additional server adds capacity for ~500 concurrent users
   Cost per unit: approximately £400/month (cloud) or £8,000 capital (physical)

   Application: Legacy Finance System
   Horizontal scaling: Not supported (stateful, single-instance design)
   Vertical scaling only: Current VM can expand to 4× current resources
   Replacement planning: Migration to cloud ERP scheduled for Q3

Calculate cost per capacity unit for each scaling option to enable comparison:

   Vertical scaling cost efficiency:
   Current: 8 vCPU at £200/month = £25 per vCPU
   Upgrade to 16 vCPU: £380/month = £23.75 per vCPU (8% more efficient)
   Upgrade to 32 vCPU: £720/month = £22.50 per vCPU (10% more efficient)

   Horizontal scaling cost efficiency:
   Additional 8 vCPU instance: £200/month = £25 per vCPU
   Plus load balancer overhead: £50/month shared across instances

   Break-even analysis:
   - Below 16 vCPU total: vertical scaling more cost-effective
   - Above 16 vCPU total: horizontal scaling provides better resilience value

+-------------------------------------------------------------------+
|                    SCALING DECISION TREE                          |
+-------------------------------------------------------------------+
|                                                                   |
|                  +------------------------+                       |
|                  | Capacity threshold     |                       |
|                  | breach predicted       |                       |
|                  +-----------+------------+                       |
|                              |                                    |
|                              v                                    |
|                  +------------------------+                       |
|                  | Application supports   |                       |
|                  | horizontal scaling?    |                       |
|                  +-----------+------------+                       |
|                              |                                    |
|              +---------------+---------------+                    |
|              |                               |                    |
|              v                               v                    |
|     +--------+--------+             +--------+--------+           |
|     | Yes             |             | No              |           |
|     +--------+--------+             +--------+--------+           |
|              |                               |                    |
|              v                               v                    |
|     +--------+--------+             +--------+--------+           |
|     | High            |             | Vertical        |           |
|     | availability    |             | headroom        |           |
|     | required?       |             | available?      |           |
|     +--------+--------+             +--------+--------+           |
|              |                               |                    |
|      +-------+-------+               +-------+-------+            |
|      |               |               |               |            |
|      v               v               v               v            |
|  +---+---+       +---+---+       +---+---+       +---+---+        |
|  | Yes   |       | No    |       | Yes   |       | No    |        |
|  +---+---+       +---+---+       +---+---+       +---+---+        |
|      |               |               |               |            |
|      v               v               v               v            |
|  +---+---+       +---+---+       +---+---+       +---+---+        |
|  |Scale  |       |Compare|       |Scale  |       |Replace|        |
|  |out    |       |costs: |       |up     |       |or     |        |
|  |(horiz)|       |vert vs|       |(vert) |       |migrate|        |
|  +-------+       |horiz  |       +-------+       +-------+        |
|                  +-------+                                        |
|                                                                   |
+-------------------------------------------------------------------+

Figure 3: Decision tree for selecting scaling approach

Document capacity plan

Compile findings into a capacity plan document containing: resource inventory with current baselines, demand forecasts by resource, threshold breach predictions, recommended scaling actions, cost estimates, and timeline for procurement or provisioning.
Structure the document with an executive summary showing resources requiring action within the planning horizon (typically 12 months), followed by detailed analysis for each resource category.
Present the capacity plan to stakeholders with budget authority. The plan should answer: what resources will become constrained, when will constraints occur, what are the options to address them, and what is the cost of each option. Include a do-nothing scenario showing the service impact of failing to scale.

Obtain approval for scaling actions and update procurement plans, cloud budgets, or project schedules accordingly. Record approved actions in a capacity action register:

Resource	Action	Trigger date	Lead time	Completion target	Budget	Owner
SAN-01	Add shelf	2024-09-01	8 weeks	2024-11-01	£12,000	Infrastructure
db-cluster	Add replica	2024-07-15	2 weeks	2024-08-01	£800/mo	Database
app-tier	Scale policy	Immediate	1 day	2024-06-15	Variable	Cloud ops

Configure cloud auto-scaling

For cloud infrastructure, capacity planning translates into auto-scaling policies that respond to demand automatically. Configure scaling policies based on the thresholds established in your capacity plan.

Define scaling metrics that align with your capacity constraints. CPU utilisation serves most compute workloads, but queue depth, response latency, or custom application metrics provide better scaling signals for specific workload types:

   # AWS Auto Scaling policy example
   Resources:
     ScalingPolicy:
       Type: AWS::AutoScaling::ScalingPolicy
       Properties:
         AutoScalingGroupName: !Ref AppServerGroup
         PolicyType: TargetTrackingScaling
         TargetTrackingConfiguration:
           PredefinedMetricSpecification:
             PredefinedMetricType: ASGAverageCPUUtilization
           TargetValue: 70.0
           ScaleInCooldown: 300
           ScaleOutCooldown: 60

Set minimum and maximum instance counts based on your baseline and forecast:

   AppServerGroup:
     Type: AWS::AutoScaling::AutoScalingGroup
     Properties:
       MinSize: 2              # Minimum for high availability
       MaxSize: 10             # Budget cap based on forecast peak
       DesiredCapacity: 3      # Current baseline requirement

Calculate maximum size from your demand forecast: if peak forecast shows 280% of current baseline, and current baseline requires 3 instances, maximum should be at least 9 instances (3 × 2.8, rounded up).

Configure scaling cooldown periods to prevent oscillation. Scale-out cooldown should be short (60 seconds) to respond to demand spikes. Scale-in cooldown should be longer (300 seconds or more) to avoid premature scale-down:

   # Terraform example for Azure VM Scale Set
   resource "azurerm_monitor_autoscale_setting" "app" {
     name                = "app-autoscale"
     resource_group_name = azurerm_resource_group.main.name
     location            = azurerm_resource_group.main.location
     target_resource_id  = azurerm_linux_virtual_machine_scale_set.app.id

     profile {
       name = "default"

       capacity {
         default = 3
         minimum = 2
         maximum = 10
       }

       rule {
         metric_trigger {
           metric_name        = "Percentage CPU"
           metric_resource_id = azurerm_linux_virtual_machine_scale_set.app.id
           time_grain         = "PT1M"
           statistic          = "Average"
           time_window        = "PT5M"
           time_aggregation   = "Average"
           operator           = "GreaterThan"
           threshold          = 70
         }

         scale_action {
           direction = "Increase"
           type      = "ChangeCount"
           value     = "1"
           cooldown  = "PT1M"
         }
       }

       rule {
         metric_trigger {
           metric_name        = "Percentage CPU"
           metric_resource_id = azurerm_linux_virtual_machine_scale_set.app.id
           time_grain         = "PT1M"
           statistic          = "Average"
           time_window        = "PT10M"
           time_aggregation   = "Average"
           operator           = "LessThan"
           threshold          = 30
         }

         scale_action {
           direction = "Decrease"
           type      = "ChangeCount"
           value     = "1"
           cooldown  = "PT5M"
         }
       }
     }
   }

Address field infrastructure constraints

Field locations present capacity planning challenges that differ from headquarters or cloud infrastructure. Procurement lead times extend to months rather than weeks due to shipping, customs, and installation logistics. Power and cooling constraints impose hard ceilings that procurement cannot solve.

Inventory power capacity at each field location. Calculate total available watts, current draw, and headroom:

   Location: Juba field office
   Power source: Solar + battery (4 kW system)
   Available for IT: 1.5 kW (remainder for lighting, HVAC, other)
   Current IT load: 1.1 kW
   Headroom: 400 W (27%)

   Constraint: Cannot add equipment drawing more than 400 W without
   solar system upgrade (12-week lead time, £8,000 cost)

Plan capacity additions to align with logistics windows. Field locations receiving quarterly supply shipments require capacity planning 4-6 months ahead to include equipment in the next shipment:

Location	Logistics window	Planning deadline	Equipment cutoff
Nairobi hub	Monthly	3 weeks prior	4 weeks prior
Juba office	Quarterly	10 weeks prior	12 weeks prior
Cox’s Bazar	Bi-monthly	6 weeks prior	8 weeks prior
Remote sites	Per deployment	16 weeks prior	20 weeks prior

Factor bandwidth constraints into application capacity planning. A field office with 2 Mbps connectivity cannot support the same concurrent user count as headquarters, regardless of local compute capacity:

   Application: Case management system
   Per-user bandwidth: 50 Kbps average, 200 Kbps peak
   Headquarters (100 Mbps): supports 500 concurrent users (bandwidth)
   Field office (2 Mbps): supports 10 concurrent users (bandwidth limited)

   Local caching reduces bandwidth requirement to 15 Kbps average:
   Field office with caching: supports 33 concurrent users

+------------------------------------------------------------------+
|                FIELD CAPACITY PLANNING TIMELINE                   |
+------------------------------------------------------------------+
|                                                                   |
|  Week:  -20  -16  -12   -8   -4    0   +4   +8  +12  +16  +20    |
|          |    |    |    |    |    |    |    |    |    |    |     |
|          v    v    v    v    v    v    v    v    v    v    v     |
|  +-------+----+----+----+----+----+----+----+----+----+----+     |
|  | Capacity assessment        |                            |     |
|  +----------------------------+                            |     |
|       |                                                    |     |
|       v                                                    |     |
|  +----+----+                                               |     |
|  | Forecast |                                              |     |
|  +----+----+                                               |     |
|       |                                                    |     |
|       v                                                    |     |
|  +----+---------+                                          |     |
|  | Procurement  |                                          |     |
|  | approval     |                                          |     |
|  +----+---------+                                          |     |
|       |                                                    |     |
|       v                                                    |     |
|  +----+-------------------+                                |     |
|  | Equipment ordering     |                                |     |
|  +----+-------------------+                                |     |
|                |                                           |     |
|                v                                           |     |
|           +----+-----------------------+                   |     |
|           | Shipping and customs       |                   |     |
|           +----+-----------------------+                   |     |
|                          |                                 |     |
|                          v                                 |     |
|                     +----+------------+                    |     |
|                     | Installation    |                    |     |
|                     +----+------------+                    |     |
|                               |                            |     |
|                               v                            |     |
|                          +----+----+                       |     |
|                          | Live    |                       |     |
|                          +---------+                       |     |
|                                                            |     |
|  Total lead time: 20 weeks from assessment to deployment   |     |
|                                                            |     |
+------------------------------------------------------------------+

Figure 4: Field infrastructure capacity planning timeline showing extended lead times

Verification

After completing capacity planning activities, verify the outputs meet quality requirements:

Baseline accuracy: Compare calculated baselines against known peak utilisation periods. Baselines should be within 10% of observed peaks during normal operations. Query a known busy period:

# Verify baseline against last month's peak
max_over_time(
  (1 - avg by(instance)(rate(node_cpu_seconds_total{mode="idle"}[5m])))[30d:1h]
)

If the peak exceeds your baseline by more than 10%, investigate whether the peak represents normal operation (adjust baseline) or an anomaly (document exception).

Forecast validation: Backtest your forecasting model by applying it to historical data and comparing predictions against actuals:

Backtest procedure:
1. Use data from months 1-6 to forecast month 9
2. Compare forecast against actual month 9 utilisation
3. Calculate forecast error: |Forecast - Actual| / Actual × 100

Acceptable error: below 15% for 3-month forecasts
Requires investigation: error above 25%

Threshold alert testing: Confirm monitoring alerts trigger at defined thresholds by temporarily lowering thresholds or using test metrics:

# Create test metric to verify alerting (Prometheus example)
curl -X POST http://pushgateway:9091/metrics/job/capacity_test \
  -d 'test_capacity_utilisation 95'

# Verify alert fires within expected timeframe
# Remove test metric after verification
curl -X DELETE http://pushgateway:9091/metrics/job/capacity_test

Documentation completeness: Confirm the capacity plan contains all required elements:

Resource inventory with current baselines: present and dated within 30 days
Demand forecasts with methodology documented: present for all constrained resources
Threshold definitions with alert configuration references: present and tested
Scaling recommendations with cost estimates: present for resources breaching thresholds
Approval records for planned actions: present with dates and approvers

Troubleshooting

Symptom	Cause	Resolution
Baseline calculation returns NULL or zero	Monitoring data gaps exceed query tolerance	Reduce query timeframe or repair monitoring collection; verify metric names match current configuration
Forecast shows negative growth despite observed increases	Seasonal pattern creating misleading trend	Use year-over-year comparison instead of linear regression; apply seasonal decomposition
Time-to-threshold shorter than procurement lead time	Delayed capacity planning or accelerated growth	Initiate emergency procurement; implement temporary mitigations (optimisation, load shedding); escalate to management
Auto-scaling triggers continuously (oscillation)	Cooldown period too short or threshold too close to steady-state	Increase cooldown period; widen gap between scale-out and scale-in thresholds; use predictive scaling
Auto-scaling fails to respond to demand spike	Wrong metric selected or aggregation period too long	Verify metric reflects actual bottleneck; reduce aggregation window; add multiple trigger metrics
Capacity plan rejected due to cost	Budget constraints not incorporated in planning	Include finance stakeholders earlier; model multiple scenarios with different cost profiles; identify optimisation opportunities
Field equipment arrives but cannot deploy	Power, cooling, or space constraints not assessed	Conduct site survey before procurement; include infrastructure requirements in equipment specifications
Forecast accuracy degrades over time	Business conditions changed from planning assumptions	Increase planning frequency; establish triggers for unscheduled plan updates; improve communication with programme teams
Horizontal scaling fails to improve performance	Application bottleneck not in scaled component	Profile application to identify actual constraint; verify load balancer distributes traffic; check for shared resource contention
Vertical scaling causes application instability	Application not tested at higher resource levels; configuration limits	Test scaling in non-production first; verify application configuration supports larger resource allocation; check for 32-bit limitations
Monitoring shows lower utilisation than users report	Monitoring sampling misses short spikes; wrong metric monitored	Increase sampling frequency; verify metric captures user-experienced performance; add application-level metrics
Storage capacity forecast inaccurate	Compression or deduplication ratios changed; data growth patterns shifted	Re-baseline using recent data; factor compression ratio into forecasts; monitor ratio changes

Emergency capacity situations

When capacity constraints cause immediate service impact, bypass normal planning procedures. Document the emergency action taken, notify stakeholders, and schedule a post-incident review to update capacity plans and thresholds.