Deployment Practices
Deployment practices are repeatable strategies for transferring tested software from pre-production environments into live service. Each practice represents a distinct approach to managing the risk, speed, and coordination required when changes affect running systems. The choice of deployment practice determines how users experience the transition, how quickly problems become visible, and what options exist when something goes wrong.
Problem context
Production deployments operate under competing pressures that no single approach satisfies optimally. Speed matters because delayed deployments mean delayed value, accumulated change batches, and coordination overhead. Safety matters because production failures affect real users, damage trust, and consume incident response capacity. Visibility matters because operators need to detect problems before they propagate. Reversibility matters because even well-tested changes fail in production.
These pressures create the fundamental tension deployment practices must resolve. Deploying everything at once minimises coordination complexity but maximises blast radius. Deploying gradually reduces blast radius but extends the period during which two versions coexist. Maintaining parallel environments enables instant rollback but doubles infrastructure cost during transitions.
Deployment practices apply when releasing changes that alter running systems. They do not apply to initial installations into empty environments, disaster recovery restores, or configuration changes that take effect without service transitions. The practices assume changes have completed release management activities including testing, approval, and packaging.
Organisations face additional constraints that shape pattern selection. Field deployments over unreliable networks cannot assume continuous connectivity during transitions. Resource-constrained teams cannot maintain duplicate production environments indefinitely. Donor-funded systems may require change documentation that specific patterns support more naturally than others. Legacy applications lacking health check endpoints cannot participate in automated traffic shifting.
Solution
Five deployment practices address the core tensions with different trade-off profiles. Each practice specifies how traffic moves from old to new, what infrastructure the transition requires, how problems become visible, and what reversal options exist.
Big bang deployment
Big bang deployment replaces all instances of a service simultaneously at a scheduled moment. Traffic flows to the old version until the cutover instant, then flows entirely to the new version. No period of mixed versions exists.
TIME ------->
Before cutover:+------------------+ +------------------+| Load Balancer |---->| Version 1.0 || | | (all traffic) |+------------------+ +------------------+
Cutover instant (T):+------------------+ +------------------+| Load Balancer |---->| Version 2.0 || | | (all traffic) |+------------------+ +------------------+
After cutover:All users on Version 2.0Figure 1: Big bang deployment showing instantaneous traffic switch at cutover
The mechanism requires stopping the old version and starting the new version within the maintenance window. For stateless services, this means draining connections, deploying new code, and accepting new connections. For stateful services, this additionally requires data migration or schema updates that both versions cannot share.
Big bang deployment suits situations where mixed-version operation is impossible. Database schema changes that break backward compatibility, protocol version upgrades that require coordinated client and server changes, and licensing transitions that prohibit running both versions simultaneously all mandate big bang approaches. The pattern also suits low-risk changes to non-critical services where deployment speed outweighs gradual validation benefits.
The practice requires a maintenance window during which the service is unavailable or degraded. Window duration depends on deployment complexity: a container image swap completes in seconds while a database migration may require hours. Organisations serving users across time zones face difficulty finding windows with acceptable impact.
Rollback requires repeating the deployment process in reverse. If the new version corrupted data or made incompatible changes, rollback may be impossible without restoring from backup. The all-or-nothing nature means problems affect all users immediately upon cutover.
Rolling deployment
Rolling deployment replaces instances incrementally, shifting traffic gradually from old to new versions. At any moment during the deployment, both versions serve production traffic. The deployment completes when all instances run the new version.
TIME ------->
Stage 1: 25% complete+------------------+ +--------+--------+| Load Balancer |---->| v1.0 | v2.0 || | | (75%) | (25%) |+------------------+ +--------+--------+
Stage 2: 50% complete+------------------+ +--------+--------+| Load Balancer |---->| v1.0 | v2.0 || | | (50%) | (50%) |+------------------+ +--------+--------+
Stage 3: 100% complete+------------------+ +------------------+| Load Balancer |---->| Version 2.0 || | | (100%) |+------------------+ +------------------+Figure 2: Rolling deployment showing gradual traffic shift across four stages
The mechanism removes instances from the load balancer pool, updates them, validates their health, and returns them to the pool. Orchestration platforms automate this sequence: Kubernetes rolling updates, AWS Auto Scaling group instance refresh, and Ansible serial execution all implement rolling deployment with varying configuration options.
Rolling deployment requires both versions to coexist safely. The database schema must support queries from both versions. API contracts must remain compatible. Session state must survive version transitions or use external storage accessible to both versions. These requirements constrain what changes rolling deployment can deliver.
A ten-instance service with 25% batch size deploys in four stages. Each stage removes 2-3 instances, reducing capacity temporarily. If instances require 60 seconds to start and pass health checks, the deployment requires at least 4 minutes excluding validation delays. Larger services deploy faster in wall-clock time relative to their instance count because more instances update in parallel.
Rollback during a rolling deployment means continuing the process with the old version. Instances already updated must cycle through the rolling process again. If 60% of instances run the new version when problems appear, rolling back requires updating that 60% back to the old version through the same incremental process.
Blue-green deployment
Blue-green deployment maintains two identical production environments, with only one receiving traffic at any time. Deployment targets the inactive environment, validation occurs against that environment, and cutover switches traffic instantaneously between environments.
Before deployment (Blue active): +------------------+ +--->| Blue (v1.0) |+------------------+| | ACTIVE || Load Balancer |+ +------------------+| |+------------------+ +------------------+ | Green (idle) | | STANDBY | +------------------+
After deployment (Green active): +------------------+ | Blue (v1.0) |+------------------+ | STANDBY || Load Balancer |+ +------------------+| ||+------------------++--->+------------------+ | Green (v2.0) | | ACTIVE | +------------------+Figure 3: Blue-green deployment showing environment switch with instant cutover
The mechanism deploys version 2.0 to the inactive green environment while blue continues serving traffic. Testing validates green independently: synthetic transactions, integration tests, and manual verification all execute against green without affecting production users. When validation passes, the load balancer or DNS record switches to point at green. Blue remains available for instant rollback by reversing the switch.
Blue-green deployment enables pre-production validation against actual production infrastructure. The green environment runs with production configuration, production data access (read-only or via replication), and production network topology. Issues that manifest only under production conditions become visible before users encounter them.
The pattern doubles infrastructure cost during the standby period. A production environment consuming 10 compute instances requires 20 instances to support blue-green deployment. Cloud infrastructure with rapid provisioning reduces this overhead by creating the green environment only for deployment, but the environment must exist long enough for meaningful validation.
Database management creates complexity. If both environments share a database, schema changes must work for both versions simultaneously, reducing blue-green to rolling deployment semantics for the data layer. If environments use separate databases, data synchronisation during the deployment and cutover adds operational burden.
Rollback executes in seconds by switching traffic back to blue. Unlike rolling deployment, no instances require reconfiguration. The old version remains running and tested. This instant rollback makes blue-green deployment attractive for high-risk changes where rapid recovery outweighs infrastructure cost.
Canary deployment
Canary deployment routes a small percentage of production traffic to the new version while the majority continues using the old version. The percentage increases gradually as confidence builds, reaching 100% only after sustained observation.
Stage 1: Canary (5% traffic)+------------------+ +------------------+| |---->| Version 1.0 || Load Balancer | | (95%) || | +------------------+| (weighted || routing) | +------------------+| |---->| Version 2.0 |+------------------+ | (5% canary) | +------------------+
Stage 2: Expanded (25% traffic)+------------------+ +------------------+| |---->| Version 1.0 || Load Balancer | | (75%) || (weighted | +------------------+| routing) || | +------------------+| |---->| Version 2.0 |+------------------+ | (25%) | +------------------+
Stage 3: Complete (100% traffic)+------------------+ +------------------+| Load Balancer |---->| Version 2.0 || | | (100%) |+------------------+ +------------------+Figure 4: Canary deployment showing progressive traffic percentage increase
The mechanism requires traffic splitting at the load balancer or service mesh layer. Percentage-based routing sends a configured proportion of requests to the canary instances. More sophisticated implementations use header-based or cookie-based routing to ensure individual users see consistent versions throughout their sessions.
Canary deployment provides production validation under real traffic before full commitment. A 5% canary serving 1,000 requests per hour generates sufficient signal to detect elevated error rates, latency increases, or resource consumption changes. The blast radius of an undetected defect remains limited to the canary percentage.
The pattern requires robust monitoring and clear success criteria. Operators must define what metrics indicate healthy deployment and what thresholds trigger rollback. Automated canary analysis compares canary metrics against baseline, promoting or aborting the deployment based on statistical significance. Manual canary analysis relies on operator judgement within defined escalation timeframes.
Traffic percentage progression follows deployment confidence. A typical progression moves from 5% to 25% to 50% to 100%, with observation periods between stages. Aggressive schedules compress progression to minutes; conservative schedules extend observation to days. The organisation’s risk tolerance and monitoring capability determine appropriate progression speed.
Rollback sets the canary percentage to 0%, routing all traffic to the old version. The canary instances can remain available for debugging or terminate immediately. Unlike blue-green, canary infrastructure cost scales with the percentage, not a full environment duplicate.
Feature flags
Feature flags deploy code containing both old and new behaviour, with runtime configuration determining which behaviour executes. The deployment itself becomes a non-event; the behavioural change happens through configuration changes that can target specific users, percentages, or conditions.
Deployment (code contains both paths):+------------------+ +------------------+| Load Balancer |---->| Version 2.0 || | | (all traffic) |+------------------+ +------------------+ | v +------------------+ | Feature Flag | | Service | +------------------+ | +-------------+-------------+ | | v v +-------+-------+ +-------+-------+ | Flag: OFF | | Flag: ON | | Old behaviour | | New behaviour | +---------------+ +---------------+Figure 5: Feature flag deployment showing runtime behaviour selection
The mechanism embeds conditional logic in the application that queries flag state before executing behaviour. Flag evaluation can check user attributes (role, location, organisation), request attributes (headers, parameters), or random distribution (percentage rollout). A feature flag service provides centralised flag management with immediate propagation.
Feature flags separate deployment risk from release risk. Code deploys through normal processes with new behaviour disabled. Enabling the flag releases the feature without deployment. Disabling the flag reverts behaviour without deployment. This separation enables deploying code during business hours and releasing features at optimal moments.
The pattern requires engineering discipline. Flag conditions scattered throughout code create maintenance burden. Flags that remain after features stabilise accumulate as technical debt. Testing must cover all flag combinations that can occur in production. Teams need processes for flag lifecycle management: creation, activation, stabilisation, and removal.
Feature flags enable use cases beyond deployment safety. A/B testing uses flags to present different experiences to user segments. Operational flags disable expensive features during capacity constraints. Entitlement flags control feature access based on subscription tier. The deployment use case represents one application of a general capability.
Rollback toggles the flag state, taking effect within the flag service propagation interval, commonly seconds to minutes. No infrastructure changes occur. The speed and simplicity of flag rollback makes feature flags attractive for features where rapid iteration matters more than infrastructure efficiency.
Implementation
Pattern selection depends on application architecture, infrastructure capabilities, change characteristics, and organisational constraints. No pattern universally dominates; each represents a valid response to different force balances.
Selection criteria
The decision tree begins with compatibility constraints. If the change requires database schema modifications that break backward compatibility, rolling deployment and canary deployment become impossible because both versions cannot coexist. Blue-green deployment with database separation or big bang deployment remain as options.
+---------------------------+ | Can old and new versions | | coexist on shared data? | +-------------+-------------+ | +-------------+-------------+ | | v v +----+----+ +----+----+ | YES | | NO | +----+----+ +----+----+ | | v v +-----------+-----------+ +---------+---------+ | Instant rollback | | Blue-green with | | required? | | separate DBs | +-----+-----+-----+-----+ | OR Big bang | | | +---------+---------+ v v +----+----+ +----+----+ | YES | | NO | +----+----+ +----+----+ | | v v +---------+----+ +----+-----------+ | Blue-green | | Gradual | | OR | | validation | | Feature flag | | needed? | +--------------+ +----+-----+-----+ | | v v +----+----+ +----+----+ | YES | | NO | +----+----+ +----+----+ | | v v +------+------+ +--+-------+ | Canary | | Rolling | +-------------+ +----------+Figure 6: Deployment pattern selection decision tree
Instant rollback requirements favour blue-green deployment or feature flags. If changes affect revenue-critical paths or user-facing features during high-traffic periods, the ability to revert in seconds rather than minutes justifies additional infrastructure cost. Rolling deployment rollback takes minutes to hours depending on instance count; canary rollback takes seconds but leaves partial traffic exposure until completion.
Gradual validation requirements favour canary deployment. If the change affects performance characteristics, resource consumption, or error rates in ways that testing cannot fully simulate, canary deployment provides production feedback before full commitment. Rolling deployment also provides gradual exposure but without percentage control.
Infrastructure cost constraints favour rolling deployment and feature flags. Neither requires duplicate environments. Feature flags add flag service dependencies and code complexity. Rolling deployment adds orchestration complexity and mixed-version operational overhead.
Organisational capability constraints matter. Blue-green deployment requires infrastructure automation to create and destroy environments efficiently. Canary deployment requires observability infrastructure to detect problems at low traffic percentages. Feature flags require engineering practices for flag lifecycle management. Rolling deployment has the lowest capability threshold, requiring only orchestration tools that most platforms provide.
Go-live coordination
Deployments affecting user experience require coordination beyond technical execution. Go-live coordination ensures stakeholders receive appropriate notice, support teams prepare for questions, and rollback decisions involve the right people.
A deployment communication timeline begins 5 business days before scheduled deployment with stakeholder notification. The notification identifies what changes, when deployment occurs, expected user impact, and rollback criteria. Recipients include service owners, support teams, communications teams, and affected user representatives.
At 24 hours before deployment, a readiness confirmation verifies prerequisites: change approval complete, deployment artifacts available, rollback plan documented, support teams briefed, monitoring dashboards prepared. Any failed prerequisite delays deployment.
During deployment, status updates flow to a designated communication channel at defined intervals: deployment started, percentage milestones for gradual patterns, validation checkpoints passed, deployment complete, or deployment aborted. Brief updates prevent information vacuums that generate support enquiries.
Post-deployment observation extends for a defined period based on pattern and risk. Blue-green deployments may hold the old environment available for 24-48 hours. Canary deployments may extend final percentage observation for similar periods. Feature flag rollouts may remain at partial percentage for days during A/B testing.
Cutover execution
Cutover procedures differ by pattern but share common elements. Pre-cutover verification confirms the target environment health: services responding, dependencies accessible, monitoring active, logs flowing. Verification failures abort cutover.
Database connection management
Connection pools on application servers may hold connections to old database instances or schemas after cutover. Plan for connection refresh through application restart or pool recycling.
Cutover execution follows documented runbook steps. For DNS-based cutover, TTL values determine propagation time; a 300-second TTL means 5 minutes before all clients see the new address. For load balancer cutover, propagation occurs within seconds but existing connections may continue to old backends until timeout.
Smoke testing immediately follows cutover. A defined test suite executes against production endpoints verifying critical user journeys complete successfully. Smoke test scope balances thoroughness against time: comprehensive testing delays rollback decisions while superficial testing misses problems. A five-minute smoke test covering authentication, core transactions, and integration touchpoints provides reasonable coverage.
Hypercare observation extends beyond smoke testing for hours or days depending on change risk. Operators monitor dashboards for anomalies, support teams track ticket volume and content, and stakeholders remain available for escalation. Defined exit criteria close the hypercare period: error rates within baseline, no critical incidents, stakeholder sign-off.
Consequences
Each pattern carries consequences beyond the deployment event that affect ongoing operations.
Blue-green deployment creates environment drift risk. The inactive environment may miss configuration updates, certificate renewals, or security patches applied to the active environment. Synchronisation processes or infrastructure-as-code practices mitigate drift but add operational overhead.
Rolling deployment creates mixed-version debugging complexity. Logs from the deployment period contain entries from both versions. Distinguishing version-specific behaviour requires version tagging in log entries and correlation by timestamp. Incident investigation during or immediately after rolling deployment must account for version mixing.
Canary deployment creates user experience inconsistency. Users may encounter different behaviour on successive requests if session affinity is not configured. Even with session affinity, users discussing the application may describe different experiences, generating confusion and support contacts.
Feature flags create testing combinatorics. If three flags exist with two states each, eight combinations exist. Testing all combinations becomes impractical as flag count grows. Risk-based testing prioritises likely combinations and known interactions while accepting that some combinations receive less coverage.
All gradual patterns create extended monitoring requirements. The organisation must sustain attention throughout the deployment period, which may span hours for rolling deployment or days for conservative canary progressions. Alert fatigue during extended deployments increases miss risk for genuine problems.
Variants
Infrastructure characteristics create pattern variants optimised for specific contexts.
Cloud-native variant
Container orchestration platforms provide deployment primitives that implement patterns with reduced custom automation. Kubernetes Deployments implement rolling updates natively with configurable surge and unavailability parameters. Service mesh implementations like Istio and Linkerd provide traffic splitting for canary deployment without load balancer reconfiguration. Cloud provider services like AWS CodeDeploy and Azure Deployment Slots provide blue-green semantics with managed infrastructure.
The cloud-native variant shifts complexity from custom automation to platform configuration. A Kubernetes Deployment specification with maxSurge: 25% and maxUnavailable: 25% implements rolling deployment without scripting. An Istio VirtualService with weighted routing implements canary deployment through declarative configuration.
On-premises variant
Traditional infrastructure without container orchestration requires explicit automation for deployment patterns. Load balancer API calls or configuration management tools adjust traffic routing. Deployment scripts coordinate instance updates with health checking. Blue-green deployment may use DNS switching between virtual IP addresses or load balancer pool swaps.
The on-premises variant requires more operational knowledge embedded in runbooks and scripts. Teams maintain deployment automation alongside application code. Rollback procedures may involve manual steps that cloud-native primitives automate.
Hybrid variant
Organisations running workloads across cloud and on-premises infrastructure face pattern selection constraints from the least capable environment. A canary deployment targeting both environments requires traffic splitting capability in both. Infrastructure differences may force different patterns for different deployment targets, complicating release coordination.
The hybrid variant often settles on rolling deployment as the common denominator available across environments. Where environments support different patterns, release documentation must specify per-environment procedures.
Field deployment variant
Deployments to field infrastructure over unreliable connectivity require offline-capable patterns. Big bang deployment with local execution packages enables deployment without continuous connectivity. The deployment package includes all artifacts and executes autonomously once delivered.
Rolling deployment over high-latency links extends deployment duration proportionally. A 500ms round-trip time adds seconds to each health check and orchestration command. Conservative timeouts prevent false-negative health checks from aborting healthy deployments.
Feature flags require flag service accessibility. Field deployments may cache flag state locally with defined refresh intervals, accepting stale flag values during connectivity gaps. Fallback behaviour when the flag service is unreachable must be defined and tested.
Anti-patterns
Common deployment failures reveal anti-patterns to recognise and avoid.
Untested rollback assumes rollback will work without verification. Production rollback differs from deployment: data may have changed, caches may contain new-version artifacts, external systems may have received notifications. Testing rollback in staging environments and documenting rollback prerequisites prevents discovery during incident response.
Missing health checks enables deployment orchestration to consider unhealthy instances ready. Rolling deployment with inadequate health checks routes traffic to starting instances before they can serve requests. Canary deployment with missing health checks cannot distinguish startup problems from traffic-induced problems.
Infinite rollout extends gradual deployments indefinitely without completion criteria. A canary deployment at 5% that never advances accumulates technical debt: two versions require maintenance, mixed-version bugs remain unresolved, and infrastructure supports both indefinitely. Define advancement criteria and maximum deployment duration.
Flag accumulation leaves feature flags in code after features stabilise. Each flag adds conditional complexity, testing burden, and cognitive load. Flag removal should be scheduled when flags are created, with removal treated as required work rather than optional cleanup.
Coordination bypass deploys changes without stakeholder communication, generating surprise support contacts and escalations. Even low-risk changes benefit from communication that sets expectations and enables preparation.
Single verification point relies on one check to confirm deployment success. Smoke tests can pass while monitoring reveals elevated errors. User reports can arrive after smoke tests complete. Multiple verification points across technical and experiential dimensions provide defence in depth.
Premature celebration declares deployment complete before observation period ends. Problems may manifest only under specific conditions: end-of-day batch processing, weekend traffic patterns, monthly report generation. Observation periods should cover representative operational cycles.