SLA Management
A service level agreement is a documented commitment between an IT function and its service consumers that specifies measurable targets for service delivery. SLAs transform implicit expectations into explicit, measurable commitments that both parties can track. This procedure covers the complete lifecycle: defining appropriate targets, negotiating and documenting agreements, establishing measurement, managing breaches, and conducting periodic reviews.
- Service Level Agreement (SLA)
- A formal agreement between IT and a service consumer (internal department, partner organisation, or external party) specifying service targets and responsibilities.
- Operational Level Agreement (OLA)
- An internal agreement between IT teams or functions that supports delivery of SLA commitments. If the SLA promises 99.5% availability, the OLA with the infrastructure team specifies their contribution to that target.
- Underpinning Contract (UC)
- A contract with an external supplier that supports SLA delivery. Cloud provider commitments, managed service agreements, and telecommunications contracts function as UCs.
- Service Level Target
- A specific, measurable metric within an SLA. Response time of 4 hours for high-priority incidents is a service level target; the SLA is the complete agreement containing multiple targets.
Agreement hierarchy
SLAs do not exist in isolation. Each SLA depends on supporting agreements that cascade responsibility through the service delivery chain. Understanding this hierarchy prevents commitments that cannot be met.
+-------------------------------------------------------------------+| SERVICE CONSUMER || (Programme Team) |+----------------------------------+--------------------------------+ | | SLA | "99.5% availability" | "4-hour response" v+-------------------------------------------------------------------+| IT SERVICE MANAGEMENT || (Service Desk, Service Owner) |+------------------+-------------------------------+----------------+ | | | OLA | OLA | "15-min triage" | "2-hour restore" v v+------------------+----------------+ +------------+----------------+| SERVICE DESK | | INFRASTRUCTURE TEAM || (Incident handling) | | (Platform operations) |+------------------+----------------+ +------------+----------------+ | | | UC | UC | "24/7 support" | "99.9% uptime" v v+------------------+----------------+ +------------+----------------+| ITSM VENDOR | | CLOUD PROVIDER || (Managed service desk) | | (Infrastructure) |+-----------------------------------+ +-----------------------------+Figure 1: Agreement hierarchy showing SLA supported by OLAs and UCs
The mathematics of this hierarchy matter. If you commit to 99.5% service availability in your SLA, but your cloud provider’s UC guarantees only 99.9% infrastructure availability, you have consumed most of your error budget before accounting for application issues, human error, or deployment problems. A 99.5% SLA realistically requires 99.95% or higher from each underlying component.
Prerequisites
Before beginning SLA development, confirm the following requirements are met.
Service catalogue availability. The services covered by the SLA must be defined in the service catalogue with clear scope, components, and ownership. Attempting to negotiate SLAs for undefined services creates ambiguity about what is actually being measured. If the service catalogue is incomplete, complete Service Catalogue Management first.
Monitoring capability. You must be able to measure what you commit to. For each potential service level target, verify that monitoring exists to capture the metric. An availability target requires uptime monitoring; a response time target requires incident timestamp tracking. If monitoring gaps exist, address them through Infrastructure Monitoring or Application Monitoring before finalising SLA targets.
Stakeholder identification. Identify the specific individuals who will negotiate and approve the SLA on the consumer side. For internal SLAs, this is typically a department head or programme director. For partner SLAs, identify the counterpart with authority to commit their organisation. Document contact details and confirm their availability for negotiation sessions.
Baseline data. Gather 3 to 6 months of historical performance data for the services in scope. This baseline informs realistic target setting. If you currently achieve 98.7% availability, committing to 99.9% without infrastructure changes sets up failure. Calculate current performance for availability, incident response times, request fulfilment times, and any other metrics under consideration.
Supporting agreement inventory. List all OLAs and UCs that underpin the services in scope. For each, document the committed service levels. These constrain what you can promise in the SLA.
| Prerequisite | Verification method | Required state |
|---|---|---|
| Service catalogue | Review catalogue entries | Services in scope are documented with owners |
| Monitoring | Check dashboards and alerting | Metrics for all proposed targets are collected |
| Stakeholders | Stakeholder register | Named individuals with approval authority |
| Baseline data | Historical reports | 3-6 months of performance data available |
| Supporting agreements | OLA/UC register | All underpinning agreements documented |
Procedure
Define service level targets
Identify metric categories for the service
Service level targets fall into distinct categories. For each service in scope, determine which categories apply based on service characteristics and consumer priorities.
Availability measures the proportion of time a service is operational and accessible. Express as a percentage over a defined period. A 99.5% monthly availability target permits 3.6 hours of downtime per month. Calculate the permitted downtime for proposed targets:
Permitted downtime = Total period × (1 - availability target)
Monthly (730 hours): 99.0% = 7.3 hours downtime 99.5% = 3.65 hours downtime 99.9% = 43.8 minutes downtime 99.95% = 21.9 minutes downtimeResponse time measures how quickly IT acknowledges an incident or request after the consumer reports it. This is not resolution time. A 4-hour response target means initial contact and triage within 4 hours, not a fix within 4 hours.
Resolution time measures elapsed time from report to restoration of service or completion of request. Resolution targets vary by priority. A critical incident might have a 4-hour resolution target; a low-priority request might have 10 working days.
Throughput measures transaction processing capacity. Relevant for systems handling high volumes: beneficiary registrations per hour, payment transactions per day, report generation capacity.
Quality measures accuracy and correctness. Data entry error rates, system-generated report accuracy, or successful transaction completion rates fall into this category.
Set target values based on baseline and constraints
For each metric category, propose a target value using this calculation approach:
Start with baseline performance. If current availability is 98.7%, that is your floor. Setting a target below current performance wastes the SLA’s value.
Apply the constraint ceiling. Check your supporting OLAs and UCs. If your cloud provider guarantees 99.9% and your internal infrastructure team’s OLA commits to 99.8%, your maximum achievable availability is approximately 99.7% (the product of component availabilities: 0.999 × 0.998 = 0.997).
Position the target between floor and ceiling, leaving margin for variation. If baseline is 98.7% and ceiling is 99.7%, a target of 99.3% provides reasonable stretch while remaining achievable.
Target positioning calculation:
Baseline (floor): 98.7% Constraint ceiling: 99.7% Available range: 1.0 percentage points
Conservative target (25% of range): 98.7% + 0.25% = 98.95% → round to 99.0% Moderate target (50% of range): 98.7% + 0.5% = 99.2% Aggressive target (75% of range): 98.7% + 0.75% = 99.45% → round to 99.5%For response and resolution times, apply similar logic using historical ticket data:
Historical P1 incident response times (last 6 months): Median: 47 minutes 90th percentile: 2.1 hours 95th percentile: 3.4 hours
Target options: Conservative: 4 hours (exceeds 95th percentile) Moderate: 2 hours (between 90th and 95th) Aggressive: 1 hour (between median and 90th)Define measurement methodology for each target
Each target requires a precise measurement definition. Ambiguous measurements create disputes during review.
Specify the measurement period. Monthly measurements smooth variation but delay breach detection. Weekly measurements catch problems faster but show more volatility. For most SLAs, monthly measurement with weekly interim reporting balances these concerns.
Define included and excluded time. Availability targets typically exclude scheduled maintenance windows. Specify the maximum permitted maintenance window duration and required notice period. A target of 99.5% availability excluding up to 4 hours of scheduled maintenance per month with 72 hours notice is more precise than 99.5% availability.
Clarify the measurement source. Name the specific monitoring system or report that provides the metric. “Availability measured by Uptime Robot external monitoring from three geographic locations, calculated as successful checks divided by total checks” removes ambiguity.
For time-based targets, define the clock. Response time might measure only business hours (Monday to Friday, 09:00 to 17:00 local time) or 24/7. Resolution time might pause outside business hours for lower priorities. Document these definitions explicitly:
P1 (Critical) resolution target: 4 hours - Clock: 24/7, no pause - Measurement: Incident created timestamp to resolved timestamp - Source: ServiceDesk Plus incident report
P3 (Low) resolution target: 10 working days - Clock: Business hours only (Mon-Fri 09:00-17:00 GMT) - Measurement: Incident created timestamp to resolved timestamp - Source: ServiceDesk Plus incident report - Note: Clock pauses on UK public holidaysEstablish priority definitions
Service level targets typically vary by priority. Define priority levels consistently so both parties apply the same classification to incidents and requests.
A four-level priority scheme covers most requirements:
Priority 1 (Critical) applies when the service is completely unavailable or a critical business function cannot operate. All users affected, no workaround exists. Examples: email system down for entire organisation, finance system unavailable during payroll processing, beneficiary registration system offline during distribution.
Priority 2 (High) applies when the service is severely degraded or a significant business function is impaired. Multiple users affected, workaround may exist but causes significant inconvenience. Examples: email delays exceeding 30 minutes, finance system slow but functional, beneficiary registration possible but taking 5 times normal duration.
Priority 3 (Medium) applies when the service is partially degraded or a non-critical function is impaired. Limited users affected, workaround available. Examples: email attachments over 10MB failing, report generation slow, single field in registration form not saving.
Priority 4 (Low) applies when the issue causes minimal impact or is a cosmetic problem. Single user affected, workaround readily available, or issue is an enhancement request. Examples: email signature formatting problem, report column alignment, minor user interface inconsistency.
Document specific examples relevant to the services in scope. Generic definitions lead to classification disputes.
Negotiate and document the agreement
Prepare the draft SLA document
Create a draft SLA document containing all proposed targets and definitions. Use a consistent structure:
Service description section identifies the services covered, referencing service catalogue entries, and states the SLA’s effective period and review schedule.
Service hours section specifies when the service is expected to be available and when support is provided. A service might be available 24/7 but supported only during business hours.
Service level targets section lists each metric with its target value, measurement methodology, and any exclusions.
Responsibilities section clarifies obligations on both sides. Consumer responsibilities might include reporting incidents through proper channels, providing access for troubleshooting, and participating in scheduled reviews.
Reporting section specifies what reports consumers receive, their frequency, and delivery method.
Review section establishes the schedule for SLA review meetings and the process for proposing changes.
Conduct negotiation sessions with stakeholders
Schedule dedicated negotiation sessions. Attempting to finalise SLAs in passing conversations or email threads leads to misunderstandings. Book 90-minute sessions with decision-makers present.
Present baseline data first. Show current performance before discussing targets. This grounds the conversation in reality and prevents requests for targets that historical data shows are unachievable.
Negotiate category by category. Address availability first, then response times, then resolution times. Reaching agreement on one category before moving to the next prevents circular discussions.
Document concerns and constraints raised by consumers. A programme team might highlight that their donor requires 99.9% availability for beneficiary-facing systems. Capture this as a constraint, then discuss what infrastructure investment would be required to meet it.
When targets cannot be agreed, escalate to appropriate governance. Do not commit to targets you cannot meet to end an uncomfortable negotiation. Record disagreement and escalate to IT leadership and the consumer’s leadership for resolution.
Align OLAs and UCs with SLA commitments
Before finalising the SLA, verify that supporting agreements can deliver the required performance.
For each SLA target, trace the dependency chain:
SLA Target: 99.5% availability for Grants Management System
Dependencies: ├── Application hosting (Azure App Service) │ └── UC: Microsoft Azure SLA 99.95% ├── Database (Azure SQL) │ └── UC: Microsoft Azure SLA 99.99% ├── Authentication (Entra ID) │ └── UC: Microsoft Azure SLA 99.99% ├── Network connectivity (ISP) │ └── UC: ISP SLA 99.9% └── Internal support (Infrastructure team) └── OLA: 99.8% platform availability
Combined theoretical maximum: 99.63% SLA target 99.5%: Achievable with marginIf the dependency analysis reveals that proposed SLA targets exceed what supporting agreements can deliver, you have three options: renegotiate supporting agreements to obtain higher commitments, reduce SLA targets to achievable levels, or implement redundancy to exceed single-component reliability.
Obtain formal approval and signatures
Finalise the SLA document incorporating negotiation outcomes. Route for approval according to your organisation’s authority matrix. SLAs committing IT resources or accepting liability typically require IT leadership approval. SLAs with external parties may require legal review.
Obtain signatures from authorised representatives on both sides. For internal SLAs, this might be the IT director and department head. For external SLAs, follow your organisation’s contract signing authority.
Distribute the signed SLA to all stakeholders. Store the authoritative copy in your document management system with appropriate access controls.
Establish monitoring and measurement
Configure monitoring for each service level target
For each target in the signed SLA, verify monitoring configuration captures the required data.
Availability monitoring requires checks at intervals shorter than your target’s sensitivity. For 99.9% availability (43.8 minutes permitted downtime monthly), checks every 5 minutes are minimum. Configure checks every 1 minute for critical services to capture brief outages.
# Example availability monitoring configuration (Uptime Robot) Monitor name: Grants Management System - Production URL: https://grants.example.org/health Check interval: 60 seconds Alert contacts: it-ops@example.org, grants-owner@example.org Locations: London, Frankfurt, New York
Availability calculation: - Period: Calendar month - Excluded: Scheduled maintenance (tagged in monitoring system) - Formula: (Successful checks / Total checks) × 100Response and resolution time monitoring requires timestamp capture in your service management tool. Verify the tool records: incident creation time (when logged), response time (first communication to user), and resolution time (when marked resolved). Calculate elapsed time using the clock rules defined in the SLA.
-- Example query for P1 response time compliance SELECT incident_id, created_at, first_response_at, TIMESTAMPDIFF(MINUTE, created_at, first_response_at) as response_minutes, CASE WHEN TIMESTAMPDIFF(MINUTE, created_at, first_response_at) <= 60 THEN 'Met' ELSE 'Breached' END as sla_status FROM incidents WHERE priority = 1 AND created_at >= '2024-01-01' AND created_at < '2024-02-01';Create SLA performance dashboards
Build dashboards that display current SLA performance for each target. Stakeholders should be able to view performance without requesting reports.
Dashboard elements for each target:
Current period performance against target (gauge or percentage display). Month-to-date availability of 99.7% against 99.5% target shows healthy margin.
Trend over previous periods (line chart). Six months of availability trending downward from 99.8% to 99.5% signals emerging risk even if current performance meets target.
Incidents or events affecting the metric (table). List downtime events, breached response times, or missed resolution targets with root cause categories.
Grant dashboard access to SLA stakeholders. For internal SLAs, share dashboard links with department contacts. For external SLAs, determine appropriate access method (shared dashboard, PDF export, or portal access).
Establish breach detection and notification
Configure alerts for actual and predicted SLA breaches.
Actual breach alerts trigger when a target is missed. A P1 incident exceeding the 1-hour response target triggers immediate notification to the service desk manager and service owner.
Threshold alerts trigger when performance approaches breach levels. If monthly availability drops below 99.6% against a 99.5% target, alert the service owner that breach is imminent without additional downtime margin.
Predictive alerts trigger when trending suggests future breach. If weekly incident volume increases 40% over trend, alert that resolution time targets may be at risk.
Alert configuration example:
Alert: P1 Response Time Breach Condition: P1 incident age > 55 minutes without response Action: - Email: servicedesk-manager@example.org - SMS: On-call service desk lead - Slack: #incident-response channel
Alert: Availability Warning Condition: MTD availability < (target + 0.1%) Action: - Email: service-owner@example.org - Dashboard: Flag service as "at risk"Manage breaches and exceptions
Respond to SLA breaches
When a breach occurs, initiate the breach response workflow:
+------------------+ +------------------+ +------------------+ | Breach | | Immediate | | Root cause | | detected +---->+ notification +---->+ analysis | | | | | | | +------------------+ +------------------+ +--------+---------+ | +------------------+ +------------------+ +--------v---------+ | Stakeholder | | Corrective | | Document | | communication +<----+ action plan +<----+ in CSI | | | | | | register | +------------------+ +------------------+ +------------------+Figure 2: SLA breach response workflow
Notify the service owner and consumer contact within 4 hours of breach confirmation. The notification should state what target was breached, what caused the breach, what immediate actions are being taken, and when a full analysis will be available.
Conduct root cause analysis appropriate to breach severity. A single P3 resolution time breach might warrant a brief review. Repeated P1 availability breaches require formal problem management investigation per Problem Management.
Document the breach in the continual service improvement register with root cause and corrective action. Track corrective actions to completion.
Process exception requests
Circumstances sometimes warrant temporary exception from SLA targets. Planned major upgrades, organisational restructuring, or external events may justify adjusted expectations.
Exception requests must specify: which targets are affected, the exception period (start and end dates), the reason requiring exception, and what temporary targets (if any) apply during the exception.
Route exception requests for approval. Minor exceptions (single target, under 30 days) might be approved by service owner and consumer contact. Major exceptions (multiple targets, over 30 days, or complete suspension) require IT leadership and consumer leadership approval.
Document approved exceptions in the SLA record. Exceptions modify the agreement temporarily; both parties must acknowledge the modification in writing.
Exception record example:
SLA: Programme Systems SLA Exception ID: EXC-2024-003
Affected target: Availability (99.5% → 98.0%) Period: 2024-03-15 to 2024-03-17
Reason: Planned migration to new cloud region Approved by: IT Director, Programme Director Date approved: 2024-02-28
Temporary provisions: - 24-hour on-call support during migration - Hourly status updates to programme team - Rollback if availability drops below 95%Conduct periodic reviews
Prepare for scheduled SLA review meetings
SLA review meetings occur at the frequency specified in the agreement, typically quarterly for critical services and annually for standard services.
Compile performance data for the review period. For each target, calculate: achievement percentage (what proportion of measurements met target), trend compared to previous periods, and breach count with causes.
SLA Performance Summary: Q3 2024
Availability Target: 99.5% Achieved: 99.72% Status: Met Breaches: 0 Trend: Stable (Q2: 99.68%)
P1 Response Time Target: 1 hour Achieved: 94% within target Status: Met (threshold: 90%) Breaches: 2 of 33 incidents Trend: Improved (Q2: 89%)
P1 Resolution Time Target: 4 hours Achieved: 88% within target Status: Breached (threshold: 90%) Breaches: 4 of 33 incidents Trend: Declined (Q2: 93%)Identify topics requiring discussion: targets consistently breached, targets achieved with no margin, changes in service scope or consumer requirements, upcoming changes affecting service delivery.
Distribute the performance summary to attendees 5 working days before the review meeting.
Facilitate the review meeting
Structure the review meeting in three parts:
Performance review (30 minutes): Present performance against each target. Discuss breaches, their causes, and corrective actions taken. Acknowledge areas of strong performance.
Issue discussion (20 minutes): Address any concerns raised by either party. Consumer concerns about service quality. IT concerns about unrealistic targets or resource constraints. Changes in business requirements affecting service needs.
Forward planning (10 minutes): Confirm targets for the next period. Identify any changes requiring SLA amendment. Schedule next review.
Document meeting outcomes including decisions made, actions assigned, and any agreed target modifications.
Process SLA amendments
When reviews identify needed changes, process amendments formally rather than through informal agreement.
Draft the amendment specifying: which sections change, the current wording, the new wording, and the effective date.
Route the amendment through the same approval process as the original SLA. Both parties must approve changes that affect commitments.
Update the master SLA document and redistribute to stakeholders. Maintain version history showing when changes occurred and why.
Verification
After establishing an SLA, verify the complete implementation:
Agreement documentation. Retrieve the signed SLA from document storage. Confirm signatures from authorised representatives on both sides. Verify the document contains all required sections: service description, service hours, service level targets with measurement methodology, responsibilities, reporting, and review schedule.
Monitoring configuration. For each service level target, access the monitoring system and confirm data collection is active. Run a test query or report covering the current period. Verify the output matches the measurement methodology specified in the SLA.
# Verify availability monitoring is activecurl -s "https://api.uptimerobot.com/v2/getMonitors" \ -d "api_key=YOUR_API_KEY" \ -d "monitors=MONITOR_ID" | jq '.monitors[0].status'# Expected output: 2 (monitoring active)
# Verify incident data collectionmysql -e "SELECT COUNT(*) FROM incidents WHERE created_at >= DATE_SUB(NOW(), INTERVAL 7 DAY);"# Expected output: Non-zero count confirming recent dataDashboard accessibility. Log in as a stakeholder user (or request a stakeholder to verify). Confirm the SLA dashboard loads and displays current period data. Verify the stakeholder can access without IT assistance.
Alert configuration. Trigger a test alert by temporarily lowering a threshold or simulating a condition. Confirm the alert reaches intended recipients through configured channels. Reset the threshold after testing.
Supporting agreement alignment. For each SLA target, review the dependency trace created during negotiation. Confirm all referenced OLAs and UCs remain current and their commitments still support SLA targets. Flag any expired or modified supporting agreements for attention.
Stakeholder awareness. Contact the consumer’s primary contact and confirm they received the signed SLA, understand how to access performance dashboards, know the escalation path for concerns, and have the review meeting in their calendar.
Troubleshooting
Monitoring shows different availability than user experience. External monitoring may report the service as available while users experience failures. This typically indicates the monitoring check is too shallow. If monitoring only checks that a login page loads, it misses failures in authentication, database queries, or specific functions. Expand monitoring to include synthetic transactions that exercise critical paths: log in, retrieve data, submit a form. Add monitoring endpoints within the application that verify database connectivity, external integrations, and background processes.
Incident timestamps appear incorrect for SLA calculations. Time zone mismatches between systems cause calculation errors. An incident logged at 09:00 local time but recorded as 09:00 UTC shifts the timeline. Standardise all incident timestamps to UTC in the service management system. Apply local time zone conversion only in reports and dashboards. Verify the service management system clock synchronises with NTP.
SLA calculations exclude incidents they should include. Filtering logic in reports may inadvertently exclude valid incidents. Common causes: priority changes during incident lifecycle (a P2 upgraded to P1 might not appear in P1 reports if the filter uses original priority), incidents closed and reopened (the second occurrence may be treated as a new incident), and category misalignment (incidents may be logged against a related but different service). Review report queries and confirm they capture the intended population. Use incident IDs from known breaches to verify they appear in reports.
Stakeholders dispute breach determination. Disagreement about whether a breach occurred indicates ambiguous measurement definitions. Review the SLA wording for the disputed target. If wording permits multiple interpretations, clarify through an amendment. For the immediate dispute, apply the interpretation most favourable to the consumer (IT bears the burden of clear documentation). Document the clarified interpretation for future reference.
OLA or UC changes after SLA signing. A vendor may modify their SLA, or an internal team may renegotiate their OLA, affecting your ability to meet SLA commitments. Immediately assess impact on each affected SLA target. If the change undermines your SLA commitment, notify consumer stakeholders within 5 working days. Propose either SLA amendment (reduced targets) or mitigation (additional redundancy, alternative supplier). Do not wait until a breach occurs to disclose the constraint change.
Availability target breached due to scheduled maintenance. If maintenance was scheduled according to SLA terms (proper notice, within permitted window), it should be excluded from availability calculation. Verify the maintenance window was documented before it occurred and tagged correctly in the monitoring system. If the maintenance exceeded the permitted duration or notice was insufficient, the excess time counts against availability. Review maintenance scheduling process to prevent recurrence.
Review meetings repeatedly cancelled or poorly attended. Stakeholder disengagement indicates the SLA may not be providing value. Assess whether targets are too easily met (no tension to discuss), reports are unclear or inaccessible (stakeholders cannot engage meaningfully), or business priorities have shifted (the service is less critical). Address root cause: tighten targets if too loose, improve reporting clarity, or propose SLA retirement if the service no longer warrants formal agreement.
Consumer requests targets that exceed infrastructure capability. When consumers request targets that OLAs and UCs cannot support, present the dependency analysis showing the constraint. Calculate the investment required to achieve the requested target: redundant systems, premium support tiers, additional providers. Provide this as a business case for consumer leadership to decide whether the investment is justified. Do not commit to unachievable targets to satisfy the immediate request.
Multiple SLAs contain conflicting priority definitions. Different SLAs may use the same priority labels with different meanings, causing confusion when incidents affect multiple services. Standardise priority definitions across all SLAs using a single priority matrix. Reference the standard matrix from each SLA rather than embedding definitions. When updating existing SLAs, align definitions through amendment during regular reviews.
Service performance fluctuates near target threshold. Performance hovering at 99.4% to 99.6% against a 99.5% target creates ongoing breach risk. Investigate the variation source. If variation is random, the target may be set too aggressively for current infrastructure. If variation correlates with specific events (time of day, workload peaks, specific operations), address those triggers. Consider implementing a buffer by targeting internal operations at 99.7% to provide margin for the 99.5% SLA commitment.