Selecting the right KPIs for IT operations is not a matter of comprehensiveness - monitoring platforms will surface hundreds of metrics if you let them. The discipline is selection: which twelve metrics will tell you definitively whether your infrastructure is reliable, your service delivery is effective, your security posture is improving, and your applications are performing? This article defines those twelve metrics with precise formulas, recommended targets, common measurement pitfalls, and the strategic questions each metric is designed to answer.
These KPIs are organized across four categories that correspond to the primary concerns of IT leadership: infrastructure reliability, service delivery, security and compliance, and application performance.
Infrastructure Reliability
Infrastructure reliability KPIs answer the foundational question: can the business operate? Downtime is not an IT problem; it is a business problem with a measurable financial cost. Reliability metrics must be calculated precisely and communicated in business terms.
System Uptime and Availability
Availability is the percentage of scheduled operating time during which a system or service is accessible and functioning within acceptable parameters.
Formula:
Availability (%) = (Total Scheduled Time - Downtime) / Total Scheduled Time x 100
For a system with 24/7 scheduling, one month contains approximately 730 hours. If that system experienced 2.5 hours of unplanned downtime, availability is (730 - 2.5) / 730 x 100 = 99.66%.
Nines notation is the conventional shorthand. 99.9% (“three nines”) allows 8.76 hours of downtime per year. 99.99% (“four nines”) allows 52.6 minutes. 99.999% (“five nines”) allows 5.26 minutes. Each additional nine represents roughly a 10x improvement and typically requires a commensurate investment in redundancy and automation.
Targets: Consumer-facing web services should target 99.9% or better. Core financial and transaction processing systems typically require 99.99%. Life-safety or regulated systems may require 99.999%.
Common pitfalls: Planned maintenance windows are often excluded from availability calculations, which can obscure the true operational cost of maintenance overhead. Consider tracking “total availability” inclusive of planned windows as a secondary metric. Also distinguish between availability (the system is reachable) and correctness (the system is producing valid results) - a degraded system that accepts requests but returns errors may register as “available” while failing users.
Business translation: Multiply revenue per hour by availability shortfall against target to produce an annualized revenue exposure figure. A $200M/year revenue business with $200,000/hour peak exposure running at 99.7% against a 99.9% target has a 0.2 percentage point shortfall, approximately 17.5 hours per year, with $3.5M exposure at peak rates.
Mean Time to Recovery (MTTR)
MTTR measures the average time from incident detection to full service restoration. It is the primary metric for evaluating incident response effectiveness.
Formula:
MTTR = Total Downtime Across All Incidents / Number of Incidents
If twelve incidents in a quarter accumulated 36 hours of total downtime, MTTR is 3 hours.
Component decomposition is analytically more useful than the aggregate figure. MTTR has three components: time to detect (TTD), time to diagnose (TDiag), and time to restore (TTR). Monitoring tool alerting quality drives TTD. Runbook completeness and on-call skill drive TDiag. Change control processes and deployment automation drive TTR. Knowing which component dominates your MTTR tells you where to invest.
Targets: Industry benchmarks suggest MTTR under 4 hours for Tier 1 systems, under 1 hour for organizations with mature SRE practices. Elite performers achieve MTTR under 15 minutes through automated remediation.
Common pitfalls: MTTR can be gamed by declaring incidents resolved before verification. Require a post-resolution monitoring period (typically 30 minutes for automated confirmation) before closing. Track “re-open rate” alongside MTTR.
Mean Time Between Failures (MTBF)
MTBF measures the average time between consecutive failures of a system or component. It is the primary metric for assessing infrastructure reliability and planning maintenance cycles.
Formula:
MTBF = Total Uptime / Number of Failures
A server that ran for 8,640 hours over twelve months and experienced three failures has an MTBF of 2,880 hours, or 120 days.
MTBF vs. MTTF: MTBF applies to repairable systems. Mean Time to Failure (MTTF) applies to non-repairable components (individual hard drives, for instance). Most IT infrastructure reporting uses MTBF.
Targets: MTBF targets vary enormously by component type. Storage arrays should target MTBF measured in years. Individual services should target MTBF of weeks to months. Network devices in production should rarely fail more than once per year per device.
Strategic use: Trending MTBF over time reveals whether infrastructure reliability is improving or degrading. A declining MTBF trend on a class of hardware is a capital refresh signal. A declining MTBF trend on a software service is a code quality or dependency signal.
Server Capacity Utilization
Capacity utilization measures the percentage of available compute, memory, storage, or network capacity currently consumed. It is the primary input for capacity planning and cloud cost optimization.
Formula (CPU example):
CPU Utilization (%) = CPU Time Used / Total Available CPU Time x 100
Multi-dimensional nature: Capacity utilization must be tracked across four dimensions simultaneously: CPU, memory, storage I/O, and network bandwidth. A server with 30% CPU utilization but 95% memory utilization is capacity-constrained. Single-dimension reporting misleads.
Targets: The optimal utilization range for on-premises infrastructure is 60-80%. Below 60% suggests over-provisioning and wasted capital. Above 80% leaves insufficient headroom for traffic spikes and reduces mean time before a utilization-driven incident. Cloud infrastructure can target higher utilization because capacity can be added programmatically.
Common pitfalls: Average utilization masks peak utilization, which is what actually causes incidents. Track P95 and P99 utilization in addition to averages. A server averaging 50% CPU but peaking at 95% for 5% of the time is a reliability risk.
Service Delivery
Service delivery KPIs measure IT’s performance as a service organization. They are the primary metrics for conversations between IT leadership and business stakeholders, and they feed SLA commitments and service review meetings.
Incident Resolution Time
Incident resolution time is the elapsed time from the moment an incident is logged to the moment it is marked resolved and confirmed stable. It differs from MTTR in that it captures the full ticket lifecycle including communication, escalation, and verification steps.
Formula:
Incident Resolution Time = Ticket Closed Timestamp - Ticket Opened Timestamp
Track this metric by priority tier (P1, P2, P3, P4) and report separately. P1 resolution time of 2 hours and P4 resolution time of 5 days are both valid depending on your SLA structure.
Targets: Common ITSM SLA structures target P1 at 4 hours, P2 at 8 hours, P3 at 3 business days, and P4 at 5 business days. High-performing service desks resolve P1 incidents within 2 hours.
SLA Compliance Rate
SLA compliance rate measures the percentage of incidents, requests, or service interactions resolved within the contracted or committed service level agreement timeframe.
Formula:
SLA Compliance Rate (%) = Tickets Resolved Within SLA / Total Tickets x 100
Segment by ticket type: Report SLA compliance separately for incidents, service requests, and change requests. Blended compliance rates hide the reality that P1 incidents may be handled well while routine service requests consistently breach SLA.
Targets: Most enterprise SLA frameworks target 95% or higher overall compliance. P1-specific compliance should target 99%.
Near-miss tracking: Complement compliance rate with a “time remaining at resolution” distribution. A ticket resolved with 2 minutes to SLA expiry represents a fundamentally different operational reality than one resolved with 2 hours remaining, even though both count equally in the compliance rate.
Ticket Volume
Ticket volume tracks the total number of incidents and service requests received over a period, segmented by category, source, and priority. Volume trends are a leading indicator of infrastructure health, change quality, and user experience.
Formula:
Ticket Volume per Period = COUNT(tickets) WHERE created_date BETWEEN period_start AND period_end
Volume is a symptom metric. Rising ticket volume typically indicates one of three things: degrading infrastructure health, a recent change that introduced regressions, or growing user population. The diagnostic value is in decomposition: which categories are growing, which are stable, and which are declining? A growing “password reset” category signals an opportunity for self-service automation. A growing “application error” category signals a code quality problem.
Deflection rate is the complementary metric: the percentage of potential tickets resolved through self-service portals, chatbots, or knowledge base articles before a ticket is created. Mature service desks target 30-50% deflection.
Change Failure Rate
Change failure rate measures the percentage of production changes that result in an incident, rollback, or unplanned outage. It is the primary metric for change management quality and a direct measure of deployment risk.
Formula:
Change Failure Rate (%) = Failed Changes / Total Changes x 100
A team that deploys 80 changes per month and experiences 6 failures has a 7.5% change failure rate.
DORA benchmark context: In DORA’s original Accelerate research, change failure rate was the weakest differentiator across performance tiers - elite performers reported 0-15% CFR while high, medium, and low performers all clustered in the same 0-15% range, showing significant overlap. The 2023 State of DevOps Report collected exact percentages rather than ranges and found more separation: elite teams at roughly 5%, high at 10%, medium at 15%, and low at 64%. Elite performers achieve low CFR through automated testing, progressive delivery, and feature flags.
Common pitfalls: Change failure rate requires a clear definition of “failure.” Establish an explicit policy: does a hotfix within 24 hours of a deployment count as a failure? Does a rollback count? Inconsistent definitions make trend analysis meaningless.
Security and Compliance
Security KPIs provide the quantitative foundation for security posture reporting. They translate qualitative assessments of “how secure are we?” into trackable, trend-able metrics.
Patch Compliance Rate
Patch compliance rate measures the percentage of systems that have received and applied required security patches within the mandated timeframe.
Formula:
Patch Compliance Rate (%) = Systems Patched Within Policy Window / Total Systems in Scope x 100
Most security frameworks mandate different patching windows by vulnerability severity. Critical vulnerabilities (CVSS 9.0+) typically require patching within 24-72 hours. High severity (CVSS 7.0-8.9) within 7-14 days. Medium severity within 30 days. Report compliance rate separately for each severity tier.
Targets: Regulated industries typically require 95%+ compliance within the policy window for critical patches. PCI-DSS mandates patching within one month of release for applicable systems.
Common pitfalls: Patch compliance rate counts systems that were patched, not systems that are currently protected. A system that was patched last month but has missed this month’s critical patch should not count as compliant. Use a rolling compliance window.
Security Threat Count
Security threat count tracks the number of validated security threats detected within a period, categorized by threat type, severity, and source. It is a primary input for security operations capacity planning and risk reporting.
Formula:
Threats per Period = COUNT(validated_security_events) WHERE severity >= threshold AND period = reporting_period
Distinguish events from threats. Raw SIEM events number in the millions per day. Validated threats after deduplication, correlation, and analyst triage are far fewer. Reporting raw event volume creates noise that obscures the signal. Report on validated threats, defined as events that have been reviewed and confirmed as non-false-positive.
Trending matters more than absolute count. A stable threat count of 50 per month is a healthy signal. A count that doubled over three months without a corresponding change in detection coverage warrants investigation.
Vulnerability Count by Severity
Open vulnerability count tracks the number of known, unremediated vulnerabilities in the environment, segmented by CVSS severity tier. It is a direct measure of attack surface and a leading indicator of breach risk.
Formula:
Vulnerability Exposure Score = SUM(vulnerabilities x severity_weight)
Where severity weights are: Critical = 10, High = 5, Medium = 2, Low = 1. This weighted score allows trend comparison across periods even when the distribution across severity tiers shifts.
Mean Time to Remediate (MTTR for vulnerabilities): Track the average time between vulnerability discovery and confirmed remediation, by severity tier. Critical vulnerability MTTR should be measured in hours to days. High severity in days to weeks.
Application Performance
Application performance KPIs connect infrastructure health to user experience. They are the metrics most directly visible to end users and therefore most closely watched by product and engineering leadership.
Application Response Time (Apdex)
Application Performance Index (Apdex) is a standardized measure of user satisfaction with response time. It converts continuous latency measurements into a 0-1 satisfaction score.
Formula:
Apdex = (Satisfied + Tolerating / 2) / Total Samples
Where:
- Satisfied = requests completed in <= T seconds (T is the threshold, typically 0.5s)
- Tolerating = requests completed in T to 4T seconds
- Frustrated = requests taking > 4T seconds
An Apdex of 0.90 or above is generally considered good. Below 0.70 represents a user experience problem requiring immediate attention.
Latency percentiles complement Apdex. Report P50 (median), P95, and P99 response times. P95 and P99 capture the tail latency experience that averages obscure. A service with 200ms P50 but 8,000ms P99 is failing 1% of users severely.
Targets: Consumer applications should target sub-200ms P50 response time. B2B applications with complex queries may tolerate higher latency with appropriate user expectations. P99 should not exceed 3-5x the P50.
Bandwidth Utilization
Bandwidth utilization measures the percentage of available network capacity consumed, tracked by circuit, segment, or region.
Formula:
Bandwidth Utilization (%) = Actual Throughput / Maximum Throughput x 100
Peak vs. average: Like CPU utilization, bandwidth utilization must be evaluated at peak as well as average. A circuit averaging 40% utilization but peaking at 95% during business hours is a reliability risk. Track hourly peaks alongside daily averages.
Targets: WAN circuits should target under 70% peak utilization to maintain quality of service headroom. Internal LAN segments can tolerate higher average utilization but should still maintain headroom for burst traffic.
Using These KPIs Together
No single KPI tells the full story. The power of this framework is in cross-metric correlation. High ticket volume combined with high change failure rate suggests that deployment quality is driving operational load. High MTTR combined with declining MTBF suggests infrastructure aging that is also generating complex, difficult-to-diagnose failures. Low patch compliance combined with a rising vulnerability count is a compounding risk signal.
For guidance on the data sources that feed these metrics, see IT Data Sources. For the analytical techniques that turn these metrics into actionable insight - including SLA breach prediction and capacity forecasting - see IT Techniques & Models. For dashboard designs that present these KPIs to the right audience, see IT Dashboards.