Techniques & Models - IT & Systems Analytics

The gap between collecting IT metrics and generating actionable operational intelligence is wider than most organizations recognize. Raw telemetry tells you what is happening. Analytical technique tells you why it is happening, what will happen next, and what the business impact is. This article covers eight analytical techniques that transform IT data from monitoring output into strategic and operational decision support.

These techniques are ordered by analytical maturity. Real-time monitoring is a prerequisite for everything else. AIOps and capacity planning are where most organizations should focus their second investment. SLA breach prediction, root cause analysis, and change impact analysis represent mature analytical practice. Security analytics and IT cost optimization are domains that benefit substantially from data integration work done for the other techniques.

Real-Time Monitoring and Alerting

Real-time monitoring is the foundation of IT analytics. Before any trend analysis, predictive model, or cost optimization initiative can succeed, the organization needs reliable, low-latency visibility into the state of its infrastructure. This sounds obvious but is frequently underinvested.

Architectural Requirements

An effective real-time monitoring architecture has four components:

Collection layer: Agents on hosts (Datadog Agent, New Relic Infrastructure Agent, Prometheus Node Exporter) collect metrics at regular intervals - typically every 10-60 seconds for infrastructure metrics, every second for APM data. Agents push to a collection endpoint or are polled by a scraper.

Streaming transport: Collected data flows through a streaming platform (Kafka, AWS Kinesis, Azure Event Hubs) that decouples producers from consumers and provides durability guarantees. This layer is what makes it possible to have multiple consumers (real-time dashboards, alerting, historical storage) without coupling them.

Alert evaluation engine: Alert rules are evaluated against the streaming data in near real-time. The alert engine must handle four cases correctly: state transitions (OK to ALERTING), transient spikes (avoid alert fatigue from single-point anomalies), sustained conditions (alert when a threshold is exceeded for N consecutive minutes, not for a single data point), and recovery (return to OK with hysteresis to prevent flapping).

Alert routing and escalation: Alerts route to the correct on-call responder based on service ownership, severity, and time of day. PagerDuty, OpsGenie, and VictorOps provide on-call scheduling and escalation policies. The integration between the alert engine and the on-call platform is a critical operational dependency.

Designing Alerts That Work

Alert fatigue is the primary failure mode of real-time monitoring. When too many alerts fire, on-call engineers habituate to ignoring them, and the signal-to-noise ratio degrades to the point where critical alerts are missed.

Alert design principles:

Alert on symptoms, not causes. Alert when user-facing latency exceeds threshold, not when CPU utilization is high. CPU utilization is a cause; high latency is a symptom. Users experience symptoms.

Use multi-condition alerts. Alert when CPU > 90% AND response time > 500ms AND for at least 5 consecutive minutes. Single-condition, single-point alerts produce excessive noise.

Set thresholds relative to baseline, not absolute values. A request rate alert that fires at 1,000 rpm is meaningless for a service that normally handles 900 rpm but also handles 50 rpm in a different environment. Threshold should be “X standard deviations from rolling 7-day baseline” rather than a fixed absolute.

Track Mean Time to Acknowledge (MTTA) alongside MTTR. If MTTA is high, your routing is misconfigured. If MTTR is high relative to MTTA, your runbooks or access are insufficient.

AIOps and Incident Pattern Recognition

AIOps (AI for IT Operations) applies machine learning to IT telemetry to detect anomalies, suppress noise, correlate related events, and identify recurring incident patterns. The term covers a range of techniques from simple statistical anomaly detection to sophisticated causal graph models.

Anomaly Detection

The most widely deployed AIOps technique is anomaly detection on time series metrics. The goal is to identify metric values that deviate from expected behavior without requiring a human to set a static threshold.

Seasonal decomposition: Most IT metrics exhibit daily and weekly seasonality. A web service handles more traffic on weekdays than weekends, and more traffic in business hours than overnight. A metric value that would be normal at 3pm Tuesday might be anomalous at 3am Sunday. Seasonal decomposition separates a time series into trend, seasonal, and residual components, then applies anomaly detection to the residual.

The STL (Seasonal and Trend decomposition using Loess) algorithm is widely used for this purpose. Given a time series y_t, STL decomposes it as:

y_t = T_t + S_t + R_t

Where:
- T_t = trend component (smoothed long-term direction)
- S_t = seasonal component (recurring pattern)
- R_t = remainder (residual after removing trend and seasonality)

Anomalies are detected in R_t using a threshold based on the interquartile range (IQR) or a z-score against the historical distribution of residuals.

ARIMA and Prophet: Facebook’s Prophet library is widely used in IT operations for time series anomaly detection and forecasting because it handles multiple seasonality levels (daily, weekly, annual), holiday effects, and missing data gracefully. Prophet models decompose similarly to STL but provide uncertainty intervals that make threshold-setting more principled.

Event Correlation and Noise Suppression

A major infrastructure incident - a network partition, for example - can generate thousands of correlated alerts across all services affected by the partition. Without correlation, the on-call engineer receives thousands of pages for a single root cause.

Topology-aware correlation: Map alerts to the network/service topology graph. When a parent node fails, suppress alerts from all child nodes that depend on it, since they are symptoms of the parent failure. This requires a maintained topology model (CMDB or service dependency map).

Temporal clustering: Group alerts that fire within a short time window and share common attributes (same host, same service cluster, same geographic region) into a single incident. The time window and attribute matching rules are tunable.

Moog, BigPanda, PagerDuty AIOps, Dynatrace Davis: These platforms implement event correlation as a managed capability. The key evaluation criterion is false positive rate - correlation systems that incorrectly merge unrelated incidents are worse than no correlation at all.

Recurring Incident Pattern Recognition

Analyze the incident history in your ITSM to identify patterns:

SELECT
    category,
    subcategory,
    EXTRACT(hour FROM opened_at) AS hour_of_day,
    EXTRACT(dow FROM opened_at) AS day_of_week,
    COUNT(*) AS incident_count,
    AVG(EXTRACT(epoch FROM resolved_at - opened_at) / 3600) AS avg_mttr_hours
FROM incidents
WHERE opened_at >= NOW() - INTERVAL '90 days'
GROUP BY category, subcategory, hour_of_day, day_of_week
ORDER BY incident_count DESC

This query surfaces which incident categories occur most frequently, at what times, with what average resolution time. Categories with high frequency and high MTTR are candidates for automated remediation runbooks.

Capacity Planning

Capacity planning is the discipline of predicting when current infrastructure will be insufficient to meet demand, with enough lead time to provision additional capacity before a constraint-driven incident occurs. It is the primary use case for historical trend analysis.

Saturation Point Forecasting

For each capacity dimension (CPU, memory, storage, network), fit a trend model to historical utilization and extrapolate to find the date at which utilization will cross the saturation threshold.

Simple linear regression approach:

Utilization(t) = a + b * t + ε

Where:
- t = time (days since baseline)
- a = intercept (baseline utilization)
- b = growth rate (percentage points per day)
- ε = residual error

Saturation point T* is where a + b * T* = threshold:

T* = (threshold - a) / b

For storage on a host with 60% current utilization growing at 0.1 percentage points per day, saturation at 90% is (90 - 60) / 0.1 = 300 days away. This gives you a procurement lead time estimate.

Non-linear growth: Many IT resources grow non-linearly. User growth often follows an S-curve. Traffic growth often has step-change patterns correlated with product launches. For non-linear growth, use polynomial regression or fit an appropriate growth model (logistic, exponential) based on observed growth pattern shape.

Confidence intervals matter. Report the 90% confidence interval on the saturation date, not just the point estimate. “Storage exhaustion is estimated at 300 days away (90% CI: 220-420 days)” is a more actionable statement than “300 days” because it quantifies the urgency of the procurement decision.

Resource Group Analysis

Individual server capacity planning is necessary but not sufficient. Modern infrastructure runs workloads across resource pools (Kubernetes node groups, auto-scaling groups, database clusters). Capacity planning at the pool level requires:

Aggregate pool utilization: Sum utilization across all nodes in the pool and divide by total capacity. Track this alongside per-node peak utilization to identify imbalanced load distribution.
Packing efficiency: What fraction of the pool is allocated (reserved by workloads) vs. actually consumed? A pool with 80% allocation and 40% actual consumption has poor packing efficiency and excess cost.
Bin-packing constraints: Some workloads have memory-to-CPU ratio requirements that constrain how they can be packed. Analyze the actual constraint binding (CPU-bound vs. memory-bound) to select the right instance type for future provisioning.

Cloud Spend Forecasting

For cloud-native or hybrid infrastructure, capacity planning intersects with cost planning. Cloud spend typically grows faster than utilization because unit prices change, usage patterns shift, and shadow IT accumulates.

Decompose cloud spend growth:

Spend Growth = Volume Growth + Price Changes + Mix Changes + Waste/Inefficiency

Where:
- Volume Growth = spend increase attributable to more workload
- Price Changes = spend increase from commitment expiry or pricing changes
- Mix Changes = spend from shifting to more expensive service tiers
- Waste = spend from idle resources, orphaned storage, unoptimized instance types

This decomposition, producible from the AWS Cost and Usage Report or equivalent, allows you to separate legitimate infrastructure investment from avoidable waste.

SLA Breach Prediction

SLA breach prediction identifies tickets and services at risk of breaching their SLA commitment before the breach occurs, enabling proactive intervention. It is one of the highest-value analytical techniques for service desk operations.

Ticket-Level Breach Risk Scoring

For each open ticket, calculate the current elapsed time as a percentage of the SLA window:

SLA Consumption (%) = Time Elapsed Since Ticket Created / SLA Window Duration x 100

A ticket with 80% of its SLA window consumed and no update in 24 hours is a breach risk signal even if it has not yet breached. Combine SLA consumption with ticket characteristics to build a risk score:

Risk Score = SLA Consumption (%) x Assignment Group Overload Factor x Category MTTR Ratio

Where:

Assignment Group Overload Factor = current open tickets in group / group normal capacity
Category MTTR Ratio = category historical average MTTR / SLA window duration

Tickets with a risk score above a threshold (calibrate through backtesting against historical breach data) get flagged for proactive escalation.

Service-Level SLA Prediction

At the service portfolio level, predict the SLA compliance rate for the current month before the month ends using the data available so far.

Projected Compliance = (Compliant Tickets Closed + Estimated Compliant Tickets Remaining) / Total Tickets

Where Estimated Compliant Tickets Remaining = Open Tickets x Historical Compliance Rate for Ticket Type x Time Adjustment Factor

This projection, updated daily, gives service desk managers advance warning if the month’s SLA compliance is trending below target, with enough lead time to reallocate resources.

Root Cause Analysis

Root cause analysis (RCA) is the structured investigation of why an incident occurred, with the goal of identifying systemic contributing factors rather than immediate triggers. Analytically, RCA involves correlating metrics across multiple data sources around the time window of an incident.

Change Correlation

The most common root cause of production incidents is change. Analyzing the change log in the ITSM relative to the incident timeline is the first step in every RCA.

SELECT
    c.number AS change_number,
    c.short_description,
    c.type,
    c.start_date,
    c.close_date,
    i.number AS incident_number,
    i.opened_at,
    EXTRACT(epoch FROM i.opened_at - c.close_date) / 60 AS minutes_after_change
FROM incidents i
JOIN change_requests c ON
    c.close_date BETWEEN i.opened_at - INTERVAL '4 hours' AND i.opened_at
    AND c.cmdb_ci = i.cmdb_ci
WHERE i.priority = 1
ORDER BY minutes_after_change ASC

This query finds P1 incidents that occurred within 4 hours of a change on the same configuration item. The temporal proximity and CI overlap is the evidence basis for investigating the change as a contributing cause.

Metric Correlation at Incident Time

For each P1 incident, extract a time series window (e.g., 30 minutes before and after the incident start) for all metrics associated with the affected CI and its upstream and downstream dependencies.

Calculate pairwise cross-correlations between metrics at different time lags to identify which metric changes preceded the incident symptoms. A network throughput drop that precedes CPU spikes by 2 minutes suggests a network-induced cause, not a compute-induced cause.

Five Whys Structured Data

Encode post-incident RCA findings in a structured format in your ITSM, not just free-text fields. Track: immediate trigger, contributing factors (1-3), systemic root cause, and remediation action. Analyzing this structured data over 50-100 incidents reveals systemic patterns: if 40% of P1 incidents trace to the same root cause category (e.g., insufficient testing of database schema changes), that is a process investment signal.

Change Impact Analysis

Change impact analysis quantifies the effect of production changes on system performance, stability, and business metrics. It is the analytical foundation for improving change management quality.

Statistical Before/After Comparison

For each production change, compare key metrics in a control window before the change to the equivalent window after the change.

Approach:

Define the change window (deployment timestamp +/- buffer).
Extract a before window (e.g., same time period 7 days prior to the change).
Extract an after window (e.g., 24 hours post-change).
Compare distributions using Welch’s t-test or Mann-Whitney U test for statistical significance.
Report effect size (Cohen’s d) alongside p-value to separate statistical significance from practical significance.

Metrics to compare: Response time (P50, P95, P99), error rate, throughput (requests per minute), CPU and memory utilization of the changed service.

A change that increased P99 response time from 800ms to 2,100ms with p < 0.001 and Cohen’s d of 1.8 is a statistically significant and practically meaningful degradation. This evidence basis is far stronger than the qualitative “seems a bit slower” that typically drives rollback decisions.

Change-to-Incident Attribution

Build a model that calculates what percentage of incidents can be attributed to preceding changes, by change type:

Change-Induced Incident Rate (%) = Incidents Opened Within 24h of Change on Same CI / Total Changes of Type x 100

Track this rate by change type (standard, normal, emergency) and by application/service. Emergency changes typically have 3-5x higher incident rates than planned changes, which is the quantitative justification for the additional review burden on emergency change processes.

Security Analytics

Security analytics applies data science techniques to security event data to detect threats, measure posture, and prioritize remediation.

Behavioral Baseline and User Entity Behavior Analytics (UEBA)

Establish behavioral baselines for users and systems, then flag deviations. Key behavioral signals:

Authentication pattern deviations: login time, source IP geolocation, and accessed resources. Flag users logging in at unusual hours from unusual locations.
Data access volume anomalies: a user who normally accesses 50 records per day accessing 50,000 records is an anomaly worth investigating.
Privilege escalation timing: legitimate escalations follow predictable patterns. Escalations outside business hours or by users without a change ticket open are anomalous.

Entity risk scoring: Aggregate behavioral anomalies into a per-entity risk score using exponential decay weighting (recent anomalies weight more heavily than older ones). Entities exceeding a risk score threshold go to the security analyst queue. This reduces analyst workload compared to reviewing every alert individually.

Vulnerability Prioritization

Not all vulnerabilities can be remediated immediately. Data-driven prioritization uses multiple factors beyond CVSS score:

Prioritization score = CVSS Score x Asset Criticality x Exploitability x Exposure

Where:

Asset Criticality is scored 1-5 based on the business function of the affected system
Exploitability = 1 if active exploit in the wild, 0.5 if PoC exists, 0.1 if theoretical only
Exposure = 1 if internet-facing, 0.5 if internal but accessible, 0.1 if isolated

This scoring produces a rank-ordered remediation queue that directs patching effort toward vulnerabilities with the highest actual risk rather than the highest theoretical severity.

IT Cost Optimization

IT cost optimization analytics identifies where infrastructure spend is misaligned with actual utilization and business value.

Rightsizing Analysis

For each cloud compute instance, compare allocated capacity to actual utilization:

Waste Score = 1 - (Average Actual Utilization / Allocated Capacity)

Annualized Waste = Current Annual Instance Cost x Waste Score

A cluster of 50 m5.4xlarge instances (16 vCPU, 64GB RAM) averaging 12% CPU utilization has a waste score of 0.88 on CPU. Downsizing to m5.xlarge (4 vCPU, 16GB RAM) at one-quarter the price would save approximately 75% of instance cost if workloads are genuinely CPU-constrained at current levels. Apply this analysis across the fleet to identify the highest-value rightsizing opportunities.

Chargeback and Showback

Allocate infrastructure costs to the business units and applications that consume them. Showback (reporting without charge) is the first step; chargeback (internal billing) follows once allocation models are trusted.

Allocation methodology:

Direct costs: cloud resources tagged to a specific application or team allocate directly.
Shared costs: shared infrastructure (network egress, logging, monitoring) allocates by consumption ratio (e.g., proportional to CPU usage share of the shared cluster).
Unallocated costs: residual costs from resources without tags or ownership. Report unallocated cost as a percentage of total - high unallocated cost percentages signal tag governance problems.

Financial impact: Organizations that implement mature chargeback programs typically reduce cloud waste by 15-30% because engineering teams respond to cost visibility by optimizing their own resource consumption.

Connecting the Techniques

These techniques build on each other. Real-time monitoring provides the signals that anomaly detection analyses. Anomaly detection identifies the incidents that RCA investigates. RCA findings inform change impact analysis thresholds. Capacity planning depends on the same trend models used in SLA breach prediction. Security analytics and cost optimization both depend on the behavioral baseline work done for AIOps.

For the KPIs these techniques support, see IT KPIs. For the data sources that feed these models, see IT Data Sources. For how to present outputs from these techniques in dashboard form, see IT Dashboards.