Data Sources - Risk & Compliance Analytics

Risk analytics is only as comprehensive as the data that feeds it. Most organizations have the raw material for sophisticated risk measurement already - it is distributed across a half-dozen systems that were never designed to speak to each other. Building integrated risk analytics is fundamentally a data integration problem before it is an analytical one.

This article catalogs the eight primary data source categories for enterprise risk analytics, covering what each system contains, how to extract it, what data quality problems to anticipate, and how it connects to the KPIs and techniques described in Risk KPIs and Risk Techniques.

Risk Management Information Systems (RMIS)

RMIS platforms are the canonical system of record for enterprise risk events, insurance program data, and claims management. The dominant vendors - Riskonnect, SAI360, Origami Risk, Ventiv - share a common data model built around incident reports, loss events, claims, and policy data.

What RMIS systems contain:

Loss event records with date, category, business unit, and financial impact
Incident reports and near-miss records
Insurance policy terms, coverage limits, and renewal data
Claims files with status, reserve amounts, and payment history
Risk assessments and risk register items
Corrective action tracking

Integration approach: Most modern RMIS platforms expose REST APIs with OAuth 2.0 authentication. Where APIs are unavailable, scheduled CSV exports are standard. Key extraction targets are the loss event tables and the risk register - these drive the majority of enterprise risk KPIs.

Data quality issues: RMIS data quality is often poor in organizations where RMIS is used primarily for insurance renewal documentation rather than active risk management. Loss event categorization is frequently inconsistent (the same type of operational failure classified under different categories by different business units), root cause fields are sparsely populated, and financial impact estimates are rarely updated as actual costs crystallize. Plan for a significant data cleansing effort before RMIS data can be reliably used in analytics.

Key fields for integration: incident_date, discovery_date, business_unit_id, risk_category, event_type, gross_loss_amount, net_loss_amount, insurance_recovery, status, root_cause_category

Connects to: Incident Cost (Risk KPIs), Operational Risk dashboards (Risk Dashboards), Enterprise Risk Quantification (Risk Techniques)

ERP and Financial Systems

Enterprise Resource Planning systems - SAP, Oracle Financials, Microsoft Dynamics, NetSuite - are primary data sources for financial risk analytics, fraud detection, and audit analytics. They contain the transaction-level detail that makes it possible to detect anomalies, reconstruct control failures, and quantify financial exposures.

What ERP systems contain:

Accounts payable and accounts receivable transaction records
General ledger entries with full journal detail
Purchase order and vendor payment data
Expense report and employee payment data
Budget vs. actual data for variance analysis
Fixed asset registers
Intercompany transaction records

Integration approach: SAP provides the BAPI/RFC interface and the ODP (Operational Data Provisioning) framework for direct extraction. Oracle offers database views and REST APIs depending on version. Most ERP vendors now support OData APIs that simplify incremental extraction. For financial risk analytics, the most valuable extraction targets are the AP subledger (for duplicate payment detection and vendor fraud), the GL journal entry table (for unusual transaction detection), and the expense report detail table.

Data quality issues: ERP financial data is typically high quality by accounting standards, but analytical use cases expose problems that accounting controls do not catch. Vendor master data is frequently inconsistent - the same supplier may appear under multiple vendor IDs with slight name variations, making vendor concentration analysis unreliable. Transaction dates vs. posting dates create ambiguity for period-boundary analysis. Intercompany eliminations need to be applied before cross-entity aggregation.

Critical tables for risk analytics:

AP invoice table: vendor_id, invoice_date, invoice_amount, payment_date, payment_method, approver_id
GL journal entry: entry_date, account_code, debit_amount, credit_amount, user_id, posting_time
Expense report: employee_id, submission_date, expense_category, amount, merchant, approval_chain

Connects to: Fraud Rate, Audit Finding Rate (Risk KPIs), Fraud Detection and Anomaly Analysis (Risk Techniques), Financial Risk and Audit dashboards (Risk Dashboards)

Transaction Data

For organizations with payment processing, lending, or trading operations, transaction-level data is the primary source for fraud analytics and credit risk measurement. This category includes payment networks, core banking systems, card processing platforms, and trading systems.

What transaction data contains:

Individual payment and settlement records with full attribute sets
Card authorization and decline records with reason codes
ACH/wire transfer records with originator and beneficiary data
Loan origination and servicing transaction records
Trading order and execution records
Chargeback and dispute records

Integration approach: Transaction systems are typically high-volume, real-time sources. Fraud analytics requires either streaming integration (Apache Kafka, AWS Kinesis) or near-real-time batch extraction at intervals of minutes rather than hours. Historical analysis for model training typically uses daily batch files. Payment networks and card processors provide standardized ISO 8583 or ISO 20022 transaction formats.

Data quality issues: Transaction data volume is the primary challenge - billions of records in mature payment operations. The key quality issue for fraud analytics is label quality: confirmed fraud cases (chargebacks, dispute resolutions) arrive weeks or months after the transaction, creating a training data lag that requires careful handling in ML model development. Transaction data also contains high cardinality fields (merchant names, IP addresses, device IDs) that require normalization before analysis.

Key fields for fraud analytics: transaction_id, transaction_timestamp, account_id, merchant_id, merchant_category_code, transaction_amount, transaction_currency, channel (card-present/card-not-present/ACH), device_fingerprint, ip_address, outcome (approved/declined), fraud_flag

Connects to: Fraud Rate (Risk KPIs), Real-Time Transaction Monitoring, Anomaly Detection (Risk Techniques), Fraud Detection dashboards (Risk Dashboards)

Regulatory and Compliance Systems

Governance, risk, and compliance (GRC) platforms and compliance management systems store the structured data that drives compliance rate metrics and audit analytics. This category includes dedicated GRC tools (ServiceNow GRC, MetricStream, Archer) as well as purpose-built compliance applications for specific regulatory domains (anti-money laundering, FCPA, data privacy).

What compliance systems contain:

Control inventories with design and operating effectiveness ratings
Policy libraries with version histories and attestation records
Regulatory requirement mappings (controls to requirements)
Audit findings with severity ratings and remediation tracking
Regulatory exam findings and management responses
Training completion records
Exception and waiver logs

Integration approach: GRC platforms typically offer REST APIs or scheduled exports. The critical integration challenge is mapping between the GRC platform’s control framework and the specific regulatory requirements or internal policy standards that drive your compliance metrics. This mapping is rarely available as structured data - it often exists only in documentation - and must be built as a reference table to enable compliance rate calculations.

Data quality issues: Compliance system data suffers from status inconsistency (controls marked “effective” based on documentation review rather than operational testing), stale risk assessments (risk scores not updated as the threat environment changes), and fragmented audit trails when findings span multiple audit cycles. Control frequency metadata (continuous, daily, monthly, annual) is often absent, making it impossible to calculate whether a required control was performed on schedule.

Key fields for compliance analytics: control_id, control_name, control_owner, regulatory_requirement_reference, assessment_date, design_effectiveness, operating_effectiveness, finding_severity, remediation_due_date, remediation_status, attestation_date

Connects to: Compliance Rate, Audit Finding Rate (Risk KPIs), Regulatory Compliance Automation (Risk Techniques), Compliance dashboards (Risk Dashboards)

Credit Bureau and External Reference Data

For organizations with credit risk exposure - banks, insurance carriers, leasing companies, trade credit providers - external credit data supplements internal transaction history with information about counterparty creditworthiness that the organization could not otherwise observe.

What credit bureau and external data contains:

Individual consumer credit reports: scores, tradeline history, public records, inquiries
Commercial credit data: business credit scores, payment history, lien and judgment data
Financial statement data for commercial obligors
Industry default rate statistics
Macro-economic variables: unemployment rates, GDP growth, interest rate curves
Sanctions and watchlist data (OFAC, EU consolidated list, UN sanctions)

Integration approach: Credit bureau data (Equifax, Experian, TransUnion for consumer; Dun & Bradstreet, Moody’s Analytics for commercial) is typically accessed through batch or real-time API calls at origination and through periodic portfolio refresh pulls. The contractual and regulatory restrictions on credit data use - permissible purpose requirements under FCRA, data minimization requirements under GDPR and CCPA - must be reflected in your data governance model before integrating credit data into analytics pipelines.

Data quality issues: Credit bureau data contains errors that obligors have disputed but which remain in files pending resolution. Commercial credit data for small and mid-market businesses is sparse and stale. Macro-economic variable vintage (the date the variable was measured) must be tracked carefully when building credit models - using current macro data to backtest a model trained on historical data introduces look-ahead bias.

Key fields for credit risk analytics: obligor_id, bureau_pull_date, credit_score, score_model_version, derogatory_count, revolving_utilization, months_since_delinquency, public_record_count, estimated_income (for consumer), duns_number, paydex_score, credit_limit_recommended (for commercial)

Connects to: Loss Given Default, Portfolio Concentration Risk (Risk KPIs), Credit Risk Modeling (Risk Techniques), Credit Risk dashboards (Risk Dashboards)

Security Information and Event Management (SIEM) Systems

SIEM platforms - Splunk, Microsoft Sentinel, IBM QRadar, Elastic SIEM - aggregate log and event data from IT infrastructure and generate security alerts. For enterprise risk analytics, SIEM data bridges the gap between cybersecurity operations and operational risk measurement.

What SIEM systems contain:

Authentication event logs: login attempts, MFA challenges, privilege escalation events
Network access logs: firewall events, VPN connections, lateral movement indicators
Endpoint events: file access, process execution, USB device connections
Application logs: failed transactions, error patterns, access control violations
Security alerts: IDS/IPS alerts, threat intelligence matches, behavioral anomalies
Incident tickets linked to underlying events

Integration approach: SIEM platforms expose search APIs (Splunk’s REST API, Microsoft Sentinel’s Log Analytics API) that enable structured queries against log data. For risk analytics purposes, you rarely need raw log volumes - pre-aggregated security metrics computed within the SIEM are sufficient. Define the specific alert categories and event types relevant to your risk taxonomy (unauthorized access attempts, privilege abuse, data exfiltration indicators) and pull aggregated counts by time period and system.

Data quality issues: SIEM data is extremely high volume, and raw logs are noisy. Alert fatigue - where high alert volumes cause security teams to triage inadequately - means that SIEM alert data reflects both genuine threats and detection system configuration quality. A spike in alerts may indicate increased threat activity or a new detection rule that is over-triggering. Contextualizing SIEM metrics requires knowing what changed in detection rules during the measurement period.

Key fields for operational risk analytics: event_timestamp, source_system, event_category, severity, user_id, source_ip, destination_system, alert_name, disposition (true positive/false positive), incident_id (if escalated)

Connects to: Open Risk Items, MTTI Risk (Risk KPIs), Anomaly Detection (Risk Techniques), Operational Risk dashboards (Risk Dashboards)

Internal Audit Systems

Dedicated audit management platforms - TeamMate+, AuditBoard, Galvanize (formerly ACL) - provide structured data about audit activity, findings, and remediation that drives audit KPIs and provides a lagging indicator of control quality.

What internal audit systems contain:

Audit plan and engagement records with scope, timing, and assigned auditors
Finding records with classification (observation, control deficiency, material weakness), risk rating, and management response
Remediation action plans with owners and due dates
Closure records with auditor verification of remediation
Follow-up audit results
Historical finding trends by business unit and control domain

Integration approach: Most audit management platforms offer REST APIs or scheduled data exports. The most analytically valuable tables are the finding register and the remediation tracking table - these support Audit Finding Rate and MTTR Risk calculations. Linking audit findings to the control inventory in your GRC system (via shared control ID references) enables impact analysis showing which controls are generating the most audit findings.

Data quality issues: Finding severity ratings are often applied inconsistently across audit teams and time periods, particularly in organizations that have reorganized their internal audit function or changed leadership. Remediation closure evidence quality varies - some findings are closed based on management attestation rather than auditor verification. Build metadata about closure methodology (management assertion vs. auditor-tested) into your reporting.

Key fields for audit analytics: finding_id, audit_engagement_id, finding_date, business_unit_id, control_domain, severity_rating, repeat_finding_flag, prior_finding_id (for repeat findings), remediation_due_date, remediation_close_date, closure_method

Connects to: Audit Finding Rate, MTTR Risk (Risk KPIs), Regulatory Compliance Automation (Risk Techniques)

Insurance Claims and Policy Data

For organizations with captive insurance programs or significant insured risk portfolios, insurance data is the actuarial foundation for risk quantification and the empirical basis for loss modeling.

What insurance data contains:

Policy terms: coverage types, limits, deductibles, effective dates, premium amounts
Claims records: claim date, loss date, claimant, cause, loss amount, reserve amount, payment history, recovery amounts
Loss development triangles for actuarial analysis
Subrogation and recovery records
Broker and underwriter correspondence (often unstructured)

Integration approach: Insurance data typically requires extraction from the RMIS (if claims are managed there), the insurer’s claims portal, or a combination of both. Large organizations with captive programs may have dedicated actuarial systems (Willis Towers Watson’s ResQ, Milliman MG-ALFA) that contain more sophisticated loss development data. Standardize the claims hierarchy (claim-level vs. occurrence-level vs. policy-level) before integration - inconsistency here produces incorrect aggregations.

Data quality issues: Reserve amounts are actuarial estimates that change over time; historical reserve data without effective-date versioning is analytically misleading. Cause-of-loss coding is often applied at first notice of loss (FNOL) and not updated as claims develop - codes that reflect initial description rather than final determination. Recovery amounts often post to different periods than the original loss, creating reconciliation complexity.

Key fields for risk analytics: claim_id, occurrence_id, policy_id, loss_date, report_date, cause_of_loss_code, line_of_business, incurred_loss_amount, paid_loss_amount, case_reserve_amount, recovery_amount, claim_status, close_date

Connects to: Incident Cost, Risk Mitigation Effectiveness (Risk KPIs), Enterprise Risk Quantification (Risk Techniques)

Integration Architecture Principles

Connecting these eight source systems into an integrated risk data layer requires deliberate architectural choices. Tools like Plotono can serve as the connective analytics layer that ingests data from these disparate systems and delivers governed, cross-domain risk views to stakeholders.

Canonical risk event model. Define a single risk event schema that can represent incidents from RMIS, findings from audit systems, alerts from SIEM, and claims from insurance data. The canonical model should include: event_id, event_type, source_system, business_unit_id, risk_category, severity, financial_impact, discovery_date, resolution_date. This model becomes the analytical foundation for cross-domain KPIs like MTTI Risk and the REI.

Temporal consistency. Risk data sources use different time dimensions - RMIS uses loss dates, ERP uses posting dates, audit systems use finding dates, SIEM uses event timestamps. Establish a consistent temporal model for your analytics: define which date field each source uses as its primary time dimension and ensure your aggregation logic is consistent.

Slowly changing dimensions. Risk data involves slowly changing reference data - organizational hierarchies change, control ownership changes, vendor categories change. Implement Type 2 slowly changing dimensions for critical reference tables (business unit hierarchy, control owner, vendor category) to preserve the ability to analyze historical data accurately.

Vendor and entity resolution. Multiple source systems will contain representations of the same entity - the same vendor, the same employee, the same business unit - with slightly different identifiers or names. Entity resolution (probabilistic matching, deterministic key matching) is prerequisite work before cross-system analytics will produce reliable results.