Building a Data Foundation for AI in Insurance — What Most Carriers Get Wrong

05.20.2026

Building a Data Foundation for AI in Insurance What Most Carriers Get Wrong.png

State of Play: The Readiness Illusion

The paradox of insurance AI in 2026 is precise: adoption is near-universal, but readiness is not. While 96% of organizations report integrating AI into core business processes, nearly 4 in 5 (80%) admit their AI initiatives are still constrained by limited data access across environments (Cloudera Data Readiness Index, 2026).

The gap between confidence and readiness is especially acute in financial services. Only 24% of financial services organizations report they can access all their data at any time — compared to 51% in telecommunications. Only 30% say they have full visibility into where their data even resides.

The consequence is measurable. Gartner projects that 60% of AI projects unsupported by AI-ready data will be abandoned, and only 7% of insurers have brought AI to enterprise-wide scale (BCG, 2025). The data foundation is not a pre-condition for AI. It is the AI investment.


What "Data Foundation" Actually Means

Most carriers use "data foundation" as shorthand for a data warehouse upgrade or a new lake architecture. Both are components. Neither is the definition.

An AI-ready data foundation is the governed, queryable, continuously maintained data layer that makes model training, deployment, and audit possible. It is not a single technology. It is the combination of data architecture, access controls, quality standards, lineage tracking, and governance ownership that allows AI models to train on trustworthy inputs and produce explainable outputs.

Without it, a model is only as reliable as the worst data source it touched during training. In insurance — where policy admin, claims, billing, actuarial, and third-party data routinely live in separate systems, separate schemas, and separate ownership structures — that reliability floor is typically very low.


The Five Mistakes Carriers Make Before the First Model Is Trained


Mistake 1: Starting With the Model, Not the Data

The Trap:

  • A vendor demonstrates a high-accuracy AI model on clean, normalized demo data
  • The carrier approves the investment and begins model configuration
  • Six months later, the model cannot train on production data because policy, claims, and billing records exist in incompatible schemas across five legacy systems
  • The project is declared "ongoing" — and never reaches production

The Fix:

  • The data architecture audit must precede vendor selection, not follow it
  • Map every source system, identify schema conflicts, and establish a data dictionary with agreed field definitions before any model selection begins
  • Carriers that complete this step first consistently reach production within 12 to 18 months; those that skip it consistently do not reach production at all

Mistake 2: Assuming "Most Data Is Governed" Means "Data Is AI-Ready"

The Trap:

  • 71% of organizations report that most of their data is governed — but only 18% report their data is fully governed across all systems (Cloudera, 2026)
  • Data that appears consistent within a single system breaks down the moment it is combined with data from another system or used across teams
  • The carrier believes its data is ready because each system looks clean in isolation

The Fix:

  • Governance must be tested at the point of integration, not within individual systems
  • The relevant question is not "Is our claims data governed?" but "Can our claims data be joined to our policy data, enriched with third-party signals, and queried consistently — on demand, by a model, without manual intervention?"
  • If the answer is no, the model will produce results no one can trust or defend

Mistake 3: Treating Data Quality as an IT Problem, Not a Business Problem

The Trap:

  • 85% of AI project failures trace back to poor data quality (Gartner)
  • 78% of organizations cannot validate data before it enters training pipelines (Informatica, 2025)
  • Because data quality is routed to IT as a maintenance issue, no one with P&L accountability owns the problem — and it stays unsolved while model development continues

The Fix:

  • Data quality ownership must be assigned to business line owners, not infrastructure teams
  • Every data domain — underwriting, claims, billing, actuarial — needs a named data steward with explicit accountability for completeness, consistency, and freshness standards
  • Data quality is not a technical debt to be resolved; it is a business capability to be built and maintained

Mistake 4: Building Siloed Data Pipelines for Individual AI Use Cases

The Trap:

  • A carrier builds a data pipeline for its claims AI project
  • Six months later, the underwriting team builds a separate pipeline for its risk scoring model
  • A year after that, the fraud detection team builds a third pipeline
  • Each pipeline accesses the same source systems but extracts data differently, applies different transformation logic, and produces results that cannot be reconciled against each other
  • When leadership asks for an enterprise view of AI-driven performance, the three teams produce three incompatible answers

The Fix:

  • Build one governed data layer as a shared foundation — not one pipeline per use case
  • A unified data layer with consistent schema, lineage tracking, and access controls serves every AI model in the organization without duplication or reconciliation overhead
  • The upfront investment in a shared foundation is recovered within the second use case deployment

Mistake 5: Ignoring Regulatory Data Requirements Until Deployment

The Trap:

  • A carrier deploys an AI underwriting or claims model without building audit trail infrastructure
  • As of early 2026, 23 states have adopted the NAIC's Model Bulletin on AI Systems, and a 12-state pilot of the NAIC's AI Systems Evaluation Tool is underway — designed to assess AI governance during regulatory exams
  • When a state examination triggers an AI audit, the carrier cannot demonstrate model lineage, training data provenance, or decision traceability
  • The program is suspended pending compliance remediation

The Fix:

  • Data lineage — the documented trail of where data originated, how it was transformed, and which model version used it — must be built into the data foundation architecture from day one
  • Regulatory requirements are architecture inputs, not post-deployment add-ons
  • Colorado's AI Act (effective February 2026) and the EU AI Act both classify insurance underwriting and claims AI as high-risk systems, requiring documentation that only a governed data foundation can produce

Key Takeaway: The five mistakes above share a single root cause. Carriers treat the data foundation as infrastructure work to be completed before the "real" AI project begins. It is not infrastructure work. It is the AI project.


The Four Layers of an AI-Ready Insurance Data Foundation

A governed, AI-ready data foundation in insurance has four distinct layers. Each layer is a prerequisite for the one above it.

Layer 1: Source System Integration

  • All core systems — policy administration, claims management, billing, actuarial, and primary third-party data feeds — connected via API-based pipelines with real-time or near-real-time refresh cadences
  • No manual extraction, no flat file transfers, no data that requires human intervention to move

Layer 2: A Unified Data Schema With a Common Dictionary

  • A single agreed definition for every key entity: policy, claim, customer, risk, loss event, payment
  • Schema conflicts between systems (different date formats, different ID structures, different field names for the same concept) resolved at the integration layer — not left for individual models to handle

Layer 3: Data Quality Monitoring With Assigned Stewardship

  • Automated monitoring of completeness, consistency, freshness, and duplicate rates — with alerts routed to named business owners, not IT queues
  • Quality standards defined to the field level, with documented acceptable thresholds and remediation protocols

Layer 4: Lineage Tracking and Governance Controls

  • Full audit trail of data provenance: where each record originated, how it was transformed, which model version used it, and when
  • Access controls enforced at the data layer — not the application layer — so governance does not break when a new model or team accesses the same data

From Foundation to Production: The AI Pathfinder

Every layer above is assessed in Phase 1 of DOOR3's AI Pathfinder for Insurance. The methodology treats data foundation maturity as the primary filter for use case sequencing — not model sophistication, not strategic ambition, and not vendor roadmap.

The AI Pathfinder data assessment answers four specific questions before any model work begins:

  • Which source systems can be accessed via API today, and which require manual extraction?
  • Where do schema conflicts exist between core systems, and what resolution is required?
  • Which data domains have assigned stewards and documented quality standards?
  • What lineage and audit trail infrastructure is in place, and what regulatory obligations does it need to satisfy?

The output is a prioritized data readiness roadmap — not a list of deficiencies. DOOR3's AI consulting work with insurers including AIG and Munich Re consistently shows the same finding: carriers that invest in the data foundation first reach production AI faster, at lower total cost, and with fewer compliance remediation cycles. The path from a governed data foundation to custom insurance software and full-lifecycle AI is a sequence of validated steps — not a series of parallel bets.


Strategic Direction: Five Actions Before Your Next AI Investment Decision

  1. Audit every source system for API accessibility. If a system requires manual extraction to share data, it is not AI-ready. Document the gap and include API connectivity in your next vendor contract.

  2. Build a unified data dictionary before selecting any AI vendor. Agree on canonical definitions for every entity the model will use — policy, claim, customer, risk — before any model is configured.

  3. Assign named data stewards for every data domain. Data quality accountability belongs with the business owner of the domain, not with IT. Without a named owner, quality debt accumulates unchecked.

  4. Test data governance at the point of integration, not within individual systems. The relevant governance question is whether data from two different systems can be joined, queried, and validated consistently — on demand.

  5. Treat lineage tracking as a regulatory requirement from day one. NAIC, Colorado AI Act, and EU AI Act all require documentation that only a governed data foundation can produce. Build it in — do not retrofit it.

The carriers generating measurable AI ROI across underwriting, claims, and fraud detection share one structural characteristic: they built the data foundation before they selected a model. Every other variable — vendor, architecture, use case — was secondary to that sequence.


Salvatore Magnone is a father, veteran, and a co-founder, a repeat offender in the best way in fact, and a long-time collaborator at DOOR3. Sal builds successful, multinational, technology companies and runs obstacle courses. He teaches business and military strategy at the university level and directly to entrepreneurs and military leaders.

https://www.linkedin.com/in/salmagnone/

Think it might be time to bring in some extra help?

Read these next...

Door3.com