Case study · Production

Multi-model extraction pipeline

A configurable multi-stage agent pipeline with field-level consensus voting, deterministic validators, and Raw → Suggested → Final audit layering — built to take a 15-minute manual entry process and make it a two-second one.

Domain Regulated nonprofit Role Architect & builder Status Production

The problem

Scanned documents arriving at a steady rate, each one needing a structured record extracted into a downstream system. The existing process was manual: someone opened the scan, typed the fields into a form, and moved on. Fifteen minutes per scan. The throughput ceiling was a person, and the person was a bottleneck.

LLM extraction was the obvious tool. The hard part was making it trustworthy enough to remove the person from the loop without removing the audit trail.

Architecture

upload → render → segment → extract → consensus → validate → research → persist
                                        ↑                     ↑
                                   agreement score       address/format rules

data layering:   Raw (model outputs)  →  Suggested (consensus)  →  Final (human-confirmed)
                 └── no silent overwrites. every layer is auditable ──┘

Each stage of the pipeline is a separately addressable Azure Function inside a Durable orchestration. A scan goes in; a structured record comes out; every intermediate state is persisted. The orchestration is the audit trail.

Why consensus voting

One model is one opinion. The interesting failures aren't refusals or hallucinations — they're plausible-but-wrong values that pass eyeballing. Field-level consensus across multiple extraction models gives you an agreement score per field. When models agree, you trust the value and let it through. When they disagree, you know exactly which field to look at.

This isn't about averaging the models. It's about locating uncertainty. The tool surfaces the disagreement at the field level so a human reviews thirty seconds of work, not a fifteen-minute form.

Why three data layers

Raw, Suggested, Final. Raw is what each model returned. Suggested is consensus. Final is what a human confirmed (or auto-confirmed under threshold). Each layer is preserved on disk. Nothing is silently overwritten.

This is the layer that turns "the AI is making decisions" into "the AI is making suggestions you can review and override." It's also what made the compliance conversation tractable — every record has a complete provenance chain.

Deterministic validators

LLMs are not the right tool for "is this a valid US address." A Smarty / USPS lookup is. The pipeline runs deterministic validators after extraction — address verification, format rules, range checks — and pushes failures back into the review queue. Models propose; validators dispose.

Learning loop

Every human correction produces a structured event: which field, what the model said, what the human changed it to, and why. Those events feed few-shot examples into subsequent extractions. The accuracy curve trends up over time without anyone retraining a model.

Decisions and trade-offs

Auto-approve threshold

Auto-apply gates exist because review time is the constraint, not extraction accuracy. The threshold is configurable per field — high-stakes fields (amounts, identifiers) require higher confidence than low-stakes ones (categorization, optional notes). The threshold is exposed in the admin UI; it's not buried in a config file.

No fine-tuned model

The pipeline runs general-purpose models with structured prompts. Fine-tuning was considered and skipped — the consensus + validator + correction loop got accuracy past the threshold without it, and a fine-tuned model would have locked the architecture to a specific provider. The pipeline is provider-agnostic and can swap models per stage.

Audit-first failure mode

When something goes wrong — a model returns malformed JSON, a validator times out, a stage crashes — the orchestration captures the failure as a first-class event. The record is preserved in its last good state with the failure attached. There is no failure mode where work is silently lost.

Stack

TypeScript Next.js Azure Functions Durable Orchestration Static Web Apps Azure SQL Blob Storage Entra ID Bicep IaC Key Vault RBAC Zod Playwright E2E Vitest Sentry (PII-scrubbed)

What I'd do again

The Raw / Suggested / Final pattern. It is the single feature that made compliance review easy.
Field-level consensus over document-level. Locating uncertainty at the field is the difference between "review everything" and "review the two fields that disagree."
Deterministic validators on top. The LLM proposes; the validator decides what's actually a valid value.

What I'd do differently

Treat the correction-event log as a first-class dataset earlier. It has more downstream uses than I planned for.
Bake provider-swap into the prompt registry from day one rather than retrofitting it.