Case study · Production
Multi-model extraction pipeline
A configurable multi-stage agent pipeline with field-level consensus voting, deterministic validators, and Raw → Suggested → Final audit layering — built to take a 15-minute manual entry process and make it a two-second one.
Domain Regulated nonprofit
Role Architect & builder
Status Production
The problem
Scanned documents arriving at a steady rate, each one needing a structured record extracted into a downstream system. The existing process was manual: someone opened the scan, typed the fields into a form, and moved on. Fifteen minutes per scan. The throughput ceiling was a person, and the person was a bottleneck.
LLM extraction was the obvious tool. The hard part was making it trustworthy enough to remove the person from the loop without removing the audit trail.
Architecture
upload → render → segment → extract → consensus → validate → research → persist
↑ ↑
agreement score address/format rules
data layering: Raw (model outputs) → Suggested (consensus) → Final (human-confirmed)
└── no silent overwrites. every layer is auditable ──┘
Each stage of the pipeline is a separately addressable Azure Function inside a Durable orchestration. A scan goes in; a structured record comes out; every intermediate state is persisted. The orchestration is the audit trail.
Why consensus voting
One model is one opinion. The interesting failures aren't refusals or hallucinations — they're plausible-but-wrong values that pass eyeballing. Field-level consensus across multiple extraction models gives you an agreement score per field. When models agree, you trust the value and let it through. When they disagree, you know exactly which field to look at.
This isn't about averaging the models. It's about locating uncertainty. The tool surfaces the disagreement at the field level so a human reviews thirty seconds of work, not a fifteen-minute form.
Why three data layers
Raw, Suggested, Final. Raw is what each model returned. Suggested is consensus. Final is what a human confirmed (or auto-confirmed under threshold). Each layer is preserved on disk. Nothing is silently overwritten.
This is the layer that turns "the AI is making decisions" into "the AI is making suggestions you can review and override." It's also what made the compliance conversation tractable — every record has a complete provenance chain.
Deterministic validators
LLMs are not the right tool for "is this a valid US address." A Smarty / USPS lookup is. The pipeline runs deterministic validators after extraction — address verification, format rules, range checks — and pushes failures back into the review queue. Models propose; validators dispose.
Learning loop
Every human correction produces a structured event: which field, what the model said, what the human changed it to, and why. Those events feed few-shot examples into subsequent extractions. The accuracy curve trends up over time without anyone retraining a model.
Decisions and trade-offs
Auto-approve threshold
Auto-apply gates exist because review time is the constraint, not extraction accuracy. The threshold is configurable per field — high-stakes fields (amounts, identifiers) require higher confidence than low-stakes ones (categorization, optional notes). The threshold is exposed in the admin UI; it's not buried in a config file.
No fine-tuned model
The pipeline runs general-purpose models with structured prompts. Fine-tuning was considered and skipped — the consensus + validator + correction loop got accuracy past the threshold without it, and a fine-tuned model would have locked the architecture to a specific provider. The pipeline is provider-agnostic and can swap models per stage.
Audit-first failure mode
When something goes wrong — a model returns malformed JSON, a validator times out, a stage crashes — the orchestration captures the failure as a first-class event. The record is preserved in its last good state with the failure attached. There is no failure mode where work is silently lost.
Stack
TypeScript
Next.js
Azure Functions
Durable Orchestration
Static Web Apps
Azure SQL
Blob Storage
Entra ID
Bicep IaC
Key Vault RBAC
Zod
Playwright E2E
Vitest
Sentry (PII-scrubbed)
What I'd do again
- The Raw / Suggested / Final pattern. It is the single feature that made compliance review easy.
- Field-level consensus over document-level. Locating uncertainty at the field is the difference between "review everything" and "review the two fields that disagree."
- Deterministic validators on top. The LLM proposes; the validator decides what's actually a valid value.
What I'd do differently
- Treat the correction-event log as a first-class dataset earlier. It has more downstream uses than I planned for.
- Bake provider-swap into the prompt registry from day one rather than retrofitting it.