Today I Learned

Document extraction: four main approaches with a 1000x cost difference

I looked at the four main ways to turn unstructured documents into structured data: full LLM inference, fine-tuned small models, template-based extraction, and cloud OCR services.

The cost difference is huge: template-based extraction costs $0.001 per document, while full LLM inference costs $5 to $15 per document. That's a 1000x+ difference.

Most companies waste money by treating all documents the same. Document classification upfront can cut costs by 85%+ while maintaining flexibility for edge cases.

What I learned

Cloud OCR services (Azure Document Intelligence, AWS Textract, Google Document AI) cost $1.50 per 1,000 pages for basic OCR. They're fully managed, pre-trained on common document types, and great for MVPs.

Recent benchmarks: Gemini 2.0 Pro achieved 100% item extraction accuracy at $0.0045 per invoice, while AWS and Azure cost $0.01 per invoice. Azure's asynchronous processing delivers 87% cost savings—30 pages async costs $0.045 versus $0.30 for synchronous.

The downside is that the cost per page adds up quickly, and Azure's custom extraction models cost $50 for every 1,000 pages.

Fine-tuned small models (7B parameter models like Llama 3.1, Mistral 7B) cost $0.00368 per 1,000 tokens for inference after training.

Real benchmarks: LLaMA-3 8B achieved 76.6% accuracy without any fine-tuning, matching fine-tuned LLaMA-2 70B. After fine-tuning on just 861 samples, LLaMA-2 7B jumped from 47.6% to 61.5% accuracy with 47.78% reduction in hallucinations.

Cost of training: less than $2 for QLoRA on A100 GPUs (46 minutes for Mistral 7B). Inference hosting costs between $288 and $530 per month on cloud GPUs. Breakeven at about 1 million documents per year compared to the costs of the GPT-4 API.

Template-based extraction costs very little per document, but you have to make the templates ahead of time. New tools can get F1 scores of 1.0 with less than a second of latency for known formats.

PyMuPDF got F1 scores between 0.983 and 0.993 in documents from the government, the law, and finance. Camelot was good at making tables with a 0.828 F1 score for complicated government tenders. Processing speed: structured documents take 0.3 to 1.6 seconds, while multimodal LLM approaches take 33.9 seconds—54 times faster.

Azure Document Intelligence requires only 3 training + 3 test documents for template model creation, with the first 10 hours of neural training free.

Full LLM inference (Claude 3.5 Sonnet, GPT-4o, Gemini 2.0 and Gemini 2.5) costs $0.005-0.02 per typical invoice. It handles any format without training, adapts to changes, and can reason about context.

Production benchmarks: Claude and GPT-4o get 92–95% accuracy for line items and 95–98% accuracy for invoice extraction. For Claude, processing takes 200 to 300 milliseconds, and for GPT-4o, it takes 1 to 30 seconds, depending on complexity.

Cost optimization: Prompt caching cuts down on repeated content by 90%. Batch API processing cuts costs by 50% for workloads that aren't urgent. With caching, Claude costs $30 to $90 for 10,000 invoices a month, while GPT-4o costs $50 to $180.

The hybrid strategy

The best way to do this is with a classifier that sorts documents, as shown in the October 2024 Hybrid OCR-LLM Framework study:

  • Standard forms (60%) → Table-based extraction (F1=1.0, 0.3s latency)
  • Semi-structured (30%) → PaddleOCR + table method (F1=0.997, 0.6s)
  • Novel formats (10%) → Multimodal LLM (F1=0.999, 34s)

Real-world impact: Asian Paints cut processing time from 5 minutes to 30 seconds per document (10 times faster), saving 192 person-hours a month and finding $47,000 in vendor overcharges.

The filename classification optimization: Lightweight classifiers achieve 96.7% accuracy at 442x faster speed than full content analysis, processing 80%+ of documents through fast paths before invoking expensive models.

This lowers the blended cost to $1.50 per document, down from $10 for pure LLM. That's an 85% drop in cost while still keeping flexibility.

How to choose

More than 10,000 documents per month: For common types, use models or templates that have been fine-tuned. Mistral 7B trains for 46 minutes for $1.46 on RunPod and gets 85% of GPT-4's accuracy for 8 times less money.

Less than 10,000 docs a month: Cloud OCR services for speed. For custom extractors, Google gives you the first 1,000 documents for free, and then $30 for every 1,000 pages after that.

Accuracy critical: Template extraction with rules. Azure supports up to 500 trained models in composed architectures with incremental training on misclassified documents.

Format highly variable: LLM-based extraction. Claude 3.5 Sonnet handles 100-page PDFs up to 30MB with 200K token context window, eliminating preprocessing.

The winning architecture

Don't pick one approach. Route intelligently:

IF standard_form → Template (F1=1.0, 0.3s, $0.001)
ELIF semi_structured → Fine-tuned 7B (F1=0.997, 0.6s, $0.03)
ELSE → LLM fallback (F1=0.999, 34s, $10)

Blended cost: $1.50/doc vs $10 pure LLM = 85% savings

The main point

Through smart routing, the best AP departments get their cost per invoice down to $2.78, which is much lower than the industry average of $9.40. They cost 78% less and are 82% faster than their competitors.

The market data backs this up: Document extraction will grow from $10.57 billion in 2025 to $66.68 billion by 2032 at a rate of 30.6% per year. This is because companies are using smart routing instead of relying on expensive LLMs for everything.

Tools and Resources

Open-source PDF parsing:

Fine-tuning frameworks:

Cloud platforms:

RAG frameworks:

Key research papers:

Official documentation: