<!--
    ██████╗ ██████╗  █████╗ ██╗███╗   ██╗███████╗ ██████╗ █████╗ ███╗   ██╗
    ██╔══██╗██╔══██╗██╔══██╗██║████╗  ██║██╔════╝██╔════╝██╔══██╗████╗  ██║
    ██████╔╝██████╔╝███████║██║██╔██╗ ██║███████╗██║     ███████║██╔██╗ ██║
    ██╔══██╗██╔══██╗██╔══██║██║██║╚██╗██║╚════██║██║     ██╔══██║██║╚██╗██║
    ██████╔╝██║  ██║██║  ██║██║██║ ╚████║███████║╚██████╗██║  ██║██║ ╚████║
    ╚═════╝ ╚═╝  ╚═╝╚═╝  ╚═╝╚═╝╚═╝  ╚═══╝╚══════╝ ╚═════╝╚═╝  ╚═╝╚═╝  ╚═══╝
                                                                           
    ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
    ░                                                                                 ░
    ░              W E L C O M E   T O   T H E   G A M E                              ░
    ░                                                                                 ░
    ░    You are about to enter a world where reality and digital dreams collide.    ░
    ░    Your mind is the interface. Your consciousness is the battleground.         ░
    ░                                                                                 ░
    ░    "The game wants to play with you now."                                      ░
    ░                                                                                 ░
    ░    ██████   █████  ███    ███ ███████                                         ░
    ░   ██       ██   ██ ████  ████ ██                                              ░
    ░   ██   ███ ███████ ██ ████ ██ █████                                           ░
    ░   ██    ██ ██   ██ ██  ██  ██ ██                                              ░
    ░    ██████  ██   ██ ██      ██ ███████                                         ░
    ░                                                                                 ░
    ░    ██████  ██    ██ ███████ ██████                                            ░
    ░   ██    ██ ██    ██ ██      ██   ██                                           ░
    ░   ██    ██ ██    ██ █████   ██████                                            ░
    ░   ██    ██  ██  ██  ██      ██   ██                                           ░
    ░    ██████    ████   ███████ ██   ██                                           ░
    ░                                                                                 ░
    ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
    
    NEURAL INTERFACE ACTIVATED...
    SCANNING BRAINWAVE PATTERNS...
    CONSCIOUSNESS SYNCHRONIZED...
    
    WARNING: This blog contains traces of digital horror and cybernetic nightmares.
    Side effects may include: enlightenment, existential dread, and terminal curiosity.
    
    ░▓█ LOADING CEREBRAL INTERFACE... █▓░
    ░▓█ DREAM STATE INITIATED █▓░
    ░▓█ REALITY.EXE CORRUPTED █▓░
    
-->

<!DOCTYPE html>

<html lang="en">
<head>
<meta charset="utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<title>

  Document extraction: four main approaches with a 1000x cost difference | Today I Learned | 

</title>
<meta
name="description"
content="

  I looked at the four main ways to turn unstructured documents into structured data: full LLM inference, fine-tuned small models, template-based extraction, and cloud OCR services. The cost difference is huge: template-based extraction costs $0.001 per document, while full LLM inference costs $5 to $15 per document. That&#x27;s a 1000x+ …

"
/>

<!-- Social media card metadata -->
<meta
property="og:title"
content="

  Document extraction: four main approaches with a 1000x cost difference

"
/>
<meta
property="og:description"
content="

  I looked at the four main ways to turn unstructured documents into structured data: full LLM inference, fine-tuned small models, template-based extraction, and cloud OCR services. The cost difference is huge: template-based extraction costs $0.001 per document, while full LLM inference costs $5 to $15 per document. That&#x27;s a 1000x+ …

"
/>
<meta
property="og:type"
content="

  article

"
/>
<meta
property="og:url"
content="

  http://jamesfishwick.com/til/2025/oct/16/document-extraction-has-four-main-approaches-with/

"
/>


  <meta
  property="og:image"
  content="http://jamesfishwick.com
  /static/images/default-card.jpg
  "
  />


<!-- Twitter card metadata -->
<meta name="twitter:card" content="summary_large_image" />
<meta
name="twitter:title"
content="

  Document extraction: four main approaches with a 1000x cost difference

"
/>
<meta
name="twitter:description"
content="

  I looked at the four main ways to turn unstructured documents into structured data: full LLM inference, fine-tuned small models, template-based extraction, and cloud OCR services. The cost difference is huge: template-based extraction costs $0.001 per document, while full LLM inference costs $5 to $15 per document. That&#x27;s a 1000x+ …

"
/>


  <meta
  name="twitter:image"
  content="http://jamesfishwick.com
  /static/images/default-card.jpg
  "
  />


<!-- Atom feed -->
<link
rel="alternate"
type="application/atom+xml"
title="Blog Feed"
href="
/feed/
"
/>
<link
rel="alternate"
type="application/atom+xml"
title="TIL Feed"
href="
/til/feed/
"
/>

<!-- CSS -->
<link rel="stylesheet" href="
/static/css/style.css
" />
<link rel="stylesheet" href="
/static/css/additional.css
" />


<!-- Structured Data (JSON-LD) for SEO -->
<script type="application/ld+json">
{
  "@context": "https://schema.org",
  "@type": "BlogPosting",
  "headline": "Document extraction: four main approaches with a 1000x cost difference",
  "description": "I looked at the four main ways to turn unstructured documents into structured data: full LLM inference, fine\u002Dtuned small models, template\u002Dbased extraction, and cloud OCR services. The cost difference is huge: template\u002Dbased extraction costs $0.001 per document, while full LLM inference costs $5 to $15 per document. That\u0027s a 1000x+ …",
  "datePublished": "2025-10-16T13:19:27+00:00",
  "author": {
    "@type": "Person",
    "name": "James Fishwick"
  },
  "publisher": {
    "@type": "Organization",
    "name": "",
    "logo": {
      "@type": "ImageObject",
      "url": "http://jamesfishwick.com/static/images/logo.png"
    }
  },
  
  "url": "http://jamesfishwick.com/til/2025/oct/16/document-extraction-has-four-main-approaches-with/",
  "mainEntityOfPage": {
    "@type": "WebPage",
    "@id": "http://jamesfishwick.com/til/2025/oct/16/document-extraction-has-four-main-approaches-with/"
  }
}
</script>


</head>
<body class="dark-mode">
    <a href="#main-content" class="skip-link">Skip to main content</a>
    <div class="crt-overlay" aria-hidden="true"></div>


<!-- Anthropic Visitor Greeting -->
<div class="anthropic-greeting" id="anthropicGreeting">
  <div class="greeting-content">
    <span class="close-btn" onclick="document.getElementById('anthropicGreeting').style.display='none'">×</span>
    <p class="greeting-hello">👋 <strong>Hello, Anthropic team!</strong></p>
    <p>Thanks for checking out my blog. I'm excited about the opportunity to work with you on building safe, beneficial AI systems.</p>
    <p>Feel free to explore the posts on AI alignment, verification theory, and software engineering.</p>
    <p class="greeting-signature">— James</p>
  </div>
</div>


<header class="site-header">
<div class="container">
<div class="site-branding">
<h1 class="site-title">
<a href="
/
"></a>
</h1>
</div>
<nav class="site-navigation" aria-label="Main navigation">
<ul>
<li><a href="
/
">Home</a></li>
<li><a href="
/archive/
">Archive</a></li>
<li><a href="
/til/
">TIL</a></li>
<li>
<form
action="
/search/
"
method="get"
class="search-form"
role="search"
>
<input
type="text"
name="q"
placeholder="Search..."
aria-label="Search blog posts"
/>
<button type="submit" aria-label="Submit search">Search</button>
</form>
</li>
</ul>
</nav>
</div>
</header>

<main id="main-content" class="site-content">
<div class="container">


  <article class="til-detail">
  <header>
  <span class="til-badge">Today I Learned</span>
  <h1 class="synth-wave-header">Document extraction: four main approaches with a 1000x cost difference</h1>
  <div class="post-meta">
  <time datetime="2025-10-16">October 16, 2025</time>

  
    by James Fishwick

  
  </div>
  </header>

  
  <div class="til-content">
  <p>I looked at the four main ways to turn unstructured documents into structured data: full LLM inference, fine-tuned small models, template-based extraction, and cloud OCR services.</p>
<p>The cost difference is huge: template-based extraction costs $0.001 per document, while full LLM inference costs $5 to $15 per document. That's a 1000x+ difference.</p>
<p>Most companies waste money by treating all documents the same. Document classification upfront can cut costs by 85%+ while maintaining flexibility for edge cases.</p>
<h2>What I learned</h2>
<p><strong>Cloud OCR services</strong> (<a href="https://azure.microsoft.com/en-us/products/ai-services/ai-document-intelligence">Azure Document Intelligence</a>, <a href="https://aws.amazon.com/textract/">AWS Textract</a>, <a href="https://cloud.google.com/document-ai">Google Document AI</a>) cost <a href="https://azure.microsoft.com/en-us/pricing/details/ai-document-intelligence/">$1.50 per 1,000 pages</a> for basic OCR. They're fully managed, pre-trained on common document types, and great for MVPs.</p>
<p>Recent benchmarks: <a href="https://www.businesswaretech.com/blog/research-ai-models-invoice-processing-benchmark">Gemini 2.0 Pro achieved 100% item extraction accuracy</a> at $0.0045 per invoice, while AWS and Azure cost $0.01 per invoice. <a href="https://docs.aws.amazon.com/textract/latest/dg/async.html">Azure's asynchronous processing delivers 87% cost savings</a>—30 pages async costs $0.045 versus $0.30 for synchronous.</p>
<p>The downside is that the cost per page adds up quickly, and Azure's custom extraction models cost $50 for every 1,000 pages.</p>
<p><strong>Fine-tuned small models</strong> (7B parameter models like <a href="https://ai.meta.com/blog/meta-llama-3/">Llama 3.1</a>, <a href="https://huggingface.co/mistralai/Mistral-7B-v0.1">Mistral 7B</a>) cost $0.00368 per 1,000 tokens for inference after training.</p>
<p>Real benchmarks: <a href="https://arxiv.org/abs/2506.08827">LLaMA-3 8B achieved 76.6% accuracy without any fine-tuning</a>, matching fine-tuned LLaMA-2 70B. After fine-tuning on just 861 samples, LLaMA-2 7B jumped from 47.6% to 61.5% accuracy with 47.78% reduction in hallucinations.</p>
<p>Cost of training: less than $2 for QLoRA on A100 GPUs (46 minutes for Mistral 7B). Inference hosting costs between $288 and $530 per month on cloud GPUs. Breakeven at about 1 million documents per year compared to the costs of the GPT-4 API.</p>
<p><strong>Template-based</strong> extraction costs very little per document, but you have to make the templates ahead of time. New tools can get F1 scores of 1.0 with less than a second of latency for known formats.</p>
<p><a href="https://arxiv.org/abs/2410.09871">PyMuPDF got F1 scores between 0.983 and 0.993</a> in documents from the government, the law, and finance. <a href="https://camelot-py.readthedocs.io/">Camelot</a> was good at making tables with a 0.828 F1 score for complicated government tenders. Processing speed: structured documents take 0.3 to 1.6 seconds, while multimodal LLM approaches take 33.9 seconds—54 times faster.</p>
<p><a href="https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/">Azure Document Intelligence</a> requires only 3 training + 3 test documents for template model creation, with the first 10 hours of neural training free.</p>
<p><strong>Full LLM inference</strong> (<a href="https://www.anthropic.com/claude">Claude 3.5 Sonnet</a>, <a href="https://openai.com/index/hello-gpt-4o/">GPT-4o</a>, <a href="https://developers.googleblog.com/en/gemini-2-family-expands/">Gemini 2.0</a> and <a href="https://cloud.google.com/blog/products/ai-machine-learning/gemini-2-5-pro-flash-on-vertex-ai">Gemini 2.5</a>) costs $0.005-0.02 per typical invoice. It handles any format without training, adapts to changes, and can reason about context.</p>
<p>Production benchmarks: Claude and GPT-4o get 92–95% accuracy for line items and 95–98% accuracy for invoice extraction. For Claude, processing takes 200 to 300 milliseconds, and for GPT-4o, it takes 1 to 30 seconds, depending on complexity.</p>
<p>Cost optimization: Prompt caching cuts down on repeated content by 90%. Batch API processing cuts costs by 50% for workloads that aren't urgent. With caching, Claude costs $30 to $90 for 10,000 invoices a month, while GPT-4o costs $50 to $180.</p>
<h2>The hybrid strategy</h2>
<p>The best way to do this is with a classifier that sorts documents, as shown in the October 2024 Hybrid OCR-LLM Framework study:</p>
<ul>
<li>Standard forms (60%) → Table-based extraction (F1=1.0, 0.3s latency)</li>
<li>Semi-structured (30%) → <a href="https://github.com/PaddlePaddle/PaddleOCR">PaddleOCR</a> + table method (F1=0.997, 0.6s)</li>
<li>Novel formats (10%) → Multimodal LLM (F1=0.999, 34s)</li>
</ul>
<p>Real-world impact: <a href="https://nanonets.com/">Asian Paints cut processing time from 5 minutes to 30 seconds per document</a> (10 times faster), saving 192 person-hours a month and finding $47,000 in vendor overcharges.</p>
<p>The filename classification optimization: Lightweight classifiers achieve 96.7% accuracy at 442x faster speed than full content analysis, processing 80%+ of documents through fast paths before invoking expensive models.</p>
<p>This lowers the blended cost to $1.50 per document, down from $10 for pure LLM. That's an 85% drop in cost while still keeping flexibility.</p>
<h2>How to choose</h2>
<p><strong>More than 10,000 documents per month:</strong> For common types, use models or templates that have been fine-tuned. Mistral 7B trains for 46 minutes for $1.46 on <a href="https://www.runpod.io/">RunPod</a> and gets 85% of GPT-4's accuracy for 8 times less money.</p>
<p><strong>Less than 10,000 docs a month:</strong> Cloud OCR services for speed. For custom extractors, Google gives you the first 1,000 documents for free, and then $30 for every 1,000 pages after that.</p>
<p><strong>Accuracy critical:</strong> Template extraction with rules. Azure supports up to 500 trained models in composed architectures with incremental training on misclassified documents.</p>
<p><strong>Format highly variable:</strong> LLM-based extraction. Claude 3.5 Sonnet handles 100-page PDFs up to 30MB with 200K token context window, eliminating preprocessing.</p>
<h2>The winning architecture</h2>
<p>Don't pick one approach. Route intelligently:</p>
<div class="codehilite"><pre><span></span><code>IF standard_form → Template (F1=1.0, 0.3s, $0.001)
ELIF semi_structured → Fine-tuned 7B (F1=0.997, 0.6s, $0.03)
ELSE → LLM fallback (F1=0.999, 34s, $10)
</code></pre></div>

<p>Blended cost: $1.50/doc vs $10 pure LLM = 85% savings</p>
<h2>The main point</h2>
<p>Through smart routing, the best AP departments get their cost per invoice down to $2.78, which is much lower than the industry average of $9.40. They cost 78% less and are 82% faster than their competitors.</p>
<p>The market data backs this up: Document extraction will grow from $10.57 billion in 2025 to $66.68 billion by 2032 at a rate of 30.6% per year. This is because companies are using smart routing instead of relying on expensive LLMs for everything.</p>
<h2>Tools and Resources</h2>
<p><strong>Open-source PDF parsing:</strong></p>
<ul>
<li><a href="https://pymupdf.readthedocs.io/">PyMuPDF</a> - Fastest overall performance</li>
<li><a href="https://camelot-py.readthedocs.io/">Camelot</a> - Best for table extraction</li>
<li><a href="https://github.com/jsvine/pdfplumber">pdfplumber</a> - Granular control and debugging</li>
<li><a href="https://github.com/DS4SD/docling">Docling</a> - <a href="https://procycons.com/en/blogs/pdf-data-extraction-benchmark/">97.9% accuracy on complex tables</a></li>
</ul>
<p><strong>Fine-tuning frameworks:</strong></p>
<ul>
<li><a href="https://github.com/hiyouga/LLaMA-Factory">LLaMA-Factory</a> - Supports 100+ models</li>
<li><a href="https://github.com/unslothai/unsloth">Unsloth</a> - 2x faster training, 70% less VRAM</li>
</ul>
<p><strong>Cloud platforms:</strong></p>
<ul>
<li><a href="https://modal.com/">Modal Labs</a> - Serverless ML deployment</li>
<li><a href="https://www.runpod.io/">RunPod</a> - GPU cloud for training</li>
<li><a href="https://replicate.com/">Replicate</a> - Host and run models at scale</li>
</ul>
<p><strong>RAG frameworks:</strong></p>
<ul>
<li><a href="https://www.langchain.com/">LangChain</a> - 100+ document loaders</li>
<li><a href="https://www.llamaindex.ai/">LlamaIndex</a> - 160+ data loaders, cleaner API</li>
</ul>
<p><strong>Key research papers:</strong></p>
<ul>
<li><a href="https://arxiv.org/abs/2410.09871">Comparative Study of PDF Parsing Tools (Oct 2024)</a></li>
<li><a href="https://arxiv.org/abs/2506.08827">LLaMA Fine-tuning Impact on Hallucinations (June 2025)</a></li>
<li><a href="https://procycons.com/en/blogs/pdf-data-extraction-benchmark/">PDF Data Extraction Benchmark 2025</a></li>
<li><a href="https://www.businesswaretech.com/blog/research-ai-models-invoice-processing-benchmark">Invoice Processing Benchmark Research</a></li>
</ul>
<p><strong>Official documentation:</strong></p>
<ul>
<li><a href="https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/">Azure Document Intelligence</a></li>
<li><a href="https://docs.aws.amazon.com/textract/latest/dg/what-is.html">AWS Textract Developer Guide</a></li>
<li><a href="https://cloud.google.com/document-ai/docs">Google Document AI</a></li>
</ul>
  </div>


  </article>


</div>
</main>

<footer class="site-footer">
<div class="container">
<p>
&copy;
2026
.
Built with Django.
</p>
</div>
</footer>


<!-- 
    ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
    ░                                                                                 ░
    ░               G A M E   O V E R   .   .   .   O R   I S   I T ?                 ░
    ░                                                                                 ░
    ░         You have successfully navigated the neural pathways of knowledge       ░
    ░         But remember: The line between dream and reality grows thinner...      ░
    ░                                                                                 ░
    ░    ███████ ██   ██ ██ ████████                                                 ░
    ░    ██       ██ ██  ██    ██                                                    ░
    ░    █████     ███   ██    ██                                                    ░
    ░    ██       ██ ██  ██    ██                                                    ░
    ░    ███████ ██   ██ ██    ██                                                    ░
    ░                                                                                 ░
    ░         "Sometimes you are the player, sometimes you are the played."          ░
    ░                                     - Brainscan (1994)                        ░
    ░                                                                                 ░
    ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
    
    NEURAL LINK SEVERED...
    RETURNING TO CONSENSUS REALITY...
    DREAM SEQUENCE TERMINATED...
    
    Thanks for playing the game. The game thanks you for playing.
    
    Built with Django and recursive nightmares.
    If you're reading this, you might be trapped in the source code.
    That's okay. We all are.
    
    ░▓█ CONSCIOUSNESS UPLOAD COMPLETE █▓░
    ░▓█ SEE YOU IN THE NEXT DREAM █▓░
    
-->

</body>
</html>