Every document processing vendor publishes an accuracy number. Most of those numbers are meaningless.
"99% accurate" could mean character accuracy on clean PDFs, field accuracy on a curated sample, or whatever other metric sounded best. If you are evaluating a platform for real work, you need a benchmark that reflects your documents and your definition of right. Here is how to build one.
The three levels of accuracy
Most published numbers conflate these. They are not the same:
Character accuracy
Percentage of characters recognized correctly. Useful for pure OCR — converting a page of text into a string. Almost useless for structured extraction. A 99% character accuracy can mean a 0% invoice‑total accuracy if every total has one wrong digit.
Field accuracy
Percentage of extracted fields that exactly match ground truth. This is what most business workflows care about. A field is right or it is wrong. "Invoice total = $4,231.17" either matches the ground truth value or it does not.
Document accuracy
Percentage of documents where every extracted field is correct. Much harsher than field accuracy. If a document has 15 fields and you get 14 right, field accuracy is 93% but document accuracy is 0% for that doc.
For production workflows with auto‑approve, document accuracy is what determines how many docs actually skip human review. Plan accordingly.
Setting up a benchmark
1. Pick a representative sample
Not your cleanest documents. Not your worst. A stratified sample that reflects your real workload:
- Mix of vendors / sources / layouts.
- Mix of quality (clean PDFs, scans, photos, faxes).
- Mix of document types if your pipeline handles more than one.
- Include the edge cases you care about (handwriting, foreign languages, low contrast).
A sample of 100 documents is the practical minimum. 300 is better. Fewer than 50 and the variance will swamp your signal.
2. Define ground truth
For each document, record the correct value for every field you care about. This is tedious and unavoidable. If you cannot get a human to write down what the answer should be, you cannot measure accuracy.
Store ground truth in a simple structured format — CSV or JSONL works fine:
{"doc_id": "001", "vendor": "Acme Corp", "invoice_number": "INV-4823", "date": "2026-03-15", "total": 4231.17}
{"doc_id": "002", "vendor": "Beta LLC", "invoice_number": "B-2024-0912", "date": "2026-03-14", "total": 1250.00}
3. Run extraction
Process every document in your sample through each platform you are evaluating. Save the raw outputs. Do not eyeball and summarize — keep the full output so you can recompute metrics later.
4. Score it
For each field on each document, compare extraction to ground truth. Common matching rules:
- Strings (vendor name, invoice number): exact match, case‑insensitive, whitespace‑normalized.
- Dates: parse both to a canonical format, compare.
- Currency: parse to numeric, compare with a small tolerance ($0.01) to handle floating point.
- Enums (country, currency): exact match.
- Tables / line items: row‑level matching — for each ground‑truth row, find the best extracted row; penalize missing and extra rows.
The metrics to report
Go beyond a single "accuracy %":
Field‑level metrics
For each field type:
- Precision = (correct extractions) / (total extractions). How often is the system right when it returns a value?
- Recall = (correct extractions) / (total ground‑truth values). How often does the system find a value when there is one?
- F1 = harmonic mean of precision and recall.
Some fields are more forgiving than others. An invoice number wrong by one character is wrong. A vendor name with "Inc." vs. "Inc" may or may not matter depending on downstream matching.
Document‑level metrics
- Perfect document rate. % of documents where every field is correct.
- Auto‑approve eligible rate. % of documents that would auto‑approve given a configurable threshold.
- Manual correction time. Average seconds to correct errors on non‑perfect documents.
Error breakdown
For every error, categorize it:
- Value error — wrong value extracted.
- Missing value — field should have a value, nothing extracted.
- Hallucinated value — field should be empty, something extracted.
- Format error — correct value, wrong format (e.g. "03/15/2026" vs "2026-03-15").
Format errors are easy to fix with a post‑processor. Value errors matter more. Hallucinations are a red flag.
Heads up
Hallucination rate is the single most important metric for AI‑based extraction and is missing from most vendor benchmarks. For any platform with real foundation models under the hood, measure it. If a field has a ground‑truth of "no value" and the platform returns something, that is a hallucination. Count them.
Reading accuracy numbers
When a vendor says "99% accurate," the questions to ask:
- Measured on what sample? (Their benchmark or yours?)
- What is the unit? (Character? Field? Document?)
- What counts as correct? (Exact match? Fuzzy?)
- What is the hallucination rate?
- How does accuracy decay with document quality? (Clean vs. scanned vs. handwritten.)
If the vendor cannot answer those quickly, the number is not reliable.
Building an ongoing benchmark
For a production pipeline, one‑time benchmarking is not enough. Set up an ongoing evaluation:
- Sample 1–2% of production documents weekly; human‑review them as ground truth.
- Track field accuracy weekly, by document type.
- Alert on regressions — if accuracy on a document type drops below baseline, someone should know.
- Re‑benchmark whenever the AI model is upgraded.
This is not extra work. It is the same review a decent operations team is already doing; it just needs to land in a place where you can look at trends.
For more on how confidence scoring connects to these metrics, see when to trust AI output: auto‑approve and confidence thresholds.
Accuracy is a configuration, not a fixed number
A non‑obvious point: for any AI document platform, accuracy is largely a configuration. You tune it by:
- Picking the right AI model for the document type.
- Defining validation rules.
- Setting confidence thresholds.
- Designing the review queue so that errors get caught.
The same document processed with GPT‑5 with validation + review will be much more accurate end‑to‑end than with a cheap model and no review. Do not evaluate a platform at its default settings — evaluate it at the configuration you would actually run.
The honest PaperAI numbers
Internal benchmarking on mixed real‑world document samples:
- Field accuracy on clean PDFs (printed invoices, forms): typically 97–99%.
- Field accuracy on scanned documents: typically 94–98%.
- Field accuracy on handwriting: varies widely — 75–95% depending on legibility and model.
- Auto‑approve rate at default thresholds on well‑configured Flows: typically 50–70%.
Your documents will differ. That is why we encourage every team to start free and run the benchmark on their own data — it takes 1–2 hours and gives you real numbers you can plan around.
Five rules that will save you a bad procurement decision
- Character accuracy, field accuracy, and document accuracy are three different units. Know which one the vendor is quoting.
- Build your own 100–300 document benchmark with ground truth. It takes an afternoon and is more useful than any vendor white paper.
- Report precision, recall, F1, and hallucination rate. Not a single number.
- Re‑benchmark every time the model is upgraded or a Flow changes. Regressions do happen.
- Accuracy is a configuration. Evaluate platforms at the settings you would actually run, not their defaults.
Document AI is mature enough to be production‑ready. The teams who get value out of it measure what they deploy. The teams that do not, end up back on spreadsheets by Q3.