How accurately an AI model detects and extracts data from a document, like field titles and values, document layout, text and character blocks.
We are constantly testing large language models for business automation tasks. AI model benchmark is based on digital documents datasets of various layouts and languages that represent documents processed in real projects.
We test how well AI models work at extracting data from complex documents by assessing data detection accuracy and completeness.
How accurately an AI model detects and extracts data from a document, like field titles and values, document layout, text and character blocks.
How long it takes a model to process one document on average.
The processing cost per 1000 pages and any additional costs.
|
Model |
Average model accuracy |
Average cost per 1000 forms |
Average processing time per form, s |
|---|---|---|---|
|
GPT 5 Mini |
88,19% |
$5.06 |
32.179 |
|
Gemini 2.5 Flash Lite |
87,29% |
$0.37 |
5.484 |
|
AWS |
72,61% |
$65 |
4.845 |
|
Claude Sonnet |
70,34% |
$18.7 |
15.488 |
|
Azure |
67,52% |
$10 |
6.588 |
|
|
50,69% |
$30 |
5.633 |
|
Grok 4 |
22,74% |
$11.5 |
129.257 |
|
Service |
Table Extraction Accuracy |
Processing duration Per 1 Page, s |
Cost, per 1000 pages |
|---|---|---|---|
|
81,5% |
4.3 ± 0.2 |
$10 |
|
|
82,1% |
2.9 ± 0.2 |
$15 |
|
|
94,2% |
47.4 ± 15.7 |
$58 |
|
|
38,5% |
16.9 ± 1.9 |
$19 |
|
| Grok 2 vision 1212 | Failed | — | — |
| Pixtral large latest | Failed | — | — |
| Google Layout Parser | Failed | — | — |
|
Model |
Hit Rate@5 |
MRR |
nDCG@5 |
Cost per RFQ |
|---|---|---|---|---|
|
Gpt 5 |
0.759 |
0.633 |
0.665 |
$1.2907 |
|
Gpt 5 nano |
0.826 |
0.609 |
0.663 |
$0.0849 |
|
Gemini 2.5 pro |
0.767 |
0.637 |
0.670 |
$1.0325 |
|
Grok 4 |
0.771 |
0.594 |
0.637 |
$0.0461 |
|
Claude sonnet 4 |
0.800 |
0.581 |
0.637 |
$1.6369 |
|
Service |
Data Extraction Efficiency |
Processing duration Per 1 Page, s |
Cost, Per 1000 Pages |
|---|---|---|---|
|
77.34% |
77.5 |
$30.5 |
|
|
79.96% |
91.4 |
$130.4 |
|
|
39.59% |
41.75 |
$24.9 |
|
|
20.38% |
163 |
$239.2 |
|
|
40.49% |
64.8 |
$312 |
|
|
7.64 |
22 |
$1.59 |
We have analysed 7 most popular AI document detection models to test how well they work “out-of-the-box” on a set of digital invoices and have assessed how well they process invoices of various layouts and languages.
|
Service |
Invoice Detection Accuracy Without Items |
Invoice Detection Accuracy With Items |
Processing duration Per 1 Page, s |
Cost, per 1000 pages |
|---|---|---|---|---|
|
85,8% |
85,7% |
4.3 ± 0.2 |
$10 |
|
|
GPT-4o using 3d party OCR (Prebuilt Layout model by Azure AI) |
90,8% |
86,5% |
33.0 ± 2.3 |
$8,8 1 |
|
88,3% |
89,2% |
16.9 ± 1.9 |
$8,8 |
|
|
83,8% |
68,1% |
3.8 ± 0.2 |
$10 |
|
|
91,3% |
91,1% |
2.9 ± 0.2 |
$10 2 |
|
| Gemini 2.0 Pro | 90% | 90,2% | 8 ± 1.5 | $4,5 3 |
| DeepSeek v3 API (Prebuilt Layout model by Azure AI) | 93,3% | 88,1% | 69 | 11$ |
| Unified Approach | ~99% | ~97% | ~15 | ~30$ |
To achieve exceptional accuracy in extracting data from invoices, we combined the power of multiple large language models (LLMs). We use advanced matching algorithms to compare the outputs of each model and select the final results using a majority-vote principle.
This ensemble approach allows us to leverage the unique strengths of each LLM, providing robust and scalable invoice data extraction for real-world business needs.
As a result, we have drastically increased the average extraction accuracy from 85% to 97%.
Do you want to know the total cost of development and realization of the project? Tell us about your requirements, our specialists will contact you as soon as possible.