OCR Benchmark 2025: Best Open-Source Models Compared

Introduction

After we explained in the first part of our series how LLM-based OCR fundamentally differs from classic methods, and in the second part examined the technical implementation, we now turn to the crucial question of model selection. The market for open-source models is moving rapidly, and the choice of the right "engine" significantly determines the quality and efficiency of the pipeline.

For this benchmark, we have pitted a selection of the currently most promising models against each other in various test cases listed below: PaddleOCR-VL, MinerU, Qwen3-VL-32B, Dots.OCR, DeepSeek OCR and HunyuanOCR.

Methodology and Datasets

To analyze the performance in a differentiated way, we rely on two pillars:

Qualitative Analysis (Use Cases): We revisit the realistic examples from our first article, from historical documents to complex tables. To test the robustness of the models, we partially modified these documents (sharpened or converted to grayscale), which is noted in the results as sharp or GS.
Quantitative Analysis (Kaggle Dataset): Additionally, we used parts of an established test dataset from Kaggle.

An important methodological limitation must be noted here: The ground truth (the correctly defined solution) of this dataset is available as plain text. However, our modern VLM approaches are trained to also extract structural information (like Markdown or HTML). A simple character-by-character comparison would therefore lead to distorted results, as the models inevitably produce "longer" text than intended in the ground truth due to the added structure tags.

We will discuss this discrepancy and our handling of it in detail in the evaluation.

Technical Indicators

Before we dive into the results, let's take a look at the technical side of the models.

💡Reading Guide: How to Interpret the Boxplots

To show not only the pure performance but also the reliability of the models, we visualize the data as boxplots. Here's what you need to know:

The Box (The Core): It represents the middle 50% of the results. The narrower the box, the more consistent the model's performance.
The Line (Median): The dividing line in the box is the median. It divides the results exactly into a better and a worse half. It is more meaningful than the average because it is not distorted by extremely good or bad individual values.
The Whiskers: The lines extending upwards and downwards show the range of "normal" dispersion (usually the upper and lower quartile).
The Dots (Outliers): Individual points outside the whiskers are documents where the model performed unusually poorly (or well) – so-called "hallucinations" or total failures.

Throughput, Inference Times & Token Usage

Figure 1: Throughput of the different OCR models.

The throughput (Figure 1) indicates how many tokens per second can be processed. A high value means that the model works faster and is therefore better suited for high-volume applications. In this metric, PaddleOCR-VL is ahead, followed by MinerU and Dots.OCR. Bringing up the rear is Qwen3-VL-32B. As expected, the results correlate with model size, i.e., the number of parameters. While the OCR-specialized models have between 0.9 billion and 1.7 billion parameters and therefore operate significantly faster, the 32-billion-parameter variant of the general Qwen VL model was deliberately included in the comparison as a reference.

Figure 2: Token usage of the different OCR models.

In our benchmark, we limited the tokens to 1024. This provides information on how efficiently the models use the provided resources. Models that require fewer tokens are generally more resource-efficient and can work faster. Here it shows that Dots.OCR and PaddleOCR-VL tend to use fewer tokens, which indicates more efficient processing.

Figure 3: Inference times of the different OCR models.

From the two previous metrics, the inference time (Figure 3) can be derived, i.e., the time a model needs to process an input. Shorter inference times are particularly advantageous in real-time applications. Here it shows that PaddleOCR-VL and MinerU are the fastest models, while Qwen3-VL-32B has by far the longest processing times.

OCR Benchmarks

Text Length, Character Accuracy & Semantic Accuracy

Figure 4: Text length of the different OCR models.

The text length metric (Figure 4) provides information on how much text the models generate on average. Ideally, the generated text length should correspond to the length of the ground truth. As mentioned at the beginning, some of the test cases do not assume that the models generate structural information (Markdown/HTML). Therefore, it is expected that the models will tend to produce longer texts. Nevertheless, potential sources of error can be derived from this metric. Models that consistently generate significantly longer texts may tend to add unnecessary or incorrect information, while models with significantly shorter texts may overlook important details. Here too, PaddleOCR-VL and Dots.OCR are in the lead.

Figure 5: Character accuracy of the different OCR models.

Character accuracy (Figure 5) measures how exactly the generated texts match the ground truth. Here again, it must be taken into account that the ground truth contains no structural information, which can lead to lower agreement values. Nevertheless, this metric provides valuable insights into the precision of the models. The clear winner in this discipline is Dots.OCR, followed by PaddleOCR-VL, MinerU and HunyuanOCR, which all perform similarly. DeepSeek OCR takes fifth place, while Qwen3-VL-32B is again clearly at the lower end of the scale.

Figure 6: Semantic accuracy of the different OCR models.

Semantic accuracy (Figure 6) evaluates how well the models have understood and reproduced the content of the texts, regardless of the exact character sequence. This metric is particularly important as it reflects the models' ability to grasp the meaning and context of the information. Here again, Dots.OCR and PaddleOCR-VL lead the ranking, closely followed by Qwen3-VL-32B. The good performance of Qwen3-VL-32B in this discipline indicates that the model captures the content of the documents well, even if it has weaknesses in exact character accuracy, as Qwen3-VL-32B is not a specialized OCR model, but a general multimodal model, this meets our expectations.

Model-to-Model Comparisons

💡Reading Guide: Model-to-Model Comparison

This graphic does not show how well a model performs against the master solution (ground truth), but how much the models agree with each other.

Up (Character Agreement): Here we check for the exact character. Low values (light yellow) are expected here and not a bad sign: Since LLMs generate structure formats (Markdown, HTML, JSON) themselves, they often differ greatly in syntax, even if the content is the same.
Down (Semantic Agreement): Here we compare the meaning of the text (embeddings). High values show: The models have understood the content identically, even if they have formatted it differently in the upper diagram.

Figure 7: Semantic accuracy of the different OCR models.

As a consistent test, we compare all models with each other here. It is noticeable that Dots.OCR and PaddleOCR-VL show the highest agreement (both at the character and semantic level) and also performed best in the previous metrics. This indicates that these two models are not only individually strong, but also deliver similar results, which points to robust and reliable performance.

💡Reading Guide: Reality vs. Expectation (Ground Truth)

Here we compare the model outputs against the correct solution ("Ground Truth"). One should note the crucial difference between the two metrics.

Up (Character Level): Shows whether the characters match exactly.
Noticeable: With the forms (funsd_...) we see a lot of yellow (low values). This is expected, as the models add structure (Markdown/HTML) that is missing in the pure text solution.
Down (Semantic Level): Shows whether the content was correctly understood.
Here the dark blue tones dominate. This proves: Even if the characters on the upper diagram differ (due to formatting), the models have mostly correctly captured the content of the forms.

Figure 8: Semantic accuracy of the different OCR models.

In this comparison against the ground truth, it becomes clear how much the models are influenced by the type of test cases. Especially with the forms (funsd_...), there are considerable deviations at the character level (up), which is due to the previously mentioned problem with the structural information. At the semantic level (down), on the other hand, the models perform significantly better, which indicates that they can capture the content of the documents well despite the formatting differences. This underlines the importance of not only paying attention to the exact character match when evaluating OCR models, but also considering the content context.

Conclusion and Outlook

Figure 9: Ranking of the models by test.

Our benchmark clearly shows that specialized OCR models such as Dots.OCR and PaddleOCR-VL currently deliver the best performance, both in technical terms and in actual text recognition. These models convince with their speed, efficiency and accuracy, which makes them excellent candidates for use in productive OCR pipelines. At the same time, the comparison with the ground truth illustrates the challenges that arise from the discrepancy between pure text solutions and the structured formats generated by the models. For future benchmarks, it would therefore be useful to use datasets that take into account both the text and the structural information in order to obtain an even more comprehensive picture of the model performance. Overall, the results provide valuable insights for developers and companies looking for powerful OCR solutions.

We would be happy to advise you in more detail on the selection of an OCR model for your specific application or build complete pipelines for you, just contact us with any questions.