OCR for Sensitive Data on Your Own GPU
Veröffentlicht 16. Dezember 2025
Photo by Sear Greyson on Unsplash
Introduction
After demonstrating the robustness and structural analysis capabilities of Vision-Language Models (VLM) like Dots.OCR in direct comparison to classic OCR (Tesseract) in the first part, How LLMs Are Revolutionizing OCR-Based Document Analysis, the crucial question now arises: How do we implement this technology in a GDPR-compliant and high-performance manner?
Processing sensitive customer data or internal documents via external cloud services is often ruled out for data protection reasons. Part of the solution can lie in your own, controlled infrastructure.
In this second part, we focus on the practical implementation of this high-performance pipeline. We show step-by-step how to set up a dedicated, fast processing server on your own NVIDIA GPU using Podman (on Rocky Linux) and the vLLM inference engine. We then build an asynchronous Python client to fully leverage the GPU's power and process even large stacks of documents.
All discussed code examples, including the final main.py and docker-compose.yml, can be found in the corresponding GitHub repository.
Server-Side Setup
To realize the high-performance VLM pipeline, a dedicated server environment is required. The decision to run the Vision-Language Model (rednote-hilab/Dots.OCR) on your own hardware enables GDPR compliance but requires specific resources:
Hardware Requirements for vLLM Inference
Since we are using an architecture based on Large Language Models (LLMs), GPU acceleration is essential to achieve economical throughput.
Graphics Card (GPU): An **NVIDIA GPU with CUDA support** is mandatory.
Video Memory (VRAM): Although the Dots.OCR model with 1.7 billion parameters is relatively compact, we recommend **at least 24 GB of VRAM** for loading the model and performing efficient inference with vLLM. In our tests, an NVIDIA RTX 3090 was used. It ran stable when a balance was found between model size, batch size, and document length.
Note: Operation on pure CPU hardware is theoretically possible, but the throughput is severely limited compared to GPU acceleration.
If local hardware is not available, dedicated German cloud servers (e.g., at IONOS, StackIT, or Hetzner) can be a **GDPR-compliant** and even **ISO 27001** certified alternative.
The Choice of Inference Engine: vLLM
To optimally manage the computational load of the VLM on the GPU and thus ensure fast response times as well as high throughput, we rely on the open-source inference engine **vLLM** (**Documentation**).
vLLM is known for its high throughput and offers an **OpenAI-compatible API**, which significantly simplifies integration into our client (and into existing solutions).
Deployment via Podman (Rocky Linux)
To ensure that our setup is reproducible and can be operated securely without root privileges (rootless), we use Podman, alternatively Docker, the standard on RHEL-based systems like Rocky Linux.
To use the container solution, we first need to install and configure Podman and the NVIDIA Container Toolkit.
1. Install Podman and Tools
2. NVIDIA Container Toolkit & CDI
Unlike Docker, Podman modernly uses the CDI (Container Device Interface) for GPU access.
Note: The var/run/* directory is cleared on every reboot, so the nvidia.yaml file must be regenerated at every system start.
On the server, we first create a new directory and move into it with the command:
There we create a new file named docker-compose.yml via VIM:
We fill this file with the following content. We use the CDI syntax for devices and the :z flag for SELinux compatibility under Rocky Linux.
In the next step, we create another file in the same way named .env, in which we set the necessary environment variables:
Now we can start the container and turn to the script. For this, we simply enter the following command into the terminal:
Automation and Autostart via Systemd
To ensure our inference server starts automatically after a system reboot and runs robustly in the background, we set up a Systemd service. This replaces the manual start command.
To do this, create the service file at /etc/systemd/system/vllm.service:
Then add the following content.
Important: Adjust the path in ExecStart and ExecStop to the location where you previously created the dotsOCR folder (the example below assumes /root/dotsOCR).
Afterward, we set the correct permissions for the file (root user, read permissions for all) and enable the service.
To ensure the container is running correctly, we check the status:
If a green "active (running)" appears here, the server is ready for use.
Pipeline
Client Setup and Dependencies
To send requests to our vLLM server and preprocess the complex PDF documents, we need the following Python libraries on the client side. We use pdf2image to convert PDF pages into images and the openai library to communicate with the vLLM endpoint.
Install all necessary packages in your local (or client) environment.
and create the file main.py.
1. Load PDF Files
First, we need a function to read .pdf files, for this we write in our main.py:
This function takes a path to a .pdf and returns a list of PIL.Images. However, we cannot use these images for requests to our server yet; we first need to convert them to a Base64 string.
2. PIL.Image to Base64
Two more imports are necessary here, so we add the following lines to the imports at the beginning of our file:
At the end of the file, we define the following new function:
3. Client Initialization & Request Logic
To be able to send requests to our server, we need an OpenAI Client. Again, additional imports are necessary, so we add these lines at the beginning of our file.
and before the function definitions, we write:
Here we would typically write a loop that sends the images to the server one by one (sequentially). But that is often too slow.
Acceleration
Sequential processing (image by image) does not exploit the full potential of the vLLM endpoint. Our script would wait for a response before sending the next image.
To really utilize the GPU, we should send the requests in parallel. It is also impractical to hardcode the path and prompt in the code, so we will make improvements to main.py.
1. Parallel Requests with asyncio
To bridge the waiting time, we replace the synchronous client with the asynchronous openai.AsyncOpenAI and use the asyncio framework.
We redefine the processing logic and introduce a constant MAX_CONCURRENT_REQUESTS to control the server load:
2. Integration of argparse
To control the script flexibly from the terminal, we add argparse. This allows calling it as follows:
Summary
Our final main.py now combines all steps: it loads the PDF, converts the pages, and sends them highly efficiently and in parallel to our local Podman container:
This final version of the pipeline allows you to process documents within your own network. However, it should not be exposed to the internet in this form; further security measures are necessary, such as securing requests via HTTPS and setting appropriate firewall rules.
Performance
The processing speed depends not only on the hardware used and the number of parallel requests but also heavily on the complexity and quality of the documents. In several tests with historical documents from the last years of the war and the early post-war period (heavily faded text, handwritten notes, various fonts, tabular structures), we were able to achieve an average throughput of about 180 pages per hour with an NVIDIA RTX 3090 (24 GB VRAM), with a batch size of 5 parallel requests. Less complex, modern documents (e.g., invoices, forms) can be processed significantly faster, but even the chosen test scenario shows that even difficult documents can be analyzed in an acceptable amount of time.
Next Steps to an Enterprise Solution
The setup described above provides a robust foundation for performing VLM-based document analysis on your own hardware. However, to seamlessly integrate this architecture into a complex Enterprise IT landscape, the following aspects should be considered:
- Orchestration instead of a single server: While podman-compose is ideal for single instances, production operation often requires high availability. Here, the use of Kubernetes or – fitting for Rocky Linux – Red Hat OpenShift is recommended. This allows VLLM inference pods to be dynamically scaled according to load and distributed across multiple GPU nodes.
- API Gateway & Security: Direct access to the vLLM container should be avoided. An upstream API Gateway (e.g., LiteLLM) can handle tasks such as rate limiting, centralized logging, managing user groups, fallback solutions, and caching.
- Observability: To identify bottlenecks early, detailed monitoring is essential. Metrics such as GPU utilization (via DCGM Exporter), VRAM allocation, and request latencies should be collected in systems like Prometheus and visualized in Grafana.
With this architecture, the processing of thousands of documents is measured not in days, but in minutes – and all under your own full data control.
We are also happy to support you with your own projects in the OCR area and discuss further details in a person contact.