PaddleOCR-VL NVIDIA Blackwell-Architecture GPUs Environment Configuration Tutorial¶
This tutorial provides guidance on configuring the environment for NVIDIA Blackwell-architecture GPUs. After completing the environment setup, please refer to the PaddleOCR-VL Usage Tutorial to use PaddleOCR-VL.
Before starting the tutorial, please ensure that your NVIDIA driver supports CUDA 12.9 or higher.
1. Environment Preparation¶
This section introduces how to set up the PaddleOCR-VL runtime environment using one of the following two methods:
-
Method 1: Use the official Docker image.
-
Method 2: Manually install PaddlePaddle and PaddleOCR.
1.1 Method 1: Using Docker Image¶
We recommend using the official Docker image (requires Docker version >= 19.03, GPU-equipped machine with NVIDIA driver supporting CUDA 12.9 or higher):
docker run \
-it \
--gpus all \
--network host \
--user root \
ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddleocr-vl:latest-gpu-sm120 \
/bin/bash
# Call PaddleOCR CLI or Python API in the container
If you wish to use PaddleOCR-VL in an offline environment, replace ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddleocr-vl:latest-gpu-sm120 (image size ~10 GB) in the above command with the offline version image ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddleocr-vl:latest-gpu-sm120-offline (image size ~12 GB).
1.2 Method 2: Manually Install PaddlePaddle and PaddleOCR¶
If Docker is not an option, you can manually install PaddlePaddle and PaddleOCR. Python version 3.8–3.12 is required.
We strongly recommend installing PaddleOCR-VL in a virtual environment to avoid dependency conflicts. For example, create a virtual environment using Python's standard venv library:
# Create a virtual environment
python -m venv .venv_paddleocr
# Activate the environment
source .venv_paddleocr/bin/activate
Run the following commands to complete the installation:
# Note that PaddlePaddle for cu129 is being installed here
python -m pip install paddlepaddle-gpu==3.2.1 -i https://www.paddlepaddle.org.cn/packages/stable/cu129/
python -m pip install -U "paddleocr[doc-parser]"
# For Linux systems, run:
python -m pip install https://paddle-whl.bj.bcebos.com/nightly/cu126/safetensors/safetensors-0.6.2.dev0-cp38-abi3-linux_x86_64.whl
# For Windows systems, run:
python -m pip install https://xly-devops.cdn.bcebos.com/safetensors-nightly/safetensors-0.6.2.dev0-cp38-abi3-win_amd64.whl
Please ensure that PaddlePaddle framework version 3.2.1 or higher is installed, along with the special version of safetensors.
2. Quick Start¶
Please refer to the corresponding section in the PaddleOCR-VL Usage Tutorial.
3. Improving VLM Inference Performance Using Inference Acceleration Frameworks¶
The inference performance under default configurations may not be fully optimized and may not meet actual production requirements. This section introduces how to use the vLLM and SGLang inference acceleration frameworks to enhance PaddleOCR-VL's inference performance.
3.1 Starting the VLM Inference Service¶
There are two methods to start the VLM inference service; choose one:
-
Method 1: Start the service using the official Docker image.
-
Method 2: Manually install dependencies and start the service via PaddleOCR CLI.
3.1.1 Method 1: Using Docker Image¶
PaddleOCR provides a Docker image for quickly starting the vLLM inference service. Use the following command to start the service (requires Docker version >= 19.03, GPU-equipped machine with NVIDIA driver supporting CUDA 12.9 or higher):
docker run \
-it \
--rm \
--gpus all \
--network host \
ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddleocr-genai-vllm-server:latest-gpu-sm120 \
paddleocr genai_server --model_name PaddleOCR-VL-0.9B --host 0.0.0.0 --port 8118 --backend vllm
If you wish to start the service in an offline environment, replace ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddleocr-genai-vllm-server:latest-gpu-sm120 (image size ~12 GB) in the above command with the offline version image ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/paddleocr-genai-vllm-server:latest-gpu-sm120-offline (image size ~14 GB).
More parameters can be passed when starting the vLLM inference service; supported parameters are detailed in the next subsection.
3.1.2 Method 2: Installation and Usage via PaddleOCR CLI¶
Since inference acceleration frameworks may have dependency conflicts with the PaddlePaddle framework, installation in a virtual environment is recommended. Taking vLLM as an example:
# If there is an active virtual environment, deactivate it first using `deactivate`
# Create a virtual environment
python -m venv .venv_vlm
# Activate the environment
source .venv_vlm/bin/activate
# Install PaddleOCR
python -m pip install "paddleocr[doc-parser]"# Install dependencies for inference acceleration services
paddleocr install_genai_server_deps vllm
python -m pip install flash-attn==2.8.3
The
paddleocr install_genai_server_depscommand may require CUDA compilation tools such as nvcc during execution. If these tools are not available in your environment or the installation takes too long, you can obtain a pre-compiled version of FlashAttention from this repository. For example, runpython -m pip install https://github.com/mjun0812/flash-attention-prebuild-wheels/releases/download/v0.3.14/flash_attn-2.8.2+cu128torch2.8-cp310-cp310-linux_x86_64.whl.
Usage of the paddleocr install_genai_server_deps command:
Currently supported framework names are vllm and sglang, corresponding to vLLM and SGLang, respectively.
After installation, you can start the service using the paddleocr genai_server command:
The parameters supported by this command are as follows:
| Parameter | Description |
|---|---|
--model_name |
Name of the model |
--model_dir |
Directory containing the model |
--host |
Server hostname |
--port |
Server port number |
--backend |
Backend name, i.e., the name of the inference acceleration framework being used; options are vllm or sglang |
--backend_config |
YAML file specifying backend configuration |
3.2 Client Usage¶
Please refer to the corresponding section in the PaddleOCR-VL Usage Tutorial.
4. Service Deployment¶
This section mainly introduces how to deploy PaddleOCR-VL as a service and invoke it. There are two methods available; choose one:
-
Method 1: Deploy using Docker Compose.
-
Method 2: Manually install dependencies for deployment.
Please note that the PaddleOCR-VL service introduced in this section differs from the VLM inference service in the previous section: the latter is responsible for only one part of the complete process (i.e., VLM inference) and is called as an underlying service by the former.
4.1 Method 1: Deploy Using Docker Compose¶
-
Copy the content from here and save it as a
compose.yamlfile. -
Copy the following content and save it as a
.envfile: -
Execute the following command in the directory containing the
compose.yamland.envfiles to start the server, which will listen on port 8080 by default:After startup, you will see output similar to the following:
4.2 Method 2: Manually Install Dependencies for Deployment¶
Execute the following command to install the service deployment plugin via the PaddleX CLI:
Then, start the server using the PaddleX CLI:
After startup, you will see output similar to the following, with the server listening on port 8080 by default:
INFO: Started server process [63108]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8080 (Press CTRL+C to quit)
The command-line parameters related to service deployment are as follows:
| Name | Description |
|---|---|
--pipeline |
Registered name of the PaddleX pipeline or path to the pipeline configuration file. |
--device |
Device for pipeline deployment. By default, GPU is used if available; otherwise, CPU is used. |
--host |
Hostname or IP address to which the server is bound. Defaults to 0.0.0.0. |
--port |
Port number on which the server listens. Defaults to 8080. |
--use_hpip |
Enable high-performance inference mode. Refer to the high-performance inference documentation for more information. |
--hpi_config |
High-performance inference configuration. Refer to the high-performance inference documentation for more information. |
To adjust pipeline-related configurations (such as model paths, batch sizes, deployment devices, etc.), refer to Section 4.4.
4.3 Client Invocation Methods¶
Please refer to the corresponding section in the PaddleOCR-VL Usage Tutorial.
4.4 Pipeline Configuration Adjustment Instructions¶
Please refer to the corresponding section in the PaddleOCR-VL Usage Tutorial.
5. Model Fine-Tuning¶
Please refer to the corresponding section in the PaddleOCR-VL Usage Tutorial.