1. PP-OCRv6 Introduction¶

PP-OCRv6 is the latest generation of the PP-OCR universal text recognition solution. Built on the newly designed PPLCNetV4 unified backbone, it offers tiny, small, and medium tiers targeting edge/IoT, mobile/desktop, and server scenarios respectively. PP-OCRv6 achieves a major breakthrough in language coverage—the medium/small tiers support 50 languages with a single unified model, including Simplified Chinese, Traditional Chinese, English, Japanese, and 46 Latin-script languages (tiny supports 49, excluding Japanese). On our in-house multi-scenario benchmark, PP-OCRv6_medium achieves +5.1% recognition accuracy and +4.6% detection Hmean over PP-OCRv5_server, with 2.37× GPU inference speedup; with only 34.5M parameters, it surpasses VLMs such as Qwen3-VL-235B and GPT-5.5 in accuracy.

Main contributions:

Unified and Scalable Model Family: A three-tier OCR model family spanning 1.5M to 34.5M parameters. The medium tier achieves 86.2% detection Hmean and 83.2% recognition accuracy, serving as production-ready infrastructure for industrial deployment and large-scale data pipelines.
Tailored Lightweight Architectural Innovations: (i) LCNetV4: a MetaFormer-style lightweight backbone with structural reparameterization; (ii) RepLKFPN: a detection neck with dilated reparameterizable depthwise convolutions for large receptive fields; (iii) EncoderWithLightSVTR: a recognition neck with local-global attention and additive skip connections.
Extensive Multi-Language and Scenario Generalization: A single model scaled to support 50 languages and diverse challenging industrial scenes (e.g., digital displays, dot-matrix characters, tire prints), significantly improving OCR performance in scenarios traditionally underserved by general-purpose VLMs.

Performance comparison between PP-OCRv6, PP-OCRv5, and Vision-Language Models. Left: text detection average Hmean (%); Right: text recognition weighted average accuracy (%).

2. Key Technical Improvements¶

2.1 Unified Backbone: PPLCNetV4¶

LCNetV4Block: Following the MetaFormer paradigm, each layer is decomposed into a Token Mixer and a Channel Mixer. Given input feature \(\mathbf{x} \in \mathbb{R}^{C \times H \times W}\):

\[\hat{\mathbf{x}} = \text{SE}(\text{DW}(\mathbf{x})) + \mathbf{x}\]

\[\mathbf{y} = W_2\,\sigma(W_1\,\hat{\mathbf{x}}) + \hat{\mathbf{x}}\]

where \(\text{DW}(\cdot)\) is a 3×3 depthwise convolution (Token Mixer), SE is an optional channel attention module, \(W_1 \in \mathbb{R}^{2C \times C}\) and \(W_2 \in \mathbb{R}^{C \times 2C}\) form the Channel Mixer with expansion ratio 2, and \(\sigma\) is GELU activation.

Task-Adaptive Downsampling: The same backbone serves both tasks via different stride strategies—detection mode uses standard stride-2 spatial downsampling producing multi-scale feature maps (stride 4/8/16/32); recognition mode uses asymmetric stride \((2,1)\) at Stage 3/4, reducing height only while preserving width, followed by height-axis average pooling to produce 1-D sequential features for CTC/NRTR decoding.

Comparison with LCNetV3:

Design Aspect	LCNetV3	LCNetV4
Architecture	MobileNet-style (DW→SE→PW)	MetaFormer (TokenMixer + ChannelMixer)
Channel Interaction	Single 1×1 PW Conv	Expand(2×)→Act→Compress + residual
Spatial Mixing	Plain DW Conv	RepDWConv (3×3 + 1×1 + identity)
BN Initialization	Standard	Zero-init on compress BN

PPLCNetV4 backbone architecture.

2.2 Detection Module¶

RepLKFPN: Lightweight large-kernel FPN using DilatedReparamBlock (7×7 depthwise conv + dilated branches), 31% fewer parameters than PP-OCRv5's RSEFPN (118K vs 172K) with receptive field expanded from 3×3 to 7×7.
Auxiliary Deep Supervision: Prediction heads at P2, P3, P4 levels for stronger gradient signals during training.
DiceBCE Loss: Combined DiceLoss + Focal Loss for better per-pixel supervision on small and dense text.

PP-OCRv6 detection module architecture.

2.3 Recognition Module¶

EncoderWithLightSVTR Neck: Local context modeling (1×7 depthwise conv) + global self-attention (1-2 Transformer layers), with additive skip connections (instead of concatenation in PP-OCRv5) to reduce parameters.
Multi-Head Decoder: CTCHead for efficient parallel inference; NRTRHead for auxiliary training supervision (removed at inference).
Tiny Model Design: No neck (direct reshape + FC), trained with knowledge distillation from the medium model.
Multilingual Unification: Dictionary extended with ~200 diacritical characters, enabling single-model 50-language coverage.

PP-OCRv6 recognition module architecture.

3. Key Metrics¶

3.1 Text Detection¶

Text detection Hmean (%) on our in-house multi-scenario benchmark (16 categories):

Model	AVG	HW-CN	HW-EN	Print-CN	Print-EN	TC	Anc.	JP	Blur	Emo.	Warp	Pin.	Art.	Tab.	Rot.	Indus.	Gen.
PP-OCRv6_medium	86.2	83.7	84.0	95.1	93.7	86.3	80.2	84.3	94.1	99.6	88.6	74.0	69.0	96.8	93.8	73.3	82.8
PP-OCRv6_small	84.1	80.5	87.1	94.2	93.6	85.7	72.6	82.3	92.6	99.7	87.6	69.6	65.3	95.6	93.7	67.6	78.2
PP-OCRv6_tiny	80.6	79.4	85.9	93.1	92.3	83.7	63.0	76.6	89.3	99.8	86.1	59.0	60.1	94.7	91.0	62.0	73.8
PP-OCRv5_server	81.6	80.3	84.1	94.5	91.7	81.5	67.6	77.2	90.1	96.2	87.6	67.1	67.3	97.1	80.0	64.3	79.7
PP-OCRv5_mobile	75.2	74.4	77.7	90.5	91.0	82.3	58.1	72.7	87.4	93.6	82.7	57.5	52.5	92.8	64.7	52.8	72.1
Gemini-3.1-Pro	46.8	53.4	56.5	47.3	47.6	39.0	45.8	38.2	50.0	68.1	44.6	40.6	65.2	26.9	22.1	52.5	50.2
GPT-5.5	45.6	42.4	58.5	50.2	51.9	35.0	26.7	42.0	49.1	97.5	37.7	36.3	52.0	71.0	10.0	36.2	32.6
Qwen3-VL-235B	38.3	56.5	66.0	41.7	37.0	19.3	13.1	27.0	38.5	81.2	28.5	33.0	68.3	19.6	2.1	48.4	32.3

3.2 Text Recognition¶

Text recognition accuracy (%) on our in-house multi-scenario benchmark (15 categories):

Model	W-Avg	HW-CN	HW-EN	Print-CN	Print-EN	TC	Anc.	JP	Conf.	Spec.	Gen.	Pin.	Art.	Indus.	Screen	Card
PP-OCRv6_medium	83.2	62.1	67.8	91.5	94.1	78.6	72.4	90.5	64.9	61.7	87.5	78.1	71.2	77.4	82.5	88.1
PP-OCRv6_small	81.3	57.6	61.1	90.5	93.3	77.0	71.1	88.2	64.1	60.2	85.7	75.9	68.4	76.4	79.7	86.9
PP-OCRv6_tiny	73.5	40.1	39.3	86.7	88.4	65.0	68.4	89.8	52.3	57.1	78.0	65.4	54.7	62.1	71.2	80.5
PP-OCRv5_server	78.1	58.0	59.6	90.1	85.1	74.7	60.4	73.7	59.4	56.8	86.5	74.4	64.0	70.2	68.1	87.6
PP-OCRv5_mobile	73.7	41.7	50.9	86.0	86.0	72.0	57.8	75.8	55.7	54.8	80.7	72.5	54.0	59.3	57.6	81.7
Qwen3-VL-235B	74.9	49.7	73.2	82.3	86.2	76.4	33.6	66.2	56.1	49.0	82.5	76.5	69.6	74.7	73.8	78.7
Gemini-3.1-Pro	71.4	46.4	73.0	80.0	90.5	69.5	18.0	67.2	54.4	50.3	74.6	75.9	63.1	69.1	73.2	75.9
GPT-5.5	64.2	19.2	56.9	75.7	82.2	57.5	63.7	58.6	49.1	48.3	67.7	50.4	53.0	62.4	67.7	71.1

3.3 End-to-End Inference Speed (s/image)¶

Tested on 200 images (general + document scenes), including image I/O, pre/post-processing, and model inference.

Hardware	Backend	PP-OCRv6_medium	PP-OCRv6_small	PP-OCRv6_tiny	PP-OCRv5_server	PP-OCRv5_mobile	PP-OCRv4_mobile
NVIDIA A100	PaddlePaddle	0.29	0.25	0.13	0.32	0.25	0.14
NVIDIA A100	TensorRT	--	0.32	0.16	--	0.33	0.16
NVIDIA V100	PaddlePaddle	0.72	0.49	0.21	0.66	0.50	0.25
NVIDIA V100	ONNX Runtime	0.67	0.53	0.29	0.77	0.46	0.27
NVIDIA V100	TensorRT	0.77	0.60	0.23	0.73	0.59	0.27
Intel Xeon 8350C	PaddlePaddle	2.05	0.79	0.32	2.04	0.80	0.62
Intel Xeon 8350C	OpenVINO	1.40	0.59	0.20	7.30	0.78	0.60
Intel Xeon 8350C	ONNX Runtime	3.31	0.61	0.22	6.36	0.61	0.49
Apple M4	PaddlePaddle	8.82	3.07	0.96	>10	5.82	5.65
Apple M4	ONNX Runtime	5.55	1.29	0.35	7.20	1.10	1.02

PP-OCRv6_medium matches or outperforms PP-OCRv5_server on all platforms: 1.1× faster on A100 (0.29s vs 0.32s), 1.15× on V100 ONNX Runtime (0.67s vs 0.77s), 5.2× on Intel Xeon OpenVINO (1.40s vs 7.30s).
PP-OCRv6_small matches PP-OCRv5_mobile in latency on most platforms with higher accuracy; 1.9× faster on Apple M4 PaddlePaddle (3.07s vs 5.82s).
PP-OCRv6_tiny is the fastest model across all platforms: 6.1× over PP-OCRv5_mobile on Apple M4 PaddlePaddle (0.96s vs 5.82s), 3.9× on Intel Xeon OpenVINO (0.20s vs 0.78s), reaching 0.13s on A100.

4. Visualization¶

4.1 Detection Comparison¶

Text detection comparison. Left to right: PP-OCRv6_medium, PP-OCRv5_server, Gemini-3.1-Pro, GPT-5.5.

4.2 Hallucination Comparison¶

PP-OCRv6_medium vs VLMs hallucination comparison. PP-OCRv6 faithfully reproduces visual text content, while VLMs introduce hallucinated corrections based on linguistic priors.

4.3 End-to-End OCR Comparison¶

End-to-end OCR comparison between PP-OCRv6_medium and PP-OCRv5_server across Chinese, English, Japanese, artistic fonts, industrial characters, rotated text, pinyin, and dot-matrix characters.

5. Quick Start¶

from paddleocr import PaddleOCR

# Default: PP-OCRv6_medium
ocr = PaddleOCR(
    use_doc_orientation_classify=False,
    use_doc_unwarping=False,
    use_textline_orientation=False,
)
result = ocr.predict("https://paddle-model-ecology.bj.bcebos.com/paddlex/imgs/demo_image/general_ocr_002.png")

for res in result:
    res.print()
    res.save_to_img("output")
    res.save_to_json("output")

# CLI usage
paddleocr ocr -i https://paddle-model-ecology.bj.bcebos.com/paddlex/imgs/demo_image/general_ocr_002.png \
    --use_doc_orientation_classify False \
    --use_doc_unwarping False \
    --use_textline_orientation False

Using Transformers Engine:

PP-OCRv6 supports inference via Hugging Face Transformers (requires transformers>=5.8.0):

from paddleocr import TextRecognition

model = TextRecognition(
    model_name="PP-OCRv6_medium_rec",
    engine="transformers",
)
output = model.predict(input="general_ocr_rec_001.png", batch_size=1)
for res in output:
    res.print()

Using High-Performance Inference (ONNX Runtime backend):

Enable the high-performance inference plugin with enable_hpi=True:

from paddleocr import PaddleOCR

ocr = PaddleOCR(
    use_doc_orientation_classify=False,
    use_doc_unwarping=False,
    use_textline_orientation=False,
    enable_hpi=True,
)
result = ocr.predict("general_ocr_002.png")

The HPI plugin requires additional installation. See High-Performance Inference Guide.

6. Deployment and Custom Development¶

Multi-OS Support: Compatible with Windows, Linux, and Mac.
Multi-Hardware Support: Supports NVIDIA GPU, Intel CPU, Kunlun, Ascend, and more.
High-Performance Inference Plugin: See High-Performance Inference Guide.
Serving Deployment: See Serving Deployment Guide.
Custom Development: Supports custom dataset training, dictionary extension, and model fine-tuning. See Text Detection Tutorial and Text Recognition Tutorial.