{"id":571,"date":"2026-03-19T02:41:33","date_gmt":"2026-03-18T18:41:33","guid":{"rendered":"https:\/\/connectword.dpdns.org\/?p=571"},"modified":"2026-03-19T02:41:33","modified_gmt":"2026-03-18T18:41:33","slug":"baidu-qianfan-team-releases-qianfan-ocr-a-4b-parameter-unified-document-intelligence-model","status":"publish","type":"post","link":"https:\/\/connectword.dpdns.org\/?p=571","title":{"rendered":"Baidu Qianfan Team Releases Qianfan-OCR: A 4B-Parameter Unified Document Intelligence Model"},"content":{"rendered":"<p>The Baidu Qianfan Team introduced <strong>Qianfan-OCR<\/strong>, a 4B-parameter end-to-end model designed to unify document parsing, layout analysis, and document understanding within a single vision-language architecture. Unlike traditional multi-stage OCR pipelines that chain separate modules for layout detection and text recognition, Qianfan-OCR performs direct image-to-Markdown conversion and supports prompt-driven tasks like table extraction and document question answering.<\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1342\" height=\"672\" data-attachment-id=\"78439\" data-permalink=\"https:\/\/www.marktechpost.com\/2026\/03\/18\/baidu-qianfan-team-releases-qianfan-ocr-a-4b-parameter-unified-document-intelligence-model\/screenshot-2026-03-18-at-11-40-01-am-2\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/03\/Screenshot-2026-03-18-at-11.40.01-AM-1.png\" data-orig-size=\"1342,672\" data-comments-opened=\"1\" data-image-meta='{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\"}' data-image-title=\"Screenshot 2026-03-18 at 11.40.01\u202fAM\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/03\/Screenshot-2026-03-18-at-11.40.01-AM-1-300x150.png\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/03\/Screenshot-2026-03-18-at-11.40.01-AM-1-1024x513.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/03\/Screenshot-2026-03-18-at-11.40.01-AM-1.png\" alt=\"\" class=\"wp-image-78439\" \/><figcaption class=\"wp-element-caption\">https:\/\/arxiv.org\/pdf\/2603.13398<\/figcaption><\/figure>\n<\/div>\n<h3 class=\"wp-block-heading\"><strong>Architecture and Technical Specifications<\/strong><\/h3>\n<p>Qianfan-OCR utilizes the multimodal bridging architecture from the Qianfan-VL framework. <strong>The system consists of three primary components:<\/strong><\/p>\n<ul class=\"wp-block-list\">\n<li><strong>Vision Encoder (Qianfan-ViT):<\/strong> Employs an <strong>Any Resolution<\/strong> design that tiles images into 448 x 448 patches. It supports variable-resolution inputs up to 4K, producing up to 4,096 visual tokens per image to maintain spatial resolution for small fonts and dense text.<\/li>\n<li><strong>Cross-Modal Adapter:<\/strong> A lightweight two-layer MLP with GELU activation that projects visual features into the language model\u2019s embedding space.<\/li>\n<li><strong>Language Model Backbone (Qwen3-4B):<\/strong> A 4.0B-parameter model with 36 layers and a native 32K context window. It utilizes Grouped-Query Attention (GQA) to reduce KV cache memory usage by 4x.<\/li>\n<\/ul>\n<h3 class=\"wp-block-heading\"><strong>\u2018Layout-as-Thought\u2019 Mechanism<\/strong><\/h3>\n<p>The main feature of the model is <strong>Layout-as-Thought<\/strong>, an optional thinking phase triggered by <code>&lt;think&gt;<\/code> tokens. During this phase, the model generates structured layout representations\u2014including bounding boxes, element types, and reading order\u2014before producing the final output.<\/p>\n<ul class=\"wp-block-list\">\n<li><strong>Functional Utility:<\/strong> This process recovers explicit layout analysis capabilities (element localization and type classification) often lost in end-to-end paradigms.<\/li>\n<li><strong>Performance Characteristics:<\/strong> Evaluation on <strong>OmniDocBench v1.5<\/strong> indicates that enabling the thinking phase provides a consistent advantage on documents with high \u201clayout label entropy\u201d\u2014those containing heterogeneous elements like mixed text, formulas, and diagrams.<\/li>\n<li><strong>Efficiency:<\/strong> Bounding box coordinates are represented as dedicated special tokens (<code>&lt;COORD_0&gt;<\/code> to <code>&lt;COORD_999&gt;<\/code>), reducing thinking output length by approximately 50% compared to plain digit sequences.<\/li>\n<\/ul>\n<h3 class=\"wp-block-heading\"><strong>Empirical Performance and Benchmarks<\/strong><\/h3>\n<p>Qianfan-OCR was evaluated against both specialized OCR systems and general vision-language models (VLMs).<\/p>\n<h4 class=\"wp-block-heading\"><strong>Document Parsing and General OCR<\/strong><\/h4>\n<p><strong>The model ranks first among end-to-end models on several key benchmarks:<\/strong><\/p>\n<ul class=\"wp-block-list\">\n<li><strong>OmniDocBench v1.5:<\/strong> Achieved a score of <strong>93.12<\/strong>, surpassing DeepSeek-OCR-v2 (91.09) and Gemini-3 Pro (90.33).<\/li>\n<li><strong>OlmOCR Bench:<\/strong> Scored <strong>79.8<\/strong>, leading the end-to-end category.<\/li>\n<li><strong>OCRBench:<\/strong> Achieved a score of <strong>880<\/strong>, ranking first among all tested models.<\/li>\n<\/ul>\n<h4 class=\"wp-block-heading\"><strong>Key Information Extraction (KIE)<\/strong><\/h4>\n<p>On public KIE benchmarks, Qianfan-OCR achieved the highest average score (87.9), outperforming significantly larger models<sup><\/sup><sup><\/sup><sup><\/sup><sup><\/sup>.<\/p>\n<figure class=\"wp-block-table is-style-stripes\">\n<table class=\"has-fixed-layout\">\n<thead>\n<tr>\n<td><strong>Model<\/strong><\/td>\n<td><strong>Overall Mean (KIE)<\/strong><\/td>\n<td><strong>OCRBench KIE<\/strong><\/td>\n<td><strong>Nanonets KIE (F1)<\/strong><\/td>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td><strong>Qianfan-OCR (4B)<\/strong><\/td>\n<td><strong>87.9<\/strong><\/td>\n<td>95.0<\/td>\n<td><strong>86.5<\/strong><\/td>\n<\/tr>\n<tr>\n<td>Qwen3-4B-VL<\/td>\n<td>83.5<\/td>\n<td>89.0<\/td>\n<td>83.3<\/td>\n<\/tr>\n<tr>\n<td>Qwen3-VL-235B-A22B<\/td>\n<td>84.2<\/td>\n<td>94.0<\/td>\n<td>83.8<\/td>\n<\/tr>\n<tr>\n<td>Gemini-3.1-Pro<\/td>\n<td>79.2<\/td>\n<td><strong>96.0<\/strong><\/td>\n<td>76.1<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/figure>\n<h4 class=\"wp-block-heading\"><strong>Document Understanding<\/strong><\/h4>\n<p>Comparative testing revealed that two-stage OCR+LLM pipelines often fail on tasks requiring spatial reasoning<sup><\/sup>. For instance, all tested two-stage systems scored <strong>0.0<\/strong> on <strong>CharXiv<\/strong> benchmarks, as the text extraction phase discards the visual context (axis relationships, data point positions) necessary for chart interpretation<sup><\/sup>.<\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1272\" height=\"560\" data-attachment-id=\"78441\" data-permalink=\"https:\/\/www.marktechpost.com\/2026\/03\/18\/baidu-qianfan-team-releases-qianfan-ocr-a-4b-parameter-unified-document-intelligence-model\/screenshot-2026-03-18-at-11-40-47-am-2\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/03\/Screenshot-2026-03-18-at-11.40.47-AM-1.png\" data-orig-size=\"1272,560\" data-comments-opened=\"1\" data-image-meta='{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\"}' data-image-title=\"Screenshot 2026-03-18 at 11.40.47\u202fAM\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/03\/Screenshot-2026-03-18-at-11.40.47-AM-1-300x132.png\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/03\/Screenshot-2026-03-18-at-11.40.47-AM-1-1024x451.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/03\/Screenshot-2026-03-18-at-11.40.47-AM-1.png\" alt=\"\" class=\"wp-image-78441\" \/><figcaption class=\"wp-element-caption\">https:\/\/arxiv.org\/pdf\/2603.13398<\/figcaption><\/figure>\n<\/div>\n<h3 class=\"wp-block-heading\"><strong>Deployment and Inference<\/strong><\/h3>\n<p>Inference efficiency was measured in <strong>Pages Per Second (PPS)<\/strong> using a single NVIDIA A100 GPU<sup><\/sup><sup><\/sup><sup><\/sup><sup><\/sup>.<\/p>\n<ul class=\"wp-block-list\">\n<li><strong>Quantization:<\/strong> With <strong>W8A8 (AWQ) quantization<\/strong>, Qianfan-OCR achieved <strong>1.024 PPS<\/strong>, a 2x speedup over the W16A16 baseline with negligible accuracy loss.<\/li>\n<li><strong>Architecture Advantage:<\/strong> Unlike pipeline systems that rely on CPU-based layout analysis\u2014which can become a bottleneck\u2014Qianfan-OCR is <strong>GPU-centric<\/strong>. This avoids inter-stage processing delays and allows for efficient large-batch inference.<\/li>\n<\/ul>\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n<p>Check out\u00a0<strong><a href=\"https:\/\/arxiv.org\/pdf\/2603.13398\" target=\"_blank\" rel=\"noreferrer noopener\">Paper<\/a><\/strong>, <a href=\"https:\/\/github.com\/baidubce\/Qianfan-VL\" target=\"_blank\" rel=\"noreferrer noopener\"><strong>Repo<\/strong><\/a> and <strong><a href=\"https:\/\/huggingface.co\/collections\/baidu\/qianfan-vl\" target=\"_blank\" rel=\"noreferrer noopener\">Model on HF<\/a>.\u00a0<\/strong>Also,\u00a0feel free to follow us on\u00a0<strong><a href=\"https:\/\/x.com\/intent\/follow?screen_name=marktechpost\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Twitter<\/mark><\/a><\/strong>\u00a0and don\u2019t forget to join our\u00a0<strong><a href=\"https:\/\/www.reddit.com\/r\/machinelearningnews\/\" target=\"_blank\" rel=\"noreferrer noopener\">120k+ ML SubReddit<\/a><\/strong>\u00a0and Subscribe to\u00a0<strong><a href=\"https:\/\/www.aidevsignals.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">our Newsletter<\/a><\/strong>. Wait! are you on telegram?\u00a0<strong><a href=\"https:\/\/t.me\/machinelearningresearchnews\" target=\"_blank\" rel=\"noreferrer noopener\">now you can join us on telegram as well.<\/a><\/strong><\/p>\n<p>The post <a href=\"https:\/\/www.marktechpost.com\/2026\/03\/18\/baidu-qianfan-team-releases-qianfan-ocr-a-4b-parameter-unified-document-intelligence-model\/\">Baidu Qianfan Team Releases Qianfan-OCR: A 4B-Parameter Unified Document Intelligence Model<\/a> appeared first on <a href=\"https:\/\/www.marktechpost.com\/\">MarkTechPost<\/a>.<\/p>","protected":false},"excerpt":{"rendered":"<p>The Baidu Qianfan Team introdu&hellip;<\/p>\n","protected":false},"author":1,"featured_media":572,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-571","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-uncategorized"],"_links":{"self":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts\/571","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=571"}],"version-history":[{"count":0,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts\/571\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/media\/572"}],"wp:attachment":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=571"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=571"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=571"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}