{"id":87,"date":"2025-12-09T09:03:00","date_gmt":"2025-12-09T01:03:00","guid":{"rendered":"https:\/\/connectword.dpdns.org\/?p=87"},"modified":"2025-12-09T09:03:00","modified_gmt":"2025-12-09T01:03:00","slug":"z-ai-debuts-open-source-glm-4-6v-a-native-tool-calling-vision-model-for-multimodal-reasoning","status":"publish","type":"post","link":"https:\/\/connectword.dpdns.org\/?p=87","title":{"rendered":"Z.ai debuts open source GLM-4.6V, a native tool-calling vision model for multimodal reasoning"},"content":{"rendered":"<p>Chinese AI startup Zhipu AI aka <a href=\"https:\/\/z.ai\/blog\/glm-4.6v\"><b>Z.ai has released its GLM-4.6V series<\/b><\/a>, a new generation of open-source vision-language models (VLMs) optimized for multimodal reasoning, frontend automation, and high-efficiency deployment. <\/p>\n<p>The release includes two models in &#8220;large&#8221; and &#8220;small&#8221; sizes: <\/p>\n<ol>\n<li>\n<p><b>GLM-4.6V (106B)<\/b>, a larger 106-billion parameter model aimed at cloud-scale inference<\/p>\n<\/li>\n<li>\n<p><b>GLM-4.6V-Flash (9B)<\/b>, a smaller model of only 9 billion parameters designed for low-latency, local applications<\/p>\n<\/li>\n<\/ol>\n<p>Recall that generally speaking, models with more parameters \u2014 or internal settings governing their behavior, i.e. weights and biases \u2014 are more powerful, performant, and capable of performing at a higher general level across more varied tasks.<\/p>\n<p>However, smaller models can offer better efficiency for edge or real-time applications where latency and resource constraints are critical.<\/p>\n<p>The defining innovation in this series is the introduction of <b>native function calling<\/b> in a vision-language model\u2014enabling direct use of tools such as search, cropping, or chart recognition with visual inputs. <\/p>\n<p>With a 128,000 token context length (equivalent to a 300-page novel&#8217;s worth of text exchanged in a single input\/output interaction with the user) and state-of-the-art (SoTA) results across more than 20 benchmarks, the GLM-4.6V series positions itself as a highly competitive alternative to both closed and open-source VLMs. It&#8217;s available in the following formats:<\/p>\n<ul>\n<li>\n<p><a href=\"https:\/\/docs.z.ai\/guides\/vlm\/glm-4.6v\">API access<\/a> via OpenAI-compatible interface<\/p>\n<\/li>\n<li>\n<p><a href=\"https:\/\/chat.z.ai\/\">Try the demo<\/a> on Zhipu\u2019s web interface<\/p>\n<\/li>\n<li>\n<p><a href=\"https:\/\/huggingface.co\/collections\/zai-org\/glm-46v\">Download weights<\/a> from Hugging Face<\/p>\n<\/li>\n<li>\n<p>Desktop assistant app available on <a href=\"https:\/\/huggingface.co\/spaces\/zai-org\/GLM-4.5V-Demo-App\">Hugging Face Spaces<\/a><\/p>\n<\/li>\n<\/ul>\n<h2><b>Licensing and Enterprise Use<\/b><\/h2>\n<p>GLM\u20114.6V and GLM\u20114.6V\u2011Flash are distributed under the <a href=\"https:\/\/opensource.org\/licenses\/MIT\">MIT license<\/a>, a permissive open-source license that allows free commercial and non-commercial use, modification, redistribution, and local deployment without obligation to open-source derivative works. <\/p>\n<p>This licensing model makes the series suitable for enterprise adoption, including scenarios that require full control over infrastructure, compliance with internal governance, or air-gapped environments.<\/p>\n<p>Model weights and documentation are publicly hosted on <a href=\"https:\/\/huggingface.co\/collections\/zai-org\/glm-46v\">Hugging Face<\/a>, with supporting code and tooling available on <a href=\"https:\/\/github.com\/zai-org\/GLM-V\">GitHub<\/a>. <\/p>\n<p>The MIT license ensures maximum flexibility for integration into proprietary systems, including internal tools, production pipelines, and edge deployments.<\/p>\n<h2><b>Architecture and Technical Capabilities<\/b><\/h2>\n<p>The GLM-4.6V models follow a conventional encoder-decoder architecture with significant adaptations for multimodal input. <\/p>\n<p>Both models incorporate a Vision Transformer (ViT) encoder\u2014based on AIMv2-Huge\u2014and an MLP projector to align visual features with a large language model (LLM) decoder. <\/p>\n<p>Video inputs benefit from 3D convolutions and temporal compression, while spatial encoding is handled using 2D-RoPE and bicubic interpolation of absolute positional embeddings.<\/p>\n<p>A key technical feature is the system\u2019s support for arbitrary image resolutions and aspect ratios, including wide panoramic inputs up to 200:1. <\/p>\n<p>In addition to static image and document parsing, GLM-4.6V can ingest temporal sequences of video frames with explicit timestamp tokens, enabling robust temporal reasoning.<\/p>\n<p>On the decoding side, the model supports token generation aligned with function-calling protocols, allowing for structured reasoning across text, image, and tool outputs. This is supported by extended tokenizer vocabulary and output formatting templates to ensure consistent API or agent compatibility.<\/p>\n<h2><b>Native Multimodal Tool Use<\/b><\/h2>\n<p>GLM-4.6V introduces native multimodal function calling, allowing visual assets\u2014such as screenshots, images, and documents\u2014to be passed directly as parameters to tools. This eliminates the need for intermediate text-only conversions, which have historically introduced information loss and complexity.<\/p>\n<p>The tool invocation mechanism works bi-directionally:<\/p>\n<ul>\n<li>\n<p>Input tools can be passed images or videos directly (e.g., document pages to crop or analyze).<\/p>\n<\/li>\n<li>\n<p>Output tools such as chart renderers or web snapshot utilities return visual data, which GLM-4.6V integrates directly into the reasoning chain.<\/p>\n<\/li>\n<\/ul>\n<p>In practice, this means GLM-4.6V can complete tasks such as:<\/p>\n<ul>\n<li>\n<p>Generating structured reports from mixed-format documents<\/p>\n<\/li>\n<li>\n<p>Performing visual audit of candidate images<\/p>\n<\/li>\n<li>\n<p>Automatically cropping figures from papers during generation<\/p>\n<\/li>\n<li>\n<p>Conducting visual web search and answering multimodal queries<\/p>\n<\/li>\n<\/ul>\n<h2><b>High Performance Benchmarks Compared to Other Similar-Sized Models<\/b><\/h2>\n<p>GLM-4.6V was evaluated across more than 20 public benchmarks covering general VQA, chart understanding, OCR, STEM reasoning, frontend replication, and multimodal agents. <\/p>\n<p>According to the benchmark chart released by Zhipu AI:<\/p>\n<ul>\n<li>\n<p>GLM-4.6V (106B) achieves SoTA or near-SoTA scores among open-source models of comparable size (106B) on MMBench, MathVista, MMLongBench, ChartQAPro, RefCOCO, TreeBench, and more.<\/p>\n<\/li>\n<li>\n<p>GLM-4.6V-Flash (9B) outperforms other lightweight models (e.g., Qwen3-VL-8B, GLM-4.1V-9B) across almost all categories tested.<\/p>\n<\/li>\n<li>\n<p>The 106B model\u2019s 128K-token window allows it to outperform larger models like Step-3 (321B) and Qwen3-VL-235B on long-context document tasks, video summarization, and structured multimodal reasoning.<\/p>\n<\/li>\n<\/ul>\n<p>Example scores from the leaderboard include:<\/p>\n<ul>\n<li>\n<p>MathVista: 88.2 (GLM-4.6V) vs. 84.6 (GLM-4.5V) vs. 81.4 (Qwen3-VL-8B)<\/p>\n<\/li>\n<li>\n<p>WebVoyager: 81.0 vs. 68.4 (Qwen3-VL-8B)<\/p>\n<\/li>\n<li>\n<p>Ref-L4-test: 88.9 vs. 89.5 (GLM-4.5V), but with better grounding fidelity at 87.7 (Flash) vs. 86.8<\/p>\n<\/li>\n<\/ul>\n<p>Both models were evaluated using the vLLM inference backend and support SGLang for video-based tasks.<\/p>\n<h2><b>Frontend Automation and Long-Context Workflows<\/b><\/h2>\n<p>Zhipu AI emphasized GLM-4.6V\u2019s ability to support frontend development workflows. The model can:<\/p>\n<ul>\n<li>\n<p>Replicate pixel-accurate HTML\/CSS\/JS from UI screenshots<\/p>\n<\/li>\n<li>\n<p>Accept natural language editing commands to modify layouts<\/p>\n<\/li>\n<li>\n<p>Identify and manipulate specific UI components visually<\/p>\n<\/li>\n<\/ul>\n<p>This capability is integrated into an end-to-end visual programming interface, where the model iterates on layout, design intent, and output code using its native understanding of screen captures.<\/p>\n<p>In long-document scenarios, GLM-4.6V can process up to 128,000 tokens\u2014enabling a single inference pass across:<\/p>\n<ul>\n<li>\n<p>150 pages of text (input)<\/p>\n<\/li>\n<li>\n<p>200 slide decks<\/p>\n<\/li>\n<li>\n<p>1-hour videos<\/p>\n<\/li>\n<\/ul>\n<p>Zhipu AI reported successful use of the model in financial analysis across multi-document corpora and in summarizing full-length sports broadcasts with timestamped event detection.<\/p>\n<h2><b>Training and Reinforcement Learning<\/b><\/h2>\n<p>The model was trained using multi-stage pre-training followed by supervised fine-tuning (SFT) and reinforcement learning (RL). Key innovations include:<\/p>\n<ul>\n<li>\n<p>Curriculum Sampling (RLCS): Dynamically adjusts the difficulty of training samples based on model progress<\/p>\n<\/li>\n<li>\n<p>Multi-domain reward systems: Task-specific verifiers for STEM, chart reasoning, GUI agents, video QA, and spatial grounding<\/p>\n<\/li>\n<li>\n<p>Function-aware training: Uses structured tags (e.g., &lt;think&gt;, &lt;answer&gt;, &lt;|begin_of_box|&gt;) to align reasoning and answer formatting<\/p>\n<\/li>\n<\/ul>\n<p>The reinforcement learning pipeline emphasizes verifiable rewards (RLVR) over human feedback (RLHF) for scalability, and avoids KL\/entropy losses to stabilize training across multimodal domains<\/p>\n<h2><b>Pricing (API)<\/b><\/h2>\n<p>Zhipu AI offers competitive pricing for the GLM-4.6V series, with both the flagship model and its lightweight variant positioned for high accessibility.<\/p>\n<ul>\n<li>\n<p>GLM-4.6V: $0.30 (input) \/ $0.90 (output) per 1M tokens<\/p>\n<\/li>\n<li>\n<p>GLM-4.6V-Flash: Free<\/p>\n<\/li>\n<\/ul>\n<p>Compared to major vision-capable and text-first LLMs, GLM-4.6V is among the most cost-efficient for multimodal reasoning at scale. Below is a comparative snapshot of pricing across providers:<\/p>\n<p><i>USD per 1M tokens \u2014 sorted lowest \u2192 highest total cost<\/i><\/p>\n<table>\n<tbody>\n<tr>\n<td>\n<p><b>Model<\/b><\/p>\n<\/td>\n<td>\n<p><b>Input<\/b><\/p>\n<\/td>\n<td>\n<p><b>Output<\/b><\/p>\n<\/td>\n<td>\n<p><b>Total Cost<\/b><\/p>\n<\/td>\n<td>\n<p><b>Source<\/b><\/p>\n<\/td>\n<\/tr>\n<tr>\n<td>\n<p>Qwen 3 Turbo<\/p>\n<\/td>\n<td>\n<p>$0.05<\/p>\n<\/td>\n<td>\n<p>$0.20<\/p>\n<\/td>\n<td>\n<p>$0.25<\/p>\n<\/td>\n<td>\n<p><a href=\"https:\/\/www.alibabacloud.com\/en\/campaign\/qwen-ai-landing-page?_p_lc=1&amp;src=qwenai\">Alibaba Cloud<\/a><\/p>\n<\/td>\n<\/tr>\n<tr>\n<td>\n<p>ERNIE 4.5 Turbo<\/p>\n<\/td>\n<td>\n<p>$0.11<\/p>\n<\/td>\n<td>\n<p>$0.45<\/p>\n<\/td>\n<td>\n<p>$0.56<\/p>\n<\/td>\n<td>\n<p><a href=\"https:\/\/cloud.baidu.com\/doc\/WENXINWORKSHOP\/s\/Blfmc9do4\">Qianfan<\/a><\/p>\n<\/td>\n<\/tr>\n<tr>\n<td>\n<p><b>GLM\u20114.6V<\/b><\/p>\n<\/td>\n<td>\n<p><b>$0.30<\/b><\/p>\n<\/td>\n<td>\n<p><b>$0.90<\/b><\/p>\n<\/td>\n<td>\n<p><b>$1.20<\/b><\/p>\n<\/td>\n<td>\n<p><a href=\"https:\/\/docs.z.ai\/guides\/overview\/pricing\">Z.AI<\/a><\/p>\n<\/td>\n<\/tr>\n<tr>\n<td>\n<p>Grok 4.1 Fast (reasoning)<\/p>\n<\/td>\n<td>\n<p>$0.20<\/p>\n<\/td>\n<td>\n<p>$0.50<\/p>\n<\/td>\n<td>\n<p>$0.70<\/p>\n<\/td>\n<td>\n<p><a href=\"https:\/\/docs.x.ai\/docs\/models?cluster=us-east-1#detailed-pricing-for-all-grok-models\">xAI<\/a><\/p>\n<\/td>\n<\/tr>\n<tr>\n<td>\n<p>Grok 4.1 Fast (non-reasoning)<\/p>\n<\/td>\n<td>\n<p>$0.20<\/p>\n<\/td>\n<td>\n<p>$0.50<\/p>\n<\/td>\n<td>\n<p>$0.70<\/p>\n<\/td>\n<td>\n<p><a href=\"https:\/\/docs.x.ai\/docs\/models?cluster=us-east-1#detailed-pricing-for-all-grok-models\">xAI<\/a><\/p>\n<\/td>\n<\/tr>\n<tr>\n<td>\n<p>deepseek-chat (V3.2-Exp)<\/p>\n<\/td>\n<td>\n<p>$0.28<\/p>\n<\/td>\n<td>\n<p>$0.42<\/p>\n<\/td>\n<td>\n<p>$0.70<\/p>\n<\/td>\n<td>\n<p><a href=\"https:\/\/api-docs.deepseek.com\/quick_start\/pricing\">DeepSeek<\/a><\/p>\n<\/td>\n<\/tr>\n<tr>\n<td>\n<p>deepseek-reasoner (V3.2-Exp)<\/p>\n<\/td>\n<td>\n<p>$0.28<\/p>\n<\/td>\n<td>\n<p>$0.42<\/p>\n<\/td>\n<td>\n<p>$0.70<\/p>\n<\/td>\n<td>\n<p><a href=\"https:\/\/api-docs.deepseek.com\/quick_start\/pricing\">DeepSeek<\/a><\/p>\n<\/td>\n<\/tr>\n<tr>\n<td>\n<p>Qwen 3 Plus<\/p>\n<\/td>\n<td>\n<p>$0.40<\/p>\n<\/td>\n<td>\n<p>$1.20<\/p>\n<\/td>\n<td>\n<p>$1.60<\/p>\n<\/td>\n<td>\n<p><a href=\"https:\/\/www.alibabacloud.com\/en\/campaign\/qwen-ai-landing-page?_p_lc=1&amp;src=qwenai\">Alibaba Cloud<\/a><\/p>\n<\/td>\n<\/tr>\n<tr>\n<td>\n<p>ERNIE 5.0<\/p>\n<\/td>\n<td>\n<p>$0.85<\/p>\n<\/td>\n<td>\n<p>$3.40<\/p>\n<\/td>\n<td>\n<p>$4.25<\/p>\n<\/td>\n<td>\n<p><a href=\"https:\/\/cloud.baidu.com\/doc\/WENXINWORKSHOP\/s\/Blfmc9do4\">Qianfan<\/a><\/p>\n<\/td>\n<\/tr>\n<tr>\n<td>\n<p>Qwen-Max<\/p>\n<\/td>\n<td>\n<p>$1.60<\/p>\n<\/td>\n<td>\n<p>$6.40<\/p>\n<\/td>\n<td>\n<p>$8.00<\/p>\n<\/td>\n<td>\n<p><a href=\"https:\/\/www.alibabacloud.com\/en\/campaign\/qwen-ai-landing-page?_p_lc=1&amp;src=qwenai\">Alibaba Cloud<\/a><\/p>\n<\/td>\n<\/tr>\n<tr>\n<td>\n<p>GPT-5.1<\/p>\n<\/td>\n<td>\n<p>$1.25<\/p>\n<\/td>\n<td>\n<p>$10.00<\/p>\n<\/td>\n<td>\n<p>$11.25<\/p>\n<\/td>\n<td>\n<p><a href=\"https:\/\/openai.com\/pricing\">OpenAI<\/a><\/p>\n<\/td>\n<\/tr>\n<tr>\n<td>\n<p>Gemini 2.5 Pro (\u2264200K)<\/p>\n<\/td>\n<td>\n<p>$1.25<\/p>\n<\/td>\n<td>\n<p>$10.00<\/p>\n<\/td>\n<td>\n<p>$11.25<\/p>\n<\/td>\n<td>\n<p><a href=\"https:\/\/ai.google.dev\/gemini-api\/docs\/pricing\">Google<\/a><\/p>\n<\/td>\n<\/tr>\n<tr>\n<td>\n<p>Gemini 3 Pro (\u2264200K)<\/p>\n<\/td>\n<td>\n<p>$2.00<\/p>\n<\/td>\n<td>\n<p>$12.00<\/p>\n<\/td>\n<td>\n<p>$14.00<\/p>\n<\/td>\n<td>\n<p><a href=\"https:\/\/ai.google.dev\/gemini-api\/docs\/pricing\">Google<\/a><\/p>\n<\/td>\n<\/tr>\n<tr>\n<td>\n<p>Gemini 2.5 Pro (&gt;200K)<\/p>\n<\/td>\n<td>\n<p>$2.50<\/p>\n<\/td>\n<td>\n<p>$15.00<\/p>\n<\/td>\n<td>\n<p>$17.50<\/p>\n<\/td>\n<td>\n<p><a href=\"https:\/\/ai.google.dev\/gemini-api\/docs\/pricing\">Google<\/a><\/p>\n<\/td>\n<\/tr>\n<tr>\n<td>\n<p>Grok 4 (0709)<\/p>\n<\/td>\n<td>\n<p>$3.00<\/p>\n<\/td>\n<td>\n<p>$15.00<\/p>\n<\/td>\n<td>\n<p>$18.00<\/p>\n<\/td>\n<td>\n<p><a href=\"https:\/\/docs.x.ai\/docs\/models?cluster=us-east-1#detailed-pricing-for-all-grok-models\">xAI<\/a><\/p>\n<\/td>\n<\/tr>\n<tr>\n<td>\n<p>Gemini 3 Pro (&gt;200K)<\/p>\n<\/td>\n<td>\n<p>$4.00<\/p>\n<\/td>\n<td>\n<p>$18.00<\/p>\n<\/td>\n<td>\n<p>$22.00<\/p>\n<\/td>\n<td>\n<p><a href=\"https:\/\/ai.google.dev\/gemini-api\/docs\/pricing\">Google<\/a><\/p>\n<\/td>\n<\/tr>\n<tr>\n<td>\n<p>Claude Opus 4.1<\/p>\n<\/td>\n<td>\n<p>$15.00<\/p>\n<\/td>\n<td>\n<p>$75.00<\/p>\n<\/td>\n<td>\n<p>$90.00<\/p>\n<\/td>\n<td>\n<p><a href=\"https:\/\/docs.anthropic.com\/claude\/docs\/models-overview\">Anthropic<\/a><\/p>\n<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<h2><b>Previous Releases: GLM\u20114.5 Series and Enterprise Applications<\/b><\/h2>\n<p>Prior to GLM\u20114.6V, Z.ai released the GLM\u20114.5 family in mid-2025, establishing the company as a serious contender in open-source LLM development. <\/p>\n<p>The flagship GLM\u20114.5 and its smaller sibling GLM\u20114.5\u2011Air both support reasoning, tool use, coding, and agentic behaviors, while offering strong performance across standard benchmarks. <\/p>\n<p>The models introduced dual reasoning modes (\u201cthinking\u201d and \u201cnon-thinking\u201d) and could automatically generate complete PowerPoint presentations from a single prompt \u2014 a feature positioned for use in enterprise reporting, education, and internal comms workflows. Z.ai also extended the GLM\u20114.5 series with additional variants such as GLM\u20114.5\u2011X, AirX, and Flash, targeting ultra-fast inference and low-cost scenarios.<\/p>\n<p>Together, these features position the GLM\u20114.5 series as a cost-effective, open, and production-ready alternative for enterprises needing autonomy over model deployment, lifecycle management, and integration pipel<\/p>\n<h2><b>Ecosystem Implications<\/b><\/h2>\n<p>The GLM-4.6V release represents a notable advance in open-source multimodal AI. While large vision-language models have proliferated over the past year, few offer:<\/p>\n<ul>\n<li>\n<p>Integrated visual tool usage<\/p>\n<\/li>\n<li>\n<p>Structured multimodal generation<\/p>\n<\/li>\n<li>\n<p>Agent-oriented memory and decision logic<\/p>\n<\/li>\n<\/ul>\n<p>Zhipu AI\u2019s emphasis on \u201cclosing the loop\u201d from perception to action via native function calling marks a step toward agentic multimodal systems. <\/p>\n<p>The model\u2019s architecture and training pipeline show a continued evolution of the GLM family, positioning it competitively alongside offerings like OpenAI\u2019s GPT-4V and Google DeepMind\u2019s Gemini-VL.<\/p>\n<h2><b>Takeaway for Enterprise Leaders<\/b><\/h2>\n<p>With GLM-4.6V, Zhipu AI introduces an open-source VLM capable of native visual tool use, long-context reasoning, and frontend automation. It sets new performance marks among models of similar size and provides a scalable platform for building agentic, multimodal AI systems<!-- -->.<\/p>","protected":false},"excerpt":{"rendered":"<p>Chinese AI startup Zhipu AI ak&hellip;<\/p>\n","protected":false},"author":1,"featured_media":88,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-87","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-uncategorized"],"_links":{"self":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts\/87","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=87"}],"version-history":[{"count":0,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts\/87\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/media\/88"}],"wp:attachment":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=87"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=87"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=87"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}