{"id":519,"date":"2026-03-07T07:53:26","date_gmt":"2026-03-06T23:53:26","guid":{"rendered":"https:\/\/connectword.dpdns.org\/?p=519"},"modified":"2026-03-07T07:53:26","modified_gmt":"2026-03-06T23:53:26","slug":"microsoft-releases-phi-4-reasoning-vision-15b-a-compact-multimodal-model-for-math-science-and-gui-understanding","status":"publish","type":"post","link":"https:\/\/connectword.dpdns.org\/?p=519","title":{"rendered":"Microsoft Releases Phi-4-Reasoning-Vision-15B: A Compact Multimodal Model for Math, Science, and GUI Understanding"},"content":{"rendered":"<p>Microsoft has released <strong>Phi-4-reasoning-vision-15B<\/strong>, a <strong>15 billion parameter open-weight multimodal reasoning model<\/strong> designed for image and text tasks that require both perception and selective reasoning. It is a compact model built to balance reasoning quality, compute efficiency, and training-data requirements, with particular strength in <strong>scientific and mathematical reasoning<\/strong> and <strong>understanding user interfaces<\/strong>. <\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1628\" height=\"1104\" data-attachment-id=\"78253\" data-permalink=\"https:\/\/www.marktechpost.com\/2026\/03\/06\/microsoft-releases-phi-4-reasoning-vision-15b-a-compact-multimodal-model-for-math-science-and-gui-understanding\/screenshot-2026-03-06-at-3-45-47-pm\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/03\/Screenshot-2026-03-06-at-3.45.47-PM.png\" data-orig-size=\"1628,1104\" data-comments-opened=\"1\" data-image-meta='{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\"}' data-image-title=\"Screenshot 2026-03-06 at 3.45.47\u202fPM\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/03\/Screenshot-2026-03-06-at-3.45.47-PM-300x203.png\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/03\/Screenshot-2026-03-06-at-3.45.47-PM-1024x694.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/03\/Screenshot-2026-03-06-at-3.45.47-PM.png\" alt=\"\" class=\"wp-image-78253\" \/><figcaption class=\"wp-element-caption\">https:\/\/arxiv.org\/pdf\/2603.03975<\/figcaption><\/figure>\n<\/div>\n<h3 class=\"wp-block-heading\"><strong>What the model is built on?<\/strong><\/h3>\n<p>Phi-4-reasoning-vision-15B combines the <strong>Phi-4-Reasoning<\/strong> language backbone with the <strong>SigLIP-2<\/strong> vision encoder using a <strong>mid-fusion architecture<\/strong>. In this setup, the vision encoder first converts images into visual tokens, then those tokens are projected into the language model embedding space and processed by the pretrained language model. This design acts as a practical trade-off: it preserves strong cross-modal reasoning while keeping training and inference costs manageable compared with heavier early-fusion designs. <\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1554\" height=\"776\" data-attachment-id=\"78258\" data-permalink=\"https:\/\/www.marktechpost.com\/2026\/03\/06\/microsoft-releases-phi-4-reasoning-vision-15b-a-compact-multimodal-model-for-math-science-and-gui-understanding\/screenshot-2026-03-06-at-3-46-48-pm-2\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/03\/Screenshot-2026-03-06-at-3.46.48-PM-1.png\" data-orig-size=\"1554,776\" data-comments-opened=\"1\" data-image-meta='{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\"}' data-image-title=\"Screenshot 2026-03-06 at 3.46.48\u202fPM\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/03\/Screenshot-2026-03-06-at-3.46.48-PM-1-300x150.png\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/03\/Screenshot-2026-03-06-at-3.46.48-PM-1-1024x511.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/03\/Screenshot-2026-03-06-at-3.46.48-PM-1.png\" alt=\"\" class=\"wp-image-78258\" \/><figcaption class=\"wp-element-caption\">https:\/\/arxiv.org\/pdf\/2603.03975<\/figcaption><\/figure>\n<\/div>\n<h3 class=\"wp-block-heading\"><strong>Why Microsoft took the smaller-model route<\/strong>?<\/h3>\n<p>Many recent vision-language models have grown in parameter count and token usage, which raises both latency and deployment cost. Phi-4-reasoning-vision-15B was built as a smaller alternative that still handles common multimodal workloads without relying on extremely large training datasets or excessive inference-time token generation. The model was trained on <strong>200 billion multimodal tokens<\/strong>, building on <strong>Phi-4-Reasoning<\/strong>, which was trained on <strong>16 billion tokens<\/strong>, and ultimately on the <strong>Phi-4<\/strong> base model, which was trained on <strong>400 billion unique tokens<\/strong>. Microsoft contrasts that with the <strong>more than 1 trillion tokens<\/strong> used to train several recent multimodal models such as <strong>Qwen 2.5 VL<\/strong>, <strong>Qwen 3 VL<\/strong>, <strong>Kimi-VL<\/strong>, and <strong>Gemma 3<\/strong>.<\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1578\" height=\"824\" data-attachment-id=\"78256\" data-permalink=\"https:\/\/www.marktechpost.com\/2026\/03\/06\/microsoft-releases-phi-4-reasoning-vision-15b-a-compact-multimodal-model-for-math-science-and-gui-understanding\/screenshot-2026-03-06-at-3-46-20-pm-2\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/03\/Screenshot-2026-03-06-at-3.46.20-PM-1.png\" data-orig-size=\"1578,824\" data-comments-opened=\"1\" data-image-meta='{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\"}' data-image-title=\"Screenshot 2026-03-06 at 3.46.20\u202fPM\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/03\/Screenshot-2026-03-06-at-3.46.20-PM-1-300x157.png\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/03\/Screenshot-2026-03-06-at-3.46.20-PM-1-1024x535.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/03\/Screenshot-2026-03-06-at-3.46.20-PM-1.png\" alt=\"\" class=\"wp-image-78256\" \/><figcaption class=\"wp-element-caption\">https:\/\/arxiv.org\/pdf\/2603.03975<\/figcaption><\/figure>\n<\/div>\n<h3 class=\"wp-block-heading\"><strong>High-resolution perception was a core design choice<\/strong><\/h3>\n<p>Microsoft team explains one of the more useful technical lessons in their technical report that multimodal reasoning often fails because perception fails first. Models can miss the answer not because they lack reasoning ability, but because they fail to extract the relevant visual details from dense images such as screenshots, documents, or interfaces with small interactive elements. <\/p>\n<p>Phi-4-reasoning-vision-15B uses a <strong>dynamic resolution vision encoder with up to 3,600 visual tokens<\/strong>, which is intended to support high-resolution understanding for tasks such as <strong>GUI grounding<\/strong> and <strong>fine-grained document analysis<\/strong>. The Microsoft team states that <strong>high-resolution, dynamic-resolution encoders yield consistent improvements<\/strong>, and explicitly notes that <strong>accurate perception is a prerequisite for high-quality reasoning<\/strong>. <\/p>\n<h3 class=\"wp-block-heading\"><strong>Mixed reasoning instead of forcing reasoning everywhere<\/strong><\/h3>\n<p>A second important design decision is the model\u2019s <strong>mixed reasoning and non-reasoning training strategy<\/strong>. Rather than forcing chain-of-thought-style reasoning for all tasks, Microsoft team trained the model to switch between two modes. Reasoning samples include <strong><code>&lt;think&gt;...&lt;\/think&gt;<\/code><\/strong> traces, while non-reasoning samples begin with <strong><code>&lt;nothink&gt;<\/code><\/strong> and are used for perception-focused tasks such as <strong>captioning, grounding, OCR, and simple VQA<\/strong>. The reasoning data makes up <strong>about 20%<\/strong> of the overall training mixture.<\/p>\n<p>The goal of this hybrid setup is to let the model respond directly on tasks where longer reasoning adds latency without improving accuracy, while still invoking structured reasoning on tasks such as math and science. Microsoft team also notes an important limitation: the boundary between these modes is learned implicitly, so switching is not always optimal. Users can override the default behavior through explicit prompting with <strong><code>&lt;think&gt;<\/code><\/strong> or <strong><code>&lt;nothink&gt;<\/code><\/strong> tokens.<\/p>\n<h3 class=\"wp-block-heading\"><strong>What areas are stronger?<\/strong><\/h3>\n<p>Microsoft team highlights 2 main application areas. The first is <strong>scientific and mathematical reasoning over visual inputs<\/strong>, including handwritten equations, diagrams, charts, tables, and quantitative documents. The second is <strong>computer-use agent tasks<\/strong>, where the model interprets screen content, localizes GUI elements, and supports interaction with desktop, web, or mobile interfaces.<\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1556\" height=\"656\" data-attachment-id=\"78252\" data-permalink=\"https:\/\/www.marktechpost.com\/2026\/03\/06\/microsoft-releases-phi-4-reasoning-vision-15b-a-compact-multimodal-model-for-math-science-and-gui-understanding\/screenshot-2026-03-06-at-3-43-36-pm-2\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/03\/Screenshot-2026-03-06-at-3.43.36-PM-1.png\" data-orig-size=\"1556,656\" data-comments-opened=\"1\" data-image-meta='{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\"}' data-image-title=\"Screenshot 2026-03-06 at 3.43.36\u202fPM\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/03\/Screenshot-2026-03-06-at-3.43.36-PM-1-300x126.png\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/03\/Screenshot-2026-03-06-at-3.43.36-PM-1-1024x432.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/03\/Screenshot-2026-03-06-at-3.43.36-PM-1.png\" alt=\"\" class=\"wp-image-78252\" \/><figcaption class=\"wp-element-caption\">https:\/\/arxiv.org\/pdf\/2603.03975<\/figcaption><\/figure>\n<\/div>\n<h3 class=\"wp-block-heading\"><strong>Benchmark results<\/strong><\/h3>\n<p>Microsoft team reports the following benchmark scores for Phi-4-reasoning-vision-15B: <strong>84.8 on AI2DTEST<\/strong>, <strong>83.3 on ChartQATEST<\/strong>, <strong>44.9 on MathVerseMINI<\/strong>, <strong>36.2 on MathVisionMINI<\/strong>, <strong>75.2 on MathVistaMINI<\/strong>, <strong>54.3 on MMMUVAL<\/strong>, <strong>64.5 on MMStar<\/strong>, <strong>76.0 on OCRBench<\/strong>, and <strong>88.2 on ScreenSpotv2<\/strong>. The technical report also notes that these results were generated using <strong>Eureka ML Insights<\/strong> and <strong>VLMEvalKit<\/strong>, with fixed evaluation settings, and that Microsoft team presents them as comparison results rather than leaderboard claims.<\/p>\n<h3 class=\"wp-block-heading\"><strong>Key Takeaways<\/strong><\/h3>\n<ul class=\"wp-block-list\">\n<li><strong>Phi-4-reasoning-vision-15B is a 15B open-weight multimodal model<\/strong> built by combining <strong>Phi-4-Reasoning<\/strong> with the <strong>SigLIP-2<\/strong> vision encoder in a <strong>mid-fusion architecture<\/strong>.<\/li>\n<li><strong>Microsoft team designed the model for compact multimodal reasoning<\/strong>, with a focus on <strong>math, science, document understanding, and GUI grounding<\/strong>, rather than scaling to a much larger parameter count.<\/li>\n<li><strong>High-resolution visual perception is a core part of the system<\/strong>, with support for <strong>dynamic resolution encoding and up to 3,600 visual tokens<\/strong>, which helps on dense screenshots, documents, and interface-heavy tasks.<\/li>\n<li><strong>The model uses mixed reasoning and non-reasoning training<\/strong>, allowing it to switch between <strong><code>&lt;think&gt;<\/code><\/strong> and <strong><code>&lt;nothink&gt;<\/code><\/strong> modes depending on whether a task needs explicit reasoning or direct perception-based output.<\/li>\n<li><strong>Microsoft\u2019s reported benchmarks show strong performance for its size<\/strong>, including results on <strong>AI2DTEST, ChartQATEST, MathVistaMINI, OCRBench, and ScreenSpotv2<\/strong>, which supports its positioning as a compact but capable vision-language reasoning model.<\/li>\n<\/ul>\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n<p>Check out the\u00a0<strong><a href=\"https:\/\/arxiv.org\/pdf\/2603.03975\" target=\"_blank\" rel=\"noreferrer noopener\">Paper<\/a>, <a href=\"https:\/\/github.com\/microsoft\/Phi-4-reasoning-vision-15B\" target=\"_blank\" rel=\"noreferrer noopener\">Repo<\/a> <\/strong>and<strong> <a href=\"https:\/\/huggingface.co\/microsoft\/Phi-4-reasoning-vision-15B\" target=\"_blank\" rel=\"noreferrer noopener\">Model Weights<\/a>.\u00a0<\/strong>Also,\u00a0feel free to follow us on\u00a0<strong><a href=\"https:\/\/x.com\/intent\/follow?screen_name=marktechpost\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Twitter<\/mark><\/a><\/strong>\u00a0and don\u2019t forget to join our\u00a0<strong><a href=\"https:\/\/www.reddit.com\/r\/machinelearningnews\/\" target=\"_blank\" rel=\"noreferrer noopener\">120k+ ML SubReddit<\/a><\/strong>\u00a0and Subscribe to\u00a0<strong><a href=\"https:\/\/www.aidevsignals.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">our Newsletter<\/a><\/strong>. Wait! are you on telegram?\u00a0<strong><a href=\"https:\/\/t.me\/machinelearningresearchnews\" target=\"_blank\" rel=\"noreferrer noopener\">now you can join us on telegram as well.<\/a><\/strong><\/p>\n<p>The post <a href=\"https:\/\/www.marktechpost.com\/2026\/03\/06\/microsoft-releases-phi-4-reasoning-vision-15b-a-compact-multimodal-model-for-math-science-and-gui-understanding\/\">Microsoft Releases Phi-4-Reasoning-Vision-15B: A Compact Multimodal Model for Math, Science, and GUI Understanding<\/a> appeared first on <a href=\"https:\/\/www.marktechpost.com\/\">MarkTechPost<\/a>.<\/p>","protected":false},"excerpt":{"rendered":"<p>Microsoft has released Phi-4-r&hellip;<\/p>\n","protected":false},"author":1,"featured_media":520,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-519","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-uncategorized"],"_links":{"self":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts\/519","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=519"}],"version-history":[{"count":0,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts\/519\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/media\/520"}],"wp:attachment":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=519"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=519"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=519"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}