{"id":348,"date":"2026-02-02T15:26:12","date_gmt":"2026-02-02T07:26:12","guid":{"rendered":"https:\/\/connectword.dpdns.org\/?p=348"},"modified":"2026-02-02T15:26:12","modified_gmt":"2026-02-02T07:26:12","slug":"nvidia-ai-brings-nemotron-3-nano-30b-to-nvfp4-with-quantization-aware-distillation-qad-for-efficient-reasoning-inference","status":"publish","type":"post","link":"https:\/\/connectword.dpdns.org\/?p=348","title":{"rendered":"NVIDIA AI Brings Nemotron-3-Nano-30B to NVFP4 with Quantization Aware Distillation (QAD) for Efficient Reasoning Inference"},"content":{"rendered":"<p>NVIDIA has released <strong>Nemotron-Nano-3-30B-A3B-NVFP4<\/strong>, a production checkpoint that runs a 30B parameter reasoning model in <strong>4 bit NVFP4<\/strong> format while keeping accuracy close to its BF16 baseline. The model combines a hybrid <strong>Mamba2 Transformer Mixture of Experts<\/strong> architecture with a <strong>Quantization Aware Distillation (QAD)<\/strong> recipe designed specifically for NVFP4 deployment. Overall, it is an ultra-efficient NVFP4 precision version of Nemotron-3-Nano that delivers up to 4x higher throughput on Blackwell B200.<\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"2560\" height=\"1440\" data-attachment-id=\"77634\" data-permalink=\"https:\/\/www.marktechpost.com\/2026\/02\/01\/nvidia-ai-brings-nemotron-3-nano-30b-to-nvfp4-with-quantization-aware-distillation-qad-for-efficient-reasoning-inference\/g_w1-dbxuau2mwv-2\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/02\/G_w1-DBXUAU2mwv-1-scaled.jpeg\" data-orig-size=\"2560,1440\" data-comments-opened=\"1\" data-image-meta='{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\"}' data-image-title=\"G_w1-DBXUAU2mwv\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/02\/G_w1-DBXUAU2mwv-1-300x169.jpeg\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/02\/G_w1-DBXUAU2mwv-1-1024x576.jpeg\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/02\/G_w1-DBXUAU2mwv-1-scaled.jpeg\" alt=\"\" class=\"wp-image-77634\" \/><figcaption class=\"wp-element-caption\">https:\/\/huggingface.co\/nvidia\/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4<\/figcaption><\/figure>\n<\/div>\n<h3 class=\"wp-block-heading\"><strong>What is Nemotron-Nano-3-30B-A3B-NVFP4?<\/strong><\/h3>\n<p><strong>Nemotron-Nano-3-30B-A3B-NVFP4<\/strong> is a quantized version of <strong>Nemotron-3-Nano-30B-A3B-BF16<\/strong>, trained from scratch by NVIDIA team as a unified reasoning and chat model. It is built as a <strong>hybrid Mamba2 Transformer MoE<\/strong> network:<\/p>\n<ul class=\"wp-block-list\">\n<li>30B parameters in total<\/li>\n<li>52 layers in depth<\/li>\n<li>23 Mamba2 and MoE layers<\/li>\n<li>6 grouped query attention layers with 2 groups<\/li>\n<li>Each MoE layer has 128 routed experts and 1 shared expert<\/li>\n<li>6 experts are active per token, which gives about 3.5B active parameters per token<\/li>\n<\/ul>\n<p>The model is pre-trained on <strong>25T tokens<\/strong> using a <strong>Warmup Stable Decay<\/strong> learning rate schedule with a batch size of 3072, a peak learning rate of 1e-3 and a minimum learning rate of 1e-5. <\/p>\n<p><strong>Post training follows a 3 stage pipeline:<\/strong><\/p>\n<ol class=\"wp-block-list\">\n<li><strong>Supervised fine tuning<\/strong> on synthetic and curated data for code, math, science, tool calling, instruction following and structured outputs.<\/li>\n<li><strong>Reinforcement learning<\/strong> with synchronous GRPO across multi step tool use, multi turn chat and structured environments, and RLHF with a generative reward model. <\/li>\n<li><strong>Post training quantization<\/strong> to NVFP4 with FP8 KV cache and a selective high precision layout, followed by QAD. <\/li>\n<\/ol>\n<p>The NVFP4 checkpoint keeps the attention layers and the Mamba layers that feed into them in BF16, quantizes remaining layers to NVFP4 and uses FP8 for the KV cache. <\/p>\n<h3 class=\"wp-block-heading\"><strong>NVFP4 format and why it matters<\/strong>?<\/h3>\n<p><strong>NVFP4<\/strong> is a <strong>4 bit floating point<\/strong> format designed for both training and inference on recent NVIDIA GPUs. The main properties of NVFP4:<\/p>\n<ul class=\"wp-block-list\">\n<li>Compared with FP8, NVFP4 delivers <strong>2 to 3 times higher arithmetic throughput<\/strong>.<\/li>\n<li>It reduces memory usage by about <strong>1.8 times<\/strong> for weights and activations.<\/li>\n<li>It extends MXFP4 by reducing the <strong>block size from 32 to 16<\/strong> and introduces <strong>two level scaling<\/strong>.<\/li>\n<\/ul>\n<p>The two level scaling uses <strong>E4M3-FP8 scales per block<\/strong> and a <strong>FP32 scale per tensor<\/strong>. The smaller block size allows the quantizer to adapt to local statistics and the dual scaling increases dynamic range while keeping quantization error low.<\/p>\n<p>For very large LLMs, simple <strong>post training quantization (PTQ)<\/strong> to NVFP4 already gives decent accuracy across benchmarks. For smaller models, especially those heavily postage pipelines, the research team notes that PTQ causes <strong>non negligible accuracy drops<\/strong>, which motivates a training based recovery method.<\/p>\n<h3 class=\"wp-block-heading\"><strong>From QAT to QAD<\/strong><\/h3>\n<p>Standard <strong>Quantization Aware Training (QAT)<\/strong> inserts a pseudo quantization into the forward pass and reuses the <strong>original task loss<\/strong>, such as next token cross entropy. This works well for convolutional networks, <strong>but the research team lists 2 main issues for modern LLMs:<\/strong><\/p>\n<ul class=\"wp-block-list\">\n<li>Complex multi stage post training pipelines with SFT, RL and model merging are hard to reproduce.<\/li>\n<li>Original training data for open models is often unavailabublic form.<\/li>\n<\/ul>\n<p><strong>Quantization Aware Distillation (QAD)<\/strong> changes the objective instead of the full pipeline. A frozen <strong>BF16 model acts as teacher<\/strong> and the NVFP4 model is a student. Training minimizes <strong>KL divergence<\/strong> between their output token distributions, not the original supervised or RL objective.<\/p>\n<p><strong>The research team highlights 3 properties of QAD:<\/strong><\/p>\n<ol class=\"wp-block-list\">\n<li>It aligns the quantized model with the high precision teacher more accurately than QAT.<\/li>\n<li>It stays stable even when the teacher has already gone through several stages, such as supervised fine tuning, reinforcement learning and model merging, because QAD only tries to match the final teacher behavior.<\/li>\n<li>It works with partial, synthetic or filtered data, because it only needs input text to query the teacher and student, not the original labels or reward models.<\/li>\n<\/ol>\n<h3 class=\"wp-block-heading\"><strong>Benchmarks on Nemotron-3-Nano-30B<\/strong><\/h3>\n<p>Nemotron-3-Nano-30B-A3B is one of the RL heavy models in the QAD research. The below Table shows accuracy on AA-LCR, AIME25, GPQA-D, LiveCodeBench-v5 and SciCode-TQ, NVFP4-QAT and NVFP4-QAD.<\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"2094\" height=\"1068\" data-attachment-id=\"77630\" data-permalink=\"https:\/\/www.marktechpost.com\/2026\/02\/01\/nvidia-ai-brings-nemotron-3-nano-30b-to-nvfp4-with-quantization-aware-distillation-qad-for-efficient-reasoning-inference\/screenshot-2026-02-01-at-10-43-40-pm\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/02\/Screenshot-2026-02-01-at-10.43.40-PM.png\" data-orig-size=\"2094,1068\" data-comments-opened=\"1\" data-image-meta='{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\"}' data-image-title=\"Screenshot 2026-02-01 at 10.43.40\u202fPM\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/02\/Screenshot-2026-02-01-at-10.43.40-PM-300x153.png\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/02\/Screenshot-2026-02-01-at-10.43.40-PM-1024x522.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/02\/Screenshot-2026-02-01-at-10.43.40-PM.png\" alt=\"\" class=\"wp-image-77630\" \/><figcaption class=\"wp-element-caption\">https:\/\/research.nvidia.com\/labs\/nemotron\/files\/NVFP4-QAD-Report.pdf<\/figcaption><\/figure>\n<\/div>\n<h3 class=\"wp-block-heading\"><strong>Key Takeaways<\/strong><\/h3>\n<ul class=\"wp-block-list\">\n<li><strong>Nemotron-3-Nano-30B-A3B-NVFP4 is a 30B parameter hybrid Mamba2 Transformer MoE model<\/strong> that runs in 4 bit NVFP4 with FP8 KV cache and a small set of BF16 layers preserved for stability, while keeping about 3.5B active parameters per token and supporting context windows up to 1M tokens.<\/li>\n<li><strong>NVFP4 is a 4 bit floating point format with block size 16 and two level scaling<\/strong>, using E4M3-FP8 per block scales and a FP32 per tensor scale, which gives about 2 to 3 times higher arithmetic throughput and about 1.8 times lower memory cost than FP8 for weights and activations.<\/li>\n<li><strong>Quantization Aware Distillation (QAD) replaces the original task loss with KL divergence to a frozen BF16 teacher<\/strong>, so the NVFP4 student directly matches the teacher\u2019s output distribution without replaying the full SFT, RL and model merge pipeline or needing the original reward models.<\/li>\n<li>Using the new Quantization Aware Distillation method, the NVFP4 version achieves up to <strong>99.4% accuracy of BF16<\/strong><\/li>\n<li><strong>On AA-LCR, AIME25, GPQA-D, LiveCodeBench and SciCode, NVFP4-PTQ shows noticeable accuracy loss and NVFP4-QAT degrades further<\/strong>, while NVFP4-QAD recovers performance to near BF16 levels, reducing the gap to only a few points across these reasoning and coding benchmarks.<\/li>\n<\/ul>\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n<p>Check out the\u00a0<strong><a href=\"https:\/\/research.nvidia.com\/labs\/nemotron\/files\/NVFP4-QAD-Report.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">Paper<\/a> and <a href=\"https:\/\/huggingface.co\/nvidia\/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4\" target=\"_blank\" rel=\"noreferrer noopener\">Model Weights<\/a><\/strong>.\u00a0Also,\u00a0feel free to follow us on\u00a0<strong><a href=\"https:\/\/x.com\/intent\/follow?screen_name=marktechpost\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Twitter<\/mark><\/a><\/strong>\u00a0and don\u2019t forget to join our\u00a0<strong><a href=\"https:\/\/www.reddit.com\/r\/machinelearningnews\/\" target=\"_blank\" rel=\"noreferrer noopener\">100k+ ML SubReddit<\/a><\/strong>\u00a0and Subscribe to\u00a0<strong><a href=\"https:\/\/www.aidevsignals.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">our Newsletter<\/a><\/strong>. Wait! are you on telegram?\u00a0<strong><a href=\"https:\/\/t.me\/machinelearningresearchnews\" target=\"_blank\" rel=\"noreferrer noopener\">now you can join us on telegram as well.<\/a><\/strong><\/p>\n<p>The post <a href=\"https:\/\/www.marktechpost.com\/2026\/02\/01\/nvidia-ai-brings-nemotron-3-nano-30b-to-nvfp4-with-quantization-aware-distillation-qad-for-efficient-reasoning-inference\/\">NVIDIA AI Brings Nemotron-3-Nano-30B to NVFP4 with Quantization Aware Distillation (QAD) for Efficient Reasoning Inference<\/a> appeared first on <a href=\"https:\/\/www.marktechpost.com\/\">MarkTechPost<\/a>.<\/p>","protected":false},"excerpt":{"rendered":"<p>NVIDIA has released Nemotron-N&hellip;<\/p>\n","protected":false},"author":1,"featured_media":349,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-348","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-uncategorized"],"_links":{"self":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts\/348","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=348"}],"version-history":[{"count":0,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts\/348\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/media\/349"}],"wp:attachment":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=348"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=348"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=348"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}