{"id":609,"date":"2026-03-25T02:49:40","date_gmt":"2026-03-24T18:49:40","guid":{"rendered":"https:\/\/connectword.dpdns.org\/?p=609"},"modified":"2026-03-25T02:49:40","modified_gmt":"2026-03-24T18:49:40","slug":"this-ai-paper-introduces-tinylora-a-13-parameter-fine-tuning-method-that-reaches-91-8-percent-gsm8k-on-qwen2-5-7b","status":"publish","type":"post","link":"https:\/\/connectword.dpdns.org\/?p=609","title":{"rendered":"This AI Paper Introduces TinyLoRA, A 13-Parameter Fine-Tuning Method That Reaches 91.8 Percent GSM8K on Qwen2.5-7B"},"content":{"rendered":"<p>Researchers from <strong>FAIR at Meta<\/strong>, <strong>Cornell University<\/strong>, and <strong>Carnegie Mellon University<\/strong> have demonstrated that large language models (LLMs) can learn to reason using a remarkably small number of trained parameters. The research team introduces <strong>TinyLoRA<\/strong>, a parameterization that can scale down to a single trainable parameter under extreme sharing settings. Using this method on a <strong>Qwen2.5-7B-Instruct<\/strong> backbone, the research team achieved <strong>91.8% accuracy<\/strong> on the GSM8K benchmark with only <strong>13 parameters<\/strong>, totaling just 26 bytes in bf16.<\/p>\n<h3 class=\"wp-block-heading\"><strong>Overcoming the Constraints of Standard LoRA<\/strong><\/h3>\n<p>Standard Low-Rank Adaptation (LoRA) adapts a frozen linear layer <strong>W \u2208 R<sup><em>d<\/em><\/sup><\/strong><sup><em>x<\/em><\/sup><strong><sup><em>k<\/em><\/sup> <\/strong>using trainable matrices <strong>A \u2208 R<sup><em>d<\/em><\/sup><\/strong><sup><em>x<\/em><\/sup><strong><sup>r<\/sup><\/strong> and <strong>B \u2208 R<sup><em>r<\/em><\/sup><\/strong><sup><em>x<\/em><\/sup><strong><sup><em>k<\/em><\/sup><\/strong>. The trainable parameter count in standard LoRA still scales with layer width and rank, which leaves a nontrivial lower bound even at rank 1. For a model like Llama3-8B, this minimum update size is approximately <strong>3 million parameters<\/strong>.<\/p>\n<p>TinyLoRA circumvents this by building upon <strong>LoRA-XS<\/strong>, which utilizes the <strong>truncated Singular Value Decomposition (SVD)<\/strong> of frozen weights. While LoRA-XS typically requires at least one parameter per adapted module, TinyLoRA replaces the trainable matrix with a low-dimensional trainable vector <strong>\ud835\udf10 \u2208 R<sup><em>u<\/em><\/sup><\/strong> projected through a fixed random tensor <strong><em>P<\/em><\/strong> <strong>\u2208 <em>R<sup>u<\/sup><\/em><\/strong><em><sup>x<\/sup><\/em><strong><em><sup>r<\/sup><\/em><\/strong><em><sup>x<\/sup><\/em><strong><em><sup>r<\/sup><\/em><\/strong>.<\/p>\n<p><strong>The update rule is defined as:<\/strong><\/p>\n<div class=\"wp-block-mathml-mathmlblock\">$$W\u2019 = W + USigma(sum_{i=1}^{u}v_{i}P_{i})V^{top}$$<\/div>\n<p>By applying a weight tying factor (<strong><em>n<\/em><\/strong><sub><strong>tie<\/strong><\/sub>), the total trainable parameters scale as <em><strong>O<\/strong><\/em>(<strong>nmu\/n<em><sub>tie<\/sub><\/em><\/strong>), allowing updates to scale down to a single parameter when all modules across all layers share the same vector.<\/p>\n<h3 class=\"wp-block-heading\"><strong>Reinforcement Learning: The Catalyst for Tiny Updates<\/strong><\/h3>\n<p>A core finding of the research is that <strong>Reinforcement Learning (RL)<\/strong> is fundamentally more efficient than <strong>Supervised Finetuning (SFT)<\/strong> at extremely low parameter counts. The research team reports that models trained via SFT require updates <strong>100 to 1,000 times larger<\/strong> to reach the same performance as those trained with RL.<\/p>\n<p>This gap is attributed to the \u2018information density\u2019 of the training signal. SFT forces a model to absorb many bits of information\u2014including stylistic noise and irrelevant structures of human demonstrations\u2014because its objective treats all tokens as equally informative. In contrast, RL (specifically <strong>Group Relative Policy Optimization<\/strong> or <strong>GRPO<\/strong>) provides a sparser but cleaner signal. Because rewards are binary (e.g., exact match for a math answer), reward-relevant features correlate with the signal while irrelevant variations cancel out through resampling.<\/p>\n<h3 class=\"wp-block-heading\"><strong>Optimization Guidelines for Devs<\/strong><\/h3>\n<p><strong>The research team isolated several strategies to maximize the efficiency of tiny updates:<\/strong><\/p>\n<ul class=\"wp-block-list\">\n<li><strong>Optimal Frozen Rank (<em>r<\/em>):<\/strong> Analysis showed that a frozen SVD rank of <strong><em>r<\/em>=2<\/strong> was optimal. Higher ranks introduced too many degrees of freedom, complicating the optimization of the small trainable vector.<\/li>\n<li><strong>Tiling vs. Structured Sharing:<\/strong> The research team compared \u2018structured\u2019 sharing (modules of the same type share parameters) with <strong>\u2019tiling<\/strong>\u2018 (nearby modules of similar depth share parameters). Surprisingly, tiling was more effective, showing no inherent benefit to forcing parameter sharing exclusively between specific projections like Query or Key modules.<\/li>\n<li><strong>Precision:<\/strong> In bit-constrained regimes, storing parameters in <strong>fp32<\/strong> proved most performant bit-for-bit, even when accounting for its larger footprint compared to <strong>bf16<\/strong> or <strong>fp16<\/strong>.<\/li>\n<\/ul>\n<h3 class=\"wp-block-heading\"><strong>Benchmark Performance<\/strong><\/h3>\n<p>The research team reports that <strong>Qwen-2.5<\/strong> models often needed around <strong>10x fewer<\/strong> updated parameters than <strong>LLaMA-3<\/strong> to reach similar performance in their setup.<\/p>\n<figure class=\"wp-block-table\">\n<table class=\"has-fixed-layout\">\n<thead>\n<tr>\n<td><strong>Model<\/strong><\/td>\n<td><strong>Parameters Trained<\/strong><\/td>\n<td><strong>GSM8K Pass@1<\/strong><\/td>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Qwen2.5-7B-Instruct (Base)<\/td>\n<td>0<\/td>\n<td>88.2%<\/td>\n<\/tr>\n<tr>\n<td>Qwen2.5-7B-Instruct<\/td>\n<td>1<\/td>\n<td>82.0%<\/td>\n<\/tr>\n<tr>\n<td>Qwen2.5-7B-Instruct<\/td>\n<td>13<\/td>\n<td>91.8%<\/td>\n<\/tr>\n<tr>\n<td>Qwen2.5-7B-Instruct<\/td>\n<td>196<\/td>\n<td>92.2%<\/td>\n<\/tr>\n<tr>\n<td>Qwen2.5-7B-Instruct (Full FT)<\/td>\n<td>~7.6 Billion<\/td>\n<td>91.7%<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/figure>\n<p>On harder benchmarks like <strong>MATH500<\/strong> and <strong>AIME24<\/strong>, 196-parameter updates for Qwen2.5-7B-Instruct retained <strong>87%<\/strong> of the absolute performance improvement of full finetuning across six difficult math benchmarks<sup><\/sup><sup><\/sup><sup><\/sup><sup><\/sup>.<\/p>\n<h3 class=\"wp-block-heading\"><strong>Key Takeaways<\/strong><\/h3>\n<ul class=\"wp-block-list\">\n<li><strong>Extreme Parameter Efficiency<\/strong>: It is possible to train a <strong>Qwen2.5-7B-Instruct<\/strong> model to achieve <strong>91.8% accuracy<\/strong> on the GSM8K math benchmark using only <strong>13 parameters<\/strong> (26 total bytes).<\/li>\n<li><strong>The RL Advantage<\/strong>: Reinforcement Learning (RL) is fundamentally more efficient than Supervised Finetuning (SFT) in low-capacity regimes; SFT requires <strong>100\u20131000x larger updates<\/strong> to reach the same performance level as RL.<\/li>\n<li><strong>TinyLoRA Framework<\/strong>: The research team developed <strong>TinyLoRA<\/strong>, a new parameterization that uses weight tying and random projections to scale low-rank adapters down to a <strong>single trainable parameter<\/strong>.<\/li>\n<li><strong>Optimizing the \u201cMicro-Update\u201d<\/strong>: For these tiny updates, <strong>fp32 precision<\/strong> is more bit-efficient than half-precision formats , and <strong>\u201ctiling\u201d<\/strong> (sharing parameters by model depth) outperforms structured sharing by module type.<\/li>\n<li><strong>Scaling Trends<\/strong>: As models grow larger, they become more \u2018programmable\u2019 with fewer absolute parameters, suggesting that <strong>trillion-scale models<\/strong> could potentially be tuned for complex tasks using just a handful of bytes.<\/li>\n<\/ul>\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n<p>Check out\u00a0the\u00a0<strong><a href=\"https:\/\/arxiv.org\/pdf\/2602.04118\" target=\"_blank\" rel=\"noreferrer noopener\">Paper<\/a>.\u00a0<\/strong>Also,\u00a0feel free to follow us on\u00a0<strong><a href=\"https:\/\/x.com\/intent\/follow?screen_name=marktechpost\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Twitter<\/mark><\/a><\/strong>\u00a0and don\u2019t forget to join our\u00a0<strong><a href=\"https:\/\/www.reddit.com\/r\/machinelearningnews\/\" target=\"_blank\" rel=\"noreferrer noopener\">120k+ ML SubReddit<\/a><\/strong>\u00a0and Subscribe to\u00a0<strong><a href=\"https:\/\/www.aidevsignals.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">our Newsletter<\/a><\/strong>. Wait! are you on telegram?\u00a0<strong><a href=\"https:\/\/t.me\/machinelearningresearchnews\" target=\"_blank\" rel=\"noreferrer noopener\">now you can join us on telegram as well.<\/a><\/strong><\/p>\n<p>The post <a href=\"https:\/\/www.marktechpost.com\/2026\/03\/24\/this-ai-paper-introduces-tinylora-a-13-parameter-fine-tuning-method-that-reaches-91-8-percent-gsm8k-on-qwen2-5-7b\/\">This AI Paper Introduces TinyLoRA, A 13-Parameter Fine-Tuning Method That Reaches 91.8 Percent GSM8K on Qwen2.5-7B<\/a> appeared first on <a href=\"https:\/\/www.marktechpost.com\/\">MarkTechPost<\/a>.<\/p>","protected":false},"excerpt":{"rendered":"<p>Researchers from FAIR at Meta,&hellip;<\/p>\n","protected":false},"author":1,"featured_media":29,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-609","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-uncategorized"],"_links":{"self":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts\/609","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=609"}],"version-history":[{"count":0,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts\/609\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/media\/29"}],"wp:attachment":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=609"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=609"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=609"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}