{"id":415,"date":"2026-02-15T16:17:46","date_gmt":"2026-02-15T08:17:46","guid":{"rendered":"https:\/\/connectword.dpdns.org\/?p=415"},"modified":"2026-02-15T16:17:46","modified_gmt":"2026-02-15T08:17:46","slug":"meet-kani-tts-2-a-400m-param-open-source-text-to-speech-model-that-runs-in-3gb-vram-with-voice-cloning-support","status":"publish","type":"post","link":"https:\/\/connectword.dpdns.org\/?p=415","title":{"rendered":"Meet \u2018Kani-TTS-2\u2019: A 400M Param Open Source Text-to-Speech Model that Runs in 3GB VRAM with Voice Cloning Support"},"content":{"rendered":"<p>The landscape of generative audio is shifting toward efficiency. A new open-source contender, <strong>Kani-TTS-2<\/strong>, has been released by the team at <strong>nineninesix<\/strong>.ai. This model marks a departure from heavy, compute-expensive TTS systems. Instead, it treats audio as a language, delivering high-fidelity speech synthesis with a remarkably small footprint.<\/p>\n<p>Kani-TTS-2 offers a lean, high-performance alternative to closed-source APIs. It is currently available on Hugging Face in both <a href=\"https:\/\/huggingface.co\/nineninesix\/kani-tts-2-en\" target=\"_blank\" rel=\"noreferrer noopener\">English (<strong>EN<\/strong>)<\/a> and <a href=\"https:\/\/huggingface.co\/nineninesix\/kani-tts-2-pt\" target=\"_blank\" rel=\"noreferrer noopener\">Portuguese (<strong>PT<\/strong>)<\/a> versions.<\/p>\n<h3 class=\"wp-block-heading\"><strong>The Architecture: LFM2 and NanoCodec<\/strong><\/h3>\n<p>Kani-TTS-2 follows the <strong>\u2018Audio-as-Language<\/strong>\u2018 philosophy. The model does not use traditional mel-spectrogram pipelines. Instead, it converts raw audio into discrete tokens using a neural codec.<\/p>\n<p><strong>The system relies on a two-stage process:<\/strong><\/p>\n<ol start=\"1\" class=\"wp-block-list\">\n<li><strong>The Language Backbone:<\/strong> The model is built on <strong>LiquidAI\u2019s LFM2 (350M)<\/strong> architecture. This backbone generates \u2018audio intent\u2019 by predicting the next audio tokens. Because LFM (Liquid Foundation Models) are designed for efficiency, they provide a faster alternative to standard transformers.<\/li>\n<li><strong>The Neural Codec:<\/strong> It uses the <strong>NVIDIA NanoCodec<\/strong> to turn those tokens into 22kHz waveforms.<\/li>\n<\/ol>\n<p>By using this architecture, the model captures human-like prosody\u2014the rhythm and intonation of speech\u2014without the \u2018robotic\u2019 artifacts found in older TTS systems.<\/p>\n<h3 class=\"wp-block-heading\"><strong>Efficiency: 10,000 Hours in 6 Hours<\/strong><\/h3>\n<p>The training metrics for Kani-TTS-2 are a masterclass in optimization. The English model was trained on <strong>10,000 hours<\/strong> of high-quality speech data.<\/p>\n<p>While that scale is impressive, the speed of training is the real story. The research team trained the model in only <strong>6 hours<\/strong> using a cluster of <strong>8 NVIDIA H100 GPUs<\/strong>. This proves that massive datasets no longer require weeks of compute time when paired with efficient architectures like LFM2.<\/p>\n<h3 class=\"wp-block-heading\"><strong>Zero-Shot Voice Cloning and Performance<\/strong><\/h3>\n<p>The standout feature for developers is <strong>zero-shot voice cloning<\/strong>. Unlike traditional models that require fine-tuning for new voices, Kani-TTS-2 uses <strong>speaker embeddings<\/strong>.<\/p>\n<ul class=\"wp-block-list\">\n<li><strong>How it works:<\/strong> You provide a short reference audio clip.<\/li>\n<li><strong>The result:<\/strong> The model extracts the unique characteristics of that voice and applies them to the generated text instantly.<\/li>\n<\/ul>\n<p><strong>From a deployment perspective, the model is highly accessible:<\/strong><\/p>\n<ul class=\"wp-block-list\">\n<li><strong>Parameter Count:<\/strong> 400M (0.4B) parameters.<\/li>\n<li><strong>Speed:<\/strong> It features a <strong>Real-Time Factor (RTF) of 0.2<\/strong>. This means it can generate 10 seconds of speech in roughly 2 seconds.<\/li>\n<li><strong>Hardware:<\/strong> It requires only <strong>3GB of VRAM<\/strong>, making it compatible with consumer-grade GPUs like the RTX 3060 or 4050.<\/li>\n<li><strong>License:<\/strong> Released under the <strong>Apache 2.0<\/strong> license, allowing for commercial use.<\/li>\n<\/ul>\n<h3 class=\"wp-block-heading\"><strong>Key Takeaways<\/strong><\/h3>\n<ul class=\"wp-block-list\">\n<li><strong>Efficient Architecture:<\/strong> The model uses a <strong>400M parameter<\/strong> backbone based on <strong>LiquidAI\u2019s LFM2 (350M)<\/strong>. This \u2018Audio-as-Language\u2019 approach treats speech as discrete tokens, allowing for faster processing and more human-like intonation compared to traditional architectures.<\/li>\n<li><strong>Rapid Training at Scale:<\/strong> Kani-TTS-2-EN was trained on <strong>10,000 hours<\/strong> of high-quality speech data in just <strong>6 hours<\/strong> using <strong>8 NVIDIA H100 GPUs<\/strong>. <\/li>\n<li><strong>Instant Zero-Shot Cloning:<\/strong> There is no need for fine-tuning to replicate a specific voice. By providing a short reference audio clip, the model uses <strong>speaker embeddings<\/strong> to instantly synthesize text in the target speaker\u2019s voice.<\/li>\n<li><strong>High Performance on Edge Hardware:<\/strong> With a <strong>Real-Time Factor (RTF) of 0.2<\/strong>, the model can generate 10 seconds of audio in approximately 2 seconds. It requires only <strong>3GB of VRAM<\/strong>, making it fully functional on consumer-grade GPUs like the RTX 3060.<\/li>\n<li><strong>Developer-Friendly Licensing:<\/strong> Released under the <strong>Apache 2.0 license<\/strong>, Kani-TTS-2 is ready for commercial integration. It offers a local-first, low-latency alternative to expensive closed-source TTS APIs.<\/li>\n<\/ul>\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n<p>Check out the\u00a0<strong><a href=\"https:\/\/huggingface.co\/nineninesix\/kani-tts-2-en\" target=\"_blank\" rel=\"noreferrer noopener\">Model Weight<\/a>.\u00a0<\/strong>Also,\u00a0feel free to follow us on\u00a0<strong><a href=\"https:\/\/x.com\/intent\/follow?screen_name=marktechpost\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Twitter<\/mark><\/a><\/strong>\u00a0and don\u2019t forget to join our\u00a0<strong><a href=\"https:\/\/www.reddit.com\/r\/machinelearningnews\/\" target=\"_blank\" rel=\"noreferrer noopener\">100k+ ML SubReddit<\/a><\/strong>\u00a0and Subscribe to\u00a0<strong><a href=\"https:\/\/www.aidevsignals.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">our Newsletter<\/a><\/strong>. Wait! are you on telegram?\u00a0<strong><a href=\"https:\/\/t.me\/machinelearningresearchnews\" target=\"_blank\" rel=\"noreferrer noopener\">now you can join us on telegram as well.<\/a><\/strong><\/p>\n<p>The post <a href=\"https:\/\/www.marktechpost.com\/2026\/02\/15\/meet-kani-tts-2-a-400m-param-open-source-text-to-speech-model-that-runs-in-3gb-vram-with-voice-cloning-support\/\">Meet \u2018Kani-TTS-2\u2019: A 400M Param Open Source Text-to-Speech Model that Runs in 3GB VRAM with Voice Cloning Support<\/a> appeared first on <a href=\"https:\/\/www.marktechpost.com\/\">MarkTechPost<\/a>.<\/p>","protected":false},"excerpt":{"rendered":"<p>The landscape of generative au&hellip;<\/p>\n","protected":false},"author":1,"featured_media":29,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-415","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-uncategorized"],"_links":{"self":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts\/415","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=415"}],"version-history":[{"count":0,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts\/415\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/media\/29"}],"wp:attachment":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=415"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=415"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=415"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}