{"id":413,"date":"2026-02-14T02:05:20","date_gmt":"2026-02-13T18:05:20","guid":{"rendered":"https:\/\/connectword.dpdns.org\/?p=413"},"modified":"2026-02-14T02:05:20","modified_gmt":"2026-02-13T18:05:20","slug":"kyutai-releases-hibiki-zero-a3b-parameter-simultaneous-speech-to-speech-translation-model-using-grpo-reinforcement-learning-without-any-word-level-aligned-data","status":"publish","type":"post","link":"https:\/\/connectword.dpdns.org\/?p=413","title":{"rendered":"Kyutai Releases Hibiki-Zero: A3B Parameter Simultaneous Speech-to-Speech Translation Model Using GRPO Reinforcement Learning Without Any Word-Level Aligned Data"},"content":{"rendered":"<p>Kyutai has released <strong>Hibiki-Zero<\/strong>, a new model for simultaneous speech-to-speech translation (S2ST) and speech-to-text translation (S2TT). The system translates source speech into a target language in real-time. It handles non-monotonic word dependencies during the process. Unlike previous models, Hibiki-Zero does not require word-level aligned data for training. This eliminates a major bottleneck in scaling AI translation to more languages.<\/p>\n<p>Traditional approaches rely on supervised training with word-level alignments. These alignments are difficult to collect at scale. Developers usually depend on synthetic alignments and language-specific heuristics. Hibiki-Zero removes this complexity by using a novel reinforcement learning (RL) strategy to optimize latency.<\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1498\" height=\"1026\" data-attachment-id=\"77879\" data-permalink=\"https:\/\/www.marktechpost.com\/2026\/02\/13\/kyutai-releases-hibiki-zero-a3b-parameter-simultaneous-speech-to-speech-translation-model-using-grpo-reinforcement-learning-without-any-word-level-aligned-data\/screenshot-2026-02-13-at-10-01-41-am-2\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/02\/Screenshot-2026-02-13-at-10.01.41-AM-1.png\" data-orig-size=\"1498,1026\" data-comments-opened=\"1\" data-image-meta='{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\"}' data-image-title=\"Screenshot 2026-02-13 at 10.01.41\u202fAM\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/02\/Screenshot-2026-02-13-at-10.01.41-AM-1-300x205.png\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/02\/Screenshot-2026-02-13-at-10.01.41-AM-1-1024x701.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/02\/Screenshot-2026-02-13-at-10.01.41-AM-1.png\" alt=\"\" class=\"wp-image-77879\" \/><figcaption class=\"wp-element-caption\">https:\/\/kyutai.org\/blog\/2026-02-12-hibiki-zero<\/figcaption><\/figure>\n<\/div>\n<h3 class=\"wp-block-heading\"><strong>A Multistream Architecture<\/strong><\/h3>\n<p>Hibiki-Zero is a decoder-only model. It uses a multistream architecture to model sequences of tokens jointly. <strong>The model handles 3 specific streams:<\/strong><\/p>\n<ul class=\"wp-block-list\">\n<li><strong>Source Stream<\/strong>: Audio tokens from the input speech.<\/li>\n<li><strong>Target Stream<\/strong>: Generated audio tokens for the translated speech.<\/li>\n<li><strong>Inner Monologue<\/strong>: A stream of padded text tokens that match the target audio.<\/li>\n<\/ul>\n<p>The system uses the <strong>Mimi<\/strong> neural audio codec. Mimi is a causal and streaming codec that encodes waveforms into discrete tokens. It operates at a framerate of <strong>12.5 Hz<\/strong>. The model uses an <strong>RQ-Transformer<\/strong> to model these audio streams.<\/p>\n<p><strong>The architectural specs include:<\/strong><\/p>\n<ul class=\"wp-block-list\">\n<li><strong>Total Parameters<\/strong>: 3B.<\/li>\n<li><strong>Temporal Transformer<\/strong>: 28 layers with a latent dimension of 2048.<\/li>\n<li><strong>Depth Transformer<\/strong>: 6 layers per codebook with a latent dimension of 1024.<\/li>\n<li><strong>Context Window<\/strong>: 4min.<\/li>\n<li><strong>Audio Codebooks<\/strong>: 16 levels for high-quality speech.<\/li>\n<\/ul>\n<h3 class=\"wp-block-heading\"><strong>Training Without Human Interpretation Data<\/strong><\/h3>\n<p><strong>Hibiki-Zero is trained in 2 main stages:<\/strong><\/p>\n<ol start=\"1\" class=\"wp-block-list\">\n<li><strong>Coarse Alignment Training<\/strong>: The model first trains on sentence-level aligned data. This data ensures that the i<sup>th<\/sup> sentence in the target is a translation of the i<sup>th<\/sup> sentence in the source. The research team use a technique to insert artificial silence in the target speech to delay its content relative to the source.<\/li>\n<li><strong>Reinforcement Learning (RL)<\/strong>: The model uses <strong>Group Relative Policy Optimization (GRPO)<\/strong> to refine its policy. This stage reduces translation latency while preserving quality.<\/li>\n<\/ol>\n<p>The RL process uses <strong>process rewards<\/strong> based only on the <strong>BLEU score<\/strong>. It computes intermediate rewards at multiple points during translation. A hyperparameter \u237a balances the trade-off between speed and accuracy. A lower \u237a reduces latency but may slightly decrease quality.<\/p>\n<h3 class=\"wp-block-heading\"><strong>Scaling to Italian in Record Time<\/strong><\/h3>\n<p>The researchers demonstrated how easily Hibiki-Zero adapts to new languages. They added Italian as an input language using less than <strong>1000h<\/strong> of speech data.<\/p>\n<ul class=\"wp-block-list\">\n<li>They performed supervised fine-tuning followed by the GRPO process.<\/li>\n<li>The model reached a quality and latency trade-off similar to Meta\u2019s <strong>Seamless<\/strong> model.<\/li>\n<li>It surpassed Seamless in speaker similarity by over <strong>30 points<\/strong>.<\/li>\n<\/ul>\n<h3 class=\"wp-block-heading\"><strong>Performance and Results<\/strong><\/h3>\n<p>Hibiki-Zero achieves state-of-the-art results across 5 X-to-English tasks. It was tested on the <strong>Audio-NTREX-4L<\/strong> long-form benchmark, which includes 15h of speech per TTS system.<\/p>\n<figure class=\"wp-block-table\">\n<table class=\"has-fixed-layout\">\n<thead>\n<tr>\n<td><strong>Metric<\/strong><\/td>\n<td><strong>Hibiki-Zero (French)<\/strong><\/td>\n<td><strong>Seamless (French)<\/strong><\/td>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td><strong>ASR-BLEU (\u2191)<\/strong><\/td>\n<td>28.7 <sup><\/sup><\/td>\n<td>23.9 <sup><\/sup><\/td>\n<\/tr>\n<tr>\n<td><strong>Speaker Similarity (\u2191)<\/strong><\/td>\n<td>61.3 <sup><\/sup><\/td>\n<td>44.4 <sup><\/sup><\/td>\n<\/tr>\n<tr>\n<td><strong>Average Lag (LAAL) (\u2193)<\/strong><\/td>\n<td>2.3 <sup><\/sup><\/td>\n<td>6.2 <sup><\/sup><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/figure>\n<p>In short-form tasks (Europarl-ST), Hibiki-Zero reached an ASR-BLEU of <strong>34.6<\/strong> with a lag of <strong>2.8 seconds<\/strong>. Human raters also scored the model significantly higher than baselines for speech naturalness and voice transfer.<\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1656\" height=\"820\" data-attachment-id=\"77881\" data-permalink=\"https:\/\/www.marktechpost.com\/2026\/02\/13\/kyutai-releases-hibiki-zero-a3b-parameter-simultaneous-speech-to-speech-translation-model-using-grpo-reinforcement-learning-without-any-word-level-aligned-data\/screenshot-2026-02-13-at-10-02-23-am-2\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/02\/Screenshot-2026-02-13-at-10.02.23-AM-1.png\" data-orig-size=\"1656,820\" data-comments-opened=\"1\" data-image-meta='{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\"}' data-image-title=\"Screenshot 2026-02-13 at 10.02.23\u202fAM\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/02\/Screenshot-2026-02-13-at-10.02.23-AM-1-300x149.png\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/02\/Screenshot-2026-02-13-at-10.02.23-AM-1-1024x507.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/02\/Screenshot-2026-02-13-at-10.02.23-AM-1.png\" alt=\"\" class=\"wp-image-77881\" \/><figcaption class=\"wp-element-caption\">https:\/\/kyutai.org\/blog\/2026-02-12-hibiki-zero<\/figcaption><\/figure>\n<\/div>\n<h3 class=\"wp-block-heading\"><strong>Key Takeaways<\/strong><\/h3>\n<ul class=\"wp-block-list\">\n<li><strong>Zero Aligned Data Requirement<\/strong>: Hibiki-Zero eliminates the need for expensive, hand-crafted word-level alignments between source and target speech, which were previously the biggest bottleneck in scaling simultaneous translation to new languages.<\/li>\n<li><strong>GRPO-Driven Latency Optimization<\/strong>: The model uses Group Relative Policy Optimization (GRPO) and a simple reward system based only on BLEU scores to automatically learn an efficient translation policy, balancing high translation quality with low latency.<\/li>\n<li><strong>Coarse-to-Fine Training Strategy<\/strong>: The training pipeline starts with sentence-level aligned data to teach the model base translation at high latency, followed by a reinforcement learning phase that \u201cteaches\u201d the model when to speak and when to listen.<\/li>\n<li><strong>Superior Voice and Naturalness<\/strong>: In benchmarking against previous state-of-the-art systems like Seamless, Hibiki-Zero achieved a 30-point lead in speaker similarity and significantly higher scores in speech naturalness and audio quality across five language tasks.<\/li>\n<li><strong>Rapid New Language Adaptation<\/strong>: The architecture is highly portable; researchers demonstrated that Hibiki-Zero could be adapted to a new input language (Italian) with less than 1,000 hours of speech data while maintaining its original performance on other languages.<\/li>\n<\/ul>\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n<p>Check out the\u00a0<strong><a href=\"https:\/\/arxiv.org\/pdf\/2602.11072\" target=\"_blank\" rel=\"noreferrer noopener\">Paper<\/a>, <a href=\"https:\/\/kyutai.org\/blog\/2026-02-12-hibiki-zero\" target=\"_blank\" rel=\"noreferrer noopener\">Technical details<\/a>, <a href=\"https:\/\/github.com\/kyutai-labs\/hibiki-zero\" target=\"_blank\" rel=\"noreferrer noopener\">Repo<\/a> and <a href=\"https:\/\/huggingface.co\/spaces\/kyutai\/hibiki-zero-samples\" target=\"_blank\" rel=\"noreferrer noopener\">Samples<\/a>.\u00a0<\/strong>Also,\u00a0feel free to follow us on\u00a0<strong><a href=\"https:\/\/x.com\/intent\/follow?screen_name=marktechpost\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Twitter<\/mark><\/a><\/strong>\u00a0and don\u2019t forget to join our\u00a0<strong><a href=\"https:\/\/www.reddit.com\/r\/machinelearningnews\/\" target=\"_blank\" rel=\"noreferrer noopener\">100k+ ML SubReddit<\/a><\/strong>\u00a0and Subscribe to\u00a0<strong><a href=\"https:\/\/www.aidevsignals.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">our Newsletter<\/a><\/strong>. Wait! are you on telegram?\u00a0<strong><a href=\"https:\/\/t.me\/machinelearningresearchnews\" target=\"_blank\" rel=\"noreferrer noopener\">now you can join us on telegram as well.<\/a><\/strong><\/p>\n<p>The post <a href=\"https:\/\/www.marktechpost.com\/2026\/02\/13\/kyutai-releases-hibiki-zero-a3b-parameter-simultaneous-speech-to-speech-translation-model-using-grpo-reinforcement-learning-without-any-word-level-aligned-data\/\">Kyutai Releases Hibiki-Zero: A3B Parameter Simultaneous Speech-to-Speech Translation Model Using GRPO Reinforcement Learning Without Any Word-Level Aligned Data<\/a> appeared first on <a href=\"https:\/\/www.marktechpost.com\/\">MarkTechPost<\/a>.<\/p>","protected":false},"excerpt":{"rendered":"<p>Kyutai has released Hibiki-Zer&hellip;<\/p>\n","protected":false},"author":1,"featured_media":414,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-413","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-uncategorized"],"_links":{"self":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts\/413","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=413"}],"version-history":[{"count":0,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts\/413\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/media\/414"}],"wp:attachment":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=413"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=413"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=413"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}