{"id":837,"date":"2026-05-03T15:47:42","date_gmt":"2026-05-03T07:47:42","guid":{"rendered":"https:\/\/connectword.dpdns.org\/?p=837"},"modified":"2026-05-03T15:47:42","modified_gmt":"2026-05-03T07:47:42","slug":"sakana-ai-introduces-kame-a-tandem-speech-to-speech-architecture-that-injects-llm-knowledge-in-real-time","status":"publish","type":"post","link":"https:\/\/connectword.dpdns.org\/?p=837","title":{"rendered":"Sakana AI Introduces KAME: A Tandem Speech-to-Speech Architecture That Injects LLM Knowledge in Real Time"},"content":{"rendered":"<p>The fundamental tension in conversational AI has always been a binary choice: respond fast or respond smart. Real-time speech-to-speech (S2S) models \u2014 the kind that power natural-feeling voice assistants \u2014 start talking almost instantly, but their answers tend to be shallow. Cascaded systems that route speech through a large language model (LLM) are far more knowledgeable, but the pipeline delay is long enough to make conversation feel stilted and robotic. <strong>Researchers at Sakana AI,<\/strong> the Tokyo-based AI lab introduces <strong>KAME<\/strong> (Knowledge-Access Model Extension), a hybrid architecture that keeps the near-zero response latency of a direct S2S system while injecting the richer knowledge of a back-end LLM in real time.<\/p>\n<h3 class=\"wp-block-heading\"><strong>The Problem: Two Paradigms, Two Tradeoffs<\/strong><\/h3>\n<p>To understand why KAME is important, it helps to understand the two dominant designs it bridges.<\/p>\n<p>A direct S2S model like Moshi (developed by KyutAI) is a monolithic transformer that takes in audio tokens and produces audio tokens in a continuous loop. Because it doesn\u2019t need to synchronize with external systems, its response latency is exceptionally low \u2014 for many queries, the model starts speaking before the user even finishes their question. But because acoustic signals are far information-denser than text, the model has to spend significant capacity modeling paralinguistic features like tone, emotion, and rhythm. That leaves less room for factual knowledge and deep reasoning.<\/p>\n<p>A cascaded system, by contrast, routes the user\u2019s speech through an Automatic Speech Recognition (ASR) model, feeds the resulting text into a powerful LLM, and then converts the LLM\u2019s response back into speech via a Text-to-Speech (TTS) engine. The knowledge quality is excellent \u2014 you can plug in any frontier LLM \u2014 but the system must wait for the user to finish speaking before ASR and LLM processing can even begin. The result is a median latency of around 2.1 seconds, which is long enough to noticeably interrupt natural conversational flow.<\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1994\" height=\"1080\" data-attachment-id=\"79478\" data-permalink=\"https:\/\/www.marktechpost.com\/2026\/05\/03\/sakana-ai-introduces-kame-a-tandem-speech-to-speech-architecture-that-injects-llm-knowledge-in-real-time\/screenshot-2026-05-03-at-12-46-41-am-2\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-03-at-12.46.41-AM-1.png\" data-orig-size=\"1994,1080\" data-comments-opened=\"0\" data-image-meta='{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\"}' data-image-title=\"Screenshot 2026-05-03 at 12.46.41\u202fAM\" data-image-description=\"\" data-image-caption=\"\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-03-at-12.46.41-AM-1-1024x555.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-03-at-12.46.41-AM-1.png\" alt=\"\" class=\"wp-image-79478\" \/><figcaption class=\"wp-element-caption\">https:\/\/pub.sakana.ai\/kame\/<\/figcaption><\/figure>\n<\/div>\n<h3 class=\"wp-block-heading\"><strong>KAME\u2019s Architecture: Speaking While Thinking<\/strong><\/h3>\n<p>KAME operates as a tandem system with two asynchronous components running in parallel.<\/p>\n<p>The <strong>front-end S2S module<\/strong> is based on the Moshi architecture and processes audio in real time at the cycle of discrete audio tokens (approximately every 80 milliseconds). It begins generating a spoken response immediately. Internally, Moshi\u2019s original three-stream design \u2014 input audio, inner monologue (text), and output audio \u2014 is extended in KAME with a fourth stream: the <strong>oracle stream<\/strong>. This is the key innovation point.<\/p>\n<p>The <strong>back-end LLM module<\/strong> consists of a streaming speech-to-text (STT) component paired with a full-scale LLM. As the user speaks, the STT component continuously builds a partial transcript and periodically sends it to the back-end LLM. For each partial transcript it receives, the LLM generates a candidate text response \u2014 called an oracle \u2014 and streams it back to the front-end. Because the user\u2019s speech is still arriving, these oracles start as educated guesses and become progressively more accurate as the transcript grows more complete.<\/p>\n<p>The front-end S2S transformer then conditions its ongoing speech output on both its own internal context and these incoming oracle tokens. When a new, better oracle arrives, the model can correct course \u2014 effectively updating its response mid-sentence, the way a human might. Because both modules run asynchronously and independently, the initial response latency stays near zero.<\/p>\n<h3 class=\"wp-block-heading\"><strong>Training on Simulated Oracles<\/strong><\/h3>\n<p>One challenge is that no naturally occurring dataset contains oracle signals. Sakana AI research team addresses this with a technique called <strong>Simulated Oracle Augmentation<\/strong>. Using a \u2018simulator\u2019 LLM and a standard conversational dataset (user utterance + ground-truth response), the research team generates synthetic oracle sequences that mimic what a real-time LLM would produce across different levels of transcript completeness. They define six hint levels (0\u20135), ranging from a completely unguided guess at hint level 0 to the verbatim ground-truth response at hint level 5. The training data for KAME was built from 56,582 synthetic dialogues drawn from MMLU-Pro, GSM8K, and HSSBench, converted to audio via TTS and augmented with these progressive oracle sequences.<\/p>\n<h3 class=\"wp-block-heading\"><strong>Results: Near-Cascaded Quality, Near-Zero Latency<\/strong><\/h3>\n<p>Evaluations on a speech-synthesized subset of the MT-Bench multi-turn Q&amp;A benchmark \u2014 specifically the reasoning, STEM, and humanities categories (Coding, Extraction, Math, Roleplay, and Writing were excluded as unsuitable for speech interaction) \u2014 show a dramatic improvement. Moshi alone scores 2.05 on average. KAME with gpt-4.1 as the back-end scores 6.43, and KAME with claude-opus-4-1 as the back-end scores 6.23 \u2014 both at essentially the same latency as Moshi. The leading cascaded system, Unmute (also backed by gpt-4.1), scores 7.70, but with a median latency of 2.1 seconds versus near-zero for KAME.<\/p>\n<p>To isolate back-end capability from timing effects, the research team also evaluated the back-end LLM\u2019s text responses from the final oracle injection in each KAME session directly \u2014 bypassing the premature-generation problem entirely. Those scores averaged 7.79 (reasoning 6.48, STEM 8.34, humanities 8.56), comparable to Unmute\u2019s 7.70. This confirms that KAME\u2019s gap to cascaded systems is not a ceiling on the back-end LLM\u2019s knowledge, but a consequence of starting to speak before the full user query has been heard.<\/p>\n<p>Crucially, KAME is fully <strong>back-end agnostic<\/strong>. The front-end was trained using gpt-4.1-nano as the primary back-end, but swapping in claude-opus-4-1 or gemini-2.5-flash at inference time requires no retraining. In Sakana AI\u2019s experiments, claude-opus-4-1 tended to outperform gpt-4.1 on reasoning tasks, while gpt-4.1 scored higher on humanities questions \u2014 suggesting practitioners can route queries to the most task-appropriate LLM without touching the front-end model.<\/p>\n<h3 class=\"wp-block-heading\"><strong>Key Takeaways<\/strong><\/h3>\n<ul class=\"wp-block-list\">\n<li><strong>KAME bridges the speed-vs-knowledge tradeoff in conversational AI<\/strong> by running a front-end speech-to-speech model and a back-end LLM asynchronously in parallel \u2014 the S2S model responds immediately while the LLM continuously injects progressively refined \u2018oracle\u2019 signals in real time, shifting the paradigm from \u2018think, then speak\u2019 to \u2018speak while thinking.\u2019<\/li>\n<li><strong>The performance gains are substantial without any latency cost<\/strong> \u2014 KAME raises the MT-Bench score from 2.05 (Moshi baseline) to 6.43, approaching the cascaded system Unmute\u2019s 7.70, while maintaining near-zero median response latency versus Unmute\u2019s 2.1 seconds.<\/li>\n<li><strong>The architecture is fully back-end agnostic<\/strong> \u2014 the front-end was trained using gpt-4.1-nano but supports plug-and-play swapping of any frontier LLM (gpt-4.1, claude-opus-4-1, gemini-2.5-flash) at inference time with no retraining, enabling task-specific LLM selection based on domain strengths.<\/li>\n<\/ul>\n<hr class=\"wp-block-separator aligncenter has-alpha-channel-opacity is-style-wide\" \/>\n<p>Check out\u00a0the\u00a0<strong><a href=\"https:\/\/huggingface.co\/SakanaAI\/kame\" target=\"_blank\" rel=\"noreferrer noopener\">Model Weights<\/a>, <a href=\"https:\/\/arxiv.org\/pdf\/2510.02327\" target=\"_blank\" rel=\"noreferrer noopener\">Paper<\/a>, <a href=\"https:\/\/github.com\/SakanaAI\/kame\" target=\"_blank\" rel=\"noreferrer noopener\">Inference code<\/a> <\/strong>and<strong> <a href=\"https:\/\/pub.sakana.ai\/kame\/\" target=\"_blank\" rel=\"noreferrer noopener\">Technical details<\/a><\/strong>.<strong>\u00a0<\/strong>Also,\u00a0feel free to follow us on\u00a0<strong><a href=\"https:\/\/x.com\/intent\/follow?screen_name=marktechpost\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Twitter<\/mark><\/a><\/strong>\u00a0and don\u2019t forget to join our\u00a0<strong><a href=\"https:\/\/www.reddit.com\/r\/machinelearningnews\/\" target=\"_blank\" rel=\"noreferrer noopener\">130k+ ML SubReddit<\/a><\/strong>\u00a0and Subscribe to\u00a0<strong><a href=\"https:\/\/www.aidevsignals.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">our Newsletter<\/a><\/strong>. Wait! are you on telegram?\u00a0<strong><a href=\"https:\/\/t.me\/machinelearningresearchnews\" target=\"_blank\" rel=\"noreferrer noopener\">now you can join us on telegram as well.<\/a><\/strong><\/p>\n<p>Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.?\u00a0<strong><a href=\"https:\/\/forms.gle\/MTNLpmJtsFA3VRVd9\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Connect with us<\/mark><\/a><\/strong><\/p>\n<p>The post <a href=\"https:\/\/www.marktechpost.com\/2026\/05\/03\/sakana-ai-introduces-kame-a-tandem-speech-to-speech-architecture-that-injects-llm-knowledge-in-real-time\/\">Sakana AI Introduces KAME: A Tandem Speech-to-Speech Architecture That Injects LLM Knowledge in Real Time<\/a> appeared first on <a href=\"https:\/\/www.marktechpost.com\/\">MarkTechPost<\/a>.<\/p>","protected":false},"excerpt":{"rendered":"<p>The fundamental tension in con&hellip;<\/p>\n","protected":false},"author":1,"featured_media":838,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-837","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-uncategorized"],"_links":{"self":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts\/837","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=837"}],"version-history":[{"count":0,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts\/837\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/media\/838"}],"wp:attachment":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=837"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=837"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=837"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}