{"id":1029,"date":"2026-06-04T16:11:05","date_gmt":"2026-06-04T08:11:05","guid":{"rendered":"https:\/\/connectword.dpdns.org\/?p=1029"},"modified":"2026-06-04T16:11:05","modified_gmt":"2026-06-04T08:11:05","slug":"miso-labs-releases-misotts-an-8b-emotive-text-to-speech-model-with-open-weights","status":"publish","type":"post","link":"https:\/\/connectword.dpdns.org\/?p=1029","title":{"rendered":"Miso Labs Releases MisoTTS: An 8B Emotive Text-to-Speech Model with Open Weights"},"content":{"rendered":"<p class=\"wp-block-paragraph\">Miso Labs has released MisoTTS, an open-weights 8-billion-parameter text-to-speech model. It generates expressive speech from both text and audio context. The model uses residual vector quantization (RVQ) to widen its sonic range. This avoids scaling a single flat vocabulary while keeping parameter count fixed. <\/p>\n<h2 class=\"wp-block-heading\"><strong>What is MisoTTS<\/strong><\/h2>\n<p class=\"wp-block-paragraph\">MisoTTS is an 8B-parameter text-to-dialogue RVQ Transformer. It is inspired by the Sesame CSM architecture. It pairs a Llama 3.2-style backbone with a smaller audio decoder. It generates Mimi audio codes from text and optional audio context. The model conditions on both text and prior audio. That second input lets it respond to the speaker\u2019s tone.<\/p>\n<p class=\"wp-block-paragraph\">The text vocabulary is 128,256 tokens, and there are 32 audio codebooks. Mimi is the audio tokenizer, and max sequence length is 2,048. Default inference runs in <code>torch.bfloat16<\/code>.<\/p>\n<p class=\"wp-block-paragraph\">Miso Labs claims 110ms latency. It lists ElevenLabs at 700ms and Sesame at 300ms. <\/p>\n<h2 class=\"wp-block-heading\"><strong>The Vocabulary Size Problem<\/strong><\/h2>\n<p class=\"wp-block-paragraph\">Standard transformers generate from a fixed vocabulary of discrete tokens. That works when a small vocabulary covers the target space. Human speech does not fit that assumption. It varies across pitch, rhythm, emphasis, emotion, and accent.<\/p>\n<p class=\"wp-block-paragraph\">Expanding the audio vocabulary is the obvious fix. But larger vocabularies need more parameters in a standard transformer. Each token must be represented and predicted by the model. Miso Labs calls this the vocabulary size problem.<\/p>\n<p class=\"wp-block-paragraph\">The second issue is conditioning. Most TTS models condition only on text. They ignore the interlocutor\u2019s tone. Miso Labs argues this contributes to the \u201cuncanny valley\u201d effect.<\/p>\n<h2 class=\"wp-block-heading\"><strong>Residual Vector Quantization: The Core Idea<\/strong><\/h2>\n<p class=\"wp-block-paragraph\">MisoTTS addresses both problems with residual vector quantization (RVQ). Miso Labs traces RVQ to image-generation research and to Sesame\u2019s CSM for audio. Instead of one token index, the model emits a vector of indices.<\/p>\n<p class=\"wp-block-paragraph\">Each audio token is 32 codebook indices over 2048-way codebooks. The model keeps a separate codebook for each position in the vector. To recover the sound, it sums the looked-up vectors. Each codebook adds another refinement to the signal.<\/p>\n<p class=\"wp-block-paragraph\">This is what makes the scaling work. Addressable vocabulary equals codebook size raised to the depth. Growing the depth adds no parameters to the model. So MisoTTS reaches about 2048<sup>32<\/sup>, or roughly 10<sup>105<\/sup> addressable tokens. Miso Labs notes naive scaling would require a far larger network.<\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1832\" height=\"1258\" data-attachment-id=\"80297\" data-permalink=\"https:\/\/www.marktechpost.com\/2026\/06\/04\/miso-labs-releases-misotts-an-8b-emotive-text-to-speech-model-with-open-weights\/screenshot-2026-06-04-at-12-56-18-am-2\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/06\/Screenshot-2026-06-04-at-12.56.18-AM-1.png\" data-orig-size=\"1832,1258\" data-comments-opened=\"0\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;,&quot;alt&quot;:&quot;&quot;}\" data-image-title=\"Screenshot 2026-06-04 at 12.56.18\u202fAM\" data-image-description=\"\" data-image-caption=\"\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/06\/Screenshot-2026-06-04-at-12.56.18-AM-1-1024x703.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/06\/Screenshot-2026-06-04-at-12.56.18-AM-1.png\" alt=\"\" class=\"wp-image-80297\" \/><figcaption class=\"wp-element-caption\">https:\/\/www.misolabs.ai\/blog\/miso-tts-8b<\/figcaption><\/figure>\n<\/div>\n<h2 class=\"wp-block-heading\"><strong>The Two-Transformer Architecture<\/strong><\/h2>\n<p class=\"wp-block-paragraph\">The model splits into a backbone and a decoder. The backbone is a 7.7B-parameter transformer, autoregressive over time. It predicts the first codebook index and a final hidden state.<\/p>\n<p class=\"wp-block-paragraph\">A 300M-parameter decoder then runs autoregressively over depth. It predicts the remaining codebook indices, one position at a time. Each prediction conditions on the indices already chosen in the frame. The same 300M parameters are reused for every position.<\/p>\n<p class=\"wp-block-paragraph\">Embeddings follow the same logic. Text tokens use a single lookup. An audio token\u2019s embedding is the sum of per-position codebook lookups. Interleaving text and audio lets the backbone use conversation history. That is how it carries context across turns.<\/p>\n<h2 class=\"wp-block-heading\"><strong>Strengths and Challenges<\/strong><\/h2>\n<h4 class=\"wp-block-heading\"><strong>Strengths:<\/strong><\/h4>\n<ul class=\"wp-block-list\">\n<li>Open weights on day one, under a modified MIT license.<\/li>\n<li>RVQ scales the sonic range without scaling parameter count.<\/li>\n<li>Conditions on audio context, not text alone.<\/li>\n<li>Local deployment keeps sensitive audio data in-house.<\/li>\n<li>The architecture and math are documented in a public blog post.<\/li>\n<\/ul>\n<h4 class=\"wp-block-heading\"><strong>Challenges:<\/strong><\/h4>\n<ul class=\"wp-block-list\">\n<li>Half-duplex only, with no turn-taking yet.<\/li>\n<li>The large model needs a capable CUDA GPU.<\/li>\n<li>API access is announced but not yet available.<\/li>\n<li>Latency and quality claims still need third-party testing.<\/li>\n<\/ul>\n<h2 class=\"wp-block-heading\"><strong>Marktechpost\u2019s Visual Explainer<\/strong><\/h2>\n<p><!-- ===== MARKTECHPOST \u00b7 MisoTTS SLIDER (Anthropic theme) \u00b7 paste into a WordPress Custom HTML block ===== --><\/p>\n<div>\n<div class=\"mtp-top\">\n    <span class=\"mtp-brand\"><b>Marktechpost<\/b> \u00b7 Model Brief<\/span><br \/>\n    <span class=\"mtp-count\" data-mtp-count>01 \/ 09<\/span>\n  <\/div>\n<div class=\"mtp-view\">\n<div class=\"mtp-track\" data-mtp-track>\n<p>      <!-- 1 --><\/p>\n<section class=\"mtp-slide mtp-cover\">\n<p class=\"mtp-eyebrow\">Open-Weights Release \u00b7 June 3, 2026<\/p>\n<h2 class=\"mtp-h\">MisoTTS<\/h2>\n<p class=\"mtp-sub\">An 8B emotive text-to-speech model from Miso Labs, built on residual vector quantization and conditioned on both text and audio.<\/p>\n<p>        <span class=\"mtp-rule\"><\/span><\/p>\n<div class=\"mtp-tags\">\n          <span class=\"mtp-chip\">8B params<\/span><br \/>\n          <span class=\"mtp-chip\">RVQ Transformer<\/span><br \/>\n          <span class=\"mtp-chip\">Mimi codes<\/span><br \/>\n          <span class=\"mtp-chip\">modified MIT<\/span>\n        <\/div>\n<\/section>\n<p>      <!-- 2 --><\/p>\n<section class=\"mtp-slide\">\n<p class=\"mtp-eyebrow\">What MisoTTS Is<\/p>\n<h2 class=\"mtp-h\">A text-to-dialogue RVQ Transformer<\/h2>\n<p>        <span class=\"mtp-rule\"><\/span><\/p>\n<ul class=\"mtp-list\">\n<li>An <b>8B-parameter<\/b> model inspired by the <b>Sesame CSM<\/b> architecture.<\/li>\n<li>Pairs a <b>Llama 3.2-style backbone<\/b> with a smaller audio decoder.<\/li>\n<li>Generates <b>Mimi audio codes<\/b> from text and optional audio context.<\/li>\n<li>Conditions on prior audio, so output responds to <b>speaker tone<\/b>.<\/li>\n<\/ul>\n<\/section>\n<p>      <!-- 3 --><\/p>\n<section class=\"mtp-slide\">\n<p class=\"mtp-eyebrow\">At a Glance<\/p>\n<h2 class=\"mtp-h\">Published specifications<\/h2>\n<p>        <span class=\"mtp-rule\"><\/span><\/p>\n<div class=\"mtp-grid\">\n<div class=\"mtp-cell\">\n<p class=\"k\">Parameters<\/p>\n<p class=\"v\">8B <small>(7.7B + 300M)<\/small><\/p>\n<\/div>\n<div class=\"mtp-cell\">\n<p class=\"k\">Architecture<\/p>\n<p class=\"v\">RVQ Transformer<\/p>\n<\/div>\n<div class=\"mtp-cell\">\n<p class=\"k\">Audio codebooks<\/p>\n<p class=\"v\">32 <small>(2048-way)<\/small><\/p>\n<\/div>\n<div class=\"mtp-cell\">\n<p class=\"k\">Audio tokenizer<\/p>\n<p class=\"v\">Mimi<\/p>\n<\/div>\n<div class=\"mtp-cell\">\n<p class=\"k\">Text vocabulary<\/p>\n<p class=\"v\">128,256<\/p>\n<\/div>\n<div class=\"mtp-cell\">\n<p class=\"k\">Max sequence length<\/p>\n<p class=\"v\">2,048<\/p>\n<\/div>\n<div class=\"mtp-cell\">\n<p class=\"k\">Default precision<\/p>\n<p class=\"v\">torch.bfloat16<\/p>\n<\/div>\n<div class=\"mtp-cell\">\n<p class=\"k\">License<\/p>\n<p class=\"v\">modified MIT<\/p>\n<\/div><\/div>\n<\/section>\n<p>      <!-- 4 --><\/p>\n<section class=\"mtp-slide\">\n<p class=\"mtp-eyebrow\">The Motivation<\/p>\n<h2 class=\"mtp-h\">The vocabulary size problem<\/h2>\n<p>        <span class=\"mtp-rule\"><\/span><\/p>\n<ul class=\"mtp-list\">\n<li>Transformers generate from a <b>fixed vocabulary<\/b> of discrete tokens.<\/li>\n<li>Speech varies in pitch, rhythm, emphasis, emotion, and accent.<\/li>\n<li>A bigger audio vocabulary needs <b>more parameters<\/b> in a standard transformer.<\/li>\n<li>Most TTS condition only on text, ignoring tone <span>\u2014 the \u201cuncanny valley\u201d effect.<\/span><\/li>\n<\/ul>\n<\/section>\n<p>      <!-- 5 --><\/p>\n<section class=\"mtp-slide\">\n<p class=\"mtp-eyebrow\">The Core Idea<\/p>\n<h2 class=\"mtp-h\">Residual vector quantization<\/h2>\n<p>        <span class=\"mtp-rule\"><\/span><\/p>\n<ul class=\"mtp-list\">\n<li>The model emits a <b>vector of indices<\/b>, not a single token index.<\/li>\n<li>Each token is <b>32 codebook indices<\/b> over 2048-way codebooks.<\/li>\n<li>Summing the looked-up vectors reconstructs the sound.<\/li>\n<li>Depth scales addressable vocabulary to <b>~2048<sup>32<\/sup> (\u224810<sup>105<\/sup>)<\/b> with no added parameters.<\/li>\n<\/ul>\n<\/section>\n<p>      <!-- 6 --><\/p>\n<section class=\"mtp-slide\">\n<p class=\"mtp-eyebrow\">Architecture<\/p>\n<h2 class=\"mtp-h\">Two transformers, one vector token<\/h2>\n<p>        <span class=\"mtp-rule\"><\/span><\/p>\n<ul class=\"mtp-list\">\n<li><b>Backbone (7.7B)<\/b> \u2014 autoregressive over time; predicts codebook index k\u2081 and hidden state h\u2080.<\/li>\n<li><b>Decoder (300M)<\/b> \u2014 autoregressive over depth; predicts k\u2082 through k\u2083\u2082.<\/li>\n<li>The same 300M parameters are <b>reused for every position<\/b>.<\/li>\n<li>Interleaved text and audio let the backbone use conversation history.<\/li>\n<\/ul>\n<\/section>\n<p>      <!-- 7 --><\/p>\n<section class=\"mtp-slide\">\n<p class=\"mtp-eyebrow\">Run It Locally<\/p>\n<h2 class=\"mtp-h\">Inference in a few lines<\/h2>\n<p>        <span class=\"mtp-rule\"><\/span><\/p>\n<pre><code><span class=\"kw\">from<\/span> generator <span class=\"kw\">import<\/span> load_miso_8b\n<span class=\"kw\">import<\/span> torchaudio\n\ngen = load_miso_8b(device=<span class=\"st\">\"cuda\"<\/span>,\n    model_path_or_repo_id=<span class=\"st\">\"MisoLabs\/MisoTTS\"<\/span>)\n\naudio = gen.generate(\n    text=<span class=\"st\">\"Hello from Miso.\"<\/span>,\n    speaker=<span class=\"st\">0<\/span>, context=[],\n    max_audio_length_ms=<span class=\"st\">10_000<\/span>)\n\ntorchaudio.save(<span class=\"st\">\"miso.wav\"<\/span>,\n    audio.unsqueeze(<span class=\"st\">0<\/span>).cpu(), gen.sample_rate)<\/code><\/pre>\n<p class=\"mtp-note\">Setup uses uv with Python 3.10. Weights download from Hugging Face. Audio is watermarked by default via SilentCipher. One-shot voice cloning works from a ~10-second clip.<\/p>\n<\/section>\n<p>      <!-- 8 --><\/p>\n<section class=\"mtp-slide\">\n<p class=\"mtp-eyebrow\">Limitations<\/p>\n<h2 class=\"mtp-h\">Where it stops, for now<\/h2>\n<p>        <span class=\"mtp-rule\"><\/span><\/p>\n<ul class=\"mtp-list\">\n<li>Handles <b>individual turns only<\/b>; no turn-taking yet.<\/li>\n<li>Generates <b>half-duplex<\/b> audio \u2014 it cannot speak while the other party speaks.<\/li>\n<li>Miso Labs frames full-duplex and turn-taking as <b>future work<\/b>.<\/li>\n<li><b>API access<\/b> is announced but not yet available.<\/li>\n<\/ul>\n<\/section>\n<p>      <!-- 9 --><\/p>\n<section class=\"mtp-slide\">\n<p class=\"mtp-eyebrow\">Key Takeaways<\/p>\n<h2 class=\"mtp-h\">The short version<\/h2>\n<p>        <span class=\"mtp-rule\"><\/span><\/p>\n<ul class=\"mtp-list\">\n<li>Open-weights 8B TTS under a modified MIT license.<\/li>\n<li>Conditions on text and audio, so output tracks speaker tone.<\/li>\n<li>RVQ scales vocabulary to ~2048<sup>32<\/sup> without adding parameters.<\/li>\n<li>7.7B backbone over time, 300M decoder over depth.<\/li>\n<li>Half-duplex and single-turn today; API access pending.<\/li>\n<\/ul>\n<\/section><\/div>\n<\/div>\n<div class=\"mtp-ctrl\">\n<div class=\"mtp-dots\" data-mtp-dots><\/div>\n<div class=\"mtp-arrows\">\n      <button class=\"mtp-btn\" data-mtp-prev>Prev<\/button><br \/>\n      <button class=\"mtp-btn pri\" data-mtp-next>Next<\/button>\n    <\/div>\n<\/div>\n<div class=\"mtp-foot\">\n<p class=\"tag\">Decoded by <b>Marktechpost<\/b> \u2014 AI research, model briefs, and developer tools for practitioners.<\/p>\n<p class=\"sub\">marktechpost.com<\/p>\n<\/div>\n<\/div>\n<p><!-- ===== END MisoTTS SLIDER ===== --><\/p>\n<h2 class=\"wp-block-heading\"><strong>Key Takeaways<\/strong><\/h2>\n<ul class=\"wp-block-list\">\n<li>Miso Labs open-sourced MisoTTS, an 8B text-to-speech model, under a modified MIT license.<\/li>\n<li>It conditions on both text and audio context, making generations responsive to speaker tone.<\/li>\n<li>Residual vector quantization (32 codebooks \u00d7 2048-way) scales vocabulary to ~2048\u00b3\u00b2 without adding parameters.<\/li>\n<li>Architecture splits a 7.7B backbone (over time) and a 300M decoder (over depth).<\/li>\n<li>It is half-duplex and single-turn only today; API access is still pending.<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\">\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n<\/p><p class=\"wp-block-paragraph\">Check out\u00a0the\u00a0<strong><a href=\"https:\/\/huggingface.co\/MisoLabs\/MisoTTS\" target=\"_blank\" rel=\"noreferrer noopener\">Model Weights<\/a><\/strong>, <strong><a href=\"https:\/\/github.com\/MisoLabsAI\/MisoTTS\" target=\"_blank\" rel=\"noreferrer noopener\">Repo<\/a> <\/strong>and<strong> <a href=\"https:\/\/www.misolabs.ai\/blog\/miso-tts-8b\" target=\"_blank\" rel=\"noreferrer noopener\">Technical details<\/a>.\u00a0<\/strong>Also,\u00a0feel free to follow us on\u00a0<strong><a href=\"https:\/\/x.com\/intent\/follow?screen_name=marktechpost\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Twitter<\/mark><\/a><\/strong>\u00a0and don\u2019t forget to join our\u00a0<strong><a href=\"https:\/\/www.reddit.com\/r\/machinelearningnews\/\" target=\"_blank\" rel=\"noreferrer noopener\">150k+ ML SubReddit<\/a><\/strong>\u00a0and Subscribe to\u00a0<strong><a href=\"https:\/\/www.aidevsignals.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">our Newsletter<\/a><\/strong>. Wait! are you on telegram?\u00a0<strong><a href=\"https:\/\/t.me\/machinelearningresearchnews\" target=\"_blank\" rel=\"noreferrer noopener\">now you can join us on telegram as well.<\/a><\/strong><\/p>\n<p class=\"wp-block-paragraph\">Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.?\u00a0<strong><a href=\"https:\/\/forms.gle\/wbash1wF6efRj8G58\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Connect with us<\/mark><\/a><\/strong><\/p>\n<p>The post <a href=\"https:\/\/www.marktechpost.com\/2026\/06\/04\/miso-labs-releases-misotts-an-8b-emotive-text-to-speech-model-with-open-weights\/\">Miso Labs Releases MisoTTS: An 8B Emotive Text-to-Speech Model with Open Weights<\/a> appeared first on <a href=\"https:\/\/www.marktechpost.com\/\">MarkTechPost<\/a>.<\/p>","protected":false},"excerpt":{"rendered":"<p>Miso Labs has released MisoTTS&hellip;<\/p>\n","protected":false},"author":1,"featured_media":1030,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-1029","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-uncategorized"],"_links":{"self":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts\/1029","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=1029"}],"version-history":[{"count":0,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts\/1029\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/media\/1030"}],"wp:attachment":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=1029"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=1029"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=1029"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}