{"id":858,"date":"2026-05-06T08:34:38","date_gmt":"2026-05-06T00:34:38","guid":{"rendered":"https:\/\/connectword.dpdns.org\/?p=858"},"modified":"2026-05-06T08:34:38","modified_gmt":"2026-05-06T00:34:38","slug":"inworld-ai-launches-realtime-tts-2-a-closed-loop-voice-model-that-adapts-to-how-you-actually-talk","status":"publish","type":"post","link":"https:\/\/connectword.dpdns.org\/?p=858","title":{"rendered":"Inworld AI Launches Realtime TTS-2: A Closed-Loop Voice Model That Adapts to How You Actually Talk"},"content":{"rendered":"<p>Voice AI has a dirty secret: most of it was never designed for conversation. The dominant paradigm \u2014 feed text in, get audio out \u2014 traces its lineage to audiobook narration and voiceover production, where the model never hears the person on the other end. That\u2019s fine when you\u2019re generating a podcast intro. It\u2019s not fine when a frustrated user is trying to get support from an AI agent at 11pm.<\/p>\n<p>Inworld AI is calling that out directly with the launch of Realtime TTS-2, a new voice model released as a research preview via its Inworld API and Inworld Realtime API. The model hears the full audio of the exchange, picks up the user\u2019s tone, pacing and emotional state, then takes voice direction in plain English the way developers prompt an LLM.<\/p>\n<h3 class=\"wp-block-heading\"><strong>What\u2019s Actually Different Here<\/strong><\/h3>\n<p>The meaningful architectural distinction with TTS-2 is that it operates as a closed-loop system. The model takes the actual audio of the prior turns of the exchange as input, not just a transcript \u2014 it hears how the user actually sounded. That\u2019s a non-trivial difference. A transcript of \u201cokay, fine\u201d gives you the words. The audio of \u201cokay, fine\u201d tells you whether the person is relieved, resigned, or sarcastic. TTS-2 is designed to use that signal.<\/p>\n<p>The same line lands differently after a joke than after bad news, and the model knows the difference because it heard the prior turn. Tone, pacing, and emotional state carry forward automatically. Practically speaking, audio context flows across turns inside a Realtime session without developers needing to pass explicit <code>prior_audio<\/code> fields or build additional plumbing.<\/p>\n<h3 class=\"wp-block-heading\"><strong>Four Capabilities, One Model<\/strong><\/h3>\n<p>Inworld team is shipping TTS-2 with <strong>four key features,<\/strong> positioning the combination and not any individual piece, as the differentiation.<\/p>\n<ol class=\"wp-block-list\">\n<li><strong>Voice Direction<\/strong>: It lets developers steer delivery using plain-language prompts inline at inference time. Instead of selecting from a fixed emotion enum like <code>[sad]<\/code> or <code>[excited]<\/code>, developers pass a bracket tag like <code>[speak sadly, as if something bad just happened]<\/code> directly in the text. Long, descriptive prompts beat short labels \u2014 the model responds far better to full context than single-word labels. Inline non-verbal markers like <code>[laugh]<\/code>, <code>[sigh]<\/code>, <code>[breathe]<\/code>, <code>[clear_throat]<\/code>, and <code>[cough]<\/code> can be dropped anywhere in the text where the moment should occur, and the model places them as audio events, not pronounced words.<\/li>\n<li><strong>Conversational Awareness<\/strong>: It is the closed-loop architecture described above \u2014 the architectural shift that separates TTS-2 from prior-generation models that treat each sentence as a stateless generation call.<\/li>\n<li><strong>Crosslingual<\/strong> support: One voice identity is preserved across over 100 languages, including mid-utterance language switches inside a single generation. No language flag is needed \u2014 the model handles transitions automatically, keeping timbre, pitch, and character constant across the switch. <strong>The top-tier languages ship at native-speaker quality, while the long tail is described as launch-window experimental, consistent with the model releasing as a research preview.<\/strong><\/li>\n<li><strong>Advanced Voice Design<\/strong>: It generates a saved voice from a written prompt and no reference audio required. Developers can describe a person in prose, save the result as a reusable voice, and call it like any other voice in the app. Voice Design ships with <strong>three stability modes<\/strong>: Expressive (for live consumer conversation and companions), Balanced (the default for most agent workloads), and Stable (for IVR and professional deployments where pitch drift is unacceptable).<\/li>\n<\/ol>\n<h3 class=\"wp-block-heading\"><strong>The Conversational Layer Underneath<\/strong><\/h3>\n<p>Beyond the four key features, it calls out a set of behaviors that push speech further into what it describes as \u201cperson paying attention\u201d territory. <strong>The most technically interesting is disfluencies: the model generates natural <em>uh<\/em> and <em>um<\/em>, self-corrections, mid-noun-phrase pauses, and trailing thoughts that signal warmth and recall rather than malfunction. Critically, different speaker profiles cluster fillers differently, and the model follows the rhythm \u2014 filler-as-energy sounds different from filler-as-hesitation.<\/strong> Voice cloning is also supported via a <strong>two-step API:<\/strong> upload a reference sample (5\u201315 seconds, clean, single speaker) to <code>\/voices\/v1\/voices:clone<\/code>, get a voice ID, and use it like any other voice.<\/p>\n<h3 class=\"wp-block-heading\"><strong>Where It Fits in the Stack<\/strong><\/h3>\n<p>TTS-2 is one layer in Inworld\u2019s broader Realtime API pipeline. The full stack includes Realtime STT, which transcribes and profiles the speaker in one pass \u2014 capturing age, accent, pitch, vocal style, emotional tone, and pacing as structured signals on the same connection. A Realtime Router that <strong>routes across 200+ models, selecting<\/strong> the appropriate model and tools based on the user\u2019s state and conversation context. And TTS-2 at the output layer. The pipeline runs over a single persistent WebSocket connection, with sub-200ms median time-to-first-audio for the TTS layer.<\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-large is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"377\" data-attachment-id=\"79551\" data-permalink=\"https:\/\/www.marktechpost.com\/2026\/05\/05\/inworld-ai-launches-realtime-tts-2-a-closed-loop-voice-model-that-adapts-to-how-you-actually-talk\/screenshot-2026-05-05-at-5-20-25-pm\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-05-at-5.20.25-PM-scaled.png\" data-orig-size=\"2560,943\" data-comments-opened=\"0\" data-image-meta='{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\"}' data-image-title=\"Screenshot 2026-05-05 at 5.20.25\u202fPM\" data-image-description=\"\" data-image-caption=\"\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-05-at-5.20.25-PM-1024x377.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-05-at-5.20.25-PM-1024x377.png\" alt=\"\" class=\"wp-image-79551\" \/><figcaption class=\"wp-element-caption\">https:\/\/artificialanalysis.ai\/text-to-speech\/leaderboard.  (data as of May 5, 2026)<\/figcaption><\/figure>\n<\/div>\n<h3 class=\"wp-block-heading\"><strong>The Broader Context<\/strong><\/h3>\n<p>Realtime TTS 1.5 already ranks #1 on the <a href=\"https:\/\/artificialanalysis.ai\/text-to-speech\/leaderboard\" target=\"_blank\" rel=\"noreferrer noopener\">Artificial Analysis Speech Arena<\/a> (as of May 5, 2026), ahead of Google (#2) and ElevenLabs (#3). The launch of TTS-2 signals that Inworld considers raw audio quality a solved problem \u2014 and is now competing on the behavioral layer: context-awareness, steerability, and identity consistency across languages.<\/p>\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n<p>Check out\u00a0the\u00a0<strong><a href=\"https:\/\/docs.inworld.ai\/tts\/tts\" target=\"_blank\" rel=\"noreferrer noopener\">Docs<\/a>\u00a0<\/strong>and<strong><a href=\"https:\/\/inworld.ai\/blog\/realtime-tts-2\" target=\"_blank\" rel=\"noreferrer noopener\">\u00a0Technical details<\/a><\/strong>.<strong>\u00a0<\/strong>Also,\u00a0feel free to follow us on\u00a0<strong><a href=\"https:\/\/x.com\/intent\/follow?screen_name=marktechpost\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Twitter<\/mark><\/a><\/strong>\u00a0and don\u2019t forget to join our\u00a0<strong><a href=\"https:\/\/www.reddit.com\/r\/machinelearningnews\/\" target=\"_blank\" rel=\"noreferrer noopener\">130k+ ML SubReddit<\/a><\/strong>\u00a0and Subscribe to\u00a0<strong><a href=\"https:\/\/www.aidevsignals.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">our Newsletter<\/a><\/strong>. Wait! are you on telegram?\u00a0<strong><a href=\"https:\/\/t.me\/machinelearningresearchnews\" target=\"_blank\" rel=\"noreferrer noopener\">now you can join us on telegram as well.<\/a><\/strong><\/p>\n<p>Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.?\u00a0<strong><a href=\"https:\/\/forms.gle\/MTNLpmJtsFA3VRVd9\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Connect with us<\/mark><\/a><\/strong><\/p>\n<p>The post <a href=\"https:\/\/www.marktechpost.com\/2026\/05\/05\/inworld-ai-launches-realtime-tts-2-a-closed-loop-voice-model-that-adapts-to-how-you-actually-talk\/\">Inworld AI Launches Realtime TTS-2: A Closed-Loop Voice Model That Adapts to How You Actually Talk<\/a> appeared first on <a href=\"https:\/\/www.marktechpost.com\/\">MarkTechPost<\/a>.<\/p>","protected":false},"excerpt":{"rendered":"<p>Voice AI has a dirty secret: m&hellip;<\/p>\n","protected":false},"author":1,"featured_media":859,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-858","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-uncategorized"],"_links":{"self":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts\/858","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=858"}],"version-history":[{"count":0,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts\/858\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/media\/859"}],"wp:attachment":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=858"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=858"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=858"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}