{"id":973,"date":"2026-05-25T06:51:05","date_gmt":"2026-05-24T22:51:05","guid":{"rendered":"https:\/\/connectword.dpdns.org\/?p=973"},"modified":"2026-05-25T06:51:05","modified_gmt":"2026-05-24T22:51:05","slug":"stepfun-releases-stepaudio-2-5-realtime-an-end-to-end-voice-model-with-roleplay-specific-rlhf-and-paralinguistic-comprehension","status":"publish","type":"post","link":"https:\/\/connectword.dpdns.org\/?p=973","title":{"rendered":"StepFun Releases StepAudio 2.5 Realtime: An End-to-End Voice Model with Roleplay-Specific RLHF and Paralinguistic Comprehension"},"content":{"rendered":"<p class=\"wp-block-paragraph\">StepFun, the Shanghai-based AI lab, released StepAudio 2.5 Realtime. It is an end-to-end real-time speech large language model with fully customizable persona capabilities. <\/p>\n<p class=\"wp-block-paragraph\">StepAudio 2.5 Realtime is a voice model that operates in real time. Unlike pipeline-based systems that separate speech recognition, reasoning, and synthesis into sequential steps, this is an end-to-end model. Audio goes in and audio comes out through a single unified system. The model supports Chinese and English.<\/p>\n<p class=\"wp-block-paragraph\">It connects via a WebSocket API. The endpoint is <code>wss:\/\/api.stepfun.com\/v1\/realtime<\/code> using the model string <code>step-2.5-realtime<\/code>. <\/p>\n<h2 class=\"wp-block-heading\"><strong>The Three Technical Pillars<\/strong><\/h2>\n<p class=\"wp-block-paragraph\"><strong>StepFun research team describes three core architectural innovations behind the model:<\/strong><\/p>\n<h4 class=\"wp-block-heading\"><strong>1. Million-Scale Persona Data Augmentation<\/strong><\/h4>\n<p class=\"wp-block-paragraph\">Starting from 10,000+ high-quality natively authored personas, StepFun applied algorithmic augmentation to build a million-scale persona feature matrix. This was combined with millions of real-world conversational samples for training. The intent is generalization \u2014 specifically, stable performance on difficult, long-tail conversational topics. <\/p>\n<p class=\"wp-block-paragraph\">Instead of manually labeling millions of persona samples, StepFun team used algorithmic expansion from a curated seed set.<\/p>\n<h4 class=\"wp-block-heading\"><strong>2. Roleplay-Specific RLHF Alignment<\/strong><\/h4>\n<p class=\"wp-block-paragraph\">A known failure mode in conversational AI is \u201cout-of-character\u201d (OOC) behavior \u2014 when a model drifts away from its defined persona mid-conversation. StepFun team conducted dedicated RLHF (Reinforcement Learning from Human Feedback) optimization specifically for persona consistency in roleplay scenarios. RLHF is a training technique where human preference signals are used to train a reward model, which then guides language model behavior. Applying it specifically to roleplay stability is a targeted design choice.<\/p>\n<h4 class=\"wp-block-heading\"><strong>3. Unified Speech Understanding and Generation<\/strong><\/h4>\n<p class=\"wp-block-paragraph\">StepAudio 2.5 Realtime inherits the StepAudio 2.5 TTS capabilities and deeply fuses speech understanding and generation through reinforcement learning. This enables what StepFun calls \u201cglobal scene-level tonal setting\u201d and \u201cintra-sentence detail sculpting.\u201d The model can set an overall emotional register for a response while adjusting finer acoustic details within individual sentences. <\/p>\n<h2 class=\"wp-block-heading\"><strong>Paralinguistic Understanding<\/strong><\/h2>\n<p class=\"wp-block-paragraph\">A technically distinct area of this model is paralinguistic perception. Paralinguistics refers to non-verbal acoustic information in speech \u2014 things like tone, speaking rate, pauses, sighs, and laughter. By analyzing these elements, the model can perceive the user\u2019s mood and underlying intentions. For example, it can identify fatigue from a low tone or frustration from a rapid speech rate. Capturing these signals requires the model to operate on audio features rather than transcribed text alone.<\/p>\n<p class=\"wp-block-paragraph\">StepAudio 2.5 Realtime scored 82.18 on the paralinguistic comprehension benchmark, demonstrating perception of vocal speed, emotion, age, and other acoustic features.<\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1836\" height=\"1088\" data-attachment-id=\"80087\" data-permalink=\"https:\/\/www.marktechpost.com\/2026\/05\/24\/stepfun-releases-stepaudio-2-5-realtime-an-end-to-end-voice-model-with-roleplay-specific-rlhf-and-paralinguistic-comprehension\/screenshot-2026-05-24-at-3-50-40-pm-2\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-24-at-3.50.40-PM-1.png\" data-orig-size=\"1836,1088\" data-comments-opened=\"0\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;,&quot;alt&quot;:&quot;&quot;}\" data-image-title=\"Screenshot 2026-05-24 at 3.50.40\u202fPM\" data-image-description=\"\" data-image-caption=\"\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-24-at-3.50.40-PM-1-1024x607.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-24-at-3.50.40-PM-1.png\" alt=\"\" class=\"wp-image-80087\" \/><figcaption class=\"wp-element-caption\">https:\/\/stepaudiollm.github.io\/step-audio-2.5-realtime\/<\/figcaption><\/figure>\n<\/div>\n<h2 class=\"wp-block-heading\"><strong>Benchmark Results<\/strong><\/h2>\n<p class=\"wp-block-paragraph\">StepFun research team conducted a comprehensive suite of subjective and objective evaluations, benchmarking StepAudio 2.5 Realtime against leading real-time voice models across five dimensions. <\/p>\n<p class=\"wp-block-paragraph\">Human evaluation is conducted through real mobile app conversations scored by human raters. <strong>The scores:<\/strong><\/p>\n<ul class=\"wp-block-list\">\n<li>Human evaluation (subjective): <strong>80.41<\/strong><\/li>\n<li>General dialogue (objective): <strong>86.36<\/strong><\/li>\n<li>Automotive scenario (objective): <strong>84.80<\/strong><\/li>\n<li>Spoken QA, covering 11 audio understanding tasks (objective): <strong>79.80<\/strong><\/li>\n<li>Paralinguistic comprehension (objective): <strong>82.18<\/strong><\/li>\n<\/ul>\n<h2 class=\"wp-block-heading\"><strong>Key Takeaways<\/strong><\/h2>\n<ul class=\"wp-block-list\">\n<li>StepAudio 2.5 Realtime is an end-to-end real-time speech LLM, released by Shanghai-based StepFun.<\/li>\n<li>It uses persona-specific RLHF and million-scale data augmentation to maintain stable character consistency.<\/li>\n<li>The model ranked first across all five benchmark dimensions, tested in April 2026.<\/li>\n<li>Paralinguistic comprehension \u2014 perceiving tone, rate, emotion from audio \u2014 is a core technical differentiator.<\/li>\n<li>API access is via WebSocket at <code>wss:\/\/api.stepfun.com\/v1\/realtime<\/code> with model string <code>step-2.5-realtime<\/code>.<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\">\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n<\/p><p class=\"wp-block-paragraph\">\n<\/p><p class=\"wp-block-paragraph\">Check out\u00a0the\u00a0<strong><a href=\"https:\/\/stepaudiollm.github.io\/step-audio-2.5-realtime\/\" target=\"_blank\" rel=\"noreferrer noopener\">Model Card<\/a>\u00a0<\/strong>and<strong><a href=\"https:\/\/www.stepfun.com\/studio\/audio?tab=voice-chat\" target=\"_blank\" rel=\"noreferrer noopener\">\u00a0Demo<\/a>.\u00a0<\/strong>Also,\u00a0feel free to follow us on\u00a0<strong><a href=\"https:\/\/x.com\/intent\/follow?screen_name=marktechpost\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Twitter<\/mark><\/a><\/strong>\u00a0and don\u2019t forget to join our\u00a0<strong><a href=\"https:\/\/www.reddit.com\/r\/machinelearningnews\/\" target=\"_blank\" rel=\"noreferrer noopener\">150k+ ML SubReddit<\/a><\/strong>\u00a0and Subscribe to\u00a0<strong><a href=\"https:\/\/www.aidevsignals.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">our Newsletter<\/a><\/strong>. Wait! are you on telegram?\u00a0<strong><a href=\"https:\/\/t.me\/machinelearningresearchnews\" target=\"_blank\" rel=\"noreferrer noopener\">now you can join us on telegram as well.<\/a><\/strong><\/p>\n<p class=\"wp-block-paragraph\">Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.?\u00a0<strong><a href=\"https:\/\/forms.gle\/MTNLpmJtsFA3VRVd9\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Connect with us<\/mark><\/a><\/strong><\/p>\n<p>The post <a href=\"https:\/\/www.marktechpost.com\/2026\/05\/24\/stepfun-releases-stepaudio-2-5-realtime-an-end-to-end-voice-model-with-roleplay-specific-rlhf-and-paralinguistic-comprehension\/\">StepFun Releases StepAudio 2.5 Realtime: An End-to-End Voice Model with Roleplay-Specific RLHF and Paralinguistic Comprehension<\/a> appeared first on <a href=\"https:\/\/www.marktechpost.com\/\">MarkTechPost<\/a>.<\/p>","protected":false},"excerpt":{"rendered":"<p>StepFun, the Shanghai-based AI&hellip;<\/p>\n","protected":false},"author":1,"featured_media":974,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-973","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-uncategorized"],"_links":{"self":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts\/973","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=973"}],"version-history":[{"count":0,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts\/973\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/media\/974"}],"wp:attachment":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=973"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=973"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=973"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}