{"id":807,"date":"2026-04-28T02:36:06","date_gmt":"2026-04-27T18:36:06","guid":{"rendered":"https:\/\/connectword.dpdns.org\/?p=807"},"modified":"2026-04-28T02:36:06","modified_gmt":"2026-04-27T18:36:06","slug":"openmoss-releases-moss-audio-an-open-source-foundation-model-for-speech-sound-music-and-time-aware-audio-reasoning","status":"publish","type":"post","link":"https:\/\/connectword.dpdns.org\/?p=807","title":{"rendered":"OpenMOSS Releases MOSS-Audio: An Open-Source Foundation Model for Speech, Sound, Music, and Time-Aware Audio Reasoning"},"content":{"rendered":"<p>Understanding what\u2019s happening in an audio clip is a deceptively hard problem. Transcribing spoken words is the easy part. A truly capable system also needs to recognize who is speaking, detect their emotional state, interpret background sounds, analyze musical content, and answer time-grounded questions like \u2018what did the speaker say at the 2-minute mark?\u2019. Tackling all of that required stitching together multiple specialized systems.<\/p>\n<p>Tthe OpenMOSS team, MOSI.AI, and Shanghai Innovation Institute released <strong>MOSS-Audio<\/strong>: an open-source audio understanding model designed to unify all of those capabilities inside a single foundation model.<\/p>\n<h3 class=\"wp-block-heading\"><strong>What MOSS-Audio Actually Does<\/strong><\/h3>\n<p>MOSS-Audio supports <strong>speech understanding, environmental sound understanding, music understanding, audio captioning, time-aware QA, and complex reasoning<\/strong> over real-world audio. Its capability set breaks down into several distinct areas. <strong>Speech &amp; Content Understanding<\/strong> accurately recognizes and transcribes spoken content, supporting both word-level and sentence-level timestamp alignment. <strong>Speaker, Emotion &amp; Event Analysis<\/strong> identifies speaker characteristics, analyzes emotional states based on tone, timbre, and context, and detects key acoustic events within the audio. <strong>Scene &amp; Sound Cue Extraction<\/strong> pulls meaningful signals from background sounds, environmental noise, and non-speech signals to infer scene context and atmosphere. <strong>Music Understanding<\/strong> analyzes musical style, emotional progression, and instrumentation. <strong>Audio Question Answering &amp; Summarization<\/strong> handles questions and summaries across speech, podcasts, meetings, and interviews. Finally, <strong>Complex Reasoning<\/strong> performs multi-hop reasoning over audio content, powered by both chain-of-thought training and reinforcement learning.<\/p>\n<p>In practical terms, a single MOSS-Audio model can do all of the above without switching between different specialized systems.<\/p>\n<h3 class=\"wp-block-heading\"><strong>Four Model Variants<\/strong><\/h3>\n<p>The team released four variants at launch: <strong>MOSS-Audio-4B-Instruct<\/strong>, <strong>MOSS-Audio-4B-Thinking<\/strong>, <strong>MOSS-Audio-8B-Instruct<\/strong>, and <strong>MOSS-Audio-8B-Thinking<\/strong>. The naming convention is worth understanding if you\u2019re deciding which to use. The <strong>Instruct<\/strong> variants are optimized for direct instruction following, making them well-suited for production pipelines where you want predictable, structured outputs. The <strong>Thinking<\/strong> variants provide stronger chain-of-thought reasoning capabilities, better suited for tasks requiring multi-hop inference. The 4B models use <strong>Qwen3-4B<\/strong> as the LLM backbone, and the 8B models use <strong>Qwen3-8B<\/strong>, resulting in total model sizes of approximately 4.6B and 8.6B parameters respectively.<\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1680\" height=\"804\" data-attachment-id=\"79345\" data-permalink=\"https:\/\/www.marktechpost.com\/2026\/04\/27\/openmoss-releases-moss-audio-an-open-source-foundation-model-for-speech-sound-music-and-time-aware-audio-reasoning\/screenshot-2026-04-27-at-11-34-12-am\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/04\/Screenshot-2026-04-27-at-11.34.12-AM.png\" data-orig-size=\"1680,804\" data-comments-opened=\"0\" data-image-meta='{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\"}' data-image-title=\"Screenshot 2026-04-27 at 11.34.12\u202fAM\" data-image-description=\"\" data-image-caption=\"\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/04\/Screenshot-2026-04-27-at-11.34.12-AM-1024x490.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/04\/Screenshot-2026-04-27-at-11.34.12-AM.png\" alt=\"\" class=\"wp-image-79345\" \/><figcaption class=\"wp-element-caption\">https:\/\/github.com\/OpenMOSS\/MOSS-Audio<\/figcaption><\/figure>\n<\/div>\n<h3 class=\"wp-block-heading\"><strong>The Architecture: Three Components Working Together<\/strong><\/h3>\n<p>MOSS-Audio follows a <strong>modular design comprising three components: an audio encoder, a modality adapter, and a large language model<\/strong>. Raw audio is first encoded by the <strong>MOSS-Audio-Encoder<\/strong> into continuous temporal representations at <strong>12.5 Hz<\/strong>. Those representations are then projected into the language model\u2019s embedding space through the adapter, and finally consumed by the LLM for auto-regressive text generation.<\/p>\n<p>The research team trained the encoder from scratch rather than relying on off-the-shelf audio frontends. Their reasoning: a dedicated encoder delivers more robust speech representations, tighter temporal alignment, and better extensibility across acoustic domains.<\/p>\n<p>Two architectural innovations inside MOSS-Audio are worth understanding in detail.<\/p>\n<p><strong>DeepStack Cross-Layer Feature Injection<\/strong>: A common weakness in audio models is that relying only on the encoder\u2019s top-layer features tends to lose low-level acoustic information, things like prosody, transient events, and local time-frequency structure. MOSS-Audio addresses this with a <strong>DeepStack<\/strong>-inspired cross-layer injection module between the encoder and the language model: in addition to the encoder\u2019s final-layer output, features from earlier and intermediate layers are selected, independently projected, and injected into the language model\u2019s early layers. This preserves multi-granularity information ranging from low-level acoustic details to high-level semantic abstractions, helping the model retain rhythm, timbre, transients, and background structure that a single high-level representation cannot fully capture.<\/p>\n<p><strong>Time-Aware Representation<\/strong>: Time is a critical dimension in audio that text models aren\u2019t naturally equipped to handle. MOSS-Audio addresses this through a <strong>time-marker insertion<\/strong> strategy during pretraining: explicit time tokens are inserted between audio frame representations at fixed time intervals to indicate temporal positions. This lets the model learn \u2018what happened when\u2019 within a unified text generation framework, naturally supporting timestamp ASR, event localization, time-based QA, and long-audio retrospection \u2014 without requiring a separate localization head or post-processing pipeline.<\/p>\n<h3 class=\"wp-block-heading\"><strong>Benchmark Performance<\/strong><\/h3>\n<p>The numbers are strong. On general audio understanding, <strong>MOSS-Audio-8B-Thinking achieves an average accuracy of 71.08<\/strong> across four benchmarks \u2014 <strong>77.33 on MMAU<\/strong>, <strong>64.92 on MMAU-Pro<\/strong>, <strong>66.53 on MMAR<\/strong>, and <strong>75.52 on MMSU<\/strong>, outperforming majority of open-source models. That includes larger models: Step-Audio-R1 at 33B scores 70.67, and Qwen3-Omni-30B-A3B-Instruct at 30B scores 67.91. For further context, Kimi-Audio (7B) scores 61.14 and MiMo-Audio-7B scores 62.97 on the same average. The 4B Thinking variant scores 68.37, meaning the smaller model with chain-of-thought training beats all larger open-source instruct-only competitors.<\/p>\n<p>On <strong>speech captioning<\/strong>, evaluated with an LLM-as-a-Judge methodology across 13 fine-grained dimensions including gender, age, accent, pitch, volume, speed, texture, clarity, fluency, emotion, tone, personality, and summary, MOSS-Audio-Instruct variants lead across 11 out of 13 dimensions, with MOSS-Audio-8B-Instruct achieving the best overall average score of <strong>3.7252<\/strong>.<\/p>\n<p>On <strong>automatic speech recognition (ASR)<\/strong> spanning 12 evaluation dimensions \u2014 including health condition, code-switching, dialect, singing, and non-speech scenarios \u2014 MOSS-Audio-8B-Instruct achieves the <strong>lowest overall CER (Character Error Rate) of 11.30<\/strong> across all tested models.<\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1584\" height=\"816\" data-attachment-id=\"79347\" data-permalink=\"https:\/\/www.marktechpost.com\/2026\/04\/27\/openmoss-releases-moss-audio-an-open-source-foundation-model-for-speech-sound-music-and-time-aware-audio-reasoning\/screenshot-2026-04-27-at-11-34-43-am-2\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/04\/Screenshot-2026-04-27-at-11.34.43-AM-1.png\" data-orig-size=\"1584,816\" data-comments-opened=\"0\" data-image-meta='{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\"}' data-image-title=\"Screenshot 2026-04-27 at 11.34.43\u202fAM\" data-image-description=\"\" data-image-caption=\"\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/04\/Screenshot-2026-04-27-at-11.34.43-AM-1-1024x528.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/04\/Screenshot-2026-04-27-at-11.34.43-AM-1.png\" alt=\"\" class=\"wp-image-79347\" \/><figcaption class=\"wp-element-caption\">https:\/\/github.com\/OpenMOSS\/MOSS-Audio<\/figcaption><\/figure>\n<\/div>\n<h3 class=\"wp-block-heading\"><strong>Key Takeaways<\/strong><\/h3>\n<ul class=\"wp-block-list\">\n<li><strong>Single Model, Full Audio Stack<\/strong>: MOSS-Audio unifies speech transcription, speaker and emotion analysis, environmental sound understanding, music analysis, audio captioning, time-aware QA, and complex reasoning into one open-source model, eliminating the need to chain multiple specialized systems together.<\/li>\n<li><strong>Two Architectural Innovations Drive Performance<\/strong>: DeepStack Cross-Layer Feature Injection preserves multi-granularity acoustic information by injecting features from intermediate encoder layers directly into the LLM\u2019s early layers, while time-marker insertion during pretraining gives the model explicit temporal awareness for timestamp-grounded tasks.<\/li>\n<li><strong>Best-in-Class Benchmark Results at Efficient Scale<\/strong>: MOSS-Audio-8B-Thinking achieves an average accuracy of 71.08 on general audio understanding benchmarks, outperforming all open-source models including 30B+ systems, while the 4B Thinking variant alone beats every larger open-source instruct-only competitor.<\/li>\n<li><strong>Dominant Timestamp ASR Accuracy<\/strong>: MOSS-Audio-8B-Instruct scores 35.77 AAS on AISHELL-1 and 131.61 AAS on LibriSpeech, dramatically outperforming both Qwen3-Omni-30B-A3B-Instruct (833.66) and the closed-source Gemini-3.1-Pro (708.24) on the same benchmark.<\/li>\n<\/ul>\n<hr class=\"wp-block-separator aligncenter has-alpha-channel-opacity is-style-wide\" \/>\n<p>Check out\u00a0the\u00a0<strong><a href=\"https:\/\/huggingface.co\/collections\/OpenMOSS-Team\/moss-audio\" target=\"_blank\" rel=\"noreferrer noopener\">Model Weights<\/a><\/strong>\u00a0and\u00a0<strong><a href=\"https:\/\/github.com\/OpenMOSS\/MOSS-Audio\" target=\"_blank\" rel=\"noreferrer noopener\">Repo<\/a><\/strong>.<strong>\u00a0<\/strong>Also,\u00a0feel free to follow us on\u00a0<strong><a href=\"https:\/\/x.com\/intent\/follow?screen_name=marktechpost\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Twitter<\/mark><\/a><\/strong>\u00a0and don\u2019t forget to join our\u00a0<strong><a href=\"https:\/\/www.reddit.com\/r\/machinelearningnews\/\" target=\"_blank\" rel=\"noreferrer noopener\">130k+ ML SubReddit<\/a><\/strong>\u00a0and Subscribe to\u00a0<strong><a href=\"https:\/\/www.aidevsignals.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">our Newsletter<\/a><\/strong>. Wait! are you on telegram?\u00a0<strong><a href=\"https:\/\/t.me\/machinelearningresearchnews\" target=\"_blank\" rel=\"noreferrer noopener\">now you can join us on telegram as well.<\/a><\/strong><\/p>\n<p>Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.?\u00a0<strong><a href=\"https:\/\/forms.gle\/MTNLpmJtsFA3VRVd9\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Connect with us<\/mark><\/a><\/strong><\/p>\n<p>The post <a href=\"https:\/\/www.marktechpost.com\/2026\/04\/27\/openmoss-releases-moss-audio-an-open-source-foundation-model-for-speech-sound-music-and-time-aware-audio-reasoning\/\">OpenMOSS Releases MOSS-Audio: An Open-Source Foundation Model for Speech, Sound, Music, and Time-Aware Audio Reasoning<\/a> appeared first on <a href=\"https:\/\/www.marktechpost.com\/\">MarkTechPost<\/a>.<\/p>","protected":false},"excerpt":{"rendered":"<p>Understanding what\u2019s happening&hellip;<\/p>\n","protected":false},"author":1,"featured_media":808,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-807","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-uncategorized"],"_links":{"self":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts\/807","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=807"}],"version-history":[{"count":0,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts\/807\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/media\/808"}],"wp:attachment":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=807"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=807"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=807"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}