{"id":610,"date":"2026-03-26T15:33:55","date_gmt":"2026-03-26T07:33:55","guid":{"rendered":"https:\/\/connectword.dpdns.org\/?p=610"},"modified":"2026-03-26T15:33:55","modified_gmt":"2026-03-26T07:33:55","slug":"tencent-ai-open-sources-covo-audio-a-7b-speech-language-model-and-inference-pipeline-for-real-time-audio-conversations-and-reasoning","status":"publish","type":"post","link":"https:\/\/connectword.dpdns.org\/?p=610","title":{"rendered":"Tencent AI Open Sources Covo-Audio: A 7B Speech Language Model and Inference Pipeline for Real-Time Audio Conversations and Reasoning"},"content":{"rendered":"<p>Tencent AI Lab has released <strong>Covo-Audio<\/strong>, a 7B-parameter end-to-end Large Audio Language Model (LALM). The model is designed to unify speech processing and language intelligence by directly processing continuous audio inputs and generating audio outputs within a single architecture.<\/p>\n<h3 class=\"wp-block-heading\"><strong>System Architecture<\/strong><\/h3>\n<p><strong>The Covo-Audio framework consists of four primary components designed for seamless cross-modal interaction:<\/strong><\/p>\n<ul class=\"wp-block-list\">\n<li><strong>Audio Encoder<\/strong>: The model utilizes <strong>Whisper-large-v3<\/strong> as its primary encoder due to its robustness against background noise and varied accents. This component operates at a frame rate of <strong>50 Hz<\/strong>.<\/li>\n<li><strong>Audio Adapter<\/strong>: To bridge the encoder and the LLM, a specialized adapter employs three downsampling modules, integrating linear and convolution layers to reduce the frame rate from <strong>50 Hz to 6.25 Hz<\/strong>.<\/li>\n<li><strong>LLM Backbone<\/strong>: The system is built upon <strong>Qwen2.5-7B-Base<\/strong>, which has been adapted to process interleaved sequences of continuous acoustic features and textual tokens.<\/li>\n<li><strong>Speech Tokenizer and Decoder<\/strong>: The tokenizer, based on <strong>WavLM-large<\/strong>, uses a codebook size of <strong>16,384<\/strong> to produce discrete audio tokens at <strong>25 Hz<\/strong>. The decoder employs a <strong>Flow-Matching (FM)<\/strong> based framework and a <strong>BigVGAN<\/strong> vocoder to reconstruct high-fidelity <strong>24K waveforms<\/strong>.<\/li>\n<\/ul>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1604\" height=\"952\" data-attachment-id=\"78612\" data-permalink=\"https:\/\/www.marktechpost.com\/2026\/03\/26\/tencent-ai-open-sources-covo-audio-a-7b-speech-language-model-and-inference-pipeline-for-real-time-audio-conversations-and-reasoning\/screenshot-2026-03-26-at-12-33-14-am-2\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/03\/Screenshot-2026-03-26-at-12.33.14-AM-1.png\" data-orig-size=\"1604,952\" data-comments-opened=\"1\" data-image-meta='{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\"}' data-image-title=\"Screenshot 2026-03-26 at 12.33.14\u202fAM\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/03\/Screenshot-2026-03-26-at-12.33.14-AM-1-300x178.png\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/03\/Screenshot-2026-03-26-at-12.33.14-AM-1-1024x608.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/03\/Screenshot-2026-03-26-at-12.33.14-AM-1.png\" alt=\"\" class=\"wp-image-78612\" \/><figcaption class=\"wp-element-caption\">https:\/\/arxiv.org\/pdf\/2602.09823<\/figcaption><\/figure>\n<\/div>\n<h3 class=\"wp-block-heading\"><strong>Hierarchical Tri-modal Interleaving<\/strong><\/h3>\n<p>A core contribution of this work is the <strong>Hierarchical Tri-modal Speech-Text Interleaving<\/strong> strategy. Unlike traditional methods that operate solely at the word or character level, this framework aligns continuous acoustic features <math data-latex=\"(a_c)\"><semantics><mrow><mo form=\"prefix\" stretchy=\"false\">(<\/mo><msub><mi>a<\/mi><mi>c<\/mi><\/msub><mo form=\"postfix\" stretchy=\"false\">)<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">(a_c)<\/annotation><\/semantics><\/math>, discrete speech tokens <math data-latex=\"(a_d)\"><semantics><mrow><mo form=\"prefix\" stretchy=\"false\">(<\/mo><msub><mi>a<\/mi><mi>d<\/mi><\/msub><mo form=\"postfix\" stretchy=\"false\">)<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">(a_d)<\/annotation><\/semantics><\/math>, and natural language text <math data-latex=\"(t)\"><semantics><mrow><mo form=\"prefix\" stretchy=\"false\">(<\/mo><mi>t<\/mi><mo form=\"postfix\" stretchy=\"false\">)<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">(t)<\/annotation><\/semantics><\/math>.<\/p>\n<p><strong>The model utilizes two primary patterns:<\/strong><\/p>\n<ol start=\"1\" class=\"wp-block-list\">\n<li><strong>Sequential Interleaving <\/strong><math data-latex=\"(a_c rightarrow t rightarrow a_d)\"><semantics><mrow><mo form=\"prefix\" stretchy=\"false\">(<\/mo><msub><mi>a<\/mi><mi>c<\/mi><\/msub><mo stretchy=\"false\">\u2192<\/mo><mi>t<\/mi><mo stretchy=\"false\">\u2192<\/mo><msub><mi>a<\/mi><mi>d<\/mi><\/msub><mo form=\"postfix\" stretchy=\"false\">)<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">(a_c rightarrow t rightarrow a_d)<\/annotation><\/semantics><\/math>: Continuous features, text, and discrete tokens are arranged in a progressive chain.<\/li>\n<li><strong>Parallel Integration <\/strong><math data-latex=\"(a_c rightarrow t | a_d)\"><semantics><mrow><mo form=\"prefix\" stretchy=\"false\">(<\/mo><msub><mi>a<\/mi><mi>c<\/mi><\/msub><mo stretchy=\"false\">\u2192<\/mo><mi>t<\/mi><mi>|<\/mi><msub><mi>a<\/mi><mi>d<\/mi><\/msub><mo form=\"postfix\" stretchy=\"false\">)<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">(a_c rightarrow t | a_d)<\/annotation><\/semantics><\/math>: Continuous features are aligned with a coupled text-discrete unit.<\/li>\n<\/ol>\n<p>The hierarchical aspect ensures structural coherence by using phrase-level interleaving for fine-grained alignment and sentence-level interleaving to preserve global semantic integrity in long-form utterances<sup><\/sup>. The training process involved a two-stage pre-training pipeline processing a total of <strong>2T tokens<\/strong><sup><\/sup>.<\/p>\n<h3 class=\"wp-block-heading\"><strong>Intelligence-Speaker Decoupling<\/strong><\/h3>\n<p>To mitigate the high cost of constructing large-scale dialogue data for specific speakers, the research team proposed an <strong>Intelligence Speaker Decoupling<\/strong> strategy. This technique separates dialogue intelligence from voice rendering, allowing for flexible voice customization using minimal text-to-speech (TTS) data.<\/p>\n<p>The method reformats high-quality TTS recordings into pseudo-conversations with <strong>masked text loss<\/strong><sup><\/sup>. By excluding the text response portion from the loss calculation, the model preserves its reasoning abilities while inheriting the naturalness of the TTS speaker<sup><\/sup>. This enables personalized interaction without the need for extensive, speaker-specific dialogue datasets<sup><\/sup>.<\/p>\n<h3 class=\"wp-block-heading\"><strong>Full-Duplex Voice Interaction<\/strong><\/h3>\n<p>Covo-Audio evolved into <strong>Covo-Audio-Chat-FD<\/strong>, a variant capable of simultaneous dual-stream communication<sup><\/sup><sup><\/sup><sup><\/sup><sup><\/sup>. The audio encoder is reformatted into a chunk-streaming manner, and the user and model streams are chunk-interleaved in a <strong>1:4 ratio<\/strong><sup><\/sup>. Each chunk represents <strong>0.16s<\/strong> of audio<sup><\/sup>.<\/p>\n<p><strong>The system manages conversational states through specific architectural tokens:<\/strong><\/p>\n<ul class=\"wp-block-list\">\n<li><strong>THINK Token<\/strong>: Indicates a listening-only state while the model waits to respond.<\/li>\n<li><strong>SHIFT Token<\/strong>: Signifies the transition to the model\u2019s speaking turn.<\/li>\n<li><strong>BREAK Token<\/strong>: Detects interruption signals (barge-ins), triggering the model to terminate speaking immediately and switch back to listening.<\/li>\n<\/ul>\n<p>For multi-turn scenarios, the model implements a <strong>recursive context-filling strategy<\/strong>, where continuous audio features from user input and generated tokens from previous turns are prefixed as historical context<sup><\/sup>.<\/p>\n<h3 class=\"wp-block-heading\"><strong>Audio Reasoning and Reinforcement Learning<\/strong><\/h3>\n<p>To enhance complex reasoning, the model incorporates <strong>Chain-of-Thought (CoT)<\/strong> reasoning and <strong>Group Relative Policy Optimization (GRPO)<\/strong>. <strong>The model is optimized using a verifiable composite reward function:<\/strong><\/p>\n<div class=\"wp-block-mathml-mathmlblock\">$$R_{total} = R_{accuracy} + R_{format} + R_{consistency} + R_{thinking}$$\n<\/div>\n<p>This structure allows the model to optimize for correctness <math data-latex=\"(R_{accuracy})\"><semantics><mrow><mo form=\"prefix\" stretchy=\"false\">(<\/mo><msub><mi>R<\/mi><mrow><mi>a<\/mi><mi>c<\/mi><mi>c<\/mi><mi>u<\/mi><mi>r<\/mi><mi>a<\/mi><mi>c<\/mi><mi>y<\/mi><\/mrow><\/msub><mo form=\"postfix\" stretchy=\"false\">)<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">(R_{accuracy})<\/annotation><\/semantics><\/math>, structured output adherence <math data-latex=\"(R_{format})\"><semantics><mrow><mo form=\"prefix\" stretchy=\"false\">(<\/mo><msub><mi>R<\/mi><mrow><mi>f<\/mi><mi>o<\/mi><mi>r<\/mi><mi>m<\/mi><mi>a<\/mi><mi>t<\/mi><\/mrow><\/msub><mo form=\"postfix\" stretchy=\"false\">)<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">(R_{format})<\/annotation><\/semantics><\/math>, logical coherence <math data-latex=\"(R_{consistency})\"><semantics><mrow><mo form=\"prefix\" stretchy=\"false\">(<\/mo><msub><mi>R<\/mi><mrow><mi>c<\/mi><mi>o<\/mi><mi>n<\/mi><mi>s<\/mi><mi>i<\/mi><mi>s<\/mi><mi>t<\/mi><mi>e<\/mi><mi>n<\/mi><mi>c<\/mi><mi>y<\/mi><\/mrow><\/msub><mo form=\"postfix\" stretchy=\"false\">)<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">(R_{consistency})<\/annotation><\/semantics><\/math>, and reasoning depth <math data-latex=\"(R_{thinking})\"><semantics><mrow><mo form=\"prefix\" stretchy=\"false\">(<\/mo><msub><mi>R<\/mi><mrow><mi>t<\/mi><mi>h<\/mi><mi>i<\/mi><mi>n<\/mi><mi>k<\/mi><mi>i<\/mi><mi>n<\/mi><mi>g<\/mi><\/mrow><\/msub><mo form=\"postfix\" stretchy=\"false\">)<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">(R_{thinking})<\/annotation><\/semantics><\/math>.<\/p>\n<h3 class=\"wp-block-heading\"><strong>Evaluation and Performance<\/strong><\/h3>\n<p><strong>Covo-Audio (7B) shows competitive or superior results on several evaluated benchmarks, with strongest claims made for models of comparable scale and selected speech\/audio tasks.<\/strong> <sup><\/sup><sup><\/sup><sup><\/sup><sup><\/sup>On the <strong>MMAU<\/strong> benchmark, it achieved an average score of <strong>75.30%<\/strong>, the highest among evaluated 7B-scale models<sup><\/sup><sup><\/sup><sup><\/sup><sup><\/sup>. It notably excelled in music understanding with a score of <strong>76.05%<\/strong><sup><\/sup>. On the <strong>MMSU<\/strong> benchmark, Covo-Audio achieved a leading <strong>66.64%<\/strong> average accuracy<sup><\/sup>.<\/p>\n<p>Regarding its conversational variants, <strong>Covo-Audio-Chat<\/strong> demonstrated strong performance on <strong>URO-Bench<\/strong>, particularly in speech reasoning and spoken dialogue tasks, outperforming models like <strong>Qwen3-Omni<\/strong> on the Chinese track<sup><\/sup>. For empathetic interaction on the <strong>VStyle<\/strong> benchmark, it achieved state-of-the-art results in Mandarin for anger (<strong>4.89<\/strong>), sadness (<strong>4.93<\/strong>), and anxiety (<strong>5.00<\/strong>)<sup><\/sup>.<\/p>\n<p><strong>The research team notes an \u2018early-response\u2019 issue on the GaokaoEval full-duplex setting, where unusually long silent pauses between vocal fragments can cause premature responses.<\/strong> This \u2018early-response\u2019 behavior correlates with the model\u2019s pause-handling success metric and is identified as a critical direction for future optimization.<\/p>\n<h3 class=\"wp-block-heading\"><strong>Key Takeaways<\/strong><\/h3>\n<ul class=\"wp-block-list\">\n<li><strong>Unified End-to-End Architecture<\/strong>: Covo-Audio is a 7B-parameter model that natively processes continuous audio inputs and generates high-fidelity audio outputs within a single, unified architecture. It eliminates the need for cascaded ASR-LLM-TTS pipelines, reducing error propagation and information loss.<\/li>\n<li><strong>Hierarchical Tri-modal Interleaving<\/strong>: The model employs a specialized strategy to align continuous acoustic features, discrete speech tokens, and natural language text. By interleaving these modalities at both phrase and sentence levels, it preserves global semantic integrity while capturing fine-grained prosodic nuances.<\/li>\n<li><strong>Intelligence-Speaker Decoupling<\/strong>: Tencent research team introduces a technique to decouple dialogue intelligence from specific voice rendering. This allows for flexible voice customization using lightweight Text-to-Speech (TTS) data, significantly lowering the cost of developing personalized conversational agents.<\/li>\n<li><strong>Native Full-Duplex Interaction<\/strong>: The Covo-Audio-Chat-FD variant supports simultaneous listening and speaking. It utilizes specific architectural tokens\u2014THINK, SHIFT, and BREAK\u2014to manage complex real-time dynamics such as smooth turn-taking, backchanneling, and user barge-ins.<\/li>\n<li><strong>Superior Parameter Efficiency<\/strong>: Despite its compact 7B scale, Covo-Audio achieves state-of-the-art or highly competitive performance across core benchmarks, including MMAU, MMSU, and URO-Bench. It frequently matches or exceeds the performance of much larger systems, such as 32B-parameter models, in audio and speech understanding tasks.<\/li>\n<\/ul>\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n<p>Check out\u00a0the\u00a0<strong><a href=\"https:\/\/arxiv.org\/pdf\/2602.09823\" target=\"_blank\" rel=\"noreferrer noopener\">Paper<\/a>, <a href=\"https:\/\/huggingface.co\/tencent\/Covo-Audio-Chat\" target=\"_blank\" rel=\"noreferrer noopener\">Model on HF<\/a> <\/strong>and<strong> <a href=\"https:\/\/github.com\/Tencent\/Covo-Audio\" target=\"_blank\" rel=\"noreferrer noopener\">Repo<\/a>.\u00a0<\/strong>Also,\u00a0feel free to follow us on\u00a0<strong><a href=\"https:\/\/x.com\/intent\/follow?screen_name=marktechpost\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Twitter<\/mark><\/a><\/strong>\u00a0and don\u2019t forget to join our\u00a0<strong><a href=\"https:\/\/www.reddit.com\/r\/machinelearningnews\/\" target=\"_blank\" rel=\"noreferrer noopener\">120k+ ML SubReddit<\/a><\/strong>\u00a0and Subscribe to\u00a0<strong><a href=\"https:\/\/www.aidevsignals.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">our Newsletter<\/a><\/strong>. Wait! are you on telegram?\u00a0<strong><a href=\"https:\/\/t.me\/machinelearningresearchnews\" target=\"_blank\" rel=\"noreferrer noopener\">now you can join us on telegram as well.<\/a><\/strong><\/p>\n<p>The post <a href=\"https:\/\/www.marktechpost.com\/2026\/03\/26\/tencent-ai-open-sources-covo-audio-a-7b-speech-language-model-and-inference-pipeline-for-real-time-audio-conversations-and-reasoning\/\">Tencent AI Open Sources Covo-Audio: A 7B Speech Language Model and Inference Pipeline for Real-Time Audio Conversations and Reasoning<\/a> appeared first on <a href=\"https:\/\/www.marktechpost.com\/\">MarkTechPost<\/a>.<\/p>","protected":false},"excerpt":{"rendered":"<p>Tencent AI Lab has released Co&hellip;<\/p>\n","protected":false},"author":1,"featured_media":611,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-610","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-uncategorized"],"_links":{"self":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts\/610","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=610"}],"version-history":[{"count":0,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts\/610\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/media\/611"}],"wp:attachment":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=610"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=610"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=610"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}