{"id":717,"date":"2026-04-14T16:24:23","date_gmt":"2026-04-14T08:24:23","guid":{"rendered":"https:\/\/connectword.dpdns.org\/?p=717"},"modified":"2026-04-14T16:24:23","modified_gmt":"2026-04-14T08:24:23","slug":"nvidia-and-the-university-of-maryland-researchers-released-audio-flamingo-next-af-next-a-super-powerful-and-open-large-audio-language-model","status":"publish","type":"post","link":"https:\/\/connectword.dpdns.org\/?p=717","title":{"rendered":"NVIDIA and the University of Maryland Researchers Released Audio Flamingo Next (AF-Next): A Super Powerful and Open Large Audio-Language Model"},"content":{"rendered":"<p>Understanding audio has always been the multimodal frontier that lags behind vision. While image-language models have rapidly scaled toward real-world deployment, building open models that robustly reason over speech, environmental sounds, and music \u2014 especially at length \u2014 has remained quite hard. NVIDIA and the University of Maryland researchers are now taking a direct swing at that gap.<\/p>\n<p>The research team have released <strong>Audio Flamingo Next (AF-Next)<\/strong>, the most capable model in the Audio Flamingo series and a fully open Large Audio-Language Model (LALM) trained on internet-scale audio data. <\/p>\n<p><strong>Audio Flamingo Next (AF-Next)<\/strong> comes in <strong>three specialized variants for different use cases<\/strong>. The release includes <strong>AF-Next-Instruct <\/strong>for general question answering, <strong>AF-Next-Think<\/strong> for advanced multi-step reasoning, and <strong>AF-Next-Captioner<\/strong> for detailed audio captioning.<\/p>\n<h3 class=\"wp-block-heading\"><strong>What is a Large Audio-Language Model (LALM)?<\/strong><\/h3>\n<p>A <strong>Large Audio-Language Model (LALM)<\/strong> pairs an audio encoder with a decoder-only language model to enable question answering, captioning, transcription, and reasoning directly over audio inputs. Think of it as the audio equivalent of a vision-language model like LLaVA or GPT-4V, but designed to handle speech, environmental sounds, and music simultaneously \u2014 within a single unified model.<\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1618\" height=\"1048\" data-attachment-id=\"78998\" data-permalink=\"https:\/\/www.marktechpost.com\/2026\/04\/14\/nvidia-and-the-university-of-maryland-researchers-released-audio-flamingo-next-af-next-a-super-powerful-and-open-large-audio-language-model\/screenshot-2026-04-14-at-1-22-32-am-2\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/04\/Screenshot-2026-04-14-at-1.22.32-AM-1.png\" data-orig-size=\"1618,1048\" data-comments-opened=\"1\" data-image-meta='{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\"}' data-image-title=\"Screenshot 2026-04-14 at 1.22.32\u202fAM\" data-image-description=\"\" data-image-caption=\"\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/04\/Screenshot-2026-04-14-at-1.22.32-AM-1-1024x663.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/04\/Screenshot-2026-04-14-at-1.22.32-AM-1.png\" alt=\"\" class=\"wp-image-78998\" \/><figcaption class=\"wp-element-caption\">https:\/\/arxiv.org\/pdf\/2604.10905<\/figcaption><\/figure>\n<\/div>\n<h3 class=\"wp-block-heading\"><strong>The Architecture: Four Components Working in a Pipeline<\/strong><\/h3>\n<p>AF-Next is built around four main components: First is the <strong>AF-Whisper audio encoder<\/strong>, a custom Whisper-based encoder further pre-trained on a larger and more diverse corpus, including multilingual speech and multi-talker ASR data. Given an audio input, the model resamples it to 16 kHz mono and converts the waveform into a 128-channel log mel-spectrogram using a 25 ms window and 10 ms hop size. The spectrogram is processed in non-overlapping 30-second chunks through AF-Whisper, which outputs features at 50 Hz, after which a stride-2 pooling layer is applied. The hidden dimension is 1280.<\/p>\n<p>Second is the <strong>audio adaptor<\/strong>, a 2-layer MLP that maps AF-Whisper\u2019s audio representations into the language model\u2019s embedding space. Third is the <strong>LLM backbone<\/strong>: Qwen-2.5-7B, a decoder-only causal model with 7B parameters, 36 transformer layers, and 16 attention heads, with context length extended from 32k to 128k tokens through additional long-context training.<\/p>\n<p>A subtle but important architectural detail is <strong>Rotary Time Embeddings (RoTE)<\/strong>. Standard positional encodings in transformers index a token by its discrete sequence position <code>i<\/code>. RoTE replaces this: instead of the standard RoPE rotation angle <code>\u03b8<\/code> <code>\u2190 \u2212i \u00b7 2\u03c0<\/code>, RoTE uses <code>\u03b8 \u2190 \u2212\u03c4i \u00b7 2\u03c0<\/code>, where \u03c4i is each token\u2019s absolute timestamp. For audio tokens produced at a fixed 40 ms stride, discrete time positions are interpolated before being fed into the RoTE module. This yields positional representations grounded in actual time rather than sequence order \u2014 a core design choice enabling the model\u2019s temporal reasoning, particularly for long audio. Finally, a <strong>streaming TTS module<\/strong> enables voice-to-voice interaction.<\/p>\n<h3 class=\"wp-block-heading\"><strong>Temporal Audio Chain-of-Thought: The Key Reasoning Recipe<\/strong><\/h3>\n<p>Chain-of-Thought (CoT) prompting has improved reasoning across text and vision models, but prior audio CoT work showed only small gains because training datasets were limited to short clips with simple questions. AF-Next addresses this with <strong>Temporal Audio Chain-of-Thought<\/strong>, where the model explicitly anchors each intermediate reasoning step to a timestamp in the audio before producing an answer, encouraging faithful evidence aggregation and reducing hallucination over long recordings.<\/p>\n<p>To train this capability, the research team created <strong>AF-Think-Time<\/strong>, a dataset of question\u2013answer\u2013thinking-chain triplets curated from challenging audio sources including trailers, movie recaps, mystery stories, and long-form multi-party conversations. AF-Think-Time consists of approximately 43K training samples, with an average of 446.3 words per thinking chain.<\/p>\n<h3 class=\"wp-block-heading\"><strong>Training at Scale: 1 Million Hours, Four Stages<\/strong><\/h3>\n<p>The final training dataset comprises approximately 108 million samples and approximately 1 million hours of audio, drawn from both existing publicly released datasets and raw audio collected from the open internet and subsequently labeled synthetically. New data categories introduced include over 200K long videos spanning 5 to 30 minutes for long-form captioning and QA, multi-talker speech understanding data covering speaker identification, interruption identification, and target speaker ASR, approximately 1 million samples for multi-audio reasoning across multiple simultaneous audio inputs, and approximately 386K safety and instruction-following samples.<\/p>\n<p>Training follows a <strong>four-stage curriculum<\/strong>, each with distinct data mixtures and context lengths. <strong>Pre-training<\/strong> has two sub-stages: Stage 1 trains only the audio adaptor while keeping both AF-Whisper and the LLM frozen (max audio 30 seconds, 8K token context); Stage 2 additionally fine-tunes the audio encoder while still keeping the LLM frozen (max audio 1 minute, 8K token context). <strong>Mid-training<\/strong> also has two sub-stages: Stage 1 performs full fine-tuning of the entire model, adding AudioSkills-XL and newly curated data (max audio 10 minutes, 24K token context); Stage 2 introduces long-audio captioning and QA, down-sampling the Stage 1 mixture to half its original blend weights while expanding context to 128K tokens and audio to 30 minutes. The model resulting from mid-training is specifically released as <strong>AF-Next-Captioner<\/strong>. <strong>Post-training<\/strong> applies GRPO-based reinforcement learning focusing on multi-turn chat, safety, instruction following, and selected skill-specific datasets, producing <strong>AF-Next-Instruct<\/strong>. Finally, <strong>CoT-training<\/strong> starts from AF-Next-Instruct, applies SFT on AF-Think-Time, then GRPO using the post-training data mixture, producing <strong>AF-Next-Think<\/strong>.<\/p>\n<p>One notable contribution from the research team is <strong>hybrid sequence parallelism<\/strong>, which makes 128K-context training feasible on long audio. Without it, audio token expansion blows past standard context windows and the quadratic memory cost of self-attention becomes infeasible. The solution combines Ulysses attention \u2014 which uses all-to-all collectives to distribute sequence and head dimensions within nodes where high-bandwidth interconnects are available \u2014 with Ring attention, which circulates key-value blocks across nodes via point-to-point transfers. Ulysses handles intra-node communication efficiently; Ring scales across nodes.<\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"996\" height=\"888\" data-attachment-id=\"78996\" data-permalink=\"https:\/\/www.marktechpost.com\/2026\/04\/14\/nvidia-and-the-university-of-maryland-researchers-released-audio-flamingo-next-af-next-a-super-powerful-and-open-large-audio-language-model\/screenshot-2026-04-14-at-1-21-46-am-2\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/04\/Screenshot-2026-04-14-at-1.21.46-AM-1.png\" data-orig-size=\"996,888\" data-comments-opened=\"1\" data-image-meta='{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\"}' data-image-title=\"Screenshot 2026-04-14 at 1.21.46\u202fAM\" data-image-description=\"\" data-image-caption=\"\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/04\/Screenshot-2026-04-14-at-1.21.46-AM-1.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/04\/Screenshot-2026-04-14-at-1.21.46-AM-1.png\" alt=\"\" class=\"wp-image-78996\" \/><figcaption class=\"wp-element-caption\">https:\/\/arxiv.org\/pdf\/2604.10905<\/figcaption><\/figure>\n<\/div>\n<h3 class=\"wp-block-heading\"><strong>Benchmark Results: Strong Across the Board<\/strong><\/h3>\n<p>On MMAU-v05.15.25, the most widely used audio reasoning benchmark, AF-Next-Instruct achieves an average accuracy of 74.20 vs. Audio Flamingo 3\u2019s 72.42, with AF-Next-Think reaching 75.01 and AF-Next-Captioner pushing to 75.76 \u2014 with gains across all three subcategories: sound (79.87), music (75.3), and speech (72.13). On the more challenging MMAU-Pro benchmark, AF-Next-Think (58.7) surpasses the closed-source Gemini-2.5-Pro (57.4).<\/p>\n<p>Music understanding sees particularly strong gains. On Medley-Solos-DB instrument recognition, AF-Next reaches 92.13 vs. Audio Flamingo 2\u2019s 85.80. On SongCaps music captioning, GPT5 coverage and correctness scores jump from 6.7 and 6.2 (AF3) to 8.8 and 8.9 respectively.<\/p>\n<p>Long-audio understanding is where AF-Next most clearly separates itself. On LongAudioBench, AF-Next-Instruct achieves 73.9, outperforming both Audio Flamingo 3 (68.6) and the closed-source Gemini 2.5 Pro (60.4). On the speech-inclusive variant (+Speech), AF-Next reaches 81.2 vs. Gemini 2.5 Pro\u2019s 66.2. On ASR, AF-Next-Instruct sets new lows among LALMs with a Word Error Rate of 1.54 on LibriSpeech test-clean and 2.76 on test-other. On VoiceBench, AF-Next-Instruct achieves the highest scores on AlpacaEval (4.43), CommonEval (3.96), and OpenBookQA (80.9), surpassing Audio Flamingo 3 by over 14 points on OpenBookQA. On CoVoST2 speech translation, AF-Next shows a particularly notable 12-point improvement over Phi-4-mm on Arabic EN\u2192X translation (21.9 vs. 9.9).<\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1532\" height=\"484\" data-attachment-id=\"79000\" data-permalink=\"https:\/\/www.marktechpost.com\/2026\/04\/14\/nvidia-and-the-university-of-maryland-researchers-released-audio-flamingo-next-af-next-a-super-powerful-and-open-large-audio-language-model\/screenshot-2026-04-14-at-1-23-50-am-2\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/04\/Screenshot-2026-04-14-at-1.23.50-AM-1.png\" data-orig-size=\"1532,484\" data-comments-opened=\"1\" data-image-meta='{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\"}' data-image-title=\"Screenshot 2026-04-14 at 1.23.50\u202fAM\" data-image-description=\"\" data-image-caption=\"\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/04\/Screenshot-2026-04-14-at-1.23.50-AM-1-1024x324.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/04\/Screenshot-2026-04-14-at-1.23.50-AM-1.png\" alt=\"\" class=\"wp-image-79000\" \/><figcaption class=\"wp-element-caption\">https:\/\/arxiv.org\/pdf\/2604.10905<\/figcaption><\/figure>\n<\/div>\n<h3 class=\"wp-block-heading\"><strong>Key Takeaways<\/strong><\/h3>\n<p>Here are 5 key takeaways:<\/p>\n<ul class=\"wp-block-list\">\n<li><strong>A Fully Open Audio-Language Model at Internet Scale<\/strong>: AF-Next is considered the first LALM to scale audio understanding to internet-scale data \u2014 approximately 108 million samples and 1 million hours of audio.<\/li>\n<li><strong>Temporal Audio Chain-of-Thought Solves Long-Audio Reasoning<\/strong>: Rather than reasoning like prior CoT approaches, AF-Next explicitly anchors each intermediate reasoning step to a timestamp in the audio before producing an answer. This makes the model significantly more faithful and interpretable on long recordings up to 30 minutes \u2014 a problem prior models largely sidestepped.<\/li>\n<li><strong>Three Specialized Variants for Different Use Cases<\/strong>: The release includes AF-Next-Instruct for general question answering, AF-Next-Think for advanced multi-step reasoning, and AF-Next-Captioner for detailed audio captioning \u2014 allowing practitioners to select the right model based on their task rather than using a one-size-fits-all checkpoint.<\/li>\n<li><strong>Beats Closed Models on Long Audio Despite Being Smaller<\/strong> On LongAudioBench, AF-Next-Instruct scores 73.9 \u2014 outperforming the closed-source Gemini 2.5 Pro (60.4) and Audio Flamingo 3 (68.6). On the more challenging speech-inclusive variant, the gap widens further, with AF-Next reaching 81.2 vs. Gemini 2.5 Pro\u2019s 66.2.<\/li>\n<\/ul>\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n<p>Check out\u00a0the<strong><a href=\"https:\/\/arxiv.org\/pdf\/2604.10905\" target=\"_blank\" rel=\"noreferrer noopener\">\u00a0Paper<\/a>, <a href=\"https:\/\/huggingface.co\/spaces\/nvidia\/audio-flamingo-next\" target=\"_blank\" rel=\"noreferrer noopener\">Project Page<\/a> and <a href=\"https:\/\/huggingface.co\/nvidia\/audio-flamingo-next-hf\" target=\"_blank\" rel=\"noreferrer noopener\">Model Weights<\/a>.\u00a0<\/strong>Also,\u00a0feel free to follow us on\u00a0<strong><a href=\"https:\/\/x.com\/intent\/follow?screen_name=marktechpost\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Twitter<\/mark><\/a><\/strong>\u00a0and don\u2019t forget to join our\u00a0<strong><a href=\"https:\/\/www.reddit.com\/r\/machinelearningnews\/\" target=\"_blank\" rel=\"noreferrer noopener\">130k+ ML SubReddit<\/a><\/strong>\u00a0and Subscribe to\u00a0<strong><a href=\"https:\/\/www.aidevsignals.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">our Newsletter<\/a><\/strong>. Wait! are you on telegram?\u00a0<strong><a href=\"https:\/\/t.me\/machinelearningresearchnews\" target=\"_blank\" rel=\"noreferrer noopener\">now you can join us on telegram as well.<\/a><\/strong><\/p>\n<p>Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.?\u00a0<strong><a href=\"https:\/\/forms.gle\/MTNLpmJtsFA3VRVd9\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Connect with us<\/mark><\/a><\/strong><\/p>\n<p>The post <a href=\"https:\/\/www.marktechpost.com\/2026\/04\/14\/nvidia-and-the-university-of-maryland-researchers-released-audio-flamingo-next-af-next-a-super-powerful-and-open-large-audio-language-model\/\">NVIDIA and the University of Maryland Researchers Released Audio Flamingo Next (AF-Next): A Super Powerful and Open Large Audio-Language Model<\/a> appeared first on <a href=\"https:\/\/www.marktechpost.com\/\">MarkTechPost<\/a>.<\/p>","protected":false},"excerpt":{"rendered":"<p>Understanding audio has always&hellip;<\/p>\n","protected":false},"author":1,"featured_media":718,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-717","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-uncategorized"],"_links":{"self":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts\/717","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=717"}],"version-history":[{"count":0,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts\/717\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/media\/718"}],"wp:attachment":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=717"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=717"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=717"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}