{"id":811,"date":"2026-04-29T15:31:31","date_gmt":"2026-04-29T07:31:31","guid":{"rendered":"https:\/\/connectword.dpdns.org\/?p=811"},"modified":"2026-04-29T15:31:31","modified_gmt":"2026-04-29T07:31:31","slug":"smol-audio-a-colab-friendly-notebook-collection-for-fine-tuning-whisper-parakeet-voxtral-granite-speech-and-audio-flamingo-3","status":"publish","type":"post","link":"https:\/\/connectword.dpdns.org\/?p=811","title":{"rendered":"smol-audio: A Colab-Friendly Notebook Collection for Fine-Tuning Whisper, Parakeet, Voxtral, Granite Speech, and Audio Flamingo 3"},"content":{"rendered":"<p>Audio AI has had a breakout year. Automatic speech recognition has gotten dramatically better with models like OpenAI\u2019s Whisper variants, NVIDIA\u2019s Parakeet, and Mistral\u2019s Voxtral. Audio understanding stepped forward with models like NVIDIA\u2019s Audio Flamingo 3. Dialogue-grade text-to-speech arrived via Nari Labs\u2019 Dia-1.6B. And Meta shipped the Perception Encoder Audiovisual (PE-AV), a multimodal encoder capable of learning a shared embedding space across audio, video, and text. The frontier has never moved faster.<\/p>\n<p>The catch? The practical knowledge required to actually work with these models \u2014 how to fine-tune them, adapt them to new languages, or run efficient inference \u2014 is scattered across GitHub issues, research blogs, and private notebooks that never see the light of day. If you are an ML engineer who just wants to fine-tune Whisper on a new domain or run zero-shot video classification with PE-AV, you are often starting from scratch.<\/p>\n<p>That is the gap <strong>smol-audio<\/strong> is designed to close.<\/p>\n<h3 class=\"wp-block-heading\"><strong>What is smol-audio<\/strong> ?<\/h3>\n<p>Released under the Apache-2.0 license by the Deep-unlearning team, smol-audio is a flat repository of self-contained Jupyter notebooks, each focused on a single practical audio AI task. Every notebook is designed to be opened directly in Google Colab, requires no local GPU setup, and is built entirely on the Hugging Face ecosystem \u2014 specifically <code>transformers<\/code>, <code>datasets<\/code>, <code>peft<\/code>, and <code>accelerate<\/code>. Most recipes fit within a 16 GB Colab runtime, which means a free or standard Colab tier is sufficient for the majority of tasks.<\/p>\n<p>The \u201cflat repo\u201d design is a deliberate choice. Rather than wrapping recipes inside a framework or hiding complexity behind convenience functions, smol-audio exposes every step. You can read the training loop, understand the data pipeline, and modify the configuration without reverse-engineering a library. For early-career engineers, that transparency is genuinely educational.<\/p>\n<h3 class=\"wp-block-heading\"><strong>ASR Fine-Tuning: Whisper, Parakeet, Voxtral, and Granite Speech<\/strong><\/h3>\n<p>The largest category in the repo today covers ASR fine-tuning across four distinct model families. Each requires meaningfully different handling.<\/p>\n<p>The <strong>Whisper<\/strong> notebook covers fine-tuning using <code>transformers<\/code> and <code>datasets<\/code>, making it straightforward to adapt the encoder-decoder architecture to a custom language or narrow domain. Whisper uses a sequence-to-sequence approach, generating transcripts token by token \u2014 familiar territory for anyone who has worked with language models.<\/p>\n<p><strong>NVIDIA\u2019s Parakeet<\/strong> uses a CTC (Connectionist Temporal Classification) architecture rather than a sequence-to-sequence setup. CTC is faster and lighter for inference but requires alignment between audio frames and output tokens rather than autoregressive decoding. The smol-audio notebook covers both full fine-tuning and LoRA (Low-Rank Adaptation) for Parakeet, which is important because full fine-tuning large CTC models can be memory-intensive.<\/p>\n<p><strong>Mistral\u2019s Voxtral<\/strong> is architecturally distinct from both Whisper and Parakeet. Rather than a traditional ASR encoder-decoder, Voxtral is built on a large language model backbone \u2014 Ministral 3B for Voxtral Mini and Mistral Small 3.1 24B for Voxtral Small \u2014 making it an LLM-based speech understanding model. The smol-audio notebook handles fine-tuning for ASR with prompt masking, supporting both full fine-tuning and LoRA. Prompt masking is important here precisely because of this LLM architecture: when a model accepts text prompts alongside audio input, you typically do not want to compute loss on the prompt tokens themselves \u2014 only on the generated transcription. Getting this wrong leads to degraded training dynamics, so having a working reference implementation saves significant debugging time.<\/p>\n<p><strong>IBM\u2019s Granite Speech<\/strong> gets its own notebook focused on Italian ASR using the YODAS-Granary dataset. This is a useful example beyond just the model: it demonstrates domain- and language-specific fine-tuning on a real multilingual speech corpus, a common production scenario.<\/p>\n<h3 class=\"wp-block-heading\"><strong>Audio Understanding with NVIDIA\u2019s Audio Flamingo 3<\/strong><\/h3>\n<p>Audio Flamingo 3, developed by NVIDIA, is a Large Audio Language Model (LALM) for reasoning and understanding across speech, sound, and music. The smol-audio notebook fine-tunes it specifically for the audio captioning task \u2014 generating a natural language description of an audio clip, which is useful for accessibility tooling, content indexing, and retrieval systems. The notebook covers both full fine-tuning and LoRA-based fine-tuning, giving practitioners the choice between maximum performance and memory efficiency.<\/p>\n<p>LoRA, for those newer to parameter-efficient fine-tuning, works by freezing the original model weights and injecting small trainable rank-decomposition matrices into specific layers. For large multimodal models like Audio Flamingo 3, LoRA can reduce GPU memory requirements by an order of magnitude compared to full fine-tuning, enabling iteration on commodity hardware.<\/p>\n<h3 class=\"wp-block-heading\"><strong>Dialogue TTS with Dia-1.6B<\/strong><\/h3>\n<p>The Dia-1.6B notebook covers dialogue-style text-to-speech, where the goal is not just synthesizing a single speaker but generating natural conversational exchanges. Dia is a 1.6-billion-parameter TTS model by Nari Labs capable of producing multi-speaker dialogue, making it relevant for anyone building voice agents, podcast generation tools, or conversational interfaces.<\/p>\n<h3 class=\"wp-block-heading\"><strong>Multimodal Inference with Meta\u2019s PE-AV<\/strong><\/h3>\n<p>Perhaps the most forward-looking notebook in the current release covers inference with Meta\u2019s <strong>Perception Encoder Audiovisual (PE-AV)<\/strong>. PE-AV is a multimodal encoder that learns a single shared embedding space across audio, video, and text \u2014 enabling zero-shot video classification without any task-specific fine-tuning, and audio<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/2194.png\" alt=\"\u2194\" class=\"wp-smiley\" \/>text retrieval on benchmarks like AudioCaps. Because all three modalities map into the same embedding space, cross-modal queries such as retrieving an audio clip from a text description work via simple dot-product similarity.<\/p>\n<p>The notebook demonstrates how to run these inference pipelines directly, which is valuable because multimodal models with joint audio-visual-text encoders are architecturally more complex than single-modality models and typically require careful preprocessing of multiple input modalities.<\/p>\n<hr class=\"wp-block-separator aligncenter has-alpha-channel-opacity is-style-wide\" \/>\n<p>Check out\u00a0the\u00a0<strong><a href=\"https:\/\/github.com\/Deep-unlearning\/smol-audio\" target=\"_blank\" rel=\"noreferrer noopener\">Repo here<\/a><\/strong>.<strong>\u00a0<\/strong>Also,\u00a0feel free to follow us on\u00a0<strong><a href=\"https:\/\/x.com\/intent\/follow?screen_name=marktechpost\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Twitter<\/mark><\/a><\/strong>\u00a0and don\u2019t forget to join our\u00a0<strong><a href=\"https:\/\/www.reddit.com\/r\/machinelearningnews\/\" target=\"_blank\" rel=\"noreferrer noopener\">130k+ ML SubReddit<\/a><\/strong>\u00a0and Subscribe to\u00a0<strong><a href=\"https:\/\/www.aidevsignals.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">our Newsletter<\/a><\/strong>. Wait! are you on telegram?\u00a0<strong><a href=\"https:\/\/t.me\/machinelearningresearchnews\" target=\"_blank\" rel=\"noreferrer noopener\">now you can join us on telegram as well.<\/a><\/strong><\/p>\n<p>Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.?\u00a0<strong><a href=\"https:\/\/forms.gle\/MTNLpmJtsFA3VRVd9\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Connect with us<\/mark><\/a><\/strong><\/p>\n<p>The post <a href=\"https:\/\/www.marktechpost.com\/2026\/04\/29\/smol-audio-a-colab-friendly-notebook-collection-for-fine-tuning-whisper-parakeet-voxtral-granite-speech-and-audio-flamingo-3\/\">smol-audio: A Colab-Friendly Notebook Collection for Fine-Tuning Whisper, Parakeet, Voxtral, Granite Speech, and Audio Flamingo 3<\/a> appeared first on <a href=\"https:\/\/www.marktechpost.com\/\">MarkTechPost<\/a>.<\/p>","protected":false},"excerpt":{"rendered":"<p>Audio AI has had a breakout ye&hellip;<\/p>\n","protected":false},"author":1,"featured_media":29,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-811","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-uncategorized"],"_links":{"self":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts\/811","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=811"}],"version-history":[{"count":0,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts\/811\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/media\/29"}],"wp:attachment":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=811"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=811"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=811"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}