{"id":359,"date":"2026-02-05T15:36:08","date_gmt":"2026-02-05T07:36:08","guid":{"rendered":"https:\/\/connectword.dpdns.org\/?p=359"},"modified":"2026-02-05T15:36:08","modified_gmt":"2026-02-05T07:36:08","slug":"mistral-ai-launches-voxtral-transcribe-2-pairing-batch-diarization-and-open-realtime-asr-for-multilingual-production-workloads-at-scale","status":"publish","type":"post","link":"https:\/\/connectword.dpdns.org\/?p=359","title":{"rendered":"Mistral AI Launches Voxtral Transcribe 2: Pairing Batch Diarization And Open Realtime ASR For Multilingual Production Workloads At Scale"},"content":{"rendered":"<p>Automatic speech recognition (ASR) is becoming a core building block for AI products, from meeting tools to voice agents. Mistral\u2019s new <strong>Voxtral Transcribe 2<\/strong> family targets this space with 2 models that split cleanly into batch and realtime use cases, while keeping cost, latency, and deployment constraints in focus.<\/p>\n<p><strong>The release includes:<\/strong><\/p>\n<ul class=\"wp-block-list\">\n<li><strong>Voxtral Mini Transcribe V2<\/strong> for batch transcription with diarization.<\/li>\n<li><strong>Voxtral Realtime (Voxtral Mini 4B Realtime 2602)<\/strong> for low-latency streaming transcription, released as open weights. <\/li>\n<\/ul>\n<p>Both models are designed for <strong>13 languages<\/strong>: English, Chinese, Hindi, Spanish, Arabic, French, Portuguese, Russian, German, Japanese, Korean, Italian, and Dutch. <\/p>\n<h3 class=\"wp-block-heading\"><strong>Model family: batch and streaming, with clear roles<\/strong><\/h3>\n<p>Mistral positions Voxtral Transcribe 2 as \u2018two next-generation speech-to-text models\u2019 with <strong>state-of-the-art transcription quality, diarization, and ultra-low latency<\/strong>. <\/p>\n<ul class=\"wp-block-list\">\n<li><strong>Voxtral Mini Transcribe V2<\/strong> is the <strong>batch model<\/strong>. It is optimized for transcription quality and diarization across domains and languages and exposed as an efficient audio input model in the Mistral API. <\/li>\n<li><strong>Voxtral Realtime<\/strong> is the <strong>streaming model<\/strong>. It is built with a dedicated streaming architecture and is released as an open-weights model under <strong>Apache 2.0<\/strong> on Hugging Face, with a recommended vLLM runtime. <\/li>\n<\/ul>\n<p>A key detail: <strong>speaker diarization is provided by Voxtral Mini Transcribe V2<\/strong>, not by Voxtral Realtime. Realtime focuses strictly on fast, accurate streaming transcription.<\/p>\n<h3 class=\"wp-block-heading\"><strong>Voxtral Realtime: 4B-parameter streaming ASR with configurable delay<\/strong><\/h3>\n<p><strong>Voxtral Mini 4B Realtime 2602<\/strong> is a <strong>4B-parameter multilingual realtime speech-transcription model<\/strong>. It is among the first open-weights models to reach accuracy comparable to offline systems with a delay under 500 ms.<\/p>\n<p><strong>Architecture:<\/strong><\/p>\n<ul class=\"wp-block-list\">\n<li>\u22483.4B-parameter <strong>language model<\/strong>.<\/li>\n<li>\u22480.6B-parameter <strong>audio encoder<\/strong>.<\/li>\n<li>The audio encoder is trained from scratch with <strong>causal attention<\/strong>.<\/li>\n<li>Both encoder and LM use <strong>sliding-window attention<\/strong>, enabling effectively \u201cinfinite\u201d streaming.<\/li>\n<\/ul>\n<p><strong>Latency vs accuracy is explicitly configurable:<\/strong><\/p>\n<ul class=\"wp-block-list\">\n<li><strong>Transcription delay is tunable from 80 ms to 2.4 s<\/strong> via a <code>transcription_delay_ms<\/code> parameter. <\/li>\n<li>The Mistral describes latency as <strong>\u201cconfigurable down to sub-200 ms\u201d<\/strong> for live applications. <\/li>\n<li>At <strong>480 ms delay<\/strong>, Realtime matches leading offline open-source transcription models and realtime APIs on benchmarks such as FLEURS and long-form English. <\/li>\n<li>At <strong>2.4 s delay<\/strong>, Realtime matches <strong>Voxtral Mini Transcribe V2<\/strong> on FLEURS, which is appropriate for subtitling tasks where slightly higher latency is acceptable. <\/li>\n<\/ul>\n<p><strong>From a deployment standpoint:<\/strong><\/p>\n<ul class=\"wp-block-list\">\n<li>The model is released in <strong>BF16<\/strong> and is designed for <strong>on-device or edge deployment<\/strong>.<\/li>\n<li>It can run in realtime on a <strong>single GPU with \u226516 GB memory<\/strong>, according to the vLLM serving instructions in the model card.<\/li>\n<\/ul>\n<p><strong>The main control knob is the delay setting:<\/strong><\/p>\n<ul class=\"wp-block-list\">\n<li>Lower delays (\u224880\u2013200 ms) for interactive agents where responsiveness dominates.<\/li>\n<li>Around <strong>480 ms<\/strong> as the recommended \u201csweet spot\u201d between latency and accuracy.<\/li>\n<li>Higher delays (up to 2.4 s) when you need accuracy as close as possible to the batch model.<\/li>\n<\/ul>\n<h3 class=\"wp-block-heading\"><strong>Voxtral Mini Transcribe V2: batch ASR with diarization and context biasing<\/strong><\/h3>\n<p><strong>Voxtral Mini Transcribe V2<\/strong> is a closed-weights <strong>audio input model<\/strong> optimized only for transcription. It is exposed in the Mistral API as <code>voxtral-mini-2602<\/code> at <strong>$0.003 per minute<\/strong>.<\/p>\n<p><strong>On benchmarks and pricing:<\/strong><\/p>\n<ul class=\"wp-block-list\">\n<li>Around <strong>4% word error rate (WER)<\/strong> on the FLEURS transcription benchmark, averaged over the top 10 languages.<\/li>\n<li><strong>\u201cBest price-performance of any transcription API\u201d<\/strong> at $0.003\/min.<\/li>\n<li>Outperforms <strong>GPT-4o mini Transcribe<\/strong>, <strong>Gemini 2.5 Flash<\/strong>, <strong>Assembly Universal<\/strong>, and <strong>Deepgram Nova<\/strong> on accuracy in their comparisons.<\/li>\n<li>Processes audio <strong>\u22483\u00d7 faster than ElevenLabs\u2019 Scribe v2<\/strong> while matching quality at <strong>one-fifth the cost<\/strong>.<\/li>\n<\/ul>\n<p><strong>Enterprise-oriented features are concentrated in this model:<\/strong><\/p>\n<ul class=\"wp-block-list\">\n<li><strong>Speaker diarization<\/strong>\n<ul class=\"wp-block-list\">\n<li>Outputs speaker labels with precise start and end times.<\/li>\n<li>Designed for meetings, interviews, and multi-party calls.<\/li>\n<li>For overlapping speech, the model typically emits a single speaker label.<\/li>\n<\/ul>\n<\/li>\n<li><strong>Context biasing<\/strong>\n<ul class=\"wp-block-list\">\n<li>Accepts up to <strong>100 words or phrases<\/strong> to bias transcription toward specific names or domain terms.<\/li>\n<li>Optimized for English, with <strong>experimental support<\/strong> for other languages.<\/li>\n<\/ul>\n<\/li>\n<li><strong>Word-level timestamps<\/strong>\n<ul class=\"wp-block-list\">\n<li>Per-word start and end timestamps for subtitles, alignment, and searchable audio workflows.<\/li>\n<\/ul>\n<\/li>\n<li><strong>Noise robustness<\/strong>\n<ul class=\"wp-block-list\">\n<li>Maintains accuracy in noisy environments such as factory floors, call centers, and field recordings.<\/li>\n<\/ul>\n<\/li>\n<li><strong>Longer audio support<\/strong>\n<ul class=\"wp-block-list\">\n<li>Handles up to <strong>3 hours<\/strong> of audio in a single request.<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<p>Language coverage mirrors Realtime: 13 languages, with Mistral noting that non-English performance \u201csignificantly outpaces competitors\u201d in their evaluation. <\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1956\" height=\"1024\" data-attachment-id=\"77750\" data-permalink=\"https:\/\/www.marktechpost.com\/2026\/02\/04\/mistral-ai-launches-voxtral-transcribe-2-pairing-batch-diarization-and-open-realtime-asr-for-multilingual-production-workloads-at-scale\/screenshot-2026-02-04-at-11-28-56-pm-2\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/02\/Screenshot-2026-02-04-at-11.28.56-PM-1.png\" data-orig-size=\"1956,1024\" data-comments-opened=\"1\" data-image-meta='{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\"}' data-image-title=\"Screenshot 2026-02-04 at 11.28.56\u202fPM\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/02\/Screenshot-2026-02-04-at-11.28.56-PM-1-300x157.png\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/02\/Screenshot-2026-02-04-at-11.28.56-PM-1-1024x536.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/02\/Screenshot-2026-02-04-at-11.28.56-PM-1.png\" alt=\"\" class=\"wp-image-77750\" \/><figcaption class=\"wp-element-caption\">https:\/\/mistral.ai\/news\/voxtral-transcribe-2<\/figcaption><\/figure>\n<\/div>\n<h3 class=\"wp-block-heading\"><strong>APIs, tooling, and deployment options<\/strong><\/h3>\n<p><strong>The integration paths are straightforward and differ slightly between the two models:<\/strong><\/p>\n<ul class=\"wp-block-list\">\n<li><strong>Voxtral Mini Transcribe V2<\/strong>\n<ul class=\"wp-block-list\">\n<li>Served via the Mistral <strong>audio transcription API<\/strong> (<code>\/v1\/audio\/transcriptions<\/code>) as an efficient transcription-only service. <\/li>\n<li>Priced at <strong>$0.003\/min<\/strong>. (<a href=\"https:\/\/mistral.ai\/news\/voxtral-transcribe-2\">Mistral AI<\/a>)<\/li>\n<li>Available in <strong>Mistral Studio\u2019s audio playground<\/strong> and in <strong>Le Chat<\/strong> for interactive testing.<\/li>\n<\/ul>\n<\/li>\n<li><strong>Voxtral Realtime<\/strong>\n<ul class=\"wp-block-list\">\n<li>Available via the Mistral API at <strong>$0.006\/min<\/strong>. <\/li>\n<li>Released as <strong>open weights<\/strong> on Hugging Face (<code>mistralai\/Voxtral-Mini-4B-Realtime-2602<\/code>) under Apache 2.0, with official vLLM Realtime support.<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<p><strong>The audio playground in Mistral Studio lets users: <\/strong><\/p>\n<ul class=\"wp-block-list\">\n<li>Upload up to <strong>10 audio files<\/strong> (.mp3, .wav, .m4a, .flac, .ogg) up to <strong>1 GB<\/strong> each.<\/li>\n<li>Toggle diarization, choose timestamp granularity, and configure context bias terms.<\/li>\n<\/ul>\n<h3 class=\"wp-block-heading\"><strong>Key Takeaways<\/strong><\/h3>\n<ol class=\"wp-block-list\">\n<li><strong>Two-model family with clear roles<\/strong>: Voxtral Mini Transcribe V2 targets batch transcription and diarization, while Voxtral Realtime targets low-latency streaming ASR, both across 13 languages.<\/li>\n<li><strong>Realtime model- 4B parameters with tunable delay<\/strong>: Voxtral Realtime uses a 4B architecture (\u22483.4B LM + \u22480.6B encoder) with sliding-window and causal attention, and supports configurable transcription delay from 80 ms to 2.4 s.<\/li>\n<li><strong>Latency vs accuracy trade-off is explicit<\/strong>: Around 480 ms delay, Voxtral Realtime reaches accuracy comparable to strong offline and realtime systems, and at 2.4 s it matches Voxtral Mini Transcribe V2 on FLEURS.<\/li>\n<li><strong>Batch model adds diarization and enterprise features<\/strong>: Voxtral Mini Transcribe V2 provides diarization, context biasing with up to 100 phrases, word-level timestamps, noise robustness, and supports up to 3 hours of audio per request at $0.003\/min.<\/li>\n<li><strong>Deployment- closed batch API, open realtime weights<\/strong>: Mini Transcribe V2 is served via Mistral\u2019s audio transcription API and playground, while Voxtral Realtime is priced at $0.006\/min and also available as Apache 2.0 open weights with official vLLM Realtime support.<\/li>\n<\/ol>\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n<p>Check out the\u00a0<strong><a href=\"https:\/\/mistral.ai\/news\/voxtral-transcribe-2\" target=\"_blank\" rel=\"noreferrer noopener\">Technical details<\/a> and <a href=\"https:\/\/huggingface.co\/mistralai\/Voxtral-Mini-4B-Realtime-2602\" target=\"_blank\" rel=\"noreferrer noopener\">Model Weights<\/a><\/strong>.\u00a0Also,\u00a0feel free to follow us on\u00a0<strong><a href=\"https:\/\/x.com\/intent\/follow?screen_name=marktechpost\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Twitter<\/mark><\/a><\/strong>\u00a0and don\u2019t forget to join our\u00a0<strong><a href=\"https:\/\/www.reddit.com\/r\/machinelearningnews\/\" target=\"_blank\" rel=\"noreferrer noopener\">100k+ ML SubReddit<\/a><\/strong>\u00a0and Subscribe to\u00a0<strong><a href=\"https:\/\/www.aidevsignals.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">our Newsletter<\/a><\/strong>. Wait! are you on telegram?\u00a0<strong><a href=\"https:\/\/t.me\/machinelearningresearchnews\" target=\"_blank\" rel=\"noreferrer noopener\">now you can join us on telegram as well.<\/a><\/strong><\/p>\n<p>The post <a href=\"https:\/\/www.marktechpost.com\/2026\/02\/04\/mistral-ai-launches-voxtral-transcribe-2-pairing-batch-diarization-and-open-realtime-asr-for-multilingual-production-workloads-at-scale\/\">Mistral AI Launches Voxtral Transcribe 2: Pairing Batch Diarization And Open Realtime ASR For Multilingual Production Workloads At Scale<\/a> appeared first on <a href=\"https:\/\/www.marktechpost.com\/\">MarkTechPost<\/a>.<\/p>","protected":false},"excerpt":{"rendered":"<p>Automatic speech recognition (&hellip;<\/p>\n","protected":false},"author":1,"featured_media":360,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-359","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-uncategorized"],"_links":{"self":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts\/359","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=359"}],"version-history":[{"count":0,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts\/359\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/media\/360"}],"wp:attachment":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=359"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=359"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=359"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}