{"id":942,"date":"2026-05-21T15:14:59","date_gmt":"2026-05-21T07:14:59","guid":{"rendered":"https:\/\/connectword.dpdns.org\/?p=942"},"modified":"2026-05-21T15:14:59","modified_gmt":"2026-05-21T07:14:59","slug":"one-model-three-modalities-bytedance-releases-lance-for-image-and-video-understanding-generation-and-editing","status":"publish","type":"post","link":"https:\/\/connectword.dpdns.org\/?p=942","title":{"rendered":"One Model, Three Modalities: ByteDance Releases Lance for Image and Video Understanding, Generation, and Editing"},"content":{"rendered":"<p class=\"wp-block-paragraph\">Building a single model that can both understand and generate images and videos is harder than it sounds. The two tasks pull in opposite directions. Understanding benefits from high-level semantic features tightly aligned with language. Generation needs low-level continuous representations that preserve texture, geometry, and temporal dynamics. Most systems handle this tension by separating the two into distinct architectures, then bridging them post-hoc.<\/p>\n<p class=\"wp-block-paragraph\">ByteDance research team took a different approach with <strong>Lance<\/strong>. Rather than assembling separate components, the research team designed a model that natively integrates understanding, generation, and editing across both image and video modalities \u2014 trained jointly from the start. <\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-large is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"455\" data-attachment-id=\"80011\" data-permalink=\"https:\/\/www.marktechpost.com\/2026\/05\/21\/one-model-three-modalities-bytedance-releases-lance-for-image-and-video-understanding-generation-and-editing\/screenshot-2026-05-21-at-12-07-02-am-2\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-21-at-12.07.02-AM-1.png\" data-orig-size=\"1922,854\" data-comments-opened=\"0\" data-image-meta='{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\",\"alt\":\"\"}' data-image-title=\"Screenshot 2026-05-21 at 12.07.02\u202fAM\" data-image-description=\"\" data-image-caption=\"\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-21-at-12.07.02-AM-1-1024x455.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-21-at-12.07.02-AM-1-1024x455.png\" alt=\"\" class=\"wp-image-80011\" \/><figcaption class=\"wp-element-caption\">https:\/\/arxiv.org\/pdf\/2605.18678<\/figcaption><\/figure>\n<\/div>\n<h2 class=\"wp-block-heading\"><strong>What Lance Can Do<\/strong><\/h2>\n<p class=\"wp-block-paragraph\">Lance organizes its capabilities into three output families: text (X2T), images (X2I), and videos (X2V). On the understanding side, this covers image and video captioning, visual question answering, OCR, visual grounding, and reasoning. On the generation side, it handles text-to-image, text-to-video, image-to-video, subject-driven generation, image editing, and video editing \u2014 including multi-turn consistency editing across both modalities.<\/p>\n<p class=\"wp-block-paragraph\">This all-in-one capability is a major milestone. While standard unified architectures typically stop at basic image understanding and text-to-image generation, Lance is among the few to natively bridge the entire image-video ecosystem across both understanding and generation tasks.<\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1724\" height=\"998\" data-attachment-id=\"80013\" data-permalink=\"https:\/\/www.marktechpost.com\/2026\/05\/21\/one-model-three-modalities-bytedance-releases-lance-for-image-and-video-understanding-generation-and-editing\/screenshot-2026-05-21-at-12-07-26-am-2\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-21-at-12.07.26-AM-1.png\" data-orig-size=\"1724,998\" data-comments-opened=\"0\" data-image-meta='{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\",\"alt\":\"\"}' data-image-title=\"Screenshot 2026-05-21 at 12.07.26\u202fAM\" data-image-description=\"\" data-image-caption=\"\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-21-at-12.07.26-AM-1-1024x593.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-21-at-12.07.26-AM-1.png\" alt=\"\" class=\"wp-image-80013\" \/><figcaption class=\"wp-element-caption\">https:\/\/arxiv.org\/pdf\/2605.18678<\/figcaption><\/figure>\n<\/div>\n<h2 class=\"wp-block-heading\"><strong>How the Architecture Works<\/strong><\/h2>\n<p class=\"wp-block-paragraph\">The architecture is based on two principles: <strong>unified context modeling<\/strong> and <strong>decoupled capability pathways<\/strong>.<\/p>\n<p class=\"wp-block-paragraph\">For unified context, Lance converts all inputs \u2014 text, images, and videos \u2014 into a single shared interleaved multimodal sequence. Text tokens come from the Qwen2.5-VL embedding layer. For understanding-oriented visual inputs, the Qwen2.5-VL ViT encoder produces compact semantic visual tokens. For generation-oriented visual inputs, the Wan2.2 3D causal VAE encoder encodes images and videos into continuous latent representations, applying 16\u00d7 spatial downsampling and 4\u00d7 temporal downsampling. All these heterogeneous token types \u2014 text, semantic visual, and latent visual \u2014 live in the same sequence. The model then runs generalized 3D causal attention over the full context, with text tokens using causal attention and visual tokens using bidirectional attention.<\/p>\n<p class=\"wp-block-paragraph\">For decoupled pathways, Lance uses a dual-stream mixture-of-experts architecture initialized from Qwen2.5-VL 3B. The understanding expert (LLMUND) handles text and semantic visual tokens, producing outputs for multimodal reasoning and text generation. The generation expert (LLMGEN) handles VAE latent tokens for visual synthesis and editing. Crucially, both experts operate over the same shared interleaved sequence \u2014 they share context but don\u2019t compete for the same parameters. The understanding expert is trained with a next-token prediction loss; the generation expert is trained with a flow matching objective in continuous latent space. The two losses are combined with configurable weights throughout training.<\/p>\n<h2 class=\"wp-block-heading\"><strong>Modality-Aware Rotary Positional Encoding (MaPE)<\/strong><\/h2>\n<p class=\"wp-block-paragraph\">Running ViT semantic tokens, clean VAE condition tokens, and noisy VAE target tokens through the same sequence creates a subtle problem. Standard 3D-RoPE encodes positions based on spatiotemporal layout alone \u2014 it has no way to tell these token groups apart. When multiple visual token groups occupy the same sequence, their positional boundaries become ambiguous, which can hurt cross-task alignment.<\/p>\n<p class=\"wp-block-paragraph\">Lance introduces <strong>Modality-Aware Rotary Positional Encoding (MaPE)<\/strong> to fix this. MaPE applies a fixed temporal offset to each modality group based on its index in the sequence. Spatial coordinates stay unchanged, so the intrinsic layout within images and videos is preserved. The temporal offset alone is enough to separate the token groups in the global positional space without disrupting temporal ordering within any individual video.<\/p>\n<p class=\"wp-block-paragraph\">Removing MaPE drops GenEval from 80.94 to 80.56, GEdit-Bench from 6.86 to 6.30, and VBench from 81.81 to 80.95 \u2014 consistent degradation across generation, editing, and understanding.<\/p>\n<h2 class=\"wp-block-heading\"><strong>Training: Four Stages, One Unified Framework<\/strong><\/h2>\n<p class=\"wp-block-paragraph\">Lance is trained through <strong>four sequential stages<\/strong>, each building on the last.<\/p>\n<p class=\"wp-block-paragraph\"><strong>Pre-Training (PT)<\/strong> lays the foundation using approximately 1B image-text and 140M video-text pairs, covering 1.5T training tokens. This stage establishes basic multimodal alignment and generation capability. The VAE and ViT encoders are frozen here; only the backbone and connectors are trained.<\/p>\n<p class=\"wp-block-paragraph\"><strong>Continual Training (CT)<\/strong> expands the task space by introducing interleaved multi-task data \u2014 editing samples, subject-driven generation samples, and multimodal understanding data \u2014 across approximately 300B tokens. A progressive data-mixture schedule gradually increases the proportion of harder tasks like editing as training proceeds.<\/p>\n<p class=\"wp-block-paragraph\"><strong>Supervised Fine-Tuning (SFT)<\/strong> tightens instruction following, editing accuracy, and identity consistency using curated high-quality data across 72B tokens.<\/p>\n<p class=\"wp-block-paragraph\"><strong>Reinforcement Learning (RL)<\/strong> uses Group Relative Policy Optimization (GRPO), with PaddleOCR serving as the reward model, to further sharpen text rendering accuracy and image-text alignment.<\/p>\n<p class=\"wp-block-paragraph\">Everything fits within a maximum training budget of 128 GPUs.<\/p>\n<h2 class=\"wp-block-heading\"><strong>Results<\/strong><\/h2>\n<p class=\"wp-block-paragraph\"><strong>Image Generation.<\/strong> On GenEval, Lance scores 0.90 overall, matching TUNA for the top spot among unified models. Subcategory scores include counting (0.84), colors (0.97), and spatial position (0.87). On DPG-Bench, Lance scores 84.67 overall, with particularly strong relation modeling \u2014 though TUNA (86.76) and TUNA-2 (86.54) lead that benchmark. To put the parameter efficiency in perspective: Janus-Pro-7B scores 0.80 on GenEval; Show-o2 (7B) scores 0.76. Lance matches the top unified model score at 3B activated parameters.<\/p>\n<p class=\"wp-block-paragraph\"><strong>Video Generation.<\/strong> On VBench, Lance achieves a Total Score of 85.11 (using LLM rewriting), the highest among unified models. The next-best unified model, TUNA, scores 84.06. Lance also outscores dedicated generation-only models including HunyuanVideo (83.43) and Wan2.1-T2V (83.69).<\/p>\n<p class=\"wp-block-paragraph\"><strong>Image Editing.<\/strong> On GEdit-Bench, Lance scores 7.30 Avg\/G_O, the highest among unified models. It leads in background change, material modification, motion change, portrait beautification, subject removal, subject replacement, and tone transfer. Text modification is flagged as a remaining weakness.<\/p>\n<p class=\"wp-block-paragraph\"><strong>Video Understanding.<\/strong> On MVBench, Lance achieves a 62.0 overall score, the highest among unified models. Show-o2 (7B), the next-best unified model, scores 55.7. Lance also outperforms several understanding-only models with more parameters \u2014 notable given that it is simultaneously trained for generation and editing.<\/p>\n<h2 class=\"wp-block-heading\"><strong>Marktechpost\u2019s Visual Explainer<\/strong><\/h2>\n<div>\n<p>  <!-- Header --><\/p>\n<div class=\"lg-header\">\n<div class=\"lg-header-badge\">How\u2014To Guide<\/div>\n<h2>Getting Started with Lance by ByteDance<\/h2>\n<p>A step-by-step guide to installing and running Lance \u2014 a 3B native unified multimodal model for image &amp; video understanding, generation, and editing.<\/p>\n<\/div>\n<p>  <!-- Progress dots --><\/p>\n<div class=\"lg-progress\">\n    <button class=\"lg-step-dot active\"><\/button><br \/>\n    <button class=\"lg-step-dot\"><\/button><br \/>\n    <button class=\"lg-step-dot\"><\/button><br \/>\n    <button class=\"lg-step-dot\"><\/button><br \/>\n    <button class=\"lg-step-dot\"><\/button><br \/>\n    <button class=\"lg-step-dot\"><\/button><br \/>\n    <span class=\"lg-step-label\">Step 1 of 6<\/span>\n  <\/div>\n<p>  <!-- Slides --><\/p>\n<div class=\"lg-body\">\n<p>    <!-- Slide 1: Prerequisites --><\/p>\n<div class=\"lg-slide active\">\n<div class=\"lg-step-num\">Step 01 \u2014 Prerequisites<\/div>\n<h3>Check Your Environment First<\/h3>\n<p>Before cloning the repository, confirm your system meets the minimum software and hardware requirements. Lance requires CUDA-capable hardware with significant VRAM.<\/p>\n<div class=\"lg-req-grid\">\n<div class=\"lg-req-card\">\n<div class=\"req-icon\"><img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/1f40d.png\" alt=\"\ud83d\udc0d\" class=\"wp-smiley\" \/><\/div>\n<div class=\"req-label\">Python<\/div>\n<div class=\"req-val\">3.10 or higher<\/div>\n<div class=\"req-note\">Required<\/div>\n<\/div>\n<div class=\"lg-req-card\">\n<div class=\"req-icon\"><img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/26a1.png\" alt=\"\u26a1\" class=\"wp-smiley\" \/><\/div>\n<div class=\"req-label\">CUDA<\/div>\n<div class=\"req-val\">12.4 or higher<\/div>\n<div class=\"req-note\">Required<\/div>\n<\/div>\n<div class=\"lg-req-card\">\n<div class=\"req-icon\"><img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/1f5a5.png\" alt=\"\ud83d\udda5\" class=\"wp-smiley\" \/><\/div>\n<div class=\"req-label\">GPU VRAM<\/div>\n<div class=\"req-val\">40 GB minimum<\/div>\n<div class=\"req-note\">For inference<\/div>\n<\/div>\n<div class=\"lg-req-card\">\n<div class=\"req-icon\"><img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/1f4e6.png\" alt=\"\ud83d\udce6\" class=\"wp-smiley\" \/><\/div>\n<div class=\"req-label\">License<\/div>\n<div class=\"req-val\">Apache 2.0<\/div>\n<div class=\"req-note\">Open\u2014source<\/div>\n<\/div>\n<\/div>\n<div class=\"lg-note\">\n<p><strong>Note:<\/strong> A GPU with at least 40 GB VRAM is required for running inference. CUDA 12.4+ is mandatory \u2014 lower versions are not officially supported.<\/p>\n<\/div>\n<\/div>\n<p>    <!-- Slide 2: Clone --><\/p>\n<div class=\"lg-slide\">\n<div class=\"lg-step-num\">Step 02 \u2014 Clone the Repository<\/div>\n<h3>Clone from GitHub<\/h3>\n<p>Clone the official Lance repository from ByteDance on GitHub. The repository includes the inference scripts, Gradio interface, benchmark scripts, and model configuration files.<\/p>\n<pre><code>git clone https:\/\/github.com\/bytedance\/Lance\ncd Lance<\/code><\/pre>\n<div class=\"lg-divider\"><\/div>\n<p>The repository structure you will see after cloning:<\/p>\n<div class=\"lg-task-grid\">\n<div class=\"lg-task-card\">\n<div class=\"task-name\">inference_lance.py<\/div>\n<div class=\"task-desc\">Main inference script for all tasks<\/div>\n<\/div>\n<div class=\"lg-task-card\">\n<div class=\"task-name\">inference_lance.sh<\/div>\n<div class=\"task-desc\">Shell wrapper with configurable parameters<\/div>\n<\/div>\n<div class=\"lg-task-card\">\n<div class=\"task-name\">lance_gradio_t2v_v2t.py<\/div>\n<div class=\"task-desc\">Gradio UI for T2V and V2T tasks<\/div>\n<\/div>\n<div class=\"lg-task-card\">\n<div class=\"task-name\">config\/examples\/<\/div>\n<div class=\"task-desc\">JSON example configs per task type<\/div>\n<\/div>\n<\/div>\n<\/div>\n<p>    <!-- Slide 3: Install deps --><\/p>\n<div class=\"lg-slide\">\n<div class=\"lg-step-num\">Step 03 \u2014 Install Dependencies<\/div>\n<h3>Install Required Packages<\/h3>\n<p>Install all Python dependencies from the provided <span class=\"inline-code\">requirements.txt<\/span> file. It is strongly recommended to use a dedicated virtual environment or conda environment before installing.<\/p>\n<pre><code># Create and activate a conda environment (recommended)\nconda create -n lance-env python=3.10 -y\nconda activate lance-env\n\n# Install all dependencies\npip install -r requirements.txt<\/code><\/pre>\n<div class=\"lg-note\">\n<p><strong>Tip:<\/strong> Using a clean conda environment prevents dependency conflicts with other projects on the same machine.<\/p>\n<\/div>\n<\/div>\n<p>    <!-- Slide 4: Download weights --><\/p>\n<div class=\"lg-slide\">\n<div class=\"lg-step-num\">Step 04 \u2014 Download Model Weights<\/div>\n<h3>Download Lance\u20143B Checkpoints<\/h3>\n<p>Download all necessary model checkpoints from the official Hugging Face repository at <strong>bytedance-research\/Lance<\/strong>. After downloading, place all files in the <span class=\"inline-code\">downloads\/<\/span> directory inside your cloned repo.<\/p>\n<pre><code># Install the Hugging Face CLI if not already installed\npip install huggingface_hub\n\n# Download the model weights\nhuggingface-cli download bytedance-research\/Lance \n  --local-dir downloads\/<\/code><\/pre>\n<div class=\"lg-divider\"><\/div>\n<p>Your directory should look like this after downloading:<\/p>\n<pre><code>Lance\/\n\u2514\u2500\u2500 downloads\/\n    \u2514\u2500\u2500 Lance_3B_Video\/     \u25c4 model weights go here<\/code><\/pre>\n<div class=\"lg-note\">\n<p><strong>Note:<\/strong> Model weights are large files. Ensure you have sufficient disk space and a stable connection before downloading.<\/p>\n<\/div>\n<\/div>\n<p>    <!-- Slide 5: Run inference --><\/p>\n<div class=\"lg-slide\">\n<div class=\"lg-step-num\">Step 05 \u2014 Run Inference<\/div>\n<h3>Run Tasks via the CLI<\/h3>\n<p>Lance provides a unified command\u2014line interface for all tasks via <span class=\"inline-code\">inference_lance.sh<\/span>. Configure parameters at the top of the shell script before running. Supported tasks are listed below.<\/p>\n<div class=\"lg-task-grid\">\n<div class=\"lg-task-card\">\n<div class=\"task-name\">t2i<\/div>\n<div class=\"task-desc\">Text\u2014to\u2014image generation<\/div>\n<\/div>\n<div class=\"lg-task-card\">\n<div class=\"task-name\">t2v<\/div>\n<div class=\"task-desc\">Text\u2014to\u2014video generation<\/div>\n<\/div>\n<div class=\"lg-task-card\">\n<div class=\"task-name\">image_edit<\/div>\n<div class=\"task-desc\">Image editing from instruction<\/div>\n<\/div>\n<div class=\"lg-task-card\">\n<div class=\"task-name\">video_edit<\/div>\n<div class=\"task-desc\">Video editing from instruction<\/div>\n<\/div>\n<div class=\"lg-task-card\">\n<div class=\"task-name\">x2t_image<\/div>\n<div class=\"task-desc\">Image understanding \/ VQA<\/div>\n<\/div>\n<div class=\"lg-task-card\">\n<div class=\"task-name\">x2t_video<\/div>\n<div class=\"task-desc\">Video understanding \/ captioning<\/div>\n<\/div>\n<\/div>\n<p>Example command for text\u2014to\u2014video generation at 480p:<\/p>\n<pre><code>bash inference_lance.sh \n  --TASK_NAME t2v \n  --MODEL_PATH downloads\/Lance_3B_Video \n  --RESOLUTION video_480p \n  --NUM_FRAMES 121 \n  --VIDEO_HEIGHT 480 \n  --VIDEO_WIDTH 848 \n  --SAVE_PATH_GEN results\/t2v<\/code><\/pre>\n<\/div>\n<p>    <!-- Slide 6: Gradio + Tips --><\/p>\n<div class=\"lg-slide\">\n<div class=\"lg-step-num\">Step 06 \u2014 Gradio UI &amp; Tips<\/div>\n<h3>Launch the Gradio Interface (Optional)<\/h3>\n<p>For a visual interface covering text\u2014to\u2014video and video\u2014to\u2014text tasks, Lance includes a ready\u2014to\u2014run Gradio app.<\/p>\n<pre><code>python lance_gradio_t2v_v2t.py<\/code><\/pre>\n<div class=\"lg-divider\"><\/div>\n<p><strong>Prompt Tips<\/strong><\/p>\n<p>For all tasks, follow the prompt format used in the provided example configs under <span class=\"inline-code\">config\/examples\/<\/span>. Using the recommended format typically leads to better generation quality.<\/p>\n<div class=\"lg-task-grid\">\n<div class=\"lg-task-card\">\n<div class=\"task-name\">x2t_image_example.json<\/div>\n<div class=\"task-desc\">Examples for image understanding and VQA<\/div>\n<\/div>\n<div class=\"lg-task-card\">\n<div class=\"task-name\">x2t_video_example.json<\/div>\n<div class=\"task-desc\">Examples for video understanding and captioning<\/div>\n<\/div>\n<\/div>\n<div class=\"lg-note\">\n<p><strong>Customize:<\/strong> You can modify <span class=\"inline-code\">TASK_DEFAULT_CONFIGS<\/span> in <span class=\"inline-code\">inference_lance.py<\/span> to set your own default data samples for each task type.<\/p>\n<\/div>\n<\/div>\n<\/div>\n<p><!-- \/lg-body --><\/p>\n<p>  <!-- Footer --><\/p>\n<div class=\"lg-footer\">\n    <button class=\"lg-nav-btn prev\" disabled>\u2190 Prev<\/button>\n<div class=\"lg-links\">\n      <a href=\"https:\/\/huggingface.co\/bytedance-research\/Lance\" target=\"_blank\">HuggingFace<\/a><br \/>\n      <a href=\"https:\/\/github.com\/bytedance\/Lance\" target=\"_blank\">GitHub<\/a><br \/>\n      <a href=\"https:\/\/lance-project.github.io\/\" target=\"_blank\">Project Page<\/a>\n    <\/div>\n<p>    <button class=\"lg-nav-btn\">Next \u2192<\/button>\n  <\/p><\/div>\n<\/div>\n<h2 class=\"wp-block-heading\"><strong>Key Takeaways<\/strong><\/h2>\n<ol class=\"wp-block-list\">\n<li><strong>Lance is a 3B activated parameter native unified multimodal model<\/strong> that handles image and video understanding, generation, and editing within a single jointly trained framework.<\/li>\n<li><strong>A dual-stream mixture-of-experts architecture with Modality-Aware Rotary Positional Encoding (MaPE)<\/strong> decouples understanding and generation pathways while keeping them in shared interleaved multimodal context.<\/li>\n<li><strong>Lance achieves 0.90 on GenEval and 85.11 on VBench<\/strong>, the highest Total Score among unified models, trained within a maximum budget of 128 GPUs.<\/li>\n<li><strong>On MVBench, Lance scores 62.0<\/strong>, the highest among unified models \u2014 outperforming Show-o2 (7B) at 55.7, while also supporting generation and editing.<\/li>\n<li><strong>Lance is open-source under Apache 2.0<\/strong>, with weights available on Hugging Face.<\/li>\n<\/ol>\n<p class=\"wp-block-paragraph\">\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n<\/p><p class=\"wp-block-paragraph\">\n<\/p><p class=\"wp-block-paragraph\">Check out\u00a0the\u00a0<strong><a href=\"https:\/\/arxiv.org\/pdf\/2605.18678\" target=\"_blank\" rel=\"noreferrer noopener\">Paper<\/a>, <a href=\"https:\/\/huggingface.co\/bytedance-research\/Lance\" target=\"_blank\" rel=\"noreferrer noopener\">Model Weights<\/a> and <a href=\"https:\/\/lance-project.github.io\/\" target=\"_blank\" rel=\"noreferrer noopener\">Project Page<\/a>.\u00a0<\/strong>Also,\u00a0feel free to follow us on\u00a0<strong><a href=\"https:\/\/x.com\/intent\/follow?screen_name=marktechpost\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Twitter<\/mark><\/a><\/strong>\u00a0and don\u2019t forget to join our\u00a0<strong><a href=\"https:\/\/www.reddit.com\/r\/machinelearningnews\/\" target=\"_blank\" rel=\"noreferrer noopener\">150k+ ML SubReddit<\/a><\/strong>\u00a0and Subscribe to\u00a0<strong><a href=\"https:\/\/www.aidevsignals.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">our Newsletter<\/a><\/strong>. Wait! are you on telegram?\u00a0<strong><a href=\"https:\/\/t.me\/machinelearningresearchnews\" target=\"_blank\" rel=\"noreferrer noopener\">now you can join us on telegram as well.<\/a><\/strong><\/p>\n<p class=\"wp-block-paragraph\">Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.?\u00a0<strong><a href=\"https:\/\/forms.gle\/MTNLpmJtsFA3VRVd9\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Connect with us<\/mark><\/a><\/strong><\/p>\n<p>The post <a href=\"https:\/\/www.marktechpost.com\/2026\/05\/21\/one-model-three-modalities-bytedance-releases-lance-for-image-and-video-understanding-generation-and-editing\/\">One Model, Three Modalities: ByteDance Releases Lance for Image and Video Understanding, Generation, and Editing<\/a> appeared first on <a href=\"https:\/\/www.marktechpost.com\/\">MarkTechPost<\/a>.<\/p>","protected":false},"excerpt":{"rendered":"<p>Building a single model that c&hellip;<\/p>\n","protected":false},"author":1,"featured_media":943,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-942","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-uncategorized"],"_links":{"self":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts\/942","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=942"}],"version-history":[{"count":0,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts\/942\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/media\/943"}],"wp:attachment":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=942"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=942"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=942"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}