{"id":665,"date":"2026-04-04T17:03:26","date_gmt":"2026-04-04T09:03:26","guid":{"rendered":"https:\/\/connectword.dpdns.org\/?p=665"},"modified":"2026-04-04T17:03:26","modified_gmt":"2026-04-04T09:03:26","slug":"netflix-ai-team-just-open-sourced-void-an-ai-model-that-erases-objects-from-videos-physics-and-all","status":"publish","type":"post","link":"https:\/\/connectword.dpdns.org\/?p=665","title":{"rendered":"Netflix AI Team Just Open-Sourced VOID: an AI Model That Erases Objects From Videos \u2014 Physics and All"},"content":{"rendered":"<p>Video editing has always had a dirty secret: removing an object from footage is easy; making the scene look like it was never there is brutally hard. Take out a person holding a guitar, and you\u2019re left with a floating instrument that defies gravity. Hollywood VFX teams spend weeks fixing exactly this kind of problem. A team of researchers from Netflix and INSAIT, Sofia University \u2018St. Kliment Ohridski,\u2019 released <strong>VOID<\/strong> (<strong>Video Object and Interaction Deletion<\/strong>) model that can do it automatically.<\/p>\n<p>VOID removes objects from videos along with all interactions they induce on the scene \u2014 not just secondary effects like shadows and reflections, but physical interactions like objects falling when a person is removed.<\/p>\n<h3 class=\"wp-block-heading\"><strong>What Problem Is VOID Actually Solving?<\/strong><\/h3>\n<p>Standard video inpainting models \u2014 the kind used in most editing workflows today \u2014 are trained to fill in the pixel region where an object was. They\u2019re essentially very sophisticated background painters. What they don\u2019t do is reason about <em>causality<\/em>: if I remove an actor who is holding a prop, what should happen to that prop?<\/p>\n<p>Existing video object removal methods excel at inpainting content \u2018behind\u2019 the object and correcting appearance-level artifacts such as shadows and reflections. However, when the removed object has more significant interactions, such as collisions with other objects, current models fail to correct them and produce implausible results.<\/p>\n<p>VOID is built on top of CogVideoX and fine-tuned for video inpainting with interaction-aware mask conditioning. The key innovation is in how the model understands the scene \u2014 not just \u2018what pixels should I fill?\u2019 but \u2018what is physically plausible after this object disappears?\u2019<\/p>\n<p>The canonical example from the research paper: if a person holding a guitar is removed, VOID also removes the person\u2019s effect on the guitar \u2014 causing it to fall naturally. That\u2019s not trivial. The model has to understand that the guitar was being <em>supported<\/em> by the person, and that removing the person means gravity takes over.<\/p>\n<p>And unlike prior work, VOID was evaluated head-to-head against real competitors. Experiments on both synthetic and real data show that the approach better preserves consistent scene dynamics after object removal compared to prior video object removal methods including ProPainter, DiffuEraser, Runway, MiniMax-Remover, ROSE, and Gen-Omnimatte.<\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1644\" height=\"912\" data-attachment-id=\"78795\" data-permalink=\"https:\/\/www.marktechpost.com\/2026\/04\/04\/netflix-ai-team-just-open-sourced-void-an-ai-model-that-erases-objects-from-videos-physics-and-all\/screenshot-2026-04-04-at-2-02-50-am-2\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/04\/Screenshot-2026-04-04-at-2.02.50-AM-1.png\" data-orig-size=\"1644,912\" data-comments-opened=\"1\" data-image-meta='{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\"}' data-image-title=\"Screenshot 2026-04-04 at 2.02.50\u202fAM\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/04\/Screenshot-2026-04-04-at-2.02.50-AM-1-300x166.png\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/04\/Screenshot-2026-04-04-at-2.02.50-AM-1-1024x568.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/04\/Screenshot-2026-04-04-at-2.02.50-AM-1.png\" alt=\"\" class=\"wp-image-78795\" \/><figcaption class=\"wp-element-caption\">https:\/\/arxiv.org\/pdf\/2604.02296<\/figcaption><\/figure>\n<\/div>\n<h3 class=\"wp-block-heading\"><strong>The Architecture: CogVideoX Under the Hood<\/strong><\/h3>\n<p>VOID is built on <a href=\"https:\/\/huggingface.co\/alibaba-pai\/CogVideoX-Fun-V1.5-5b-InP\" target=\"_blank\" rel=\"noreferrer noopener\">CogVideoX-Fun-V1.5-5b-InP<\/a> \u2014 a model from Alibaba PAI \u2014 and fine-tuned for video inpainting with interaction-aware <strong>quadmask<\/strong> conditioning. CogVideoX is a 3D Transformer-based video generation model. Think of it like a video version of Stable Diffusion \u2014 a diffusion model that operates over temporal sequences of frames rather than single images. The specific base model (<code>CogVideoX-Fun-V1.5-5b-InP<\/code>) is released by Alibaba PAI on Hugging Face, which is the checkpoint engineers will need to download separately before running VOID.<\/p>\n<p>The fine-tuned architecture specs: a CogVideoX 3D Transformer with 5B parameters, taking video, quadmask, and a text prompt describing the scene after removal as input, operating at a default resolution of 384\u00d7672, processing a maximum of 197 frames, using the DDIM scheduler, and running in BF16 with FP8 quantization for memory efficiency.<\/p>\n<p>The <strong>quadmask<\/strong> is arguably the most interesting technical contribution here. Rather than a binary mask (remove this pixel \/ keep this pixel), the quadmask is a 4-value mask that encodes the primary object to remove, overlap regions, affected regions (falling objects, displaced items), and background to keep.<\/p>\n<p>In practice, each pixel in the mask gets one of four values: <code>0<\/code> (primary object being removed), <code>63<\/code> (overlap between primary and affected regions), <code>127<\/code> (interaction-affected region \u2014 things that will move or change as a result of the removal), and <code>255<\/code> (background, keep as-is). This gives the model a structured semantic map of <em>what\u2019s happening in the scene<\/em>, not just <em>where the object is<\/em>.<\/p>\n<h3 class=\"wp-block-heading\"><strong>Two-Pass Inference Pipeline<\/strong><\/h3>\n<p>VOID uses two transformer checkpoints, trained sequentially. You can run inference with Pass 1 alone or chain both passes for higher temporal consistency.<\/p>\n<p>Pass 1 (<code>void_pass1.safetensors<\/code>) is the base inpainting model and is sufficient for most videos. Pass 2 serves a specific purpose: correcting a known failure mode. If the model detects object morphing \u2014 a known failure mode of smaller video diffusion models \u2014 an optional second pass re-runs inference using flow-warped noise derived from the first pass, stabilizing object shape along the newly synthesized trajectories.<\/p>\n<p>It\u2019s worth understanding the distinction: Pass 2 isn\u2019t just for longer clips \u2014 it\u2019s specifically a <em>shape-stability fix<\/em>. When the diffusion model produces objects that gradually warp or deform across frames (a well-documented artifact in video diffusion), Pass 2 uses optical flow to warp the latents from Pass 1 and feeds them as initialization into a second diffusion run, anchoring the shape of synthesized objects frame-to-frame.<\/p>\n<h3 class=\"wp-block-heading\"><strong>How the Training Data Was Generated<\/strong><\/h3>\n<p>This is where things get genuinely interesting. Training a model to understand physical interactions requires paired videos \u2014 the same scene, with and without the object, where the physics plays out correctly in both. Real-world paired data at this scale doesn\u2019t exist. So the team built it synthetically.<\/p>\n<p>Training used paired counterfactual videos generated from two sources: HUMOTO \u2014 human-object interactions rendered in Blender with physics simulation \u2014 and Kubric \u2014 object-only interactions using Google Scanned Objects.<\/p>\n<p>HUMOTO uses motion-capture data of human-object interactions. The key mechanic is a Blender re-simulation: the scene is set up with a human and objects, rendered once with the human present, then the human is removed from the simulation and physics is re-run forward from that point. The result is a physically correct counterfactual \u2014 objects that were being held or supported now fall, exactly as they should. Kubric, developed by Google Research, applies the same idea to object-object collisions. Together, they produce a dataset of paired videos where the physics is provably correct, not approximated by a human annotator.<\/p>\n<h3 class=\"wp-block-heading\"><strong>Key Takeaways<\/strong><\/h3>\n<ul class=\"wp-block-list\">\n<li><strong>VOID goes beyond pixel-filling.<\/strong> Unlike existing video inpainting tools that only correct visual artifacts like shadows and reflections, VOID understands physical causality \u2014 if you remove a person holding an object, the object falls naturally in the output video.<\/li>\n<li><strong>The quadmask is the core innovation.<\/strong> Instead of a simple binary remove\/keep mask, VOID uses a 4-value quadmask (values 0, 63, 127, 255) that encodes not just what to remove, but which surrounding regions of the scene will be <em>physically affected<\/em> \u2014 giving the diffusion model structured scene understanding to work with.<\/li>\n<li><strong>Two-pass inference solves a real failure mode.<\/strong> Pass 1 handles most videos; Pass 2 exists specifically to fix object morphing artifacts \u2014 a known weakness of video diffusion models \u2014 by using optical flow-warped latents from Pass 1 as initialization for a second diffusion run.<\/li>\n<li><strong>Synthetic paired data made training possible.<\/strong> Since real-world paired counterfactual video data doesn\u2019t exist at scale, the research team built it using Blender physics re-simulation (HUMOTO) and Google\u2019s Kubric framework, generating ground-truth before\/after video pairs where the physics is provably correct.<\/li>\n<\/ul>\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n<p>Check out\u00a0the\u00a0<strong><a href=\"https:\/\/arxiv.org\/pdf\/2604.02296\" target=\"_blank\" rel=\"noreferrer noopener\">Paper<\/a>, <a href=\"https:\/\/huggingface.co\/netflix\/void-model\" target=\"_blank\" rel=\"noreferrer noopener\">Model Weight<\/a> <\/strong>and<strong> <a href=\"https:\/\/github.com\/netflix\/void-model?tab=readme-ov-file\" target=\"_blank\" rel=\"noreferrer noopener\">Repo<\/a>. \u00a0<\/strong>Also,\u00a0feel free to follow us on\u00a0<strong><a href=\"https:\/\/x.com\/intent\/follow?screen_name=marktechpost\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Twitter<\/mark><\/a><\/strong>\u00a0and don\u2019t forget to join our\u00a0<strong><a href=\"https:\/\/www.reddit.com\/r\/machinelearningnews\/\" target=\"_blank\" rel=\"noreferrer noopener\">120k+ ML SubReddit<\/a><\/strong>\u00a0and Subscribe to\u00a0<strong><a href=\"https:\/\/www.aidevsignals.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">our Newsletter<\/a><\/strong>. Wait! are you on telegram?\u00a0<strong><a href=\"https:\/\/t.me\/machinelearningresearchnews\" target=\"_blank\" rel=\"noreferrer noopener\">now you can join us on telegram as well.<\/a><\/strong><\/p>\n<p>The post <a href=\"https:\/\/www.marktechpost.com\/2026\/04\/04\/netflix-ai-team-just-open-sourced-void-an-ai-model-that-erases-objects-from-videos-physics-and-all\/\">Netflix AI Team Just Open-Sourced VOID: an AI Model That Erases Objects From Videos \u2014 Physics and All<\/a> appeared first on <a href=\"https:\/\/www.marktechpost.com\/\">MarkTechPost<\/a>.<\/p>","protected":false},"excerpt":{"rendered":"<p>Video editing has always had a&hellip;<\/p>\n","protected":false},"author":1,"featured_media":666,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-665","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-uncategorized"],"_links":{"self":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts\/665","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=665"}],"version-history":[{"count":0,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts\/665\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/media\/666"}],"wp:attachment":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=665"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=665"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=665"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}