{"id":361,"date":"2026-02-05T12:10:06","date_gmt":"2026-02-05T04:10:06","guid":{"rendered":"https:\/\/connectword.dpdns.org\/?p=361"},"modified":"2026-02-05T12:10:06","modified_gmt":"2026-02-05T04:10:06","slug":"nvidia-ai-release-vibetensor-an-ai-generated-deep-learning-runtime-built-end-to-end-by-coding-agents-programmatically","status":"publish","type":"post","link":"https:\/\/connectword.dpdns.org\/?p=361","title":{"rendered":"NVIDIA AI Release VibeTensor: An AI Generated Deep Learning Runtime Built End to End by Coding Agents Programmatically"},"content":{"rendered":"<p>NVIDIA has released VIBETENSOR, an open-source research system software stack for deep learning. VIBETENSOR is generated by LLM-powered coding agents under high-level human guidance.<\/p>\n<p>The system asks a concrete question: can coding agents generate a coherent deep learning runtime that spans Python and JavaScript APIs down to C++ runtime components and CUDA memory management and validate it only through tools.<\/p>\n<h3 class=\"wp-block-heading\"><strong>Architecture from frontends to CUDA runtime<\/strong><\/h3>\n<p>VIBETENSOR implements a PyTorch-style eager tensor library with a C++20 core for CPU and CUDA, a torch-like Python overlay via nanobind, and an experimental Node.js \/ TypeScript interface. It targets Linux x86_64 and NVIDIA GPUs via CUDA, and builds without CUDA are intentionally disabled.<\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1918\" height=\"1240\" data-attachment-id=\"77742\" data-permalink=\"https:\/\/www.marktechpost.com\/2026\/02\/04\/nvidia-ai-release-vibetensor-an-ai-generated-deep-learning-runtime-built-end-to-end-by-coding-agents-programmatically\/screenshot-2026-02-04-at-7-53-09-pm-2\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/02\/Screenshot-2026-02-04-at-7.53.09-PM-1.png\" data-orig-size=\"1918,1240\" data-comments-opened=\"1\" data-image-meta='{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\"}' data-image-title=\"Screenshot 2026-02-04 at 7.53.09\u202fPM\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/02\/Screenshot-2026-02-04-at-7.53.09-PM-1-300x194.png\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/02\/Screenshot-2026-02-04-at-7.53.09-PM-1-1024x662.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/02\/Screenshot-2026-02-04-at-7.53.09-PM-1.png\" alt=\"\" class=\"wp-image-77742\" \/><figcaption class=\"wp-element-caption\">https:\/\/arxiv.org\/pdf\/2601.16238<\/figcaption><\/figure>\n<\/div>\n<p>The core stack includes its own tensor and storage system, a schema-lite dispatcher, a reverse-mode autograd engine, a CUDA subsystem with streams, events, and CUDA graphs, a stream-ordered caching allocator with diagnostics, and a stable C ABI for dynamically loaded operator plugins. Frontends in Python and Node.js share a C++ dispatcher, tensor implementation, autograd engine, and CUDA runtime.<\/p>\n<p>The Python overlay exposes a <code>vibetensor.torch<\/code> namespace with tensor factories, operator dispatch, and CUDA utilities. The Node.js frontend is built on Node-API and focuses on async execution, using worker scheduling with bounds on concurrent inflight work as described in the implementation sections.<\/p>\n<p>At the runtime level, <code>TensorImpl<\/code> represents a view over reference-counted <code>Storage<\/code>, with sizes, strides, storage offsets, dtype, device metadata, and a shared version counter. This supports non-contiguous views and aliasing. A <code>TensorIterator<\/code> subsystem computes iteration shapes and per-operand strides for elementwise and reduction operators, and the same logic is exposed through the plugin ABI so external kernels follow the same aliasing and iteration rules.<\/p>\n<p>The dispatcher is schema-lite. It maps operator names to implementations across CPU and CUDA dispatch keys and allows wrapper layers for autograd and Python overrides. Device policies enforce invariants such as \u201call tensor inputs on the same device,\u201d while leaving room for specialized multi-device policies.<\/p>\n<h3 class=\"wp-block-heading\"><strong>Autograd, CUDA subsystem, and multi-GPU Fabric<\/strong><\/h3>\n<p>Reverse-mode autograd uses Node and Edge graph objects and per-tensor <code>AutogradMeta<\/code>. During backward, the engine maintains dependency counts, per-input gradient buffers, and a ready queue. For CUDA tensors, it records and waits on CUDA events to synchronize cross-stream gradient flows. The system also contains an experimental multi-device autograd mode for research on cross-device execution.<\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1926\" height=\"1216\" data-attachment-id=\"77744\" data-permalink=\"https:\/\/www.marktechpost.com\/2026\/02\/04\/nvidia-ai-release-vibetensor-an-ai-generated-deep-learning-runtime-built-end-to-end-by-coding-agents-programmatically\/screenshot-2026-02-04-at-7-54-17-pm-2\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/02\/Screenshot-2026-02-04-at-7.54.17-PM-1.png\" data-orig-size=\"1926,1216\" data-comments-opened=\"1\" data-image-meta='{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\"}' data-image-title=\"Screenshot 2026-02-04 at 7.54.17\u202fPM\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/02\/Screenshot-2026-02-04-at-7.54.17-PM-1-300x189.png\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/02\/Screenshot-2026-02-04-at-7.54.17-PM-1-1024x647.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/02\/Screenshot-2026-02-04-at-7.54.17-PM-1.png\" alt=\"\" class=\"wp-image-77744\" \/><figcaption class=\"wp-element-caption\">https:\/\/arxiv.org\/pdf\/2601.16238<\/figcaption><\/figure>\n<\/div>\n<p>The CUDA subsystem provides C++ wrappers for CUDA streams and events, a caching allocator with stream-ordered semantics, and CUDA graph capture and replay. The allocator includes diagnostics such as snapshots, statistics, memory-fraction caps, and GC ladders to make memory behavior observable in tests and debugging. CUDA graphs integrate with allocator \u201cgraph pools\u201d to manage memory lifetime across capture and replay.<\/p>\n<p>The Fabric subsystem is an experimental multi-GPU layer. It exposes explicit peer-to-peer GPU access via CUDA P2P and unified virtual addressing when the topology supports it. Fabric focuses on single-process multi-GPU execution and provides observability primitives such as statistics and event snapshots rather than a full distributed training stack.<\/p>\n<p>As a reference extension, VIBETENSOR ships a best-effort CUTLASS-based ring allreduce plugin for NVIDIA Blackwell-class GPUs. This plugin binds experimental ring-allreduce kernels, does not call NCCL, and is positioned as an illustrative example, not as an NCCL replacement. Multi-GPU results in the paper rely on Fabric plus this optional plugin, and they are reported only for Blackwell GPUs.<\/p>\n<h3 class=\"wp-block-heading\"><strong>Interoperability and extension points<\/strong><\/h3>\n<p>VIBETENSOR supports DLPack import and export for CPU and CUDA tensors and provides a C++20 Safetensors loader and saver for serialization. Extensibility mechanisms include Python-level overrides inspired by <code>torch.library<\/code>, a versioned C plugin ABI, and hooks for custom GPU kernels authored in Triton and CUDA template libraries such as CUTLASS. The plugin ABI exposes DLPack-based dtype and device metadata and <code>TensorIterator<\/code> helpers so external kernels integrate with the same iteration and aliasing rules as built-in operators.<\/p>\n<h3 class=\"wp-block-heading\"><strong>AI-assisted development<\/strong><\/h3>\n<p>VIBETENSOR was built using LLM-powered coding agents as the main code authors, guided only by high-level human specifications. Over roughly 2 months, humans defined targets and constraints, then agents proposed code diffs and executed builds and tests to validate them. The work does not introduce a new agent framework, it treats agents as black-box tools that modify the codebase under tool-based checks. Validation relies on C++ tests (CTest), Python tests via pytest, and differential checks against reference implementations such as PyTorch for selected operators. The research team also include longer training regressions and allocator and CUDA diagnostics to catch stateful bugs and performance pathologies that do not show up in unit tests.<\/p>\n<h3 class=\"wp-block-heading\"><strong>Key Takeaways<\/strong><\/h3>\n<ul class=\"wp-block-list\">\n<li><strong>AI-generated, CUDA-first deep learning stack<\/strong>: VIBETENSOR is an Apache 2.0, open-source PyTorch-style eager runtime whose implementation changes were generated by LLM coding agents, targeting Linux x86_64 with NVIDIA GPUs and CUDA as a hard requirement.<\/li>\n<li><strong>Full runtime architecture, not just kernels<\/strong>: The system includes a C++20 tensor core (TensorImpl\/Storage\/TensorIterator), a schema-lite dispatcher, reverse-mode autograd, a CUDA subsystem with streams, events, graphs, a stream-ordered caching allocator, and a versioned C plugin ABI, exposed through Python (<code>vibetensor.torch<\/code>) and experimental Node.js frontends.<\/li>\n<li><strong>Tool-driven, agent-centric development workflow<\/strong>: Over ~2 months, humans specified high-level goals, while agents proposed diffs and validated them via CTest, pytest, differential checks against PyTorch, allocator diagnostics, and long-horizon training regressions, without per-diff manual code review.<\/li>\n<li><strong>Strong microkernel speedups, slower end-to-end training<\/strong>: AI-generated kernels in Triton\/CuTeDSL achieve up to ~5\u20136\u00d7 speedups over PyTorch baselines in isolated benchmarks, but complete training workloads (Transformer toy tasks, CIFAR-10 ViT, miniGPT-style LM) run 1.7\u00d7 to 6.2\u00d7 slower than PyTorch, emphasizing the gap between kernel and system-level performance.<\/li>\n<\/ul>\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n<p>Check out the\u00a0<strong><a href=\"https:\/\/arxiv.org\/pdf\/2601.16238\" target=\"_blank\" rel=\"noreferrer noopener\">Paper<\/a> and <a href=\"https:\/\/github.com\/NVLabs\/vibetensor\" target=\"_blank\" rel=\"noreferrer noopener\">Repo here<\/a><\/strong>.\u00a0Also,\u00a0feel free to follow us on\u00a0<strong><a href=\"https:\/\/x.com\/intent\/follow?screen_name=marktechpost\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Twitter<\/mark><\/a><\/strong>\u00a0and don\u2019t forget to join our\u00a0<strong><a href=\"https:\/\/www.reddit.com\/r\/machinelearningnews\/\" target=\"_blank\" rel=\"noreferrer noopener\">100k+ ML SubReddit<\/a><\/strong>\u00a0and Subscribe to\u00a0<strong><a href=\"https:\/\/www.aidevsignals.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">our Newsletter<\/a><\/strong>. Wait! are you on telegram?\u00a0<strong><a href=\"https:\/\/t.me\/machinelearningresearchnews\" target=\"_blank\" rel=\"noreferrer noopener\">now you can join us on telegram as well.<\/a><\/strong><\/p>\n<p>The post <a href=\"https:\/\/www.marktechpost.com\/2026\/02\/04\/nvidia-ai-release-vibetensor-an-ai-generated-deep-learning-runtime-built-end-to-end-by-coding-agents-programmatically\/\">NVIDIA AI Release VibeTensor: An AI Generated Deep Learning Runtime Built End to End by Coding Agents Programmatically<\/a> appeared first on <a href=\"https:\/\/www.marktechpost.com\/\">MarkTechPost<\/a>.<\/p>","protected":false},"excerpt":{"rendered":"<p>NVIDIA has released VIBETENSOR&hellip;<\/p>\n","protected":false},"author":1,"featured_media":362,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-361","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-uncategorized"],"_links":{"self":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts\/361","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=361"}],"version-history":[{"count":0,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts\/361\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/media\/362"}],"wp:attachment":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=361"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=361"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=361"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}