{"id":994,"date":"2026-05-29T16:43:32","date_gmt":"2026-05-29T08:43:32","guid":{"rendered":"https:\/\/connectword.dpdns.org\/?p=994"},"modified":"2026-05-29T16:43:32","modified_gmt":"2026-05-29T08:43:32","slug":"meet-mkernel-a-multi-gpu-multi-node-fused-kernel-library-for-gpu-driven-communication","status":"publish","type":"post","link":"https:\/\/connectword.dpdns.org\/?p=994","title":{"rendered":"Meet mKernel: A Multi-GPU, Multi-Node Fused Kernel Library for GPU-Driven Communication"},"content":{"rendered":"<p class=\"wp-block-paragraph\">GPU communication overhead is a measurable bottleneck in production AI workloads. According to data cited by the mKernel project, communication can consume <strong>43.6% of the forward pass and 32% of end-to-end training time<\/strong>. Across popular Mixture-of-Experts (MoE) models, inter-device communication can account for <strong>up to 47% of total execution time<\/strong>. Researchers from UC Berkeley\u2019s UCCL project have released mKernel, a library of persistent CUDA kernels that fuse intra-node NVLink communication, inter-node RDMA, and compute into a single kernel.<\/p>\n<h2 class=\"wp-block-heading\"><strong>The Problem: Host-Driven Communication<\/strong><\/h2>\n<p class=\"wp-block-paragraph\">The standard model for multi-GPU communication is <strong>host-driven<\/strong>: the CPU runs the control path and calls into a library like NCCL or NVSHMEM. The library issues the collective operation \u2014 an AllReduce, an AllGather, etc. \u2014 across GPUs. Compute and communication run on separate CUDA streams and overlap at kernel boundaries.<\/p>\n<p class=\"wp-block-paragraph\"><strong>The research team identifies two problems with this approach<\/strong>:<\/p>\n<p class=\"wp-block-paragraph\">(1) CPUs are not scaling with GPU compute. A GB300 NVL72 rack integrates 72 Blackwell Ultra GPUs and 36 Grace CPUs, delivering 720 PFLOP\/s FP8\/FP6, 1.44 EFLOP\/s FP4 Tensor Core performance, and 130 TB\/s of all-to-all intra-rack NVLink bandwidth. At those speeds, microsecond-scale host orchestration overhead \u2014 a <code>cudaLaunchKernel<\/code> call, a CPU-side \u201call writes done\u201d check, an inter-stream event \u2014 shows up directly as <strong>pipeline bubbles<\/strong>.<\/p>\n<p class=\"wp-block-paragraph\">(2) Host-driven systems overlap compute and communication at coarse kernel boundaries. Finer-grained overlap at the tile or chunk level is not possible from the host side.<\/p>\n<p class=\"wp-block-paragraph\">The alternative is <strong>GPU-driven communication<\/strong>: the GPU itself triggers transfers, with communication fused into the same kernel as the compute. Most existing fused kernel libraries operate within a single node, or a single GPU. mKernel targets the multi-node case.<\/p>\n<h2 class=\"wp-block-heading\"><strong>What mKernel Does<\/strong><\/h2>\n<p class=\"wp-block-paragraph\">mKernel is a library of <strong>persistent CUDA kernels<\/strong>. Each kernel fuses intra-node NVLink communication, inter-node RDMA, and dense compute into a single kernel.<\/p>\n<p class=\"wp-block-paragraph\"><strong>Multi-GPU + multi-node, in one kernel<\/strong>: Both intra-node NVLink and inter-node RDMA live inside the same persistent kernel.<\/p>\n<p class=\"wp-block-paragraph\"><strong>Fine-grained intra-kernel overlap<\/strong>: Compute and communication overlap at tile\/chunk granularity, covering both intra-node and inter-node GPU communication.<\/p>\n<p class=\"wp-block-paragraph\"><strong>Persistent kernel with SM specialization<\/strong>: CTAs self-assign roles: <code>compute<\/code>, <code>intra-comm<\/code>, <code>inter-send<\/code>, <code>inter-reduce<\/code>. The number of SMs dedicated to each role is tunable per shape.<\/p>\n<p class=\"wp-block-paragraph\"><strong>GPU-driven networking built on <code>libibverbs<\/code><\/strong>: mKernel uses GPU-initiated RDMA writes without depending on NCCL or NVSHMEM. The communication backend is written from scratch to maximize performance and support heterogeneous networking devices.<\/p>\n<h2 class=\"wp-block-heading\"><strong>The Five Fused Kernels<\/strong><\/h2>\n<figure class=\"wp-block-table\">\n<table class=\"has-fixed-layout\">\n<thead>\n<tr>\n<th>Kernel<\/th>\n<th>What it fuses<\/th>\n<th>Description<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td><strong>AllGather + GEMM<\/strong><\/td>\n<td>AllGather \u2192 GEMM<\/td>\n<td>Each rank holds a shard of <code>A<\/code>. While ranks gather peers\u2019 shards over NVLink\/RDMA, the local GEMM consumes tiles as soon as they arrive.<\/td>\n<\/tr>\n<tr>\n<td><strong>GEMM + AllReduce<\/strong><\/td>\n<td>GEMM \u2192 AllReduce<\/td>\n<td>Computes <code>C = A @ B<\/code> and reduces partial outputs across all ranks in one launch. Output tiles are pushed into the reduction tree the instant they\u2019re produced.<\/td>\n<\/tr>\n<tr>\n<td><strong>MoE Dispatch + GEMM<\/strong><\/td>\n<td>All-to-All dispatch \u2192 grouped GEMM<\/td>\n<td>Routes MoE tokens to their expert ranks (intra-node NVLink + inter-node all-to-all) and runs the per-expert grouped GEMM in the same kernel. Tokens are processed as soon as they land \u2014 no staging buffer round-trip.<\/td>\n<\/tr>\n<tr>\n<td><strong>Ring Attention<\/strong><\/td>\n<td>Ring KV exchange \u2192 FlashAttention<\/td>\n<td>Sequence-parallel attention across ranks. Each step rotates a KV chunk around the ring while the local FlashAttention consumes the previously-received chunk. Compute and the ring send\/recv run concurrently inside a single persistent kernel.<\/td>\n<\/tr>\n<tr>\n<td><strong>GEMM + ReduceScatter<\/strong><\/td>\n<td>GEMM \u2192 ReduceScatter<\/td>\n<td>Computes <code>C = A @ B<\/code> and reduce-scatters the output. Each output tile is reduced and forwarded to its owning rank as soon as it is produced.<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/figure>\n<p class=\"wp-block-paragraph\">\n<h2 class=\"wp-block-heading\"><strong>Evaluation Setup<\/strong><\/h2>\n<\/p><p class=\"wp-block-paragraph\">The research team evaluated mKernel on two 2-node \u00d7 8-H200 clusters that differ only in their inter-node fabric:<\/p>\n<figure class=\"wp-block-table\">\n<table class=\"has-fixed-layout\">\n<thead>\n<tr>\n<th>Testbed<\/th>\n<th>Nodes \u00d7 GPUs<\/th>\n<th>Intra-node<\/th>\n<th>Inter-node transport<\/th>\n<th>NIC<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td><strong>AWS EFA<\/strong><\/td>\n<td>2 \u00d7 8 H200<\/td>\n<td>NVLink<\/td>\n<td>AWS EFA \/ SRD<\/td>\n<td>16 \u00d7 200 Gb\/s EFA per node<\/td>\n<\/tr>\n<tr>\n<td><strong>ConnectX-7<\/strong><\/td>\n<td>2 \u00d7 8 H200<\/td>\n<td>NVLink<\/td>\n<td>InfiniBand<\/td>\n<td>8 \u00d7 400 Gb\/s NVIDIA ConnectX-7 per node<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/figure>\n<p class=\"wp-block-paragraph\">mKernel was benchmarked against NCCL, Triton-distributed, Flux, Mercury, MagiAttention, Transformer-Engine, and ring-flash-attention. The team notes that further benchmarking at larger scale is still in progress.<\/p>\n<h2 class=\"wp-block-heading\"><strong>Backends and Requirements<\/strong><\/h2>\n<p class=\"wp-block-paragraph\">mKernel supports two networking backends:<\/p>\n<figure class=\"wp-block-table\">\n<table class=\"has-fixed-layout\">\n<thead>\n<tr>\n<th>Backend<\/th>\n<th>Macro<\/th>\n<th>Transport<\/th>\n<th>Where it runs<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td><strong>CX7<\/strong><\/td>\n<td><code>-DINTERNODE_BACKEND_IBVERBS<\/code><\/td>\n<td>libibverbs RC<\/td>\n<td>ConnectX-7 \/ InfiniBand \/ RoCE<\/td>\n<\/tr>\n<tr>\n<td><strong>EFA<\/strong><\/td>\n<td><code>-DINTERNODE_BACKEND_EFA<\/code><\/td>\n<td>libibverbs + efadv (SRD)<\/td>\n<td>AWS p5\/p5e (H200, EFA)<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/figure>\n<p class=\"wp-block-paragraph\">Both backends share the same host-side API and the same on-GPU kernel. Only the proxy\/session implementation differs (<code>session.h<\/code> for CX7, <code>session_efa.h<\/code> for EFA). Requirements: NVIDIA Hopper GPUs (default build targets <code>sm_90a<\/code>), CUDA 12.9, Python with PyTorch. The CX7 backend requires libibverbs development headers and libraries. The EFA backend requires AWS EFA installation with libfabric, libibverbs, efadv, and EFA headers under <code>EFA_HOME=\/opt\/amazon\/efa<\/code> by default.<\/p>\n<h2 class=\"wp-block-heading\"><strong>Marktechpost\u2019s Visual Explainer<\/strong><\/h2>\n<div>\n<div class=\"mk-header\">\n<div class=\"mk-logo\">UCCL<\/div>\n<div class=\"mk-header-title\"><span>mKernel<\/span> \u2014 Multi-GPU, Multi-Node Fused Kernels Guide<\/div>\n<\/div>\n<div class=\"mk-progress\">\n<div class=\"mk-progress-bar\"><\/div>\n<\/div>\n<div class=\"mk-slides\">\n<div class=\"mk-track\">\n<p>      <!-- Slide 1: Overview --><\/p>\n<div class=\"mk-slide\">\n<div class=\"mk-slide-num\">01 \/ 07 \u2014 Overview<\/div>\n<h2>What is <span class=\"mk-accent\">mKernel<\/span>?<\/h2>\n<p>mKernel is an open-source library of persistent CUDA kernels from UC Berkeley\u2019s UCCL project. It fuses intra-node NVLink communication, inter-node RDMA, and dense compute into a single kernel.<\/p>\n<p>Most existing fused kernel libraries operate within a single node or a single GPU. mKernel is designed from the start to span node boundaries.<\/p>\n<div class=\"mk-stat-row\">\n<div class=\"mk-stat\">\n<div class=\"mk-stat-val\">43.6%<\/div>\n<div class=\"mk-stat-label\">of forward pass consumed by communication in production<\/div>\n<\/div>\n<div class=\"mk-stat\">\n<div class=\"mk-stat-val\">47%<\/div>\n<div class=\"mk-stat-label\">of total execution time in popular MoE models<\/div>\n<\/div>\n<div class=\"mk-stat\">\n<div class=\"mk-stat-val\">32%<\/div>\n<div class=\"mk-stat-label\">of end-to-end training time consumed by communication<\/div>\n<\/div>\n<\/div>\n<\/div>\n<p>      <!-- Slide 2: The Problem --><\/p>\n<div class=\"mk-slide\">\n<div class=\"mk-slide-num\">02 \/ 07 \u2014 The Problem<\/div>\n<h2>Why <span class=\"mk-accent\">Host-Driven<\/span> Communication Falls Short<\/h2>\n<p>The standard model is host-driven: the CPU calls NCCL or NVSHMEM, which issues collective operations across GPUs. The UCCL team identifies two problems.<\/p>\n<div class=\"mk-prop-row\">\n<div class=\"mk-prop\">\n<div class=\"mk-prop-icon\"><img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/26a1.png\" alt=\"\u26a1\" class=\"wp-smiley\" \/><\/div>\n<div class=\"mk-prop-text\"><strong>CPUs are not scaling with GPUs.<\/strong> A GB300 NVL72 rack delivers 720 PFLOP\/s FP8\/FP6 and 1.44 EFLOP\/s FP4. At those speeds, microsecond-scale overhead from <code>cudaLaunchKernel<\/code>, CPU-side sync checks, and inter-stream events shows up directly as pipeline bubbles.<\/div>\n<\/div>\n<div class=\"mk-prop\">\n<div class=\"mk-prop-icon\"><img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/1f532.png\" alt=\"\ud83d\udd32\" class=\"wp-smiley\" \/><\/div>\n<div class=\"mk-prop-text\"><strong>Overlap is too coarse.<\/strong> Host-driven systems overlap compute and communication only at kernel boundaries. Finer-grained overlap at the tile or chunk level is not possible from the host side.<\/div>\n<\/div>\n<div class=\"mk-prop\">\n<div class=\"mk-prop-icon\"><img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/1f500.png\" alt=\"\ud83d\udd00\" class=\"wp-smiley\" \/><\/div>\n<div class=\"mk-prop-text\"><strong>The answer: GPU-driven communication.<\/strong> The GPU itself triggers fine-grained transfers, fused into the same kernel as the compute.<\/div>\n<\/div>\n<\/div>\n<\/div>\n<p>      <!-- Slide 3: Design Properties --><\/p>\n<div class=\"mk-slide\">\n<div class=\"mk-slide-num\">03 \/ 07 \u2014 Design<\/div>\n<h2>Four Core Design <span class=\"mk-accent\">Properties<\/span><\/h2>\n<div class=\"mk-prop-row\">\n<div class=\"mk-prop\">\n<div class=\"mk-prop-icon\">\ud83d\udda7<\/div>\n<div class=\"mk-prop-text\"><strong>Multi-GPU + multi-node, in one kernel.<\/strong> Intra-node NVLink and inter-node RDMA both live inside the same persistent kernel.<\/div>\n<\/div>\n<div class=\"mk-prop\">\n<div class=\"mk-prop-icon\"><img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/1f52c.png\" alt=\"\ud83d\udd2c\" class=\"wp-smiley\" \/><\/div>\n<div class=\"mk-prop-text\"><strong>Fine-grained intra-kernel overlap.<\/strong> Compute and communication overlap at tile\/chunk granularity, covering both intra-node and inter-node communication.<\/div>\n<\/div>\n<div class=\"mk-prop\">\n<div class=\"mk-prop-icon\"><img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/2699.png\" alt=\"\u2699\" class=\"wp-smiley\" \/><\/div>\n<div class=\"mk-prop-text\"><strong>Persistent kernel with SM specialization.<\/strong> CTAs self-assign roles: <code>compute<\/code>, <code>intra-comm<\/code>, <code>inter-send<\/code>, <code>inter-reduce<\/code>. SM split is tunable per shape.<\/div>\n<\/div>\n<div class=\"mk-prop\">\n<div class=\"mk-prop-icon\"><img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/1f4e1.png\" alt=\"\ud83d\udce1\" class=\"wp-smiley\" \/><\/div>\n<div class=\"mk-prop-text\"><strong>GPU-driven networking via <code>libibverbs<\/code>.<\/strong> Uses GPU-initiated RDMA writes. No NCCL or NVSHMEM dependency. Communication backend is written from scratch.<\/div>\n<\/div>\n<\/div>\n<\/div>\n<p>      <!-- Slide 4: The Five Kernels --><\/p>\n<div class=\"mk-slide\">\n<div class=\"mk-slide-num\">04 \/ 07 \u2014 Kernels<\/div>\n<h2>The Five <span class=\"mk-accent\">Fused Kernels<\/span><\/h2>\n<div class=\"mk-kernel-list\">\n<div class=\"mk-kernel\">\n<div class=\"mk-kernel-name\">AllGather + GEMM<\/div>\n<div class=\"mk-kernel-fuse\">AllGather \u2014&gt; GEMM<\/div>\n<div class=\"mk-kernel-desc\">Each rank holds a shard of <code>A<\/code>. The local GEMM consumes tiles over NVLink\/RDMA as they arrive \u2014 matmul starts before the collective finishes.<\/div>\n<\/div>\n<div class=\"mk-kernel\">\n<div class=\"mk-kernel-name\">GEMM + AllReduce<\/div>\n<div class=\"mk-kernel-fuse\">GEMM \u2014&gt; AllReduce<\/div>\n<div class=\"mk-kernel-desc\">Computes <code>C = A @ B<\/code> and reduces partial outputs across all ranks in one launch. Output tiles enter the reduction tree the instant they are produced.<\/div>\n<\/div>\n<div class=\"mk-kernel\">\n<div class=\"mk-kernel-name\">MoE Dispatch + GEMM<\/div>\n<div class=\"mk-kernel-fuse\">All-to-All dispatch \u2014&gt; grouped GEMM<\/div>\n<div class=\"mk-kernel-desc\">Routes MoE tokens to expert ranks via NVLink + inter-node all-to-all, then runs per-expert grouped GEMM in the same kernel. No staging buffer round-trip.<\/div>\n<\/div>\n<div class=\"mk-kernel\">\n<div class=\"mk-kernel-name\">Ring Attention<\/div>\n<div class=\"mk-kernel-fuse\">Ring KV exchange \u2014&gt; FlashAttention<\/div>\n<div class=\"mk-kernel-desc\">Sequence-parallel attention across ranks. Each step rotates a KV chunk around the ring while the local FlashAttention consumes the previously-received chunk.<\/div>\n<\/div>\n<div class=\"mk-kernel\">\n<div class=\"mk-kernel-name\">GEMM + ReduceScatter<\/div>\n<div class=\"mk-kernel-fuse\">GEMM \u2014&gt; ReduceScatter<\/div>\n<div class=\"mk-kernel-desc\">Computes <code>C = A @ B<\/code> and reduce-scatters the output. Each tile is reduced and forwarded to its owning rank as soon as it is produced.<\/div>\n<\/div>\n<\/div>\n<\/div>\n<p>      <!-- Slide 5: Evaluation --><\/p>\n<div class=\"mk-slide\">\n<div class=\"mk-slide-num\">05 \/ 07 \u2014 Evaluation<\/div>\n<h2>Evaluation <span class=\"mk-accent\">Setup<\/span><\/h2>\n<p>Tested on two 2-node \u00d7 8-H200 clusters differing only in inter-node fabric.<\/p>\n<table class=\"mk-table\">\n<thead>\n<tr>\n<th>Testbed<\/th>\n<th>Nodes \u00d7 GPUs<\/th>\n<th>Inter-node<\/th>\n<th>NIC<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td><strong>AWS EFA<\/strong><\/td>\n<td>2 \u00d7 8 H200<\/td>\n<td>AWS EFA \/ SRD<\/td>\n<td>16 \u00d7 200 Gb\/s EFA per node<\/td>\n<\/tr>\n<tr>\n<td><strong>ConnectX-7<\/strong><\/td>\n<td>2 \u00d7 8 H200<\/td>\n<td>InfiniBand<\/td>\n<td>8 \u00d7 400 Gb\/s CX7 per node<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Both testbeds use NVLink intra-node. Benchmarked against: NCCL, Triton-distributed, Flux, Mercury, MagiAttention, Transformer-Engine, and ring-flash-attention. Larger-scale benchmarking is still in progress.<\/p>\n<\/div>\n<p>      <!-- Slide 6: Backends &amp; Requirements --><\/p>\n<div class=\"mk-slide\">\n<div class=\"mk-slide-num\">06 \/ 07 \u2014 Backends &amp; Requirements<\/div>\n<h2>Backends &amp; <span class=\"mk-accent\">Requirements<\/span><\/h2>\n<table class=\"mk-table\">\n<thead>\n<tr>\n<th>Backend<\/th>\n<th>Transport<\/th>\n<th>Where it runs<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td><strong>CX7<\/strong><\/td>\n<td>libibverbs RC<\/td>\n<td>ConnectX-7 \/ InfiniBand \/ RoCE<\/td>\n<\/tr>\n<tr>\n<td><strong>EFA<\/strong><\/td>\n<td>libibverbs + efadv (SRD)<\/td>\n<td>AWS p5\/p5e (H200, EFA)<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<div class=\"mk-prop-row\">\n<div class=\"mk-prop\">\n<div class=\"mk-prop-icon\"><img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/1f4cb.png\" alt=\"\ud83d\udccb\" class=\"wp-smiley\" \/><\/div>\n<div class=\"mk-prop-text\"><strong>Requirements:<\/strong> NVIDIA Hopper GPUs (default <code>sm_90a<\/code>), CUDA 12.9, Python with PyTorch. CX7 needs libibverbs headers. EFA needs libfabric, libibverbs, efadv under <code>EFA_HOME=\/opt\/amazon\/efa<\/code>.<\/div>\n<\/div>\n<div class=\"mk-prop\">\n<div class=\"mk-prop-icon\"><img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/1f4dd.png\" alt=\"\ud83d\udcdd\" class=\"wp-smiley\" \/><\/div>\n<div class=\"mk-prop-text\"><strong>License &amp; Attribution:<\/strong> MIT licensed. MMA\/compute code adapted from ThunderKittens (HazyResearch).<\/div>\n<\/div>\n<\/div>\n<\/div>\n<p>      <!-- Slide 7: Roadmap + Takeaways --><\/p>\n<div class=\"mk-slide\">\n<div class=\"mk-slide-num\">07 \/ 07 \u2014 Roadmap &amp; Key Takeaways<\/div>\n<h2>Roadmap &amp; <span class=\"mk-accent\">Key Takeaways<\/span><\/h2>\n<div class=\"mk-roadmap\">\n<div class=\"mk-road-item done\">\n<div class=\"mk-road-icon\">\u2705<\/div>\n<div>Fused GPU-driven multi-node kernels (AG+GEMM, GEMM+AR, MoE Dispatch+GEMM, Ring Attention, GEMM+RS)<\/div>\n<\/div>\n<div class=\"mk-road-item done\">\n<div class=\"mk-road-icon\">\u2705<\/div>\n<div>ConnectX-7 and AWS EFA backends<\/div>\n<\/div>\n<div class=\"mk-road-item wip\">\n<div class=\"mk-road-icon\">\ud83d\udea7<\/div>\n<div>Full heterogeneous accelerator\/NIC support with topology-aware discovery, placement, routing<\/div>\n<\/div>\n<div class=\"mk-road-item wip\">\n<div class=\"mk-road-icon\">\ud83d\udea7<\/div>\n<div>Inter-node megakernels: collapsing several fused steps into a single megakernel spanning a transformer layer<\/div>\n<\/div>\n<div class=\"mk-road-item wip\">\n<div class=\"mk-road-icon\">\ud83d\udea7<\/div>\n<div>Blackwell GPU support<\/div>\n<\/div><\/div>\n<div class=\"mk-takeaway-list\">\n<div class=\"mk-takeaway\">\n<div class=\"mk-takeaway-dot\"><\/div>\n<div>Fuses NVLink, inter-node RDMA, and compute into a single persistent CUDA kernel<\/div>\n<\/div>\n<div class=\"mk-takeaway\">\n<div class=\"mk-takeaway-dot\"><\/div>\n<div>Five kernels: AllGather+GEMM, GEMM+AllReduce, MoE Dispatch+GEMM, Ring Attention, GEMM+ReduceScatter<\/div>\n<\/div>\n<div class=\"mk-takeaway\">\n<div class=\"mk-takeaway-dot\"><\/div>\n<div>GPU-initiated RDMA via <code>libibverbs<\/code> \u2014 no NCCL or NVSHMEM dependency<\/div>\n<\/div>\n<div class=\"mk-takeaway\">\n<div class=\"mk-takeaway-dot\"><\/div>\n<div>Requires Hopper GPUs (<code>sm_90a<\/code>) and ConnectX-7 or AWS EFA networking<\/div>\n<\/div><\/div>\n<\/div>\n<\/div>\n<p><!-- \/mk-track -->\n  <\/p><\/div>\n<p><!-- \/mk-slides --><\/p>\n<div class=\"mk-nav\">\n    <button class=\"mk-nav-btn\" disabled>\u2190 Prev<\/button>\n<div class=\"mk-dots\"><\/div>\n<p>    <button class=\"mk-nav-btn\">Next \u2192<\/button>\n  <\/p><\/div>\n<\/div>\n<h2 class=\"wp-block-heading\"><strong>Key Takeaways<\/strong><\/h2>\n<ul class=\"wp-block-list\">\n<li>mKernel fuses intra-node NVLink, inter-node RDMA, and compute into a single persistent CUDA kernel.<\/li>\n<li>Communication overhead accounts for up to 47% of execution time in MoE models per cited production data.<\/li>\n<li>Five kernels are included: AllGather+GEMM, GEMM+AllReduce, MoE Dispatch+GEMM, Ring Attention, and GEMM+ReduceScatter.<\/li>\n<li>GPU-initiated RDMA is implemented directly via <code>libibverbs<\/code> \u2014 no NCCL or NVSHMEM dependency.<\/li>\n<li>Currently requires Hopper GPUs (<code>sm_90a<\/code>) and ConnectX-7 or AWS EFA networking; Blackwell support is on the roadmap.<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\">\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n<\/p><p class=\"wp-block-paragraph\">\n<\/p><p class=\"wp-block-paragraph\">Check out\u00a0the\u00a0<strong><a href=\"https:\/\/github.com\/uccl-project\/mKernel\" target=\"_blank\" rel=\"noreferrer noopener\">Repo<\/a>\u00a0<\/strong>and<strong>\u00a0<a href=\"https:\/\/uccl-project.github.io\/posts\/mkernel\/\" target=\"_blank\" rel=\"noreferrer noopener\">Technical Details<\/a>.\u00a0<\/strong>Also,\u00a0feel free to follow us on\u00a0<strong><a href=\"https:\/\/x.com\/intent\/follow?screen_name=marktechpost\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Twitter<\/mark><\/a><\/strong>\u00a0and don\u2019t forget to join our\u00a0<strong><a href=\"https:\/\/www.reddit.com\/r\/machinelearningnews\/\" target=\"_blank\" rel=\"noreferrer noopener\">150k+ ML SubReddit<\/a><\/strong>\u00a0and Subscribe to\u00a0<strong><a href=\"https:\/\/www.aidevsignals.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">our Newsletter<\/a><\/strong>. Wait! are you on telegram?\u00a0<strong><a href=\"https:\/\/t.me\/machinelearningresearchnews\" target=\"_blank\" rel=\"noreferrer noopener\">now you can join us on telegram as well.<\/a><\/strong><\/p>\n<p class=\"wp-block-paragraph\">Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.?\u00a0<strong><a href=\"https:\/\/forms.gle\/wbash1wF6efRj8G58\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Connect with us<\/mark><\/a><\/strong><\/p>\n<p>The post <a href=\"https:\/\/www.marktechpost.com\/2026\/05\/29\/meet-mkernel-a-multi-gpu-multi-node-fused-kernel-library-for-gpu-driven-communication\/\">Meet mKernel: A Multi-GPU, Multi-Node Fused Kernel Library for GPU-Driven Communication<\/a> appeared first on <a href=\"https:\/\/www.marktechpost.com\/\">MarkTechPost<\/a>.<\/p>","protected":false},"excerpt":{"rendered":"<p>GPU communication overhead is &hellip;<\/p>\n","protected":false},"author":1,"featured_media":29,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-994","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-uncategorized"],"_links":{"self":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts\/994","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=994"}],"version-history":[{"count":0,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts\/994\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/media\/29"}],"wp:attachment":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=994"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=994"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=994"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}