{"id":912,"date":"2026-05-15T07:21:09","date_gmt":"2026-05-14T23:21:09","guid":{"rendered":"https:\/\/connectword.dpdns.org\/?p=912"},"modified":"2026-05-15T07:21:09","modified_gmt":"2026-05-14T23:21:09","slug":"a-coding-implementation-to-master-gpu-computing-with-cupy-custom-cuda-kernels-streams-sparse-matrices-and-profiling","status":"publish","type":"post","link":"https:\/\/connectword.dpdns.org\/?p=912","title":{"rendered":"A Coding Implementation to Master GPU Computing with CuPy, Custom CUDA Kernels, Streams, Sparse Matrices, and Profiling"},"content":{"rendered":"<p>In this tutorial, we delve into<a href=\"https:\/\/github.com\/cupy\/cupy\"> <strong>CuPy<\/strong><\/a> as a powerful GPU-accelerated alternative to NumPy for high-performance numerical computing in Python. We start by inspecting the available CUDA device, checking the CuPy version, runtime details, GPU memory, and compute capability so that we understand the hardware environment before running heavy computations. Then, we compare NumPy and CuPy on large matrix multiplication and FFT workloads to see how GPU acceleration changes execution speed. Also, we work with memory pools, custom elementwise kernels, reduction kernels, raw CUDA kernels, CUDA streams, sparse matrices, dense linear solvers, GPU image processing, DLPack interoperability, event-based profiling, cupyx.jit, and kernel fusion. Through these examples, we build a practical understanding of how CuPy lets us write familiar Python code while still accessing advanced CUDA-level performance features.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\">import sys, time, subprocess\ntry:\n   import cupy as cp\nexcept ImportError:\n   subprocess.check_call([sys.executable, \"-m\", \"pip\", \"install\", \"-q\", \"cupy-cuda12x\"])\n   import cupy as cp\nimport numpy as np\nimport matplotlib.pyplot as plt\nfrom cupyx.scipy import sparse as cps\nfrom cupyx.scipy import ndimage as cdi\nfrom cupyx import jit\ndef header(t): print(\"n\" + \"=\"*64 + f\"n{t}n\" + \"=\"*64)\ndef bench(fn, *args, n=5, warmup=2, gpu=True):\n   for _ in range(warmup): fn(*args)\n   if gpu: cp.cuda.Stream.null.synchronize()\n   t0 = time.perf_counter()\n   for _ in range(n): r = fn(*args)\n   if gpu: cp.cuda.Stream.null.synchronize()\n   return (time.perf_counter() - t0) \/ n\nheader(\"1. GPU INTROSPECTION\")\nprops = cp.cuda.runtime.getDeviceProperties(0)\nprint(f\"CuPy version       : {cp.__version__}\")\nprint(f\"CUDA runtime       : {cp.cuda.runtime.runtimeGetVersion()}\")\nprint(f\"Device             : {props['name'].decode()}\")\nprint(f\"Compute capability : {props['major']}.{props['minor']}\")\nprint(f\"SMs                : {props['multiProcessorCount']}\")\nprint(f\"Global memory      : {props['totalGlobalMem']\/1e9:.2f} GB\")\nheader(\"2. NUMPY vs CUPY BENCHMARK\")\nN = 4096\nA_np = np.random.rand(N, N).astype(np.float32)\nB_np = np.random.rand(N, N).astype(np.float32)\nA_cp, B_cp = cp.asarray(A_np), cp.asarray(B_np)\nt_np = bench(np.matmul, A_np, B_np, n=2, gpu=False)\nt_cp = bench(cp.matmul, A_cp, B_cp, n=3, gpu=True)\nprint(f\"Matmul {N}x{N}  NumPy={t_np*1000:7.1f} ms  CuPy={t_cp*1000:7.1f} ms  ({t_np\/t_cp:.1f}x)\")\nx_np = np.random.rand(2**21).astype(np.complex64)\nx_cp = cp.asarray(x_np)\nt_np = bench(np.fft.fft, x_np, n=3, gpu=False)\nt_cp = bench(cp.fft.fft, x_cp, n=5, gpu=True)\nprint(f\"FFT 2^21        NumPy={t_np*1000:7.1f} ms  CuPy={t_cp*1000:7.1f} ms  ({t_np\/t_cp:.1f}x)\")<\/code><\/pre>\n<\/div>\n<\/div>\n<p>We begin by setting up CuPy, NumPy, Matplotlib, sparse utilities, image-processing tools, and JIT support so the tutorial has all the required GPU-computing components. We define helper functions for section headers and reliable benchmarking, then inspect the available CUDA device to understand the GPU environment. We also compare NumPy and CuPy for large matrix multiplication and FFT operations to observe the performance difference between CPU-based and GPU-accelerated computation.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\">header(\"3. MEMORY POOL\")\npool = cp.get_default_memory_pool()\npinned = cp.get_default_pinned_memory_pool()\nprint(f\"Used  : {pool.used_bytes()\/1e6:8.2f} MB\")\nprint(f\"Total : {pool.total_bytes()\/1e6:8.2f} MB\")\ndel A_cp, B_cp, x_cp\npool.free_all_blocks(); pinned.free_all_blocks()\nprint(f\"After free_all_blocks \u2192 Used: {pool.used_bytes()\/1e6:.2f} MB\")\nheader(\"4. ELEMENTWISE KERNEL\")\nrobust_norm = cp.ElementwiseKernel(\n   in_params ='float32 x, float32 y, float32 eps',\n   out_params='float32 z',\n   operation ='z = sqrtf((x - y)*(x - y) + eps)',\n   name      ='robust_norm')\nx = cp.random.rand(2_000_000, dtype=cp.float32)\ny = cp.random.rand(2_000_000, dtype=cp.float32)\nz = robust_norm(x, y, cp.float32(1e-6))\nprint(f\"Output shape={z.shape}  mean={float(z.mean()):.5f}\")\nheader(\"5. REDUCTION KERNEL \u2014 L2 NORM\")\nl2 = cp.ReductionKernel(\n   in_params  ='T x',\n   out_params ='T y',\n   map_expr   ='x * x',\n   reduce_expr='a + b',\n   post_map_expr='y = sqrt(a)',\n   identity   ='0',\n   name       ='l2norm')\nv = cp.random.rand(5_000_000, dtype=cp.float32)\nprint(f\"Custom : {float(l2(v)):.6f}\")\nprint(f\"cupy   : {float(cp.linalg.norm(v)):.6f}\")<\/code><\/pre>\n<\/div>\n<\/div>\n<p>We examine CuPy\u2019s memory pool to understand how GPU memory is allocated, reused, and released during execution. We then create a custom ElementwiseKernel to perform a per-element robust distance calculation directly on the GPU. After that, we define a custom ReductionKernel for L2 norm computation and compare its result with CuPy\u2019s built-in linear algebra function.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\">header(\"6. RAW CUDA KERNEL \u2014 MANDELBROT\")\nmandel = cp.RawKernel(r'''\nextern \"C\" __global__\nvoid mandel(float xmin, float xmax, float ymin, float ymax,\n           int W, int H, int max_iter, int* out) {\n   int ix = blockDim.x * blockIdx.x + threadIdx.x;\n   int iy = blockDim.y * blockIdx.y + threadIdx.y;\n   if (ix &gt;= W || iy &gt;= H) return;\n   float cx = xmin + (xmax - xmin) * ix \/ (W - 1);\n   float cy = ymin + (ymax - ymin) * iy \/ (H - 1);\n   float zx = 0.f, zy = 0.f;\n   int it = 0;\n   while (zx*zx + zy*zy &lt; 4.f &amp;&amp; it &lt; max_iter) {\n       float t = zx*zx - zy*zy + cx;\n       zy = 2.f*zx*zy + cy;\n       zx = t; ++it;\n   }\n   out[iy*W + ix] = it;\n}\n''', 'mandel')\nW, H, ITER = 1024, 1024, 400\nimg = cp.zeros((H, W), dtype=cp.int32)\nthreads = (16, 16)\nblocks = ((W + 15)\/\/16, (H + 15)\/\/16)\nmandel(blocks, threads,\n      (cp.float32(-2.0), cp.float32(1.0),\n       cp.float32(-1.5), cp.float32(1.5),\n       W, H, ITER, img))\ncp.cuda.Stream.null.synchronize()\nprint(f\"Mandelbrot done. max iter reached={int(img.max())}\")\nplt.figure(figsize=(6,6))\nplt.imshow(cp.asnumpy(cp.log1p(img)), cmap='twilight_shifted', extent=[-2,1,-1.5,1.5])\nplt.title(\"Mandelbrot set \u2014 computed with a CuPy RawKernel\")\nplt.axis('off'); plt.show()\nheader(\"7. CUDA STREAMS\")\ns1, s2 = cp.cuda.Stream(non_blocking=True), cp.cuda.Stream(non_blocking=True)\nwith s1:\n   a1 = cp.random.rand(2000, 2000, dtype=cp.float32)\n   b1 = cp.random.rand(2000, 2000, dtype=cp.float32)\n   c1 = a1 @ b1\nwith s2:\n   a2 = cp.random.rand(2000, 2000, dtype=cp.float32)\n   b2 = cp.random.rand(2000, 2000, dtype=cp.float32)\n   c2 = a2 @ b2\ns1.synchronize(); s2.synchronize()\nprint(f\"Stream-1 mean={float(c1.mean()):.4f}\")\nprint(f\"Stream-2 mean={float(c2.mean()):.4f}\")<\/code><\/pre>\n<\/div>\n<\/div>\n<p>We use a raw CUDA C kernel through CuPy\u2019s RawKernel interface to compute the Mandelbrot set directly on the GPU. We launch the kernel with custom thread and block dimensions, synchronize execution, and visualize the resulting fractal using Matplotlib. We also explore CUDA streams by running two independent matrix multiplications concurrently and checking the output means from both streams.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\">header(\"8. SPARSE LINEAR ALGEBRA\")\nN, density = 8000, 5e-4\nnnz = int(N*N*density)\ndata = cp.random.rand(nnz, dtype=cp.float32)\nrows = cp.random.randint(0, N, nnz)\ncols = cp.random.randint(0, N, nnz)\nA_sp = cps.csr_matrix((data, (rows, cols)), shape=(N, N))\nxv   = cp.random.rand(N, dtype=cp.float32)\nprint(f\"NNZ           : {A_sp.nnz}\")\nprint(f\"Sparse matvec : {bench(lambda: A_sp @ xv)*1000:.3f} ms\")\nA_dense = A_sp.toarray()\nprint(f\"Dense  matvec : {bench(lambda: A_dense @ xv)*1000:.3f} ms\")\nheader(\"9. LINEAR SYSTEM Ax = b\")\nN = 2000\nM = cp.random.rand(N, N, dtype=cp.float32)\nA = M @ M.T + N * cp.eye(N, dtype=cp.float32)\nb = cp.random.rand(N, dtype=cp.float32)\nx_sol = cp.linalg.solve(A, b)\nres   = cp.linalg.norm(A @ x_sol - b) \/ cp.linalg.norm(b)\nprint(f\"Solved {N}x{N} SPD system. Relative residual = {float(res):.2e}\")\nheader(\"10. GAUSSIAN FILTER ON GPU\")\nbig = cp.random.rand(4096, 4096, dtype=cp.float32)\nt = bench(cdi.gaussian_filter, big, 5.0, n=3)\nprint(f\"4096x4096 Gaussian \u03c3=5  \u2192  {t*1000:.2f} ms\")\nheader(\"11. INTEROP &amp; ZERO-COPY (DLPack)\")\ng = cp.arange(8, dtype=cp.float32)\nh = cp.asnumpy(g)\nback = cp.asarray(h)\ndl = g.toDlpack()\nrestored = cp.from_dlpack(dl)\nprint(f\"NumPy view : {h}\")\nprint(f\"DLPack RT  : {restored}\")<\/code><\/pre>\n<\/div>\n<\/div>\n<p>We use sparse linear algebra by generating a random sparse CSR matrix and comparing sparse matrix-vector multiplication with dense multiplication. We then solve a large symmetric positive definite linear system using CuPy\u2019s dense linear algebra tools and verify the solution through a relative residual. Finally, we apply a Gaussian filter to a large image-like array on the GPU and demonstrate interoperability between NumPy, CuPy, and DLPack for data movement and zero-copy exchange.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\">header(\"12. CUDA EVENTS\")\nA = cp.random.rand(4000, 4000, dtype=cp.float32)\nB = cp.random.rand(4000, 4000, dtype=cp.float32)\ne0, e1 = cp.cuda.Event(), cp.cuda.Event()\ne0.record(); C = A @ B; e1.record(); e1.synchronize()\nprint(f\"4000x4000 matmul = {cp.cuda.get_elapsed_time(e0, e1):.3f} ms (CUDA events)\")\nheader(\"13. cupyx.jit \u2014 SAXPY\")\n@jit.rawkernel()\ndef saxpy(a, x, y, out, n):\n   tid = jit.blockIdx.x * jit.blockDim.x + jit.threadIdx.x\n   if tid &lt; n:\n       out[tid] = a * x[tid] + y[tid]\nn = 2_000_000\nxv = cp.random.rand(n, dtype=cp.float32)\nyv = cp.random.rand(n, dtype=cp.float32)\nout = cp.empty_like(xv)\nTPB = 256\nblocks = (n + TPB - 1) \/\/ TPB\nsaxpy((blocks,), (TPB,), (cp.float32(2.5), xv, yv, out, n))\nprint(\"Correctness:\", bool(cp.allclose(out, 2.5*xv + yv)))\nheader(\"14. KERNEL FUSION with @cp.fuse\")\n@cp.fuse()\ndef fused(x, y, z):\n   return cp.sqrt(x*x + y*y + z*z) * cp.exp(-0.5*(x+y+z))\ndef unfused(x, y, z):\n   return cp.sqrt(x*x + y*y + z*z) * cp.exp(-0.5*(x+y+z))\nn = 4_000_000\nx = cp.random.rand(n, dtype=cp.float32)\ny = cp.random.rand(n, dtype=cp.float32)\nz = cp.random.rand(n, dtype=cp.float32)\nfused(x, y, z)\nt1 = bench(unfused, x, y, z)\nt2 = bench(fused,   x, y, z)\nprint(f\"Unfused : {t1*1e3:6.3f} ms\")\nprint(f\"Fused   : {t2*1e3:6.3f} ms   (speedup {t1\/t2:.2f}x)\")\nprint(\"n\" + \"=\"*64)\nprint(\"DONE \u2014 explore: cupy.linalg, cupyx.scipy.signal, cupy.cuda.Graph\")\nprint(\"=\"*64)<\/code><\/pre>\n<\/div>\n<\/div>\n<p>We profile a large GPU matrix multiplication using CUDA events to obtain accurate device-side timing. We then write a SAXPY kernel with cupyx.jit, launch it manually, and verify its correctness against the equivalent CuPy expression. Also, we use @cp.fuse to combine multiple array operations into a fused kernel and compare its speed with the unfused version.<\/p>\n<p>In conclusion, we gained a complete hands-on overview of CuPy\u2019s advanced GPU computing capabilities. We learned how to benchmark GPU operations correctly, manage GPU memory, create custom kernels, run concurrent CUDA streams, process sparse and dense numerical problems, and apply GPU acceleration to image filtering and scientific workloads. We also explored interoperability through NumPy and DLPack, profile operations using CUDA events, and improved performance with JIT kernels and fused computations. Also, we saw how CuPy provides NumPy-like syntax while allowing us to delve deeper into CUDA programming when we need more control, speed, and scalability for real-world numerical and scientific computing tasks.<\/p>\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n<p>Check out\u00a0the\u00a0<strong><a href=\"https:\/\/github.com\/Marktechpost\/AI-Agents-Projects-Tutorials\/blob\/main\/Data%20Analysis\/cupy_gpu_computing_tutorial_Marktechpost.ipynb\" target=\"_blank\" rel=\"noreferrer noopener\">Full Codes and Notebook here<\/a>.\u00a0<\/strong>Also,\u00a0feel free to follow us on\u00a0<strong><a href=\"https:\/\/x.com\/intent\/follow?screen_name=marktechpost\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Twitter<\/mark><\/a><\/strong>\u00a0and don\u2019t forget to join our\u00a0<strong><a href=\"https:\/\/www.reddit.com\/r\/machinelearningnews\/\" target=\"_blank\" rel=\"noreferrer noopener\">150k+ ML SubReddit<\/a><\/strong>\u00a0and Subscribe to\u00a0<strong><a href=\"https:\/\/www.aidevsignals.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">our Newsletter<\/a><\/strong>. Wait! are you on telegram?\u00a0<strong><a href=\"https:\/\/t.me\/machinelearningresearchnews\" target=\"_blank\" rel=\"noreferrer noopener\">now you can join us on telegram as well.<\/a><\/strong><\/p>\n<p>Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.?\u00a0<strong><a href=\"https:\/\/forms.gle\/MTNLpmJtsFA3VRVd9\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Connect with us<\/mark><\/a><\/strong><\/p>\n<p>The post <a href=\"https:\/\/www.marktechpost.com\/2026\/05\/14\/a-coding-implementation-to-master-gpu-computing-with-cupy-custom-cuda-kernels-streams-sparse-matrices-and-profiling\/\">A Coding Implementation to Master GPU Computing with CuPy, Custom CUDA Kernels, Streams, Sparse Matrices, and Profiling<\/a> appeared first on <a href=\"https:\/\/www.marktechpost.com\/\">MarkTechPost<\/a>.<\/p>","protected":false},"excerpt":{"rendered":"<p>In this tutorial, we delve int&hellip;<\/p>\n","protected":false},"author":1,"featured_media":29,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-912","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-uncategorized"],"_links":{"self":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts\/912","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=912"}],"version-history":[{"count":0,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts\/912\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/media\/29"}],"wp:attachment":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=912"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=912"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=912"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}