{"id":606,"date":"2026-03-25T05:45:49","date_gmt":"2026-03-24T21:45:49","guid":{"rendered":"https:\/\/connectword.dpdns.org\/?p=606"},"modified":"2026-03-25T05:45:49","modified_gmt":"2026-03-24T21:45:49","slug":"paged-attention-in-large-language-models-llms","status":"publish","type":"post","link":"https:\/\/connectword.dpdns.org\/?p=606","title":{"rendered":"Paged Attention in Large Language Models LLMs"},"content":{"rendered":"<p>When running LLMs at scale, the real limitation is GPU memory rather than compute, mainly because each request requires a KV cache to store token-level data. In traditional setups, a large fixed memory block is reserved per request based on the maximum sequence length, which leads to significant unused space and limits concurrency. Paged Attention improves this by breaking the KV cache into smaller, flexible chunks that are allocated only when needed, similar to how virtual memory works. It also allows multiple requests with the same starting prompt to share memory and only duplicate it when their outputs start to differ. This approach greatly improves memory efficiency, allowing significantly higher throughput with very little overhead.<\/p>\n<p>In this article, we simulate the naive KV cache allocator, build a working Paged Attention implementation with a block table and Copy-on-Write prefix sharing, and measure the utilisation gap across batch sizes of 10 to 200 concurrent requests.<\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"793\" height=\"247\" data-attachment-id=\"78572\" data-permalink=\"https:\/\/www.marktechpost.com\/2026\/03\/24\/paged-attention-in-large-language-models-llms\/image-383\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/03\/image-46.png\" data-orig-size=\"793,247\" data-comments-opened=\"1\" data-image-meta='{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\"}' data-image-title=\"image\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/03\/image-46-300x93.png\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/03\/image-46.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/03\/image-46.png\" alt=\"\" class=\"wp-image-78572\" \/><\/figure>\n<\/div>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"823\" height=\"248\" data-attachment-id=\"78574\" data-permalink=\"https:\/\/www.marktechpost.com\/2026\/03\/24\/paged-attention-in-large-language-models-llms\/image-384\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/03\/image-47.png\" data-orig-size=\"823,248\" data-comments-opened=\"1\" data-image-meta='{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\"}' data-image-title=\"image\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/03\/image-47-300x90.png\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/03\/image-47.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/03\/image-47.png\" alt=\"\" class=\"wp-image-78574\" \/><\/figure>\n<\/div>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"800\" height=\"237\" data-attachment-id=\"78576\" data-permalink=\"https:\/\/www.marktechpost.com\/2026\/03\/24\/paged-attention-in-large-language-models-llms\/image-386\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/03\/image-48.png\" data-orig-size=\"800,237\" data-comments-opened=\"1\" data-image-meta='{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\"}' data-image-title=\"image\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/03\/image-48-300x89.png\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/03\/image-48.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/03\/image-48.png\" alt=\"\" class=\"wp-image-78576\" \/><\/figure>\n<\/div>\n<h1 class=\"wp-block-heading\">Importing the dependencies<\/h1>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\">import math\nimport random\nimport numpy as np\nimport matplotlib.pyplot as plt\nimport matplotlib.patches as mpatches\nfrom collections import defaultdict\n \nrandom.seed(42)\nnp.random.seed(42)<\/code><\/pre>\n<\/div>\n<\/div>\n<h1 class=\"wp-block-heading\">Setting up the Constants<\/h1>\n<p>Before simulating anything, we need to know how much GPU memory a single token actually costs. This depends entirely on the model\u2019s architecture. We use a GPT-style configuration \u2014 32 layers, 32 attention heads, 128 dimensions per head, stored in fp16. The factor of 2 at the front accounts for both the Key and Value projections (there is no Q cache \u2014 queries are recomputed at each step). Multiplying these out gives us 524,288 bytes, or 512 KB, per token. This is the fundamental unit everything else is built on \u2014 pre-allocation sizes, page counts, and wasted memory all scale directly from this number.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\">NUM_LAYERS  = 32\nNUM_HEADS   = 32\nHEAD_DIM    = 128\nBYTES_FP16  = 2\nPAGE_SIZE   = 16    # tokens per page (vLLM default)\nMAX_SEQ_LEN = 2048\n \nKV_BYTES_PER_TOKEN = 2 * NUM_LAYERS * NUM_HEADS * HEAD_DIM * BYTES_FP16\nKV_MB_PER_TOKEN    = KV_BYTES_PER_TOKEN \/ 1024 \/ 1024<\/code><\/pre>\n<\/div>\n<\/div>\n<h1 class=\"wp-block-heading\">Naive KV Cache<\/h1>\n<p>The naive approach is simple: when a request arrives, a contiguous block of GPU memory is allocated sized to the maximum sequence length \u2014 2048 tokens in this case. This happens because the response length is unknown upfront, so the worst case is reserved.<\/p>\n<p>AVG_RESPONSE is set to 500, which is a realistic average for a production chatbot. Multiplying by KV_MB_PER_TOKEN gives what is actually written versus what was locked. The gap is the waste.<\/p>\n<p>The numbers make the problem concrete. Each request pre-allocates 1024 MB but uses only 250 MB \u2014 24.4% utilisation. The remaining 774 MB sits reserved for the entire duration of the request, unavailable to any other request. Across 100 concurrent users, that is 75 GB of GPU memory doing nothing. This is not an edge case \u2014 it is the default behavior of every system that does not implement paged allocation, and it is exactly why naive serving systems hit an OOM wall long before the GPU is computationally saturated.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\">print(\"=\" * 60)\nprint(\"SECTION 1 -- Naive KV Cache: The Waste Problem\")\nprint(\"=\" * 60)\n \nAVG_RESPONSE = 500   # realistic average tokens generated\n \npre_allocated_mb = MAX_SEQ_LEN  * KV_MB_PER_TOKEN\nactually_used_mb = AVG_RESPONSE * KV_MB_PER_TOKEN\n \nprint(f\"nKV cache per token    : {KV_BYTES_PER_TOKEN:,} bytes\")\nprint(f\"Pre-allocated\/request : {pre_allocated_mb:.2f} MB  ({MAX_SEQ_LEN} tokens)\")\nprint(f\"Actually used\/request : {actually_used_mb:.2f} MB  ({AVG_RESPONSE} tokens)\")\nprint(f\"Utilisation           : {actually_used_mb \/ pre_allocated_mb * 100:.1f}%\")\nprint(f\"Wasted per request    : {pre_allocated_mb - actually_used_mb:.2f} MB\")\n \nNUM_USERS = 100\nwasted_gb = (pre_allocated_mb - actually_used_mb) * NUM_USERS \/ 1024\nprint(f\"nAcross {NUM_USERS} concurrent users \u2192 {wasted_gb:.2f} GB wasted\")\nprint(\"n\u2192 Naive systems utilise only 20-38% of allocated KV cache memory\")\nprint(\"  (source: original Paged Attention \/ vLLM paper)\")<\/code><\/pre>\n<\/div>\n<\/div>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1111\" height=\"576\" data-attachment-id=\"78580\" data-permalink=\"https:\/\/www.marktechpost.com\/2026\/03\/24\/paged-attention-in-large-language-models-llms\/image-390\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/03\/image-50.png\" data-orig-size=\"1111,576\" data-comments-opened=\"1\" data-image-meta='{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\"}' data-image-title=\"image\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/03\/image-50-300x156.png\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/03\/image-50-1024x531.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/03\/image-50.png\" alt=\"\" class=\"wp-image-78580\" \/><\/figure>\n<\/div>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"2172\" height=\"1042\" data-attachment-id=\"78583\" data-permalink=\"https:\/\/www.marktechpost.com\/2026\/03\/24\/paged-attention-in-large-language-models-llms\/screenshot-2026-03-24-at-2-36-04-pm-2\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/03\/Screenshot-2026-03-24-at-2.36.04-PM-1.png\" data-orig-size=\"2172,1042\" data-comments-opened=\"1\" data-image-meta='{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\"}' data-image-title=\"Screenshot 2026-03-24 at 2.36.04\u202fPM\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/03\/Screenshot-2026-03-24-at-2.36.04-PM-1-300x144.png\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/03\/Screenshot-2026-03-24-at-2.36.04-PM-1-1024x491.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/03\/Screenshot-2026-03-24-at-2.36.04-PM-1.png\" alt=\"\" class=\"wp-image-78583\" \/><\/figure>\n<\/div>\n<h1 class=\"wp-block-heading\">Paged Attention<\/h1>\n<p>Two classes are introduced here to simulate how Paged Attention actually works at the memory management level.<\/p>\n<p>PagePool represents the physical GPU memory pool \u2014 a flat array of equal-size pages, each holding 16 tokens. It maintains a free list and a ref count per page. When a page\u2019s ref count drops to zero, it is immediately returned to the free list and becomes available to any new request. This is the key difference from naive allocation \u2014 there are no reserved holes, no fragmentation, and no memory tied to a finished request.<\/p>\n<p>PagedRequest represents a single inference request. It holds a block_table \u2014 a list that maps logical page indices to physical page ids in the pool. Every time generate_token() is called and the token count crosses a page boundary, a new physical page is claimed from the pool. No memory is touched before it is needed.<\/p>\n<p>Five requests are run with token counts of 320, 48, 160, 96, and 272. The output shows pages allocated proportionally to actual usage \u2014 req-1 with 48 tokens gets 3 pages, req-0 with 320 tokens gets 20. When req-1 is freed, its 3 pages go straight back to the pool and are immediately reusable. The pool utilisation at 10.9% looks low only because 512 pages were provisioned for 5 small requests \u2014 in a fully loaded production pool it would sit near the 98% range seen in Section 4. The \u201c0 tokens wasted\u201d in the last-page column is a seed artifact \u2014 all five token counts happen to be exact multiples of 16. In practice, the average last-page waste is PAGE_SIZE \/ 2 = 8 tokens per request.<\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1103\" height=\"840\" data-attachment-id=\"78581\" data-permalink=\"https:\/\/www.marktechpost.com\/2026\/03\/24\/paged-attention-in-large-language-models-llms\/image-391\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/03\/image-51.png\" data-orig-size=\"1103,840\" data-comments-opened=\"1\" data-image-meta='{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\"}' data-image-title=\"image\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/03\/image-51-300x228.png\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/03\/image-51-1024x780.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/03\/image-51.png\" alt=\"\" class=\"wp-image-78581\" \/><\/figure>\n<\/div>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\">print(\"n\" + \"=\" * 60)\nprint(\"SECTION 2 -- Paged Attention: Pages + Block Table\")\nprint(\"=\" * 60)\n \n\"\"\"\nInstead of one large contiguous block per request:\n  - KV cache is split into fixed-size pages (PAGE_SIZE tokens each)\n  - Pages are allocated on demand, can live anywhere in GPU memory\n  - Each request keeps a block_table: logical index \u2192 physical page id\n\"\"\"\n \nclass PagePool:\n    def __init__(self, total_pages):\n        self.free      = list(range(total_pages))\n        self.total     = total_pages\n        self.ref_count = defaultdict(int)\n \n    def allocate(self):\n        if not self.free:\n            raise MemoryError(\"OOM -- no free pages\")\n        pid = self.free.pop(0)\n        self.ref_count[pid] = 1\n        return pid\n \n    def release(self, pid):\n        self.ref_count[pid] -= 1\n        if self.ref_count[pid] &lt;= 0:\n            self.free.append(pid)\n            del self.ref_count[pid]\n \n    def share(self, pid):\n        \"\"\"Increment ref count -- another request is sharing this page.\"\"\"\n        self.ref_count[pid] += 1\n \n    def cow_copy(self, pid):\n        \"\"\"CoW: allocate a new page, decrement ref on the old one.\"\"\"\n        new_pid = self.allocate()\n        self.release(pid)\n        return new_pid\n \n    @property\n    def utilisation(self):\n        return (self.total - len(self.free)) \/ self.total * 100\n \n \nclass PagedRequest:\n    def __init__(self, req_id, pool: PagePool):\n        self.id          = req_id\n        self.pool        = pool\n        self.block_table = []   # logical index \u2192 physical page id\n        self.tokens      = 0\n \n    def generate_token(self):\n        if self.tokens % PAGE_SIZE == 0:   # page boundary \u2192 allocate new page\n            self.block_table.append(self.pool.allocate())\n        self.tokens += 1\n \n    def free(self):\n        for pid in self.block_table:\n            self.pool.release(pid)\n        self.block_table.clear()\n \n \npool = PagePool(total_pages=512)\nrequests = [PagedRequest(f\"req-{i}\", pool) for i in range(5)]\ntoken_counts = [320, 48, 160, 96, 272]\n \nfor req, n in zip(requests, token_counts):\n    for _ in range(n):\n        req.generate_token()\n \nprint(\"nRequest state after generation:\")\nprint(f\"  {'ID':&lt;10} {'Tokens':&gt;8} {'Pages':&gt;7} {'Last-page waste':&gt;16}\")\nfor req in requests:\n    waste = req.tokens % PAGE_SIZE\n    waste = PAGE_SIZE - waste if waste else 0\n    print(f\"  {req.id:&lt;10} {req.tokens:&gt;8} {len(req.block_table):&gt;7} {waste:&gt;16} tokens\")\n \nprint(f\"nPool utilisation : {pool.utilisation:.1f}%\")\nrequests[1].free()\nprint(f\"After freeing req-1 \u2192 utilisation: {pool.utilisation:.1f}%  (pages immediately reusable)\")<\/code><\/pre>\n<\/div>\n<\/div>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1890\" height=\"610\" data-attachment-id=\"78585\" data-permalink=\"https:\/\/www.marktechpost.com\/2026\/03\/24\/paged-attention-in-large-language-models-llms\/screenshot-2026-03-24-at-2-37-27-pm-2\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/03\/Screenshot-2026-03-24-at-2.37.27-PM-1.png\" data-orig-size=\"1890,610\" data-comments-opened=\"1\" data-image-meta='{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\"}' data-image-title=\"Screenshot 2026-03-24 at 2.37.27\u202fPM\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/03\/Screenshot-2026-03-24-at-2.37.27-PM-1-300x97.png\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/03\/Screenshot-2026-03-24-at-2.37.27-PM-1-1024x330.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/03\/Screenshot-2026-03-24-at-2.37.27-PM-1.png\" alt=\"\" class=\"wp-image-78585\" \/><\/figure>\n<\/div>\n<h1 class=\"wp-block-heading\">Copy-on-Write: Shared System Prompts<\/h1>\n<p>In production, nearly every request to a deployed LLM carries the same system prompt \u2014 the instructions that define the model\u2019s behavior. Under naive allocation, each of those requests stores its own full copy of the system prompt\u2019s KV cache. With 10 concurrent requests and a 200-token system prompt, that is 10 identical copies of the same data occupying separate memory regions.<\/p>\n<p>The same PagePool from Section 2 is reused here, extended with two methods: share() increments a page\u2019s ref count without allocating anything new, and cow_copy() allocates a fresh page and decrements the ref count on the original. A new pool is instantiated and the system prompt is encoded into 13 pages \u2014 math.ceil(200 \/ 16). Each of the 10 user requests then calls share() on all 13 pages, pointing their block tables at the same physical memory. No new pages are allocated. The ref count on each shared page simply rises to 11.<\/p>\n<p>The savings are immediate: naive allocation would require 130 pages across 10 requests. With CoW, only 13 physical pages exist. That is 936 MB saved from a single shared prefix.<\/p>\n<p>When req-3 generates its first unique token, cow_copy() is called on its last shared page \u2014 page 12. A new page 13 is allocated as req-3\u2019s private copy, and the ref count on page 12 drops by one. The other 9 requests continue pointing at page 12, completely unaffected. This is the CoW contract: shared until divergence, private only when necessary.<\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1744\" height=\"1258\" data-attachment-id=\"78587\" data-permalink=\"https:\/\/www.marktechpost.com\/2026\/03\/24\/paged-attention-in-large-language-models-llms\/screenshot-2026-03-24-at-2-37-59-pm-2\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/03\/Screenshot-2026-03-24-at-2.37.59-PM-1.png\" data-orig-size=\"1744,1258\" data-comments-opened=\"1\" data-image-meta='{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\"}' data-image-title=\"Screenshot 2026-03-24 at 2.37.59\u202fPM\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/03\/Screenshot-2026-03-24-at-2.37.59-PM-1-300x216.png\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/03\/Screenshot-2026-03-24-at-2.37.59-PM-1-1024x739.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/03\/Screenshot-2026-03-24-at-2.37.59-PM-1.png\" alt=\"\" class=\"wp-image-78587\" \/><\/figure>\n<\/div>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\">print(\"n\" + \"=\" * 60)\nprint(\"SECTION 3 -- Copy-on-Write: Shared System Prompts\")\nprint(\"=\" * 60)\n \n\"\"\"\nIf N requests share a system prompt, naive allocation stores N copies.\nWith CoW, all requests point to the SAME physical pages.\nA private copy is made only when a request writes a diverging token.\n\"\"\"\n \ncow_pool    = PagePool(total_pages=512)\nSYSTEM_TOKENS = 200\nsystem_pages  = math.ceil(SYSTEM_TOKENS \/ PAGE_SIZE)\nshared_pids   = [cow_pool.allocate() for _ in range(system_pages)]\nprint(f\"nSystem prompt \u2192 {system_pages} shared pages: {shared_pids}\")\n \nN = 10\nuser_tables = []\nfor i in range(N):\n    table = list(shared_pids)\n    for pid in shared_pids:\n        cow_pool.share(pid)     # ref count up -- no physical copy\n    user_tables.append(table)\n \nsaved_mb = (system_pages * N - system_pages) * PAGE_SIZE * KV_MB_PER_TOKEN\nprint(f\"nStoring system prompt for {N} requests:\")\nprint(f\"  Naive : {system_pages * N} pages  ({system_pages * N * PAGE_SIZE * KV_MB_PER_TOKEN:.1f} MB)\")\nprint(f\"  CoW   : {system_pages} pages   ({system_pages * PAGE_SIZE * KV_MB_PER_TOKEN:.1f} MB)\")\nprint(f\"  Saved : {saved_mb:.1f} MB\")\n \nold_pid                 = user_tables[3][-1]\nnew_pid                 = cow_pool.cow_copy(old_pid)\nuser_tables[3][-1]      = new_pid\nprint(f\"nReq-3 diverges \u2192 CoW: old page {old_pid} \u2192 new page {new_pid}\")\nprint(f\"All other {N-1} requests still share page {old_pid} unaffected\")\n<\/code><\/pre>\n<\/div>\n<\/div>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1836\" height=\"512\" data-attachment-id=\"78589\" data-permalink=\"https:\/\/www.marktechpost.com\/2026\/03\/24\/paged-attention-in-large-language-models-llms\/screenshot-2026-03-24-at-2-39-04-pm-2\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/03\/Screenshot-2026-03-24-at-2.39.04-PM-1.png\" data-orig-size=\"1836,512\" data-comments-opened=\"1\" data-image-meta='{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\"}' data-image-title=\"Screenshot 2026-03-24 at 2.39.04\u202fPM\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/03\/Screenshot-2026-03-24-at-2.39.04-PM-1-300x84.png\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/03\/Screenshot-2026-03-24-at-2.39.04-PM-1-1024x286.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/03\/Screenshot-2026-03-24-at-2.39.04-PM-1.png\" alt=\"\" class=\"wp-image-78589\" \/><\/figure>\n<\/div>\n<h1 class=\"wp-block-heading\">Utilisation: Naive vs Paged<\/h1>\n<p>Two functions are defined to measure utilisation under each approach across different batch sizes.<\/p>\n<p>naive_utilisation draws token counts from a normal distribution with avg=500 and std=200, clipped to [200, 2048]. This reflects a realistic production distribution \u2014 most responses fall between 200 and 800 tokens, with occasional long ones. For each request, the full 2048-slot block is pre-allocated regardless. Utilisation is then actual_tokens_sum \/ (2048 \u00d7 n) \u2014 the ratio of what was written to what was reserved.<\/p>\n<p>paged_utilisation takes the same actual token counts but computes how many pages each request would need \u2014 ceil(tokens \/ 16). The only waste is the unfilled tail of each request\u2019s last page, which averages 8 tokens. Utilisation is actual_tokens_sum \/ (pages_allocated \u00d7 16).<\/p>\n<p>The results are run across batch sizes of 10, 25, 50, 100, and 200. Naive utilisation hovers around 24% across all batch sizes \u2014 with some variance at smaller batches due to sampling noise \u2014 which is exactly avg \/ max_seq = 500 \/ 2048. It does not improve with scale because the waste is structural, not statistical.<\/p>\n<p>Paged utilisation sits flat at ~98.5% regardless of batch size, because the waste per request is bounded by a single partial page and does not scale with max_seq_len at all. The gap between the two numbers \u2014 roughly 74 percentage points \u2014 is directly what enables vLLM to fit 2\u20134\u00d7 more concurrent requests into the same GPU memory.<\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1856\" height=\"1092\" data-attachment-id=\"78591\" data-permalink=\"https:\/\/www.marktechpost.com\/2026\/03\/24\/paged-attention-in-large-language-models-llms\/screenshot-2026-03-24-at-2-39-39-pm-2\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/03\/Screenshot-2026-03-24-at-2.39.39-PM-1.png\" data-orig-size=\"1856,1092\" data-comments-opened=\"1\" data-image-meta='{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\"}' data-image-title=\"Screenshot 2026-03-24 at 2.39.39\u202fPM\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/03\/Screenshot-2026-03-24-at-2.39.39-PM-1-300x177.png\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/03\/Screenshot-2026-03-24-at-2.39.39-PM-1-1024x602.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/03\/Screenshot-2026-03-24-at-2.39.39-PM-1.png\" alt=\"\" class=\"wp-image-78591\" \/><\/figure>\n<\/div>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\">print(\"n\" + \"=\" * 60)\nprint(\"SECTION 4 -- Utilisation: Naive vs Paged\")\nprint(\"=\" * 60)\n \ndef naive_utilisation(n, max_seq=2048, avg=500, std=200):\n    actual = np.clip(np.random.normal(avg, std, n).astype(int), 200, max_seq)\n    return actual.sum() \/ (max_seq * n) * 100, actual\n \ndef paged_utilisation(actual_tokens, page_size=PAGE_SIZE):\n    pages = np.ceil(actual_tokens \/ page_size).astype(int)\n    return actual_tokens.sum() \/ (pages * page_size).sum() * 100\n \nbatch_sizes = [10, 25, 50, 100, 200]\nnaive_u, paged_u = [], []\n \nprint(f\"n  {'Batch':&gt;6}   {'Naive':&gt;8}   {'Paged':&gt;8}\")\nfor bs in batch_sizes:\n    nu, actual = naive_utilisation(bs)\n    pu = paged_utilisation(actual)\n    naive_u.append(nu)\n    paged_u.append(pu)\n    print(f\"  {bs:&gt;6}   {nu:&gt;7.1f}%   {pu:&gt;7.1f}%\")<\/code><\/pre>\n<\/div>\n<\/div>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1636\" height=\"1052\" data-attachment-id=\"78592\" data-permalink=\"https:\/\/www.marktechpost.com\/2026\/03\/24\/paged-attention-in-large-language-models-llms\/screenshot-2026-03-24-at-2-40-26-pm\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/03\/Screenshot-2026-03-24-at-2.40.26-PM.png\" data-orig-size=\"1636,1052\" data-comments-opened=\"1\" data-image-meta='{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\"}' data-image-title=\"Screenshot 2026-03-24 at 2.40.26\u202fPM\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/03\/Screenshot-2026-03-24-at-2.40.26-PM-300x193.png\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/03\/Screenshot-2026-03-24-at-2.40.26-PM-1024x658.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/03\/Screenshot-2026-03-24-at-2.40.26-PM.png\" alt=\"\" class=\"wp-image-78592\" \/><\/figure>\n<\/div>\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n<p>Check out\u00a0the\u00a0<a href=\"https:\/\/github.com\/Marktechpost\/AI-Tutorial-Codes-Included\/blob\/main\/Data%20Science\/Paged_Attention.ipynb\" target=\"_blank\" rel=\"noreferrer noopener\"><strong>Full Notebook here<\/strong><\/a><strong>.\u00a0<\/strong>Also,\u00a0feel free to follow us on\u00a0<strong><a href=\"https:\/\/x.com\/intent\/follow?screen_name=marktechpost\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Twitter<\/mark><\/a><\/strong>\u00a0and don\u2019t forget to join our\u00a0<strong><a href=\"https:\/\/www.reddit.com\/r\/machinelearningnews\/\" target=\"_blank\" rel=\"noreferrer noopener\">120k+ ML SubReddit<\/a><\/strong>\u00a0and Subscribe to\u00a0<strong><a href=\"https:\/\/www.aidevsignals.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">our Newsletter<\/a><\/strong>. Wait! are you on telegram?\u00a0<strong><a href=\"https:\/\/t.me\/machinelearningresearchnews\" target=\"_blank\" rel=\"noreferrer noopener\">now you can join us on telegram as well.<\/a><\/strong><\/p>\n<p>The post <a href=\"https:\/\/www.marktechpost.com\/2026\/03\/24\/paged-attention-in-large-language-models-llms\/\">Paged Attention in Large Language Models LLMs<\/a> appeared first on <a href=\"https:\/\/www.marktechpost.com\/\">MarkTechPost<\/a>.<\/p>","protected":false},"excerpt":{"rendered":"<p>When running LLMs at scale, th&hellip;<\/p>\n","protected":false},"author":1,"featured_media":607,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-606","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-uncategorized"],"_links":{"self":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts\/606","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=606"}],"version-history":[{"count":0,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts\/606\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/media\/607"}],"wp:attachment":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=606"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=606"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=606"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}