{"id":172,"date":"2025-12-21T17:23:44","date_gmt":"2025-12-21T09:23:44","guid":{"rendered":"https:\/\/connectword.dpdns.org\/?p=172"},"modified":"2025-12-21T17:23:44","modified_gmt":"2025-12-21T09:23:44","slug":"ai-interview-series-4-explain-kv-caching","status":"publish","type":"post","link":"https:\/\/connectword.dpdns.org\/?p=172","title":{"rendered":"AI Interview Series #4: Explain KV Caching"},"content":{"rendered":"<h3 class=\"wp-block-heading\"><strong>Question:<\/strong><\/h3>\n<p><em>You\u2019re deploying an LLM in production. Generating the first few tokens is fast, but as the sequence grows, each additional token takes progressively longer to generate\u2014even though the model architecture and hardware remain the same.<\/em><\/p>\n<p><em>If compute isn\u2019t the primary bottleneck, what inefficiency is causing this slowdown, and how would you redesign the inference process to make token generation significantly faster?<\/em><\/p>\n<h3 class=\"wp-block-heading\"><strong>What is KV Caching and how does it make token generation faster?<\/strong><\/h3>\n<p>KV caching is an optimization technique used during text generation in large language models to avoid redundant computation. In autoregressive generation, the model produces text one token at a time, and at each step it normally recomputes attention over all previous tokens. However, the keys (K) and values (V) computed for earlier tokens never change.<\/p>\n<p>With KV caching, the model stores these keys and values the first time they are computed. When generating the next token, it reuses the cached K and V instead of recomputing them from scratch, and only computes the query (Q), key, and value for the new token. Attention is then calculated using the cached information plus the new token.<\/p>\n<p>This reuse of past computations significantly reduces redundant work, making inference faster and more efficient\u2014especially for long sequences\u2014at the cost of additional memory to store the cache. Check out the <strong><a href=\"https:\/\/github.com\/Marktechpost\/AI-Tutorial-Codes-Included\/blob\/main\/Data%20Science\/KV_Caching.ipynb\" target=\"_blank\" rel=\"noreferrer noopener\">Practice Notebook here<\/a><\/strong><\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"786\" height=\"470\" data-attachment-id=\"77004\" data-permalink=\"https:\/\/www.marktechpost.com\/2025\/12\/21\/ai-interview-series-4-explain-kv-caching\/image-262\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2025\/12\/image-22.png\" data-orig-size=\"786,470\" data-comments-opened=\"1\" data-image-meta='{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\"}' data-image-title=\"image\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2025\/12\/image-22-300x179.png\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2025\/12\/image-22.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2025\/12\/image-22.png\" alt=\"\" class=\"wp-image-77004\" \/><\/figure>\n<\/div>\n<h3 class=\"wp-block-heading\"><strong>Evaluating the Impact of KV Caching on Inference Speed<\/strong><\/h3>\n<p>In this code, we benchmark the impact of KV caching during autoregressive text generation. We run the same prompt through the model multiple times, once with KV caching enabled and once without it, and measure the average generation time. By keeping the model, prompt, and generation length constant, this experiment isolates how reusing cached keys and values significantly reduces redundant attention computation and speeds up inference. Check out the <strong><a href=\"https:\/\/github.com\/Marktechpost\/AI-Tutorial-Codes-Included\/blob\/main\/Data%20Science\/KV_Caching.ipynb\" target=\"_blank\" rel=\"noreferrer noopener\">Practice Notebook here<\/a><\/strong><\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\">import numpy as np\nimport time\nimport torch\nfrom transformers import AutoModelForCausalLM, AutoTokenizer\n\ndevice = \"cuda\" if torch.cuda.is_available() else \"cpu\"\n\nmodel_name = \"gpt2-medium\"  \ntokenizer = AutoTokenizer.from_pretrained(model_name)\nmodel = AutoModelForCausalLM.from_pretrained(model_name).to(device)\n\nprompt = \"Explain KV caching in transformers.\"\n\ninputs = tokenizer(prompt, return_tensors=\"pt\").to(device)\n\nfor use_cache in (True, False):\n    times = []\n    for _ in range(5):  \n        start = time.time()\n        model.generate(\n            **inputs,\n            use_cache=use_cache,\n            max_new_tokens=1000\n        )\n        times.append(time.time() - start)\n\n    print(\n        f\"{'with' if use_cache else 'without'} KV caching: \"\n        f\"{round(np.mean(times), 3)} \u00b1 {round(np.std(times), 3)} seconds\"\n    )<\/code><\/pre>\n<\/div>\n<\/div>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"590\" height=\"390\" data-attachment-id=\"77006\" data-permalink=\"https:\/\/www.marktechpost.com\/2025\/12\/21\/ai-interview-series-4-explain-kv-caching\/image-264\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2025\/12\/image-24.png\" data-orig-size=\"590,390\" data-comments-opened=\"1\" data-image-meta='{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\"}' data-image-title=\"image\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2025\/12\/image-24-300x198.png\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2025\/12\/image-24.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2025\/12\/image-24.png\" alt=\"\" class=\"wp-image-77006\" \/><\/figure>\n<\/div>\n<p>The results clearly demonstrate the impact of KV caching on inference speed. With KV caching enabled, generating 1000 tokens takes around 21.7 seconds, whereas disabling KV caching increases the generation time to over 107 seconds\u2014nearly a 5\u00d7 slowdown. This sharp difference occurs because, without KV caching, the model recomputes attention over all previously generated tokens at every step, leading to quadratic growth in computation.\u00a0Check out the <strong><a href=\"https:\/\/github.com\/Marktechpost\/AI-Tutorial-Codes-Included\/blob\/main\/Data%20Science\/KV_Caching.ipynb\" target=\"_blank\" rel=\"noreferrer noopener\">Practice Notebook here<\/a><\/strong><\/p>\n<p>With KV caching, past keys and values are reused, eliminating redundant work and keeping generation time nearly linear as the sequence grows. This experiment highlights why KV caching is essential for efficient, real-world deployment of autoregressive language models.<\/p>\n<p>Check out the <strong><a href=\"https:\/\/github.com\/Marktechpost\/AI-Tutorial-Codes-Included\/blob\/main\/Data%20Science\/KV_Caching.ipynb\" target=\"_blank\" rel=\"noreferrer noopener\">Practice Notebook here<\/a><\/strong><\/p>\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n<figure class=\"wp-block-embed is-type-wp-embed is-provider-marktechpost wp-block-embed-marktechpost\">\n<div class=\"wp-block-embed__wrapper\">\n<blockquote class=\"wp-embedded-content\" data-secret=\"A54fAErn9K\"><p><a href=\"https:\/\/www.marktechpost.com\/2025\/11\/23\/ai-interview-series-3-explain-federated-learning\/\">AI Interview Series #3: Explain Federated Learning<\/a><\/p><\/blockquote>\n<\/div>\n<\/figure>\n<p>The post <a href=\"https:\/\/www.marktechpost.com\/2025\/12\/21\/ai-interview-series-4-explain-kv-caching\/\">AI Interview Series #4: Explain KV Caching<\/a> appeared first on <a href=\"https:\/\/www.marktechpost.com\/\">MarkTechPost<\/a>.<\/p>","protected":false},"excerpt":{"rendered":"<p>Question: You\u2019re deploying an &hellip;<\/p>\n","protected":false},"author":1,"featured_media":173,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-172","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-uncategorized"],"_links":{"self":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts\/172","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=172"}],"version-history":[{"count":0,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts\/172\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/media\/173"}],"wp:attachment":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=172"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=172"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=172"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}