{"id":927,"date":"2026-05-18T02:19:09","date_gmt":"2026-05-17T18:19:09","guid":{"rendered":"https:\/\/connectword.dpdns.org\/?p=927"},"modified":"2026-05-18T02:19:09","modified_gmt":"2026-05-17T18:19:09","slug":"a-coding-implementation-to-compress-and-benchmark-instruction-tuned-llms-with-fp8-gptq-and-smoothquant-quantization-using-llmcompressor","status":"publish","type":"post","link":"https:\/\/connectword.dpdns.org\/?p=927","title":{"rendered":"A Coding Implementation to Compress and Benchmark Instruction-Tuned LLMs with FP8, GPTQ, and SmoothQuant Quantization using llmcompressor"},"content":{"rendered":"<p>In this tutorial, we explore how to apply post-training quantization to an instruction-tuned language model using<a href=\"https:\/\/github.com\/vllm-project\/llm-compressor\"> <strong>llmcompressor<\/strong><\/a>. We start with an FP16 baseline and then compare multiple compression strategies, including FP8 dynamic quantization, GPTQ W4A16, and SmoothQuant with GPTQ W8A8. Along the way, we benchmark each model variant for disk size, generation latency, throughput, perplexity, and output quality. We also prepare a reusable calibration dataset, save compressed model artifacts, and inspect how each recipe changes practical inference behavior. By the end, we get a practical understanding of how different quantization methods affect model efficiency, deployment readiness, and performance trade-offs. [<strong><a href=\"https:\/\/github.com\/Marktechpost\/AI-Agents-Projects-Tutorials\/blob\/main\/LLM%20Projects\/llm_compressor_quantization_marktechpost.py\" target=\"_blank\" rel=\"noreferrer noopener\">Codes with Notebook<\/a><\/strong>]<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\">import subprocess, sys\ndef pip(*pkgs):\n   subprocess.check_call([sys.executable, \"-m\", \"pip\", \"install\", \"-q\", *pkgs])\npip(\"llmcompressor\", \"compressed-tensors\",\n   \"transformers&gt;=4.45\", \"accelerate\", \"datasets\")\nimport os, gc, time, json, math\nfrom pathlib import Path\nimport torch\nfrom transformers import AutoModelForCausalLM, AutoTokenizer\nfrom datasets import load_dataset\nassert torch.cuda.is_available(), \n   \"Enable a GPU: Runtime &gt; Change runtime type &gt; T4 GPU\"\nprint(\"GPU:\", torch.cuda.get_device_name(0),\n     \"| CUDA:\", torch.version.cuda,\n     \"| torch:\", torch.__version__)\nMODEL_ID = \"Qwen\/Qwen2.5-0.5B-Instruct\"\nWORKDIR = Path(\"\/content\/quant_lab\"); WORKDIR.mkdir(exist_ok=True)\nos.chdir(WORKDIR)\ndef free_mem():\n   gc.collect(); torch.cuda.empty_cache()\ndef dir_size_gb(path):\n   total = 0\n   for root, _, files in os.walk(path):\n       for f in files:\n           total += os.path.getsize(os.path.join(root, f))\n   return total \/ 1e9\ndef time_generation(model, tok, prompt, max_new_tokens=64):\n   \"\"\"Greedy decode; reports latency &amp; tokens\/sec after a brief warmup.\"\"\"\n   inputs = tok(prompt, return_tensors=\"pt\").to(model.device)\n   _ = model.generate(**inputs, max_new_tokens=4, do_sample=False)\n   torch.cuda.synchronize()\n   t0 = time.time()\n   out = model.generate(**inputs, max_new_tokens=max_new_tokens,\n                        do_sample=False, pad_token_id=tok.eos_token_id)\n   torch.cuda.synchronize()\n   dt = time.time() - t0\n   new_ids = out[0][inputs[\"input_ids\"].shape[1]:]\n   return tok.decode(new_ids, skip_special_tokens=True), dt, max_new_tokens\/dt\n@torch.no_grad()\ndef wikitext_ppl(model, tok, seq_len=512, max_chunks=20, stride=512):\n   \"\"\"Light WikiText-2 perplexity probe (fast, indicative).\"\"\"\n   ds = load_dataset(\"wikitext\", \"wikitext-2-raw-v1\", split=\"test\")\n   text = \"nn\".join(t for t in ds[\"text\"][:400] if t.strip())\n   enc = tok(text, return_tensors=\"pt\").input_ids.to(model.device)\n   nll_sum, tok_count = 0.0, 0\n   for begin in range(0, enc.size(1) - seq_len, stride):\n       chunk = enc[:, begin:begin+seq_len]\n       out = model(chunk, labels=chunk)\n       nll_sum += out.loss.float().item() * seq_len\n       tok_count += seq_len\n       if tok_count \/\/ seq_len &gt;= max_chunks: break\n   return math.exp(nll_sum \/ tok_count)\nresults = {}\nPROMPT = (\"&lt;|im_start|&gt;usernIn two sentences, explain why post-training \"\n         \"quantization works for large language models.&lt;|im_end|&gt;n\"\n         \"&lt;|im_start|&gt;assistantn\")\ndef benchmark(label, model_path_or_id):\n   free_mem()\n   print(f\"n\u2500\u2500\u2500\u2500 benchmarking: {label} \u2500\u2500\u2500\u2500\")\n   tok = AutoTokenizer.from_pretrained(model_path_or_id)\n   m = AutoModelForCausalLM.from_pretrained(\n           model_path_or_id, torch_dtype=\"auto\", device_map=\"cuda\").eval()\n   sample, dt, tps = time_generation(m, tok, PROMPT)\n   ppl = wikitext_ppl(m, tok)\n   size = dir_size_gb(model_path_or_id) if os.path.isdir(str(model_path_or_id)) else None\n   results[label] = {\"size_gb\": size, \"ppl\": round(ppl, 3),\n                     \"latency_s\": round(dt, 3), \"tok_per_s\": round(tps, 1),\n                     \"sample\": sample.strip().replace(\"n\", \" \")[:180]}\n   print(json.dumps(results[label], indent=2))\n   del m; free_mem()<\/code><\/pre>\n<\/div>\n<\/div>\n<p>We install all required libraries, import the core packages, and verify that a CUDA-enabled GPU is available in Colab. We define the base Qwen2.5 instruction model, create a working directory, and prepare helper functions for memory cleanup, model size calculation, generation timing, and perplexity evaluation. We also create a reusable benchmark function that loads any model variant, tests its generation speed, calculates perplexity, and stores the results for final comparison.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\">print(\"n\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550 Baseline (FP16) \u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\")\nbenchmark(\"00_fp16_baseline\", MODEL_ID)\nfrom llmcompressor import oneshot\nfrom llmcompressor.modifiers.quantization import QuantizationModifier\nprint(\"n\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550 Recipe 1: FP8_DYNAMIC \u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\")\nmodel = AutoModelForCausalLM.from_pretrained(MODEL_ID, torch_dtype=\"auto\")\ntok = AutoTokenizer.from_pretrained(MODEL_ID)\nrecipe_fp8 = QuantizationModifier(\n   targets=\"Linear\",\n   scheme=\"FP8_DYNAMIC\",\n   ignore=[\"lm_head\"],\n)\noneshot(model=model, recipe=recipe_fp8)\nFP8_DIR = \"Qwen2.5-0.5B-FP8-Dynamic\"\nmodel.save_pretrained(FP8_DIR, save_compressed=True)\ntok.save_pretrained(FP8_DIR)\ndel model; free_mem()\nbenchmark(\"01_fp8_dynamic\", FP8_DIR)<\/code><\/pre>\n<\/div>\n<\/div>\n<p>We first benchmark the original FP16 model to establish a reliable baseline for subsequent comparisons. We then apply FP8 dynamic quantization using llmcompressor, where linear layers are compressed while the language modeling head remains in higher precision. We save the compressed FP8 model and run the same benchmark again to compare its size, latency, throughput, and perplexity against the baseline.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\">NUM_CALIB_SAMPLES = 256\nMAX_SEQ_LEN       = 1024\ntok = AutoTokenizer.from_pretrained(MODEL_ID)\nraw = load_dataset(\"HuggingFaceH4\/ultrachat_200k\",\n                  split=f\"train_sft[:{NUM_CALIB_SAMPLES}]\")\ndef to_text(ex):\n   return {\"text\": tok.apply_chat_template(ex[\"messages\"], tokenize=False)}\ndef tokenize(ex):\n   return tok(ex[\"text\"], padding=False, truncation=True,\n              max_length=MAX_SEQ_LEN, add_special_tokens=False)\ncalib_ds = (raw.shuffle(seed=42)\n              .map(to_text)\n              .map(tokenize, remove_columns=raw.column_names))\nprint(\"Calibration set:\", len(calib_ds), \"samples, max_seq_len =\", MAX_SEQ_LEN)<\/code><\/pre>\n<\/div>\n<\/div>\n<p>We build a small calibration dataset using UltraChat samples so that the calibrated quantization recipes can observe realistic instruction-style inputs. We convert each chat example into model-compatible text through the tokenizer\u2019s chat template. We then tokenize the samples with a fixed maximum sequence length, creating a reusable dataset for GPTQ and SmoothQuant-based compression.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\">from llmcompressor.modifiers.quantization import GPTQModifier\nprint(\"n\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550 Recipe 2: GPTQ W4A16 \u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\")\nmodel = AutoModelForCausalLM.from_pretrained(MODEL_ID, torch_dtype=\"auto\")\nrecipe_w4a16 = GPTQModifier(\n   targets=\"Linear\",\n   scheme=\"W4A16\",\n   ignore=[\"lm_head\"],\n   dampening_frac=0.01,\n)\noneshot(\n   model=model,\n   dataset=calib_ds,\n   recipe=recipe_w4a16,\n   max_seq_length=MAX_SEQ_LEN,\n   num_calibration_samples=NUM_CALIB_SAMPLES,\n)\nW4A16_DIR = \"Qwen2.5-0.5B-W4A16-G128\"\nmodel.save_pretrained(W4A16_DIR, save_compressed=True)\ntok.save_pretrained(W4A16_DIR)\ndel model; free_mem()\nbenchmark(\"02_gptq_w4a16\", W4A16_DIR)<\/code><\/pre>\n<\/div>\n<\/div>\n<p>We apply GPTQ W4A16 quantization to compress the model\u2019s linear weights into 4-bit precision while keeping activations in higher precision. We use the calibration dataset to enable GPTQ to reduce reconstruction error and preserve model quality during compression. We save the W4A16 compressed model and benchmark it to study how aggressive 4-bit weight compression affects speed, size, and perplexity.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\">from llmcompressor.modifiers.smoothquant import SmoothQuantModifier\nprint(\"n\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550 Recipe 3: SmoothQuant + GPTQ W8A8 \u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\")\nmodel = AutoModelForCausalLM.from_pretrained(MODEL_ID, torch_dtype=\"auto\")\nrecipe_w8a8 = [\n   SmoothQuantModifier(smoothing_strength=0.8),\n   GPTQModifier(targets=\"Linear\", scheme=\"W8A8\", ignore=[\"lm_head\"]),\n]\noneshot(\n   model=model,\n   dataset=calib_ds,\n   recipe=recipe_w8a8,\n   max_seq_length=MAX_SEQ_LEN,\n   num_calibration_samples=NUM_CALIB_SAMPLES,\n)\nW8A8_DIR = \"Qwen2.5-0.5B-W8A8-SmoothQuant\"\nmodel.save_pretrained(W8A8_DIR, save_compressed=True)\ntok.save_pretrained(W8A8_DIR)\ndel model; free_mem()\nbenchmark(\"03_smoothquant_w8a8\", W8A8_DIR)\nprint(\"n\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550 FINAL SUMMARY \u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\")\nprint(f\"{'Variant':&lt;26}{'Size GB':&gt;9}{'PPL':&gt;10}{'tok\/s':&gt;9}{'Latency':&gt;11}\")\nprint(\"-\" * 65)\nfor k, v in results.items():\n   size = f\"{v['size_gb']:.3f}\" if v['size_gb'] else \"  (hub) \"\n   print(f\"{k:&lt;26}{size:&gt;9}{v['ppl']:&gt;10.2f}{v['tok_per_s']:&gt;9.1f}\"\n         f\"{v['latency_s']:&gt;10.2f}s\")\nprint(\"nSample completions (greedy, 64 new tokens):\")\nfor k, v in results.items():\n   print(f\"n[{k}]n  \u2192 {v['sample']}\")<\/code><\/pre>\n<\/div>\n<\/div>\n<p>We combine SmoothQuant with GPTQ W8A8 to create an advanced quantization pipeline that handles activation outliers before applying 8-bit compression. We save and benchmark this SmoothQuant-based model using the same evaluation setup as the earlier variants. Also, we print a summary table and sample completions to compare all quantized models against the FP16 baseline in one place.<\/p>\n<p>In conclusion, we built a complete quantization workflow that compresses and evaluates a small instruction-tuned LLM using modern PTQ techniques. We saw that FP8 dynamic quantization offers a fast, data-free option, while GPTQ-based methods use calibration data to achieve stronger compression and improved accuracy recovery. We also compared all variants through consistent benchmarks, which helps us understand the trade-offs between size, speed, latency, and perplexity. By saving each quantized model and testing generation quality, we made the workflow closer to a real deployment pipeline. This gives us a reusable Colab-ready framework for testing LLM compression methods before deploying efficient models in real-world inference systems.<\/p>\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n<p>Check out\u00a0the\u00a0<strong><a href=\"https:\/\/github.com\/Marktechpost\/AI-Agents-Projects-Tutorials\/blob\/main\/LLM%20Projects\/llm_compressor_quantization_marktechpost.py\" target=\"_blank\" rel=\"noreferrer noopener\">Codes with Notebook here<\/a>.\u00a0<\/strong>Also,\u00a0feel free to follow us on\u00a0<strong><a href=\"https:\/\/x.com\/intent\/follow?screen_name=marktechpost\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Twitter<\/mark><\/a><\/strong>\u00a0and don\u2019t forget to join our\u00a0<strong><a href=\"https:\/\/www.reddit.com\/r\/machinelearningnews\/\" target=\"_blank\" rel=\"noreferrer noopener\">150k+ ML SubReddit<\/a><\/strong>\u00a0and Subscribe to\u00a0<strong><a href=\"https:\/\/www.aidevsignals.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">our Newsletter<\/a><\/strong>. Wait! are you on telegram?\u00a0<strong><a href=\"https:\/\/t.me\/machinelearningresearchnews\" target=\"_blank\" rel=\"noreferrer noopener\">now you can join us on telegram as well.<\/a><\/strong><\/p>\n<p>Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.?\u00a0<strong><a href=\"https:\/\/forms.gle\/MTNLpmJtsFA3VRVd9\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Connect with us<\/mark><\/a><\/strong><\/p>\n<p>The post <a href=\"https:\/\/www.marktechpost.com\/2026\/05\/17\/a-coding-implementation-to-compress-and-benchmark-instruction-tuned-llms-with-fp8-gptq-and-smoothquant-quantization-using-llmcompressor\/\">A Coding Implementation to Compress and Benchmark Instruction-Tuned LLMs with FP8, GPTQ, and SmoothQuant Quantization using llmcompressor<\/a> appeared first on <a href=\"https:\/\/www.marktechpost.com\/\">MarkTechPost<\/a>.<\/p>","protected":false},"excerpt":{"rendered":"<p>In this tutorial, we explore h&hellip;<\/p>\n","protected":false},"author":1,"featured_media":29,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-927","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-uncategorized"],"_links":{"self":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts\/927","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=927"}],"version-history":[{"count":0,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts\/927\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/media\/29"}],"wp:attachment":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=927"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=927"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=927"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}