{"id":836,"date":"2026-05-02T04:52:08","date_gmt":"2026-05-01T20:52:08","guid":{"rendered":"https:\/\/connectword.dpdns.org\/?p=836"},"modified":"2026-05-02T04:52:08","modified_gmt":"2026-05-01T20:52:08","slug":"a-coding-guide-on-llm-post-training-with-trl-from-supervised-fine-tuning-to-dpo-and-grpo-reasoning","status":"publish","type":"post","link":"https:\/\/connectword.dpdns.org\/?p=836","title":{"rendered":"A Coding Guide on LLM Post Training with TRL from Supervised Fine Tuning to DPO and GRPO Reasoning"},"content":{"rendered":"<p>In this tutorial, we walk through a complete, hands-on journey of post-training large language models using the powerful <a href=\"https:\/\/github.com\/huggingface\/trl\"><strong>TRL (Transformer Reinforcement Learning) library<\/strong><\/a><strong> <\/strong>ecosystem. We start from a lightweight base model and progressively apply four key techniques: Supervised Fine-Tuning (SFT), Reward Modeling (RM), Direct Preference Optimization (DPO), and Group Relative Policy Optimization (GRPO). Also, we leverage efficient methods like LoRA to make training feasible even on limited hardware, such as Google Colab\u2019s T4 GPU. As we move step by step, we build intuition for how modern alignment pipelines work, from teaching models how to respond to shaping their behavior using preferences and verifiable rewards.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\">import subprocess, sys\nsubprocess.check_call([sys.executable, \"-m\", \"pip\", \"install\", \"-q\", \"-U\",\n   \"torchao&gt;=0.16\",\n   \"trl&gt;=0.20\",\n   \"transformers&gt;=4.45\",\n   \"datasets\",\n   \"peft&gt;=0.13\",\n   \"accelerate\",\n   \"bitsandbytes\",\n])\n\n\nimport sys as _sys\nfor _m in [m for m in list(_sys.modules) if m.startswith((\"torchao\", \"peft\"))]:\n   _sys.modules.pop(_m, None)\ntry:\n   import torchao\nexcept Exception:\n   import types\n   _fake = types.ModuleType(\"torchao\")\n   _fake.__version__ = \"0.16.1\"\n   _sys.modules[\"torchao\"] = _fake\n\n\nimport os, re, gc, torch, warnings\nwarnings.filterwarnings(\"ignore\")\nos.environ[\"TOKENIZERS_PARALLELISM\"] = \"false\"\nos.environ[\"WANDB_DISABLED\"] = \"true\"\nos.environ[\"HF_HUB_DISABLE_PROGRESS_BARS\"] = \"1\"\n\n\nfrom datasets import load_dataset, Dataset\nfrom transformers import AutoTokenizer, AutoModelForCausalLM\nfrom peft import LoraConfig\n\n\nprint(f\"torch={torch.__version__}  cuda={torch.cuda.is_available()}\")\nif torch.cuda.is_available():\n   print(f\"GPU: {torch.cuda.get_device_name(0)}  \"\n         f\"({torch.cuda.get_device_properties(0).total_memory\/1e9:.1f} GB)\")\n\n\nMODEL_NAME = \"Qwen\/Qwen2.5-0.5B-Instruct\"\nDEVICE     = \"cuda\" if torch.cuda.is_available() else \"cpu\"\nBF16_OK    = torch.cuda.is_available() and torch.cuda.is_bf16_supported()\n\n\nLORA_CFG = LoraConfig(\n   r=8, lora_alpha=16, lora_dropout=0.05, bias=\"none\",\n   target_modules=[\"q_proj\", \"k_proj\", \"v_proj\", \"o_proj\"],\n   task_type=\"CAUSAL_LM\",\n)\n\n\ndef cleanup():\n   \"\"\"Release VRAM between training stages (Colab T4 is tight).\"\"\"\n   gc.collect()\n   if torch.cuda.is_available():\n       torch.cuda.empty_cache()\n\n\ndef chat_generate(model, tokenizer, prompt, max_new_tokens=120):\n   \"\"\"Helper: format as chat, generate, decode just the assistant turn.\"\"\"\n   msgs = [{\"role\": \"user\", \"content\": prompt}]\n   ids = tokenizer.apply_chat_template(\n       msgs, return_tensors=\"pt\", add_generation_prompt=True\n   ).to(model.device)\n   with torch.no_grad():\n       out = model.generate(\n           ids, max_new_tokens=max_new_tokens,\n           do_sample=True, temperature=0.7, top_p=0.9,\n           pad_token_id=tokenizer.eos_token_id,\n       )\n   return tokenizer.decode(out[0][ids.shape[-1]:], skip_special_tokens=True)<\/code><\/pre>\n<\/div>\n<\/div>\n<p>We install and configure the full training stack, ensuring compatibility across libraries like TRL (Transformer Reinforcement Learning library), Transformers, and PEFT. We set up environment variables and GPU checks, and define reusable components such as LoRA configuration and helper functions. We also prepare utility functions for memory cleanup and chat-style generation to support all later stages.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\">print(\"n\" + \"=\"*72 + \"nPART 1 \u2014 Supervised Fine-Tuning (SFT)n\" + \"=\"*72)\n\n\nfrom trl import SFTTrainer, SFTConfig\n\n\nsft_ds = load_dataset(\"trl-lib\/Capybara\", split=\"train[:300]\")\nprint(f\"SFT dataset rows: {len(sft_ds)}\")\nprint(f\"Example messages: {sft_ds[0]['messages'][:1]}\")\n\n\nsft_args = SFTConfig(\n   output_dir=\".\/sft_out\",\n   num_train_epochs=1,\n   per_device_train_batch_size=2,\n   gradient_accumulation_steps=4,\n   learning_rate=2e-4,\n   logging_steps=10,\n   save_strategy=\"no\",\n   bf16=BF16_OK, fp16=not BF16_OK,\n   max_length=768,\n   gradient_checkpointing=True,\n   report_to=\"none\",\n)\n\n\nsft_trainer = SFTTrainer(\n   model=MODEL_NAME,\n   args=sft_args,\n   train_dataset=sft_ds,\n   peft_config=LORA_CFG,\n)\nsft_trainer.train()\n\n\nprint(\"n[SFT inference]\")\nprint(\"Q: Explain the bias-variance tradeoff in two sentences.\")\nprint(\"A:\", chat_generate(sft_trainer.model, sft_trainer.processing_class,\n                         \"Explain the bias-variance tradeoff in two sentences.\"))\n\n\nsft_trainer.save_model(\".\/sft_out\/final\")\ndel sft_trainer; cleanup()<\/code><\/pre>\n<\/div>\n<\/div>\n<p>We begin by supervised fine-tuning, loading a conversational dataset, and configuring the SFT trainer. We train the model to imitate high-quality responses using LoRA for efficient adaptation on limited hardware. We then validate the model\u2019s behavior through inference to confirm it follows instruction-style outputs.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\">print(\"n\" + \"=\"*72 + \"nPART 2 \u2014 Reward Modelingn\" + \"=\"*72)\n\n\nfrom trl import RewardTrainer, RewardConfig\n\n\nrm_ds = load_dataset(\"trl-lib\/ultrafeedback_binarized\", split=\"train[:300]\")\nprint(f\"RM dataset rows: {len(rm_ds)}  keys: {list(rm_ds[0].keys())}\")\n\n\nrm_args = RewardConfig(\n   output_dir=\".\/rm_out\",\n   num_train_epochs=1,\n   per_device_train_batch_size=2,\n   gradient_accumulation_steps=2,\n   learning_rate=1e-4,\n   logging_steps=10,\n   save_strategy=\"no\",\n   bf16=BF16_OK, fp16=not BF16_OK,\n   max_length=512,\n   gradient_checkpointing=True,\n   report_to=\"none\",\n)\n\n\nrm_lora = LoraConfig(\n   r=8, lora_alpha=16, lora_dropout=0.05, bias=\"none\",\n   target_modules=[\"q_proj\", \"k_proj\", \"v_proj\", \"o_proj\"],\n   task_type=\"SEQ_CLS\",\n)\n\n\nrm_trainer = RewardTrainer(\n   model=MODEL_NAME,\n   args=rm_args,\n   train_dataset=rm_ds,\n   peft_config=rm_lora,\n)\nrm_trainer.train()\ndel rm_trainer; cleanup()<\/code><\/pre>\n<\/div>\n<\/div>\n<p>We move to reward modeling, where we train a model to score responses based on human preference data. We configure a sequence classification setup and train using chosen vs rejected pairs. This stage helps us learn a reward signal that can guide alignment in later methods.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\">print(\"n\" + \"=\"*72 + \"nPART 3 \u2014 Direct Preference Optimization (DPO)n\" + \"=\"*72)\n\n\nfrom trl import DPOTrainer, DPOConfig\n\n\ndpo_ds = load_dataset(\"trl-lib\/ultrafeedback_binarized\", split=\"train[:300]\")\n\n\ndpo_args = DPOConfig(\n   output_dir=\".\/dpo_out\",\n   num_train_epochs=1,\n   per_device_train_batch_size=1,\n   gradient_accumulation_steps=4,\n   learning_rate=5e-6,\n   logging_steps=10,\n   save_strategy=\"no\",\n   bf16=BF16_OK, fp16=not BF16_OK,\n   max_length=512,\n   max_prompt_length=256,\n   beta=0.1,\n   gradient_checkpointing=True,\n   report_to=\"none\",\n)\n\n\ndpo_trainer = DPOTrainer(\n   model=MODEL_NAME,\n   args=dpo_args,\n   train_dataset=dpo_ds,\n   peft_config=LORA_CFG,\n)\ndpo_trainer.train()\ndel dpo_trainer; cleanup()<\/code><\/pre>\n<\/div>\n<\/div>\n<p>We implement Direct Preference Optimization to directly optimize the model using preference data without needing a separate reward model. We configure a low learning rate and control divergence using the beta parameter. We train the model to efficiently align its outputs with preferred responses.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\">print(\"n\" + \"=\"*72 + \"nPART 4 \u2014 GRPO with verifiable math rewardsn\" + \"=\"*72)\n\n\nfrom trl import GRPOTrainer, GRPOConfig\nimport random\n\n\nrandom.seed(0)\ndef make_math_problem():\n   a, b = random.randint(1, 50), random.randint(1, 50)\n   op = random.choice([\"+\", \"-\", \"*\"])\n   expr = f\"{a} {op} {b}\"\n   return {\n       \"prompt\": f\"Solve this and end your reply with only the final number. {expr} =\",\n       \"answer\": str(eval(expr)),\n   }\n\n\ngrpo_ds = Dataset.from_list([make_math_problem() for _ in range(200)])\nprint(f\"GRPO dataset rows: {len(grpo_ds)}\")\nprint(f\"Example: {grpo_ds[0]}\")\n\n\ndef correctness_reward(completions, **kwargs):\n   \"\"\"+1 if the last number in the completion matches the gold answer.\"\"\"\n   answers = kwargs[\"answer\"]\n   rewards = []\n   for c, gold in zip(completions, answers):\n       nums = re.findall(r\"-?d+\", c)\n       rewards.append(1.0 if nums and nums[-1] == gold else 0.0)\n   return rewards\n\n\ndef brevity_reward(completions, **kwargs):\n   \"\"\"Small bonus for short answers \u2014 discourages rambling.\"\"\"\n   return [max(0.0, 1.0 - len(c) \/ 200) * 0.2 for c in completions]\n\n\ngrpo_args = GRPOConfig(\n   output_dir=\".\/grpo_out\",\n   learning_rate=1e-5,\n   per_device_train_batch_size=2,\n   gradient_accumulation_steps=2,\n   num_generations=4,\n   max_prompt_length=128,\n   max_completion_length=96,\n   logging_steps=2,\n   save_strategy=\"no\",\n   bf16=BF16_OK, fp16=not BF16_OK,\n   gradient_checkpointing=True,\n   max_steps=15,\n   report_to=\"none\",\n)\n\n\ngrpo_trainer = GRPOTrainer(\n   model=MODEL_NAME,\n   args=grpo_args,\n   train_dataset=grpo_ds,\n   reward_funcs=[correctness_reward, brevity_reward],\n   peft_config=LORA_CFG,\n)\ngrpo_trainer.train()\n\n\nprint(\"n[GRPO inference]\")\nfor q in [\"What is 17 + 28?\", \"What is 9 * 7?\", \"What is 100 - 47?\"]:\n   a = chat_generate(grpo_trainer.model, grpo_trainer.processing_class, q, 60)\n   print(f\"Q: {q}nA: {a}n\")\n\n\ndel grpo_trainer; cleanup()\n\n\nprint(\"n\u2713 Tutorial complete \u2014 you've trained 4 post-training algorithms!\")<\/code><\/pre>\n<\/div>\n<\/div>\n<p>We apply GRPO by generating multiple responses per prompt and evaluating them using custom reward functions. We design deterministic rewards for correctness and brevity, allowing the model to learn from verifiable signals. We finally test the model on arithmetic queries to observe improved reasoning behavior.<\/p>\n<p>In conclusion, we implemented and understood four major post-training paradigms that define today\u2019s LLM alignment workflows. We saw how each method builds on the previous one, starting with structured learning in SFT, moving to preference understanding in RM, simplifying optimization with DPO, and finally scaling reasoning with GRPO. Also, we demonstrate that advanced training techniques are not restricted to massive infrastructure; they can be prototyped efficiently with the right tools and abstractions. It gives us a strong foundation for further experimentation, customizing reward functions, scaling models, and designing our own aligned AI systems.<\/p>\n<hr class=\"wp-block-separator aligncenter has-alpha-channel-opacity is-style-wide\" \/>\n<p>Check out\u00a0the\u00a0<strong><a href=\"https:\/\/github.com\/Marktechpost\/AI-Agents-Projects-Tutorials\/blob\/main\/LLM%20Projects\/trl_llm_post_training_sft_dpo_grpo_marktechpost.py\" target=\"_blank\" rel=\"noreferrer noopener\">Full Codes here<\/a><\/strong>.<strong>\u00a0<\/strong>Also,\u00a0feel free to follow us on\u00a0<strong><a href=\"https:\/\/x.com\/intent\/follow?screen_name=marktechpost\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Twitter<\/mark><\/a><\/strong>\u00a0and don\u2019t forget to join our\u00a0<strong><a href=\"https:\/\/www.reddit.com\/r\/machinelearningnews\/\" target=\"_blank\" rel=\"noreferrer noopener\">130k+ ML SubReddit<\/a><\/strong>\u00a0and Subscribe to\u00a0<strong><a href=\"https:\/\/www.aidevsignals.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">our Newsletter<\/a><\/strong>. Wait! are you on telegram?\u00a0<strong><a href=\"https:\/\/t.me\/machinelearningresearchnews\" target=\"_blank\" rel=\"noreferrer noopener\">now you can join us on telegram as well.<\/a><\/strong><\/p>\n<p>Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.?\u00a0<strong><a href=\"https:\/\/forms.gle\/MTNLpmJtsFA3VRVd9\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Connect with us<\/mark><\/a><\/strong><\/p>\n<p>The post <a href=\"https:\/\/www.marktechpost.com\/2026\/05\/01\/a-coding-guide-on-llm-post-training-with-trl-from-supervised-fine-tuning-to-dpo-and-grpo-reasoning\/\">A Coding Guide on LLM Post Training with TRL from Supervised Fine Tuning to DPO and GRPO Reasoning<\/a> appeared first on <a href=\"https:\/\/www.marktechpost.com\/\">MarkTechPost<\/a>.<\/p>","protected":false},"excerpt":{"rendered":"<p>In this tutorial, we walk thro&hellip;<\/p>\n","protected":false},"author":1,"featured_media":29,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-836","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-uncategorized"],"_links":{"self":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts\/836","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=836"}],"version-history":[{"count":0,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts\/836\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/media\/29"}],"wp:attachment":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=836"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=836"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=836"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}