{"id":697,"date":"2026-04-11T15:33:41","date_gmt":"2026-04-11T07:33:41","guid":{"rendered":"https:\/\/connectword.dpdns.org\/?p=697"},"modified":"2026-04-11T15:33:41","modified_gmt":"2026-04-11T07:33:41","slug":"how-knowledge-distillation-compresses-ensemble-intelligence-into-a-single-deployable-ai-model","status":"publish","type":"post","link":"https:\/\/connectword.dpdns.org\/?p=697","title":{"rendered":"How Knowledge Distillation Compresses Ensemble Intelligence into a Single Deployable AI Model"},"content":{"rendered":"<p>Complex prediction problems often lead to ensembles because combining multiple models improves accuracy by reducing variance and capturing diverse patterns. However, these ensembles are impractical in production due to latency constraints and operational complexity.<\/p>\n<p>Instead of discarding them, Knowledge Distillation offers a smarter approach: keep the ensemble as a teacher and train a smaller student model using its soft probability outputs. This allows the student to inherit much of the ensemble\u2019s performance while being lightweight and fast enough for deployment.<\/p>\n<p>In this article, we <a href=\"https:\/\/github.com\/Marktechpost\/AI-Tutorial-Codes-Included\/blob\/main\/Data%20Science\/Knowledge_Distillation.ipynb\" target=\"_blank\" rel=\"noreferrer noopener\">build this pipeline from scratch<\/a> \u2014 training a 12-model teacher ensemble, generating soft targets with temperature scaling, and distilling it into a student that recovers 53.8% of the ensemble\u2019s accuracy edge at 160\u00d7 the compression.<\/p>\n<figure class=\"wp-block-image size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"889\" height=\"258\" data-attachment-id=\"78925\" data-permalink=\"https:\/\/www.marktechpost.com\/2026\/04\/11\/how-knowledge-distillation-compresses-ensemble-intelligence-into-a-single-deployable-ai-model\/image-428\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/04\/image-23.png\" data-orig-size=\"889,258\" data-comments-opened=\"1\" data-image-meta='{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\"}' data-image-title=\"image\" data-image-description=\"\" data-image-caption=\"\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/04\/image-23.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/04\/image-23.png\" alt=\"\" class=\"wp-image-78925\" \/><\/figure>\n<figure class=\"wp-block-image size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"889\" height=\"200\" data-attachment-id=\"78924\" data-permalink=\"https:\/\/www.marktechpost.com\/2026\/04\/11\/how-knowledge-distillation-compresses-ensemble-intelligence-into-a-single-deployable-ai-model\/image-427\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/04\/image-24.png\" data-orig-size=\"889,200\" data-comments-opened=\"1\" data-image-meta='{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\"}' data-image-title=\"image\" data-image-description=\"\" data-image-caption=\"\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/04\/image-24.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/04\/image-24.png\" alt=\"\" class=\"wp-image-78924\" \/><\/figure>\n<figure class=\"wp-block-image size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"891\" height=\"274\" data-attachment-id=\"78929\" data-permalink=\"https:\/\/www.marktechpost.com\/2026\/04\/11\/how-knowledge-distillation-compresses-ensemble-intelligence-into-a-single-deployable-ai-model\/image-432\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/04\/image-28.png\" data-orig-size=\"891,274\" data-comments-opened=\"1\" data-image-meta='{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\"}' data-image-title=\"image\" data-image-description=\"\" data-image-caption=\"\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/04\/image-28.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/04\/image-28.png\" alt=\"\" class=\"wp-image-78929\" \/><\/figure>\n<h3 class=\"wp-block-heading\"><strong>What is Knowledge Distillation?<\/strong><\/h3>\n<p>Knowledge distillation is a model compression technique in which a large, pre-trained \u201cteacher\u201d model transfers its learned behavior to a smaller \u201cstudent\u201d model. Instead of training solely on ground-truth labels, the student is trained to mimic the teacher\u2019s predictions\u2014capturing not just final outputs but the richer patterns embedded in its probability distributions. This approach enables the student to approximate the performance of complex models while remaining significantly smaller and faster. Originating from early work on compressing large ensemble models into single networks, knowledge distillation is now widely used across domains like NLP, speech, and computer vision, and has become especially important in scaling down massive generative AI models into efficient, deployable systems.<\/p>\n<h3 class=\"wp-block-heading\"><strong>Knowledge Distillation: From Ensemble Teacher to Lean Student<\/strong><\/h3>\n<h4 class=\"wp-block-heading\"><strong>Setting up the dependencies<\/strong><\/h4>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\">pip install torch scikit-learn numpy<\/code><\/pre>\n<\/div>\n<\/div>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\">import torch\nimport torch.nn as nn\nimport torch.nn.functional as F\nfrom torch.utils.data import DataLoader, TensorDataset\nfrom sklearn.datasets import make_classification\nfrom sklearn.model_selection import train_test_split\nfrom sklearn.preprocessing import StandardScaler\nimport numpy as np<\/code><\/pre>\n<\/div>\n<\/div>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\">torch.manual_seed(42)\nnp.random.seed(42)<\/code><\/pre>\n<\/div>\n<\/div>\n<h4 class=\"wp-block-heading\"><strong>Creating the dataset<\/strong><\/h4>\n<p>This block creates and prepares a synthetic dataset for a binary classification task (like predicting whether a user clicks an ad). First, make_classification generates 5,000 samples with 20 features, of which some are informative and some redundant to simulate real-world data complexity. The dataset is then split into training and testing sets to evaluate model performance on unseen data.<\/p>\n<p>Next, StandardScaler normalizes the features so they have a consistent scale, which helps neural networks train more efficiently. The data is then converted into PyTorch tensors so it can be used in model training. Finally, a DataLoader is created to feed the data in mini-batches (size 64) during training, improving efficiency and enabling stochastic gradient descent.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\">X, y = make_classification(\n    n_samples=5000, n_features=20, n_informative=10,\n    n_redundant=5, random_state=42\n)\n \nX_train, X_test, y_train, y_test = train_test_split(\n    X, y, test_size=0.2, random_state=42\n)\n \nscaler = StandardScaler()\nX_train = scaler.fit_transform(X_train)\nX_test  = scaler.transform(X_test)\n \n# Convert to tensors\nX_train_t = torch.tensor(X_train, dtype=torch.float32)\ny_train_t  = torch.tensor(y_train, dtype=torch.long)\nX_test_t   = torch.tensor(X_test,  dtype=torch.float32)\ny_test_t   = torch.tensor(y_test,  dtype=torch.long)\n \ntrain_loader = DataLoader(\n    TensorDataset(X_train_t, y_train_t), batch_size=64, shuffle=True\n)<\/code><\/pre>\n<\/div>\n<\/div>\n<h4 class=\"wp-block-heading\"><strong>Model Architecture<\/strong><\/h4>\n<p>This section defines two neural network architectures: a <strong>TeacherModel<\/strong> and a <strong>StudentModel<\/strong>. The teacher represents one of the large models in the ensemble\u2014it has multiple layers, wider dimensions, and dropout for regularization, making it highly expressive but computationally expensive during inference.<\/p>\n<p>The student model, on the other hand, is a smaller and more efficient network with fewer layers and parameters. Its goal is not to match the teacher\u2019s complexity, but to learn its behavior through distillation. Importantly, the student still retains enough capacity to approximate the teacher\u2019s decision boundaries\u2014too small, and it won\u2019t be able to capture the richer patterns learned by the ensemble.<\/p>\n<figure class=\"wp-block-image size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"593\" height=\"477\" data-attachment-id=\"78928\" data-permalink=\"https:\/\/www.marktechpost.com\/2026\/04\/11\/how-knowledge-distillation-compresses-ensemble-intelligence-into-a-single-deployable-ai-model\/image-431\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/04\/image-27.png\" data-orig-size=\"593,477\" data-comments-opened=\"1\" data-image-meta='{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\"}' data-image-title=\"image\" data-image-description=\"\" data-image-caption=\"\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/04\/image-27.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/04\/image-27.png\" alt=\"\" class=\"wp-image-78928\" \/><\/figure>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\">class TeacherModel(nn.Module):\n    \"\"\"Represents one heavy model inside the ensemble.\"\"\"\n    def __init__(self, input_dim=20, num_classes=2):\n        super().__init__()\n        self.net = nn.Sequential(\n            nn.Linear(input_dim, 256), nn.ReLU(), nn.Dropout(0.3),\n            nn.Linear(256, 128),       nn.ReLU(), nn.Dropout(0.3),\n            nn.Linear(128, 64),        nn.ReLU(),\n            nn.Linear(64, num_classes)\n        )\n    def forward(self, x):\n        return self.net(x)\n \n \nclass StudentModel(nn.Module):\n    \"\"\"\n    The lean production model that learns from the ensemble.\n    Two hidden layers -- enough capacity to absorb distilled\n    knowledge, still ~30x smaller than the full ensemble.\n    \"\"\"\n    def __init__(self, input_dim=20, num_classes=2):\n        super().__init__()\n        self.net = nn.Sequential(\n            nn.Linear(input_dim, 64), nn.ReLU(),\n            nn.Linear(64, 32),        nn.ReLU(),\n            nn.Linear(32, num_classes)\n        )\n    def forward(self, x):\n        return self.net(x)<\/code><\/pre>\n<\/div>\n<\/div>\n<h4 class=\"wp-block-heading\"><strong>Helpers<\/strong><\/h4>\n<p>This section defines two utility functions for training and evaluation.<\/p>\n<p><strong>train_one_epoch <\/strong>handles one full pass over the training data. It puts the model in training mode, iterates through mini-batches, computes the loss, performs backpropagation, and updates the model weights using the optimizer. It also tracks and returns the average loss across all batches to monitor training progress.<\/p>\n<p><strong>evaluate <\/strong>is used to measure model performance. It switches the model to evaluation mode (disabling dropout and gradients), makes predictions on the input data, and computes the accuracy by comparing predicted labels with true labels.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\">def train_one_epoch(model, loader, optimizer, criterion):\n    model.train()\n    total_loss = 0\n    for xb, yb in loader:\n        optimizer.zero_grad()\n        loss = criterion(model(xb), yb)\n        loss.backward()\n        optimizer.step()\n        total_loss += loss.item()\n    return total_loss \/ len(loader)\n \n \ndef evaluate(model, X, y):\n    model.eval()\n    with torch.no_grad():\n        preds = model(X).argmax(dim=1)\n    return (preds == y).float().mean().item()<\/code><\/pre>\n<\/div>\n<\/div>\n<h4 class=\"wp-block-heading\"><strong>Training the Ensemble<\/strong><\/h4>\n<p>This section trains the teacher ensemble, which serves as the source of knowledge for distillation. Instead of a single model, 12 teacher models are trained independently with different random initializations, allowing each one to learn slightly different patterns from the data. This diversity is what makes ensembles powerful.<\/p>\n<p>Each teacher is trained for multiple epochs until convergence, and their individual test accuracies are printed. Once all models are trained, their predictions are combined using soft voting\u2014by averaging their output logits rather than taking a simple majority vote. This produces a stronger, more stable final prediction, giving you a high-performing ensemble that will act as the \u201cteacher\u201d in the next step.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\">print(\"=\" * 55)\nprint(\"STEP 1: Training the 12-model Teacher Ensemble\")\nprint(\"        (this happens offline, not in production)\")\nprint(\"=\" * 55)\n \nNUM_TEACHERS = 12\nteachers = []\n \nfor i in range(NUM_TEACHERS):\n    torch.manual_seed(i)                           # different init per teacher\n    model = TeacherModel()\n    optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)\n    criterion = nn.CrossEntropyLoss()\n \n    for epoch in range(30):                        # train until convergence\n        train_one_epoch(model, train_loader, optimizer, criterion)\n \n    acc = evaluate(model, X_test_t, y_test_t)\n    print(f\"  Teacher {i+1:02d} -&gt; test accuracy: {acc:.4f}\")\n    model.eval()\n    teachers.append(model)\n \n# Soft voting: average logits across all teachers (stronger than majority vote)\nwith torch.no_grad():\n    avg_logits     = torch.stack([t(X_test_t) for t in teachers], dim=0).mean(dim=0)\n    ensemble_preds = avg_logits.argmax(dim=1)\nensemble_acc = (ensemble_preds == y_test_t).float().mean().item()\nprint(f\"n  Ensemble (soft vote) accuracy: {ensemble_acc:.4f}\")<\/code><\/pre>\n<\/div>\n<\/div>\n<h4 class=\"wp-block-heading\"><strong>Generating Soft Targets from the Ensemble<\/strong><\/h4>\n<p>This step generates soft targets from the trained teacher ensemble, which are the key ingredient in knowledge distillation. Instead of using hard labels (0 or 1), the ensemble\u2019s averaged predictions are converted into probability distributions, capturing how confident the model is across all classes.<\/p>\n<p>The function first averages the logits from all teachers (soft voting), then applies temperature scaling to smooth the probabilities. A higher temperature (like 3.0) makes the distribution softer, revealing subtle relationships between classes that hard labels cannot capture. These soft targets provide richer learning signals, allowing the student model to better approximate the ensemble\u2019s behavior.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\">TEMPERATURE = 3.0   # controls how \"soft\" the teacher's output is\n \ndef get_ensemble_soft_targets(teachers, X, T):\n    \"\"\"\n    Average logits from all teachers, then apply temperature scaling.\n    Soft targets carry richer signal than hard 0\/1 labels.\n    \"\"\"\n    with torch.no_grad():\n        logits = torch.stack([t(X) for t in teachers], dim=0).mean(dim=0)\n    return F.softmax(logits \/ T, dim=1)   # soft probability distribution\n \nsoft_targets = get_ensemble_soft_targets(teachers, X_train_t, TEMPERATURE)\n \nprint(f\"n  Sample hard label : {y_train_t[0].item()}\")\nprint(f\"  Sample soft target: [{soft_targets[0,0]:.4f}, {soft_targets[0,1]:.4f}]\")\nprint(\"  -&gt; Soft target carries confidence info, not just class identity.\")<\/code><\/pre>\n<\/div>\n<\/div>\n<h4 class=\"wp-block-heading\"><strong>Distillation: Training the Student<\/strong><\/h4>\n<p>This section trains the student model using knowledge distillation, where it learns from both the teacher ensemble and the true labels. A new dataloader is created that provides inputs along with hard labels and soft targets together.<\/p>\n<p>During training, two losses are computed:<\/p>\n<ul class=\"wp-block-list\">\n<li>Distillation loss (KL-divergence) encourages the student to match the teacher\u2019s softened probability distribution, transferring the ensemble\u2019s \u201cknowledge.\u201d<\/li>\n<li>Hard label loss (cross-entropy) ensures the student still aligns with the ground truth.<\/li>\n<\/ul>\n<p>These are combined using a weighting factor (ALPHA), where a higher value gives more importance to the teacher\u2019s guidance. Temperature scaling is applied again to keep consistency with the soft targets, and a rescaling factor ensures stable gradients. Over multiple epochs, the student gradually learns to approximate the ensemble\u2019s behavior while remaining much smaller and efficient for deployment.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\">print(\"n\" + \"=\" * 55)\nprint(\"STEP 2: Training the Student via Knowledge Distillation\")\nprint(\"        (this produces the single production model)\")\nprint(\"=\" * 55)\n \nALPHA  = 0.7    # weight on distillation loss (0.7 = mostly soft targets)\nEPOCHS = 50\n \nstudent    = StudentModel()\noptimizer  = torch.optim.Adam(student.parameters(), lr=1e-3, weight_decay=1e-4)\nce_loss_fn = nn.CrossEntropyLoss()\n \n# Dataloader that yields (inputs, hard labels, soft targets) together\ndistill_loader = DataLoader(\n    TensorDataset(X_train_t, y_train_t, soft_targets),\n    batch_size=64, shuffle=True\n)\n \nfor epoch in range(EPOCHS):\n    student.train()\n    epoch_loss = 0\n \n    for xb, yb, soft_yb in distill_loader:\n        optimizer.zero_grad()\n \n        student_logits = student(xb)\n \n        # (1) Distillation loss: match the teacher's soft distribution\n        #     KL-divergence between student and teacher outputs at temperature T\n        student_soft = F.log_softmax(student_logits \/ TEMPERATURE, dim=1)\n        distill_loss = F.kl_div(student_soft, soft_yb, reduction='batchmean')\n        distill_loss *= TEMPERATURE ** 2   # rescale: keeps gradient magnitude\n                                           # stable across different T values\n \n        # (2) Hard label loss: also learn from ground truth\n        hard_loss = ce_loss_fn(student_logits, yb)\n \n        # Combined loss\n        loss = ALPHA * distill_loss + (1 - ALPHA) * hard_loss\n        loss.backward()\n        optimizer.step()\n        epoch_loss += loss.item()\n \n    if (epoch + 1) % 10 == 0:\n        acc = evaluate(student, X_test_t, y_test_t)\n        print(f\"  Epoch {epoch+1:02d}\/{EPOCHS}  loss: {epoch_loss\/len(distill_loader):.4f}  \"\n              f\"student accuracy: {acc:.4f}\")<\/code><\/pre>\n<\/div>\n<\/div>\n<h4 class=\"wp-block-heading\"><strong>Student trained on on Hard Labels only<\/strong><\/h4>\n<p>This section trains a baseline student model without knowledge distillation, using only the ground truth labels. The architecture is identical to the distilled student, ensuring a fair comparison.<\/p>\n<p>The model is trained in the standard way with cross-entropy loss, learning directly from hard labels without any guidance from the teacher ensemble. After training, its accuracy is evaluated on the test set.<\/p>\n<p>This baseline acts as a reference point\u2014allowing you to clearly measure how much performance gain comes specifically from distillation, rather than just the student model\u2019s capacity or training process.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\">print(\"n\" + \"=\" * 55)\nprint(\"BASELINE: Student trained on hard labels only (no distillation)\")\nprint(\"=\" * 55)\n \nbaseline_student = StudentModel()\nb_optimizer = torch.optim.Adam(\n    baseline_student.parameters(), lr=1e-3, weight_decay=1e-4\n)\n \nfor epoch in range(EPOCHS):\n    train_one_epoch(baseline_student, train_loader, b_optimizer, ce_loss_fn)\n \nbaseline_acc = evaluate(baseline_student, X_test_t, y_test_t)\nprint(f\"  Baseline student accuracy: {baseline_acc:.4f}\")<\/code><\/pre>\n<\/div>\n<\/div>\n<h3 class=\"wp-block-heading\"><strong>Comparison<\/strong><\/h3>\n<p>To measure how much the ensemble\u2019s knowledge actually transfers, we run three models against the same held-out test set. The ensemble \u2014 all 12 teachers voting together via averaged logits \u2014 sets the accuracy ceiling at 97.80%. This is the number we are trying to approximate, not beat. The baseline student is an identical single-model architecture trained the conventional way, on hard labels only: it sees each sample as a binary 0 or 1, nothing more. It lands at 96.50%. The distilled student is the same architecture again, but trained on the ensemble\u2019s soft probability outputs at temperature T=3, with a combined loss weighted 70% toward matching the teacher\u2019s distribution and 30% toward ground truth labels. It reaches 97.20%.<\/p>\n<p>The 0.70 percentage point gap between the baseline and the distilled student is not a coincidence of random seed or training noise \u2014 it is the measurable value of the soft targets. The student did not get more data, a better architecture, or more computation. It got a richer training signal, and that alone recovered 53.8% of the gap between what a small model can learn on its own and what the full ensemble knows. The remaining gap of 0.60 percentage points between the distilled student and the ensemble is the honest cost of compression \u2014 the portion of the ensemble\u2019s knowledge that a 3,490-parameter model simply cannot hold, regardless of how well it is trained.<\/p>\n<figure class=\"wp-block-image size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"886\" height=\"329\" data-attachment-id=\"78927\" data-permalink=\"https:\/\/www.marktechpost.com\/2026\/04\/11\/how-knowledge-distillation-compresses-ensemble-intelligence-into-a-single-deployable-ai-model\/image-430\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/04\/image-26.png\" data-orig-size=\"886,329\" data-comments-opened=\"1\" data-image-meta='{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\"}' data-image-title=\"image\" data-image-description=\"\" data-image-caption=\"\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/04\/image-26.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/04\/image-26.png\" alt=\"\" class=\"wp-image-78927\" \/><\/figure>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\">distilled_acc = evaluate(student, X_test_t, y_test_t)\n \nprint(\"n\" + \"=\" * 55)\nprint(\"RESULTS SUMMARY\")\nprint(\"=\" * 55)\nprint(f\"  Ensemble  (12 models, production-undeployable) : {ensemble_acc:.4f}\")\nprint(f\"  Student   (distilled, production-ready)        : {distilled_acc:.4f}\")\nprint(f\"  Baseline  (student, hard labels only)          : {baseline_acc:.4f}\")\n \ngap      = ensemble_acc - distilled_acc\nrecovery = (distilled_acc - baseline_acc) \/ max(ensemble_acc - baseline_acc, 1e-9)\nprint(f\"n  Accuracy gap vs ensemble       : {gap:.4f}\")\nprint(f\"  Knowledge recovered vs baseline: {recovery*100:.1f}%\")<\/code><\/pre>\n<\/div>\n<\/div>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\">def count_params(m):\n    return sum(p.numel() for p in m.parameters())\n \nsingle_teacher_params = count_params(teachers[0])\nstudent_params        = count_params(student)\n \nprint(f\"n  Single teacher parameters : {single_teacher_params:,}\")\nprint(f\"  Full ensemble parameters  : {single_teacher_params * NUM_TEACHERS:,}\")\nprint(f\"  Student parameters        : {student_params:,}\")\nprint(f\"  Size reduction            : {single_teacher_params * NUM_TEACHERS \/ student_params:.0f}x\")<\/code><\/pre>\n<\/div>\n<\/div>\n<figure class=\"wp-block-image size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"882\" height=\"116\" data-attachment-id=\"78926\" data-permalink=\"https:\/\/www.marktechpost.com\/2026\/04\/11\/how-knowledge-distillation-compresses-ensemble-intelligence-into-a-single-deployable-ai-model\/image-429\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/04\/image-25.png\" data-orig-size=\"882,116\" data-comments-opened=\"1\" data-image-meta='{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\"}' data-image-title=\"image\" data-image-description=\"\" data-image-caption=\"\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/04\/image-25.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/04\/image-25.png\" alt=\"\" class=\"wp-image-78926\" \/><\/figure>\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n<p>Check out\u00a0the<strong>\u00a0<a href=\"https:\/\/github.com\/Marktechpost\/AI-Tutorial-Codes-Included\/blob\/main\/Data%20Science\/Knowledge_Distillation.ipynb\" target=\"_blank\" rel=\"noreferrer noopener\">Full Codes here<\/a>.\u00a0<\/strong>Also,\u00a0feel free to follow us on\u00a0<strong><a href=\"https:\/\/x.com\/intent\/follow?screen_name=marktechpost\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Twitter<\/mark><\/a><\/strong>\u00a0and don\u2019t forget to join our\u00a0<strong><a href=\"https:\/\/www.reddit.com\/r\/machinelearningnews\/\" target=\"_blank\" rel=\"noreferrer noopener\">120k+ ML SubReddit<\/a><\/strong>\u00a0and Subscribe to\u00a0<strong><a href=\"https:\/\/www.aidevsignals.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">our Newsletter<\/a><\/strong>. Wait! are you on telegram?\u00a0<strong><a href=\"https:\/\/t.me\/machinelearningresearchnews\" target=\"_blank\" rel=\"noreferrer noopener\">now you can join us on telegram as well.<\/a><\/strong><\/p>\n<p>Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.?\u00a0<strong><a href=\"https:\/\/forms.gle\/MTNLpmJtsFA3VRVd9\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Connect with us<\/mark><\/a><\/strong><\/p>\n<p>The post <a href=\"https:\/\/www.marktechpost.com\/2026\/04\/11\/how-knowledge-distillation-compresses-ensemble-intelligence-into-a-single-deployable-ai-model\/\">How Knowledge Distillation Compresses Ensemble Intelligence into a Single Deployable AI Model<\/a> appeared first on <a href=\"https:\/\/www.marktechpost.com\/\">MarkTechPost<\/a>.<\/p>","protected":false},"excerpt":{"rendered":"<p>Complex prediction problems of&hellip;<\/p>\n","protected":false},"author":1,"featured_media":698,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-697","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-uncategorized"],"_links":{"self":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts\/697","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=697"}],"version-history":[{"count":0,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts\/697\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/media\/698"}],"wp:attachment":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=697"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=697"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=697"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}