{"id":660,"date":"2026-04-03T15:48:18","date_gmt":"2026-04-03T07:48:18","guid":{"rendered":"https:\/\/connectword.dpdns.org\/?p=660"},"modified":"2026-04-03T15:48:18","modified_gmt":"2026-04-03T07:48:18","slug":"step-by-step-guide-to-build-an-end-to-end-model-optimization-pipeline-with-nvidia-model-optimizer-using-fastnas-pruning-and-fine-tuning","status":"publish","type":"post","link":"https:\/\/connectword.dpdns.org\/?p=660","title":{"rendered":"Step by Step Guide to Build an End-to-End Model Optimization Pipeline with NVIDIA Model Optimizer Using FastNAS Pruning and Fine-Tuning"},"content":{"rendered":"<p>In this tutorial, we build a complete end-to-end pipeline using <a href=\"https:\/\/github.com\/NVIDIA\/Model-Optimizer\"><strong>NVIDIA Model Optimizer<\/strong><\/a><strong> <\/strong>to train, prune, and fine-tune a deep learning model directly in Google Colab. We start by setting up the environment and preparing the CIFAR-10 dataset, then define a ResNet architecture and train it to establish a strong baseline. From there, we apply FastNAS pruning to systematically reduce the model\u2019s complexity under FLOPs constraints while preserving performance. We also handle real-world compatibility issues, restore the optimized subnet, and fine-tune it to recover accuracy. By the end, we have a fully working workflow that takes a model from training to deployment-ready optimization, all within a single streamlined setup. Check out\u00a0the\u00a0<strong><a href=\"https:\/\/github.com\/Marktechpost\/AI-Tutorial-Codes-Included\/blob\/main\/Deep%20Learning\/nvidia_model_optimizer_fastnas_pipeline_marktechpost.py\" target=\"_blank\" rel=\"noreferrer noopener\">Full Implementation Coding Notebook<\/a><\/strong>.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\">!pip -q install -U nvidia-modelopt torchvision torchprofile tqdm\n\n\nimport math\nimport os\nimport random\nimport time\n\n\nimport numpy as np\nimport torch\nimport torch.nn as nn\nimport torch.nn.functional as F\nimport torchvision\nimport torchvision.transforms as transforms\n\n\nfrom torch.utils.data import DataLoader, Subset\nfrom torchvision.models.resnet import BasicBlock\nfrom tqdm.auto import tqdm\n\n\nimport modelopt.torch.opt as mto\nimport modelopt.torch.prune as mtp\n\n\nSEED = 123\nrandom.seed(SEED)\nnp.random.seed(SEED)\ntorch.manual_seed(SEED)\nif torch.cuda.is_available():\n   torch.cuda.manual_seed_all(SEED)\n\n\nFAST_MODE = True\n\n\nbatch_size = 256 if FAST_MODE else 512\nbaseline_epochs = 20 if FAST_MODE else 120\nfinetune_epochs = 12 if FAST_MODE else 120\n\n\ntrain_subset_size = 12000 if FAST_MODE else None\nval_subset_size   = 2000  if FAST_MODE else None\ntest_subset_size  = 4000  if FAST_MODE else None\n\n\ntarget_flops = 60e6<\/code><\/pre>\n<\/div>\n<\/div>\n<p>We begin by installing all required dependencies and importing the necessary libraries to set up our environment. We initialize seeds to ensure reproducibility and configure the device to leverage a GPU if available. We also define key runtime parameters, such as batch size, epochs, dataset subsets, and FLOP constraints, to control the overall experiment.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\">def seed_worker(worker_id):\n   worker_seed = SEED + worker_id\n   np.random.seed(worker_seed)\n   random.seed(worker_seed)\n\n\ndef build_cifar10_loaders(train_batch_size=256,\n                         train_subset_size=None,\n                         val_subset_size=None,\n                         test_subset_size=None):\n   normalize = transforms.Normalize(\n       mean=[0.4914, 0.4822, 0.4465],\n       std=[0.2470, 0.2435, 0.2616],\n   )\n\n\n   train_transform = transforms.Compose([\n       transforms.ToTensor(),\n       transforms.RandomHorizontalFlip(),\n       transforms.RandomCrop(32, padding=4),\n       normalize,\n   ])\n   eval_transform = transforms.Compose([\n       transforms.ToTensor(),\n       normalize,\n   ])\n\n\n   train_full = torchvision.datasets.CIFAR10(\n       root=\".\/data\", train=True, transform=train_transform, download=True\n   )\n   val_full = torchvision.datasets.CIFAR10(\n       root=\".\/data\", train=True, transform=eval_transform, download=True\n   )\n   test_full = torchvision.datasets.CIFAR10(\n       root=\".\/data\", train=False, transform=eval_transform, download=True\n   )\n\n\n   n_trainval = len(train_full)\n   ids = np.arange(n_trainval)\n   np.random.shuffle(ids)\n\n\n   n_train = int(n_trainval * 0.9)\n   train_ids = ids[:n_train]\n   val_ids = ids[n_train:]\n\n\n   if train_subset_size is not None:\n       train_ids = train_ids[:min(train_subset_size, len(train_ids))]\n   if val_subset_size is not None:\n       val_ids = val_ids[:min(val_subset_size, len(val_ids))]\n\n\n   test_ids = np.arange(len(test_full))\n   if test_subset_size is not None:\n       test_ids = test_ids[:min(test_subset_size, len(test_ids))]\n\n\n   train_ds = Subset(train_full, train_ids.tolist())\n   val_ds = Subset(val_full, val_ids.tolist())\n   test_ds = Subset(test_full, test_ids.tolist())\n\n\n   num_workers = min(2, os.cpu_count() or 1)\n\n\n   g = torch.Generator()\n   g.manual_seed(SEED)\n\n\n   train_loader = DataLoader(\n       train_ds,\n       batch_size=train_batch_size,\n       shuffle=True,\n       num_workers=num_workers,\n       pin_memory=torch.cuda.is_available(),\n       worker_init_fn=seed_worker,\n       generator=g,\n   )\n   val_loader = DataLoader(\n       val_ds,\n       batch_size=512,\n       shuffle=False,\n       num_workers=num_workers,\n       pin_memory=torch.cuda.is_available(),\n       worker_init_fn=seed_worker,\n   )\n   test_loader = DataLoader(\n       test_ds,\n       batch_size=512,\n       shuffle=False,\n       num_workers=num_workers,\n       pin_memory=torch.cuda.is_available(),\n       worker_init_fn=seed_worker,\n   )\n\n\n   print(f\"Train: {len(train_ds)} | Val: {len(val_ds)} | Test: {len(test_ds)}\")\n   return train_loader, val_loader, test_loader\n\n\ntrain_loader, val_loader, test_loader = build_cifar10_loaders(\n   train_batch_size=batch_size,\n   train_subset_size=train_subset_size,\n   val_subset_size=val_subset_size,\n   test_subset_size=test_subset_size,\n)\n<\/code><\/pre>\n<\/div>\n<\/div>\n<p>We construct the full data pipeline by preparing CIFAR-10 datasets with appropriate augmentations and normalization. We split the dataset to reduce its size and speed up experimentation. We then create efficient data loaders that ensure proper batching, shuffling, and reproducible data handling.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\">def _weights_init(m):\n   if isinstance(m, (nn.Linear, nn.Conv2d)):\n       nn.init.kaiming_normal_(m.weight)\n\n\nclass LambdaLayer(nn.Module):\n   def __init__(self, lambd):\n       super().__init__()\n       self.lambd = lambd\n\n\n   def forward(self, x):\n       return self.lambd(x)\n\n\nclass ResNet(nn.Module):\n   def __init__(self, num_blocks, num_classes=10):\n       super().__init__()\n       self.in_planes = 16\n       self.layers = nn.Sequential(\n           nn.Conv2d(3, 16, kernel_size=3, stride=1, padding=1, bias=False),\n           nn.BatchNorm2d(16),\n           nn.ReLU(),\n           self._make_layer(16, num_blocks, stride=1),\n           self._make_layer(32, num_blocks, stride=2),\n           self._make_layer(64, num_blocks, stride=2),\n           nn.AdaptiveAvgPool2d((1, 1)),\n           nn.Flatten(),\n           nn.Linear(64, num_classes),\n       )\n       self.apply(_weights_init)\n\n\n   def _make_layer(self, planes, num_blocks, stride):\n       strides = [stride] + [1] * (num_blocks - 1)\n       layers = []\n       for s in strides:\n           downsample = None\n           if s != 1 or self.in_planes != planes:\n               downsample = LambdaLayer(\n                   lambda x: F.pad(\n                       x[:, :, ::2, ::2],\n                       (0, 0, 0, 0, planes \/\/ 4, planes \/\/ 4),\n                       \"constant\",\n                       0,\n                   )\n               )\n           layers.append(BasicBlock(self.in_planes, planes, s, downsample))\n           self.in_planes = planes\n       return nn.Sequential(*layers)\n\n\n   def forward(self, x):\n       return self.layers(x)\n\n\ndef resnet20():\n   return ResNet(num_blocks=3).to(device)<\/code><\/pre>\n<\/div>\n<\/div>\n<p>We define the ResNet20 architecture from scratch, including custom initialization and shortcut handling through lambda layers. We structure the network using convolutional blocks and residual connections to capture hierarchical features. We finally encapsulate the model creation into a reusable function that moves it directly to the selected device.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\">class CosineLRwithWarmup(torch.optim.lr_scheduler._LRScheduler):\n   def __init__(self, optimizer, warmup_steps, decay_steps, warmup_lr=0.0, last_epoch=-1):\n       self.warmup_steps = warmup_steps\n       self.warmup_lr = warmup_lr\n       self.decay_steps = max(decay_steps, 1)\n       super().__init__(optimizer, last_epoch)\n\n\n   def get_lr(self):\n       if self.last_epoch &lt; self.warmup_steps:\n           return [\n               (base_lr - self.warmup_lr) * self.last_epoch \/ max(self.warmup_steps, 1) + self.warmup_lr\n               for base_lr in self.base_lrs\n           ]\n       current_steps = self.last_epoch - self.warmup_steps\n       return [\n           0.5 * base_lr * (1 + math.cos(math.pi * current_steps \/ self.decay_steps))\n           for base_lr in self.base_lrs\n       ]\n\n\ndef get_optimizer_scheduler(model, lr, weight_decay, warmup_steps, decay_steps):\n   optimizer = torch.optim.SGD(\n       filter(lambda p: p.requires_grad, model.parameters()),\n       lr=lr,\n       momentum=0.9,\n       weight_decay=weight_decay,\n   )\n   scheduler = CosineLRwithWarmup(optimizer, warmup_steps, decay_steps)\n   return optimizer, scheduler\n\n\ndef loss_fn_default(model, outputs, labels):\n   return F.cross_entropy(outputs, labels)\n\n\ndef train_one_epoch(model, loader, optimizer, scheduler, loss_fn=loss_fn_default):\n   model.train()\n   running_loss = 0.0\n   total = 0\n   for images, labels in loader:\n       images = images.to(device, non_blocking=True)\n       labels = labels.to(device, non_blocking=True)\n\n\n       outputs = model(images)\n       loss = loss_fn(model, outputs, labels)\n\n\n       optimizer.zero_grad(set_to_none=True)\n       loss.backward()\n       optimizer.step()\n       scheduler.step()\n\n\n       running_loss += loss.item() * labels.size(0)\n       total += labels.size(0)\n\n\n   return running_loss \/ max(total, 1)\n\n\n@torch.no_grad()\ndef evaluate(model, loader):\n   model.eval()\n   correct = 0\n   total = 0\n   for images, labels in loader:\n       images = images.to(device, non_blocking=True)\n       labels = labels.to(device, non_blocking=True)\n       logits = model(images)\n       preds = logits.argmax(dim=1)\n       correct += (preds == labels).sum().item()\n       total += labels.size(0)\n   return 100.0 * correct \/ max(total, 1)\n\n\ndef train_model(model, train_loader, val_loader, epochs, ckpt_path,\n               lr=None, weight_decay=1e-4, print_every=1):\n   if lr is None:\n       lr = 0.1 * batch_size \/ 128\n\n\n   steps_per_epoch = len(train_loader)\n   warmup_steps = max(1, 2 * steps_per_epoch if FAST_MODE else 5 * steps_per_epoch)\n   decay_steps = max(1, epochs * steps_per_epoch)\n\n\n   optimizer, scheduler = get_optimizer_scheduler(\n       model=model,\n       lr=lr,\n       weight_decay=weight_decay,\n       warmup_steps=warmup_steps,\n       decay_steps=decay_steps,\n   )\n\n\n   best_val = -1.0\n   best_epoch = -1\n\n\n   print(f\"Training for {epochs} epochs...\")\n   for epoch in tqdm(range(1, epochs + 1)):\n       train_loss = train_one_epoch(model, train_loader, optimizer, scheduler)\n       val_acc = evaluate(model, val_loader)\n\n\n       if val_acc &gt;= best_val:\n           best_val = val_acc\n           best_epoch = epoch\n           torch.save(model.state_dict(), ckpt_path)\n\n\n       if epoch == 1 or epoch % print_every == 0 or epoch == epochs:\n           print(f\"Epoch {epoch:03d} | train_loss={train_loss:.4f} | val_acc={val_acc:.2f}%\")\n\n\n   model.load_state_dict(torch.load(ckpt_path, map_location=device))\n   print(f\"Restored best checkpoint from epoch {best_epoch} with val_acc={best_val:.2f}%\")\n   return model, best_val<\/code><\/pre>\n<\/div>\n<\/div>\n<p>We implement the training utilities, including a cosine learning rate scheduler with warmup, to enable stable optimization. We define loss computation, a training loop for one epoch, and an evaluation function to measure accuracy. We then build a complete training pipeline that tracks the best model and restores it based on validation performance.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\">baseline_model = resnet20()\nbaseline_ckpt = \"resnet20_baseline.pth\"\n\n\nstart = time.time()\nbaseline_model, baseline_val = train_model(\n   baseline_model,\n   train_loader,\n   val_loader,\n   epochs=baseline_epochs,\n   ckpt_path=baseline_ckpt,\n   lr=0.1 * batch_size \/ 128,\n   weight_decay=1e-4,\n   print_every=max(1, baseline_epochs \/\/ 4),\n)\nbaseline_test = evaluate(baseline_model, test_loader)\nbaseline_time = time.time() - start\n\n\nprint(f\"nBaseline validation accuracy: {baseline_val:.2f}%\")\nprint(f\"Baseline test accuracy:       {baseline_test:.2f}%\")\nprint(f\"Baseline training time:       {baseline_time\/60:.2f} min\")\n\n\nfastnas_cfg = mtp.fastnas.FastNASConfig()\nfastnas_cfg[\"nn.Conv2d\"][\"*\"][\"channel_divisor\"] = 16\nfastnas_cfg[\"nn.BatchNorm2d\"][\"*\"][\"feature_divisor\"] = 16\n\n\ndummy_input = torch.randn(1, 3, 32, 32, device=device)\n\n\ndef score_func(model):\n   return evaluate(model, val_loader)\n\n\nsearch_ckpt = \"modelopt_search_checkpoint_fastnas.pth\"\npruned_ckpt = \"modelopt_pruned_model_fastnas.pth\"\n\n\nimport torchprofile.profile as tp_profile\nfrom torchprofile.handlers import HANDLER_MAP\n\n\nif not hasattr(tp_profile, \"handlers\"):\n   tp_profile.handlers = tuple((tuple([op_name]), handler) for op_name, handler in HANDLER_MAP.items())\n\n\nprint(\"nRunning FastNAS pruning...\")\nprune_start = time.time()\n\n\nmodel_for_prune = resnet20()\nmodel_for_prune.load_state_dict(torch.load(baseline_ckpt, map_location=device))\n\n\npruned_model, pruned_metadata = mtp.prune(\n   model=model_for_prune,\n   mode=[(\"fastnas\", fastnas_cfg)],\n   constraints={\"flops\": target_flops},\n   dummy_input=dummy_input,\n   config={\n       \"data_loader\": train_loader,\n       \"score_func\": score_func,\n       \"checkpoint\": search_ckpt,\n   },\n)\n\n\nmto.save(pruned_model, pruned_ckpt)\nprune_elapsed = time.time() - prune_start\n\n\npruned_test_before_ft = evaluate(pruned_model, test_loader)\n\n\nprint(f\"Pruned model test accuracy before fine-tune: {pruned_test_before_ft:.2f}%\")\nprint(f\"Pruning\/search time: {prune_elapsed\/60:.2f} min\")<\/code><\/pre>\n<\/div>\n<\/div>\n<p>We train the baseline model and evaluate its performance to establish a reference point for optimization. We then configure FastNAS pruning, define constraints, and apply a compatibility patch to ensure proper FLOPs profiling. We execute the pruning process to generate a compressed model and evaluate its performance before fine-tuning.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\">restored_pruned_model = resnet20()\nrestored_pruned_model = mto.restore(restored_pruned_model, pruned_ckpt)\n\n\nrestored_test = evaluate(restored_pruned_model, test_loader)\nprint(f\"Restored pruned model test accuracy: {restored_test:.2f}%\")\n\n\nprint(\"nFine-tuning pruned model...\")\nfinetune_ckpt = \"resnet20_pruned_finetuned.pth\"\n\n\nstart_ft = time.time()\nrestored_pruned_model, pruned_val_after_ft = train_model(\n   restored_pruned_model,\n   train_loader,\n   val_loader,\n   epochs=finetune_epochs,\n   ckpt_path=finetune_ckpt,\n   lr=0.05 * batch_size \/ 128,\n   weight_decay=1e-4,\n   print_every=max(1, finetune_epochs \/\/ 4),\n)\npruned_test_after_ft = evaluate(restored_pruned_model, test_loader)\nft_time = time.time() - start_ft\n\n\nprint(f\"nFine-tuned pruned validation accuracy: {pruned_val_after_ft:.2f}%\")\nprint(f\"Fine-tuned pruned test accuracy:       {pruned_test_after_ft:.2f}%\")\nprint(f\"Fine-tuning time:                      {ft_time\/60:.2f} min\")\n\n\ndef count_params(model):\n   return sum(p.numel() for p in model.parameters())\n\n\ndef count_nonzero_params(model):\n   total = 0\n   for p in model.parameters():\n       total += (p.detach() != 0).sum().item()\n   return total\n\n\nbaseline_params = count_params(baseline_model)\npruned_params = count_params(restored_pruned_model)\n\n\nbaseline_nonzero = count_nonzero_params(baseline_model)\npruned_nonzero = count_nonzero_params(restored_pruned_model)\n\n\nprint(\"n\" + \"=\" * 60)\nprint(\"FINAL SUMMARY\")\nprint(\"=\" * 60)\nprint(f\"Baseline test accuracy:                 {baseline_test:.2f}%\")\nprint(f\"Pruned test accuracy before finetune:   {pruned_test_before_ft:.2f}%\")\nprint(f\"Pruned test accuracy after finetune:    {pruned_test_after_ft:.2f}%\")\nprint(\"-\" * 60)\nprint(f\"Baseline total params:                  {baseline_params:,}\")\nprint(f\"Pruned total params:                    {pruned_params:,}\")\nprint(f\"Baseline nonzero params:                {baseline_nonzero:,}\")\nprint(f\"Pruned nonzero params:                  {pruned_nonzero:,}\")\nprint(\"-\" * 60)\nprint(f\"Baseline train time:                    {baseline_time\/60:.2f} min\")\nprint(f\"Pruning\/search time:                    {prune_elapsed\/60:.2f} min\")\nprint(f\"Pruned finetune time:                   {ft_time\/60:.2f} min\")\nprint(\"=\" * 60)\n\n\ntorch.save(baseline_model.state_dict(), \"baseline_resnet20_final_state_dict.pth\")\nmto.save(restored_pruned_model, \"pruned_resnet20_final_modelopt.pth\")\n\n\nprint(\"nSaved files:\")\nprint(\" - baseline_resnet20_final_state_dict.pth\")\nprint(\" - modelopt_pruned_model_fastnas.pth\")\nprint(\" - pruned_resnet20_final_modelopt.pth\")\nprint(\" - modelopt_search_checkpoint_fastnas.pth\")\n\n\n@torch.no_grad()\ndef show_sample_predictions(model, loader, n=8):\n   model.eval()\n   class_names = [\n       \"airplane\", \"automobile\", \"bird\", \"cat\", \"deer\",\n       \"dog\", \"frog\", \"horse\", \"ship\", \"truck\"\n   ]\n   images, labels = next(iter(loader))\n   images = images[:n].to(device)\n   labels = labels[:n]\n   logits = model(images)\n   preds = logits.argmax(dim=1).cpu()\n\n\n   print(\"nSample predictions:\")\n   for i in range(len(preds)):\n       print(f\"{i:02d} | pred={class_names[preds[i]]:&lt;10} | true={class_names[labels[i]]}\")\n\n\nshow_sample_predictions(restored_pruned_model, test_loader, n=8)<\/code><\/pre>\n<\/div>\n<\/div>\n<p>We restore the pruned model and verify its performance to ensure the pruning process succeeded. We fine-tune the model to recover accuracy lost during pruning and evaluate the final performance. We conclude by comparing metrics, saving artifacts, and running sample predictions to validate the optimized model end-to-end.<\/p>\n<p>In conclusion, we moved beyond theory and built a complete, production-grade model-optimization pipeline from scratch. We saw how a dense model is transformed into an efficient, compute-aware network through structured pruning, and how fine-tuning restores performance while retaining efficiency gains. We developed a strong intuition for FLOP constraints, automated architecture search, and how FastNAS intelligently navigates the trade-off between accuracy and efficiency. Most importantly, we walked away with a powerful, reusable workflow that we can apply to any model or dataset, enabling us to systematically design high-performance models that are not only accurate but also truly optimized for real-world deployment.<\/p>\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n<p>Check out\u00a0the\u00a0<strong><a href=\"https:\/\/github.com\/Marktechpost\/AI-Tutorial-Codes-Included\/blob\/main\/Deep%20Learning\/nvidia_model_optimizer_fastnas_pipeline_marktechpost.py\" target=\"_blank\" rel=\"noreferrer noopener\">Full Implementation Coding Notebook<\/a>. \u00a0<\/strong>Also,\u00a0feel free to follow us on\u00a0<strong><a href=\"https:\/\/x.com\/intent\/follow?screen_name=marktechpost\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Twitter<\/mark><\/a><\/strong>\u00a0and don\u2019t forget to join our\u00a0<strong><a href=\"https:\/\/www.reddit.com\/r\/machinelearningnews\/\" target=\"_blank\" rel=\"noreferrer noopener\">120k+ ML SubReddit<\/a><\/strong>\u00a0and Subscribe to\u00a0<strong><a href=\"https:\/\/www.aidevsignals.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">our Newsletter<\/a><\/strong>. Wait! are you on telegram?\u00a0<strong><a href=\"https:\/\/t.me\/machinelearningresearchnews\" target=\"_blank\" rel=\"noreferrer noopener\">now you can join us on telegram as well.<\/a><\/strong><\/p>\n<p>The post <a href=\"https:\/\/www.marktechpost.com\/2026\/04\/03\/step-by-step-guide-to-build-an-end-to-end-model-optimization-pipeline-with-nvidia-model-optimizer-using-fastnas-pruning-and-fine-tuning\/\">Step by Step Guide to Build an End-to-End Model Optimization Pipeline with NVIDIA Model Optimizer Using FastNAS Pruning and Fine-Tuning<\/a> appeared first on <a href=\"https:\/\/www.marktechpost.com\/\">MarkTechPost<\/a>.<\/p>","protected":false},"excerpt":{"rendered":"<p>In this tutorial, we build a c&hellip;<\/p>\n","protected":false},"author":1,"featured_media":29,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-660","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-uncategorized"],"_links":{"self":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts\/660","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=660"}],"version-history":[{"count":0,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts\/660\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/media\/29"}],"wp:attachment":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=660"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=660"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=660"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}