{"id":683,"date":"2026-04-09T15:10:00","date_gmt":"2026-04-09T07:10:00","guid":{"rendered":"https:\/\/connectword.dpdns.org\/?p=683"},"modified":"2026-04-09T15:10:00","modified_gmt":"2026-04-09T07:10:00","slug":"sigmoid-vs-relu-activation-functions-the-inference-cost-of-losing-geometric-context","status":"publish","type":"post","link":"https:\/\/connectword.dpdns.org\/?p=683","title":{"rendered":"Sigmoid vs ReLU Activation Functions: The Inference Cost of Losing Geometric Context"},"content":{"rendered":"<p>A deep neural network can be understood as a geometric system, where each layer reshapes the input space to form increasingly complex decision boundaries. For this to work effectively, layers must preserve meaningful spatial information \u2014 particularly how far a data point lies from these boundaries \u2014 since this distance enables deeper layers to build rich, non-linear representations.<\/p>\n<p>Sigmoid disrupts this process by compressing all inputs into a narrow range between 0 and 1. As values move away from decision boundaries, they become indistinguishable, causing a loss of geometric context across layers. This leads to weaker representations and limits the effectiveness of depth.<\/p>\n<p>ReLU, on the other hand, preserves magnitude for positive inputs, allowing distance information to flow through the network. This enables deeper models to remain expressive without requiring excessive width or compute.<\/p>\n<p>In this article, we focus on this forward-pass behavior \u2014 analyzing how Sigmoid and ReLU differ in signal propagation and representation geometry using a two-moons experiment, and what that means for inference efficiency and scalability.<\/p>\n<figure class=\"wp-block-image size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"803\" height=\"507\" data-attachment-id=\"78878\" data-permalink=\"https:\/\/www.marktechpost.com\/2026\/04\/09\/sigmoid-vs-relu-activation-functions-the-inference-cost-of-losing-geometric-context\/image-414\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/04\/image-11.png\" data-orig-size=\"803,507\" data-comments-opened=\"1\" data-image-meta='{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\"}' data-image-title=\"image\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/04\/image-11-300x189.png\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/04\/image-11.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/04\/image-11.png\" alt=\"\" class=\"wp-image-78878\" \/><\/figure>\n<figure class=\"wp-block-image size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"816\" height=\"536\" data-attachment-id=\"78879\" data-permalink=\"https:\/\/www.marktechpost.com\/2026\/04\/09\/sigmoid-vs-relu-activation-functions-the-inference-cost-of-losing-geometric-context\/image-414\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/04\/image-10.png\" data-orig-size=\"816,536\" data-comments-opened=\"1\" data-image-meta='{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\"}' data-image-title=\"image\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/04\/image-10-300x197.png\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/04\/image-10.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/04\/image-10.png\" alt=\"\" class=\"wp-image-78879\" \/><\/figure>\n<h3 class=\"wp-block-heading\"><strong>Setting up the dependencies<\/strong><\/h3>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\">import numpy as np\nimport matplotlib.pyplot as plt\nimport matplotlib.gridspec as gridspec\nfrom matplotlib.colors import ListedColormap\nfrom sklearn.datasets import make_moons\nfrom sklearn.preprocessing import StandardScaler\nfrom sklearn.model_selection import train_test_split<\/code><\/pre>\n<\/div>\n<\/div>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\">plt.rcParams.update({\n    \"font.family\":        \"monospace\",\n    \"axes.spines.top\":    False,\n    \"axes.spines.right\":  False,\n    \"figure.facecolor\":   \"white\",\n    \"axes.facecolor\":     \"#f7f7f7\",\n    \"axes.grid\":          True,\n    \"grid.color\":         \"#e0e0e0\",\n    \"grid.linewidth\":     0.6,\n})\n \nT = {                          \n    \"bg\":      \"white\",\n    \"panel\":   \"#f7f7f7\",\n    \"sig\":     \"#e05c5c\",      \n    \"relu\":    \"#3a7bd5\",      \n    \"c0\":      \"#f4a261\",      \n    \"c1\":      \"#2a9d8f\",      \n    \"text\":    \"#1a1a1a\",\n    \"muted\":   \"#666666\",\n}<\/code><\/pre>\n<\/div>\n<\/div>\n<h3 class=\"wp-block-heading\"><strong>Creating the dataset<\/strong><\/h3>\n<p>To study the effect of activation functions in a controlled setting, we first generate a synthetic dataset using scikit-learn\u2019s make_moons. This creates a non-linear, two-class problem where simple linear boundaries fail, making it ideal for testing how well neural networks learn complex decision surfaces.<\/p>\n<p>We add a small amount of noise to make the task more realistic, then standardize the features using StandardScaler so both dimensions are on the same scale \u2014 ensuring stable training. The dataset is then split into training and test sets to evaluate generalization.<\/p>\n<p>Finally, we visualize the data distribution. This plot serves as the baseline geometry that both Sigmoid and ReLU networks will attempt to model, allowing us to later compare how each activation function transforms this space across layers.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\">X, y = make_moons(n_samples=400, noise=0.18, random_state=42)\nX = StandardScaler().fit_transform(X)\nX_train, X_test, y_train, y_test = train_test_split(\n    X, y, test_size=0.25, random_state=42\n)\n\nfig, ax = plt.subplots(figsize=(7, 5))\nfig.patch.set_facecolor(T[\"bg\"])\nax.set_facecolor(T[\"panel\"])\nax.scatter(X[y == 0, 0], X[y == 0, 1], c=T[\"c0\"], s=40,\n           edgecolors=\"white\", linewidths=0.5, label=\"Class 0\", alpha=0.9)\nax.scatter(X[y == 1, 0], X[y == 1, 1], c=T[\"c1\"], s=40,\n           edgecolors=\"white\", linewidths=0.5, label=\"Class 1\", alpha=0.9)\nax.set_title(\"make_moons -- our dataset\", color=T[\"text\"], fontsize=13)\nax.set_xlabel(\"x\u2081\", color=T[\"muted\"]); ax.set_ylabel(\"x\u2082\", color=T[\"muted\"])\nax.tick_params(colors=T[\"muted\"]); ax.legend(fontsize=10)\nplt.tight_layout()\nplt.savefig(\"moons_dataset.png\", dpi=140, bbox_inches=\"tight\")\nplt.show()<\/code><\/pre>\n<\/div>\n<\/div>\n<figure class=\"wp-block-image size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"690\" height=\"490\" data-attachment-id=\"78876\" data-permalink=\"https:\/\/www.marktechpost.com\/2026\/04\/09\/sigmoid-vs-relu-activation-functions-the-inference-cost-of-losing-geometric-context\/image-412\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/04\/image-8.png\" data-orig-size=\"690,490\" data-comments-opened=\"1\" data-image-meta='{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\"}' data-image-title=\"image\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/04\/image-8-300x213.png\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/04\/image-8.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/04\/image-8.png\" alt=\"\" class=\"wp-image-78876\" \/><\/figure>\n<h3 class=\"wp-block-heading\"><strong>Creating the Network<\/strong><\/h3>\n<p>Next, we implement a small, controlled neural network to isolate the effect of activation functions. The goal here is not to build a highly optimized model, but to create a clean experimental setup where Sigmoid and ReLU can be compared under identical conditions.<\/p>\n<p>We define both activation functions (Sigmoid and ReLU) along with their derivatives, and use binary cross-entropy as the loss since this is a binary classification task. The TwoLayerNet class represents a simple 3-layer feedforward network (2 hidden layers + output), where the only configurable component is the activation function.<\/p>\n<p>A key detail is the initialization strategy: we use He initialization for ReLU and Xavier initialization for Sigmoid, ensuring that each network starts in a fair and stable regime based on its activation dynamics.<\/p>\n<p>The forward pass computes activations layer by layer, while the backward pass performs standard gradient descent updates. Importantly, we also include diagnostic methods like get_hidden and get_z_trace, which allow us to inspect how signals evolve across layers \u2014 this is crucial for analyzing how much geometric information is preserved or lost.<\/p>\n<p>By keeping architecture, data, and training setup constant, this implementation ensures that any difference in performance or internal representations can be directly attributed to the activation function itself \u2014 setting the stage for a clear comparison of their impact on signal propagation and expressiveness.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\">def sigmoid(z):      return 1 \/ (1 + np.exp(-np.clip(z, -500, 500)))\ndef sigmoid_d(a):    return a * (1 - a)\ndef relu(z):         return np.maximum(0, z)\ndef relu_d(z):       return (z &gt; 0).astype(float)\ndef bce(y, yhat):    return -np.mean(y * np.log(yhat + 1e-9) + (1 - y) * np.log(1 - yhat + 1e-9))\n\nclass TwoLayerNet:\n    def __init__(self, activation=\"relu\", seed=0):\n        np.random.seed(seed)\n        self.act_name = activation\n        self.act  = relu    if activation == \"relu\" else sigmoid\n        self.dact = relu_d  if activation == \"relu\" else sigmoid_d\n\n        # He init for ReLU, Xavier for Sigmoid\n        scale = lambda fan_in: np.sqrt(2 \/ fan_in) if activation == \"relu\" else np.sqrt(1 \/ fan_in)\n        self.W1 = np.random.randn(2, 8)  * scale(2)\n        self.b1 = np.zeros((1, 8))\n        self.W2 = np.random.randn(8, 8)  * scale(8)\n        self.b2 = np.zeros((1, 8))\n        self.W3 = np.random.randn(8, 1)  * scale(8)\n        self.b3 = np.zeros((1, 1))\n        self.loss_history = []\n\n    def forward(self, X, store=False):\n        z1 = X  @ self.W1 + self.b1;  a1 = self.act(z1)\n        z2 = a1 @ self.W2 + self.b2;  a2 = self.act(z2)\n        z3 = a2 @ self.W3 + self.b3;  out = sigmoid(z3)\n        if store:\n            self._cache = (X, z1, a1, z2, a2, z3, out)\n        return out\n\n    def backward(self, lr=0.05):\n        X, z1, a1, z2, a2, z3, out = self._cache\n        n = X.shape[0]\n\n        dout = (out - self.y_cache) \/ n\n        dW3 = a2.T @ dout;  db3 = dout.sum(axis=0, keepdims=True)\n        da2 = dout @ self.W3.T\n        dz2 = da2 * (self.dact(z2) if self.act_name == \"relu\" else self.dact(a2))\n        dW2 = a1.T @ dz2;  db2 = dz2.sum(axis=0, keepdims=True)\n        da1 = dz2 @ self.W2.T\n        dz1 = da1 * (self.dact(z1) if self.act_name == \"relu\" else self.dact(a1))\n        dW1 = X.T  @ dz1;  db1 = dz1.sum(axis=0, keepdims=True)\n\n        for p, g in [(self.W3,dW3),(self.b3,db3),(self.W2,dW2),\n                     (self.b2,db2),(self.W1,dW1),(self.b1,db1)]:\n            p -= lr * g\n\n    def train_step(self, X, y, lr=0.05):\n        self.y_cache = y.reshape(-1, 1)\n        out = self.forward(X, store=True)\n        loss = bce(self.y_cache, out)\n        self.backward(lr)\n        return loss\n\n    def get_hidden(self, X, layer=1):\n        \"\"\"Return post-activation values for layer 1 or 2.\"\"\"\n        z1 = X @ self.W1 + self.b1;  a1 = self.act(z1)\n        if layer == 1: return a1\n        z2 = a1 @ self.W2 + self.b2; return self.act(z2)\n\n    def get_z_trace(self, x_single):\n        \"\"\"Return pre-activation magnitudes per layer for ONE sample.\"\"\"\n        z1 = x_single @ self.W1 + self.b1\n        a1 = self.act(z1)\n        z2 = a1 @ self.W2 + self.b2\n        a2 = self.act(z2)\n        z3 = a2 @ self.W3 + self.b3\n        return [np.abs(z1).mean(), np.abs(a1).mean(),\n                np.abs(z2).mean(), np.abs(a2).mean(),\n                np.abs(z3).mean()]<\/code><\/pre>\n<\/div>\n<\/div>\n<h3 class=\"wp-block-heading\"><strong>Training the Networks<\/strong><\/h3>\n<p>Now we train both networks under identical conditions to ensure a fair comparison. We initialize two models \u2014 one using Sigmoid and the other using ReLU \u2014 with the same random seed so they start from equivalent weight configurations.<\/p>\n<p>The training loop runs for 800 epochs using mini-batch gradient descent. In each epoch, we shuffle the training data, split it into batches, and update both networks in parallel. This setup guarantees that the only variable changing between the two runs is the activation function.<\/p>\n<p>We also track the loss after every epoch and log it at regular intervals. This allows us to observe how each network evolves over time \u2014 not just in terms of convergence speed, but whether it continues improving or plateaus.<\/p>\n<p>This step is critical because it establishes the first signal of divergence: if both models start identically but behave differently during training, that difference must come from how each activation function propagates and preserves information through the network.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\">EPOCHS = 800\nLR     = 0.05\nBATCH  = 64\n\nnet_sig  = TwoLayerNet(\"sigmoid\", seed=42)\nnet_relu = TwoLayerNet(\"relu\",    seed=42)\n\nfor epoch in range(EPOCHS):\n    idx = np.random.permutation(len(X_train))\n    for net in [net_sig, net_relu]:\n        epoch_loss = []\n        for i in range(0, len(idx), BATCH):\n            b = idx[i:i+BATCH]\n            loss = net.train_step(X_train[b], y_train[b], LR)\n            epoch_loss.append(loss)\n        net.loss_history.append(np.mean(epoch_loss))\n\n    if (epoch + 1) % 200 == 0:\n        ls = net_sig.loss_history[-1]\n        lr = net_relu.loss_history[-1]\n        print(f\"  Epoch {epoch+1:4d} | Sigmoid loss: {ls:.4f} | ReLU loss: {lr:.4f}\")\n\nprint(\"n<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/2705.png\" alt=\"\u2705\" class=\"wp-smiley\" \/> Training complete.\")<\/code><\/pre>\n<\/div>\n<\/div>\n<figure class=\"wp-block-image size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"812\" height=\"517\" data-attachment-id=\"78877\" data-permalink=\"https:\/\/www.marktechpost.com\/2026\/04\/09\/sigmoid-vs-relu-activation-functions-the-inference-cost-of-losing-geometric-context\/image-413\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/04\/image-9.png\" data-orig-size=\"812,517\" data-comments-opened=\"1\" data-image-meta='{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\"}' data-image-title=\"image\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/04\/image-9-300x191.png\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/04\/image-9.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/04\/image-9.png\" alt=\"\" class=\"wp-image-78877\" \/><\/figure>\n<h3 class=\"wp-block-heading\"><strong>Training Loss Curve<\/strong><\/h3>\n<p>The loss curves make the divergence between Sigmoid and ReLU very clear. Both networks start from the same initialization and are trained under identical conditions, yet their learning trajectories quickly separate. Sigmoid improves initially but plateaus around ~0.28 by epoch 400, showing almost no progress afterward \u2014 a sign that the network has exhausted the useful signal it can extract.<\/p>\n<p>ReLU, in contrast, continues to steadily reduce loss throughout training, dropping from ~0.15 to ~0.03 by epoch 800. This isn\u2019t just faster convergence; it reflects a deeper issue: Sigmoid\u2019s compression is limiting the flow of meaningful information, causing the model to stall, while ReLU preserves that signal, allowing the network to keep refining its decision boundary.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\">fig, ax = plt.subplots(figsize=(10, 5))\nfig.patch.set_facecolor(T[\"bg\"])\nax.set_facecolor(T[\"panel\"])\n\nax.plot(net_sig.loss_history,  color=T[\"sig\"],  lw=2.5, label=\"Sigmoid\")\nax.plot(net_relu.loss_history, color=T[\"relu\"], lw=2.5, label=\"ReLU\")\n\nax.set_xlabel(\"Epoch\", color=T[\"muted\"])\nax.set_ylabel(\"Binary Cross-Entropy Loss\", color=T[\"muted\"])\nax.set_title(\"Training Loss -- same architecture, same init, same LRnonly the activation differs\",\n             color=T[\"text\"], fontsize=12)\nax.legend(fontsize=11)\nax.tick_params(colors=T[\"muted\"])\n\n# Annotate final losses\nfor net, color, va in [(net_sig, T[\"sig\"], \"bottom\"), (net_relu, T[\"relu\"], \"top\")]:\n    final = net.loss_history[-1]\n    ax.annotate(f\"  final: {final:.4f}\", xy=(EPOCHS-1, final),\n                color=color, fontsize=9, va=va)\n\nplt.tight_layout()\nplt.savefig(\"loss_curves.png\", dpi=140, bbox_inches=\"tight\")\nplt.show()<\/code><\/pre>\n<\/div>\n<\/div>\n<figure class=\"wp-block-image size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"987\" height=\"489\" data-attachment-id=\"78875\" data-permalink=\"https:\/\/www.marktechpost.com\/2026\/04\/09\/sigmoid-vs-relu-activation-functions-the-inference-cost-of-losing-geometric-context\/image-411\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/04\/image-7.png\" data-orig-size=\"987,489\" data-comments-opened=\"1\" data-image-meta='{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\"}' data-image-title=\"image\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/04\/image-7-300x149.png\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/04\/image-7.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/04\/image-7.png\" alt=\"\" class=\"wp-image-78875\" \/><\/figure>\n<h3 class=\"wp-block-heading\"><strong>Decision Boundary Plots<\/strong><\/h3>\n<p>The decision boundary visualization makes the difference even more tangible. The Sigmoid network learns a nearly linear boundary, failing to capture the curved structure of the two-moons dataset, which results in lower accuracy (~79%). This is a direct consequence of its compressed internal representations \u2014 the network simply doesn\u2019t have enough geometric signal to construct a complex boundary.<\/p>\n<p>In contrast, the ReLU network learns a highly non-linear, well-adapted boundary that closely follows the data distribution, achieving much higher accuracy (~96%). Because ReLU preserves magnitude across layers, it enables the network to progressively bend and refine the decision surface, turning depth into actual expressive power rather than wasted capacity.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\">def plot_boundary(ax, net, X, y, title, color):\n    h = 0.025\n    x_min, x_max = X[:, 0].min() - 0.5, X[:, 0].max() + 0.5\n    y_min, y_max = X[:, 1].min() - 0.5, X[:, 1].max() + 0.5\n    xx, yy = np.meshgrid(np.arange(x_min, x_max, h),\n                         np.arange(y_min, y_max, h))\n    grid = np.c_[xx.ravel(), yy.ravel()]\n    Z = net.forward(grid).reshape(xx.shape)\n\n    # Soft shading\n    cmap_bg = ListedColormap([\"#fde8c8\", \"#c8ece9\"])\n    ax.contourf(xx, yy, Z, levels=50, cmap=cmap_bg, alpha=0.85)\n    ax.contour(xx, yy, Z, levels=[0.5], colors=[color], linewidths=2)\n\n    ax.scatter(X[y==0, 0], X[y==0, 1], c=T[\"c0\"], s=35,\n               edgecolors=\"white\", linewidths=0.4, alpha=0.9)\n    ax.scatter(X[y==1, 0], X[y==1, 1], c=T[\"c1\"], s=35,\n               edgecolors=\"white\", linewidths=0.4, alpha=0.9)\n\n    acc = ((net.forward(X) &gt;= 0.5).ravel() == y).mean()\n    ax.set_title(f\"{title}nTest acc: {acc:.1%}\", color=color, fontsize=12)\n    ax.set_xlabel(\"x\u2081\", color=T[\"muted\"]); ax.set_ylabel(\"x\u2082\", color=T[\"muted\"])\n    ax.tick_params(colors=T[\"muted\"])\n\nfig, axes = plt.subplots(1, 2, figsize=(13, 5.5))\nfig.patch.set_facecolor(T[\"bg\"])\nfig.suptitle(\"Decision Boundaries learned on make_moons\",\n             fontsize=13, color=T[\"text\"])\n\nplot_boundary(axes[0], net_sig,  X_test, y_test, \"Sigmoid\", T[\"sig\"])\nplot_boundary(axes[1], net_relu, X_test, y_test, \"ReLU\",    T[\"relu\"])\n\nplt.tight_layout()\nplt.savefig(\"decision_boundaries.png\", dpi=140, bbox_inches=\"tight\")\nplt.show()<\/code><\/pre>\n<\/div>\n<\/div>\n<figure class=\"wp-block-image size-large is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"431\" data-attachment-id=\"78883\" data-permalink=\"https:\/\/www.marktechpost.com\/2026\/04\/09\/sigmoid-vs-relu-activation-functions-the-inference-cost-of-losing-geometric-context\/image-418\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/04\/image-15.png\" data-orig-size=\"1289,543\" data-comments-opened=\"1\" data-image-meta='{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\"}' data-image-title=\"image\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/04\/image-15-300x126.png\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/04\/image-15-1024x431.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/04\/image-15-1024x431.png\" alt=\"\" class=\"wp-image-78883\" \/><\/figure>\n<h3 class=\"wp-block-heading\"><strong>Layer-by-Layer Signal Trace<\/strong><\/h3>\n<p>This chart tracks how the signal evolves across layers for a point far from the decision boundary \u2014 and it clearly shows where Sigmoid fails. Both networks start with similar pre-activation magnitude at the first layer (~2.0), but Sigmoid immediately compresses it to ~0.3, while ReLU retains a higher value. As we move deeper, Sigmoid continues to squash the signal into a narrow band (0.5\u20130.6), effectively erasing meaningful differences. ReLU, on the other hand, preserves and amplifies magnitude, with the final layer reaching values as high as 9\u201320.<\/p>\n<p>This means the output neuron in the ReLU network is making decisions based on a strong, well-separated signal, while the Sigmoid network is forced to classify using a weak, compressed one. The key takeaway is that ReLU preserves distance from the decision boundary across layers, allowing that information to compound, whereas Sigmoid progressively destroys it.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\">far_class0 = X_train[y_train == 0][np.argmax(\n    np.linalg.norm(X_train[y_train == 0] - [-1.2, -0.3], axis=1)\n)]\nfar_class1 = X_train[y_train == 1][np.argmax(\n    np.linalg.norm(X_train[y_train == 1] - [1.2, 0.3], axis=1)\n)]\n\nstage_labels = [\"z\u2081 (pre)\", \"a\u2081 (post)\", \"z\u2082 (pre)\", \"a\u2082 (post)\", \"z\u2083 (out)\"]\nx_pos = np.arange(len(stage_labels))\n\nfig, axes = plt.subplots(1, 2, figsize=(13, 5.5))\nfig.patch.set_facecolor(T[\"bg\"])\nfig.suptitle(\"Layer-by-layer signal magnitude -- a point far from the boundary\",\n             fontsize=12, color=T[\"text\"])\n\nfor ax, sample, title in zip(\n    axes,\n    [far_class0, far_class1],\n    [\"Class 0 sample (deep in its moon)\", \"Class 1 sample (deep in its moon)\"]\n):\n    ax.set_facecolor(T[\"panel\"])\n    sig_trace  = net_sig.get_z_trace(sample.reshape(1, -1))\n    relu_trace = net_relu.get_z_trace(sample.reshape(1, -1))\n\n    ax.plot(x_pos, sig_trace,  \"o-\", color=T[\"sig\"],  lw=2.5, markersize=8, label=\"Sigmoid\")\n    ax.plot(x_pos, relu_trace, \"s-\", color=T[\"relu\"], lw=2.5, markersize=8, label=\"ReLU\")\n\n    for i, (s, r) in enumerate(zip(sig_trace, relu_trace)):\n        ax.text(i, s - 0.06, f\"{s:.3f}\", ha=\"center\", fontsize=8, color=T[\"sig\"])\n        ax.text(i, r + 0.04, f\"{r:.3f}\", ha=\"center\", fontsize=8, color=T[\"relu\"])\n\n    ax.set_xticks(x_pos); ax.set_xticklabels(stage_labels, color=T[\"muted\"], fontsize=9)\n    ax.set_ylabel(\"Mean |activation|\", color=T[\"muted\"])\n    ax.set_title(title, color=T[\"text\"], fontsize=11)\n    ax.tick_params(colors=T[\"muted\"]); ax.legend(fontsize=10)\n\nplt.tight_layout()\nplt.savefig(\"signal_trace.png\", dpi=140, bbox_inches=\"tight\")\nplt.show()<\/code><\/pre>\n<\/div>\n<\/div>\n<figure class=\"wp-block-image size-large is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"433\" data-attachment-id=\"78880\" data-permalink=\"https:\/\/www.marktechpost.com\/2026\/04\/09\/sigmoid-vs-relu-activation-functions-the-inference-cost-of-losing-geometric-context\/image-415\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/04\/image-12.png\" data-orig-size=\"1285,543\" data-comments-opened=\"1\" data-image-meta='{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\"}' data-image-title=\"image\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/04\/image-12-300x127.png\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/04\/image-12-1024x433.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/04\/image-12-1024x433.png\" alt=\"\" class=\"wp-image-78880\" \/><\/figure>\n<h1 class=\"wp-block-heading\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/www.marktechpost.com\/d1468d47-7595-40b7-8471-1f7351288c6f\" width=\"624\" height=\"480\" \/><\/h1>\n<h3 class=\"wp-block-heading\"><strong>Hidden Space Scatter<\/strong><\/h3>\n<p>This is the most important visualization because it directly exposes how each network uses (or fails to use) depth. In the Sigmoid network (left), both classes collapse into a tight, overlapping region \u2014 a diagonal smear where points are heavily entangled. The standard deviation actually decreases from layer 1 (0.26) to layer 2 (0.19), meaning the representation is becoming less expressive with depth. Each layer is compressing the signal further, stripping away the spatial structure needed to separate the classes.<\/p>\n<p>ReLU shows the opposite behavior. In layer 1, while some neurons are inactive (the \u201cdead zone\u201d), the active ones already spread across a wider range (1.15 std), indicating preserved variation. By layer 2, this expands even further (1.67 std), and the classes become clearly separable \u2014 one is pushed to high activation ranges while the other remains near zero. At this point, the output layer\u2019s job is trivial.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\">fig, axes = plt.subplots(2, 2, figsize=(13, 10))\nfig.patch.set_facecolor(T[\"bg\"])\nfig.suptitle(\"Hidden-space representations on make_moons test set\",\n             fontsize=13, color=T[\"text\"])\n\nfor col, (net, color, name) in enumerate([\n    (net_sig,  T[\"sig\"],  \"Sigmoid\"),\n    (net_relu, T[\"relu\"], \"ReLU\"),\n]):\n    for row, layer in enumerate([1, 2]):\n        ax = axes[row][col]\n        ax.set_facecolor(T[\"panel\"])\n        H = net.get_hidden(X_test, layer=layer)\n\n        ax.scatter(H[y_test==0, 0], H[y_test==0, 1], c=T[\"c0\"], s=40,\n                   edgecolors=\"white\", linewidths=0.4, alpha=0.85, label=\"Class 0\")\n        ax.scatter(H[y_test==1, 0], H[y_test==1, 1], c=T[\"c1\"], s=40,\n                   edgecolors=\"white\", linewidths=0.4, alpha=0.85, label=\"Class 1\")\n\n        spread = H.std()\n        ax.text(0.04, 0.96, f\"std: {spread:.4f}\",\n                transform=ax.transAxes, fontsize=9, va=\"top\",\n                color=T[\"text\"],\n                bbox=dict(boxstyle=\"round,pad=0.3\", fc=\"white\", ec=color, alpha=0.85))\n\n        ax.set_title(f\"{name}  --  Layer {layer} hidden space\",\n                     color=color, fontsize=11)\n        ax.set_xlabel(f\"Unit 1\", color=T[\"muted\"])\n        ax.set_ylabel(f\"Unit 2\", color=T[\"muted\"])\n        ax.tick_params(colors=T[\"muted\"])\n        if row == 0 and col == 0: ax.legend(fontsize=9)\n\nplt.tight_layout()\nplt.savefig(\"hidden_space.png\", dpi=140, bbox_inches=\"tight\")\nplt.show()<\/code><\/pre>\n<\/div>\n<\/div>\n<figure class=\"wp-block-image size-large is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"782\" data-attachment-id=\"78882\" data-permalink=\"https:\/\/www.marktechpost.com\/2026\/04\/09\/sigmoid-vs-relu-activation-functions-the-inference-cost-of-losing-geometric-context\/image-417\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/04\/image-14.png\" data-orig-size=\"1289,985\" data-comments-opened=\"1\" data-image-meta='{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\"}' data-image-title=\"image\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/04\/image-14-300x229.png\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/04\/image-14-1024x782.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/04\/image-14-1024x782.png\" alt=\"\" class=\"wp-image-78882\" \/><\/figure>\n<figure class=\"wp-block-image size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"818\" height=\"505\" data-attachment-id=\"78881\" data-permalink=\"https:\/\/www.marktechpost.com\/2026\/04\/09\/sigmoid-vs-relu-activation-functions-the-inference-cost-of-losing-geometric-context\/image-416\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/04\/image-13.png\" data-orig-size=\"818,505\" data-comments-opened=\"1\" data-image-meta='{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\"}' data-image-title=\"image\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/04\/image-13-300x185.png\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/04\/image-13.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/04\/image-13.png\" alt=\"\" class=\"wp-image-78881\" \/><\/figure>\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n<p>Check out\u00a0the\u00a0<strong><a href=\"https:\/\/github.com\/Marktechpost\/AI-Tutorial-Codes-Included\/blob\/main\/Deep%20Learning\/Sigmoid_Relu.ipynb\" target=\"_blank\" rel=\"noreferrer noopener\">Full Codes here<\/a>.\u00a0<\/strong>Also,\u00a0feel free to follow us on\u00a0<strong><a href=\"https:\/\/x.com\/intent\/follow?screen_name=marktechpost\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Twitter<\/mark><\/a><\/strong>\u00a0and don\u2019t forget to join our\u00a0<strong><a href=\"https:\/\/www.reddit.com\/r\/machinelearningnews\/\" target=\"_blank\" rel=\"noreferrer noopener\">120k+ ML SubReddit<\/a><\/strong>\u00a0and Subscribe to\u00a0<strong><a href=\"https:\/\/www.aidevsignals.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">our Newsletter<\/a><\/strong>. Wait! are you on telegram?\u00a0<strong><a href=\"https:\/\/t.me\/machinelearningresearchnews\" target=\"_blank\" rel=\"noreferrer noopener\">now you can join us on telegram as well.<\/a><\/strong><\/p>\n<p>Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.?\u00a0<strong><a href=\"https:\/\/forms.gle\/MTNLpmJtsFA3VRVd9\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Connect with us<\/mark><\/a><\/strong><\/p>\n<p>The post <a href=\"https:\/\/www.marktechpost.com\/2026\/04\/09\/sigmoid-vs-relu-activation-functions-the-inference-cost-of-losing-geometric-context\/\">Sigmoid vs ReLU Activation Functions: The Inference Cost of Losing Geometric Context<\/a> appeared first on <a href=\"https:\/\/www.marktechpost.com\/\">MarkTechPost<\/a>.<\/p>","protected":false},"excerpt":{"rendered":"<p>A deep neural network can be u&hellip;<\/p>\n","protected":false},"author":1,"featured_media":684,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-683","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-uncategorized"],"_links":{"self":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts\/683","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=683"}],"version-history":[{"count":0,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts\/683\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/media\/684"}],"wp:attachment":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=683"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=683"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=683"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}