{"id":847,"date":"2026-05-05T15:26:29","date_gmt":"2026-05-05T07:26:29","guid":{"rendered":"https:\/\/connectword.dpdns.org\/?p=847"},"modified":"2026-05-05T15:26:29","modified_gmt":"2026-05-05T07:26:29","slug":"why-gradient-descent-zigzags-and-how-momentum-fixes-it","status":"publish","type":"post","link":"https:\/\/connectword.dpdns.org\/?p=847","title":{"rendered":"Why Gradient Descent Zigzags and How Momentum Fixes It"},"content":{"rendered":"<p>Gradient descent has a fundamental limitation: on most real-world loss surfaces, it is inefficient. When the surface has uneven curvature\u2014steep in one direction and flat in another, which is common in practice\u2014the algorithm struggles to make consistent progress. A high learning rate helps move faster along the flat direction but causes overshooting and oscillations along the steep direction. Reducing the learning rate stabilizes the updates but significantly slows convergence. This trade-off is not rare; it is typical behavior for standard gradient descent.<\/p>\n<p>Momentum addresses this issue by incorporating information from past gradients. Instead of relying only on the current gradient, it maintains a running average (often called velocity) and updates parameters based on this accumulated direction. As a result, consistent gradients reinforce each other, allowing faster movement across flat regions, while oscillating gradients tend to cancel out, reducing instability.<\/p>\n<p>In this article, we walk through exactly how this works: the update equations, and a from-scratch simulation on a controlled anisotropic surface that lets us measure the difference precisely \u2014 185 steps for vanilla GD versus 159 for Momentum, with \u03b2=0.99 failing to converge entirely.\u00a0<\/p>\n<figure class=\"wp-block-image size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1064\" height=\"783\" data-attachment-id=\"79528\" data-permalink=\"https:\/\/www.marktechpost.com\/2026\/05\/05\/why-gradient-descent-zigzags-and-how-momentum-fixes-it\/image-484\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/05\/image-13.png\" data-orig-size=\"1064,783\" data-comments-opened=\"0\" data-image-meta='{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\"}' data-image-title=\"image\" data-image-description=\"\" data-image-caption=\"\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/05\/image-13-1024x754.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/05\/image-13.png\" alt=\"\" class=\"wp-image-79528\" \/><\/figure>\n<figure class=\"wp-block-image size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1073\" height=\"677\" data-attachment-id=\"79525\" data-permalink=\"https:\/\/www.marktechpost.com\/2026\/05\/05\/why-gradient-descent-zigzags-and-how-momentum-fixes-it\/image-481\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/05\/image-11.png\" data-orig-size=\"1073,677\" data-comments-opened=\"0\" data-image-meta='{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\"}' data-image-title=\"image\" data-image-description=\"\" data-image-caption=\"\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/05\/image-11-1024x646.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/05\/image-11.png\" alt=\"\" class=\"wp-image-79525\" \/><\/figure>\n<h3 class=\"wp-block-heading\"><strong>Setting up the dependencies<\/strong><\/h3>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\">import numpy as np\nimport matplotlib.pyplot as plt\nfrom matplotlib.gridspec import GridSpec<\/code><\/pre>\n<\/div>\n<\/div>\n<h3 class=\"wp-block-heading\"><strong>Defining the Loss Surface<\/strong><\/h3>\n<p>The loss surface is a stretched bowl \u2014 flat along one axis, steep along the other. This is controlled by the two coefficients: 0.05 in x makes that direction nearly flat, while 5 in y makes it steep. The gradients reflect this directly \u2014 0.1\u00b7x in the flat direction, 10\u00b7y in the steep one.<\/p>\n<p>The Hessian of this surface is diagonal with eigenvalues 0.1 and 10, giving a condition number of 100. That number is the core of the problem: it tells you the surface is 100\u00d7 more curved in one direction than the other, which is what forces GD into its zigzag behavior.<\/p>\n<p>The learning rate of 0.18 is chosen deliberately. The stability limit for GD is 2 \/ \u03bb_max = 2 \/ 10 = 0.2 \u2014 any higher and the optimizer diverges outright. At 0.18, the steep axis update factor is |1 \u2212 10 \u00d7 0.18| = 0.8, meaning the optimizer overshoots and reverses direction every single step. The flat axis factor is |1 \u2212 0.1 \u00d7 0.18| = 0.982, meaning it recovers only 1.8% of the remaining distance per step. This is the worst-case combination Momentum is built for: oscillation in one direction, near-stagnation in the other.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\">def loss(x, y):\n    return 0.05 * x**2 + 5 * y**2\n \ndef grad(x, y):\n    return np.array([0.1 * x, 10 * y])<\/code><\/pre>\n<\/div>\n<\/div>\n<h3 class=\"wp-block-heading\"><strong>Optimizers<\/strong><\/h3>\n<p>Both methods follow the same overall process: start from an initial position, take a fixed number of steps, and track how the position changes. The key difference lies in how each step is computed.<\/p>\n<p>Vanilla gradient descent is very simple. At each step, it updates the position by subtracting the gradient scaled by the learning rate. It does not remember anything from previous steps. This is why oscillations occur\u2014if the gradient direction keeps changing (up, then down), the updates simply follow that pattern with no mechanism to smooth it out.<\/p>\n<p>Momentum introduces one additional term: velocity (v), which is initially zero. Instead of using only the current gradient, it updates this velocity by combining the previous velocity and the new gradient. The parameter \u03b2 controls how much weight is given to past information. A higher \u03b2 (e.g., 0.9) means the update relies more on past gradients, while a lower \u03b2 makes it behave more like standard gradient descent.<\/p>\n<p>This averaging effect behaves differently across directions. In steep directions, where gradients frequently change sign, the updates tend to cancel each other out, reducing oscillations. In flatter directions, where gradients are more consistent, they accumulate over time, allowing the optimizer to move faster.<\/p>\n<p>Finally, the position is updated using this velocity instead of the raw gradient. This results in smoother and more stable progress toward the minimum.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\">def gradient_descent(start, lr, steps=300):\n    \"\"\"\n    Vanilla GD:  \u03b8 \u2190 \u03b8 \u2212 lr \u00b7 \u2207L(\u03b8)\n \n    Each update depends only on the current gradient.\n    No memory of past gradients -- oscillations persist.\n    \"\"\"\n    path = [np.array(start, dtype=float)]\n    pos  = np.array(start, dtype=float)\n    for _ in range(steps):\n        pos = pos - lr * grad(*pos)\n        path.append(pos.copy())\n    return np.array(path)\n \n \ndef momentum_gd(start, lr, beta, steps=300):\n    \"\"\"\n    Momentum GD:\n        v \u2190 \u03b2\u00b7v + (1\u2212\u03b2)\u00b7\u2207L(\u03b8)\n        \u03b8 \u2190 \u03b8 \u2212 lr\u00b7v\n \n    v is a weighted running average of past gradients (exponential moving avg).\n \n    Why it helps:\n      - In y: gradients alternate sign \u2192 they cancel in v \u2192 oscillations damped.\n      - In x: gradients share the same sign \u2192 they accumulate in v \u2192 faster steps.\n \n    \u03b2 controls memory length. High \u03b2 \u2192 longer memory \u2192 more smoothing (and risk\n    of overshooting). Low \u03b2 \u2192 shorter memory \u2192 closer to vanilla GD.\n    \"\"\"\n    path = [np.array(start, dtype=float)]\n    pos  = np.array(start, dtype=float)\n    v    = np.zeros(2)\n    for _ in range(steps):\n        g   = grad(*pos)\n        v   = beta * v + (1 - beta) * g\n        pos = pos - lr * v\n        path.append(pos.copy())\n    return np.array(path)<\/code><\/pre>\n<\/div>\n<\/div>\n<h3 class=\"wp-block-heading\"><strong>Running all three scenarios<\/strong><\/h3>\n<p>All three experiments start from the same point (\u22124.0, 1.5), use the same learning rate, and run for 300 steps. The only difference is the use of momentum and the value of \u03b2. Instead of just recording the final position, the full trajectory is stored for each run, which allows us to analyze how the optimizer moves over time. Vanilla gradient descent progresses slowly with a zigzag pattern and reaches a final loss of 0.000015. Momentum with \u03b2 = 0.90 performs more efficiently, reducing oscillations and building speed in the right direction, ultimately achieving a lower loss of 0.000001 within the same number of steps.<\/p>\n<p>However, momentum is sensitive to the choice of \u03b2. When \u03b2 is set too high (e.g., 0.99), the optimizer accumulates excessive velocity with very little decay. This leads to overshooting the minimum and failing to stabilize, resulting in a much higher final loss of 0.487363 even after 300 steps. In this case, the optimizer effectively keeps circling the minimum without converging. These results highlight that while momentum can significantly improve convergence, it must be carefully tuned\u2014too little offers no benefit over standard gradient descent, while too much introduces instability.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\">START = [-4.0, 1.5]\nLR    = 0.18\nSTEPS = 300\n \npath_gd        = gradient_descent(START, lr=LR,           steps=STEPS)\npath_mom_good  = momentum_gd(START,      lr=LR, beta=0.90, steps=STEPS)\npath_mom_large = momentum_gd(START,      lr=LR, beta=0.99, steps=STEPS)\n \nprint(f\"Vanilla GD        -- final loss: {loss(*path_gd[-1]):.6f}\")\nprint(f\"Momentum \u03b2=0.90   -- final loss: {loss(*path_mom_good[-1]):.6f}\")\nprint(f\"Momentum \u03b2=0.99   -- final loss: {loss(*path_mom_large[-1]):.6f}  \u2190 diverges\")<\/code><\/pre>\n<\/div>\n<\/div>\n<figure class=\"wp-block-image size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"799\" height=\"860\" data-attachment-id=\"79527\" data-permalink=\"https:\/\/www.marktechpost.com\/2026\/05\/05\/why-gradient-descent-zigzags-and-how-momentum-fixes-it\/image-483\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/05\/image-12.png\" data-orig-size=\"799,860\" data-comments-opened=\"0\" data-image-meta='{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\"}' data-image-title=\"image\" data-image-description=\"\" data-image-caption=\"\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/05\/image-12.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/05\/image-12.png\" alt=\"\" class=\"wp-image-79527\" \/><\/figure>\n<h3 class=\"wp-block-heading\"><strong>Visualizing the Result<\/strong><\/h3>\n<p>The visualization is divided into two parts. The top row shows the first 55 steps of each optimizer plotted on the contour map of the loss surface, making it easier to observe movement patterns without clutter. The bottom row displays the full 300-step loss curves on a log scale, allowing a clear comparison of convergence speed over the entire run.<\/p>\n<p>From the contour plots, the behavior is immediately clear. Vanilla gradient descent oscillates heavily along one direction, making very slow progress toward the minimum. Momentum with \u03b2 = 0.90 stabilizes these oscillations and follows a smoother, more direct path. In contrast, \u03b2 = 0.99 leads to persistent bouncing with almost no progress. The loss curves confirm this: vanilla GD decreases steadily but slowly, \u03b2 = 0.90 converges faster and more efficiently, while \u03b2 = 0.99 shows repeated spikes due to overshooting and fails to converge within the given steps.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\">PLOT_STEPS = 55\n \nx_ = np.linspace(-5, 5, 500)\ny_ = np.linspace(-2.2, 2.2, 500)\nX, Y = np.meshgrid(x_, y_)\nZ    = loss(X, Y)\n \nfig = plt.figure(figsize=(16, 10), facecolor=\"#FAFAF8\")\ngs  = GridSpec(2, 3, figure=fig, hspace=0.45, wspace=0.38,\n               left=0.07, right=0.97, top=0.88, bottom=0.08)\n \nCOLORS = {\n    \"gd\":        \"#E05C4B\",\n    \"mom_good\":  \"#3A7CA5\",\n    \"mom_large\": \"#F4A536\",\n    \"contour\":   \"#D4C9B8\",\n    \"minima\":    \"#2A9D5C\",\n    \"start\":     \"#444444\",\n}\n \nPANEL_TITLES = [\n    \"Vanilla Gradient DescentnOscillates, slow  (185 steps to converge)\",\n    \"Momentum  \u03b2 = 0.90nSmooth, fast  (159 steps to converge)\",\n    \"Momentum  \u03b2 = 0.99 (too large)nOvershoots -- never converges\",\n]\n \npaths_plot = [\n    path_gd[:PLOT_STEPS+1],\n    path_mom_good[:PLOT_STEPS+1],\n    path_mom_large[:PLOT_STEPS+1],\n]\ncolors = [COLORS[\"gd\"], COLORS[\"mom_good\"], COLORS[\"mom_large\"]]\n \n# top row: trajectory panels\nfor col, (path, color, title) in enumerate(zip(paths_plot, colors, PANEL_TITLES)):\n    ax = fig.add_subplot(gs[0, col])\n    ax.set_facecolor(\"#F5F3EE\")\n \n    levels = np.geomspace(0.005, 3.5, 28)\n    ax.contour(X, Y, Z, levels=levels, colors=COLORS[\"contour\"],\n               linewidths=0.7, alpha=0.9)\n \n    ax.plot(path[:, 0], path[:, 1], color=color, lw=1.8, alpha=0.85, zorder=3)\n    ax.scatter(path[:, 0], path[:, 1], color=color, s=18, zorder=4, alpha=0.6)\n \n    ax.scatter(*path[0],  marker=\"o\", s=90,  color=COLORS[\"start\"],  zorder=5, label=\"start\")\n    ax.scatter(*path[-1], marker=\"*\", s=120, color=COLORS[\"minima\"], zorder=5, label=\"end\")\n    ax.scatter(0, 0, marker=\"+\", s=200, color=COLORS[\"minima\"], linewidths=2.5, zorder=6)\n \n    ax.set_xlim(-5, 5)\n    ax.set_ylim(-2.2, 2.2)\n    ax.set_title(title, fontsize=9.5, fontweight=\"bold\", color=\"#222\", pad=7, loc=\"left\")\n    ax.set_xlabel(\"\u03b8\u2081  (slow direction)\", fontsize=8, color=\"#666\")\n    ax.set_ylabel(\"\u03b8\u2082  (fast direction)\", fontsize=8, color=\"#666\")\n    ax.tick_params(labelsize=7, colors=\"#888\")\n    for spine in ax.spines.values():\n        spine.set_edgecolor(\"#CCCCCC\")\n \n# bottom-left: loss curves (full 300 steps)\nax_loss = fig.add_subplot(gs[1, :2])\nax_loss.set_facecolor(\"#F5F3EE\")\n \nfull_paths  = [path_gd, path_mom_good, path_mom_large]\nfull_labels = [\"Vanilla GD  (185 steps)\", \"Momentum \u03b2=0.90  (159 steps)\", \"Momentum \u03b2=0.99  (diverges)\"]\n \nfor path, color, label in zip(full_paths, colors, full_labels):\n    losses = [loss(*p) for p in path]\n    steps_range = np.arange(len(path))\n    ax_loss.plot(steps_range, losses, color=color, lw=2, label=label, alpha=0.9)\n \nax_loss.axhline(0.001, color=\"#999\", lw=1, ls=\"--\", alpha=0.6)\nax_loss.text(305, 0.001, \"convergencenthreshold\", fontsize=7, color=\"#888\", va=\"center\")\n \nax_loss.set_yscale(\"log\")\nax_loss.set_xlim(0, STEPS)\nax_loss.set_title(\"Loss vs. Optimisation Step  (log scale, 300 steps)\",\n                  fontsize=10.5, fontweight=\"bold\", color=\"#222\", loc=\"left\")\nax_loss.set_xlabel(\"Step\", fontsize=9, color=\"#666\")\nax_loss.set_ylabel(\"Loss  f(\u03b8)\", fontsize=9, color=\"#666\")\nax_loss.legend(fontsize=8.5, framealpha=0.6)\nax_loss.tick_params(labelsize=8, colors=\"#888\")\nfor spine in ax_loss.spines.values():\n    spine.set_edgecolor(\"#CCCCCC\")\n \n# bottom-right: annotation panel\nax_ann = fig.add_subplot(gs[1, 2])\nax_ann.set_facecolor(\"#F5F3EE\")\nax_ann.axis(\"off\")\n \nannotation = (\n    \"Update rulesnn\"\n    \"Vanilla GDn\"\n    \"  \u03b8 \u2190 \u03b8 \u2212 \u03b1\u00b7\u2207L(\u03b8)nn\"\n    \"Momentum GDn\"\n    \"  v \u2190 \u03b2\u00b7v + (1\u2212\u03b2)\u00b7\u2207L(\u03b8)n\"\n    \"  \u03b8 \u2190 \u03b8 \u2212 \u03b1\u00b7vnn\"\n    \"Key intuitionn\"\n    \"  v accumulates past gradients.n\"\n    \"  Vertical oscillations cancel out.n\"\n    \"  Horizontal steps compound.nn\"\n    \"Hyperparameter \u03b2n\"\n    \"  \u03b2 \u2192 0  :  behaves like GDn\"\n    \"  \u03b2 = 0.9:  typical sweet spotn\"\n    \"  \u03b2 \u2192 1  :  overshoots \/ diverges\"\n)\nax_ann.text(0.05, 0.97, annotation, transform=ax_ann.transAxes,\n            fontsize=8.8, va=\"top\", ha=\"left\",\n            fontfamily=\"monospace\", color=\"#333\", linespacing=1.7)\n \nfig.suptitle(\"Momentum in Gradient Descent\",\n             fontsize=16, fontweight=\"bold\", color=\"#111\", y=0.95)\n \nplt.savefig(\"momentum_explainer.png\", dpi=150, bbox_inches=\"tight\",\n            facecolor=fig.get_facecolor())\nplt.show()<\/code><\/pre>\n<\/div>\n<\/div>\n<figure class=\"wp-block-image size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1515\" height=\"928\" data-attachment-id=\"79529\" data-permalink=\"https:\/\/www.marktechpost.com\/2026\/05\/05\/why-gradient-descent-zigzags-and-how-momentum-fixes-it\/image-485\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/05\/image-14.png\" data-orig-size=\"1515,928\" data-comments-opened=\"0\" data-image-meta='{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\"}' data-image-title=\"image\" data-image-description=\"\" data-image-caption=\"\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/05\/image-14-1024x627.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/05\/image-14.png\" alt=\"\" class=\"wp-image-79529\" \/><\/figure>\n<h3 class=\"wp-block-heading\"><strong>\u03b2 sensitivity sweep<\/strong><\/h3>\n<p>The experiment runs the momentum optimizer multiple times with different \u03b2 values, each for up to 500 steps. For every run, it checks when the loss first drops below 0.001 and records that step as the convergence point. \u03b2 = 0 serves as a baseline, since it removes the effect of momentum and behaves exactly like vanilla gradient descent.<\/p>\n<p>The results show a clear pattern. As \u03b2 increases from 0.0 to 0.95, convergence steadily improves, with fewer steps needed each time. This happens because higher \u03b2 values better smooth out oscillations and build useful momentum in the right direction. However, at \u03b2 = 0.99, performance drops sharply. The optimizer becomes too slow to adjust because it relies too heavily on past gradients, leading to excessive overshooting and delayed convergence. Overall, this creates an inverted U-shaped relationship: moderate \u03b2 values (around 0.9\u20130.95) give the best performance, while values that are too high can hurt convergence significantly.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\">THRESHOLD = 0.001\nbetas = [0.0, 0.5, 0.7, 0.85, 0.90, 0.95, 0.99]\n \nprint(f\"n\u2500\u2500 \u03b2 Sensitivity  (steps to loss &lt; {THRESHOLD}) \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\")\nprint(f\"{'\u03b2':&gt;6}  {'steps':&gt;10}  note\")\nprint(\"\u2500\" * 46)\n \nfor b in betas:\n    path = momentum_gd(START, lr=LR, beta=b, steps=500)\n    losses = [loss(*p) for p in path]\n    hit = next((i for i, l in enumerate(losses) if l &lt; THRESHOLD), None)\n    note = \"\"\n    if b == 0.0:  note = \"\u2190 equivalent to vanilla GD\"\n    elif b == 0.90: note = \"\u2190 typical sweet spot\"\n    elif b == 0.99: note = \"\u2190 overshoots \/ diverges\"\n    status = f\"{hit:&gt;6} steps\" if hit else \"  did not converge\"\n    print(f\"{b:&gt;6.2f}  {status}  {note}\")<\/code><\/pre>\n<\/div>\n<\/div>\n<figure class=\"wp-block-image size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1065\" height=\"796\" data-attachment-id=\"79530\" data-permalink=\"https:\/\/www.marktechpost.com\/2026\/05\/05\/why-gradient-descent-zigzags-and-how-momentum-fixes-it\/image-486\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/05\/image-15.png\" data-orig-size=\"1065,796\" data-comments-opened=\"0\" data-image-meta='{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\"}' data-image-title=\"image\" data-image-description=\"\" data-image-caption=\"\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/05\/image-15-1024x765.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/05\/image-15.png\" alt=\"\" class=\"wp-image-79530\" \/><\/figure>\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n<p>Check out\u00a0the\u00a0<strong><a href=\"https:\/\/github.com\/Marktechpost\/AI-Agents-Projects-Tutorials\/blob\/main\/Data%20Science\/Momentum_Gradient_Descent.ipynb\" target=\"_blank\" rel=\"noreferrer noopener\">Codes with Notebook here<\/a><\/strong>.<strong>\u00a0<\/strong>Also,\u00a0feel free to follow us on\u00a0<strong><a href=\"https:\/\/x.com\/intent\/follow?screen_name=marktechpost\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Twitter<\/mark><\/a><\/strong>\u00a0and don\u2019t forget to join our\u00a0<strong><a href=\"https:\/\/www.reddit.com\/r\/machinelearningnews\/\" target=\"_blank\" rel=\"noreferrer noopener\">130k+ ML SubReddit<\/a><\/strong>\u00a0and Subscribe to\u00a0<strong><a href=\"https:\/\/www.aidevsignals.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">our Newsletter<\/a><\/strong>. Wait! are you on telegram?\u00a0<strong><a href=\"https:\/\/t.me\/machinelearningresearchnews\" target=\"_blank\" rel=\"noreferrer noopener\">now you can join us on telegram as well.<\/a><\/strong><\/p>\n<p>Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.?\u00a0<strong><a href=\"https:\/\/forms.gle\/MTNLpmJtsFA3VRVd9\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Connect with us<\/mark><\/a><\/strong><\/p>\n<p>The post <a href=\"https:\/\/www.marktechpost.com\/2026\/05\/05\/why-gradient-descent-zigzags-and-how-momentum-fixes-it\/\">Why Gradient Descent Zigzags and How Momentum Fixes It<\/a> appeared first on <a href=\"https:\/\/www.marktechpost.com\/\">MarkTechPost<\/a>.<\/p>","protected":false},"excerpt":{"rendered":"<p>Gradient descent has a fundame&hellip;<\/p>\n","protected":false},"author":1,"featured_media":848,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-847","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-uncategorized"],"_links":{"self":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts\/847","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=847"}],"version-history":[{"count":0,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts\/847\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/media\/848"}],"wp:attachment":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=847"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=847"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=847"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}