{"id":932,"date":"2026-05-19T04:18:55","date_gmt":"2026-05-18T20:18:55","guid":{"rendered":"https:\/\/connectword.dpdns.org\/?p=932"},"modified":"2026-05-19T04:18:55","modified_gmt":"2026-05-18T20:18:55","slug":"stochastic-gradient-descent-sgds-frequency-bias-and-how-adam-fixes-it","status":"publish","type":"post","link":"https:\/\/connectword.dpdns.org\/?p=932","title":{"rendered":"Stochastic Gradient Descent (SGD\u2019s) Frequency Bias and How Adam Fixes It\u00a0"},"content":{"rendered":"<p>Modern language models are trained on data with extremely uneven token distributions. A small number of words appear in almost every sentence, while many rare but meaningful tokens occur only occasionally. This creates a hidden optimization challenge: parameters associated with common tokens receive constant gradient updates, while parameters tied to rare tokens may go hundreds or thousands of steps without receiving any meaningful signal. Under standard Stochastic Gradient Descent (SGD), every parameter uses the same learning rate, so frequently updated weights converge quickly while rare-token weights often remain close to their random initialization.<\/p>\n<p>This is where Adam\u2019s adaptive optimization becomes important. While Adam is commonly described as SGD with momentum, its most impactful feature in practice is variance normalization. Adam tracks the historical gradient statistics for each parameter independently and automatically adjusts update sizes based on how often reliable gradient information has been observed. Parameters that rarely receive updates end up getting proportionally larger effective learning rates, allowing underrepresented features to learn much faster than they would under vanilla SGD.<\/p>\n<p>To demonstrate this behavior concretely, we build a controlled NumPy experiment using a six-token vocabulary spanning four orders of magnitude in frequency \u2014 from tokens appearing in nearly every batch to tokens appearing only 0.1% of the time. We train the same linear model twice, once with SGD and once with Adam, while keeping all target weights identical. By comparing final parameter values, non-zero gradient counts, and Adam\u2019s effective learning rates for each token, we can directly observe how adaptive optimization compensates for frequency imbalance in real training dynamics.<\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"914\" height=\"635\" data-attachment-id=\"79945\" data-permalink=\"https:\/\/www.marktechpost.com\/2026\/05\/18\/stochastic-gradient-descent-sgds-frequency-bias-and-how-adam-fixes-it\/image-509\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/05\/image-36.png\" data-orig-size=\"914,635\" data-comments-opened=\"0\" data-image-meta='{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\"}' data-image-title=\"image\" data-image-description=\"\" data-image-caption=\"\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/05\/image-36.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/05\/image-36.png\" alt=\"\" class=\"wp-image-79945\" \/><\/figure>\n<\/div>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"913\" height=\"301\" data-attachment-id=\"79946\" data-permalink=\"https:\/\/www.marktechpost.com\/2026\/05\/18\/stochastic-gradient-descent-sgds-frequency-bias-and-how-adam-fixes-it\/image-510\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/05\/image-37.png\" data-orig-size=\"913,301\" data-comments-opened=\"0\" data-image-meta='{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\"}' data-image-title=\"image\" data-image-description=\"\" data-image-caption=\"\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/05\/image-37.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/05\/image-37.png\" alt=\"\" class=\"wp-image-79946\" \/><\/figure>\n<\/div>\n<h1 class=\"wp-block-heading\">Setting up the dependencies<\/h1>\n<p>We begin by constructing a deliberately simplified training environment that isolates a single factor: token frequency. The vocabulary contains six tokens ranging from extremely common words like \u201cthe\u201d to very rare tokens like \u201cthalweg,\u201d with appearance probabilities spanning four orders of magnitude. Every token is assigned the same ground-truth importance \u2014 the correct weight for all tokens is set to 1.0 \u2014 so the experiment removes semantic complexity and focuses entirely on how often each parameter receives gradient updates.<\/p>\n<p>Each training sample is represented as a sparse binary vector indicating which tokens are present in the batch. The target value is simply the sum of the active token weights plus a small amount of noise. We then train a small linear model using this synthetic dataset. Because gradients are only computed for tokens that appear in a batch, rare tokens naturally receive far fewer updates than common ones. This setup creates a clean environment for observing how SGD and Adam behave under highly imbalanced gradient exposure.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\">import numpy as np\nimport matplotlib.pyplot as plt\nimport matplotlib.gridspec as gridspec\n\nnp.random.seed(42)<\/code><\/pre>\n<\/div>\n<\/div>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\">TOKENS = [\"the\", \"model\", \"embedding\", \"tokenization\", \"xenobiotic\", \"thalweg\"]\n# Appearance probability per batch -- spans 4 orders of magnitude\nFREQ   = np.array([0.95,   0.60,    0.20,          0.05,          0.005,       0.001])\nTRUE_W = np.ones(6)   # all weights should reach 1.0\n\nN_STEPS   = 3000\nLR        = 0.05\nBATCH_SIZE = 32     # samples per step\n\n\ndef sample_batch(batch_size):\n    \"\"\"\n    Each sample is a sparse binary feature vector.\n    Token i appears in the sample with probability FREQ[i].\n    Target y = x @ TRUE_W + small noise.\n    \"\"\"\n    X = (np.random.rand(batch_size, 6) &lt; FREQ).astype(float)\n    y = X @ TRUE_W + np.random.randn(batch_size) * 0.1\n    return X, y<\/code><\/pre>\n<\/div>\n<\/div>\n<h1 class=\"wp-block-heading\">SGD<\/h1>\n<p>We first train the model using standard mini-batch SGD. The model weights are initialized to zero, and at every training step we sample a batch, compute the prediction error, calculate the average gradient across the batch, and update the weights using a fixed learning rate. The implementation also records the full weight trajectory over time along with the number of steps in which each parameter received a non-zero gradient.<\/p>\n<p>The key behavior emerges from the sparsity of the input vectors. A token only contributes to the gradient when it appears in the sampled batch. For common tokens, this happens almost every step, so their associated weights receive frequent updates and converge quickly. Rare tokens, however, are absent from most batches, causing their gradients to remain near zero for long stretches of training. As a result, SGD spends most of its optimization effort on high-frequency tokens while low-frequency tokens barely move from initialization.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\">def train_sgd(n_steps, lr, batch_size):\n    w        = np.zeros(6)\n    history  = np.zeros((n_steps, 6))   # weight trajectory per token\n    grad_counts = np.zeros(6)            # how many non-zero gradients each weight got\n\n    for t in range(n_steps):\n        X, y    = sample_batch(batch_size)\n        error   = X @ w - y\n        grad    = (X.T @ error) \/ batch_size\n        w      -= lr * grad\n\n        grad_counts += (np.abs(grad) &gt; 1e-9).astype(float)\n        history[t]  = w.copy()\n\n    return history, grad_counts<\/code><\/pre>\n<\/div>\n<\/div>\n<h1 class=\"wp-block-heading\">ADAM<\/h1>\n<p>We now train the same model using Adam to observe how adaptive optimization changes the learning dynamics. Alongside the model weights, Adam maintains two additional running statistics for every parameter: a momentum estimate mmm, which tracks the average direction of past gradients, and a variance estimate vvv, which tracks the average magnitude of squared gradients. Before applying updates, both statistics are bias-corrected to account for their initialization at zero.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\">def train_adam(n_steps, lr, batch_size, beta1=0.9, beta2=0.999, eps=1e-8):\n    w        = np.zeros(6)\n    m        = np.zeros(6)\n    v        = np.zeros(6)\n    history  = np.zeros((n_steps, 6))\n    v_history = np.zeros((n_steps, 6))  # track variance accumulation\n\n    for t in range(1, n_steps + 1):\n        X, y   = sample_batch(batch_size)\n        error  = X @ w - y\n        grad   = (X.T @ error) \/ batch_size\n\n        m = beta1 * m + (1 - beta1) * grad\n        v = beta2 * v + (1 - beta2) * grad ** 2\n\n        m_hat = m \/ (1 - beta1 ** t)\n        v_hat = v \/ (1 - beta2 ** t)\n\n        w -= lr * m_hat \/ (np.sqrt(v_hat) + eps)\n\n        history[t-1]   = w.copy()\n        v_history[t-1] = v_hat.copy()\n\n    return history, v_history<\/code><\/pre>\n<\/div>\n<\/div>\n<h1 class=\"wp-block-heading\">Running both<\/h1>\n<p>With both optimizers implemented, we train the model twice under identical conditions \u2014 once using SGD and once using Adam. Each optimizer sees the same synthetic data distribution, uses the same initialization, learning rate, batch size, and training duration. This ensures that any difference in the final behavior comes entirely from the optimization strategy itself rather than changes in the dataset or model architecture.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\">print(\"Training SGD...\")\nsgd_history, sgd_grad_counts = train_sgd(N_STEPS, LR, BATCH_SIZE)\n\nprint(\"Training Adam...\")\nadam_history, adam_v_history = train_adam(N_STEPS, LR, BATCH_SIZE)\n\nprint()<\/code><\/pre>\n<\/div>\n<\/div>\n<h1 class=\"wp-block-heading\">Measuring the failure<\/h1>\n<p>We now evaluate how well each optimizer learned the token weights after training. Since every token has the same true target weight of 1.0, the ideal outcome is that all learned weights also end close to 1.0 regardless of token frequency. Along with the final weights, we also measure how many training steps each token actually received a non-zero gradient. This helps us directly compare optimization quality against gradient exposure frequency.<\/p>\n<p>The results clearly show the difference between SGD and Adam. For common tokens, both optimizers learn the correct weights successfully because these tokens appear in almost every batch. But for rare tokens, SGD struggles badly. \u201cxenobiotic\u201d only receives gradients in about 15% of training steps and its weight stops around 0.53 instead of 1.0. The rarest token, \u201cthalweg,\u201d receives gradients in only 3.4% of steps and SGD barely learns it at all, ending near 0.15. Adam, however, keeps both rare-token weights close to the correct value despite receiving the same sparse gradient signals.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\">sgd_final  = sgd_history[-1]\nadam_final = adam_history[-1]\n\nprint(\"=\" * 62)\nprint(f\"{'Token':&lt;16} {'Freq':&gt;6}  {'SGD w':&gt;8}  {'Adam w':&gt;8}  {'SGD grads':&gt;10}\")\nprint(\"-\" * 62)\nfor i, token in enumerate(TOKENS):\n    sgd_err  = abs(sgd_final[i]  - TRUE_W[i])\n    adam_err = abs(adam_final[i] - TRUE_W[i])\n    flag = \"  \u2190 fails\" if sgd_err &gt; 0.3 else \"\"\n    print(\n        f\"{token:&lt;16} {FREQ[i]:&gt;6.3f}  {sgd_final[i]:&gt;8.4f}  \"\n        f\"{adam_final[i]:&gt;8.4f}  {int(sgd_grad_counts[i]):&gt;10}{flag}\"\n    )\nprint()\nprint(f\"True weight for all tokens: {TRUE_W[0]:.1f}\")\nprint()\n\n# How many steps did each token get a non-zero gradient?\nprint(\"Non-zero gradient steps out of\", N_STEPS)\nfor i, token in enumerate(TOKENS):\n    pct = sgd_grad_counts[i] \/ N_STEPS * 100\n    bar = \"\u2588\" * int(pct \/ 2)\n    print(f\"  {token:&lt;16} {bar:&lt;50} {pct:.1f}%\")\n\nprint()<\/code><\/pre>\n<\/div>\n<\/div>\n<h1 class=\"wp-block-heading\">Effective Learning Rate<\/h1>\n<p>To understand why Adam succeeds on rare tokens, we examine its effective learning rate for each parameter at the end of training. Adam does not use the same update scale for every weight. Instead, each parameter\u2019s update is divided by the square root of its accumulated variance estimate vvv. This means the practical step size depends on how large or small that variance has become during training.<\/p>\n<p>The numbers reveal a clear pattern. Common tokens such as \u201cthe\u201d and \u201cmodel\u201d accumulate large variance values because they receive gradients almost every step, so their effective learning rates remain relatively small. Rare tokens behave very differently. Since \u201cxenobiotic\u201d and \u201cthalweg\u201d receive gradients only occasionally, their variance estimates stay tiny, causing Adam to automatically amplify their effective learning rates by a massive amount. Even though the nominal learning rate is fixed at 0.05, the rarest token ends up receiving an effective step size above 40. This adaptive scaling is the core reason Adam can learn sparse parameters that SGD fails to optimize properly.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\">eps = 1e-8\nadam_v_final    = adam_v_history[-1]\neffective_lr    = LR \/ (np.sqrt(adam_v_final) + eps)\n\nprint(\"=\" * 55)\nprint(\"Adam Effective Learning Rate (final step)\")\nprint(\"=\" * 55)\nfor i, token in enumerate(TOKENS):\n    print(f\"  {token:&lt;16}  v_hat={adam_v_final[i]:.6f}  lr_eff={effective_lr[i]:.4f}\")\nprint()\nprint(f\"Nominal LR: {LR}\")\nprint(\"Rare tokens get an automatically amplified effective LR.\")\nprint()<\/code><\/pre>\n<\/div>\n<\/div>\n<h1 class=\"wp-block-heading\">Visualizing the Results<\/h1>\n<p>Finally, we visualize the full training dynamics to compare how SGD and Adam behave across tokens with vastly different frequencies. The first two plots track the weight trajectories during training, showing whether each optimizer can move rare-token parameters toward the correct value. We also compare the final weight errors for every token to measure overall learning quality.<\/p>\n<p>The four charts tell a single story across two optimizers. The top-left shows SGD\u2019s weight trajectories: common tokens (dark and medium blue) shoot up to 1.0 within the first few hundred steps, while the two rare tokens \u2014 xenobiotic and thalweg \u2014 barely leave the floor, crawling to 0.53 and 0.15 respectively after all 3,000 steps. The top-right bar chart makes the damage concrete: SGD\u2019s error bars for xenobiotic and thalweg dwarf everything else, while Adam\u2019s blue bars stay uniformly small across all six tokens.<\/p>\n<p>The bottom-left shows Adam\u2019s trajectories \u2014 all six tokens converge to 1.0, including the rare ones, though with more oscillation because each rare gradient update carries a large amplified step. The bottom-right explains why: plotted on a log-log scale, the relationship between token frequency and Adam\u2019s effective learning rate is a clean inverse \u2014 thalweg sits at the top-left with a 41\u00d7 amplified effective LR, \u201cthe\u201d sits at the bottom-right near the nominal 0.05, and every other token falls on the same diagonal. Adam did not receive any special instructions about which tokens were rare; the variance term computed it automatically from gradient history alone.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\">BG   = \"#fafaf8\"\nDARK = \"#1a1a1a\"\n\n# Color ramp: blue for common tokens, red for rare\nTOKEN_COLORS = [\"#1a5276\", \"#2471a3\", \"#5dade2\", \"#e67e22\", \"#c0392b\", \"#7d2a2a\"]\n\nsteps = np.arange(N_STEPS)\n\nfig = plt.figure(figsize=(16, 11), facecolor=BG)\nfig.suptitle(\n    \"SGD vs. Adam on Rare Tokens -- Frequency Bias and Variance Normalization\",\n    fontsize=14, fontweight=\"bold\", color=DARK, y=0.99\n)\n\ngs = gridspec.GridSpec(2, 3, figure=fig, hspace=0.45, wspace=0.35)\n\n# \u2500\u2500 1. SGD weight trajectories \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\nax1 = fig.add_subplot(gs[0, :2])\nax1.set_facecolor(BG)\nax1.axhline(1.0, color=DARK, lw=1, ls=\"--\", alpha=0.3, label=\"True weight = 1.0\")\n\nfor i, (token, color) in enumerate(zip(TOKENS, TOKEN_COLORS)):\n    ax1.plot(steps, sgd_history[:, i], color=color, lw=1.8,\n             label=f\"{token} (freq={FREQ[i]:.3f})\")\n\nax1.set_title(\"SGD -- Weight TrajectoriesnRare tokens barely move from zero\", fontsize=11, color=DARK)\nax1.set_xlabel(\"Training Step\", fontsize=9)\nax1.set_ylabel(\"Learned Weight\", fontsize=9)\nax1.legend(fontsize=8, loc=\"right\")\nax1.set_ylim(-0.3, 1.6)\nax1.spines[[\"top\", \"right\"]].set_visible(False)\n\n# Annotate failure zone\nax1.annotate(\n    \"Rare tokens stucknnear zero\",\n    xy=(N_STEPS * 0.95, sgd_history[-1, 5]),\n    xytext=(N_STEPS * 0.65, -0.15),\n    fontsize=8.5, color=\"#c0392b\",\n    arrowprops=dict(arrowstyle=\"-&gt;\", color=\"#c0392b\", lw=1.2),\n    bbox=dict(boxstyle=\"round,pad=0.3\", facecolor=\"#fff0f0\", edgecolor=\"#c0392b\", alpha=0.85)\n)\n\n# \u2500\u2500 2. Final weight error bar chart \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\nax2 = fig.add_subplot(gs[0, 2])\nax2.set_facecolor(BG)\n\nx      = np.arange(6)\nw_sgd  = sgd_final\nw_adam = adam_final\nwidth  = 0.35\n\nbars_sgd  = ax2.bar(x - width\/2, np.abs(w_sgd  - TRUE_W), width, color=\"#c0392b\", alpha=0.85, label=\"SGD error\")\nbars_adam = ax2.bar(x + width\/2, np.abs(w_adam - TRUE_W), width, color=\"#2980b9\", alpha=0.85, label=\"Adam error\")\n\nax2.set_xticks(x)\nax2.set_xticklabels([t[:8] for t in TOKENS], rotation=30, ha=\"right\", fontsize=8)\nax2.set_ylabel(\"|learned w \u2212 true w|\", fontsize=9)\nax2.set_title(\"Final Weight Errorn(lower = better)\", fontsize=11, color=DARK)\nax2.legend(fontsize=8)\nax2.spines[[\"top\", \"right\"]].set_visible(False)\n\n# \u2500\u2500 3. Adam weight trajectories \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\nax3 = fig.add_subplot(gs[1, :2])\nax3.set_facecolor(BG)\nax3.axhline(1.0, color=DARK, lw=1, ls=\"--\", alpha=0.3, label=\"True weight = 1.0\")\n\nfor i, (token, color) in enumerate(zip(TOKENS, TOKEN_COLORS)):\n    ax3.plot(steps, adam_history[:, i], color=color, lw=1.8,\n             label=f\"{token} (freq={FREQ[i]:.3f})\")\n\nax3.set_title(\"Adam -- Weight TrajectoriesnRare tokens converge via variance normalization\", fontsize=11, color=DARK)\nax3.set_xlabel(\"Training Step\", fontsize=9)\nax3.set_ylabel(\"Learned Weight\", fontsize=9)\nax3.legend(fontsize=8, loc=\"right\")\nax3.set_ylim(-0.3, 1.6)\nax3.spines[[\"top\", \"right\"]].set_visible(False)\n\nax3.annotate(\n    \"Rare tokens convergendespite sparse gradients\",\n    xy=(N_STEPS * 0.95, adam_history[-1, 5]),\n    xytext=(N_STEPS * 0.60, 0.3),\n    fontsize=8.5, color=\"#27ae60\",\n    arrowprops=dict(arrowstyle=\"-&gt;\", color=\"#27ae60\", lw=1.2),\n    bbox=dict(boxstyle=\"round,pad=0.3\", facecolor=\"#f0fff4\", edgecolor=\"#27ae60\", alpha=0.85)\n)\n\n# \u2500\u2500 4. Effective LR vs frequency \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\nax4 = fig.add_subplot(gs[1, 2])\nax4.set_facecolor(BG)\n\nax4.scatter(FREQ, effective_lr, c=TOKEN_COLORS, s=120, zorder=5, edgecolors=\"white\", lw=1.5)\nfor i, token in enumerate(TOKENS):\n    ax4.annotate(token, (FREQ[i], effective_lr[i]),\n                 textcoords=\"offset points\", xytext=(6, 4), fontsize=7.5, color=TOKEN_COLORS[i])\n\nax4.axhline(LR, color=DARK, lw=1, ls=\"--\", alpha=0.4)\nax4.text(0.5, LR * 1.05, f\"Nominal LR = {LR}\", fontsize=8, color=DARK, alpha=0.6)\n\nax4.set_xscale(\"log\")\nax4.set_yscale(\"log\")\nax4.set_xlabel(\"Token Frequency (log scale)\", fontsize=9)\nax4.set_ylabel(\"Adam Effective LR  lr\/\u221av\u0302  (log scale)\", fontsize=9)\nax4.set_title(\"Adam's Automatic EqualizernRare tokens get amplified LR\", fontsize=11, color=DARK)\nax4.spines[[\"top\", \"right\"]].set_visible(False)\n\nplt.savefig(\"sgd_vs_adam.png\", dpi=150, bbox_inches=\"tight\", facecolor=BG)\nplt.show()<\/code><\/pre>\n<\/div>\n<\/div>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1391\" height=\"1034\" data-attachment-id=\"79942\" data-permalink=\"https:\/\/www.marktechpost.com\/2026\/05\/18\/stochastic-gradient-descent-sgds-frequency-bias-and-how-adam-fixes-it\/image-506\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/05\/image-34.png\" data-orig-size=\"1391,1034\" data-comments-opened=\"0\" data-image-meta='{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\"}' data-image-title=\"image\" data-image-description=\"\" data-image-caption=\"\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/05\/image-34-1024x761.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/05\/image-34.png\" alt=\"\" class=\"wp-image-79942\" \/><\/figure>\n<\/div>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"920\" height=\"705\" data-attachment-id=\"79948\" data-permalink=\"https:\/\/www.marktechpost.com\/2026\/05\/18\/stochastic-gradient-descent-sgds-frequency-bias-and-how-adam-fixes-it\/image-512\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/05\/image-39.png\" data-orig-size=\"920,705\" data-comments-opened=\"0\" data-image-meta='{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\"}' data-image-title=\"image\" data-image-description=\"\" data-image-caption=\"\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/05\/image-39.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/05\/image-39.png\" alt=\"\" class=\"wp-image-79948\" \/><\/figure>\n<\/div>\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n<p>Check out\u00a0the\u00a0<strong><a href=\"https:\/\/github.com\/Marktechpost\/AI-Agents-Projects-Tutorials\/blob\/main\/NLP\/SGD_Adam.ipynb\" target=\"_blank\" rel=\"noreferrer noopener\">Codes with Notebook<\/a>.\u00a0<\/strong>Also,\u00a0feel free to follow us on\u00a0<strong><a href=\"https:\/\/x.com\/intent\/follow?screen_name=marktechpost\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Twitter<\/mark><\/a><\/strong>\u00a0and don\u2019t forget to join our\u00a0<strong><a href=\"https:\/\/www.reddit.com\/r\/machinelearningnews\/\" target=\"_blank\" rel=\"noreferrer noopener\">150k+ ML SubReddit<\/a><\/strong>\u00a0and Subscribe to\u00a0<strong><a href=\"https:\/\/www.aidevsignals.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">our Newsletter<\/a><\/strong>. Wait! are you on telegram?\u00a0<strong><a href=\"https:\/\/t.me\/machinelearningresearchnews\" target=\"_blank\" rel=\"noreferrer noopener\">now you can join us on telegram as well.<\/a><\/strong><\/p>\n<p>Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.?\u00a0<strong><a href=\"https:\/\/forms.gle\/MTNLpmJtsFA3VRVd9\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Connect with us<\/mark><\/a><\/strong><\/p>\n<p>The post <a href=\"https:\/\/www.marktechpost.com\/2026\/05\/18\/stochastic-gradient-descent-sgds-frequency-bias-and-how-adam-fixes-it\/\">Stochastic Gradient Descent (SGD\u2019s) Frequency Bias and How Adam Fixes It\u00a0<\/a> appeared first on <a href=\"https:\/\/www.marktechpost.com\/\">MarkTechPost<\/a>.<\/p>","protected":false},"excerpt":{"rendered":"<p>Modern language models are tra&hellip;<\/p>\n","protected":false},"author":1,"featured_media":933,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-932","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-uncategorized"],"_links":{"self":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts\/932","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=932"}],"version-history":[{"count":0,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts\/932\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/media\/933"}],"wp:attachment":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=932"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=932"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=932"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}