{"id":459,"date":"2026-02-24T17:48:58","date_gmt":"2026-02-24T09:48:58","guid":{"rendered":"https:\/\/connectword.dpdns.org\/?p=459"},"modified":"2026-02-24T17:48:58","modified_gmt":"2026-02-24T09:48:58","slug":"google-deepmind-researchers-apply-semantic-evolution-to-create-non-intuitive-vad-cfr-and-shor-psro-variants-for-superior-algorithmic-convergence","status":"publish","type":"post","link":"https:\/\/connectword.dpdns.org\/?p=459","title":{"rendered":"Google DeepMind Researchers Apply Semantic Evolution to Create Non Intuitive VAD-CFR and SHOR-PSRO Variants for Superior Algorithmic Convergence"},"content":{"rendered":"<p>In the competitive arena of Multi-Agent Reinforcement Learning (MARL), progress has long been bottlenecked by human intuition. For years, researchers have manually refined algorithms like <strong>Counterfactual Regret Minimization (CFR)<\/strong> and <strong>Policy Space Response Oracles (PSRO)<\/strong>, navigating a vast combinatorial space of update rules via trial-and-error.<\/p>\n<p>Google DeepMind research team has now shifted this paradigm with <strong>AlphaEvolve<\/strong>, an evolutionary coding agent powered by Large Language Models (LLMs) that automatically discovers new multi-agent learning algorithms. By treating source code as a genome, AlphaEvolve doesn\u2019t just tune parameters\u2014it invents entirely new symbolic logic.<\/p>\n<h3 class=\"wp-block-heading\"><strong>Semantic Evolution: Beyond Hyperparameter Tuning<\/strong><\/h3>\n<p>Unlike traditional AutoML, which often optimizes numeric constants, AlphaEvolve performs <strong>semantic evolution<\/strong>. It utilizes <strong>Gemini 2.5 pro<\/strong> as an intelligent genetic operator to rewrite logic, introduce novel control flows, and inject symbolic operations into the algorithm\u2019s source code.<\/p>\n<p><strong>The framework follows a rigorous evolutionary loop:<\/strong><\/p>\n<ul class=\"wp-block-list\">\n<li><strong>Initialization<\/strong>: The population begins with standard baseline implementations, such as standard CFR.<\/li>\n<li><strong>LLM-Driven Mutation<\/strong>: A parent algorithm is selected based on fitness, and the LLM is prompted to modify the code to reduce exploitability.<\/li>\n<li><strong>Automated Evaluation<\/strong>: Candidates are executed on proxy games (e.g., Kuhn Poker) to compute negative exploitability scores.<\/li>\n<li><strong>Selection<\/strong>: Valid, high-performing candidates are added back to the population, allowing the search to discover non-intuitive optimizations.<\/li>\n<\/ul>\n<h3 class=\"wp-block-heading\"><strong>VAD-CFR: Mastering Game Volatility<\/strong><\/h3>\n<p>The first major discovery is <strong>Volatility-Adaptive Discounted (VAD-) CFR<\/strong>. In Extensive-Form Games (EFGs) with imperfect information, agents must minimize regret across a sequence of histories. While traditional variants use static discounting, <strong>VAD-CFR introduces three mechanisms that often elude human designers:<\/strong><\/p>\n<ol start=\"1\" class=\"wp-block-list\">\n<li><strong>Volatility-Adaptive Discounting<\/strong>: Using an <strong>Exponential Weighted Moving Average (EWMA)<\/strong> of the instantaneous regret magnitude, the algorithm tracks the \u201cshake\u201d of the learning process. When volatility is high, it increases discounting to forget unstable history faster; when it drops, it retains more history for fine-tuning.<\/li>\n<li><strong>Asymmetric Instantaneous Boosting<\/strong>: VAD-CFR boosts positive instantaneous regrets by a factor of <strong>1.1<\/strong>. This allows the agent to immediately exploit beneficial deviations without the lag associated with standard accumulation.<\/li>\n<li><strong>Hard Warm-Start &amp; Regret-Magnitude Weighting<\/strong>: The algorithm enforces a \u2018hard warm-start,\u2019 postponing policy averaging until <strong>iteration 500<\/strong>. Interestingly, the LLM generated this threshold without knowing the 1000-iteration evaluation horizon. Once accumulation begins, policies are weighted by the magnitude of instantaneous regret to filter out noise.<\/li>\n<\/ol>\n<p>In empirical tests, VAD-CFR matched or surpassed state-of-the-art performance in <strong>10 out of 11 games<\/strong>, including Leduc Poker and Liar\u2019s Dice, with 4-player Kuhn Poker being the only exception<sup><\/sup>.<\/p>\n<h3 class=\"wp-block-heading\"><strong>SHOR-PSRO: The Hybrid Meta-Solver<\/strong><\/h3>\n<p>The second breakthrough is <strong>Smoothed Hybrid Optimistic Regret (SHOR-) PSRO<\/strong>. PSRO operates on a higher abstraction called the <strong>Meta-Game<\/strong>, where a population of policies is iteratively expanded. SHOR-PSRO evolves the <strong>Meta-Strategy Solver (MSS)<\/strong>, the component that determines how opponents are pitted against each other.<\/p>\n<p><strong>The core of SHOR-PSRO is a Hybrid Blending Mechanism that constructs a meta-strategy \u03c3 by linearly blending two distinct components:<\/strong><\/p>\n<h3 class=\"wp-block-heading\"><strong>\u03c3 <sub>hybrid<\/sub> = (1 -\ud835\udecc) . \u03c3 <sub>ORM<\/sub> + \ud835\udecc . \u03c3<sub>Softmax<\/sub><\/strong><\/h3>\n<ul class=\"wp-block-list\">\n<li><strong><\/strong><strong>\u03c3 <sub>ORM<\/sub> <\/strong>: Provides the stability of <strong>Optimistic Regret Matching<\/strong>.<\/li>\n<li><strong><\/strong><strong>\u03c3<sub>Softmax<\/sub><\/strong>: A Boltzmann distribution over pure strategies that aggressively biases the solver toward high-reward modes.<\/li>\n<\/ul>\n<p>SHOR-PSRO employs a dynamic <strong>Annealing Schedule<\/strong>. The blending factor <strong>\ud835\udecc<\/strong> anneals from <strong>0.3 to 0.05<\/strong>, gradually shifting the focus from greedy exploration to robust equilibrium finding. Furthermore, it discovered a <strong>Training vs. Evaluation Asymmetry<\/strong>: the training solver uses the annealing schedule for stability, while the evaluation solver uses a fixed, low blending factor (<strong>\ud835\udecc<\/strong>=0.01) for reactive exploitability estimates.<\/p>\n<h3 class=\"wp-block-heading\"><strong>Key Takeaways<\/strong><\/h3>\n<ul class=\"wp-block-list\">\n<li><strong>AlphaEvolve Framework<\/strong>: DeepMind Researchers introduced AlphaEvolve, an evolutionary system that uses Large Language Models (LLMs) to perform \u2018semantic evolution\u2019 by treating an algorithm\u2019s source code as its genome. This allows the system to discover entirely new symbolic logic and control flows rather than just tuning hyperparameters.<\/li>\n<li><strong>Discovery of VAD-CFR<\/strong>: The system evolved a new regret minimization algorithm called Volatility-Adaptive Discounted (VAD-) CFR. It outperforms state-of-the-art baselines like Discounted Predictive CFR+ by using non-intuitive mechanisms to manage regret accumulation and policy derivation.<\/li>\n<li><strong>VAD-CFR\u2019s Adaptive Mechanisms<\/strong>: VAD-CFR utilizes a volatility-sensitive discounting schedule that tracks learning instability via an Exponential Weighted Moving Average (EWMA). It also features an \u2018Asymmetric Instantaneous Boosting\u2019 factor of 1.1 for positive regrets and a hard warm-start that delays policy averaging until iteration 500 to filter out early-stage noise.<\/li>\n<li><strong>Discovery of SHOR-PSRO<\/strong>: For population-based training, AlphaEvolve discovered Smoothed Hybrid Optimistic Regret (SHOR-) PSRO. This variant utilizes a hybrid meta-solver that blends Optimistic Regret Matching with a smoothed, temperature-controlled distribution over best pure strategies to improve convergence speed and stability.<\/li>\n<li><strong>Dynamic Annealing and Asymmetry<\/strong>: SHOR-PSRO automates the transition from exploration to exploitation by annealing its blending factor and diversity bonuses during training. The search also discovered a performance-boosting asymmetry where the training-time solver uses time-averaging for stability while the evaluation-time solver uses a reactive last-iterate strategy.<\/li>\n<\/ul>\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n<p>Check out the\u00a0<strong><a href=\"https:\/\/arxiv.org\/pdf\/2602.16928\" target=\"_blank\" rel=\"noreferrer noopener\">Paper<\/a>.\u00a0<\/strong>Also,\u00a0feel free to follow us on\u00a0<strong><a href=\"https:\/\/x.com\/intent\/follow?screen_name=marktechpost\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Twitter<\/mark><\/a><\/strong>\u00a0and don\u2019t forget to join our\u00a0<strong><a href=\"https:\/\/www.reddit.com\/r\/machinelearningnews\/\" target=\"_blank\" rel=\"noreferrer noopener\">100k+ ML SubReddit<\/a><\/strong>\u00a0and Subscribe to\u00a0<strong><a href=\"https:\/\/www.aidevsignals.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">our Newsletter<\/a><\/strong>. Wait! are you on telegram?\u00a0<strong><a href=\"https:\/\/t.me\/machinelearningresearchnews\" target=\"_blank\" rel=\"noreferrer noopener\">now you can join us on telegram as well.<\/a><\/strong><\/p>\n<p>The post <a href=\"https:\/\/www.marktechpost.com\/2026\/02\/24\/google-deepmind-researchers-apply-semantic-evolution-to-create-non-intuitive-vad-cfr-and-shor-psro-variants-for-superior-algorithmic-convergence\/\">Google DeepMind Researchers Apply Semantic Evolution to Create Non Intuitive VAD-CFR and SHOR-PSRO Variants for Superior Algorithmic Convergence<\/a> appeared first on <a href=\"https:\/\/www.marktechpost.com\/\">MarkTechPost<\/a>.<\/p>","protected":false},"excerpt":{"rendered":"<p>In the competitive arena of Mu&hellip;<\/p>\n","protected":false},"author":1,"featured_media":29,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-459","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-uncategorized"],"_links":{"self":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts\/459","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=459"}],"version-history":[{"count":0,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts\/459\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/media\/29"}],"wp:attachment":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=459"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=459"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=459"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}