{"id":743,"date":"2026-04-18T14:00:41","date_gmt":"2026-04-18T06:00:41","guid":{"rendered":"https:\/\/connectword.dpdns.org\/?p=743"},"modified":"2026-04-18T14:00:41","modified_gmt":"2026-04-18T06:00:41","slug":"google-ai-releases-auto-diagnose-an-large-language-model-llm-based-system-to-diagnose-integration-test-failures-at-scale","status":"publish","type":"post","link":"https:\/\/connectword.dpdns.org\/?p=743","title":{"rendered":"Google AI Releases Auto-Diagnose: An Large Language Model LLM-Based System to Diagnose Integration Test Failures at Scale"},"content":{"rendered":"<p>If you have ever stared at thousands of lines of integration test logs wondering which of the sixteen log files actually contains your bug, you are not alone \u2014 and Google now has data to prove it.<\/p>\n<p>A team of Google researchers introduced <strong>Auto-Diagnose<\/strong>, an LLM-powered tool that automatically reads the failure logs from a broken integration test, finds the root cause, and posts a concise diagnosis directly into the code review where the failure showed up. On a manual evaluation of 71 real-world failures spanning <strong>39 distinct teams<\/strong>, the tool correctly identified the root cause <strong>90.14% of the time<\/strong>. It has run on <strong>52,635 distinct failing tests<\/strong> across <strong>224,782 executions<\/strong> on <strong>91,130 code changes<\/strong> authored by <strong>22,962 distinct developers<\/strong>, with a \u2018Not helpful\u2019 rate of just 5.8% on the feedback received.<\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"872\" height=\"680\" data-attachment-id=\"79108\" data-permalink=\"https:\/\/www.marktechpost.com\/2026\/04\/17\/google-ai-releases-auto-diagnose-an-large-language-model-llm-based-system-to-diagnose-integration-test-failures-at-scale\/screenshot-2026-04-17-at-11-00-24-pm-2\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/04\/Screenshot-2026-04-17-at-11.00.24-PM-1.png\" data-orig-size=\"872,680\" data-comments-opened=\"0\" data-image-meta='{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\"}' data-image-title=\"Screenshot 2026-04-17 at 11.00.24\u202fPM\" data-image-description=\"\" data-image-caption=\"\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/04\/Screenshot-2026-04-17-at-11.00.24-PM-1.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/04\/Screenshot-2026-04-17-at-11.00.24-PM-1.png\" alt=\"\" class=\"wp-image-79108\" \/><figcaption class=\"wp-element-caption\">https:\/\/arxiv.org\/pdf\/2604.12108<\/figcaption><\/figure>\n<\/div>\n<h3 class=\"wp-block-heading\"><strong>The problem: integration tests are a debugging tax<\/strong><\/h3>\n<p>Integration tests verify that multiple components of a distributed system actually communicate to each other correctly. The tests Auto-Diagnose targets are <em>hermetic functional integration tests<\/em>: tests where an entire system under test (SUT) \u2014 typically a graph of communicating servers \u2014 is brought up inside an isolated environment by a test driver, and exercised against business logic. A separate Google survey of 239 respondents found that <strong>78% of integration tests at Google are functional<\/strong>, which is what motivated the scope.<\/p>\n<p>Diagnosing integration test failures showed up as one of the top five complaints in <em>EngSat<\/em>, a Google-wide survey of 6,059 developers. A follow-up survey of 116 developers found that <strong>38.4% of integration test failures take more than an hour to diagnose, and 8.9% take more than a day<\/strong> \u2014 versus 2.7% and 0% for unit tests.<\/p>\n<p>The root cause is structural. Test driver logs usually surface only a generic symptom (a timeout, an assertion). The actual error lives somewhere inside one of the SUT component logs, often buried under recoverable warnings and ERROR-level lines that are not actually the cause.<\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"2102\" height=\"506\" data-attachment-id=\"79106\" data-permalink=\"https:\/\/www.marktechpost.com\/2026\/04\/17\/google-ai-releases-auto-diagnose-an-large-language-model-llm-based-system-to-diagnose-integration-test-failures-at-scale\/screenshot-2026-04-17-at-10-59-30-pm-2\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/04\/Screenshot-2026-04-17-at-10.59.30-PM-1.png\" data-orig-size=\"2102,506\" data-comments-opened=\"0\" data-image-meta='{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\"}' data-image-title=\"Screenshot 2026-04-17 at 10.59.30\u202fPM\" data-image-description=\"\" data-image-caption=\"\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/04\/Screenshot-2026-04-17-at-10.59.30-PM-1-1024x247.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/04\/Screenshot-2026-04-17-at-10.59.30-PM-1.png\" alt=\"\" class=\"wp-image-79106\" \/><figcaption class=\"wp-element-caption\">https:\/\/arxiv.org\/pdf\/2604.12108<\/figcaption><\/figure>\n<\/div>\n<h3 class=\"wp-block-heading\"><strong>How Auto-Diagnose works<\/strong><\/h3>\n<p>When an integration test fails, a pub\/sub event triggers Auto-Diagnose. The system collects all test driver and SUT component logs at level INFO and above \u2014 across data centers, processes, and threads \u2014 then <strong>joins and sorts them by timestamp into a single log stream<\/strong>. That stream is dropped into a prompt template along with component metadata.<\/p>\n<p>The model is <strong>Gemini 2.5 Flash<\/strong>, called with <code>temperature = 0.1<\/code> (for near-deterministic, debuggable outputs) and <code>top<sub>p<\/sub> = 0.8<\/code>. Gemini was not fine-tuned on Google\u2019s integration test data; this is pure prompt engineering on a general-purpose model.<\/p>\n<p>The prompt itself is the most instructive part of this research. It walks the model through an explicit step-by-step protocol: scan log sections, read component context, locate the failure, summarize errors, and only then attempt a conclusion. Critically, it includes hard negative constraints \u2014 for example: <em>if the logs do not contain lines from the component that failed, do not draw any conclusion.<\/em> <\/p>\n<p>The model\u2019s response is post-processed into a markdown finding with <code>==Conclusion==<\/code>, <code>==Investigation Steps==<\/code>, and <code>==Most Relevant Log Lines==<\/code> sections, then posted as a comment in <strong>Critique<\/strong>, Google\u2019s internal code review system. Each cited log line is rendered as a clickable link.<\/p>\n<h3 class=\"wp-block-heading\"><strong>Numbers from production<\/strong><\/h3>\n<p>Auto-Diagnose averages <strong>110,617 input tokens and 5,962 output tokens per execution<\/strong>, and posts findings with a <strong>p50 latency of 56 seconds and p90 of 346 seconds<\/strong> \u2014 fast enough that developers see the diagnosis before they have switched contexts.<\/p>\n<p>Critique exposes three feedback buttons on a finding: <em>Please fix<\/em> (used by reviewers), <em>Helpful<\/em>, and <em>Not helpful<\/em> (both used by authors). Across 517 total feedback reports from 437 distinct developers, <strong>436 (84.3%) were \u201cPlease fix\u201d<\/strong> from 370 reviewers \u2014 by far the dominant interaction, and a sign that reviewers are actively asking authors to act on the diagnoses. Among dev-side feedback, the helpfulness ratio (<code>H \/ (H + N)<\/code>) is 62.96%, and the \u201cNot helpful\u201d rate (<code>N \/ (PF + H + N)<\/code>) is 5.8% \u2014 well under Google\u2019s 10% threshold for keeping a tool live. Across <strong>370 tools that post findings to Critique<\/strong>, Auto-Diagnose ranks <strong>#14 in helpfulness, putting it in the top 3.78%<\/strong>.<\/p>\n<p>The manual evaluation also surfaced a useful side effect. Of the seven cases where Auto-Diagnose failed, four were because test driver logs were not properly saved on crash, and three were because SUT component logs were not saved when the component crashed \u2014 both real infrastructure bugs, reported back to the relevant teams. In production, around 20 <em>\u2018more information is needed<\/em>\u2018 diagnoses have similarly helped surface infrastructure issues.<\/p>\n<h3 class=\"wp-block-heading\"><strong>Key Takeaways<\/strong><\/h3>\n<ul class=\"wp-block-list\">\n<li><strong>Auto-Diagnose hit 90.14% root-cause accuracy<\/strong> on a manual evaluation of 71 real-world integration test failures spanning 39 teams at Google, addressing a problem 6,059 developers ranked among their top five complaints in the EngSat survey.<\/li>\n<li><strong>The system runs on Gemini 2.5 Flash with no fine-tuning<\/strong> \u2014 just prompt engineering. A pub\/sub trigger collects logs across data centers and processes, joins them by timestamp, and sends them to the model at temperature 0.1 and top<sub>p<\/sub> 0.8.<\/li>\n<li><strong>The prompt is engineered to refuse rather than guess.<\/strong> Hard negative constraints force the model to respond with \u201cmore information is needed\u201d when evidence is missing \u2014 a deliberate trade-off that prevents hallucinated root causes and even helped surface real infrastructure bugs in Google\u2019s logging pipeline.<\/li>\n<li><strong>In production since May 2025, Auto-Diagnose has run on 52,635 distinct failing tests across 224,782 executions on 91,130 code changes from 22,962 developers<\/strong>, posting findings in a p50 of 56 seconds \u2014 fast enough that engineers see the diagnosis before switching contexts.<\/li>\n<\/ul>\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n<p>Check out\u00a0the<strong><a href=\"https:\/\/arxiv.org\/pdf\/2604.06425\" target=\"_blank\" rel=\"noreferrer noopener\">\u00a0<\/a><a href=\"https:\/\/arxiv.org\/pdf\/2604.12108\" target=\"_blank\" rel=\"noreferrer noopener\">Pre-Print Paper here<\/a><\/strong>.<strong>\u00a0<\/strong>Also,\u00a0feel free to follow us on\u00a0<strong><a href=\"https:\/\/x.com\/intent\/follow?screen_name=marktechpost\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Twitter<\/mark><\/a><\/strong>\u00a0and don\u2019t forget to join our\u00a0<strong><a href=\"https:\/\/www.reddit.com\/r\/machinelearningnews\/\" target=\"_blank\" rel=\"noreferrer noopener\">130k+ ML SubReddit<\/a><\/strong>\u00a0and Subscribe to\u00a0<strong><a href=\"https:\/\/www.aidevsignals.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">our Newsletter<\/a><\/strong>. Wait! are you on telegram?\u00a0<strong><a href=\"https:\/\/t.me\/machinelearningresearchnews\" target=\"_blank\" rel=\"noreferrer noopener\">now you can join us on telegram as well.<\/a><\/strong><\/p>\n<p>Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.?\u00a0<strong><a href=\"https:\/\/forms.gle\/MTNLpmJtsFA3VRVd9\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Connect with us<\/mark><\/a><\/strong><\/p>\n<p>The post <a href=\"https:\/\/www.marktechpost.com\/2026\/04\/17\/google-ai-releases-auto-diagnose-an-large-language-model-llm-based-system-to-diagnose-integration-test-failures-at-scale\/\">Google AI Releases Auto-Diagnose: An Large Language Model LLM-Based System to Diagnose Integration Test Failures at Scale<\/a> appeared first on <a href=\"https:\/\/www.marktechpost.com\/\">MarkTechPost<\/a>.<\/p>","protected":false},"excerpt":{"rendered":"<p>If you have ever stared at tho&hellip;<\/p>\n","protected":false},"author":1,"featured_media":744,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-743","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-uncategorized"],"_links":{"self":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts\/743","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=743"}],"version-history":[{"count":0,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts\/743\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/media\/744"}],"wp:attachment":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=743"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=743"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=743"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}