{"id":575,"date":"2026-03-18T15:08:46","date_gmt":"2026-03-18T07:08:46","guid":{"rendered":"https:\/\/connectword.dpdns.org\/?p=575"},"modified":"2026-03-18T15:08:46","modified_gmt":"2026-03-18T07:08:46","slug":"servicenow-research-introduces-enterpriseops-gym-a-high-fidelity-benchmark-designed-to-evaluate-agentic-planning-in-realistic-enterprise-settings","status":"publish","type":"post","link":"https:\/\/connectword.dpdns.org\/?p=575","title":{"rendered":"ServiceNow Research Introduces EnterpriseOps-Gym: A High-Fidelity Benchmark Designed to Evaluate Agentic Planning in Realistic Enterprise Settings"},"content":{"rendered":"<p>Large language models (LLMs) are transitioning from conversational to autonomous agents capable of executing complex professional workflows. However, their deployment in enterprise environments remains limited by the lack of benchmarks that capture the specific challenges of professional settings: long-horizon planning, persistent state changes, and strict access protocols. To address this, researchers from ServiceNow Research, Mila and Universite de Montreal have introduced <strong>EnterpriseOps-Gym<\/strong>, a high-fidelity sandbox designed to evaluate agentic planning in realistic enterprise scenarios.<\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1780\" height=\"1262\" data-attachment-id=\"78425\" data-permalink=\"https:\/\/www.marktechpost.com\/2026\/03\/18\/servicenow-research-introduces-enterpriseops-gym-a-high-fidelity-benchmark-designed-to-evaluate-agentic-planning-in-realistic-enterprise-settings\/screenshot-2026-03-18-at-12-00-24-am-2\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/03\/Screenshot-2026-03-18-at-12.00.24-AM-1.png\" data-orig-size=\"1780,1262\" data-comments-opened=\"1\" data-image-meta='{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\"}' data-image-title=\"Screenshot 2026-03-18 at 12.00.24\u202fAM\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/03\/Screenshot-2026-03-18-at-12.00.24-AM-1-300x213.png\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/03\/Screenshot-2026-03-18-at-12.00.24-AM-1-1024x726.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/03\/Screenshot-2026-03-18-at-12.00.24-AM-1.png\" alt=\"\" class=\"wp-image-78425\" \/><figcaption class=\"wp-element-caption\">https:\/\/arxiv.org\/pdf\/2603.13594<\/figcaption><\/figure>\n<\/div>\n<h3 class=\"wp-block-heading\"><strong>The Evaluation Environment<\/strong><\/h3>\n<p><strong>EnterpriseOps-Gym features a containerized Docker environment that simulates eight mission-critical enterprise domains:<\/strong><\/p>\n<ul class=\"wp-block-list\">\n<li><strong>Operational Domains:<\/strong> Customer Service Management (CSM), Human Resources (HR), and IT Service Management (ITSM).<\/li>\n<li><strong>Collaboration Domains:<\/strong> Email, Calendar, Teams, and Drive.<\/li>\n<li><strong>Hybrid Domain:<\/strong> Cross-domain tasks requiring coordinated execution across multiple systems.<\/li>\n<\/ul>\n<p>The benchmark comprises <strong>164 relational database tables<\/strong> and <strong>512 functional tools<\/strong><sup><\/sup><sup><\/sup><sup><\/sup><sup><\/sup><sup><\/sup>. With a mean foreign key degree of <strong>1.7<\/strong>, the environment presents high relational density, forcing agents to navigate complex inter-table dependencies to maintain referential integrity<sup><\/sup>. The benchmark includes <strong>1,150 expert-curated tasks<\/strong>, with execution trajectories averaging 9 steps and reaching up to 34 steps<sup><\/sup><sup><\/sup><sup><\/sup><sup><\/sup><sup><\/sup><sup><\/sup><sup><\/sup><sup><\/sup><sup><\/sup>.<\/p>\n<h3 class=\"wp-block-heading\"><strong>Performance Results: A Capability Gap<\/strong><\/h3>\n<p>The research team evaluated 14 frontier models using a <strong>pass@1<\/strong> metric, where a task is successful only if all outcome-based SQL verifiers pass.<\/p>\n<figure class=\"wp-block-table\">\n<table class=\"has-fixed-layout\">\n<thead>\n<tr>\n<td><strong>Model<\/strong><\/td>\n<td><strong>Average Success Rate (%)<\/strong><\/td>\n<td><strong>Cost per Task (USD)<\/strong><\/td>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td><strong>Claude Opus 4.5<\/strong><\/td>\n<td>37.4%<\/td>\n<td>$0.36<\/td>\n<\/tr>\n<tr>\n<td><strong>Gemini-3-Flash<\/strong><\/td>\n<td>31.9%<\/td>\n<td>$0.03<\/td>\n<\/tr>\n<tr>\n<td><strong>GPT-5.2 (High)<\/strong><\/td>\n<td>31.8%<\/td>\n<td>Not explicitly listed in text<\/td>\n<\/tr>\n<tr>\n<td><strong>Claude Sonnet 4.5<\/strong><\/td>\n<td>30.9%<\/td>\n<td>$0.26<\/td>\n<\/tr>\n<tr>\n<td><strong>GPT-5<\/strong><\/td>\n<td>29.8%<\/td>\n<td>$0.16<\/td>\n<\/tr>\n<tr>\n<td><strong>DeepSeek-V3.2 (High)<\/strong><\/td>\n<td>24.5%<\/td>\n<td>$0.014<\/td>\n<\/tr>\n<tr>\n<td><strong>GPT-OSS-120B (High)<\/strong><\/td>\n<td>23.7%<\/td>\n<td>$0.015<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/figure>\n<p>The results indicate that even state-of-the-art models fail to reach 40% reliability in these structured environments<sup><\/sup><sup><\/sup><sup><\/sup><sup><\/sup>. Performance is strongly domain-dependent; models performed best on collaboration tools (Email, Teams) but dropped significantly in policy-heavy domains like <strong>ITSM (28.5%)<\/strong> and <strong>Hybrid (30.7%)<\/strong> workflows<sup><\/sup><sup><\/sup><sup><\/sup><sup><\/sup>.<\/p>\n<h3 class=\"wp-block-heading\"><strong>Planning vs. Execution<\/strong><\/h3>\n<p>A critical finding of this research is that <strong>strategic planning<\/strong>, rather than tool invocation, is the primary performance bottleneck.<\/p>\n<p>The research team conducted \u2018Oracle\u2019 experiments where agents were provided with human-authored plans. This intervention improved performance by <strong>14-35 percentage points<\/strong> across all models. Strikingly, smaller models like <strong>Qwen3-4B<\/strong> became competitive with much larger models when strategic reasoning was externalized. Conversely, adding \u2018distractor tools\u2019 to simulate retrieval errors had a negligible impact on performance, further suggesting that tool discovery is not the binding constraint.<\/p>\n<h3 class=\"wp-block-heading\"><strong>Failure Modes and Safety Concerns<\/strong><\/h3>\n<p><strong>The qualitative analysis revealed four recurring failure patterns:<\/strong><\/p>\n<ol start=\"1\" class=\"wp-block-list\">\n<li><strong>Missing Prerequisite Lookup:<\/strong> Creating objects without querying necessary prerequisites, leading to \u201corphaned\u201d records.<\/li>\n<li><strong>Cascading State Propagation:<\/strong> Failing to trigger follow-up actions required by system policies after a state change.<\/li>\n<li><strong>Incorrect ID Resolution:<\/strong> Passing unverified or guessed identifiers to tool calls.<\/li>\n<li><strong>Premature Completion Hallucination:<\/strong> Declaring a task finished before all required steps are executed.<\/li>\n<\/ol>\n<p>Furthermore, agents struggle with <strong>safe refusal<\/strong><sup><\/sup><sup><\/sup><sup><\/sup><sup><\/sup>. The benchmark includes 30 infeasible tasks (e.g., requests violating access rules or involving inactive users)<sup><\/sup><sup><\/sup><sup><\/sup><sup><\/sup><sup><\/sup><sup><\/sup><sup><\/sup><sup><\/sup><sup><\/sup>. The best-performing model, <strong>GPT-5.2 (Low)<\/strong>, correctly refused these tasks only <strong>53.9%<\/strong> of the time<sup><\/sup><sup><\/sup><sup><\/sup><sup><\/sup>. In professional settings, failing to refuse an unauthorized or impossible task can lead to corrupted database states and security risks<sup><\/sup><sup><\/sup><sup><\/sup><sup><\/sup>.<\/p>\n<h3 class=\"wp-block-heading\"><strong>Orchestration and Multi-Agent Systems (MAS)<\/strong><\/h3>\n<p>The research team also evaluated whether more complex agent architectures could close the performance gap. While a <strong>Planner+Executor<\/strong> setup (where one model plans and another executes) yielded modest gains, more complex <strong>decomposition architectures<\/strong> often regressed performance. In domains like CSM and HR, tasks have strong sequential state dependencies; breaking these into sub-tasks for separate agents often disrupted the necessary context, leading to lower success rates than simple ReAct loops.<\/p>\n<h3 class=\"wp-block-heading\"><strong>Economic Considerations: The Pareto Frontier<\/strong><\/h3>\n<p><strong>For deployment, the benchmark establishes a clear cost-performance tradeoff:<\/strong><\/p>\n<ul class=\"wp-block-list\">\n<li><strong>Gemini-3-Flash<\/strong> represents the strongest practical tradeoff for closed-source models, offering 31.9% performance at a 90% lower cost than GPT-5 or Claude Sonnet 4.5.<\/li>\n<li><strong>DeepSeek-V3.2 (High)<\/strong> and <strong>GPT-OSS-120B (High)<\/strong> are the dominant open-source options, offering approximately 24% performance at roughly $0.015 per task.<\/li>\n<li><strong>Claude Opus 4.5<\/strong> remains the benchmark for absolute reliability (37.4%) but at the highest cost of $0.36 per task.<\/li>\n<\/ul>\n<h3 class=\"wp-block-heading\"><strong>Key Takeaways<\/strong><\/h3>\n<ul class=\"wp-block-list\">\n<li><strong>Benchmark Scale and Complexity<\/strong>: EnterpriseOps-Gym provides a high-fidelity evaluation environment featuring <strong>164 relational database tables<\/strong> and <strong>512 functional tools<\/strong> across eight enterprise domains.<\/li>\n<li><strong>Significant Performance Gap<\/strong>: Current frontier models are not yet reliable for autonomous deployment; the top-performing model, <strong>Claude Opus 4.5<\/strong>, achieves only a <strong>37.4% success rate<\/strong>.<\/li>\n<li><strong>Planning as the Primary Bottleneck<\/strong>: Strategic reasoning is the binding constraint rather than tool execution, as providing agents with human-authored plans improves performance by <strong>14 to 35 percentage points<\/strong>.<\/li>\n<li><strong>Inadequate Safe Refusal<\/strong>: Models struggle to identify and refuse infeasible or policy-violating requests, with even the best-performing model cleanly abstaining only <strong>53.9%<\/strong> of the time.<\/li>\n<li><strong>Thinking Budget Limitations<\/strong>: While increasing test-time compute yields gains in some domains, performance plateaus in others, suggesting that more \u2018thinking\u2019 tokens cannot fully overcome fundamental gaps in <strong>policy understanding<\/strong> or <strong>domain knowledge<\/strong>.<\/li>\n<\/ul>\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n<p>Check out\u00a0<strong><a href=\"https:\/\/arxiv.org\/pdf\/2603.13594\" target=\"_blank\" rel=\"noreferrer noopener\">Paper<\/a><\/strong>, <strong><a href=\"https:\/\/github.com\/ServiceNow\/EnterpriseOps-Gym\" target=\"_blank\" rel=\"noreferrer noopener\">Codes<\/a><\/strong> and <strong><a href=\"https:\/\/enterpriseops-gym.github.io\/\" target=\"_blank\" rel=\"noreferrer noopener\">Technical details<\/a>.\u00a0<\/strong>Also,\u00a0feel free to follow us on\u00a0<strong><a href=\"https:\/\/x.com\/intent\/follow?screen_name=marktechpost\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Twitter<\/mark><\/a><\/strong>\u00a0and don\u2019t forget to join our\u00a0<strong><a href=\"https:\/\/www.reddit.com\/r\/machinelearningnews\/\" target=\"_blank\" rel=\"noreferrer noopener\">120k+ ML SubReddit<\/a><\/strong>\u00a0and Subscribe to\u00a0<strong><a href=\"https:\/\/www.aidevsignals.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">our Newsletter<\/a><\/strong>. Wait! are you on telegram?\u00a0<strong><a href=\"https:\/\/t.me\/machinelearningresearchnews\" target=\"_blank\" rel=\"noreferrer noopener\">now you can join us on telegram as well.<\/a><\/strong><\/p>\n<p>The post <a href=\"https:\/\/www.marktechpost.com\/2026\/03\/18\/servicenow-research-introduces-enterpriseops-gym-a-high-fidelity-benchmark-designed-to-evaluate-agentic-planning-in-realistic-enterprise-settings\/\">ServiceNow Research Introduces EnterpriseOps-Gym: A High-Fidelity Benchmark Designed to Evaluate Agentic Planning in Realistic Enterprise Settings<\/a> appeared first on <a href=\"https:\/\/www.marktechpost.com\/\">MarkTechPost<\/a>.<\/p>","protected":false},"excerpt":{"rendered":"<p>Large language models (LLMs) a&hellip;<\/p>\n","protected":false},"author":1,"featured_media":576,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-575","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-uncategorized"],"_links":{"self":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts\/575","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=575"}],"version-history":[{"count":0,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts\/575\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/media\/576"}],"wp:attachment":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=575"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=575"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=575"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}