Skip to content
Software Engineering · 11 min read

Agentic Testing is a Costly Cop-Out for Flaky Code

Why agentic testing is an expensive band-aid for broken architecture. I explain why deterministic environments beat non-deterministic AI agents every time.

A few mornings ago, I was sipping my third espresso and scrolling through my feed when I stumbled upon Slack Engineering’s write-up on agentic testing. They are advocating for putting autonomous AI agents directly into their end-to-end (E2E) testing stack. The pitch sounds brilliant on paper: instead of writing brittle Playwright scripts, you let an LLM-driven browser agent crawl your application, self-heal when the UI changes, and verify business goals rather than rigid paths.

I almost choked on my coffee.

Do not get me wrong, I understand the pain Slack is trying to solve. We have all lived through the hell of maintaining thousands of lines of fragile Cypress or Playwright locators. But pointing a non-deterministic reasoning model like Claude 3.5 Sonnet at an unstable staging environment to “figure out” how to execute a checkout flow is an architectural disaster. It is the engineering equivalent of using a self-driving car to navigate a collapsing bridge instead of just fixing the structural concrete underneath.

Inside the Slack Engineering Architecture

The core argument in the Slack engineering post is that traditional E2E tests are too rigid. They assert that tests enforce journeys, while agents verify goals. If a button moves three pixels to the left, or if a product designer wraps a checkout form in a new modal, a standard automated pipeline breaks.

To solve this, the agentic testing crowd wants you to deploy autonomous agents running on the Model Context Protocol (MCP) or custom Playwright CLI wrappers. You give the agent a high-level goal in plain English, such as “log in as a premium user, add a blue shoe to the cart, and verify the checkout succeeds.” The agent then spins up a browser, looks at the DOM, takes screenshots, determines the next logical action, clicks, and repeats until it decides the goal is met.

Here is what the execution of an agentic step looks like in a typical Node.js scaffold using the modern Playwright API and the OpenAI SDK:

import { chromium, Page } from 'playwright';
import { OpenAI } from 'openai';

interface AgentAction {
  action: 'click' | 'type' | 'wait' | 'success';
  selector?: string;
  text?: string;
  reasoning: string;
}

async function runAgenticStep(page: Page, goal: string): Promise<boolean> {
  const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
  
  // Capture visual state and simplified DOM to minimize token consumption
  const screenshot = await page.screenshot({ encoding: 'base64' });
  const domTree = await page.evaluate(() => {
    return Array.from(document.querySelectorAll('button, input, a, [role="button"]')).map(el => ({
      tagName: el.tagName,
      id: el.id,
      className: el.className,
      text: el.textContent?.trim().substring(0, 50),
      ariaLabel: el.getAttribute('aria-label'),
      role: el.getAttribute('role')
    }));
  });

  const response = await openai.chat.completions.create({
    model: 'gpt-4o',
    response_format: { type: 'json_object' },
    messages: [
      {
        role: 'system',
        content: `You are an autonomous QA engineer. Analyze the current DOM tree and goal. 
        Respond with a JSON object matching this schema:
        {
          "action": "click" | "type" | "wait" | "success",
          "selector": "CSS selector",
          "text": "text to type if action is type",
          "reasoning": "Why you chose this step"
        }`
      },
      {
        role: 'user',
        content: JSON.stringify({ goal, domTree, screenshot: `data:image/png;base64,${screenshot}` })
      }
    ]
  });

  const rawResult = response.choices[0].message.content;
  if (!rawResult) throw new Error('No agent response received');
  
  const decision = JSON.parse(rawResult) as AgentAction;
  console.log(`Agent Reasoning: ${decision.reasoning}`);

  if (decision.action === 'success') {
    return true;
  }

  if (decision.action === 'click' && decision.selector) {
    await page.click(decision.selector);
  } else if (decision.action === 'type' && decision.selector && decision.text) {
    await page.fill(decision.selector, decision.text);
  } else if (decision.action === 'wait') {
    await page.waitForTimeout(2000);
  }

  return false;
}

It looks magical in a pristine, low-stakes slide deck. The agent can handle minor layout shifts without breaking. But when you move this setup from a local sandbox to a highly concurrent CI/CD pipeline, the facade crumbles, exposing a massive pile of technical debt that you can no longer ignore.

Why Agentic Testing is a Costly Cop-Out for Flaky Codebases

When you deploy agentic testing against a codebase plagued by race conditions and unsemantic HTML, you enter what I call the Trust Valley.

In a healthy engineering organization, a test failure is an immediate, deterministic signal. A button did not appear within 500ms, which means either the frontend state machine is broken or a backend API is returning a 500 error. The test fails in 600ms, costing you zero dollars and zero cents in third-party API fees.

An AI testing agent behaves completely differently. It is designed to adapt, self-heal, and survive. If a dynamic UI race condition occurs, the agent will not just fail. It will pause. It will take another screenshot. It will think. It will try clicking a different part of the screen, perhaps navigating back to the homepage to see if it can reset the state. It might spin in a reasoning loop for three minutes, burning through premium tokens, before finally giving up or, worse, bypassing the bug entirely.

On a recent project for a scaling fintech platform, I saw a team run a suite of 50 agentic E2E tests. A single transient database lock on their staging server caused a cascade of UI delays. Instead of failing loudly, the agents spent the next forty-five minutes attempting to self-heal their way through the checkout flows. The bill for that single CI run was $142.80 in LLM API tokens.

Even worse than the financial cost is how these agents mask structural rot. Writing deterministic tests forces developers to write high-quality, accessible HTML. You need unique test IDs, proper ARIA roles, and logical DOM hierarchies so your Playwright locators can find elements cleanly.

When you tell developers that “the AI agent will just figure out how to click it,” all discipline evaporates. Frontend engineers stop adding data-testid attributes. They ignore semantic markup, write deeply nested, unlabelled div structures, and completely skip accessibility trees.

You also fall directly into the Confidence vs. Correctness trap. An AI agent is highly confident even when it is completely wrong. The agent might declare that a user registration flow succeeded because it successfully clicked the submit button and saw a green checkmark, completely missing the fact that the underlying registration API threw a 500 error, leaving the console riddled with exceptions and database state unwritten.

The Real Reason We Got Here: The Brittle Tragedy of Scripted E2E

To understand why so many teams are falling for the agentic testing trap, we have to acknowledge the absolute misery of traditional E2E testing.

In my experience, engineering teams spend anywhere from 60% to 80% of their testing cycles maintaining legacy test suites. We have all written code that looks like this:

import { test, expect } from '@playwright/test';

test('brittle checkout flow', async ({ page }) => {
  await page.goto('https://example.com/shop');
  
  // This selector breaks the moment the design team runs an A/B test or updates Tailwind classes
  await page.click('main > div:nth-child(2) > div.flex.flex-col > button.bg-blue-600');
  
  await page.fill('input[placeholder="Search products..."]', 'leather boots');
  await page.press('input[placeholder="Search products..."]', 'Enter');
  
  // Waiting for arbitrary dynamic elements
  await page.waitForTimeout(3000); 
  
  // Highly fragile XPath locator that snaps on any DOM restructuring
  await page.click('xpath=//html/body/div[1]/div/div[2]/div[3]/button');
  
  await expect(page.locator('.cart-count')).toHaveText('1');
});

The moment a developer refactors that product card component to use a CSS grid layout instead of a flexbox, this test dies in CI. The build goes red, the release queue blocks, and some engineer has to spend two hours debugging XPath locators.

Because of this daily friction, the “tests enforce journeys, agents verify goals” paradigm feels like a lifesaver. If the agent does not care about the specific CSS path and instead reads the screen like a human, you can refactor the frontend all day long without rewriting a single line of test code.

In low-stakes startup environments where you are moving incredibly fast and do not have a QA team, this trade-off makes sense. If you have five engineers shipping fifteen times a day, maintaining a heavy Playwright suite is impossible. But as you scale, this shortcuts-first mentality quickly turns into a financial and operational liability.

The Production Reality: Cost, Latency, and Polluted Sandboxes

Let us look at some hard numbers.

While top closed-source models have crossed the 80% success mark on the SWE-bench Verified benchmark, running these models in a tight developer feedback loop is practically impossible.

In modern software delivery, deployment velocity is everything. If a developer pushes a hotfix to production, they expect the CI pipeline to finish in under five minutes. If you introduce an agentic testing suite that relies on multi-step reasoning loops, your pipeline latency skyrockets.

Consider this performance and cost comparison from a real-world benchmark I ran across 100 typical user flows:

MetricDeterministic Playwright (Mocked APIs)Agentic Testing (Claude 3.5 Sonnet)
Average Run Time4.2 seconds3.2 minutes
API Cost per Run$0.00$0.85 to $3.10
Flakiness Rate< 1% (with proper mocks)12% to 18% (non-deterministic LLM output)
Success SignalsClear stack trace and line numberVague “Goal failed to resolve” explanation

Then there is the state mutation problem. When you write a deterministic test, you control the state. You seed the database with a specific user, you run the test, and you tear down the state afterward.

An autonomous agent, however, is designed to explore. If you let it loose in a staging sandbox with live database connections, it will mutate state unpredictably. It will create orphan test accounts, leave hundreds of items in draft carts, trigger real webhook events, and pollute your monitoring dashboards with chaotic, unpredictable user behavior. Cleaning up after an autonomous AI rampage on a shared staging database is a nightmare I do not wish on my worst enemy.

The Blueprint for Real Reliability: Hermetic Environments and Caching

You do not need to choose between the brittle hell of legacy XPath selectors and the expensive, slow, non-deterministic chaos of AI agents. There is a much better way to build highly reliable, fast, and low-maintenance testing pipelines.

My preferred architecture uses strict deterministic mock boundaries, contract testing, and hermetic local environments. Instead of running E2E tests against a live, unpredictable staging server, we run tests against a completely self-contained environment where all external network calls are mocked using Mock Service Worker (MSW) or Playwright’s native network routing.

Here is how you write a fast, deterministic, hermetic E2E test that mocks the network layer to eliminate staging server dependency:

import { test, expect } from '@playwright/test';

test.describe('Hermetic Checkout Flow', () => {
  test.beforeEach(async ({ page }) => {
    // Intercept API calls and return mock data to ensure 100% determinism
    await page.route('**/api/v1/products/search*', async (route) => {
      await route.fulfill({
        status: 200,
        contentType: 'application/json',
        body: JSON.stringify([
          { id: 'prod_123', name: 'Premium Leather Boots', price: 120.00 }
        ])
      });
    });

    await page.route('**/api/v1/cart', async (route) => {
      await route.fulfill({
        status: 200,
        contentType: 'application/json',
        body: JSON.stringify({ items: [{ id: 'prod_123', quantity: 1 }] })
      });
    });
  });

  test('should successfully add mocked product to cart', async ({ page }) => {
    await page.goto('/shop');
    
    // Use resilient, user-visible accessibility locators instead of CSS/XPath
    await page.getByRole('textbox', { name: /search products/i }).fill('leather boots');
    await page.getByRole('button', { name: /search/i }).click();
    
    const productCard = page.getByRole('heading', { name: /premium leather boots/i });
    await expect(productCard).toBeVisible();
    
    await page.getByRole('button', { name: /add to cart/i }).click();
    await expect(page.getByRole('link', { name: /cart/i })).toContainText('1');
  });
});

This test runs in less than two seconds, costs absolutely nothing in API tokens, and will never break unless the actual user-facing labels or roles change.

If you absolutely must use LLMs to help maintain your tests, do not use them as active run-time executors in your CI pipeline. Instead, use them as an offline code-generation tool. This is known as the Locator Cache model.

You run an LLM agent once in development to inspect your UI, analyze the DOM, and generate a highly efficient, deterministic selector map. Your CI runner then executes the generated, hardcoded Playwright test. The costly reasoning engine is only re-awakened if a test fails, prompting the LLM to run offline and output a pull request with updated selectors.

Here is an example of implementing a simple Locator Cache wrapper that separates the LLM selector discovery from the execution:

{
  "searchBox": "input[name='search']",
  "searchButton": "button[type='submit']",
  "productResult": "div[data-testid='product-card-prod_123']"
}

We parse this file inside our test suite:

import { test, expect } from '@playwright/test';
import * as fs from 'fs';
import * as path from 'path';

const selectorCachePath = path.resolve(__dirname, 'locator-cache.json');
const selectors = JSON.parse(fs.readFileSync(selectorCachePath, 'utf-8'));

test('Uses cached locators generated by our offline LLM pipeline', async ({ page }) => {
  await page.goto('/shop');
  
  // No live AI is invoked in CI here. We use fast, cached, deterministic selectors.
  await page.fill(selectors.searchBox, 'leather boots');
  await page.click(selectors.searchButton);
  
  await expect(page.locator(selectors.productResult)).toBeVisible();
});

If the developers rename input[name='search'] to input[name='q'], this test will fail. But instead of letting an expensive agent try to self-heal in the middle of a release block, your CI pipeline fails cleanly. You trigger your offline recovery script to update the cache:

import { OpenAI } from 'openai';
import * as fs from 'fs';
import * as path from 'path';

async function healLocators(failedSelectorKey: string, domSnapshot: string) {
  const openai = new OpenAI();
  
  const response = await openai.chat.completions.create({
    model: 'gpt-4o',
    response_format: { type: 'json_object' },
    messages: [
      {
        role: 'system',
        content: 'Identify the correct replacement selector for the broken key based on the DOM.'
      },
      {
        role: 'user',
        content: JSON.stringify({ brokenKey: failedSelectorKey, dom: domSnapshot })
      }
    ]
  });

  const rawResult = response.choices[0].message.content;
  if (rawResult) {
    const patch = JSON.parse(rawResult);
    const cachePath = path.resolve(__dirname, 'locator-cache.json');
    const currentCache = JSON.parse(fs.readFileSync(cachePath, 'utf-8'));
    
    currentCache[failedSelectorKey] = patch.newSelector;
    fs.writeFileSync(cachePath, JSON.stringify(currentCache, null, 2));
    console.log(`Successfully healed selector: ${failedSelectorKey} -> ${patch.newSelector}`);
  }
}

This keeps the non-deterministic AI completely out of the critical path of your deploy pipeline while still slashing your manual test maintenance overhead.

To round out your strategy, structure your test automation around a hybrid tiering framework. Match your tool of choice to the engineering cost of being wrong:

  1. Tier 1 (High Risk, Core Business Logic): Use deterministic, rigid, and cheap assertions. Your payment integrations, user authentication, and core database transactions must be validated by explicit, loud, deterministic scripts. Never let an AI agent interpret whether a transaction succeeded based on a green checkmark icon.
  2. Tier 2 (Medium Risk, Visual UX): This is where agents actually shine. Let autonomous visual regression agents loose on a schedule, such as a nightly cron job that does not block active pull requests. Let them click around, perform visual regression testing, and discover edge cases that a human developer would never think to script.

Are you currently experimenting with agentic testing frameworks in your CI/CD pipelines, or have you already run into the token-burning Trust Valley yourself? Drop me a line on Twitter or open a discussion in my public GitHub workspace.

Keep reading