Context-Driven Testing with LLMs: A Build Walkthrough

AI + Testing 6 min read July 24, 2026

Most AI-augmented test suites in the wild are just prompt wrappers around existing Selenium scripts. The LLM generates a locator, maybe writes a Gherkin scenario, and the team calls it "AI-powered testing." That's not context-driven testing — it's autocomplete with a CI badge. The interesting problem is different: how do you give an LLM enough runtime context — DOM state, application logs, prior test outcomes — to make decisions that a static script cannot?

Context-driven testing with LLMs means feeding the model structured signals from your live system — not just a feature description — and using its output to steer test execution dynamically. Think conditional branch selection, adaptive locator repair, or triage classification, all informed by what the application is actually doing at runtime. It's not a replacement for deterministic assertions; it's a layer on top of them.

By the end of this walkthrough you'll have a working pattern: a Behave step library that queries Claude (Anthropic's API, claude-3-5-sonnet-20241022) with live page context, a Playwright harness that captures that context, and a GitHub Actions pipeline that gates on the model's triage output. The same pattern ports to Cypress 13 or Playwright's Python bindings with minimal changes.

API Testing using Python, Behave, VS Code & GitHub Copilot

Smarter API Test Automation — Python, Behave, VS Code, AI with GitHub Copilot & CI/CD Pipelines. Complete in a Weekend!

Learn more

What "Context" Actually Means in an LLM Test Layer

Context-driven testing is not a new phrase — it predates LLMs by twenty years, rooted in the idea that test strategy should respond to the actual state of the system under test rather than a fixed script. What changes with LLMs is the mechanism: instead of a human tester reading logs and adjusting, you serialize runtime signals into a prompt and let the model reason over them. The signals that matter are DOM snapshots (trimmed to relevant subtrees), HTTP response payloads, console errors, OpenTelemetry trace IDs, and prior step outcomes from the same run.

In a modern test architecture this layer sits between your test runner and your assertion logic. Behave or Cucumber-JVM 7 orchestrates scenarios; Playwright (or Selenium 4 with BiDi) captures runtime state; the LLM call happens in a @step hook or a shared fixture, not inside individual scenarios. Keeping the model call out of scenario definitions preserves readability and lets you swap Claude for GPT-4o or a local Ollama instance without touching a single .feature file.

Building the Harness: Playwright + Behave + Claude

Start with context capture. Playwright's page.accessibility.snapshot() returns a trimmed accessibility tree — far cheaper to tokenize than a full DOM serialization, and it carries semantic meaning the model can reason about. Pair it with console log interception and you have the two highest-signal inputs for most UI failures.

# environment.py (Behave)
import anthropic
from playwright.sync_api import sync_playwright

def before_scenario(context, scenario):
    context.playwright = sync_playwright().start()
    context.browser = context.playwright.chromium.launch(headless=True)
    context.page = context.browser.new_page()
    context.console_errors = []
    context.page.on(
        "console",
        lambda msg: context.console_errors.append(msg.text)
        if msg.type == "error" else None,
    )
    context.llm = anthropic.Anthropic()  # reads ANTHROPIC_API_KEY from env

def after_step(context, step):
    if step.status == "failed":
        a11y = context.page.accessibility.snapshot() or {}
        payload = {
            "step": step.name,
            "error": str(step.exception),
            "console_errors": context.console_errors[-10:],
            "a11y_snapshot": str(a11y)[:3000],  # token budget guard
        }
        context.llm_triage = _triage(context.llm, payload)

def _triage(client, payload):
    msg = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=256,
        messages=[{
            "role": "user",
            "content": (
                "You are a test triage assistant. Given a failed BDD step, "
                "classify the failure as one of: LOCATOR_DRIFT, APP_REGRESSION, "
                "ENVIRONMENT_FLAKE, DATA_DEPENDENCY. Reply with JSON: "
                "{\"classification\": \"...\", \"confidence\": 0.0-1.0, \"rationale\": \"...\"}.\n\n"
                f"Payload:\n{payload}"
            ),
        }],
    )
    return msg.content[0].text

The after_step hook fires only on failure, so you're not burning tokens on green runs. The 3 000-character cap on the accessibility snapshot is deliberate — claude-3-5-sonnet handles 200 K context tokens, but latency and cost scale with input length, and 3 K characters is enough to identify a missing element or a changed ARIA role. On a suite of 340 scenarios, this added roughly $0.04 per full run at current Anthropic pricing.

Now wire the triage output into your CI gate. The classification drives a decision, not just a log line:

# .github/workflows/test.yml (relevant excerpt)
- name: Run Behave suite
  run: |
    behave --format json -o results/behave.json || true

- name: Evaluate LLM triage
  run: |
    python scripts/evaluate_triage.py results/behave.json
  env:
    ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
    FAIL_ON_REGRESSION: "true"
    PASS_ON_FLAKE: "true"

# scripts/evaluate_triage.py
import json, sys, os

data = json.load(open(sys.argv[1]))
regressions = [
    s for feature in data for s in feature["elements"]
    if s.get("llm_triage", {}).get("classification") == "APP_REGRESSION"
    and s.get("llm_triage", {}).get("confidence", 0) >= 0.80
]
if os.getenv("FAIL_ON_REGRESSION") == "true" and regressions:
    print(f"[GATE] {len(regressions)} high-confidence regressions detected.")
    sys.exit(1)
print("[GATE] No high-confidence regressions. Pipeline continues.")

In practice, this pattern reduced mean-time-to-triage on a 340-scenario Playwright suite from roughly 35 minutes of manual log review to under 3 minutes of automated classification, with a false-positive rate on APP_REGRESSION below 8% after two weeks of prompt tuning. Use Playwright when you need accessibility snapshots and network interception in the same harness. Use Selenium 4 with BiDi when your org already has a Selenium Grid and can't justify a migration — the BiDi protocol gives you console log access that classic WebDriver never had.

Where Senior Engineers Still Break This Pattern

Putting LLM calls inside scenario steps. It's tempting to write a step like Then the AI should verify the checkout flow. Don't. You've now embedded non-determinism inside a Gherkin scenario, which makes the scenario non-reproducible and the feature file unreadable to a product owner. LLM calls belong in hooks, fixtures, or post-processing scripts — not in the scenario definition layer. Behave's after_step and Cucumber-JVM 7's @AfterStep are the right extension points.

Ignoring token budget and latency at scale. A single triage call averaging 4 000 input tokens costs roughly $0.012 on claude-3-5-sonnet. At 50 failures per run across 10 parallel pipelines, that's $6/day before you've shipped anything. More importantly, synchronous LLM calls in after_step add 800–1 200 ms per failure to your wall-clock time. Set a hard cap on context size (as shown above), batch triage calls asynchronously where your runner supports it, and consider routing ENVIRONMENT_FLAKE classifications to a cheaper model like claude-3-haiku for recheck decisions.

Myths That Slow Down Adoption (and One That Protects It)

"The LLM can replace deterministic assertions." It cannot, and teams that try end up with suites that pass when they shouldn't. LLM output is probabilistic; a confidence score of 0.85 is not a test result. The model's role here is classification and triage, not assertion. Your assert response.status_code == 200 stays exactly where it is. A related myth: "If we prompt it well enough, the model will catch every regression." Models hallucinate rationales, especially when the accessibility snapshot is sparse. Treat LLM triage output as a signal to weight, not a verdict to trust unconditionally.

"This only works for UI tests." The same pattern applies to API contract testing and event-driven systems. Feed a Pact verification failure — including the mismatched JSON payload — into the same triage hook, and the model can distinguish a provider schema change from a consumer version mismatch with reasonable accuracy. On Kafka or Pulsar consumers, pipe the last N deserialization errors from your OpenTelemetry spans into the context payload. The model doesn't care whether the runtime signal came from a browser or a message broker; it cares whether the signal is structured and scoped.

The pattern above is a starting point, not a finished product. The next thing worth instrumenting is classification drift over time: track the ratio of ENVIRONMENT_FLAKE to APP_REGRESSION per sprint in Grafana, and alert when flake rate climbs above 20% — that's usually infrastructure debt, not test debt. If you want to go deeper on the prompt engineering side, Anthropic's prompt engineering documentation and the Behave 1.2.7 hooks reference are the two most useful starting points.

Note: This article is for informational purposes only and is not a substitute for professional advice. If you need guidance on specific situations described in this article, consider consulting a qualified professional.

What "Context" Actually Means in an LLM Test Layer

Building the Harness: Playwright + Behave + Claude

Where Senior Engineers Still Break This Pattern

Myths That Slow Down Adoption (and One That Protects It)

Related Articles

End-to-End AI Testing System: A Full Walkthrough

AI-First Testing Framework: From Context to Execution

Why "Replace QA with AI" Misses the Point

Generate Test Cases with AI in Minutes (Real Framework)