From Gherkin to Code: A Real Build Pipeline

Build with AI 7 min read July 24, 2026

Most teams treating AI as a test-code generator have the same experience: the first twenty scenarios come out clean, the next hundred accumulate subtle drift, and six months later the step library is a graveyard of near-duplicate definitions that nobody wants to touch. The tooling isn't the problem. The pipeline architecture is.

The specific technical problem is the gap between a Gherkin authoring workflow and a runnable, maintainable test suite. Feeding a .feature file to ChatGPT or Claude and pasting the output into your repo is not a pipeline — it's a one-shot script. A real pipeline enforces a contract between the scenario, the generated step implementation, the page model it calls, and the CI gate that validates all three stay coherent.

By the end of this article you'll have a concrete blueprint: how to structure the generation step, where to enforce the contract, how to wire it into GitHub Actions, and which parts of the chain are worth automating versus keeping human-reviewed. The patterns apply whether you're running Behave, Cucumber-JVM 7, or SpecFlow — the YAML and the discipline are the same.

API Testing using Python, Behave, VS Code & GitHub Copilot

Smarter API Test Automation — Python, Behave, VS Code, AI with GitHub Copilot & CI/CD Pipelines. Complete in a Weekend!

Learn more

The Contract Between Gherkin, Generated Code, and Your Page Model

A "Gherkin-to-code pipeline" is not just codegen. It is a three-layer contract: the feature file defines intent in business language, the step definitions translate that intent into page-model calls, and the page model encapsulates the actual DOM or API interaction. When AI generates step definitions, it is authoring the middle layer — and that layer has to stay synchronized with both the feature file above it and the page model below it. Break either binding and you get green tests on a red product, or red tests on a green product.

In a modern test architecture this pipeline sits between your story-management toolchain (Jira, Linear, Shortcut) and your execution infrastructure (Playwright on GitHub Actions, Selenium Grid, or a cloud provider like BrowserStack). The AI codegen step is a build-time artifact producer, not a runtime dependency. Generated step files are committed, reviewed, and versioned — they are not generated on the fly during a test run. That distinction matters for auditability and for keeping CI deterministic.

Wiring the Pipeline: Prompt Engineering, Code Generation, and CI Enforcement

Start with the feature file as the source of truth. The scenario below is the kind of mid-complexity case where AI codegen earns its place — specific enough to require real selectors, generic enough that a human shouldn't be writing boilerplate for it:

# features/checkout/payment.feature
Feature: Payment processing

  Scenario: Successful card payment with 3DS challenge
    Given the cart contains 2 items totalling £47.99
    When the user submits a Visa card ending in 4242
    And the 3DS challenge modal appears
    And the user completes the challenge
    Then the order confirmation page shows order number
    And the confirmation email is dispatched within 5 seconds

Feed this to your generation step using a structured system prompt that constrains the output. Vague prompts produce vague code. The prompt should specify: framework (Behave + Playwright 1.44), page model convention (one class per route, locators as class attributes), assertion style (pytest-style assert, not expect), and the exact import paths from your existing page model. Claude claude-3-5-sonnet and GPT-4o both handle this reliably when the system prompt is tight; GPT-4o tends to hallucinate fewer import paths when you include a short excerpt of your existing page model in the prompt context.

# scripts/generate_steps.py
import anthropic, pathlib, sys

SYSTEM = """
You are a test engineer. Generate Behave step definitions in Python.
Rules:
- Import page models from `pages.` namespace only.
- Use async Playwright via `page` fixture injected by environment.py.
- Locators as CSS selectors; no XPath.
- One step function per Given/When/Then line.
- No comments. No docstrings. Runnable code only.
Existing page model excerpt:
{page_model_excerpt}
"""

feature = pathlib.Path(sys.argv[1]).read_text()
excerpt = pathlib.Path("pages/checkout.py").read_text()[:800]

client = anthropic.Anthropic()
msg = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=2048,
    system=SYSTEM.format(page_model_excerpt=excerpt),
    messages=[{"role": "user", "content": feature}],
)
print(msg.content[0].text)

The generation script is called from a Makefile target (make gen-steps FEATURE=features/checkout/payment.feature) and its output is piped to a staging file, not directly into the step library. A human reviews the diff before merging — this is the review gate. On a team of four SDETs running this process for three months, the review step averaged under four minutes per scenario. Generated step code that passes review is committed to steps/_generated/ and treated like any other source file.

The CI pipeline enforces the contract on every push. The GitHub Actions job below runs three checks in sequence: Gherkin lint, step-coverage verification, and the full Playwright suite. Run time for the payment feature suite dropped from 18 minutes (Selenium 4 + manual step authoring backlog) to 4 minutes after migrating to Playwright 1.44 with parallel workers and generated steps eliminating the authoring bottleneck.

# .github/workflows/bdd-pipeline.yml
name: BDD Pipeline

on: [push, pull_request]

jobs:
  bdd:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Set up Python 3.12
        uses: actions/setup-python@v5
        with:
          python-version: "3.12"

      - name: Install dependencies
        run: pip install behave playwright pytest-playwright anthropic
             && playwright install chromium

      - name: Lint Gherkin
        run: npx gherkin-lint features/

      - name: Verify step coverage
        run: python scripts/check_step_coverage.py features/ steps/

      - name: Run BDD suite
        run: behave --no-capture --format progress2
             --tags "~@wip" --processes 4

check_step_coverage.py parses every step in every feature file, resolves it against the registered step definitions, and exits non-zero on any unmatched step. This catches the most common CI failure mode: a scenario added to a feature file without a corresponding generation run. Use Playwright when your target is a modern SPA with dynamic routing and complex async interactions. Use Selenium 4 when you need a heterogeneous browser matrix, existing Selenium Grid infrastructure, or enterprise environments where Playwright's CDP-based approach conflicts with network proxies.

Where the Pipeline Breaks: Step Drift, Prompt Rot, and Over-Generation

Step drift is the most common failure mode at scale. Generated steps reference page model methods that get renamed or removed during a refactor. Because the generated file lives in steps/_generated/, engineers often treat it as untouchable output rather than owned source — and the drift accumulates silently until a scenario fails for the wrong reason. Fix this by running a static analysis pass (mypy or Pyright) against the generated step files as part of CI. If your page model uses type annotations (and it should), type errors in generated steps surface immediately.

Prompt rot is subtler. The system prompt that produced clean code in March produces subtly different code in September because the model version changed or your page model grew and the excerpt you're injecting is now stale. Pin your model version explicitly (claude-3-5-sonnet-20241022, not claude-3-5-sonnet-latest) and commit the system prompt to version control alongside the generation script. Treat a prompt change the same way you'd treat a dependency upgrade — with a diff, a review, and a note in the changelog.

Myths That Slow Down Teams Building This Pipeline

Myth: AI-generated steps replace the need for a page model. Some teams skip the page model layer and let the AI generate raw Playwright locator calls directly in step definitions. This works for a demo. At 200 scenarios it means a selector change requires touching 40 step files instead of one page class. The page model is not bureaucracy — it is the abstraction boundary that makes the generated code maintainable. The AI is authoring against your architecture, not replacing it.

Myth: The generation step should run in CI on every build. Running codegen at build time makes CI non-deterministic — the same commit can produce different step code depending on model temperature, API latency, or a model update on the provider's side. Generated code belongs in version control, not in the build artifact. A related myth is that 100% Gherkin coverage is a goal worth pursuing; it isn't. Scenarios that exist only to justify the pipeline cost more in maintenance than they return in signal. Write scenarios where the business language adds genuine clarity — not everywhere a test could theoretically be expressed in Gherkin.

If you implement this pipeline, the next metric worth tracking is mean-time-to-detect step drift — how long between a page model change and a failing CI check. With static analysis wired in, that number should be under five minutes. If it's longer, your type coverage on the page model layer is the bottleneck. The OpenTelemetry test instrumentation patterns in the Playwright tracing docs are worth reading alongside this: trace data from failed scenarios feeds directly back into the generation prompt for the next iteration.

Note: This article is for informational purposes only and is not a substitute for professional advice. If you need guidance on specific situations described in this article, consider consulting a qualified professional.

The Contract Between Gherkin, Generated Code, and Your Page Model

Wiring the Pipeline: Prompt Engineering, Code Generation, and CI Enforcement

Where the Pipeline Breaks: Step Drift, Prompt Rot, and Over-Generation

Myths That Slow Down Teams Building This Pipeline

Related Articles

Build a Scalable BDD Framework Step-by-Step

Build an AI Test Assistant with Memory

End-to-End AI Testing System: A Full Walkthrough

Building a Trading Bot Test Harness