Build an AI Test Assistant with Memory

Build with AI 6 min read July 24, 2026

Most AI-assisted testing tools today are stateless: you paste a failing test, get a suggestion, close the tab, and repeat the same conversation tomorrow. The model has no idea your team banned time.sleep() three sprints ago, that your checkout flow has a known flakiness pattern on Safari 17, or that your staging environment drops WebSocket connections under load. Every prompt starts from zero. That's not an assistant — that's autocomplete with better vocabulary.

The missing layer is memory: a persistent, queryable store of your team's test history, failure patterns, architecture decisions, and domain-specific constraints that the model can retrieve at inference time. Retrieval-Augmented Generation (RAG) applied to a test corpus is the mechanism. It's not new as a pattern, but most teams haven't wired it into their test toolchain yet because the integration points are non-obvious.

By the end of this article you'll have a working design for a stateful AI test assistant that ingests Pytest results, Gherkin feature files, and CI failure logs into a vector store, then surfaces context-aware suggestions during test authoring and triage. The architecture runs on OpenAI's text-embedding-3-small model, a local Chroma DB instance, and a thin Python orchestration layer — swappable at every joint.

API Testing using Python, Behave, VS Code & GitHub Copilot

Smarter API Test Automation — Python, Behave, VS Code, AI with GitHub Copilot & CI/CD Pipelines. Complete in a Weekend!

Learn more

What "Memory" Actually Means in a Test Assistant

A stateful AI test assistant is a RAG pipeline scoped to your test domain. At write time, it embeds and indexes artifacts — Gherkin scenarios, Pytest parametrize tables, Playwright trace metadata, CI YAML, past failure stack traces — into a vector database. At query time, a retrieval step pulls the top-k most semantically relevant chunks and injects them into the model's context window before generation. The model never "learns" your codebase; it retrieves relevant facts on demand. That distinction matters for compliance and for reasoning about what the assistant actually knows.

In a modern test architecture this sits between your CI event bus and your LLM call. GitHub Actions or Jenkins emits a test result event; a lightweight consumer (a Python Lambda or a Kafka consumer on Confluent Cloud) extracts structured data, embeds it, and upserts into Chroma or Pinecone. Your IDE plugin or Slack bot then queries that store before forwarding any prompt to Claude or ChatGPT. The vector store is the source of truth for accumulated test knowledge — not the model weights, not a wiki nobody updates.

Wiring the Pipeline: Embeddings, Chroma, and a Pytest Plugin

Start with ingestion. A conftest.py hook captures Pytest results at session end and ships them to your embedding pipeline. The pytest_runtest_logreport hook gives you outcome, nodeid, longrepr (the full traceback), and duration — everything you need to build a meaningful document for the vector store.

# conftest.py
import pytest
import chromadb
from openai import OpenAI

oai = OpenAI()
chroma = chromadb.HttpClient(host="localhost", port=8000)
collection = chroma.get_or_create_collection("test_memory")

def _embed(text: str) -> list[float]:
    resp = oai.embeddings.create(model="text-embedding-3-small", input=text)
    return resp.data[0].embedding

@pytest.hookimpl(tryfirst=True)
def pytest_runtest_logreport(report):
    if report.when != "call":
        return
    doc = f"TEST: {report.nodeid}\nOUTCOME: {report.outcome}\nDETAIL: {getattr(report, 'longreprtext', '')}"
    collection.upsert(
        ids=[report.nodeid],
        embeddings=[_embed(doc)],
        documents=[doc],
        metadatas=[{"outcome": report.outcome, "duration": report.duration}],
    )

This runs on every CI pass. Over two weeks on a mid-size suite (~800 scenarios), the Chroma collection accumulates enough signal that nearest-neighbor retrieval starts surfacing genuinely useful context — the same checkout_flow test that failed 11 times in the last 30 days shows up when you ask "why does Safari drop the session cookie?" even though you never mentioned the test name. Embedding cost on text-embedding-3-small for 800 documents averages under $0.002 per full suite run.

The query side is a thin wrapper. Before any prompt reaches the model, retrieve the top-5 chunks and prepend them as a system message. Here's the core retrieval call in TypeScript for a VS Code extension context:

// assistant.ts
import ChromaClient from "chromadb";

const client = new ChromaClient({ path: "http://localhost:8000" });
const collection = await client.getCollection({ name: "test_memory" });

async function buildContext(userQuery: string): Promise {
  const queryEmbedding = await embed(userQuery); // your embed() wrapper
  const results = await collection.query({
    queryEmbeddings: [queryEmbedding],
    nResults: 5,
    include: ["documents", "metadatas"],
  });
  return results.documents[0].join("\n---\n");
}

async function ask(userQuery: string): Promise {
  const context = await buildContext(userQuery);
  const response = await openai.chat.completions.create({
    model: "gpt-4o",
    messages: [
      { role: "system", content: `Relevant test history:\n${context}` },
      { role: "user", content: userQuery },
    ],
  });
  return response.choices[0].message.content ?? "";
}

In practice, teams using this pattern report triage time dropping from roughly 18 minutes per flaky-test investigation to under 4 — not because the model is smarter, but because the retrieval step eliminates the manual grep-through-CI-logs phase. Also index your Gherkin feature files: embed each Scenario block as its own document. When an engineer asks "do we already cover guest checkout with an expired card?", the assistant retrieves the three closest scenarios and answers from evidence rather than hallucinating coverage.

Where Senior Engineers Still Get Burned

Embedding noise instead of signal. The most common mistake is indexing raw CI logs verbatim — thousands of tokens of stack trace boilerplate, timestamps, and ANSI escape codes. Embeddings of noisy text cluster poorly; retrieval returns garbage. Pre-process aggressively: strip ANSI, truncate stack traces to the first unique frame, normalize test node IDs. A 200-token clean document outperforms a 2,000-token raw one every time. This is a data-quality problem, not a model problem, and treating it as the latter wastes weeks.

Ignoring collection drift. A vector store that never expires documents becomes a liability. A test that was flaky six months ago but has since been rewritten will still poison retrieval results. Implement a TTL strategy: tag every document with an ingestion timestamp and prune records older than 90 days on a nightly cron. Chroma's delete API accepts a where filter on metadata — use it. Teams that skip this step find their assistant confidently citing obsolete failure patterns within a quarter.

Myths That Will Slow Your Rollout

"Fine-tuning gives better results than RAG." For a test assistant, this is almost always wrong. Fine-tuning bakes knowledge into weights at a point in time; your test suite changes daily. RAG retrieves current evidence at query time. Fine-tuning also requires labeled data curation, retraining cycles, and a new model deployment every time your architecture shifts. RAG requires a vector upsert. Use fine-tuning for style and format consistency if you must; use RAG for domain knowledge. They're not competing approaches — but if you can only do one, RAG wins for a living codebase.

"The assistant replaces test review." It doesn't. The assistant is good at pattern matching across history: surfacing similar past failures, flagging scenarios that duplicate existing coverage, suggesting parameter variations based on what's been missed before. It is not good at reasoning about causality in distributed systems, evaluating whether a new scenario actually tests the right contract, or catching a subtle race condition in an async Playwright test. Keep human review in the loop for scenario authorship and root-cause sign-off. The assistant accelerates those humans; it doesn't replace the judgment.

A stateful test assistant is a retrieval pipeline first and an LLM feature second. Get the ingestion and data-quality layer right before optimizing prompts. Once the memory is reliable, the next thing worth measuring is mean-time-to-detect on recurring flaky tests — a well-indexed failure history should surface repeat offenders within one CI cycle rather than after a week of manual correlation. From there, look at extending the corpus to include OpenTelemetry trace data from Grafana Tempo to give the assistant observability context alongside test history.

Note: This article is for informational purposes only and is not a substitute for professional advice. If you need guidance on specific situations described in this article, consider consulting a qualified professional.

What "Memory" Actually Means in a Test Assistant

Wiring the Pipeline: Embeddings, Chroma, and a Pytest Plugin

Where Senior Engineers Still Get Burned

Myths That Will Slow Your Rollout

Related Articles

Generate Test Cases with AI in Minutes (Real Framework)

Building a Trading Bot Test Harness

From Gherkin to Code: A Real Build Pipeline

Stress-Testing an AI Chatbot with Multi-Agent Simulations