Part 2: Testing, Quality, and the Thinking Problem — Insights 2025

Dave Farley's Keynote: Acceptance Testing as 5GL

Dave Farley opened the unconference with a vision: acceptance tests as a fifth-generation programming language. His layered ATDD architecture separates concerns into four layers:

DSL Layer: Business-level domain language describing behavior
Protocol Driver Layer: An adapter translating the DSL into executable actions
Test Skills Layer: The actual test execution infrastructure
System Under Test: The application being tested

The key insight: the same DSL can exercise the system through different interfaces (unit tests, API, browser) by swapping out the protocol driver layer. This makes AI-driven development feasible — you write the specification, AI writes the implementation.

Dave's slides are available: "The Future of Programming with AI: Acceptance Testing as 5GL". His demo repository is at Flight-Search-ATDD on GitHub.

"Let's Put Dave to the Test": What Actually Happened

A participant organized a hands-on mob-programming workshop (Day 1, Sessions 2–4) where participants attempted to apply Farley's approach to a real coding challenge: a banking API from a finmid interview exercise. Multiple tools were tried simultaneously — Claude Code (Python) and OpenAI Codex (Java).

The results were sobering:

What went wrong

✗ AI-generated scenarios were too technical. When given the OpenAPI spec and asked to generate test scenarios, the AI produced only 3 test cases (one per endpoint) using HTTP status codes (404, 409) instead of banking domain language. "There's no conflict error in the banking space" — the AI leaked implementation details into what should be domain-level tests.
✗ Context confusion from mixed projects. Having both Farley's Flight Search project and the banking challenge in the same workspace confused the AI — it tried to mix flights and bank accounts.
✗ Critical edge cases were missed. A participant who spent a full week on the same challenge with various LLMs reported: 15–20 rounds with different models, and none identified concurrency/deadlock issues until explicitly prompted. None thought about big decimals for money (vs. floating point). "There are pitfalls that none of the LLMs can figure out just from the requirements."

The group's key takeaways:

Write the scenarios yourself — do not let AI generate the acceptance criteria
Do not mix programming languages in the same context
Feed context incrementally — in the spirit of incrementalism
Fine-tune your agents.md/claude.md — include rules like "no comments," "work in the domain space," "do one thing at a time," "ask me for confirmation"
You still need to be the expert — "It's like a newborn baby every time"

"Worst case scenario is that it actually exactly does what you asked it to do. It's just not what you wanted."

The Thinking Problem: Brain Rot and the 97% Rework Rate

A session "How to Detect Thinking by Humans" (Day 1, Session 1) was one of the most philosophically rich conversations of the unconference. A participant posed a fundamental question:

"In the past, we used to use written documents to figure out if someone has thought about a problem. Now we're saying generate these documents with AI. How do I know that someone has thought about a problem? We can't use the document anymore."

The group explored where human thinking checkpoints should be placed in the development process and converged on the specification phase — understanding what to solve matters more than how to solve it.

The group identified two distinct feedback loops:

Inner loop (fast): Translating a defined specification into code, running tests, iterating
Outer loop (slow): Validating with real users whether the product hypothesis is correct

AI is accelerating the inner loop. But the outer loop is the one that ultimately matters. And a provocative point was raised: if regenerating code from a changed prompt is nearly free, the traditional wisdom that "bugs found later are more expensive to fix" may not hold for AI-generated code.

"Writing Triggers Thinking"

A key insight emerged: the act of writing code triggers thinking. When you start coding, you discover things you had not considered in the specification. Remove humans from the coding process, and this discovery loop never kicks in. One participant compared it to children doing homework: "If they write something, they actually think about it."

This concern was echoed across multiple sessions. In the Day 2 "AI And The Physical World" session, participants discussed doctors who used AI diagnostic tools and got measurably worse at diagnosis within three months: "All of them now think, okay, I have the machine." The brain, as one participant put it, "is lazy" — it naturally offloads work when possible, and the atrophied skills may not come back easily.

AI as Inquiry Partner, Not Answer Machine

A session on "Building AI Supported Inquiry Systems" (Day 1, Session 2) offered a compelling reframe. A participant's central thesis: use AI not to provide answers, but to support better inquiry.

"I cannot trust the AI to give me the answer, especially in a domain it doesn't know. But if at least it's able to build a common way for me to think through certain problems, then it gets me better."

His concrete technique: instead of having AI generate acceptance tests as statements, generate them as questions with multiple-choice answers (A, B, or C). Humans then validate the answers. This bridged the trust gap between technical teams and business stakeholders.

In the ATDD session, another participant described a complementary practice: when letting AI write production code, he deliberately hid the acceptance tests from the AI, because "it will try to cheat at some point — it's like training versus testing, two different data sets."

Example Mapping: The Human Practice That AI Cannot Replace

The session on Example Mapping (Day 1, Session 1) was a grounding reminder that some practices are inherently human. Example Mapping is a ~30 minute exercise where a cross-functional group (business, development, testing) works through a user story using colored sticky notes:

Yellow

The user story

Blue

Business rules

Green

Concrete examples (become test cases)

Red

Unanswered questions

Two rules of thumb stood out:

If a story has more than 3 rules, consider splitting it
Do NOT write Gherkin in the session — it blows up session time without adding to understanding. Write Gherkin afterward.

The session also bridged into Event Storming as a complementary "outside-in" approach for discovering user stories before zooming into individual ones.

A useful resource shared: the Humanizing Work Story Splitting Flowchart, which helps ask the right questions to arrive at good scenarios.

TDD with AI Agents: The RGR Sub-Agent Vision

A session on "RGR/TDD Sub-Agents" (Day 1, Session 5) explored a vision for full automation of the TDD process using specialized sub-agents. The concept: create three specialized sub-agents — a "red agent", a "green agent", and a "refactor agent" — each handling their respective phase of the Red-Green-Refactor cycle. These agents would run in a continuous loop, autonomously, until a large feature is complete. Combined with Claude Code's planning mode beforehand, this approach aims to ship large pieces of functionality very incrementally, with high accuracy and quality.

Supporting this vision, a participant shared TDD-Guard — an automated TDD enforcement tool for Claude Code that blocks agents from skipping tests.

The pattern connects to the broader "baby steps" philosophy: work inside-out (domain first, then adapters, one layer at a time), verify each step with tests, and never let AI-produced code that you cannot explain pass to main.