Testing Kit · Measuring Success
Signal • Not • Scale
How do you know if it’s working?
Most AI metrics measure the floor — time saved, cost reduced, output volume. The Human–AI Loop claims something happens above the floor. So our metrics have to reach higher than the standard toolkit, while still being credible to a skeptical PM or executive.
At this stage, we’re not trying to prove it at N=1,000. We’re trying to identify the right questions — the ones that tell us where the Loop is working, and where it needs to improve. That’s Phase 1. Signal over scale.
Why qualitative metrics first
We’re measuring above the floor.
The standard AI metrics toolkit — time saved, tokens processed, drafts generated — measures whether AI is faster than not having AI. That’s a low bar, and it’s not the interesting question.
The Human–AI Loop makes a different claim: that structured human–AI collaboration produces qualitatively different outcomes — not just faster versions of the same thing. Decisions that are more robust. Outputs that reflect a level of creative rigor that neither human nor AI could reach independently. Learning that compounds across cycles rather than evaporating after each task.
You can’t measure that with a stopwatch. Qualitative metrics aren’t a compromise — they’re philosophically consistent with the thesis. And at this stage, finding the right questions matters more than generating numbers we don’t yet know how to interpret.
The core principle
If we led with quantitative metrics, we’d be playing by the “outputs” rulebook — the exact framing the Loop is designed to move beyond. Leaning into qualitative metrics is the methodology in action.
Start here
Three metrics that matter most right now.
These three are the ones to anchor your pilot. They’re credible to a skeptical senior PM or executive, they’re not the same metrics everyone else is tracking, and they directly reflect the central claim of the methodology.
Metric 01
“Would you have gotten here alone?”
After each cycle, the human asks: did the Loop take you somewhere you wouldn’t have reached independently? Not faster — somewhere different. This is a simple self-report, but it directly names the central claim of the methodology and it’s almost impossible to game.
What to watch for: consistent “yes” answers across cycles and team members. Occasional “no” is fine — that’s useful signal about where the Loop isn’t adding value.
Metric 02
Option surface
Did the Loop surface options or directions the human hadn’t considered before the AI partner entered? This is a counterfactual self-report — after Test and Build, did the option space expand in ways that surprised you?
Senior leaders get this immediately. Their biggest frustration with teams is convergence that happens too fast — everyone anchoring on the first reasonable idea. The Loop is designed to counter that. Option surface is the proof point.
Metric 03
Decision durability
Did the output hold up downstream — or did it get revised, reversed, or quietly shelved? Durable decisions are expensive to produce and easy to recognize in hindsight. Loop-produced decisions should hold up better because they’ve been stress-tested through multiple rounds of human judgment and AI challenge.
This one is worth tracking over a longer horizon. Ask about it 2–4 weeks after the cycle closes, not immediately after.
Full metrics taxonomy
Other dimensions worth watching.
These don’t all need to be tracked in a first pilot — but they’re the directions worth building toward as the methodology matures.
Process signals
Observable indicators that the collaboration is actually happening — not just AI producing, human accepting.
- Iteration depth — how many substantive rounds before the output was accepted? Zero rounds = rubber-stamping.
- Human judgment touchpoints — how often did the human redirect, push back, or override?
- Time-to-confidence — not time-to-output, but time until the human felt genuinely confident in the result.
Learning and compounding
The Loop’s claim is that learning compounds across cycles, not just within them. This is the metric nobody else is measuring.
- Reuse rate — how often does a Codify-phase output get pulled into future work?
- Test phase cycle time — does it get faster over repeated cycles as the team builds intuition?
- Onboarding acceleration — can a new team member get up to speed faster because Loop artifacts exist?
Team dynamics
The hardest to measure and the most important. These are the signals that indicate the Loop is changing how the team works, not just what they produce.
- Human agency felt — does the human feel more in control of their work, or less?
- Attribution clarity — can everyone articulate what the human contributed vs. AI?
- Adoption spread — does one team running the Loop pull adjacent teams toward it?
Output quality
Subjective but defensible, especially if you design the evaluation carefully.
- Blind review — Loop output vs. solo-human or standard-AI output on the same brief, rated by a neutral panel on creativity, robustness, and clarity.
- Stakeholder reception — how was the output received by people who didn’t know how it was produced?
Not yet — and why that’s honest
The rigorous metrics come later.
There are metrics that would be genuinely rigorous — A/B testing at the team level, longitudinal capability tracking, quantitative outcome comparison across matched cohorts. These are the gold standard. They’re also not where to start.
The quantitative layer comes after you’ve run enough cycles to know what you’re measuring. Designing a rigorous measurement framework before you understand the failure modes and the dimensions that matter is backwards. Phase 1 is about finding the right questions. The numbers follow.
This is good research design, not a gap
Every well-designed methodology goes through a qualitative discovery phase before it quantifies. That’s not a weakness — it’s how you avoid measuring the wrong things with great precision. The Testing Kit is Phase 1. It’s designed to surface signal, not prove the theorem.
For pilot testers
What to actually capture during your pilot.
You don’t need a scoring rubric. After you’ve run a cycle, spend 10 minutes reflecting on these questions. That reflection is the data.
After Test
What surprised you?
Did the AI surface directions you hadn’t considered? Did any of them change your approach to the Build phase?
After Build
How many rounds did it take?
Did you push back, redirect, or challenge the AI? Or did you accept early? What drove the iterations?
After Codify
Would you reuse this?
Is the framework, template, or pattern something you’d pull into future work — or did it feel one-time?
2–4 weeks later
Did it hold up?
Was the decision revised or reversed? Did the output stand up to scrutiny from people who didn’t know how it was made?
We’re less interested in what AI can produce — and more interested in what humans and AI can achieve together. Measuring that well is part of the work.
Methodology
Guides & Playbooks
Literacy & Writing
Tools & Connect