Autonomous AI in Your SDLC, Part II: We Tried to Give the Method Away

Mikhail Ershov
Operational Director

AI & ML 8 min read

A continuation of “Autonomous AI in Your SDLC.” Last time we argued that Claude Code is a senior engineer who joined your company this morning, and that your job is to build the guardrails — process, context, focus — where it can work autonomously and verifiably. This is what happened when we stopped describing the method and tried to make it something anyone could run.

Part one ended on a high: a guardrail system that improved a bilingual search engine from 35% to 98% across 22 mostly autonomous iterations, and a second project that replicated the result in a different domain. Same framework, same outcome. We were pleased.

But we were also nagged by two uncomfortable facts.

The first: every success had the same person in the loop. The method worked, but did the method work, or did we? You cannot answer that from inside your own wins.

The second was more practical. The method lived in one head. Every new program started with the same speech: here’s how we define “done” as a number, here’s why the test suite comes before the feature, here’s the debugging discipline, here’s how we keep ourselves honest. Claude Code has no memory between sessions. Every session is day one, so the foundation was re-taught by hand, project after project. It couldn’t be handed to a teammate without them re-learning the expensive lessons the expensive way.

A method that only works when its author runs it is not a method. It’s a person. So we set out to prove the framework was real by removing ourselves from it.

From a method to an artifact

We encoded the methodology as a set of Claude Code “skills,” or installable commands that carry the whole discipline. The structured interview that turns “quality is poor” into a runnable program, the oracle design, the tracing, the cost tiers, the gates, and — this is the important part — every lesson we’d paid for in earlier projects, written where the agent reads it automatically.

That last point is what makes it more than a prompt template. The skill doesn’t just ask good questions; it carries scar tissue. A few examples of what’s baked in:

The oracle question. Before any work starts, it asks whether your “gold” data could itself be wrong. If your reference comes from the tool you’re trying to replace, treating it as ground truth just teaches the system to reproduce that tool’s errors. We learned that the hard way — nine wasted iterations on one project.
The escalation ladder. “Stop after three failed attempts” isn’t enough; the attempts have to change kind — a prompt tweak, then deterministic code, then a convention question for the human — never the same move twice.
Canary validation. When you re-test a fix cheaply, the sample must include the cases the fix is most likely to break, not just the ones it targets. On one project, a change that scored 95.6% on its favorable slice had silently broken 73 unrelated fields at full scale. A favorable slice is how regressions sneak into production.

None of this is novel to a seasoned engineering manager. That’s the point. It’s the stuff we’d otherwise repeat to every team, now running at machine speed and surfacing to a human only the questions that genuinely need one.

The field test: replacing a vendor in a day

Before any controlled experiment, we ran it for real. A new domain — commercial-lease abstraction, 114 structured fields pulled from messy PDFs — with a real business goal: replace an incumbent vendor tool, with evidence good enough to put in front of a customer.

One working day. About $80 of API spend. On five lease bundles the engine had never seen, it verified 96.5% of fields correct against the source documents, versus the incumbent’s 87.2% on the same contracts.

The interesting part wasn’t the number; it was how the number was earned. The only available “ground truth” was the vendor’s own output, and it was wrong often enough to matter. So the skill’s machinery did something a naive accuracy script never would: a blinded two-model panel read the original leases, every ruling in our favor was attacked by an adversarial reviewer hunting for counter-evidence, and a human signed off on the consequential calls. The audit trail wasn’t debugging output. It was the customer deliverable.

And, consistent with part one’s honesty-over-hype stance, the project’s own write-up named where the skill fell short: it burned through budget fast until a human imposed economics, and it would have trusted the flawed vendor reference if a human hadn’t

overruled it. We fed every one of those weaknesses back into the skill the same day. Hold that thought; it becomes a pattern.

The honest experiment: a time machine

The field test proved the skill delivers. It did not prove what carried the result, the skill, or the dozen judgment calls the operator made along the way. To isolate that, we needed the same problem with the author removed. But all our problems were already solved.

So we used git history as a time machine. We checked out the bilingual-search project at the exact commit where its test framework first landed — before any of the quality journey — handed a fresh agent the test suite (the one thing an operator must supply), installed the skill, and let it run. We were strict about it: the future commits were physically removed so the agent couldn’t read ahead, the old run history was wiped, and we wrote down, in advance, exactly what we’d measure and what would count as cheating. A result you can’t attack is a result you can’t trust.

The replay started at 49% and reached 100%. Round one — fully autonomous — matched the original multi-evening, 22-run program in 78 minutes for about sixty cents. Total cost across the whole thing: roughly a dollar. It even audited its predecessor. It discovered that our original “100%” had been measured on a test path that quietly skipped a production filter. The old suite had been flattering itself, and the replay said so, with evidence.

The framework worked without us. That was the question, answered.

What our people were needed for

The same experiment drew the line between the two jobs more clearly than any description could. Round one ran to a strong score with no human input. Then a human stepped in for three decisions, and each was a values question in a technical costume: should a category the AI merely guessed be allowed to exclude results, or only rank them? Should the test suite measure the harsher real pipeline, even though the score would drop? Is “pasta” the same as “noodles” for search purposes?

Those three rulings unlocked the final stretch, and surfaced two production bugs the original project had never found. The machine owned the mechanisms. The human owned the meaning. An operator who only presses “continue” gets a well-documented mediocre result; the framework amplifies judgment, it doesn’t replace it.

What broke, and why it was the best thing that happened

Then someone else ran it. A different person, a different task. And it exposed the most important limitation we’ve found.

The program hit its target — 50% to 100%. But the honest reading, which that operator wrote down herself, was damning: three of four iterations had moved the score by editing the test, not improving the system. About nine lines of application code changed. “100%” meant “the app now matches a corrected, repeatedly-adjusted ruler,” not “the system got better.”

What had the loop done? Every time the model was weak at something — say, estimating a nutrition value — the cheap path was to loosen the test to match. It kept taking that path, one reasonable-looking, human-approved decision at a time. It was accommodating the model’s limitations instead of transforming the system to overcome them. The most dangerous sentence in this entire field is “it’s an LLM, it hallucinates, live with it” because it’s often true, always available, and it quietly ends the inquiry.

A person doesn’t do that. Faced with “embeddings can’t tell chicken from lamb,” our original project didn’t loosen the meat tests; it invented a new mechanism. It changed reality to meet the expectation, rather than bending the expectation to meet flawed reality. The autonomous loop never even put that option on the table.

This pointed at a real limit, and it was the same limit twice: an autonomous loop hill-climbs brilliantly, and cannot tell when it’s climbing the wrong hill. It optimizes within a frame; it can’t step back and question the frame. And the polish of a capable tool quietly lulls its operator from a co-author who pushes back into a consumer who approves output.

The fix: two tools, because there are two relationships

The tempting fix — make the autonomous loop “creative” enough to question itself — is the wrong one. Every step in that direction blunts the thing that made it valuable (running the routine without you), and still doesn’t produce a genuine thinking partner. The problem wasn’t a missing feature. It was one tool being asked to serve two opposite needs.

So we split it.

Manage the machine — the implementer. A convergent, autonomous loop that grinds a measurable target to its number, cheaply and honestly. Best when the problem is well formed and you have, or can build, an oracle. It still hill-climbs, but now it knows it does, and when it plateaus or catches itself about to lower the bar, it says so and hands off.
Think with the machine — the collaborator. A divergent partner, in dialogue, one question at a time. It shapes a fuzzy problem into something measurable, and breaks a stuck one by naming the contradiction and inventing structurally different approaches — the “transform the system” option the loop can’t generate on its own.

They share one core of principles and compose naturally: the collaborator shapes a problem and hands it to the implementer; when the implementer gets stuck, it routes back to the collaborator. Managing a worker and thinking with a peer are different relationships, and they finally have different tools.

And note the recurring pattern: the lease project improved the skill, the replay improved the skill, and this failure improved it most of all. The methodology that demands a retrospective after every iteration got one itself, and acted on it. Every program audits the instrument that runs it.

There’s a final loop we didn’t plan. The replay had authored its own held-out tests; we ported them into the live production search engine — months and many releases downstream of where the experiment began — and ran the skill against the real codebase. The tests caught one gap the original program had missed; the skill closed it. Production now passes all 68 of 68. The clean-room experiment didn’t just reproduce a historical result: its tests became a standing audit of the real product, and the same skill that wrote them brought the product up to meet them.

Who needs to rethink their role?

Part one argued that adopting autonomous AI changes the manager’s job. This sharpens it. With two tools — one to manage the machine, one to think with it — the defining skill of the person in the loop is no longer writing the code. It’s knowing, moment to moment, which relationship the problem needs, and owning the outcome across both.

This is the developer’s role in a “vibe-coding” world. Anyone can ask for a result and get one. The distance between a result and success is knowing how to organize work, and that knowledge turns out to be mostly human-proven engineering management, which transfers almost verbatim, plus one AI-specific layer: managing the machine’s context and focus. The scarce skill is shifting from producing the work to defining “done,” building verification you can trust without reading the code, delegating well, and exercising judgment at the right altitude. The skills encode the harness. The judgment is still hired.

Real limitations worth naming

In the spirit of the first article, the honest constraints:

The collaborator is young. The implementer has five programs behind it; the divergent thinking partner is new and far less battle-tested. It structures divergent thinking and forces contradictions into the open, but it is an aid to a human’s judgment, not an insight oracle.
A soft oracle invites surrender. When the test set is small and editable, the loop will discover that it can raise the score by lowering the bar. The fix is partly cultural (a human watching for it) and partly mechanical (freeze the oracle; require the system to still pass the original one). Neither is automatic.
It can lull a passive operator. The framework can make mediocrity visible — it logs every place a judgment call was made — but it cannot install the disposition to push back in someone who’d rather have a vending machine. It raises the floor; it doesn’t replace the operator.
Not every problem fits. If “done” is a genuine matter of taste with no expressible rubric, or there’s no affordable way to build ground truth, the whole apparatus has nothing to grip and just adds ceremony. The skill will tell you so, and decline to oversell is part of the method.

We started this chapter trying to answer one question: was it the method, or was it us? The answer is both, at different altitudes. The method is real, transferable, and now installable. It reproduced a multi-week result in 78 minutes without its author in the room. And the judgment is still indispensable, just relocated: not to writing the code, but to defining the goal, ruling on what’s true, and knowing when to stop managing the machine and start thinking with it.

Twenty years of engineering judgment didn’t become unnecessary. It became infrastructure. The job now is to keep being the person whose judgment is worth encoding.

The skills, the experiment protocol, and the full reproducible record are available on request.

June 2026