Autonomous AI in Your SDLC

Mikhail Ershov
Operational Director

AI & ML Generative AI Services AI & MLOps 4 min read

A new guardrail framework is letting engineers hand off overnight coding runs to Claude Code — and wake up to measurable results. It works. How does it work? Mike Ershov will explain.

95.9%	22	0
Test pass rate achieved autonomously, up from 73%	Autonomous runs to reach 98% on a search pipeline	Human approvals needed during overnight iterations

The wrong mental model is costing you time

Most teams using AI code generation tools are making the same conceptual mistake. They treat the tool as either a junior developer to supervise step-by-step, or a clever autocomplete to sit beside. Both framings undervalue what’s possible — and both require constant human attention.

A more useful analogy: Claude Code is a highly capable senior engineer who joined your company this morning. Technically brilliant, confident, fast — but has zero knowledge about your project, your standards, or how you define “done.” Every new session is day one for this person. Your job is not to watch over their shoulder. Your job is to onboard them properly.

“The goal is to spend less personal time and let Claude Code work autonomously overnight — without you approving every step.”

Mikhail Ershov
Operational Director, First Line Software

This reframe drives everything that follows. If you accept it, the next question becomes obvious: what does a good first day look like? What guardrails, documentation, and measurable goals does a brilliant-but-contextless engineer need to do great work unsupervised?

Three layers of guardrails — not prompts

Across two production projects, the same three-layer system produced dramatic autonomous improvements in both cases. The key insight is that this approach relies on engineering infrastructure, not prompting tricks.

Layer 1: Process guardrails. Build test infrastructure first, including everything stored in a database: structured test fixtures, an autonomous test runner, a regression suite, and full request tracing. Define specifications before writing code. The same engineering discipline you’d expect for a human team also applies here.

Layer 2: Context encoding. Since every session starts fresh, all execution results, quality reports, architecture decisions, and optimization history must live in a database. Claude Code can then review what past sessions tried, what worked, and what regressions occurred — without needing you to re-explain everything.

Layer 3: Focus management. Storing results in a database naturally narrows Claude Code’s attention. Instead of scanning the entire codebase, it reads the traces and compares them to the target. Scoped tasks consistently outperformed open-ended exploration, which burned context and returned generic output.

What this looks like in practice

Take, for instance, our project with SaleSpace— a natural language real estate research platform querying multiple databases, geocoding APIs, and market data sources. The team had 144 demo prompts that needed to pass consistently. Manual testing was slow and expensive. Fixing one prompt often broke others.

Using the guardrail framework, Claude Code ran autonomously across two nights. It discovered rate limit issues, empty role schemas, column type mismatches, and category mapping bugs — and fixed them. It implemented auto-retry logic with exponential backoff, achieving 60% self-recovery from SQL errors. The test pass rate moved from 73% to 95.9% without a human in the loop.

On a separate cooking app project with a hybrid vector/full-text search pipeline, the same methodology took success rates from 35% to 98% across 22 autonomous runs. Claude Code independently identified that vector embeddings couldn’t distinguish between different types of meat — and built cross-language search capabilities to compensate.

What didn’t work:

Chain-of-thought prompting added to CLAUDE.md: 0% improvement. Routing easier queries to a cheaper model: performance dropped from 73% to 60%. Open-ended “look at the codebase and suggest improvements”: context burned, output generic.

The guardrail system made these failures fast and cheap to detect. Measure results; don’t guess.

Pattern Recognition

Aspect	SaleSpace	Cooking Class	The Pattern
Goal	136 prompts, 95%+ pass rate	55 tests, pass rate per category	A number, not a feeling
Observability	Request tracing: LLM, DB, tools	Pipeline tracing: vectors, filters	CC can self-diagnose
Test infra	Runner + regression + E2E in DB	Runner + streaming + history	Autonomous feedback loop
Cost tiers	model-eval vs regression	Zero-API vs LLM ($0.08-0.14)	Cheap iterate, expensive confirm
CLAUDE.md	Safety, debugging, memory files	Safety, debugging rules	Hard rules, not suggestions
CC’s role	Rate limits, schemas, retry logic	Code paths, ingredient validation	Technical solutions
My role	Goals, tests, tracing, process	Goals, tests, tracing, process	System design, not solution design

Five principles for autonomous AI collaboration

Define “done” as a number, not a feeling. Measurable targets — pass rates, cost per run, latency thresholds — give Claude Code something to optimize toward and give you something to verify.
Build fast, cheap feedback loops. The faster Claude Code can run a test and see the result, the more iterations happen in a given window. Infrastructure investment here pays compound returns.
Encode constraints in memory and specs. Don’t rely on a single session to remember your standards. Store them. Every new session starts cold; your documentation shouldn’t.
Narrow the task, give freedom on how. Define the goal precisely. Leave the implementation open. This is what enables overnight runs without approval gates.
Track costs as numbers, too. Set a spend limit per run and let Claude Code optimize within it. Money is a constraint like any other — model it explicitly.

Who needs to rethink their role

This approach doesn’t make engineers redundant — but it does change what engineering means. A developer who simply implements specifications to a 95% pass rate can be replaced by Claude Code running this framework. What can’t be replaced: understanding the right specifications in the first place.

The SaleSpace platform wasn’t limited by Claude Code’s ability to write SQL or call APIs. The binding constraint was data engineering — knowing which data sources tell a coherent story, which questions real estate clients actually ask, and how to structure databases so that natural language queries return trustworthy answers. That domain judgment is irreducibly human.

At the same time, business analysts who understand client needs but avoid technical tooling are increasingly at a disadvantage. The gap between “technical enough to use these tools” and “not technical enough” is closing, but the cost of being on the wrong side is rising. The emerging pattern: business analysts and engineers working in close pairs — analysts defining goals and acceptance criteria, engineers handling technical risk during execution.

“Roles are compressing. Developers need business understanding. Analysts need technical awareness. The gap between them is where AI-era value gets created.”

Mikhail Ershov
Operational Director, First Line Software

Real limitations worth naming

The framework has genuine constraints. Claude Code struggles when tasks span more than roughly 15 files simultaneously — the task must be decomposed. It cannot push back on a bad requirement or challenge a flawed target; product judgment remains entirely human. And the permanent amnesia of each new session means documentation maintenance is not optional — it’s load-bearing infrastructure.

These aren’t reasons to avoid the approach. They’re the parameters you design within. The three-layer guardrail system exists precisely to work around the amnesia problem. Scope discipline exists to work around the context ceiling. Measurable goals exist because Claude Code has no opinion on whether your target was wise — only whether it was met.

Talk to an AI operations architect