Join us at Realcomm in San Diego (June 2–4) → Turning AI into real estate ROI. Book a meeting.Join us at Realcomm in San Diego (June 2–4) → Turning AI into real estate ROI. Book a meeting.Join us at Realcomm in San Diego (June 2–4) → Turning AI into real estate ROI. Book a meeting.Join us at Realcomm in San Diego (June 2–4) → Turning AI into real estate ROI. Book a meeting.

All Insights

Autonomous AI in Your SDLC

Mikhail Ershov
Mikhail Ershov
Operational Director
Autonomous-AI-in-SDLC
4 min read

A new guardrail framework is letting engineers hand off overnight coding runs to Claude Code — and wake up to measurable results. It works. How does it work? Mike Ershov will explain.

95.9%220
Test pass rate achieved autonomously, up from 73%Autonomous runs to reach 98% on a search pipelineHuman approvals needed during overnight iterations

The wrong mental model is costing you time

Most teams using AI code generation tools are making the same conceptual mistake. They treat the tool as either a junior developer to supervise step-by-step, or a clever autocomplete to sit beside. Both framings undervalue what’s possible — and both require constant human attention.

A more useful analogy: Claude Code is a highly capable senior engineer who joined your company this morning. Technically brilliant, confident, fast — but has zero knowledge about your project, your standards, or how you define “done.” Every new session is day one for this person. Your job is not to watch over their shoulder. Your job is to onboard them properly.

“The goal is to spend less personal time and let Claude Code work autonomously overnight — without you approving every step.”

Mikhail Ershov
Mikhail Ershov
Operational Director, First Line Software

This reframe drives everything that follows. If you accept it, the next question becomes obvious: what does a good first day look like? What guardrails, documentation, and measurable goals does a brilliant-but-contextless engineer need to do great work unsupervised?

Three layers of guardrails — not prompts

Across two production projects, the same three-layer system produced dramatic autonomous improvements in both cases. The key insight is that this approach relies on engineering infrastructure, not prompting tricks.

Layer 1: Process guardrails. Build test infrastructure first, including everything stored in a database: structured test fixtures, an autonomous test runner, a regression suite, and full request tracing. Define specifications before writing code. The same engineering discipline you’d expect for a human team also applies here.

Layer 2: Context encoding. Since every session starts fresh, all execution results, quality reports, architecture decisions, and optimization history must live in a database. Claude Code can then review what past sessions tried, what worked, and what regressions occurred — without needing you to re-explain everything.

Layer 3: Focus management. Storing results in a database naturally narrows Claude Code’s attention. Instead of scanning the entire codebase, it reads the traces and compares them to the target. Scoped tasks consistently outperformed open-ended exploration, which burned context and returned generic output.

What this looks like in practice

Take, for instance, our project with SaleSpace— a natural language real estate research platform querying multiple databases, geocoding APIs, and market data sources. The team had 144 demo prompts that needed to pass consistently. Manual testing was slow and expensive. Fixing one prompt often broke others.

Using the guardrail framework, Claude Code ran autonomously across two nights. It discovered rate limit issues, empty role schemas, column type mismatches, and category mapping bugs — and fixed them. It implemented auto-retry logic with exponential backoff, achieving 60% self-recovery from SQL errors. The test pass rate moved from 73% to 95.9% without a human in the loop.

On a separate cooking app project with a hybrid vector/full-text search pipeline, the same methodology took success rates from 35% to 98% across 22 autonomous runs. Claude Code independently identified that vector embeddings couldn’t distinguish between different types of meat — and built cross-language search capabilities to compensate.

What didn’t work:

Chain-of-thought prompting added to CLAUDE.md: 0% improvement. Routing easier queries to a cheaper model: performance dropped from 73% to 60%. Open-ended “look at the codebase and suggest improvements”: context burned, output generic.

The guardrail system made these failures fast and cheap to detect. Measure results; don’t guess.

Pattern Recognition

AspectSaleSpaceCooking ClassThe Pattern
Goal136 prompts, 95%+ pass rate55 tests, pass rate per categoryA number, not a feeling
ObservabilityRequest tracing: LLM, DB, toolsPipeline tracing: vectors, filtersCC can self-diagnose
Test infraRunner + regression + E2E in DBRunner + streaming + historyAutonomous feedback loop
Cost tiersmodel-eval vs regressionZero-API vs LLM ($0.08-0.14)Cheap iterate, expensive confirm
CLAUDE.mdSafety, debugging, memory filesSafety, debugging rulesHard rules, not suggestions
CC’s roleRate limits, schemas, retry logicCode paths, ingredient validationTechnical solutions
My roleGoals, tests, tracing, processGoals, tests, tracing, processSystem design, not solution design

Five principles for autonomous AI collaboration

  1. Define “done” as a number, not a feeling. Measurable targets — pass rates, cost per run, latency thresholds — give Claude Code something to optimize toward and give you something to verify.
  2. Build fast, cheap feedback loops. The faster Claude Code can run a test and see the result, the more iterations happen in a given window. Infrastructure investment here pays compound returns.
  3. Encode constraints in memory and specs. Don’t rely on a single session to remember your standards. Store them. Every new session starts cold; your documentation shouldn’t.
  4. Narrow the task, give freedom on how. Define the goal precisely. Leave the implementation open. This is what enables overnight runs without approval gates.
  5. Track costs as numbers, too. Set a spend limit per run and let Claude Code optimize within it. Money is a constraint like any other — model it explicitly.

Who needs to rethink their role

This approach doesn’t make engineers redundant — but it does change what engineering means. A developer who simply implements specifications to a 95% pass rate can be replaced by Claude Code running this framework. What can’t be replaced: understanding the right specifications in the first place.

The SaleSpace platform wasn’t limited by Claude Code’s ability to write SQL or call APIs. The binding constraint was data engineering — knowing which data sources tell a coherent story, which questions real estate clients actually ask, and how to structure databases so that natural language queries return trustworthy answers. That domain judgment is irreducibly human.

At the same time, business analysts who understand client needs but avoid technical tooling are increasingly at a disadvantage. The gap between “technical enough to use these tools” and “not technical enough” is closing, but the cost of being on the wrong side is rising. The emerging pattern: business analysts and engineers working in close pairs — analysts defining goals and acceptance criteria, engineers handling technical risk during execution.

“Roles are compressing. Developers need business understanding. Analysts need technical awareness. The gap between them is where AI-era value gets created.”

Mikhail Ershov
Mikhail Ershov
Operational Director, First Line Software

Real limitations worth naming

The framework has genuine constraints. Claude Code struggles when tasks span more than roughly 15 files simultaneously — the task must be decomposed. It cannot push back on a bad requirement or challenge a flawed target; product judgment remains entirely human. And the permanent amnesia of each new session means documentation maintenance is not optional — it’s load-bearing infrastructure.

These aren’t reasons to avoid the approach. They’re the parameters you design within. The three-layer guardrail system exists precisely to work around the amnesia problem. Scope discipline exists to work around the context ceiling. Measurable goals exist because Claude Code has no opinion on whether your target was wise — only whether it was met.

Mikhail Ershov

Mikhail Ershov

Operational Director

Mikhail Ershov is an Operations Director at First Line Software who thinks about AI the way engineers think about systems — with rigor, structure, and a bias for what actually ships. His work focuses on practical AI solutions in software delivery, applying engineering management principles to orchestrate multi-functional collaboration across teams. When AI teams need a framework to stop experimenting and start producing, Mikhail builds it.

Start a conversation today