ralph-starter - AI-Powered Autonomous Coding from Specs to Production Blog

Spec Driven Development with ralph-starter

2026-04-08T00:00:00.000Z

Spec Driven Development is the biggest shift in AI coding since agents learned to run tests. Here is how ralph-starter fits in.

The problem with "just prompt it"

Most people use AI coding agents the same way: type a sentence, hit enter, hope for the best. "Add user auth." "Fix the sidebar." Three words and vibes.

I did this for weeks. The agent would generate something that looked plausible but missed what I actually wanted. I blamed the tool, but the problem was me. I was not giving it enough context.

Then I started writing specs -- not essays, just 10-20 lines describing what I actually wanted, how to verify it, and where things should go. The difference was night and day. 2 loops instead of 5. $0.50 instead of $3. Correct output instead of close-but-wrong.

This pattern has a name now: Spec Driven Development (SDD).

The SDD landscape

Three frameworks are leading the SDD conversation:

Tool	Philosophy	Lock-in
OpenSpec (Fission AI)	Lightweight, fluid, tool-agnostic	None
Spec-Kit (GitHub)	Heavyweight, rigid 5-phase gates	GitHub ecosystem
Kiro (AWS)	Full IDE with built-in agents	AWS account required

OpenSpec organizes specs into changes with proposal.md, design.md, tasks.md, and requirement specs using RFC 2119 keywords (SHALL, MUST, SHOULD). It is the lightest of the three.

Spec-Kit enforces five phases: constitution, specification, plan, tasks, implement. Thorough but heavy.

Kiro bundles everything into a VS Code fork with agent hooks and EARS notation. Powerful but locked to AWS.

Where ralph-starter fits

ralph-starter takes a different angle: your specs already exist somewhere.

They are in GitHub Issues. Linear tickets. Notion docs. Figma designs. OpenSpec directories. Why rewrite them in a new format?

ralph-starter pulls specs from where they already live:

# From GitHub issues
ralph-starter run --from github --project myorg/myrepo --label "ready"

# From OpenSpec directories
ralph-starter run --from openspec:add-auth

# From Linear tickets
ralph-starter run --from linear --project "Mobile App"

# From a Notion doc
ralph-starter run --from notion --project "https://notion.so/spec-abc123"

Then it runs autonomous loops: build context, spawn agent, collect output, run validation (lint/build/test), commit, repeat until done.

New in v0.5.0: OpenSpec + spec validation

We just shipped native OpenSpec support and a spec validator:

# List all OpenSpec changes in the project
ralph-starter spec list

# Validate spec completeness (0-100 score)
ralph-starter spec validate

# Validate before running -- stops if spec is too thin
ralph-starter run --from openspec:my-feature --spec-validate

The validator checks for:

Proposal or rationale section (why are we building this?)
RFC 2119 keywords (SHALL, MUST -- formal requirements)
Given/When/Then acceptance criteria (testable conditions)
Design section (how to build it)
Task breakdown (implementation steps)

A spec scoring below 40/100 gets flagged before the agent starts. This saves tokens on underspecified work.

The new spec command

ralph-starter spec gives you a CLI for spec operations:

# Validate all specs in the project
ralph-starter spec validate

# List available specs (auto-detects OpenSpec, Spec-Kit, or raw)
ralph-starter spec list

# Show completeness summary
ralph-starter spec summary

It auto-detects whether you are using OpenSpec format, GitHub Spec-Kit format, or plain markdown specs.

The numbers

Metric	Without specs	With specs
Loops per task	5	2
Cost per task	~$3.00	~$0.50
Output accuracy	Hit or miss	Consistent
Time writing spec	0 min	3 min

The 3 minutes spent writing a spec save 15 minutes of iteration and debugging. The spec is the leverage.

What is next

We are working on:

Spec coverage tracking -- which requirements have been implemented?
Spec-to-test generation -- Given/When/Then to test stubs
Living specs -- specs that update as implementation diverges

SDD is not a fad. It is the natural evolution of AI-assisted coding. The spec is the interface between human intent and machine execution. The clearer the spec, the better the output.

ralph-starter is open source, MIT licensed: github.com/multivmlabs/ralph-starter

How ralph-starter Converts Figma Designs to Pixel-Perfect Code

2026-03-07T00:00:00.000Z

I ran ralph-starter figma, pasted a Figma URL, picked my tech stack, and walked away. When I came back, the generated landing page matched the Figma design at 98.2% pixel accuracy. The AI agent had caught its own font-size mismatch, fixed it, and passed a strict visual comparison — all without me writing a single line of code.

This is how Figma-to-code should work. Here is exactly how it does.

The Figma-to-code gap

Every developer has been here: you get a Figma design, spend hours eyeballing spacing values, manually extracting colors, guessing which font weight the designer used, and then the design review comes back with 30 comments about misaligned elements.

Tools like the official Figma MCP and others try to solve this, but most of them break down when the designer did not use Auto Layout. Real-world Figma files are messy — absolute positioning, nested frames without constraints, text layers with overrides.

ralph-starter takes a different approach. It reads the raw Figma node tree via the REST API and calculates positioning and z-index from the actual coordinates of each element. It does not depend on how the designer organized the layers.

Messy Figma file? Works the same.

One command: `ralph-starter figma`

The v0.4.0 release adds an interactive wizard that handles the full workflow:

$ ralph-starter figma
  Figma to Code
  Design to code in one command

? Figma design URL: https://figma.com/design/ABC123/Dashboard
? What would you like to build? responsive dashboard with sidebar nav
? Tech stack? Next.js + TypeScript + Tailwind CSS (Detected)
? Which model? Claude Opus 4.6 — maximum quality (Recommended)

Four steps:

Paste the Figma URL. Any design file link works — no special setup or plugins needed.
Describe what to build. A short natural-language description of the component or page.
Pick your tech stack. Auto-detected from your package.json. Supports Next.js, React, Vue, Nuxt, Svelte, Astro, and plain HTML.
Choose a model. ralph-starter detects which AI coding agents you have installed (Claude Code, Cursor, Codex, Gemini CLI, GitHub Copilot, etc.) and shows the relevant models.

Under the hood, it extracts a complete technical spec from the Figma API — typography, colors, spacing, images, icons, font detection — and passes it to the AI agent along with an implementation plan. The agent works in an autonomous loop: write code, validate with lint and build, commit, repeat.

Three-layer visual validation

The real differentiator in v0.4.0 is the visual validation pipeline. After each coding iteration, ralph-starter doesn't just check if the code builds — it checks if it looks right.

Layer 1: Pixel comparison (pixelmatch)

ralph-starter starts your dev server, captures a full-page screenshot with Playwright, and runs pixelmatch against a screenshot of the Figma design. If the pixel difference is under 2%, it passes immediately. Zero LLM cost.

Layer 2: LLM vision analysis

When the pixel diff exceeds 2%, things get interesting. ralph-starter sends three images to an LLM vision API:

The original Figma design screenshot
The implementation screenshot
A diff overlay where red pixels highlight the mismatches

The model returns actionable, numbered issues:

Header font should be 88px serif, currently 40px sans-serif
Section gap should be ~80px, currently ~20px
Button border-radius should be 8px, currently square

The agent then fixes each issue in the next loop iteration.

Layer 3: Strict gate

After fixes are applied, the pipeline runs pixelmatch again with a strict 2% threshold. This catches anything the LLM missed — sub-pixel rendering differences are fine, but real layout or color mismatches get flagged for another round.

Here is what a typical run looks like end-to-end:

→ Capturing implementation screenshot...
→ Running pixel comparison...
  Pixel diff: 8.3% (14,231 pixels differ) — analyzing...
→ Sending to LLM vision for semantic analysis...
  1. Header font should be 88px serif, currently 40px sans-serif
  2. Section gap should be ~80px, currently ~20px
→ Agent fixing 2 issues...
→ Re-running strict pixel comparison...
  Pixel diff: 1.1% — strict check passed
✓ Visual validation passed

Playwright and sharp are auto-installed the first time visual validation runs. No manual setup needed.

What gets extracted from Figma

ralph-starter does not just grab colors and text. The full extraction includes:

Typography: font family, size, weight, line height, letter spacing, text decoration
Colors: fills, strokes, gradients, opacity values
Spacing: padding, margins, gaps calculated from element coordinates
Layout: auto-layout properties, constraints, absolute positioning
Images and icons: downloaded at correct scale, optimized with sharp
Fonts: Google Fonts detected and configured in your project automatically
Component variants: component properties and variant metadata preserved
Effects: shadows, blurs, and other layer effects

Five extraction modes are available: spec (default), tokens, components, assets, and content.

Works with any AI coding agent

ralph-starter is agent-agnostic. The Figma wizard works with:

Claude Code (Claude Opus 4.6, Sonnet 4.5)
Cursor (any model)
OpenAI Codex (o3, o4-mini)
Gemini CLI (Gemini models)
GitHub Copilot
OpenCode
Amp

Skills (Tailwind v4, React best practices, design systems) are auto-injected into the agent's context, so it generates code that follows current best practices regardless of which agent you choose.

Try it

npm install -g ralph-starter@latest
ralph-starter figma

Or without installing:

npx ralph-starter figma

The tool is free and open source. Costs come from the AI agent you choose — typically $0.10 to $1.00 per page depending on complexity and model.

Beyond Figma

ralph-starter is not just for Figma. The same autonomous loop works with specs from GitHub issues, Linear tickets, Notion pages, URLs, PDFs, and plain text files. Point it at a spec, pick an agent, and let it build.

ralph-starter run --from github --issue 42 --commit
ralph-starter run --from linear --label ready --commit
ralph-starter run --from notion --project "API Spec" --commit

References

ralph-starter vs doing it manually

2026-02-14T00:00:00.000Z

I tracked one full week of development. Half the tasks with ralph-starter, half by hand. Same sprint, same project, same me, same coffee intake (a lot).

Look, I have been telling people ralph-starter saves time for weeks, but I realized I had never actually measured it. I was just vibing on the feeling that things were faster. That is not great. So I ran an honest experiment on myself.

12 features from the sprint backlog. All had clear specs in Linear. Endpoints, bug fixes, component updates, tests. Nothing exotic -- the kind of stuff that fills up every sprint everywhere.

I split them down the middle. 6 done manually (IDE open, ChatGPT tab, write code, run tests, fix, commit, repeat). 6 with ralph-starter.

Manual: the way I have been doing it for years

Read the ticket. Think about approach. Open files. Start coding. Hit a snag, open ChatGPT, paste context, get a suggestion back, realize it is not quite right, spend 10 minutes adapting it. Run tests. Something fails. Fix it. Run lint. Fix that too. Commit. Push. Open PR. You know this loop. You have lived this loop.

Average: 45 minutes per task for a total of about 4.5 hours of focused work.

And honestly, "focused" is generous. A chunk of that time was me being a human clipboard between ChatGPT and my editor. The AI was helpful but I was the integration layer. I was the glue code.

ralph-starter: the other way

Read ticket. Label it "ralph-ready". Run one command. Go do something else. Review the PR when it shows up.

$ ralph-starter run --from linear --project ENG --issue ENG-71 --commit --pr --loops 5 --test --lint

🔄 Loop 1/5
  → Fetching spec from Linear: ENG-71
  → "Add email validation to signup form"
  → Writing code with Claude Code...
  → Running tests... 4 passed, 1 failed

🔄 Loop 2/5
  → Fixing: regex pattern for edge case emails...
  → Running tests... 5 passed ✓
  → Running lint... clean ✓
  → Committing changes...
  → Opening PR #112...

✅ Done in 1m 38s | Cost: $0.23 | Tokens: 17,640

Average: 12 minutes per task -- and most of that was waiting while I worked on other stuff. The actual hands-on time for all 6 was about 1.2 hours. Reading the PR, checking the diff, approving or leaving a comment. That is it.

Quality

Both approaches produced working code. Tests passed, lint passed. The ralph-starter PRs were more verbose in places -- the AI writes way more error handling and comments than I would. I would have skipped half those try/catch blocks. Probably better practice, but yeah, I am lazier than the AI.

Nothing needed major rework either way. The AI code was not worse. It was just... more cautious than I am.

Where ralph-starter won

Consistency. Every single PR had tests, passed lint, passed build. When I code manually I sometimes skip writing tests for small changes -- "it is just a tiny fix, I will add tests later." (I never add tests later.) The validation loop does not let the AI get away with that. I talked about this in the Ralph Wiggum technique post -- the loop enforces discipline that I personally do not have.

Throughput. 6 features in the time I would normally do 2, maybe 3 -- closer to 3.5x once you factor in all the context switching overhead from doing things manually.

Cost. The 6 ralph-starter tasks cost $1.87 total in API spend -- six tasks for less than two dollars. I track costs carefully now and this is pretty typical.

Where I won

Complex design decisions. One task needed choosing between two data modeling approaches. The AI would have just picked one and run with it. I needed to think through tradeoffs, talk to the team, consider what happens next quarter. Ralph Wiggum vibes do not work here -- "I choo-choo-choose you!" is not a valid data modeling strategy.

Knowing stuff the AI does not. On a refactoring task, I knew a certain pattern was going to be deprecated next sprint. The AI had no idea. It would have happily written more of the old pattern and felt great about it. Sometimes you need the person who was in last week's meeting.

What I took away from this

ralph-starter is better than me at "turn this spec into code." Faster, more consistent, less lazy about error handling. I am better than the AI at "figure out what we should build."

So now I focus on three things: writing clear specs (AI input), reviewing PRs (AI output), and architecture decisions (AI blind spot). The mechanical part in between -- translating spec to code -- ralph-starter does that. And it does not get distracted by Slack.

Want to run your own comparison?

npm i -g ralph-starter
ralph-starter init
ralph-starter run --from github --issue YOUR_ISSUE --commit --pr --loops 3 --test --lint

Pick a task you would normally do by hand. Time yourself both ways. I think you will be surprised.

References

I tried 5 AI coding agents with ralph-starter

2026-02-13T00:00:00.000Z

ralph-starter works with multiple coding agents. I use Claude Code for basically everything, but I wanted to actually test the others on real tasks instead of just assuming. So I ran the same task on Claude Code, Cursor, Codex CLI, and OpenCode over the past few weeks. Some surprises, some not.

Quick note: ralph-starter auto-detects which agents you have installed. Checks in order: Claude Code, Cursor, Codex, OpenCode. Uses the first one it finds. You can also be explicit:

$ ralph-starter run "add pagination to /api/users" --agent claude-code

  Agent: Claude Code v1.0.16 (auto-detected)
  Mode: autonomous (--dangerously-skip-permissions)
  Output: stream-json

Loop 1/3
  Writing code with Claude Code...

Swapping agents is just a flag:

ralph-starter run "your task" --agent claude-code
ralph-starter run "your task" --agent cursor
ralph-starter run "your task" --agent codex

For this comparison I ran the same task -- "add JWT auth middleware with tests" -- on each agent. Same project, same spec, same validation pipeline. Tried to make it as fair as possible.

Claude Code

My daily driver, and honestly it's not close. Fastest for autonomous loops because of prompt caching -- 90% less on input tokens after the first loop -- and the stream-json output lets ralph-starter track progress in real time. But the real reason I keep coming back: it just handles multi-file changes without flinching. Create a middleware file, write tests for it, update 3 route files, wire it all together. Done. Other agents sometimes get nervous about touching too many files.

$ ralph-starter run "add JWT auth middleware" --agent claude-code --loops 3 --test

Done in 1m 22s | Cost: $0.28 | Loops: 2/3

npm i -g @anthropic-ai/claude-code

Cursor

Cursor is good if you're already living in the Cursor IDE. It indexes your workspace, so it knows your project structure out of the box. The catch is it's more interactive by nature -- autonomous mode requires some extra config to work smoothly.

$ ralph-starter run "add JWT auth middleware" --agent cursor --loops 3 --test

Done in 2m 48s | Cost: $0.41 | Loops: 3/3

Slower and 46% more expensive than Claude Code on the same task. It needed all 3 loops to finish. Got there eventually, but I wouldn't pick it for batch processing a bunch of issues. If you're already a Cursor user and want to stay in that ecosystem, it works. Otherwise I'd go with Claude Code.

Codex CLI

OpenAI's entry. It supports --auto-approve for autonomous mode, which is nice. The code it produces is clean and conservative -- it's the "measure twice, cut once" agent. Doesn't try to do too much at once. The flip side is it won't tackle big multi-file refactors the way Claude Code will. It kind of plays it safe, which is fine for straightforward features but frustrating when you need it to be bold.

$ ralph-starter run "add JWT auth middleware" --agent codex --loops 3 --test

Done in 2m 15s | Cost: $0.35 | Loops: 2/3

npm i -g codex

OpenCode

The newest one, and it supports --auto for autonomous mode. I'll be real: it's the least polished of the bunch right now. I've gotten decent results on smaller, focused tasks -- single-file stuff, utility functions, that kind of thing. But the JWT middleware task tripped it up because it needed to coordinate changes across multiple files. It kept getting confused about which file it had already edited. Still early days though, and it's improving fast.

npm i -g opencode

What actually matters

After running all these comparisons, the validation pipeline matters way more than which agent you pick. Tests, lint, build -- those catch mistakes regardless of who wrote the code. A weaker agent that iterates 5 times with test feedback produces better code than a strong agent running once with no tests. The loop is the product, not the agent. That is the whole Ralph Wiggum technique.

My actual config:

agent: claude-code
auto_commit: true
max_iterations: 50
validation:
  test: npm test
  build: npm run build
  lint: npm run lint

If one agent doesn't work for a task, I just switch. ralph-starter is agent-agnostic on purpose -- the loop executor and validation pipeline work the same regardless of which AI is doing the coding.

My honest advice? If you're starting fresh, install Claude Code. It's what I use for 95% of my work, the prompt caching makes it the cheapest option for loops, and it handles multi-file tasks better than anything else I've tried. If you already have something installed, start with that. Don't overthink it.

npx ralph-starter init

It'll detect what you have and set things up. Pick one and start looping.

References

Ralph Wiggum technique explained in 2 minutes

2026-02-11T00:00:00.000Z

The Ralph Wiggum technique is running AI coding agents in autonomous loops until the task is done. You give it a job, walk away, come back to a PR. That is the whole idea.

The technique was created by Geoffrey Huntley, an open source developer in Australia who started experimenting with autonomous AI coding loops in mid-2025. His original implementation was almost disappointingly simple: a bash while loop that feeds the same prompt to Claude over and over until the task is done. It went viral by the end of 2025 and has since been adopted by Anthropic's Claude Code, Vercel's AI SDK, and others.

People always ask about the name. Yes, it is the Simpsons character. Ralph approaches everything with pure, unfiltered confidence and persistence. "I'm learnding!" He just... keeps going. And that is exactly what you want the AI to do.

Instead of treating the AI as a chat partner you go back and forth with, copy error, paste into chat, get fix, paste back, run tests, repeat, you treat it as a worker that iterates until done. You step out of the loop entirely.

Traditional AI coding (the clipboard dance):

You: "build this feature"
AI: generates code
You: *runs tests* "tests fail, here's the error"
AI: generates fix
You: *runs lint* "linter is angry about unused imports"
AI: another fix
You: *runs tests again* "ok now commit it"

That is 4 round trips. Each one takes you 30 seconds to a minute because you have to context switch, copy output, paste it, wait for a response. It adds up fast.

Ralph Wiggum technique:

You: "build this feature, run tests, fix errors, commit when done"
AI: loops autonomously until everything passes
You: *reviews PR*

One input. One output. Everything in between is handled.

The difference is not just convenience. The AI can iterate fast without waiting on a human to relay errors. It reads the error output directly, understands what went wrong, fixes it, re-runs validation. All in seconds. No clipboard involved.

Here is what it looks like in practice with ralph-starter:

$ ralph-starter run "add user registration with email/password" --loops 5 --test --lint --build --commit

🔄 Loop 1/5
  → Writing code with Claude Code...
  → Created: src/auth/register.ts, src/auth/__tests__/register.test.ts
  → Running tests... 3 passed, 2 failed
  → Test failure: bcrypt not imported

🔄 Loop 2/5
  → Fixing: adding bcrypt import and hash logic...
  → Running tests... 5 passed ✓
  → Running lint... 1 issue (unused variable)

🔄 Loop 3/5
  → Fixing lint: removing unused `salt` variable...
  → Running lint... clean ✓
  → Running build... success ✓
  → Committing changes...

✅ Done in 1m 44s | Cost: $0.34 | Tokens: 26,190

Three loops, under 2 minutes, 34 cents. The agent saw the bcrypt error, fixed it, saw the lint warning, fixed that too -- I did not touch anything.

The loop executor runs the coding agent, checks the result against your test suite, lint, and build. If anything fails, the failure becomes context for the next loop. The agent sees the exact error message and fixes it. Just like Ralph Wiggum -- "I bent my Wookiee" -- it acknowledges the problem and keeps going.

Three things prevent it from going off the rails:

Circuit breaker trips after 3 consecutive identical failures or 5 of the same error, so it does not keep burning tokens on something that is stuck. I have seen this save me money when a task genuinely needed a different approach.

Completion detector verifies that files actually changed before accepting "I'm done" from the agent. Without this, the AI occasionally claims it finished without actually writing anything -- learned that one the hard way.

Cost tracker runs in real time so you see what you are spending per iteration. Transparency matters when you are running lots of loops.

What works and what does not

Best tasks for the Ralph Wiggum technique: well-defined inputs and outputs. Add an endpoint, fix a bug that has a failing test, implement a component from a design spec. Things where "done" is clear.

Worst tasks: vague ones like "make the code better" or "improve performance." The AI has no target to iterate toward. I learned this the hard way when I tried to batch process 10 issues and the vague ones hit the circuit breaker every time.

Good specs, good tests, let the ralph loop handle the rest. That is the technique in one sentence.

Want to try the Ralph Wiggum technique on your own project?

npx ralph-starter init
ralph-starter run "your task here" --loops 3 --test --lint --commit

"Hi, Super Nintendo Chalmers!" -- just let Ralph do his thing.

References

Specs are the new code

2026-02-09T00:00:00.000Z

I spend more time writing specs than writing code now, and my output went up, not down. That genuinely surprised me.

When I started using ralph-starter I was writing lazy one-liners. "Add user auth." "Fix the sidebar." Three words and vibes. You can guess what happened. The AI generated something that looked plausible but completely missed what I actually wanted. I would look at the PR and think, "that is not even close to what I meant."

I blamed the tool for like two weeks, but the problem was me.

So I started writing real specs. Not essays -- I do not have time for that. Just clear, specific descriptions of what I actually want.

What I used to write:

Add authentication to the app

What I write now:

Add JWT auth to the Express API.

- POST /api/auth/login takes { email, password }, validates against users table
- Return { token, expiresIn } on success, 401 with { error } on failure
- Token TTL: 24h
- Auth middleware goes in src/middleware/auth.ts (check Authorization: Bearer header)
- Tests: login success, login failure, protected route without token

That second spec is maybe 10 lines and took me 3 minutes to write. But it tells the agent exactly what to build, where to put it, and how to verify it works. ralph-starter turns that into an implementation plan and the agent nails it in 2 to 3 loops.

The difference is ridiculous:

# Bad spec: vague task, agent guesses wrong
$ ralph-starter run "Add authentication to the app" --loops 5 --test --commit

🔄 Loop 1/5 → tests failed (wrong auth strategy)
🔄 Loop 2/5 → tests failed (missing middleware)
🔄 Loop 3/5 → tests failed (wrong token format)
🔄 Loop 4/5 → tests passed ✓ but not what I wanted
✅ Done in 3m 45s | Cost: $0.94 | Tokens: 71,203

# Good spec: clear requirements, agent nails it
$ ralph-starter run --from github --issue 42 --loops 5 --test --commit

🔄 Loop 1/5 → tests: 2 passed, 1 failed (token expiry)
🔄 Loop 2/5 → tests: 3 passed ✓
✅ Done in 1m 12s | Cost: $0.27 | Tokens: 19,844

Same feature. Same agent. Same model. The only difference was the spec. 67 cents and 2 minutes I did not need to spend. Better spec, fewer loops, less cost, better code.

And this is exactly why the GitHub, Linear, Notion integrations matter so much. You are probably already writing specs there. ralph-starter just pulls them directly. No copy-paste, no "let me summarize this ticket for the AI."

ralph-starter run --from github --project myorg/myrepo --issue 42 --commit --pr

After running hundreds of tasks, I am pretty confident about this: the quality of the PR is directly proportional to the quality of the issue. I have started treating it as a law of nature.

It completely changed how I write tickets. Every issue now has a clear description of what needs to happen. Not "improve the thing" but "response time of /api/users should be under 200ms." Acceptance criteria as a checklist. Technical context when it matters, like "we use Prisma" or "follow the pattern in src/api/orders.ts."

My whole workflow flipped. Before AI coding I spent maybe 10% of my time planning and 90% implementing. Now it is 40% writing specs and 60% reviewing output. Total time is less, quality is higher, and here is the bonus I did not expect: the specs double as documentation for what was built and why.

So if you are using ralph-starter and the output is not good enough, the fix is almost always in the spec -- not in the tool, not in the agent, not in the model.

Want to see the difference good specs make?

npx ralph-starter init

Write a detailed issue, point ralph-starter at it, and watch what happens.

References

Prompt caching saved me $47 last month

2026-02-07T00:00:00.000Z

I run ralph-starter a lot. Multiple tasks per day, 3 to 7 loops each. Last month I checked my Anthropic dashboard and the total was $62, almost scrolled past it, but then I looked at the prompt caching line and did the math. Without caching? $109. I saved $47 by literally doing nothing.

Ok so let me explain why this blew my mind.

Every time ralph-starter runs Claude Code in a loop, that first iteration ships everything over: system prompt, project files, the spec, the implementation plan. All fresh tokens, expensive. But Claude's prompt caching stores all of that server-side, so on loop 2, 3, 4 those same tokens cost 90% less. You basically get a discount for repeating yourself.

Regular input:  $3.00 per million tokens
Cache write:    $3.75 per million (first time, slight premium)
Cache read:     $0.30 per million (90% off)
Output:         $15.00 per million tokens

Think about a 5-loop task. Something like 80% of the input tokens are identical every single iteration. The system prompt, the spec, your project context -- none of that changes between loops. The only new stuff is validation feedback, like test errors or lint output. So after loop 1, you are paying $0.30/M instead of $3.00/M on most of your input. Ten times cheaper, for free.

ralph-starter tracks all of this automatically. After every run you see exactly what you spent, no extra setup:

$ ralph-starter run "add pagination to /api/users" --loops 5 --test --commit

Loop 1/5
  Writing code with Claude Code...
  Running tests... 4 passed, 1 failed

Loop 2/5
  Fixing: off-by-one in page calculation...
  Running tests... 5 passed
  Committing changes...

Done in 1m 48s | Cost: $0.31 | Tokens: 28,412

Cost Breakdown:
  Loop 1: $0.22 (cache write)
  Loop 2: $0.09 (cache read, 90% savings on 22K input tokens)
  Cache savings: $0.18

31 cents for pagination, tests, and a commit. I used to spend more than that on a single ChatGPT message and still have to copy-paste the code myself.

What affects your costs

More loops = more cache hits. Simple math. A 1-loop task barely benefits because the context gets cached but never reused. A 5-loop task though? Loops 2 through 5 all ride on that cached context. This is why the ralph loop design matters -- iteration is basically free once you have paid for the first pass.

Bigger specs also mean more tokens getting cached, so the absolute savings on subsequent loops go up. A detailed spec with acceptance criteria costs more on loop 1, sure, but the savings compound after that.

Oh and the Batch API is a completely different beast. ralph-starter auto --batch uses the Anthropic Batch API at a flat 50% discount on all tokens. Catch is it takes up to 24 hours and no tool use, but for straightforward tasks where you do not need results right now? Worth it. I wrote about that in the batch processing post.

What actually helped me cut costs

My first week came in at $35 and I panicked a little. Looked at the cost tracker and immediately saw the problem: I was setting --loops 10 on everything. A task that completes in 2 loops does not need 10. I was being lazy with the flag and paying for it.

Now I keep it simple: --loops 3 for easy stuff, --loops 5 for complex things. I almost never go higher.

But the biggest cost optimization is not a caching trick -- it is writing better specs. A clear spec means the agent gets it right in fewer loops, which means fewer API calls. A spec that lands in 2 iterations instead of 7 saves you way more than any caching ever could.

Want to see your own cost breakdown?

npx ralph-starter init

Every run shows token usage and cost automatically. No extra config needed.

References

How I ship tasks from Linear every day with AI

2026-02-05T00:00:00.000Z

Linear is where my team plans work. ralph-starter is where it gets built. I have been running this combo every single day for weeks now, and I want to show you exactly what my workflow looks like.

Every morning I open Linear and check the sprint. Tickets that are well-specified -- clear inputs, clear outputs -- I process with ralph-starter. Tickets that need thinking or architecture decisions? Those I handle myself. The split is usually 60/40 in ralph-starter's favor, which means I spend most of my day on the hard problems.

Setup is quick. One command:

ralph-starter config set linear.apiKey lin_api_xxxxx

For a single ticket:

$ ralph-starter run --from linear --project ENG --issue ENG-42 --commit --pr

🔄 Loop 1/5
  → Fetching spec from Linear: ENG-42
  → "Add retry logic to webhook delivery"
  → Priority: High | Labels: backend, webhooks
  → Writing code with Claude Code...
  → Running tests... 6 passed, 2 failed

🔄 Loop 2/5
  → Fixing: retry delay calculation off by one...
  → Running tests... 8 passed ✓
  → Committing changes...
  → Opening PR #94...

✅ Done in 1m 51s | Cost: $0.22 | Tokens: 16,830

It hits the Linear GraphQL API, pulls the title, description, priority, labels, sub-issues -- everything. All of that becomes context for the coding agent. The agent does not just see "add retry logic." It sees the full ticket with all the context your team wrote.

For batch processing I filter by label. We use "ralph-ready" for tickets that are groomed:

$ ralph-starter auto --source linear --project ENG --label "ralph-ready" --limit 5

  Fetched 5 issues from Linear (sorted by priority):

  Urgent:
    ENG-89: Fix auth token expiry handling

  High:
    ENG-91: Add retry logic to webhook delivery
    ENG-93: Rate limit the public API

  Medium:
    ENG-95: Add dark mode to settings
    ENG-97: Update user avatar component

  Processing...

What Linear gives you that GitHub does not (at least not as cleanly) is structured data. Every ticket has priority (urgent, high, medium, low), status, labels -- and ralph-starter uses priority for task ordering, so urgent tickets get processed first.

My typical day looks like this: morning standup, I see 3 or 4 tickets assigned to me. I label the straightforward ones "ralph-ready" and kick off auto mode. While that runs, I work on the complex ticket that actually needs my brain. By the time I finish the hard work, ralph-starter has PRs waiting for my review. I wrote about this batch workflow in more detail in automating entire workflows -- same idea, different source.

One thing that works really well is writing acceptance criteria as a checklist in Linear:

Acceptance:
[ ] Endpoint returns JSON response with { data, meta } shape
[ ] Tests cover happy path and error case
[ ] No lint warnings
[ ] Build succeeds

ralph-starter extracts those checkboxes from the ticket body and uses them as completion criteria. The agent knows it needs to satisfy each point before signaling done. This is why specs matter so much -- the better your Linear tickets, the better the PRs.

The tickets that work best are the ones with clear inputs and outputs. "Add this endpoint", "Fix this test", "Update this component to match the new design." The ones that need a human are where the approach is not obvious, where you need to ask "should we even build this?"

Clear instructions work, vague ones don't -- same as with a human developer, really.

Want to connect your Linear workspace?

npm i -g ralph-starter
ralph-starter init
ralph-starter config set linear.apiKey lin_api_your_key_here
ralph-starter run --from linear --project YOUR-PROJECT --issue YOUR-ISSUE --commit --pr

References

Automating entire workflows with ralph-starter

2026-02-03T00:00:00.000Z

ralph-starter is a CLI tool that orchestrates AI coding agents in autonomous loops. You give it a task (or point it at a GitHub issue, a Linear ticket, a Notion page), it runs the agent, checks if tests pass, if lint is clean, if build works. If something fails it feeds the error back to the agent and loops again. When everything passes it commits, pushes, and opens a PR.

It supports Claude Code, Cursor, Codex CLI, OpenCode, Gemini CLI, Copilot, Amp, and Openclaw. You do not need to pick one in advance, it auto-detects what you have installed.

It is open source, MIT licensed. I built it because I was tired of being the middleman between my terminal and my AI chat window.

Why I built it

I was using AI coding assistants every day. Claude, ChatGPT, Copilot, whatever was available. And the workflow was always the same: I read a ticket, I open the editor, I start coding, I get stuck, I open the AI chat, I paste the context, I get a suggestion, I adapt it, I paste it back. Then I run tests. Something breaks. I go back to the chat, paste the error, get a fix, paste that back. Lint complains. Another round trip. Then I commit, push, open a PR.

That is like 12 steps and I was doing it 5 to 8 times a day. The AI was doing the hard part (writing the code) and I was just the relay moving text between windows. I felt like a human clipboard.

So I wrote a script that does the relay for me. The script takes a spec, sends it to the agent, runs my test suite, and if something fails it sends the error output back to the agent automatically. No copying, no pasting, no switching windows. The agent sees the error and fixes it on its own.

That script grew into ralph-starter.

Where it is most useful

ralph-starter works best when you have:

A clear spec. "Add /health endpoint that returns 200 with JSON body { status: 'ok' }" finishes in 1 loop. "Make the app better" will still run, the agent will analyze your codebase and pick something to improve, but it might take 4 loops and the result might not be what you wanted. The more specific the spec, the fewer loops and the better the output.
Tests. The loop needs something to validate against. If you have no tests the agent does not know when it is done.
Routine implementation work. Endpoints, bug fixes, component updates, adding tests, config changes. The stuff that fills up a sprint backlog.

Vague specs do not break it, they just cost more. "Refactor the auth system" with no details will make the agent try different approaches each loop until the circuit breaker trips. "Add JWT middleware at src/middleware/auth.ts using bcrypt, httpOnly cookies, add tests for login success and failure" finishes in 2 loops because the agent knows exactly what done looks like.

I use it every day for the mechanical parts of development. I still do the thinking, the architecture, the spec writing. ralph-starter handles the translation from spec to code.

Getting started

You can start from an idea and ralph-starter will generate the spec for you. Or you can point it at an existing GitHub issue or Linear ticket and it fetches the spec automatically.

# Install and initialize
npx ralph-starter init

ralph-starter init detects your project type (Node.js, Python, Rust, Go), finds which agents you have installed, and sets up your validation commands (test, lint, build). If it finds a Ralph Playbook in your project it picks up AGENTS.md, IMPLEMENTATION_PLAN.md, and your prompt files automatically.

Run your first task with an inline spec:

ralph-starter run "add a /ping endpoint that returns pong" --commit

Or point it at a GitHub issue or Linear ticket:

# From GitHub
ralph-starter run --from github --project rubenmarcus/ralph-starter --issue 2

# From Linear
ralph-starter run --from linear --project ENG --issue ENG-71 --commit --pr

To connect GitHub, Linear, Notion, or Figma as spec sources, use the config commands:

ralph-starter config set github.token ghp_xxx
ralph-starter config set linear.apiKey lin_api_xxx
ralph-starter config set notion.apiKey ntn_xxx

ralph-starter setup configures the CLI agent preferences. Integrations are managed through ralph-starter config.

How the loop works

The loop executor follows this sequence:

Fetch spec (GitHub issue, Linear ticket, inline text)
Create branch (auto/42-health-endpoint)
Run agent with the spec as prompt
Run validations: test → lint → build
If any validation fails → feed error output back to agent → go to step 3
If all pass → commit, push, open PR

The validation step is configurable in ralph-starter.config.yaml:

validation:
  test: pnpm test
  lint: pnpm lint
  build: pnpm build

When a validation fails, ralph-starter takes the stderr/stdout and builds context for the next iteration. The context includes the original spec, the diff of what changed, and the full validation output. The agent sees TypeError: Cannot read property 'id' of undefined at src/routes/user.ts:42 and knows exactly what to fix.

The agent does not get a summary. It gets the raw error. This is faster than me copying the error into a chat window because there is zero delay between failure and the next attempt.

Real example: building a landing page from a GitHub issue

Here is a real run. I pointed ralph-starter at a GitHub issue that asked for a landing page for a London pet shop. The spec had 8 tasks (header, hero, services, gallery, testimonials, contact form, footer, polish).

ralph-starter detected 28 installed skills (frontend-design, tailwind, responsive-web-design, etc.), picked the relevant ones for the task, and started the loop with Claude Code.

The loop ran for 2 iterations. First iteration completed 5 out of 8 tasks (Project Setup, Header & Navigation, Hero Section, Services Section, Featured Pets Gallery). Second iteration picked up the remaining tasks (Testimonials, Contact Form, Footer, Polish). It stopped automatically when no file changes were detected for 2 consecutive iterations.

Final result:

Cost Summary:
  Tokens: 47.0K (764 in / 46.2K out)
  Cost: $0.606 ($0.348/iteration avg)

Loop completed!
  Exit reason: completed
  Iterations: 2
  Total duration: 8m 19s
  Total cost: $0.696 (47.0K tokens)

8 minutes. 69 cents. A full landing page with React components, Tailwind styling, and responsive layout. I did not open the editor at all.

What it actually costs

I personally use the Claude Max Plan, so my per-task cost is effectively zero on top of the monthly subscription. But ralph-starter tracks token usage either way, and for users on pay-per-token plans the numbers are worth knowing.

On the Anthropic API, a typical task runs $0.10 to $0.40. The reason it stays low is prompt caching. The first loop sends the full context at $3.00 per million input tokens. Loops 2, 3, 4 reuse the cached portion at $0.30 per million — 90% less. Most tasks finish in 2 to 3 loops, so the bulk of the input is already cached after the first one. I wrote the detailed breakdown with exact numbers here.

You are not locked into Claude either. ralph-starter supports OpenRouter, which gives you access to cheaper models like Kimi K2, MiniMax, DeepSeek, and others. Some of these cost a fraction of what Claude or GPT-4 charge per token, and they work well for straightforward tasks. You can mix and match — use Claude Code for complex multi-file changes and a cheaper model for simple fixes.

Before each run, ralph-starter shows you an estimate so you know what to expect, and after each run it shows the actual cost breakdown: tokens in, tokens out, cache hits, cost per iteration.

A few things that help keep costs down regardless of which model you use:

Good specs mean fewer loops. Clear acceptance criteria = agent knows when it is done.
Prompt caching saves 90% on input tokens after the first loop (on models that support it).
Circuit breaker stops tasks that are stuck, so you do not burn money on something unsolvable.
Skills teach the agent patterns so it gets things right faster (fewer iterations = less cost).
OpenRouter lets you pick the cheapest model that can handle each task.

Batch mode: 10 issues, 8 PRs

During sprint grooming I label issues as "auto-ready". These are the well defined tickets with clear specs. Then I run a single command and go get lunch:

ralph-starter picks up all matching issues, shows the estimate for each, and starts the Ralph Wiggum loop one by one:

# From GitHub
ralph-starter auto --source github --project multivmlabs/ralph-starter --label "auto-ready" --limit 10

# From Linear
ralph-starter auto --source linear --project ENG --label "auto-ready" --limit 10

It works with both GitHub Issues and Linear tickets. Each issue gets its own branch, its own loop, its own PR:

[1/10] Issue #145: Add health check endpoint
  > Branch: auto/145
  > 2 loops > Validation: passed
  > PR #151 created

[2/10] Issue #147: Add rate limit headers
  > Branch: auto/147
  > 1 loop > Validation: passed
  > PR #152 created

[3/10] Issue #150: Improve performance
  > 3 loops > Circuit breaker tripped. Skipping.

...

Completed: 8/10 | Failed: 2/10
Total cost: $1.84

8 out of 10. The 2 that failed were vague tickets. One was "Improve performance" with no metrics or targets. The agent tried different optimizations each loop but had nothing to validate against. The circuit breaker tripped after 3 loops.

The other was a refactoring ticket that referenced a discussion from a team meeting. The agent did not have that context.

The circuit breaker trips after 3 consecutive identical failures or 5 of the same error type. It prevents burning tokens on something the agent cannot solve.

Picking an agent

You can be explicit about which agent to use:

ralph-starter run "your task" --agent claude-code
ralph-starter run "your task" --agent codex
ralph-starter run "your task" --agent cursor

Or let ralph-starter auto-detect. It checks what you have installed and uses the first one it finds.

I use Claude Code daily because prompt caching makes the loops cheaper and stream-json output lets ralph-starter track progress in real time. But the loop executor and validation pipeline are the same for all agents. I ran the same JWT auth task on 4 different agents and they all got there, just with different loop counts and costs.

Why I keep building it

I did a side-by-side comparison of 12 tasks from the same sprint. 6 manual, 6 with ralph-starter. Same project, same type of work. The ralph-starter tasks averaged 12 minutes of my attention vs 45 minutes coding manually. Code quality was comparable.

Now I spend my time on three things: writing clear specs (the input), reviewing PRs (the output), and architecture decisions (the part the AI cannot do). Everything in between, the mechanical translation of spec to code, ralph-starter handles that.

Every PR it produces passes tests, lint, and build. When I code manually I sometimes skip tests for small changes -- the validation loop does not let the agent skip anything, and honestly that discipline is better than what I do on my own.

About the name

The name comes from the Ralph Wiggum technique. You give the AI a task and let it keep going until done. No micro-managing. Full explanation here.

Links

ralph-starter is open source, MIT licensed.

If you try it, open an issue or drop a star. All feedback is welcome.

Building a full app from a Figma file in one command

2026-02-01T00:00:00.000Z

A designer handed me a Figma file on Friday afternoon with 12 screens for a dashboard. My immediate thought was "cool, that is next week gone." Instead I pointed ralph-starter at it and had working React components before I left for the weekend.

You know the Figma-to-code dance. Squint at the inspector panel. Copy hex colors one at a time. Guess at spacing values. Eyeball the layout. Realize the padding was 24 not 20. Go back. Fix it. Repeat for every single frame. I have lost days of my life to this.

So I tried something. I just pointed ralph-starter at the Figma file to see what would happen. Honestly expected it to fall apart. It did not.

The integration has 5 modes. The one I reach for most is components -- it reads your Figma file and generates actual component code.

$ ralph-starter run --from figma \
  --project "https://figma.com/file/ABC123/Dashboard" \
  --figma-mode components \
  --figma-framework react \
  --max-iterations 5 --test --commit

🔄 Loop 1/5
  → Fetching from Figma API... 12 frames, 34 components found
  → Generating implementation plan...
  → Writing code with Claude Code...
  → Created: 14 files in src/components/
  → Running tests... 8 passed, 3 failed

🔄 Loop 2/5
  → Fixing import paths and missing props...
  → Running tests... 10 passed, 1 failed

🔄 Loop 3/5
  → Fixing: Sidebar component missing responsive breakpoint...
  → Running tests... 11 passed ✓
  → Running lint... clean ✓
  → Committing changes...

✅ Done in 4m 18s | Cost: $0.87 | Tokens: 67,412

What happens under the hood: ralph-starter hits the Figma API, pulls every component and frame, converts the design data into specs, and the coding agent implements each one. 87 cents for a 12-screen dashboard scaffold. I checked the number twice.

It is not pixel-perfect -- the agent works from structural data, not screenshots. But the component breakdown and layout are right, so you end up tweaking CSS values instead of writing everything from scratch. I spent maybe 2 hours polishing what would have taken me 2 full days.

The 5 modes (yes, there are 5)

Spec (default) converts frames to markdown specs. I use this when I want the AI to understand the design intent before touching any code.

Tokens extracts your design system -- colors, typography, spacing. Exports to CSS, SCSS, JSON, or Tailwind. This one was a pleasant surprise:

$ ralph-starter integrations fetch figma "ABC123" --figma-mode tokens --figma-format tailwind

  Extracted design tokens:
    → 24 colors (primary, secondary, neutral, semantic)
    → 6 font sizes + 4 font weights
    → 8 spacing values
    → 4 border radius values
  Written to: tailwind.config.tokens.js

Components generates actual code. React, Vue, Svelte, Astro, Next.js, Nuxt, HTML -- pick your poison:

ralph-starter integrations fetch figma "ABC123" --figma-mode components --figma-framework react

Assets exports icons and images with download scripts.

Content extracts text content for static sites.

What I actually did with the dashboard

I ran tokens first to get the Tailwind config right, then components for the React code, two commands in maybe 10 minutes total. And I had a working foundation to start refining.

Setup is dead simple, just a personal access token from Figma:

ralph-starter config set figma.token figd_xxxxx

Figma Plan Matters

The free/starter plan limits you to 6 API requests per month -- that is barely one fetch. For real development, you need a Professional plan with a Dev seat ($12/month), which gives you 10+ requests per minute. Responses are cached locally so repeated runs are free.

The reason this works at all is the loop. The agent does not just generate code and stop. It generates, runs tests, sees what broke, fixes it, runs again. By loop 3 or 4 you have components that actually render and pass lint. Same Ralph Wiggum technique I use for everything else -- just pointed at a design file instead of a GitHub issue. I did not even plan it this way. It just... worked.

Want to try it with your own Figma file?

npx ralph-starter init
ralph-starter config set figma.token figd_your_token_here
ralph-starter run --from figma --project "your-figma-url" --figma-mode components --figma-framework react --max-iterations 5

You can also pick your model with --model. Sonnet is fast and cheap for UI work, Opus is better for complex state logic:

# Fast + cheap (recommended for most Figma workflows)
ralph-starter run --from figma --project "your-figma-url" --figma-mode components --model claude-sonnet-4-5-20250929 --max-iterations 5

# Maximum quality
ralph-starter run --from figma --project "your-figma-url" --figma-mode components --model claude-opus-4-6 --max-iterations 3

References

ralph-starter + Claude Code: the full setup

2026-01-30T00:00:00.000Z

I wanted to write the post I wish existed when I started: how to go from zero to your first automated PR with ralph-starter and Claude Code. No fluff, just the steps.

Claude Code is the best agent I use with ralph-starter. Prompt caching makes loops cheap, stream-json output lets ralph-starter track progress in real time, and it handles multi-file changes without breaking a sweat.

Install

npm i -g @anthropic-ai/claude-code
npm i -g ralph-starter

You need ANTHROPIC_API_KEY in your environment. Quick sanity check:

$ claude --version
claude-code 1.0.16

$ ralph-starter --version
ralph-starter 0.6.2

If both work, you are ready.

Init

$ cd your-project
$ ralph-starter init

  Detected: Node.js project (package.json found)
  Agent: Claude Code (claude-code v1.0.16)

  Created:
    ✓ AGENTS.md — validation commands
    ✓ PROMPT_build.md — agent build instructions
    ✓ PROMPT_plan.md — planning phase prompt
    ✓ .ralph/config.yaml — project config

  Run your first task:
    ralph-starter run "your task" --loops 3 --test

This detects your project type (Node, Python, Rust, Go) and reads package.json to find test/build/lint commands. Creates a few files:

AGENTS.md with validation commands
PROMPT_build.md and PROMPT_plan.md for agent behavior
.ralph/config.yaml

The config is straightforward:

agent: claude-code
auto_commit: true
max_iterations: 50
validation:
  test: npm test
  build: npm run build
  lint: npm run lint

First task

Pick something small for your first run:

$ ralph-starter run "add a health check endpoint at /api/health" --loops 3 --test --commit

🔄 Loop 1/3
  → Writing code with Claude Code...
  → Created: src/api/health.ts, src/api/__tests__/health.test.ts
  → Running tests... 5 passed ✓
  → Committing changes...

✅ Done in 47s | Cost: $0.11 | Tokens: 8,924

47 seconds. 11 cents. A working health endpoint with tests.

Under the hood, ralph-starter launches Claude Code with --dangerously-skip-permissions for autonomous mode and --output-format stream-json so it can track progress in real time. You do not need to know this, but I think it is cool.

After loop 1 your context gets cached. Loops 2, 3, 4 reuse that cache at 90% less cost. On a 5-loop task you pay full price only on the first iteration. I wrote more about this in prompt caching saved me $47.

Auto PRs from GitHub

You can also go straight from a GitHub issue to a PR:

$ gh auth login
$ ralph-starter run --from github --project myorg/myrepo --issue 42 --commit --pr

🔄 Loop 1/5
  → Fetching spec from GitHub issue #42...
  → "Add rate limiting to /api/users endpoint"
  → Writing code with Claude Code...
  → Running tests... 7 passed, 1 failed

🔄 Loop 2/5
  → Fixing: rate limit header format...
  → Running tests... 8 passed ✓
  → Committing changes...
  → Opening PR #87...

✅ Done in 2m 12s | Cost: $0.19 | Tokens: 18,340

Creates branch, runs loops, commits, pushes, opens PR. For multiple issues at once:

ralph-starter auto --source github --project myorg/myrepo --label "auto-ready" --limit 5

I label issues "auto-ready" when they have clear specs and run this once or twice a week.

One thing that made a big difference

Add specific context in .claude/CLAUDE.md. Things like "we use Tailwind", "tests in __tests__/", "follow pattern in src/api/". The more specific you are, the better the output gets. I have seen first-loop success rate go from maybe 40% to 70% just by adding a few lines of project context.

Ready to try it?

npx ralph-starter init

References

My first ralph loop: what actually happens

2026-01-28T00:00:00.000Z

I keep saying "I type one command and get a PR" and people want to know what actually happens in between. So let me walk you through a real one.

Say you want to add a dark mode toggle to your settings page. Nothing crazy, but enough to touch a few files. You run:

ralph-starter run "add dark mode toggle to settings page" --loops 5 --test --lint --commit

First thing ralph-starter does is detect your coding agent. It prefers Claude Code but works with Cursor, Codex, OpenCode too. Then it reads your AGENTS.md to find your test/lint/build commands. No guessing -- it knows how your project validates code.

Loop 1 starts. The agent gets the task with full project context, reads your files, creates the components. First pass usually gets the structure right but something breaks -- which is fine, that is the whole point of loops.

Here is what the real terminal output looked like:

$ ralph-starter run "add dark mode toggle to settings page" --loops 5 --test --lint --commit

🔄 Loop 1/5
  → Planning implementation...
  → Writing code with Claude Code...
  → Running tests... 2 passed, 1 failed
  → Test failure: ThemeContext is not exported from './contexts'

🔄 Loop 2/5
  → Fixing: adding ThemeContext export...
  → Running tests... 3 passed ✓
  → Running lint... 2 issues found

🔄 Loop 3/5
  → Fixing lint issues (unused import, missing type)...
  → Running tests... 3 passed ✓
  → Running lint... clean ✓
  → Committing changes...

✅ Done in 1m 32s | Cost: $0.29 | Tokens: 45,218

Test failure goes back as context for loop 2. Agent sees the exact error -- ThemeContext is not exported -- and fixes it. Loop 2 passes tests but lint complains about an unused import. Loop 3 cleans that up.

Three loops, about 90 seconds total. The other 2 loops never ran because the task completed early -- ralph-starter stops as soon as everything passes so you are not wasting tokens.

You also get a cost summary at the end:

Cost Summary:
  Tokens: 45K (32K in / 13K out)
  Cost: $0.29 (3 iterations)
  Cache savings: $0.12

29 cents for a feature with tests that pass and clean lint. I used to spend 20 minutes doing this exact thing by hand.

There is a circuit breaker too -- if the agent fails the same way 3 consecutive times, it stops instead of burning tokens on something that is stuck. As Ralph Wiggum would say, "I bent my Wookiee" -- sometimes you just have to stop and try a different approach.

Want to try it yourself?

npm i -g ralph-starter
ralph-starter init
ralph-starter run "your first task" --loops 3 --test

Three commands and you are in the loop. If you want to understand why I built this in the first place, I wrote about that too.

References

Why I built ralph-starter

2026-01-25T00:00:00.000Z

Do you know when you have a GitHub issue with the full spec, and then you open Claude or ChatGPT, copy the issue, paste there, get code back, paste in your editor, run tests, something breaks, go back to chat, paste the error? I was doing this 20 times a day. Twenty. I counted.

The AI was doing the hard part. I was just the middleman moving text around. A glorified clipboard manager, basically.

So one night I wrote a bash script. Nothing fancy -- it pulled the issue body with gh, piped it into Claude Code, ran the tests, and if they failed it sent the error back and let Claude try again. I ran it, went to make coffee, came back. There was a working PR sitting there. I had not touched my keyboard once.

That was the moment. I literally said out loud: "why was I doing this by hand?"

That script became ralph-starter.

Here is what it looks like now:

$ ralph-starter run --from github --project myorg/myrepo --issue 42 --commit --pr

🔄 Loop 1/5
  → Fetching spec from GitHub issue #42...
  → Generating implementation plan...
  → Writing code with Claude Code...
  → Running tests... 3 passed, 1 failed

🔄 Loop 2/5
  → Fixing test failures...
  → Running tests... 4 passed ✓
  → Committing changes...
  → Opening PR #87...

✅ Done in 2m 34s | Cost: $0.08 | Tokens: 12,847

Your specs already live somewhere -- GitHub, Linear, Notion, Figma. There is no reason to copy them manually into a chat window.

One command. Fetches the spec, makes a branch, runs the AI in loops with your tests as the guardrails, commits, opens a PR. You review it like any other PR from your team. No ceremony.

I use this every day now. Linear tickets in the morning, ralph-starter processes them while I work on the hard stuff, and I review PRs after lunch. It does not replace thinking about architecture -- but it handles the mechanical part of turning specs into code. The part that was eating my day.

The name comes from the Ralph Wiggum technique. As Ralph would say: "I'm learnding!" -- and honestly, that is exactly what the loop does. You give the AI a task and let it keep going until done. No prompting back and forth. Just autonomous iteration.

ralph-starter is open source because AI coding tooling is evolving so fast that the community can push it further than I could alone. And honestly, I want to see what people build with it.

If you want to try it:

npx ralph-starter init

References

Let your tests guide the AI

2026-01-20T00:00:00.000Z

The first time I let an AI agent write code without running tests, it produced something that looked perfect. Clean code, nice comments, the works. Blew up at runtime. The second time, I added --test and the agent caught its own mistake and fixed it in the next loop. That's when I realized: the tests aren't just for me anymore. They're for the agent.

How this actually works

So when you pass --test to ralph-starter, every loop ends with your test suite running. If something fails, the error output goes straight back into the agent's context. It reads the failure, figures out what it got wrong, and tries again. Basically your tests become the agent's to-do list.

ralph-starter run "add Stripe webhook handler" \
  --test \
  --lint \
  --loops 5

Here's what that actually looked like on a Stripe webhook handler I built last week:

Loop 1: Implementing webhook handler...
  → Running tests...
  FAIL src/webhooks/stripe.test.ts
    x should verify webhook signature (8ms)
      Error: No signatures found matching the expected signature for payload

  → 1 test failed. Feeding errors back to agent.

Loop 2: Fixing signature verification...
  → Added raw body parsing middleware for webhook route
  → Running tests...
  PASS src/webhooks/stripe.test.ts
    ✓ should verify webhook signature (12ms)
    ✓ should handle checkout.session.completed (5ms)
    ✓ should return 400 for unknown events (3ms)
  → Running linter... 1 issue
    src/webhooks/stripe.ts:14:7 - 'event' is defined but never used

Loop 3: Removing unused variable...
  → Running tests... passed
  → Running linter... passed

Done in 3 loops.

The thing that blew my mind: the agent saw No signatures found matching the expected signature for payload and just... knew it needed raw body parsing. I didn't tell it that. The test output was specific enough. That's the Stripe webhook gotcha that trips up every developer the first time, and the agent figured it out from the error message alone. Took me like 20 minutes of Googling when I first hit it myself.

Your tests are basically the spec

This flips the whole workflow on its head. Instead of describing what you want in a prompt and hoping the output is correct, you write tests that define correct behavior and let the agent figure out the implementation.

If you already do TDD, this is basically what you've been training for. Write your tests first, then:

ralph-starter run "make the failing tests pass" --test --loops 5

Each failing test becomes a requirement, and when they all pass, the task is done.

Setting it up

In your config file, just tell ralph-starter what to run:

# ralph.config.yaml
validation:
  test: "pnpm test"
  lint: "pnpm lint"
  build: "pnpm build"

Or pass them as flags:

ralph-starter run "fix the auth bug" \
  --test "pytest -x" \
  --lint "ruff check ."

That -x flag on pytest is a pro tip, by the way. It stops at the first failure, so the agent gets one focused error instead of a wall of 47 failures. Way more useful.

What I've learned about good validation setups

Fast tests matter a lot. The agent runs your suite on every loop. If your tests take 10 minutes, a 5-loop run takes close to an hour. If they take 10 seconds, you're done in a few minutes. I learned this the hard way on a project with a 7-minute test suite. Now I usually point the agent at a subset:

ralph-starter run "fix payment processing" \
  --test "pnpm test -- --testPathPattern=payment"

Specific error messages make a huge difference. Compare these two test failures:

# Bad: agent has to guess what went wrong
AssertionError: expected false to be true

# Good: agent knows exactly what to fix
Expected status code 201 for POST /api/users
Received status code 400 with body: {"error": "email is required"}

The first one forces the agent to guess. The second one tells it exactly what is missing. More information in your test output means fewer loops and less money.

Type checking is worth adding too. It catches a totally different class of bugs. I add it as another validator:

validation:
  test: "pnpm test"
  lint: "pnpm lint"
  typecheck: "pnpm tsc --noEmit"

Every validator runs after every loop, and the agent does not move on until all of them pass.

When to skip auto-commit

I'll be honest, I don't always trust the agent enough to commit automatically. When I'm trying a new type of task, I run without --commit first so I can look at the diff:

ralph-starter run "add rate limiting" --test --lint --loops 5

# Review what the agent did
git diff

# Commit if it looks good
git add -A && git commit -m "feat: add rate limiting"

Once I trust the pattern for a given type of task, I add --commit and let it ship. That trust builds over time. Took me maybe a week before I stopped reviewing every diff.

Details on all the validation options are in the validation config docs.

From spec to code in one command

2026-01-18T00:00:00.000Z

Every feature you build starts with a spec that already exists somewhere. GitHub issue, Linear ticket, Notion doc. It's already written. The annoying part is getting it into your AI tool without losing half of it.

The problem I kept running into

I'd open a GitHub issue, read through the description and comments, mentally summarize it, then type a prompt for Claude that captured... maybe 60% of what was actually in the issue. The linked design doc? Forgot to include it. The acceptance criteria someone added in comment #3? Missed that too.

The spec was right there. I was just a really bad copy-paster. And you know what's dumb? I did this for months before it occurred to me that a script could just fetch the issue directly.

So I made it one command

ralph-starter just pulls the spec directly and feeds it to the agent:

ralph-starter run --from github --project myorg/api --issue 42 --loops 5 --test --commit

What happens here: it authenticates with GitHub using your existing gh CLI session, grabs issue #42 -- body, all comments, labels, linked references, everything -- and hands it all to the coding agent. Then the agent implements the feature, runs your tests after each loop, and commits when everything passes.

No tab-switching, no summarizing, no "let me paste the relevant parts." The agent gets the raw spec, the whole thing.

GitHub issues and PRs

This is the one I use the most, by far. Just point it at an issue:

ralph-starter run --from github --project owner/repo --issue 123

It pulls the title, body, comments, file references. If the issue links to other issues, those come along too. Basically the agent sees everything your team wrote -- which is usually way more context than what I'd remember to paste.

PRs work the same way, which is great for when you get review feedback and don't want to fix 12 nits by hand:

ralph-starter run --from github --project owner/repo --issue 456 --loops 3 --test

Linear tickets

If your team uses Linear, same deal:

ralph-starter run --from linear --project PROJ --issue PROJ-123 --commit

Grabs the ticket description, sub-issues, attachments, priority. One thing I've noticed: Linear tickets tend to be really well-structured compared to GitHub issues, so the agent gets cleaner input and the results are usually better on the first try. Not always, but noticeably.

Notion pages

For teams that write everything in Notion (I've been there):

ralph-starter run --from notion --project "page-id" --loops 5 --test

The page content gets converted to markdown, and child pages and linked databases come along for the ride. This is especially nice for those longer specs -- you know, the ones with 3 sections and a table and a "Notes from the last meeting" block. Try pasting all of that into a chat window. Actually, don't.

Local files and URLs

Sometimes the spec is just a markdown file in your repo:

ralph-starter run --from ./specs/auth-feature.md --test --commit

Or a URL:

ralph-starter run --from "https://example.com/spec.md"

Combining sources (this is the good part)

OK so the thing I actually use the most in practice is combining a GitHub issue with extra local context:

ralph-starter run \
  --from github --project owner/repo --issue 123 \
  --context ./docs/api-conventions.md \
  --loops 5 \
  --test \
  --commit

The agent gets the issue spec plus your project conventions in one shot. This is where the quality jumps noticeably. Before I started doing this, the agent would generate code that worked but didn't follow our patterns -- wrong naming conventions, different error handling style, that kind of thing. Now it matches the rest of the codebase on the first try most of the time.

Setting up auth

One-time setup, takes like 30 seconds:

# GitHub (uses your existing gh CLI login)
gh auth login

# Linear
ralph-starter config set linear.apiKey lin_api_xxx

# Notion
ralph-starter config set notion.token secret_xxx

The full list of supported sources is in the integrations guide.

Why autonomous AI coding loops work

2026-01-15T00:00:00.000Z

I spent months copy-pasting code between ChatGPT and my editor before I realized I was the bottleneck.

The back-and-forth tax

You know the drill. Open ChatGPT, describe what you want, get some code back, paste it into your editor, run it, something breaks, copy the error, go back to the chat, paste the error, get a new version, paste that in. I was doing this maybe 15-20 times a day. For months.

And every round trip you lose a little bit of context. The model half-forgets what it suggested two messages ago. You lose track of which version you pasted where. Meanwhile you're the one running tests, reading stack traces, deciding what to try next.

At some point I realized the AI was doing the easy part -- generating code -- and I was doing everything else. Basically a clipboard manager with opinions.

What if the agent just... kept going?

So the idea behind ralph-starter is stupid simple. Instead of you being the middleman, the agent does the whole thing:

ralph-starter run "add user authentication with JWT" --loops 5 --test --commit

It reads your codebase, writes code, runs your tests. If something fails, it reads the error and fixes it. Then runs tests again. Over and over until things pass or it runs out of loops. You just... go do something else.

Here's a real run from last week:

Loop 1: Read codebase, generated auth middleware and routes
  → Tests: 3 failed (missing bcrypt import, wrong token expiry, no error handler)

Loop 2: Fixed imports, updated token config, added error handling
  → Tests: 1 failed (error handler not catching expired tokens)

Loop 3: Added expired token case to error handler
  → Tests: passed
  → Lint: 2 warnings (unused import, missing return type)

Loop 4: Cleaned up lint issues
  → Tests: passed, Lint: passed, Build: passed
  → Committed: feat: add JWT authentication

Four loops, zero copy-pasting, zero babysitting. I reviewed the diff after and it was clean -- I made coffee during loop 2.

Why this works better than chatting

The big difference is context. In a chat, the model kind of forgets what it tried two messages ago. In a loop, the agent sees everything -- its own previous attempts, the full test output, the whole history. It doesn't start over each time.

And errors become free instructions, which is the part that really clicked for me. When TypeError: Cannot read properties of undefined shows up in the test output, the agent gets that exact string -- you don't have to describe the problem. It reads the stack trace and acts on it. That is the stuff I was doing manually before, and honestly the agent is better at it than me because it does not skip lines.

A chat session might take 15-20 messages to land on working code. A loop usually finishes in 3-5 iterations because each one does real work, validates it, and course-corrects. You're paying for results, not conversation.

Where it actually works (and where it doesn't)

This is not magic though. It works really well when you have:

A clear spec. "Add password reset flow per the design in issue #42" works great. "Make the auth better" does not.
Tests. Even crappy ones. Tests give the agent a finish line. Without them it's just vibes.
A linter and type checker. More automated checks = more signal for the agent to self-correct.

The tasks where I've had the best results: implementing a well-scoped feature from a GitHub issue, fixing a bug with a reproducible test case, refactoring code that has good coverage.

Where it falls apart: vague requirements with no tests, greenfield projects with no structure yet, anything that needs human judgment about UX. I tried it on a "redesign the dashboard" task once and... yeah. Don't do that.

It gets even better with real specs

One more thing that made a huge difference. Instead of writing a prompt from scratch, you can point it at an actual GitHub issue:

ralph-starter run --from github --project myorg/api --issue 42 --loops 5 --test --commit

This fetches the full issue body, comments, linked context -- all of it. The agent gets the same spec your team wrote for a human developer. Except it doesn't skim. It reads the whole thing, every comment, every acceptance criterion. Honestly it's more thorough than I am.

If you want to try it, the quickstart takes about two minutes.

ralph-starter - AI-Powered Autonomous Coding from Specs to Production Blog

Spec Driven Development with ralph-starter

The problem with "just prompt it"​

The SDD landscape​

Where ralph-starter fits​

New in v0.5.0: OpenSpec + spec validation​

The new spec command​

The numbers​

What is next​

How ralph-starter Converts Figma Designs to Pixel-Perfect Code

The Figma-to-code gap​

One command: ralph-starter figma​

Three-layer visual validation​

Layer 1: Pixel comparison (pixelmatch)​

Layer 2: LLM vision analysis​

Layer 3: Strict gate​

What gets extracted from Figma​

Works with any AI coding agent​

Try it​

Beyond Figma​

References​

ralph-starter vs doing it manually

Manual: the way I have been doing it for years​

ralph-starter: the other way​

Quality​

Where ralph-starter won​

Where I won​

What I took away from this​

References​

I tried 5 AI coding agents with ralph-starter

Claude Code​

Cursor​

Codex CLI​

OpenCode​

What actually matters​

References​

Ralph Wiggum technique explained in 2 minutes

What works and what does not​

References​

Specs are the new code

References​

Prompt caching saved me $47 last month

What affects your costs​

What actually helped me cut costs​

References​

How I ship tasks from Linear every day with AI

References​

Automating entire workflows with ralph-starter

Why I built it​

Where it is most useful​

Getting started​

How the loop works​

Real example: building a landing page from a GitHub issue​

What it actually costs​

Batch mode: 10 issues, 8 PRs​

Picking an agent​

Why I keep building it​

About the name​

Links​

Building a full app from a Figma file in one command

The 5 modes (yes, there are 5)​

What I actually did with the dashboard​

References​

ralph-starter + Claude Code: the full setup

Install​

Init​

First task​

Auto PRs from GitHub​

One thing that made a big difference​

References​

My first ralph loop: what actually happens

References​

Why I built ralph-starter

References​

Let your tests guide the AI

How this actually works​

Your tests are basically the spec​

Setting it up​

What I've learned about good validation setups​

When to skip auto-commit​

The problem with "just prompt it"

The SDD landscape

Where ralph-starter fits

New in v0.5.0: OpenSpec + spec validation

The new spec command

The numbers

What is next

The Figma-to-code gap

One command: `ralph-starter figma`

Three-layer visual validation

Layer 1: Pixel comparison (pixelmatch)

Layer 2: LLM vision analysis

Layer 3: Strict gate

What gets extracted from Figma

Works with any AI coding agent

Try it

Beyond Figma

References

Manual: the way I have been doing it for years

ralph-starter: the other way

Quality

Where ralph-starter won

Where I won

What I took away from this

References

Claude Code

Cursor

Codex CLI

OpenCode

What actually matters

References

What works and what does not

References

References

What affects your costs

What actually helped me cut costs

References

References

Why I built it

Where it is most useful

Getting started

How the loop works

Real example: building a landing page from a GitHub issue

What it actually costs

Batch mode: 10 issues, 8 PRs

Picking an agent

Why I keep building it

About the name

Links

The 5 modes (yes, there are 5)

What I actually did with the dashboard

References

Install

Init

First task

Auto PRs from GitHub

One thing that made a big difference

References

References

References

How this actually works

Your tests are basically the spec

Setting it up

What I've learned about good validation setups

When to skip auto-commit

The problem I kept running into

So I made it one command

GitHub issues and PRs

Linear tickets

Notion pages

Local files and URLs

Combining sources (this is the good part)

Setting up auth

The back-and-forth tax

What if the agent just... kept going?

Why this works better than chatting

Where it actually works (and where it doesn't)

It gets even better with real specs