Vibe coding in production: what breaks, and the math on fixing it later

A bug that ships to production costs you somewhere between 30x and 100x more to fix than one caught in review. That ratio has been industry folklore for two decades, and AI did not repeal it. It made it worse, because now the code arrives faster than any human can read it. Vibe coding is the right speed for a mid-market operator. Unreviewed vibe coding is a line item you have not budgeted for.

Let me show you the leak, name the system that plugs it, and tell you who should not buy it.

What vibe coding actually is

The term comes from Andrej Karpathy. You describe what you want in plain language, the model writes the code, and you guide and test rather than hand-write every line. The Wikipedia definition captures it well: the human steers and verifies, the machine types. Done right, it collapses a 6-to-8-week agency build into days.

That speed is real and it is good. A landing page that used to take a quarter to ship can go live in a week. The problem is not the method, and it never was. The problem is the half of operators who skip the second word in the Karpathy formula: test. They keep the prompt-and-pray loop running until the screen looks right, then ship. The screen looking right and the code being safe are two different facts, and only one of them shows up in your bank account.

There is a second misread worth naming. Operators treat vibe coding as a tool decision, as if the choice were between one model and another. It is a process decision. The model is a junior engineer who types fast, never tires, and has no judgment about what happens after the code runs once. You would not let a fast junior push straight to production without a senior reading the diff. The speed does not change the rule. It raises the stakes, because the junior now ships ten times the volume.

The leak: what unreviewed AI code costs you

The numbers are not subtle. Independent security research has found that 40 to 62 percent of AI-generated code samples contain exploitable security flaws, that misconfigurations show up roughly 75 percent more often than in hand-written code, and that certain vulnerability classes appear at 2.74x the baseline rate. Code now moves from prompt to deploy with little or no human in the loop, and analysts expect a wave of these failures to surface in production through 2026.

Translate that to your P&L. Say a vibe-coded checkout flow ships with a misconfigured payment webhook. The fix is an afternoon. The chargebacks, the support tickets, the customer who screenshots the error and posts it - that is your CAC working in reverse. One leaked credential in a deploy script can end a mid-market company. Research bodies like Gartner and McKinsey have been flagging the AI-code governance gap for exactly this reason: the speed is banked instantly, the risk is deferred and compounding.

The honest framing: the vibes get you 80 percent of the way for 20 percent of the time. The last 20 percent - the security review, the tests, the ownership - is where the entire risk lives. Skip it and you have not saved money. You have borrowed it at a bad rate.

The reason the cost compounds is structural, not bad luck. A defect caught in review is a five-minute fix in a text file. The same defect in production is a fix plus an incident: someone has to notice, reproduce, patch, redeploy, then clean up whatever the defect did to live data while it was loose. IBM research on defect economics puts the post-release multiplier in the range our folklore describes, and generated code bends that curve against you: more code per hour means more places for the defect to hide.

The five things that break first

Across the AI-built stacks we see, the failure modes repeat. They are predictable, which is exactly why review catches them.

Auth and access control top the list. The model wires a login that works for the happy path and quietly skips the check on the admin route. Input validation comes next: a form that trusts whatever the browser sends, straight into your database. Then secrets, where API keys land in client-side code or a public repo. Dependency sprawl is the fourth, with the model pulling in 5 packages to do the job of 1, each one a new attack surface. The fifth is the silent error path, where a failed payment or a dropped webhook returns a cheerful 200 and nobody finds out for days.

None of these are exotic. A senior engineer spots all 5 in a single read of the diff. The model will not flag them on its own, because it optimised for output that runs, not output that holds up when a real user does something the prompt never imagined.

It helps to see why each one happens, because the cause tells you why the model cannot self-correct. The auth gap appears because the prompt described the feature, not the threat model, so the model never wrote the guard. Input validation slips because the demo data was always clean; the model never saw the apostrophe in a last name or the script tag in a comment box. Secrets leak because hard-coding the key is the fastest path to a working call, and working is what the model optimises for. Dependency sprawl grows because pulling a library is one line and writing the function is twenty. The silent error path survives because a 200 makes the demo look finished, and the model has no incentive to imagine the webhook that times out at 2 a.m.

A worked example: the cost of one skipped review

Numbers settle arguments faster than principles. Take a mid-market operator with a checkout page doing 200 orders a day at an average order value of 120 dollars. That page moves 24,000 dollars a day, or 1,000 dollars an hour.

The team vibe-codes a pricing change on a Thursday and ships it unreviewed. The change breaks the webhook that confirms payment to the fulfillment system. Orders still take money; they stop reaching the warehouse, and the error returns a 200, so no alarm fires. The break runs from Thursday to Monday, when the first customer asks where their order is. Call it three business days before anyone notices.

Now the bill. The engineering fix is an afternoon, maybe 400 dollars of senior time. That is the part operators anchor on, and it is the smallest line. Three days of orders that took money but never shipped is roughly 72,000 dollars now sitting in a refund-or-scramble pile. Support has to contact every affected customer, reship or refund, and absorb the spike. A fraction of those customers never come back, and your CAC to replace them is money you already spent once. None of that lands if a senior engineer reads the diff and asks the question the prompt forgot: what happens when the webhook fails.

The point is not the exact figure. It is the shape. The visible cost of the fix is a rounding error next to the cost of the window the defect stayed open. Review does not make the fix cheaper. It closes the window before it opens.

The named system: a reviewed Vibe Code build

The luup answer is the Vibe Code build. Same speed as raw vibe coding - sites and landing pages shipped in 7 days, not 6 to 8 weeks - but every line passes human review, sits behind guardrails and tests, and gets handed to you with full ownership. The discipline is the product. The vibes are the engine.

Concretely, that means a senior engineer reviews the generated diff before it merges, not after a customer finds the hole. It means the deploy pipeline has gates. It means you get the repo, the credentials, and the docs - no vendor lock, no monthly ransom for your own code. We built the same way for service businesses and SaaS teams; you can read the 7-day Vibe Code launch breakdown for the full sequence.

This is the opposite of the vague retainer. You are not renting a seat on someone else roadmap. You get a working asset and the keys. See the Website and Vibe Code service for scope and pricing.

The review layer is not a vibe check at the end. It is four passes, each aimed at one of the failure modes above. The first reads the diff for the five repeat offenders: auth gaps, unvalidated input, exposed secrets, dependency sprawl, and silent error paths. The second runs tests against the unhappy paths the model never imagined - the malformed form, the timed-out webhook, the user who hits the admin route logged out. The third checks the deploy gates: nothing merges that fails a test. The fourth is the handoff, where you receive the repo, the keys, and the docs. Speed comes from the model. Safety comes from the four passes.

Raw vibe coding vs a reviewed Vibe Code build

Dimension	Raw vibe coding	Reviewed Vibe Code build
Speed to ship	Days	7 days
Human review of generated code	Little to none	Senior engineer on every diff
Security flaw exposure	40 to 62 percent of samples	Reviewed, tested, gated
Tests and deploy guardrails	Optional, often skipped	Built in
Who owns the code	Unclear, often the tool	You - repo, keys, docs
Cost of a production failure	30x to 100x the review cost	Caught pre-merge

Stack-wise we build on Webflow, Framer, and Vercel, with design in Figma. The tools are not the moat. The review layer is.

Common mistakes operators make with AI-built sites

Five patterns show up again and again when a team ships AI-generated code without a process around it. Each is cheap to avoid and expensive to ignore.

The first is treating the preview as the product. The screen renders, the demo clicks through, so it ships. But the preview only exercises the happy path. Everything that breaks lives in the paths nobody clicked.

The second is no rollback plan. When the change breaks production, the team finds no clean way to revert, because nobody set up deploy gates or kept a known-good version. The fix turns into a forensic dig instead of a one-line rollback.

The third is accepting whatever dependencies the model chose. Each package is code you now run and cannot read. A site that pulls 30 libraries to render a landing page has 30 supply-chain surfaces, and you reviewed none.

The fourth is shipping with the model's default error handling, which swallows the failure and returns success. A payment that fails silently is worse than one that fails loudly, because the loud one wakes someone up.

The fifth is no ownership of the output. The team builds inside a tool that holds the code, and when they want to move or hire an engineer to extend it, they find they never had the keys. The build was a rental the whole time.

Who this is NOT for

This is an honest fit, so here is where it does not fit.

If you have an internal engineering team that already runs code review, CI, and a real security process, you do not need us to add a review layer. You need us for capacity, not discipline, and that is a different conversation.

If you are pre-revenue and throwing a weekend prototype at the wall to see if anyone cares, raw vibe coding is the correct call. Do not pay for guardrails on a thing that might not exist in 30 days. Ship the ugly version, get the signal, then come back when real money flows through it.

If you want the cheapest possible site and you accept that you own the risk, a no-review build is a legitimate choice. We are not it. We are for the operator whose checkout, lead form, or pricing page touches actual revenue and cannot quietly break for 3 weeks before anyone notices.

One more case where we are the wrong call: if your project is a regulated build that needs a formal audit trail, third-party penetration testing, and compliance sign-off, our review layer is a strong first line but not the whole apparatus. We will tell you that on the call, not after the invoice.

How to decide in the next 10 minutes

Run the math on your own situation. What does an hour of downtime on your money page cost? What is your AOV times the daily traffic that hits it? If a silent failure for one business day is a rounding error, vibe away. If it is a number that makes your stomach drop, you need review.

Use a three-question framework before you spend anything. First, does the page touch money or capture leads that become money? If no, ship raw and move on. Second, would a silent failure for three business days hurt, the realistic window before anyone notices a quiet break? If the loss over that window beats the cost of a review, review wins on math alone. Third, do you own the code today, repo and keys in hand? If not, you are renting your own revenue surface, and that is the first thing to fix.

Then find the actual leak. Our Loop Map generator maps where revenue enters and exits your systems, so you can see which surfaces are too expensive to ship unreviewed. The free Closed Loop Audit scores your stack the same way we scored 50 mid-market setups and found 87 percent broken - the writeup is in our 50-stack teardown.

If the numbers say review, see what reviewed builds look like in our case studies, then book the build. Vibe coding is the right speed. Shipping it blind is the wrong bet.

Frequently asked questions

Is vibe coding safe for production use?

Raw vibe coding is risky for production. Security research has found 40 to 62 percent of AI-generated code samples contain exploitable flaws and misconfigurations run about 75 percent higher than hand-written code. It becomes safe when a senior engineer reviews every diff and tests and deploy guardrails sit in front of the code. The method is not the problem. The missing review step is. Add the step and you keep the speed without carrying the risk into production.

What is vibe coding in simple terms?

Vibe coding, a term from Andrej Karpathy, means describing what you want in plain language and letting an AI model write the code while you guide and test it rather than hand-writing every line. It collapses week-long builds into days, but the testing step is the part that protects you. Think of the model as a fast junior engineer who types tirelessly and has no judgment about what happens after the code runs once. The speed is a gift. The judgment is your job.

How fast can luup ship a reviewed Vibe Code build?

luup ships high-conversion sites and landing pages in 7 days, versus the typical 6-to-8-week agency timeline. The speed comes from AI generation; the safety comes from human review, tests, guardrails, and handing you full ownership of the repo and keys. You are not trading speed for safety, and you are not trading safety for ownership. The review layer is what lets you keep all three at once.

Who should not pay for a reviewed Vibe Code build?

Teams with their own code review, CI, and security process do not need our review layer. Pre-revenue founders testing a weekend prototype should ship raw and fast. Reviewed builds are for operators whose pages touch real revenue and cannot break silently for days. If your build is regulated and needs formal third-party audit and compliance sign-off, our review is a strong first line but not the whole apparatus, and we will say so before you pay.

How much does production AI code cost to fix after launch?

A defect caught after deploy runs roughly 30x to 100x the cost of catching it in review, and that is before chargebacks, support load, and reputation damage. Reviewing AI-generated code before merge is the cheapest insurance you will buy this quarter. The fix itself is usually cheap. The window the defect stays open is what bleeds you, and that window is exactly what a pre-merge review closes.

Vibe coding is the right speed for a mid-market operator. Unreviewed vibe coding is a liability. Start with the free Closed Loop Audit and decide which surfaces are too expensive to ship blind.

Vibe coding in production: what breaks and the real cost