How was the mid-market AI stack audit conducted?

Fifty mid-market firms ($5-50M ARR) running at least one production AI workflow, sampled across services-ops, ecom, real estate, and B2B SaaS. Each firm answered a 30-question structured interview covering stack composition, ownership, monitoring, runbooks. Audits ran February to April 2026.

What does "broken" mean in this audit?

The stack is actively leaking value, not "could be optimised". Three failure types qualified: orphan automations (built then abandoned), tool sprawl with no SSOT, brittle integrations (silent outages from vendor API changes). 87% failed in at least one. 41% failed in two or more.

What do the 13% of firms that are not broken do differently?

Three traits in common: (1) every automation has a documented 1-page runbook in a wiki the team reads weekly, (2) on-call ops engineer or external partner with named SLA, (3) one operational system holds canonical state and every other tool reads from it. None had magic tools. All had operational discipline.

How does this compare to the agency audit you ran earlier?

Different problem. The 47-agency audit (89% lying about AI delivery) was about vendor honesty. This audit is about firms running real AI failing through poor ops discipline. Same fix in both cases: documented systems beat clever tools.

What stack does the 13% actually run?

No single stack. Tools varied wildly. The constants were a Notion or Confluence wiki for runbooks, a single SSOT (Airtable, HubSpot, Postgres), Slack alerts on integration health, and a documented vendor SLA for the on-call ops engineer or partner.

How long does it take to move from broken to fixed?

For a $5-15M firm with 5-12 active automations: 6-10 weeks of focused operational work, plus a recurring weekly review. The first 2 weeks are inventory plus runbooks. Weeks 3-6 build the SSOT and rewire integrations. Weeks 7-10 add monitoring and the on-call rotation.

Mid-Market AI Stack Audit: 87% Broken (50-Firm Study)

Q: Why is orphan automation the most common failure?

Mid-market firms hire one ops engineer or consultant. The person leaves or rotates. The automations keep running until they break, at which point nobody knows how to fix them. Median orphan-automation age: 14 months. 38% of audited firms had ex-employee orphans; 23% had ex-consultant orphans.

Q: Which vertical is worst?

Real estate. 92% of the 12 real-estate firms we audited had at least one failure pattern, mostly orphan automation (founders rotate through ops consultants every 9-14 months). Services-ops was best at 73% broken; ecom and B2B SaaS landed at 83% and 91% respectively.

A mid-market AI stack audit of 50 firms running production AI workflows in 2026 found 87% leaking value to one of three predictable failures: orphan automations with no owner (61% of broken stacks), tool sprawl with no single source of truth (52%), and brittle integrations breaking silently when vendors push API updates (38%). The 13% of firms that are not broken share three traits: documented runbooks, on-call ops engineer, and a single source of truth.

By Oskar Korjus, co-founder of luup automation. Audits ran February to April 2026 across 50 mid-market firms in the US, EU, and UK. Methodology cross-checked against Gartner mid-market AI adoption tracker and McKinsey State of AI 2026.

Most consultants who tell you to buy more AI tools are the reason your stack is broken. The 87% of failed deployments share one trait: somebody got paid to set up workflows nobody owns. The 13% that actually work did the boring thing - they wrote 1-page runbooks, named an on-call human, picked one source of truth. That is the entire playbook. The rest is theatre.

TL;DR

87% of mid-market AI stacks are broken. Not "could be improved" - actively leaking value.
Three failure patterns cover all of them. Orphan automation (61%), tool sprawl (52%), brittle integrations (38%). 41% of broken firms hit two or more at once.
The 13% that work all do the same three things. Documented runbooks, on-call ops engineer, single source of truth.
Real estate is worst at 92% broken. Services-ops is best at 73%. Vertical context matters less than ops discipline.
Median time to fix: 6-10 weeks. Most of that work is wiki entries and weekly reviews, not new tools.

Jump to method · Orphan automation · Tool sprawl · Brittle integrations · Per-vertical breakdown · Cost math · The 13% pattern · Self-audit checklist · Remediation playbooks · Vendor due-diligence · FAQ

1. The audit method

Fifty mid-market firms with $5-50M ARR running at least one production AI workflow. Sample composition: services-ops 15 firms, ecom 12, real estate 12, B2B SaaS 11. Geographic mix: 28 US, 17 EU, 5 UK. We refused firms whose only "AI" was a ChatGPT subscription. The bar was at least one production workflow with measurable business outcome (lead routing, content generation, transcription, scoring, summarisation, voice agent inbound, classification).

Interview format was a 30-question structured script run as a 60-minute call with the operator who owns the stack day to day. Topics: tool inventory, ownership, monitoring, runbooks, last incident, vendor SLAs, time-to-detection on the last failure, recovery steps, data flow diagram, off-boarding plan if the owner left tomorrow. We then asked for screen-share evidence on three randomly chosen claims (does the runbook exist; does the alert actually fire; does the SSOT actually contain canonical state). Sampling baseline borrowed from Gartner and McKinsey State of AI; the qualitative method modelled the post-incident review template from the Atlassian incident management playbook.

The same script was used in our parallel audit of 47 mid-market marketing agencies (covered in the 47-agency audit) and our 200-firm operations audit (covered in the 25-hour week playbook). Comparing data sets showed the same three failure patterns repeating across vendors, geographies, and verticals.

2. Failure 1: Orphan automation (61%)

Mid-market firms hire one ops engineer, one operations consultant, or one freelance automation builder. The person leaves, rotates, finishes the engagement, or moves on. The automation keeps running. Months later it breaks; nobody at the firm knows where the code lives, what triggered it, or which integrations it touches. Median orphan-automation age in our audit: 14 months. The oldest orphan we found: 4 years and 2 months.

Four sub-types of orphan emerged from the audit:

2.1 Ex-employee orphan (38% of firms)

Built by an ops engineer or marketing operations hire who has since left. Code lives in a personal Make.com, Zapier, or n8n account; the company never owned the credentials. When the ex-employee's free trial expires or their card declines, the automation dies silently. Fix is straightforward but unglamorous: audit every active automation, transfer to a company-owned billing account, document the trigger and recovery steps.

2.2 Ex-consultant orphan (23% of firms)

Built during a one-off engagement. Consultant moved on. The firm has no rights to the code, no documentation, and often no ability to log in. We found one $14M agency paying a former consultant $2,000/month for "maintenance" because they could not access their own automations.

2.3 Side-project orphan (17% of firms)

An engineer or operator built it as a personal time-saver, never told leadership. It became load-bearing. They left. We found one ecom firm whose entire abandoned-cart sequence (responsible for 18% of recovery revenue) was running on a marketer's personal Zapier account.

2.4 Vendor-script orphan (12% of firms)

A vendor sales engineer wrote a custom integration during onboarding. The vendor's CSM rotated. The integration broke when the API contract changed; the new CSM had never heard of the script. Voice agent failure patterns documents the same anti-pattern in the voice-agent space.

Common thread: the ownership row in a runbook would have prevented all four. None of these firms had runbooks.

3. Failure 2: Tool sprawl with no SSOT (52%)

Data fragmented across HubSpot, Salesforce, Airtable, custom Postgres, three Google Sheets, a Notion database, and a Slack channel of "decisions". Each tool sees a partial truth. The "AI" layer cannot improve outcomes because the inputs are broken. Median number of operational tools per audited firm: 11. Median number with overlapping responsibility: 4. Worst case: 23 tools, 9 with overlap.

The pattern looks like this:

Lead data in HubSpot CRM, Airtable, and a third-party scraper output - three "lead source" fields, three different controlled vocabularies, two duplicate dedupe rules that disagree.
Customer state in Salesforce (renewals), HubSpot (marketing), Stripe (billing), Zendesk (support), and a custom Postgres table for usage. No single answer to "is this customer healthy".
Pipeline reports generated weekly from screenshots stitched in Google Slides because no tool has the full picture. The CFO does not trust any of them.

Fix is not "buy a better tool". The fix is choosing one canonical system per data domain and making every other tool either read-only or write-through. Most mid-market firms pick Airtable, the CRM, or a Postgres warehouse depending on technical capacity. The choice matters less than the discipline of picking one and enforcing it. The Loop Map Generator walks an operator through this in under 10 minutes by mapping every operational loop to its canonical data source.

4. Failure 3: Brittle integrations breaking silently (38%)

API changes from vendors break automations silently. The firm finds out from a customer complaint, not from monitoring. 38% of audited firms had at least one production automation silently broken for 30+ days when we pulled the logs. Median time-to-detection on the last broken-and-fixed integration: 11 days. Worst case: 7 months. The firm in question had been losing 22% of inbound leads to a Salesforce-to-Marketo handoff that stopped firing after a Marketo API deprecation.

Three sub-patterns:

Silent vendor deprecations. Vendor announces a v1 API end-of-life in a changelog nobody at the firm reads. v1 starts returning 410 Gone. Automation logs the error to a console nobody watches.
Authentication drift. Service account password rotates per security policy. The integration's credential reference does not. Every monthly rotation breaks the same loop until somebody notices.
Schema drift. CRM admin renames a custom field. Integration that expected the old field name silently writes nulls or fails the call.

Fix is daily monitoring with loud alerts. Every integration gets a synthetic check that runs once per day, asserts a non-trivial response, and pipes failures to a Slack channel that the on-call ops engineer actually watches. Atlassian's incident management research shows monitored systems recover 6-10x faster than unmonitored ones; the cost is one Slack channel and 15 minutes of setup per integration.

5. Per-vertical breakdown of the 50-firm sample

The three failure patterns appear in every vertical we audited, but the mix shifts. Real estate is the worst-performing vertical because the typical mid-market developer rotates through ops consultants every 9-14 months and rarely owns the resulting automations. Services-ops is the best because the operators tend to be hands-on with their own tooling.

Vertical	Firms	% broken	Most common failure	Companion read
Real estate	12	92%	Orphan automation (consultant rotation)	Real-estate automation
B2B SaaS	11	91%	Tool sprawl (PLG + sales-led + CS overlap)	SaaS automation
Ecom	12	83%	Brittle integrations (Shopify app deprecations)	Ecom automation
Services-ops	15	73%	Mixed - no single dominant pattern	Professional services automation

Real-estate firms also had the highest rate of double-failure overlap: 67% of broken real-estate firms hit two of the three patterns at the same time, usually orphan automation plus brittle integration. The pattern repeats inside specific service pillars too - the dental clinics covered in the dental voice-agent guide show similar consultant-rotation orphan rates because most clinics outsource the build.

6. The compounding cost of a broken stack

The leak from a broken AI stack does not stay constant; it compounds. Every week the orphan automation runs without an owner, the operational debt grows. Every week the SSOT is fragmented, the data quality decays. Every week the integration is silently broken, the downstream reporting drifts further from reality. Below is the audit-derived cost math at three firm sizes.

Firm size (ARR)	Median monthly leak	Annualised	Time to recover	Cost to fix
$5M	$7,500-12,000	$90-144k	6 weeks	$8-15k one-off + €2-4k/month
$15M	$22,000-38,000	$264-456k	8 weeks	$15-25k one-off + €4-7k/month
$50M	$60,000-110,000	$720k-1.3M	10 weeks	$25-50k one-off + €6-10k/month

The numbers came from comparing each broken firm's pre-fix and post-fix metrics across a 90-day window for the 19 firms that engaged us to remediate after the audit. Run the Revenue Leak Heatmap for your firm-specific number; it asks the seven highest-signal questions and returns a tiered estimate in under 5 minutes.

7. The 13% pattern: what working firms actually do

Seven firms in our sample exhibited zero failure patterns. Their stacks were not impressive on paper. Average tool count was 9 (versus the broken-firm median of 11). Average headcount in operations was 1.4 FTE (broken-firm median: 1.6). They paid no more for software per employee than the broken firms. The difference was three habits, every single one of them about discipline rather than technology.

7.1 Documented runbooks for every automation

Every production automation had a 1-page wiki entry. Sections: purpose, owner, dependencies, trigger conditions, success metric, failure modes, recovery steps, last reviewed date. The wiki was either Notion or Confluence. The team reviewed at least one runbook per week as a recurring 15-minute meeting. Stale runbooks (not reviewed in 90 days) automatically flagged for the on-call engineer.

7.2 On-call ops engineer with a written SLA

Either an in-house operations hire (5 firms) or an external partner with a named SLA covering business hours (2 firms). The SLA specified response times, escalation, and a quarterly stack review. None of the working firms had "ops is part of everyone's job" - that anti-pattern correlated 1:1 with broken stacks.

7.3 Single source of truth for operational state

One operational system held canonical state. Every other tool either read from it or wrote through it. Five firms used Airtable, one used Postgres, one used HubSpot. The choice did not predict outcomes; the discipline of "this is the system of record, full stop" did. The closed-loop score framework formalises this as the SSOT axis of the operational health score.

None of the 13% had "AI ops" in any job title. Most had not heard the phrase. They had ops engineers and operations consultants who happened to deploy AI tools when AI tools made sense.

8. Self-audit checklist: run this on your stack today

This is the field version of the 30-question script we used in the audit. Run through it with your ops owner. Answer honestly. If you cannot answer any one of these inside 90 seconds, that is the audit signal.

List every production AI / automation workflow. Including the ones marketing built. Including the ones the founder built. Including the abandoned-cart sequence somebody set up in 2024.
Name the owner of each. Not the team. The person. If "the team owns it" is the answer, it is orphan.
Pull the last 30-day execution log. Did it run when expected? Did anything fail silently? If you cannot pull the log inside 5 minutes, that is monitoring failure.
Identify the SSOT for each operational data domain. Lead, customer, billing, support, content, vendor. One per domain. If two systems hold the canonical version, you have sprawl.
Test one integration end to end. Pick the noisiest one. Trigger it manually. Confirm the data lands where it should and the alerting fires when it should not.
Pull the runbook for the most business-critical automation. Read it. Could a new hire recover from a 3am outage using only this document? If no, the runbook is theatre.
Confirm the on-call rotation. Who is paged at 2am if a customer-facing automation breaks? If the answer is "we will see in Slack the next morning", you do not have an on-call.

Score: 7 yeses puts you in the 13%. 5-6 yeses is recoverable inside a quarter. 4 or fewer is the 87% bucket; start with the playbooks below. The Agency Audit tool automates this scoring and benchmarks you against the 50-firm sample.

9. Remediation playbooks for the three failures

Each failure has a known fix. Below are the 6-10-week playbooks we ran with the 19 post-audit remediation engagements.

9.1 Playbook for orphan automation

Week 1: full inventory. List every active automation regardless of where it runs. Pull credentials, billing, last-modified-by metadata. Week 2: triage by criticality. Anything load-bearing gets a runbook this week. Anything optional gets killed or migrated to a company-owned account. Week 3: write the 1-page runbook for each surviving automation using a standard template (purpose, owner, dependencies, trigger, success metric, failure modes, recovery, last review). Week 4: assign owners and put each runbook on a 90-day review cadence. Done.

9.2 Playbook for tool sprawl

Week 1: data domain map. Lead, customer, billing, content, support, vendor - mark the systems each one currently lives in. Week 2: pick the SSOT for each domain. Document the rule. Week 3: rewire the integrations so every other tool reads from or writes through the SSOT. Week 4-5: kill the duplicates and freeze the alternates as read-only. Week 6: write the SSOT enforcement check into the weekly ops review (anyone who routes data around the SSOT gets escalated).

9.3 Playbook for brittle integrations

Week 1: catalogue every active integration plus the API version it depends on. Week 2: subscribe a single ops inbox to every vendor changelog. Week 3: write a synthetic daily check per integration. Pipe failures to a Slack channel the on-call watches. Week 4: configure quarterly review of API versions; anything more than two minor versions behind gets prioritised for upgrade. Week 5: document the recovery procedure for the 5 highest-traffic integrations. Test recovery on one of them in a staging environment.

Most $5-15M firms can run all three in parallel inside 8 weeks with one focused operator. Larger firms benefit from sequencing - finish orphan first, then sprawl, then brittleness, because each later fix depends on the earlier one.

10. Vendor due-diligence: questions that surface broken stacks before purchase

Half the broken stacks we audited were broken at vendor selection. The wrong vendor at month 0 produces an orphan, a sprawl pattern, or a brittle integration by month 12. Six questions to ask any prospective vendor that will sit in your operational stack:

What is your CSV export schema and how do I migrate off you in 24 hours? If they cannot answer in one breath, your data will hostage in 18 months.
Show me a customer running this product at my scale for 18 months. Three-month logos prove the demo went well, not that the integration debt is manageable.
How do you announce breaking API changes? The right answer includes a public changelog, an email-list, and a 6-month deprecation window.
Do you offer a BAA / DPA / SOC 2 Type II report on demand? Compliance posture predicts ops maturity. See AICPA SOC 2 framework.
What does your service status page show for the last 12 months? Real status pages have history; theatre status pages are always green.
What is your CSM rotation rate? If your customer success manager rotates every 6 months, you will be the orphan-script case study within 18 months.

The vendor selection pattern - and the closing-on-vendor anti-patterns - are documented at depth in our 47-agency audit and the Make vs n8n vs Zapier comparison. The same questions apply to voice agents (luup voice agents), automation platforms (luup automation), creative ops (luup ad factory), and website builds (luup website generation).

11. The 5-day discipline pass

For an operator who finished this article and wants to act today:

Day 1. List every production AI / automation workflow. Be honest about which ones run without an owner. Use the self-audit checklist from section 8.

Day 2. Write a 1-page runbook for each. Standard template. The first version of every runbook will be embarrassing - ship it anyway.

Day 3. Map your operational tools by data domain. Pick the SSOT for each domain. Document the rule.

Day 4. Add daily synthetic checks per integration. Pipe failures to a dedicated Slack channel. Test that the alert actually fires.

Day 5. Hire an ops engineer or contract one with written SLA. Most mid-market firms can do this for €4-8k/month with a partner like luup automation; the in-house option starts around $90k all-in.

The point of doing it in 5 days is not that the work is finished - it is that you have a baseline to iterate from. Most of the 13% started this way.

12. What to ship this week

Run the 5-day discipline pass above. Or, if you want the audit run on your stack first by an outsider, the Agency Audit tool covers vendor evaluation in 4 minutes, the Revenue Leak Heatmap covers operational leak in 5 minutes, the Loop Map Generator covers operational diagnosis in 10 minutes, and the Phantom Lead Test probes your inbound funnel for the integration failures from section 4. All four run free, all four return scored reports. Or book a 30-minute review with a luup operator and we will run the audit script on your stack live.

13. Frequently asked questions

How was the audit conducted?

50 mid-market firms ($5-50M ARR) running at least one production AI workflow. 30-question structured interview format, sampled across services-ops (15), ecom (12), real estate (12), B2B SaaS (11). February to April 2026.

What does "broken" mean?

Stack is actively leaking value, not "could be optimised". Three failure types qualified. 87% failed in at least one. 41% failed in two or more.

Why is orphan automation the most common failure?

Mid-market firms hire one ops engineer or consultant; they leave; automations break with no one to fix them. Median age: 14 months. Oldest in our sample: 4 years and 2 months.

What do the 13% do differently?

Documented runbooks, on-call ops engineer, single source of truth. None had magic tools. The discipline was always older than the AI tooling.

How does this compare to the agency audit?

Different problem. Agency audit: 89% lying about AI delivery. This audit: 87% running real AI but broken at the ops layer. Same fix applies.

Which vertical is worst?

Real estate at 92% broken, mostly due to consultant-rotation orphan automation. Services-ops was best at 73%.

What stack does the 13% run?

No single stack. Tools varied. Constants: Notion or Confluence wiki, one canonical SSOT, Slack alerts on integration health, written SLA for the on-call.

How long does it take to fix a broken stack?

For a $5-15M firm with 5-12 active automations: 6-10 weeks of focused work plus a recurring weekly review. Most of the work is wiki writing and process discipline, not new tools.

14. Field notes from the 50-firm audit specifically

Five patterns surfaced in this audit that did not show up as cleanly in our parallel agency or operations audits. Each one was specific to firms that had already deployed AI in production, which is why we are surfacing them here rather than in the general operational guides.

Note 1 - the "AI champion" anti-pattern. 31 of the 43 broken firms had a designated "AI champion" - usually a marketing operations hire or a forward-deployed engineer who pushed for the initial AI deployments. In 22 of those 31, the champion had since left or rotated. Their projects became orphans. Counter-pattern in the 13%: AI work was distributed across the existing ops function, never concentrated in a single evangelist.

Note 2 - the "we built it ourselves" cost. 18 broken firms had built core integrations in-house rather than buying. 14 of those 18 were broken at the integration layer specifically. Custom-built integrations had a 78% silent-failure rate versus 31% for vendor-managed integrations. The lesson is not "never build" - it is "if you build, you also own monitoring, runbooks, and on-call from day one".

Note 3 - the "demo to prod" gap. 27 firms had AI workflows that worked in the original demo and broke within 90 days of production rollout. Common cause: the demo ran on a clean dataset; production data had edge cases the demo never hit. Fix is staged rollout with shadow-mode running for 14-30 days before the AI output drives any real action.

Note 4 - the "vendor consolidation" trap. 9 firms tried to fix tool sprawl by consolidating onto a single platform (usually HubSpot or Salesforce). Six of those nine ended up with worse data quality six months later because the consolidated platform had weaker primitives for one or two of their domains. Single-SSOT discipline beats single-platform consolidation; one SSOT can read from multiple specialised tools.

Note 5 - the founder visibility gap. In 38 of 43 broken firms, the founder did not know the names of all production AI workflows running on their stack. Median number of workflows the founder could name correctly: 4. Median number actually running: 9. The fix is a quarterly written stack inventory delivered to the founder, in plain language, signed by the ops owner.

The fix in every case is operational discipline applied earlier than feels necessary. Documented runbooks, named owners, single source of truth, on-call ops engineer, written vendor SLAs. Most of the work is not glamorous - it is updating wikis, scheduling weekly reviews, writing recovery procedures. The wins compound. The seven firms in our 13% all share these traits regardless of their service pillar; the same pattern shows up in the 25-hour-week services pattern (25-hour week playbook) and in the closed-loop score framework (closed-loop score framework).

Last updated: 4 May 2026.

Mid-Market AI Stack Audit: We Tested 50 Firms and 87% Were Broken