If you’ve read any “AI saved our company $40 million” case study in the last eighteen months, you’ve probably noticed the number is always large, always round, and always impossible to verify. There’s a reason for that. The industry has a strong incentive to inflate ROI claims and almost no incentive to audit them.
We run into this on every Forge deployment. A client asks us what ROI to expect. We give them a number. They ask us to put it in writing. We put it in writing, and now we have to actually measure it — and if we’re off by more than a handful of percentage points in the wrong direction, we’re refunding the deployment fee.
That constraint — being on the hook for your own ROI claims — forces honesty. Here’s the methodology we’ve landed on.
The four ROI lies we try to avoid
Before the methodology, the traps. Every AI ROI claim you’ve ever read is probably committing at least one of these.
Lie #1: Counting hours saved as dollars saved.
A finance associate spending 6 hours a day on invoice matching is not “$360,000 of waste” just because you multiplied their hourly rate by 250 working days. If the agent saves those 6 hours, you don’t get to fire 1.5 FTEs (and you wouldn’t want to). You get 6 hours of someone’s day back. That’s real — but it’s only valuable if you can redirect those hours to something that actually generates revenue or reduces a different cost.
What we do instead: we track hours saved as capacity, not as dollars, and we ask the finance leader to tell us what projects were previously blocked by capacity. When those projects ship, we count the project outcomes as the ROI. Not the hours.
Lie #2: Counting cost avoidance as cash.
“We avoided hiring 3 more people next year” is a real benefit, but only if you can prove you were actually going to hire them. If hiring was already paused or wasn’t in the approved headcount plan, cost avoidance is imaginary.
What we do instead: we only count cost avoidance against a signed, dated, pre-existing headcount plan. If it’s not in last quarter’s board deck, we don’t count it.
Lie #3: Confusing accuracy with value.
An agent that reconciles invoices with 98% accuracy sounds great. But if it’s only being tested on invoices that were already easy to reconcile manually, the accuracy number tells you nothing about the actual value.
What we do instead: we measure accuracy separately for easy and hard cases, and we define “hard” before we turn the agent on. Hard cases are the ones that used to end up on an analyst’s desk. If the agent can only handle the easy ones, we report that honestly.
Lie #4: Not measuring the humans.
The most common AI ROI mistake is forgetting that the humans are part of the system. If the agent takes 4 hours off the team’s day, but the team now spends 3 hours a day babysitting the agent, you’ve saved 1 hour, not 4. We’ve seen plenty of deployments where the “time saved” number was gross and the net was negative.
What we do instead: we time-track supervision and escalation overhead for the first 8 weeks, and we subtract it from the gross savings. The final ROI number is the net.
The methodology: baseline, target, measurement
Every Forge deployment starts the same way. Three steps, all before a line of agent code is written.
Step 1: Baseline — week 1
We spend the first week instrumenting the current process. Specifically:
- Volume. How many transactions per day, week, or month? Counted, not estimated.
- Cycle time. From request to completion, measured at the p50 and p95.
- Error rate. How often does the current process produce a bad outcome? Define “bad” tightly.
- Touch count. How many humans touch each transaction on average?
- Capacity consumption. What fraction of the team’s week is spent on this process? Not estimated — calendar-based.
This is the baseline document. The client signs it at the end of week 1. Every ROI claim we make at the end of the engagement is measured against this signed baseline, with no retroactive adjustments.
Step 2: Target — also week 1
With the baseline in hand, we write the target into the SOW. A good target is:
- Specific: “Reduce p50 cycle time from 4.2 days to 1.5 days.”
- Time-bound: “By end of week 12.”
- Measurable from existing systems: We can pull the number from the ERP or the ticket system without adding custom instrumentation.
- Attributable: We can isolate it from other things happening in the business.
If any of those four is missing, the target doesn’t go in the SOW. This is the number the refund clause references.
Step 3: Measurement — week 12
At week 12, we re-run the baseline measurement using the same method, on the same systems, at the same granularity. We publish a written report. The client reviews it. If we hit the target, deployment is complete and we move into the optimization phase. If we miss, the refund schedule kicks in.
That’s it. No mystery, no hand-waving.
What makes this work
Two things. First, the baseline is signed, so we can’t quietly move the goalposts. Second, the target is attributable, so we can’t credit the agent for improvements that came from other things (a new hire, a process change, an integration upgrade).
The attribution piece is where most ROI measurement falls apart. If you deploy an agent and simultaneously hire three analysts and redesign the approval workflow, you will not be able to tell which of the three drove the improvement. We make the client commit, in the SOW, to not making other concurrent changes to the process during the 12-week engagement. If something else has to change, we pause the measurement and reset the baseline.
What we’ve actually seen
Across our first set of founding-client deployments, the realistic numbers have landed in a tighter range than most vendors would have you believe:
- Cycle time: 55–80% reduction on the target process.
- Capacity freed: 20–35% of the team’s week, net of supervision overhead.
- Error rate: Typically lower, but not always — we sometimes surface errors that were already there and uncounted.
- Payback period: 4–6 months on the deployment fee. Recurring cost is usually covered by the freed capacity in the first quarter.
Nothing dramatic. No $40 million case studies. Just honest, measurable, attributable improvements that pay for the engagement and leave the client with a new capability they didn’t have before.
If you want the long version — including our actual baseline template and a sample week-12 report — we ship them as part of the sample SOW pack.