ROI chart ascending from baseline to target across 12 weeks
← Back to blog
Playbook Mar 17, 2026 11 min read

How we measure AI ROI in 90 days (without lying to ourselves)

Most AI ROI numbers are theatre. Here’s the baseline methodology we use on every Forge deployment — including the four ways we catch ourselves overclaiming, and how we write it all into the SOW before we start.

If you’ve read any “AI saved our company $40 million” case study in the last eighteen months, you’ve probably noticed the number is always large, always round, and always impossible to verify. There’s a reason for that. The industry has a strong incentive to inflate ROI claims and almost no incentive to audit them.

We run into this on every Forge deployment. A client asks us what ROI to expect. We give them a number. They ask us to put it in writing. We put it in writing, and now we have to actually measure it — and if we’re off by more than a handful of percentage points in the wrong direction, we’re refunding the deployment fee.

That constraint — being on the hook for your own ROI claims — forces honesty. Here’s the methodology we’ve landed on.

The four ROI lies we try to avoid

Before the methodology, the traps. Every AI ROI claim you’ve ever read is probably committing at least one of these.

Lie #1: Counting hours saved as dollars saved.

A finance associate spending 6 hours a day on invoice matching is not “$360,000 of waste” just because you multiplied their hourly rate by 250 working days. If the agent saves those 6 hours, you don’t get to fire 1.5 FTEs (and you wouldn’t want to). You get 6 hours of someone’s day back. That’s real — but it’s only valuable if you can redirect those hours to something that actually generates revenue or reduces a different cost.

What we do instead: we track hours saved as capacity, not as dollars, and we ask the finance leader to tell us what projects were previously blocked by capacity. When those projects ship, we count the project outcomes as the ROI. Not the hours.

Lie #2: Counting cost avoidance as cash.

“We avoided hiring 3 more people next year” is a real benefit, but only if you can prove you were actually going to hire them. If hiring was already paused or wasn’t in the approved headcount plan, cost avoidance is imaginary.

What we do instead: we only count cost avoidance against a signed, dated, pre-existing headcount plan. If it’s not in last quarter’s board deck, we don’t count it.

Lie #3: Confusing accuracy with value.

An agent that reconciles invoices with 98% accuracy sounds great. But if it’s only being tested on invoices that were already easy to reconcile manually, the accuracy number tells you nothing about the actual value.

What we do instead: we measure accuracy separately for easy and hard cases, and we define “hard” before we turn the agent on. Hard cases are the ones that used to end up on an analyst’s desk. If the agent can only handle the easy ones, we report that honestly.

Lie #4: Not measuring the humans.

The most common AI ROI mistake is forgetting that the humans are part of the system. If the agent takes 4 hours off the team’s day, but the team now spends 3 hours a day babysitting the agent, you’ve saved 1 hour, not 4. We’ve seen plenty of deployments where the “time saved” number was gross and the net was negative.

What we do instead: we time-track supervision and escalation overhead for the first 8 weeks, and we subtract it from the gross savings. The final ROI number is the net.

The methodology: baseline, target, measurement

Every Forge deployment starts the same way. Three steps, all before a line of agent code is written.

Step 1: Baseline — week 1

We spend the first week instrumenting the current process. Specifically:

This is the baseline document. The client signs it at the end of week 1. Every ROI claim we make at the end of the engagement is measured against this signed baseline, with no retroactive adjustments.

Step 2: Target — also week 1

With the baseline in hand, we write the target into the SOW. A good target is:

If any of those four is missing, the target doesn’t go in the SOW. This is the number the refund clause references.

Step 3: Measurement — week 12

At week 12, we re-run the baseline measurement using the same method, on the same systems, at the same granularity. We publish a written report. The client reviews it. If we hit the target, deployment is complete and we move into the optimization phase. If we miss, the refund schedule kicks in.

That’s it. No mystery, no hand-waving.

What makes this work

Two things. First, the baseline is signed, so we can’t quietly move the goalposts. Second, the target is attributable, so we can’t credit the agent for improvements that came from other things (a new hire, a process change, an integration upgrade).

The attribution piece is where most ROI measurement falls apart. If you deploy an agent and simultaneously hire three analysts and redesign the approval workflow, you will not be able to tell which of the three drove the improvement. We make the client commit, in the SOW, to not making other concurrent changes to the process during the 12-week engagement. If something else has to change, we pause the measurement and reset the baseline.

What we’ve actually seen

Across our first set of founding-client deployments, the realistic numbers have landed in a tighter range than most vendors would have you believe:

Nothing dramatic. No $40 million case studies. Just honest, measurable, attributable improvements that pay for the engagement and leave the client with a new capability they didn’t have before.

If you want the long version — including our actual baseline template and a sample week-12 report — we ship them as part of the sample SOW pack.

Want the sample SOW pack?

Includes our baseline template, target methodology, refund schedule, and a redacted week-12 measurement report. Delivered within the hour.

Request the SOW pack →