← Case studies·STUDY № 072·REVENUE·MICROSOFT (BING)

How a One-Line Code Change Was Worth $100M — and Why Bing Ran 20,000 Experiments a Year to Find It

A Bing engineer proposed a trivial-seeming change: promote the second line of an ad to the title. The experiment triggered a revenue alarm — not because something broke, but because revenue jumped ~12%. That one idea was worth $100 million and became the biggest revenue impact in Bing's history. The lesson isn't the idea; it's the system that found it: 20,000 experiments a year, with marginal cost approaching zero, run inside a culture that expected most ideas to fail.

A Bing engineer proposed a trivial-seeming change: promote the second line of an ad to the title. The experiment triggered a revenue alarm — not because something broke, but because revenue jumped ~12%. That one idea was worth $100 million and became the biggest revenue impact in Bing's history. The lesson isn't the idea; it's the system that found it: 20,000 experiments a year, with marginal cost approaching zero, run inside a culture that expected most ideas to fail.

The Alarm That Wasn't a Bug

Somewhere inside the Bing experimentation platform, an alert fired. A revenue metric had moved so far outside normal bounds that the on-call team escalated it as a potential bug. Engineers started digging — and found nothing wrong. No bad code, no infrastructure fault, no data pipeline error. The revenue was just... higher. A lot higher.

The cause was a single change: a Bing engineer had proposed that the second line of a search ad — normally displayed below the clickable title — should be promoted up and merged into the title itself, making the title line larger and more prominent. The idea sounds almost embarrassingly small. No new product surface. No new ad format. Just a one-line rearrangement. And it increased revenue by about 12%.

At the time, Bing was smaller than it is today. Twelve percent of its ad revenue amounted to roughly $100 million. It became, as Ronny Kohavi puts it, "the biggest revenue impact to Bing in all its history." Critically, it wasn't a trick: it didn't flood the page with more ads or degrade the user experience in ways the guardrail metrics would catch. It was, in his words, "a home run that improved revenue, didn't significantly hurt the guardrail metrics."

The engineer who proposed it had no idea it would work at that scale. Neither did anyone else.

Two Thirds of Ideas Fail — and That's Optimistic

The reason the Bing story matters isn't the cleverness of the idea. It's the context. Kohavi spent years at Microsoft, then Airbnb, then Amazon building and running experimentation platforms, and the most consistent thing he observed across all three companies was how wrong smart people are about their own ideas.

At Microsoft overall, roughly two-thirds of ideas fail to move the metric they were designed to move. At Bing specifically — a search engine that had been intensively optimized for years — the failure rate climbed to around 85%. At Airbnb, during a stretch of approximately 250 search relevance experiments, 92% failed. "I know that every group that starts to run experiments, they always start off by thinking that somehow, they're different. And their success rate's going to be much, much higher, and they're all humbled."

Booking.com, Google Ads, and other companies that have published their numbers land in the same range: 80 to 90 percent of ideas produce no meaningful positive result.

The corollary is the part that changes how you think about velocity: if only 8–33% of ideas work, and you cannot reliably predict which ones, then the throughput of your idea-testing system is itself a growth lever. The team that runs 20 experiments a month doesn't just learn twenty times faster than the team running one — it finds the $100M insight that the one-experiment team's annual plan never even scheduled.

The Platform Is the Unlock

When Kohavi joined the group at Microsoft, almost nobody was running experiments. He set out to change that by building the platform first. The insight was economic: "Once you build a platform, the incremental cost of running an experiment should approach zero. And we got to that at Microsoft, where after a while, the cost of running experiments was so low that nobody was questioning the idea that everything should be experimented with."

By the time he left Microsoft in 2019, the organization was running 20,000 to 25,000 experiments per year — roughly 100 new treatments started every working day. That number is staggering, but it only becomes possible when the marginal cost of a test is truly near zero. At that scale, the question stops being "should we experiment?" and becomes "what should we test next?"

The experimentation platform serves two purposes simultaneously: it's a safety net (if you launch something harmful, you catch it and abort fast), and it's an oracle (at the end of a two-week experiment, it tells you exactly what happened to your key metric). Both functions compound at volume. The safety net means engineers ship confidently without fear of silent disasters. The oracle means the institution builds a searchable, queryable history of every idea it ever tried.

The Math Behind 'Test Everything'

Kohavi's core thesis is simple enough to fit in a sentence: "I don't think it's possible to experiment too much." Even small, apparently low-stakes changes — minor bug fixes, incremental copy tweaks, subtle layout shifts — can produce surprising results in either direction. The Windows indexer experiment at Microsoft looked like a clean relevance win in offline tests, then turned out to destroy laptop battery life in production. Nobody anticipated it. The experiment caught it.

The Bing social integration project — integrating Twitter and Facebook feeds into search results — consumed 100 person-years of engineering effort. Hundreds of experiments. None produced a breakthrough. It was eventually shut down. "All the experiments were just negative to flat." Without the experimentation system, that team might have kept building for years.

The 8–33% success rate isn't a failure of imagination. It's the empirical baseline for what good teams working on mature products can expect. The correct response isn't to be more confident in your ideas. It's to build the machine that lets you run more of them faster, and to stop shipping the 70–92% of things that aren't working before they accumulate as maintenance debt.

The Lesson

At Airbnb, 250 search relevance experiments — with only 8% individually succeeding — compounded into a 6% improvement in revenue. No single idea drove it. The velocity drove it. The ad-title experiment at Bing is the story everyone remembers, but Kohavi's point is precisely that you cannot plan for it. You can only build the system that would find it.

Challenge

Most product ideas — even from expert teams — fail to move their target metric, with failure rates of 66–92% across Microsoft, Bing, and Airbnb. Without an experimentation system, teams ship winning and losing ideas indiscriminately, and the rare breakthrough (like a $100M revenue gain) is invisible amid the noise.

Approach

Kohavi's team at Microsoft built an experimentation platform that drove the marginal cost of each test toward zero, enabling 20,000–25,000 experiments per year (roughly 100 new treatments per working day). Every code change ran inside an experiment; guardrail metrics caught regressions; and the searchable history of all past experiments became institutional memory.

Results

  • Revenue uplift from the Bing ad-title experiment: ~12% (approximately $100M at the time)
  • Bing ad-title result described as: The biggest revenue impact in Bing's history (Kohavi)
  • Microsoft idea failure rate (overall): ~66% (two thirds)
  • Bing idea failure rate (optimized domain): ~85%
  • Airbnb search relevance failure rate: 92% (only 8% of ~250 experiments moved the key metric)
  • Airbnb search relevance cumulative result: 6% revenue improvement across ~250 experiments
  • Microsoft experiment run rate (at Kohavi's departure in 2019): 20,000–25,000 experiments per year (~100 new treatments per working day)
  • Industry failure rate benchmark (Booking, Google Ads, others): 80–90%

Sources

The full record sits in the studio register.

Related

Part of the Revenue growth pillar. See also Netflix's Price Increase Playbook, Figma's Freemium-to-Enterprise Expansion, Zoom's 40-Minute Limit as Conversion Engine.

Cite as · Omega Point Studies № 072 · Microsoft (Bing)experimentation · a/b-testing · experiment-velocity · revenue-optimization · search-advertising · platform-investment · failure-rates · institutional-learning · microsoft · bing