inlineMethology/microsoft-exp-framework.html at main · inlineapps/inlineMethology · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Microsoft ExP — OEC, CUPED, and trustworthy experiments</title>
<link rel="stylesheet" href="framework.css">
<style>
  /* Page-accent — overrides framework.css fallback */
  :root{--page-accent:var(--blue);--page-accent-soft:var(--blue-soft)}
  /* three-concept layout */
.three{display:grid;grid-template-columns:repeat(3,1fr);gap:14px;margin:14px 0}
  .concept{background:#fff;border:1px solid var(--line);border-radius:12px;padding:18px 20px;box-shadow:var(--shadow);border-top:5px solid var(--page-accent)}
  .concept .ab{font-family:Georgia,serif;font-size:14px;color:var(--page-accent);font-weight:700;letter-spacing:.05em;text-transform:uppercase;margin-bottom:4px}
  .concept h3{margin:0 0 8px;font-size:18px;font-family:Georgia,serif}
  .concept p{margin:0;font-size:13.5px;color:var(--ink-soft);line-height:1.5}
  .concept p b{color:var(--ink)}
  /* example */
.example{background:#fff;border:1px solid var(--line);border-radius:12px;padding:20px 24px;margin:14px 0;box-shadow:var(--shadow);border-left:5px solid var(--page-accent)}
  .example h3{margin:0 0 8px;font-size:18px;font-family:Georgia,serif}
  .example .quote{font-style:italic;color:var(--ink-soft);border-left:3px solid var(--line);padding:6px 14px;margin:10px 0;font-size:14.5px}
  .stats{display:flex;gap:10px;flex-wrap:wrap;margin:14px 0}
  .stat{flex:1 1 150px;background:#f6f4ee;border:1px solid var(--line);border-radius:8px;padding:12px 14px;text-align:center}
  .stat .n{font-family:Georgia,serif;font-size:22px;color:var(--page-accent);font-weight:700;line-height:1.1}
  .stat .l{font-size:11.5px;color:var(--ink-soft);margin-top:4px;line-height:1.35}
  .caveat{font-size:12.5px;color:#8a6d2e;background:#fbf6e7;border:1px dashed #d8c98f;border-radius:6px;padding:7px 11px;margin-top:10px}
  @media(max-width:680px){.three{grid-template-columns:1fr}}
</style>
</head>
<body>
<nav class="sitenav">
<details>
<summary>📑 Jump to</summary>
<div class="navmenu">
<div class="navgrp"><h4>Start here</h4>
<a href="index.html"><b>← Home (goal &amp; map)</b></a>
<a href="impact-saas-companies.html">SaaS / B2B field study</a>
<a href="impact-consumer-companies.html">Consumer-tech field study</a>
<a href="methodologies-comparison.html"><b>All methods compared →</b></a>
<a href="experiment-trustworthiness.html">How 40k tests actually work →</a>
<a href="jargon.html">Jargon (glossary)</a>
</div>
<div class="navgrp"><h4>Scoring &amp; Input modeling</h4>
<a href="rice-framework.html">RICE (Intercom)</a>
<a href="north-star-framework.html">North Star (Amplitude / Slack)</a>
</div>
<div class="navgrp"><h4>Goal-laddering / Define first</h4>
<a href="v2mom-framework.html">V2MOM (Salesforce)</a>
<a href="pyramid-of-clarity-framework.html">Pyramid of Clarity (Asana)</a>
<a href="pr-faq-framework.html">PR-FAQ / Working Backwards (Amazon)</a>
<a href="heart-framework.html">HEART (Google)</a>
<a href="dibb-framework.html">DIBB (Spotify)</a>
</div>
<div class="navgrp"><h4>Experimentation (SaaS)</h4>
<a class="cur" href="microsoft-exp-framework.html">Microsoft ExP / CUPED</a>
<a href="linkedin-xlnt-framework.html">LinkedIn T-REX</a>
</div>
<div class="navgrp"><h4>Experimentation (Consumer)</h4>
<a href="netflix-experimentation.html">Netflix · ABlaze</a>
<a href="booking-experimentation.html">Booking.com</a>
<a href="airbnb-erf-framework.html">Airbnb ERF</a>
<a href="uber-xp-framework.html">Uber XP</a>
<a href="doordash-switchback-framework.html">DoorDash switchback</a>
<a href="lyft-experimentation.html">Lyft</a>
<a href="pinterest-ab-framework.html">Pinterest</a>
</div>
<div class="navgrp"><h4>AI labs</h4>
<a href="anthropic-pm-on-ai-exponential.html">Anthropic · PM on AI exponential</a>
<a href="google-customer-zero-2026.html">Google · "Customer zero" 2026</a>
</div>
<div class="navgrp"><h4>Written discipline</h4>
<a href="stripe-shaping-framework.html">Stripe shaping</a>
</div>
</div>
</details>
</nav>

<div class="wrap">
  <header class="masthead">
    <p class="kicker">Methods · Deep-dive · Experimentation</p>
    <h1>Microsoft ExP — OEC, CUPED, and trustworthy experiments at scale <span class="srcyr">2013</span></h1>
    <p class="sub">Pioneered by Ron Kohavi and team across Bing, Office, Windows, and Azure. Three pieces — a <b>platform</b> that runs experiments cheaply, a single agreed success metric per test (<b>OEC</b>), and a variance-reduction technique (<b><a class="j" href="jargon.html#cuped">CUPED</a></b>) that gets the same answer from far fewer users.</p>
    <p class="sub">Earliest canonical publication: the CUPED paper (Deng, Xu, Kohavi, Walker — 2013). Codified at book length in <em>Trustworthy Online Controlled Experiments</em> (Kohavi, Tang, Xu — 2020).</p>
    <div class="goal"><span>Goal</span><br>Decide features by data-backed expected impact — choose by outcome, not by to-do list or opinion.</div>
  </header>

  <div class="eli">
    <div class="lbl">🎓 8th-grade version</div>
    Everyone has opinions about whether a feature will work. Microsoft figured out that opinions are mostly wrong — when you actually test ideas on real users by showing the new thing to half of them and the old thing to the other half (an <em>A/B test</em>), only about <b>one in three</b> ideas actually makes the number you cared about go up. One in three changes nothing. One in three makes it <em>worse</em>. So they built a platform (<b>ExP</b>) that makes running these tests cheap, agreed on a single success number per test (<b>OEC</b>) so nobody can cheat by switching what counts as winning, and invented a math trick (<b>CUPED</b>) that gets the same answer from half as many users — so you find out faster. The whole point: stop arguing about whether ideas will work, just test them.
  </div>

  <nav class="toc">
    <a href="#headline">Honest headline</a>
    <a href="#anatomy">The three pieces</a>
    <a href="#mechanism">How it picks work</a>
    <a href="#example">Bing headline example</a>
    <a href="#stats">The humbling stats</a>
    <a href="#apply">Apply to a sheet</a>
    <a href="methodologies-comparison.html" style="color:var(--blue);font-weight:700">Comparison table →</a>
  </nav>

  <div class="finding" id="headline">
    <h2>The honest headline: this is the only ground-truth method on the list</h2>
    <p>Most frameworks <b>estimate</b> impact (RICE) or <b>model</b> it (North Star). Microsoft's experimentation platform <b>measures</b> it — a controlled A/B test on real users is the only number that's actually causal. The other methods narrow the candidates; this one decides the winner.</p>
    <p>The cost of that rigour: it requires traffic, infra, instrumentation, and a culture that accepts <code>only ~1 in 3 ideas actually win</code>. The widely-cited Kohavi finding (paraphrased from his talks and the Kohavi/Tang/Xu book): <em>"When evaluating well-designed and executed experiments that were designed to improve a key metric at Microsoft, only about one-third were successful at improving the key metric."</em> Trustworthy experiments are how you find out which third.</p>
  </div>

  <!-- ANATOMY -->
  <h2 class="sec" id="anatomy">The three pieces — ExP, OEC, CUPED</h2>
  <p class="secsub">It's tempting to talk about "doing A/B tests" as if it's one thing. Microsoft separates three distinct problems and gives each a name.</p>

  <div class="three">
    <div class="concept">
      <div class="ab">ExP</div>
      <h3>The platform</h3>
      <p>Internal experimentation system that lets <b>any team</b> design, randomize, deploy, and analyse experiments without re-building plumbing each time. The platform itself is the cultural enabler — without it, experiments are too expensive to run by default.</p>
    </div>
    <div class="concept">
      <div class="ab">OEC</div>
      <h3>Overall Evaluation Criterion</h3>
      <p>The <b>one agreed success metric</b> per experiment, decided <em>before</em> the test runs. Without an OEC the team will cherry-pick whichever metric moved. The OEC also embeds long-term thinking (e.g. revenue per session weighted by churn risk) so a team can't win by gaming a short-term proxy.</p>
    </div>
    <div class="concept">
      <div class="ab">CUPED</div>
      <h3>Controlled-experiment Using Pre-Experiment Data</h3>
      <p>Variance-reduction technique from Deng, Xu, Kohavi, Walker (WSDM 2013). Use each user's <em>pre-experiment</em> behaviour as a covariate to subtract baseline noise. The Bing slowdown experiment in the paper is the canonical illustration — with CUPED, the +250ms server delay's CTR effect was statistically significant from day 1, whereas the un-adjusted analysis took two weeks at a smaller exposure to reach borderline significance. Three Bing experiments cited in the paper showed variance reductions of <strong>45%, 52%, and 49%</strong> — "the same statistical power with about half the users or half the duration." (Variance reduction is metric-dependent: revenue-per-user gained &lt;5%; the DoorDash Dash-AB post reports 10–20% sample reduction.)</p>
    </div>
  </div>
  <div class="src">Source: <a class="cite" href="https://exp-platform.com/">exp-platform.com</a> (Kohavi's hub) · <a class="cite" href="https://exp-platform.com/Documents/2013-02-CUPED-ImprovingSensitivityOfControlledExperiments.pdf">CUPED paper (Deng et al., 2013)</a> · book: <em>Trustworthy Online Controlled Experiments</em> (Kohavi, Tang, Xu, 2020).</div>

  <!-- MECHANISM -->
  <h2 class="sec" id="mechanism">How an ExP-style experiment actually picks the winner</h2>
  <p class="secsub">Five steps. The discipline that distinguishes "trustworthy" experiments from "we shipped an A/B test" is that <b>steps 1–2 happen before any code is written</b>.</p>

  <div class="step"><div class="num">1</div><div><h3>Write the hypothesis and pick the OEC</h3><p>A single sentence: <em>"If we ship X, we expect the OEC to move by Y, because Z."</em> The OEC is locked before the test starts — no peeking and re-picking.</p></div></div>
  <div class="step"><div class="num">2</div><div><h3>Define guardrail metrics</h3><p>Metrics that must <em>not</em> get worse: latency, error rate, support volume. Catches the "OEC went up but churn went up too" failure mode.</p></div></div>
  <div class="step"><div class="num">3</div><div><h3>Power-calculate the sample size</h3><p>How many users for what effect size? CUPED comes in here — by subtracting pre-experiment variance you need ~half as many users for the same confidence.</p></div></div>
  <div class="step"><div class="num">4</div><div><h3>Run the test, then check it ran clean</h3><p>Sample-Ratio-Mismatch checks, A/A pre-tests, segment analysis. The "trust this readout before reading it" discipline is the central argument of Kohavi / Tang / Xu's <em>Trustworthy Online Controlled Experiments</em> book — the specific <em>"~70% of surprising results turn out to be bugs"</em> figure is from Kohavi's talks rather than a published paper, and is repeated here as a cultural finding rather than a hard statistic.</p></div></div>
  <div class="step"><div class="num">5</div><div><h3>Read the OEC. Ship, kill, or iterate.</h3><p>If the OEC moved and guardrails held, ship. If not, kill — even if you loved it. The book calls this "humility through data."</p></div></div>

  <!-- EXAMPLE -->
  <h2 class="sec" id="example">Worked example — Bing's $100M headline-color test</h2>
  <div class="example">
    <h3>Tiny change, enormous OEC movement</h3>
    <p>The most cited Microsoft experiment: a Bing engineer proposed changing the colour and font of result-page headlines. It sat unprioritised for <strong>more than six months</strong> — until someone ran the A/B test.</p>
    <div class="quote">"The change had increased revenue by an astonishing 12% — which on an annual basis would come to more than $100 million in the United States alone." — Kohavi &amp; Thomke, HBR (Sept–Oct 2017, verbatim)</div>
    <p>The lesson is not "small changes can be big" — it's that <b>without the experiment, no scoring framework would have ranked this above the work it was beating</b>. The team's pre-test estimate was effectively zero. Only measurement found it.</p>
    <div class="src" style="margin-top:8px">Source: <a class="cite" href="https://hbr.org/2017/09/the-surprising-power-of-online-experiments">HBR — The Surprising Power of Online Experiments</a> (Kohavi &amp; Thomke).</div>
  </div>

  <!-- STATS -->
  <h2 class="sec" id="stats">The humbling stats — why estimation is so hard</h2>
  <p class="secsub">Across Microsoft / Bing / Office, every well-instrumented org publishes roughly the same shocking number: most ideas don't work. This is the empirical case for measure-not-estimate culture.</p>

  <div class="stats">
    <div class="stat"><div class="n">~⅓</div><div class="l">of experiments improve the metric they were designed to improve</div></div>
    <div class="stat"><div class="n">~⅓</div><div class="l">are flat (no statistically significant change)</div></div>
    <div class="stat"><div class="n">~⅓</div><div class="l">make things <em>worse</em> — statistically significant negatives, caught only because we measured</div></div>
    <div class="stat"><div class="n">~50%</div><div class="l">variance cut by CUPED at Bing → 2× faster decisions</div></div>
    <div class="stat"><div class="n">~$100M</div><div class="l">/yr from one Bing headline-color test alone</div></div>
  </div>
  <div class="caveat">Caveat: the ⅓ / ⅓ / ⅓ split is the Kohavi-canonical figure (talks, HBR, and the Kohavi/Tang/Xu 2020 book) — a Microsoft-internal aggregate, not a controlled study. <b>Bing's success rate is reportedly lower than the Microsoft-wide ⅓.</b> Other companies (Booking, Netflix) independently cite similar ratios. The $100M Bing figure is HBR-published (reputable secondary).</div>

  <div class="note"><b>Why this matters for prioritisation.</b> If only ~1/3 of ideas win, ranking 50 ideas with RICE and shipping the top 10 means you ship ~7 losers and ~3 winners — and you may not know which were which. ExP-style discipline says: <b>ship behind a test, declare the OEC up front, kill what doesn't move it.</b> That's the entire argument for trustworthy experimentation as the layer that finishes what scoring frameworks start.</div>

  <!-- APPLY TO A SHEET -->
  <h2 class="sec" id="apply">Apply to a feature sheet</h2>
  <p class="secsub">ExP-style discipline doesn't score features pre-build — it adjudicates them <strong>post-build via experiment</strong>. If you adopt it, your feature sheet evolves from "ideas + scores" into a <em>live experiment ledger</em>: every shipped feature carries the test that earned it the right to stay.</p>

  <div class="note" style="background:var(--teal-soft);border-left-color:var(--teal)"><b>Try it Monday morning (30 minutes).</b> Pick the most recently-shipped feature on your team. Write its post-hoc experiment design: what was the <em>OEC</em> (the one number that defined winning)? What <em>guardrails</em> should not have regressed? What <em>MDE</em> would have justified the build? If you can't answer those for a feature you already shipped, you've found the discipline gap — your team is shipping ideas faster than you're learning from them. Next sprint: pick one upcoming feature and write those three lines <em>before</em> any code goes in.</div>

  <div class="note" style="background:var(--blue-soft);border-left-color:var(--blue);font-size:13.5px"><b>Quick glossary for the columns below.</b> <b>OEC</b> = <a class="j" href="jargon.html#oec">Overall Evaluation Criterion</a> (the one agreed success metric). <b>MDE</b> = <a class="j" href="jargon.html#mde">Minimum Detectable Effect</a> (the smallest change the test is powered to spot). <b>SRM</b> = <a class="j" href="jargon.html#srm">Sample-Ratio Mismatch</a> (when your 50/50 split came out 51/49 by accident — a sign the experiment is broken). <b><a class="j" href="jargon.html#aa-test">A/A test</a></b> = both arms get the same variant; if you see "winning" results, the platform is lying to you. <b>CI</b> = Confidence Interval. <b>CTA</b> = Call-To-Action (the button you want clicked). <b><a class="j" href="jargon.html#a11y">a11y</a></b> = accessibility.</div>

  <div class="extable">
    <table class="ex">
      <thead><tr><th>Column to add</th><th>What it captures</th><th>How you fill it</th></tr></thead>
      <tbody>
        <tr><td>Feature</td><td>What's being tested</td><td>Backlog title</td></tr>
        <tr><td>Hypothesis</td><td>"If we ship X, OEC moves by Y, because Z"</td><td>One sentence — locked before code is written</td></tr>
        <tr><td>OEC</td><td>The single agreed success metric</td><td>Picked from the org's OEC catalogue — no per-test improvisation</td></tr>
        <tr><td>Guardrails</td><td>Metrics that must <em>not</em> regress (latency, errors, support volume, churn)</td><td>Standard set + feature-specific additions</td></tr>
        <tr><td>MDE</td><td>Minimum detectable effect — smallest move worth shipping for</td><td>Set by business value, not by what's easy to detect</td></tr>
        <tr><td>Sample size N (CUPED)</td><td>Users per arm needed for that MDE at 80% power</td><td>Power calc — CUPED cuts ~50% off the raw N</td></tr>
        <tr><td>Ramp plan</td><td>Exposure schedule (1% → 5% → 25% → 50% → 100%)</td><td>Standard ramp; pause at any step if guardrail breaches</td></tr>
        <tr><td>Stop rule</td><td>When to call it (time, sample, or guardrail breach)</td><td>Defined upfront — no "let's run it another week" mid-test</td></tr>
        <tr><td>Trust checks</td><td>SRM, A/A pre-test, segment sanity</td><td>Platform-automated; flagged tests get blocked from readout</td></tr>
        <tr><td>Result</td><td>OEC delta + CI + guardrail status</td><td>Auto-populated from platform readout</td></tr>
        <tr><td>Decision</td><td>Ship / Iterate / Kill</td><td>Driven by the rule below, not by attachment</td></tr>
      </tbody>
    </table>
  </div>

  <h3 style="font-family:Georgia,serif;font-size:18px;margin:24px 0 8px">Worked example — an experiment ledger snapshot</h3>
  <p style="font-size:13.5px;color:var(--ink-soft);margin:0 0 12px">Eight experiments in flight or just read out, shaped after Microsoft / Bing-style templates. Numbers illustrative — chosen to show every verdict the discipline produces, including the unglamorous ones (under-powered, invalid).</p>

  <div class="extable">
    <table class="ex">
      <thead><tr><th>Feature</th><th>OEC</th><th>Guardrails</th><th>MDE</th><th>N / arm</th><th>Ramp</th><th>Result</th><th>Decision</th></tr></thead>
      <tbody>
        <tr class="top"><td>Headline color &amp; font tweak (Bing-style)</td><td>Revenue / session</td><td>Latency, abandon</td><td>0.5%</td><td>200k</td><td>50/50 · 2 wk</td><td>OEC +1.2% sig · guards held</td><td class="score">Ship</td></tr>
        <tr class="top"><td>Recommendation algo v2</td><td>Sessions / user / wk</td><td>CTR, latency</td><td>0.3%</td><td>150k</td><td>25/25/50 · 2 wk</td><td>OEC +0.4% sig · guards held</td><td class="score">Ship</td></tr>
        <tr class="top"><td>Bigger CTA buttons on landing</td><td>Conversion to signup</td><td>Bounce rate, a11y</td><td>0.5%</td><td>120k</td><td>50/50 · 1 wk</td><td>OEC +0.6% sig · bounce -0.3%</td><td class="score">Ship</td></tr>
        <tr><td>New onboarding tutorial</td><td><a class="j" href="jargon.html#dn-activation">D7</a> retention</td><td>Completion, errors</td><td>1.0%</td><td>80k</td><td>1→5→25→50%</td><td>OEC flat · guards OK</td><td class="score" style="color:var(--gold)">Iterate — try v2</td></tr>
        <tr><td>AI suggestion panel</td><td>Tasks completed / session</td><td>Page load, opt-out rate</td><td>1.5%</td><td>60k</td><td>1→5→25%</td><td>OEC +0.2% NOT sig (under-powered)</td><td class="score" style="color:var(--gold)">Iterate — extend run</td></tr>
        <tr><td>Aggressive paywall variant</td><td>Trial → paid</td><td>Post-paid churn, support tickets</td><td>3.0%</td><td>20k</td><td>5% · halted</td><td>OEC +8% BUT tickets +40%</td><td class="score" style="color:var(--accent)">Kill — guardrail breach</td></tr>
        <tr><td>Auto-translate UI for new locales</td><td>Locale activation rate</td><td>Tickets, error rate</td><td>2.0%</td><td>40k</td><td>25/25/50</td><td>OEC -2.1% sig</td><td class="score" style="color:var(--accent)">Kill</td></tr>
        <tr><td>Inline help tooltips</td><td>Help-doc opens ↓ (proxy)</td><td>Tickets, time-to-first-action</td><td>1.0%</td><td>50k</td><td>50/50</td><td>OEC NOT sig · SRM flagged</td><td class="score" style="color:var(--ink-soft)">Invalid — re-randomize</td></tr>
      </tbody>
    </table>
  </div>

  <div class="note" style="background:var(--accent-soft);border-left-color:var(--accent)"><b>The most important reading skill on this page.</b> Look at the "Aggressive paywall variant" row. OEC <em>moved</em> (+8%, a huge win on most teams' default metric). Guardrails <em>breached</em> (support tickets +40%). Decision: <b>Kill</b>. This is the row that explains why ExP exists. A team without guardrails ships the +8% feature and discovers the support cost months later, by which point the feature is entrenched. A team with guardrails sees both numbers in the same readout and kills it the same day. The OEC tells you what you wanted; the guardrails tell you what you forgot you needed.</div>

  <div class="note"><b>Decision rule.</b> <b>Ship</b> only if (1) OEC moves with statistical significance, (2) <em>every</em> guardrail holds, and (3) no SRM or trust flag was raised. <b>Iterate</b> if OEC is flat but guardrails are clean — change the variant, re-run. <b>Kill</b> immediately when a guardrail breaches, even if the OEC looks great (the paywall row): a feature that lifts conversion but burns support load is a net loss. The point of CUPED is that you can read these verdicts in days, not weeks — speed of decision is what makes the ⅓-win rate survivable.</div>

  <footer>
    Companion to <a href="impact-saas-companies.html#exp">← SaaS case studies · Experimentation</a> · <a href="methodologies-comparison.html">All methods compared</a><br>
    <b>Grounded in</b> <a href="https://exp-platform.com/">exp-platform.com</a> (Kohavi's hub), the CUPED paper (Deng, Xu, Kohavi, Walker — WSDM 2013), the Kohavi/Tang/Xu book <em>Trustworthy Online Controlled Experiments</em> (Cambridge University Press, 2020), and Kohavi &amp; Thomke's HBR article (Sept–Oct 2017). <b>Verbatim from CUPED 2013:</b> variance reductions of 45% / 52% / 49% on three Bing experiments; the Bing slowdown experiment narrative; "Pre-experiment period of 1–2 weeks works well." <b>Verbatim from HBR 2017:</b> the Bing-headline quote ("The change had increased revenue by an astonishing 12%..."). <b>Paraphrased from Kohavi's talks/book:</b> the ⅓-win-rate finding; the "~70% of surprising results turn out to be bugs" trust-flag figure (talks, not a published paper). <b>Added by us, not in the sources:</b> the 11-column experiment ledger, the 8-row worked-example readout table, the "Try it Monday" exercise, and the in-page glossary.<br>
    <em>Note: a 2026-05-26 source-verification pass against the CUPED 2013 paper confirmed the variance-reduction claims, added the Bing slowdown example, and softened the "~70% trust-flag" line (it's a talks-derived figure, not a hard statistic). The "Office, Windows, Azure" attribution for the ExP platform is general industry knowledge — the CUPED paper itself focuses on Bing.</em>
  </footer>
</div>
</body>
</html>