Advisory one-pager · Model-quality reliability · Prepared for Descript

Who gets paged when the transcription drifts?

I'd love to start with a quick call to hear what you actually need. You're probably swamped, so I did some homework first: a few small prototypes, built only from Descript's public posture and public ASR benchmarks, on a problem your industry is wrestling with right now.

They're rough sketches on public and synthetic numbers, not your data and not a finished product, just a faster way to show what working together could look like than a blank-page call. One screen is built and runnable: a metric-SEV console that watches a model-quality metric drift and shows, verdict-first, what would page someone versus what waits for the retro. Everything in it is labeled synthetic and public-benchmark-shaped.

The thing I keep noticing

Descript's product is its model quality. When the transcript mangles a brand name, when the AI editor flags the wrong "best clip," when a voice clone drifts, the damage isn't a dashboard cell: it's a creator who published something wrong, or churned, or posted "Descript got worse after the update." Laura Burkhauser has drawn the line out loud, that Descript is "not a slop machine" and that "you, and only you are qualified to write the eval criteria for what a good job looks like." That's a quality bar stated as a value. The genuine question, not an accusation: a value isn't an operating discipline. There's an eval-criteria owner; is there an on-call that fires when a surface drifts below the bar? If it's caught when a creator complains, it's caught at the next retro, not in 24 hours.

What's in the homework

  • The metric-SEV console (built, runnable) picks a model-quality metric, watches it drift across a synthetic quarter, and shows verdict-first what pages versus what waits. Three preset scenarios: the slice the average hides, the model swap that lifts the mean and regresses a sub-segment, the voice-clone safety SEV-1.
  • The bound-and-runbook worksheet is the artifact a real engagement leaves behind: one negotiated bound, a severity ladder, and the runbook for the one metric whose drift would actually hurt.
  • The model-swap gate turns "does the new model clear the bar?" into a check before rollout, instead of a hope.

The honest bound

I can't tell from outside whether you already page on a drift; that's the first thing I'd ask. The console runs on public-benchmark-shaped slice gaps and seeded synthetic drift, never your metrics, and wiring it to Descript's real numbers is the follow-on, not a claim here. The discipline is vendor-agnostic: it sits on top of whatever you monitor with (Arize, Fiddler, a homegrown eval), it doesn't replace it.

"My team could build this"

Probably true, and I'd say so on the call. A team that ships ML can wire a dashboard. What's harder to build fast is the negotiated bound and the on-call behind it, and, when a customer or a regulator asks "how would you catch a quality regression?", an independent record you didn't grade yourselves.

The hunches behind these sketches

1 · The average hides the slice. Overall accuracy looks fine while the accented-speaker and proper-noun slices drift worse; public reviews name exactly those weak spots. The console sketches a slice-bound that pages when the mean wouldn't. A real engagement checks it against your own per-slice WER.
2 · Every model swap is an un-paged risk. Burkhauser has talked about balancing frontier and in-house models. The hunch: a swap that lifts the average can quietly regress a sub-segment and ship anyway. The gate turns that into a check; we'd confirm it on one real recent swap.
3 · The voice-clone metric deserves a SEV-1. A synthetic-voice fidelity or safety breach is a trust incident, not a quality nit, and likely has no documented bound wired to a page today. The engagement writes that bound and the 1-hour rollback runbook.

The 4 to 6 week diagnostic

WK 1Public-data proof of the drift-risk shape (the console, reskinned to your surfaces) + pick the one metric whose drift would actually hurt, with your team
WK 2-3Negotiate and write its bound; build the severity ladder and the runbook to the bound-and-runbook artifact
WK 3-4Wire it to your existing pager (vendor-agnostic); run one fire-drill on purpose in staging
WK 5-6Hold the first blameless post-mortem + handoff doc so the discipline stays resident. All IP transfers

Who this is for

Head of ML / AI Eng, or a senior PM on an AI surface, championing up to the CEO. Triggers: a transcription or voice regression that a creator caught before you did; a model swap you shipped on the average; an enterprise or RAI review asking "who gets paged when quality drifts?"; the public quality bar needing an operating discipline behind it, not just a value.

Engagement & pricing

One-time diagnostic : one metric, its bound, the runbook, the pager wiring, one fire-drill, the first post-mortem$60k
Optional monthly advisory : the second and third metric, the model-swap gate as standing practice$8k / mo

Fixed-scope, 4 to 6 weeks, IP transfers; no platform, no subscription, no data-integration project, and the public-data phase needs no internal access at all. The retainer is optional, cancel-anytime, never a license. Final scope set after a 30-minute call. If the surface doesn't have a drift problem worth an on-call, I'll say so, and that's a cheap thing to find out at this price.

The ask

One 30-minute call, you tell me what you actually need. Bring one question: who gets paged when transcription WER drifts on the hardest slice today? I'll walk the console live and we'll map whether a bound is missing. If it's already covered in your shop, I'll say so and we'll have spent half an hour well.

The tool, live: descript-metric-sev.pages.dev · Book it: jeffpinto.com/engage · Method: the metric-SEV note

Who's behind this

Jeff Pinto runs a small, independent data and AI advisory practice (jeffpinto.com). Thirty years across AI data and privacy, health tech, marketing analytics, renewables, logistics, and broadcasting; the last seven in ML and AI. Hands-on at Meta, Uber, and IBM, plus six startups (one turnaround, three acquisitions). Two MScs: computer science (Toronto) and engineering (Loughborough). Engagements are fixed-scope, four to twelve weeks, no platform and no subscription; whatever gets built, the IP transfers to you.

The edge for this one: at Meta he ran a consumer-ML data-quality lockdown on the smart-glasses surface that wired metric bounds to a pager instead of a dashboard nobody opened, a documented ~14x detection-lift that cut drift-response from a quarter to 24 hours; and his UofT fairness research used decoupled classifiers to compress an accuracy parity gap from 35% to 1% on a small clinical corpus, the same "the average hides the worst slice" problem your accented-audio transcription has.

Sources : Descript public posture and Burkhauser quotes (The Cognitive Revolution; aakashg.com CEO interview) · public ASR benchmark structure (Open ASR Leaderboard, Common Voice, FLEURS) for the slice-gap shape · jeffpinto.com metric-SEV and decoupled-classifiers notes for the ~14x detection-lift and the 35% to 1% parity figures. Console data is synthetic and public-benchmark-shaped, labeled on screen; no Descript metrics are quoted or implied.

Built by Jeff Pinto : ran model-quality metrics at Meta scale · Meta / Uber / IBM + 6 startups · two MScs. jeffpinto.com