notes  /  craft
craft

transcription is not what you think it is

theNumen · essay

Transcription sounds like the most familiar task in this whole field. Everyone has typed out what they heard at some point. So when the work shows up on Appen's CrowdGen platform — projects like ADAP transcription and segmentation — people arrive confident and start cleaning things up the way they would for a human reader. That instinct is exactly what gets their work rejected.

The shift you have to make is this: you're not writing a clean script for a person. You're producing training data for a speech model. Precision matters more than readability. The machine needs to learn from what was actually said, not from your tidied-up version of it.

the main speaker is everything

Every ADAP task defines a main speaker — the primary voice you're there to transcribe. Their speech gets written out normally, following the spelling rules. To help you, Appen provides main speaker samples so you can learn the voice.

Here's the detail that confuses new contributors: those samples sometimes contain more than one voice. That's deliberate. Appen uses real audio so you can practice the actual skill — distinguishing the main speaker from background and secondary speakers, rather than assuming the clip is conveniently clean.

everyone else gets a tag

When anyone other than the main speaker talks, you do not transcribe their words. You insert the tag <nonprimaryspeakertalking/> and move on. This holds for short interruptions, for background reactions like "yeah" or "mhm," for comments from another person in the room. You never guess at what the secondary speaker said, and you never rewrite it. Their words simply aren't yours to capture.

speech versus noise

The audio is divided into segments, and each one is either speech or noise. A speech segment contains meaningful spoken language — main speaker transcribed as text, everyone else marked with the tag. A noise segment contains no meaningful speech at all: music, laughter, coughing, throat-clearing, the ambient sound of a room. In a noise segment there's no main speaker, you transcribe no words, and you apply a single noise tag. Mixing up the two rule sets is one of the most common ways to fail.

flag the cuts, don't fix them

Appen also asks you to notice when the audio itself is broken. You flag incorrect segmentation when a word is cut off at the beginning or end, when a sentence is interrupted unnaturally, when the timing clearly doesn't match how the speech actually flows. These flags aren't busywork — they're how the dataset gets better.

And when a word is partly missing because of a cut, the discipline is to resist your own helpfulness. Don't reconstruct it. Don't guess the intended word. Transcribe only what's audible. The guidelines actively discourage interpretation, because your reasonable guess is noise to a model trying to learn from ground truth.

fillers, languages, and the urge to clean

If the main speaker says "uh," "um," "ehm," you transcribe it exactly as the project rules specify — those hesitations are data too. If a non-primary speaker produces them, they fall under the tag like everything else. You never translate; you always transcribe in the original spoken language, using foreign-language flags only when the project calls for them. And you never "fix" grammar. Appen is evaluating rule compliance, not your writing style.

When you're unsure, ask the only question that matters: what would help the model learn from this audio?

the contributors who last

The people who do well on CrowdGen aren't the fastest typists or the most elegant writers. They're the ones who follow the rules exactly, avoid assumptions, and stay consistent across long sessions. The work rewards a particular kind of restraint — doing less, but doing it precisely. Get comfortable with that, and the rejections stop.

join the network ← all notes