reading the voice: a guide to audio annotation

Cardamom HybX annotation tasks are where a lot of careful people still get tripped up, because the work asks you to make distinctions that feel almost identical until you've been burned by QA a few times. This isn't a substitute for the official guidelines. It's the version of events you wish someone had told you before your first batch came back with rejections.

Each task asks two complementary things of you. First, compare the reference audio against the annotation audio across voice characteristics and speaking style. Second, add inline tags marking specific moments of speech delivery in the annotation audio only. Keep those two jobs separate in your head and most of the difficulty dissolves.

validity first, and lean toward valid

Before anything else, you judge whether the audio is valid — and the important thing to know is that valid is the normal case. Audio is valid when there's a single main speaker, even if that speaker is reading, narrating, acting out characters, or changing their voice to impersonate someone. Background noises and brief interjections don't break validity. Small transcript mismatches — a word or two, a stray sentence at the start or end — don't either.

Audio is invalid only in genuinely rare situations: a fluid conversation between two or more people, or a transcript that's a total mismatch with what you hear. The rule to carry with you is simple.

When in doubt, do not invalidate. Annotate.

voice characteristics: how the voice sounds

This field is only about the physical sound of the voice compared to the reference. The question to keep asking is whether the voice itself sounds different — its timbre (metallic, rough, soft, nasal, warm), its pitch, its stability, its articulation, its resonance. If the sound changes even slightly — more metallic giving way to more natural, more compressed opening up, roughness smoothing out — you mark that there's variation. If the voice is the same and only the interpretation changes, there's no difference.

What does not belong here: acting style, emotion, rhythm, expressiveness. Those live in the next field, and putting them here is a classic rejection.

speaking style: how the message is delivered

This field is about delivery, not sound. How is the speaker performing — theatrical or natural, formal or casual, confident or insecure, controlled or expressive, narrative or dramatic or conversational? Three rules keep you safe: always write comparatively, always use complete sentences, and never use the word "tone." That last one catches more people than anything else. Aim for twenty to fifty words.

A description that passes reads like this: "The delivery feels more theatrical and expressive, with a less spontaneous and more constructed performance that increases dramatic emphasis." A description that gets rejected reads like this: "The tone is more dramatic." Same instinct, very different outcome.

inline tags: local events, annotation audio only

Tags mark noticeable moments in the annotation audio — emphasis on a word, a deliberate pause, a hesitation, an audible sigh, syllable-by-syllable pronunciation, a sing-song delivery. You place a tag in the space between words, immediately before the event. You never tag the words or the punctuation themselves, and you never quote the transcript text. So a tagged moment looks like a marker dropped into the gap right before the thing it describes — "ma se ride anche," then the tag noting marked emphasis on the subject, then "lei."

the rejections worth memorizing

Most rejections come from a short list: blurring voice characteristics into speaking style, using the word "tone," writing descriptions that are too short, falling back on bullet points instead of sentences, naming the reference or annotation audio explicitly, commenting on recording quality, or guessing at age, gender, identity, or accent. Forgetting to add at least one tag is its own quiet failure.

the checklist before you submit

Run through it every time until it's automatic: voice characteristics describe sound only; speaking style describes delivery only; at least one inline tag is placed; every description is comparative, clear, and twenty to fifty words; and none of the forbidden words slipped in. The short version is easy to remember — voice characteristics are how the voice sounds, speaking style is how the speech is delivered, tags are local events in the annotation audio, invalid tasks are rare, and when you're unsure, you annotate rather than invalidate.