Skip to main content
Annotation is where a run becomes training signal for the team. It records whether the agent succeeded, why the run mattered, and what a better answer or fix should preserve.

The problem

Manual annotation is expensive. It takes domain expertise, context switching, and careful reading of traces. A naive queue fills up quickly with noise: duplicates, obvious mistakes, easy successes, and runs that are already covered by earlier examples. The hard failures are different. They are often hidden in domain assumptions, financial conventions, internal policy, or multi-step tool use. These are the cases where annotation is most valuable, because the missing knowledge is not visible from the final answer alone.

What’s wrong with random sampling?

Random sampling gives you a rough sense of the agent’s failure rate. It is a reasonable starting point when the system is new and failures are everywhere. It becomes inefficient once the agent is already reasonably good. Reviewers spend too much time on obvious successes, repeated mistakes, and behavior already covered by earlier annotations. The more capable the agent gets, the more valuable it is to spend review time on the few runs that reveal something new.

Why not just use LLM-as-a-judge?

LLM judges are useful for straightforward checks. They can catch clear format errors, direct mismatches, or failures that are obvious from the local context. They are weaker when the failure depends on domain knowledge the agent also missed. If a general judge can reliably infer the issue from the same context, the agent often could have avoided the mistake in the first place. The most valuable failures are the ones that require a human to notice the hidden assumption and turn it into reusable guidance. That is where annotation gives teams an edge. It captures the domain judgment behind the failure, not just the fact that the output was wrong.