Trajel Audits the Part of Agent Failures Your Final-Answer Checker Never Sees
Final-answer evaluation is a terrible way to judge agents. It tells you whether the last sentence looked right, not whether the system took a safe path to get there. Trajel, a new dataset and evaluation framework for trajectory-level hallucinations, attacks exactly that blind spot by labeling failures inside the Thought-Action-Observation traces that most leaderboards flatten into a single pass/fail result.
That distinction matters because agents are not chatbots with longer transcripts. They call tools, observe state, make intermediate decisions, delegate work, and sometimes mutate the world. A final answer can be correct by accident. A trajectory can be unsafe even when the final response is polished. If your evaluation only checks the commit message, you are not reviewing the diff.
The middle of the trace is where the incident starts
Trajel focuses on multi-agent industrial workflows, including AssetOps-style tasks around data-center monitoring and infrastructure maintenance. The dataset contains 225 expert-annotated agent trajectories across 6 models and 42 industrial tasks. Instead of asking only whether the end result is acceptable, the framework labels failures at subtask and trajectory level, including the hallucination type, location inside the trace, and a free-text reviewer rationale.
The taxonomy is practical: factual, referential, logical, procedural, and scope-based hallucinations. Those categories are not academic hair-splitting. They map to different operational fixes. A factual error may need better evidence retrieval. A referential error may mean the agent lost track of which server, file, metric, or user it was discussing. A procedural error wants policy gates and checklists. A scope error means the agent wandered beyond authority. A logical error suggests the system connected correct observations into a wrong conclusion. One “hallucination rate” number does not tell you which of those systems to repair.
The uncomfortable result is that 48.7% of hallucinated trajectories exhibit multiple hallucination types at once. That should feel familiar to anyone who has debugged real automation. Incidents rarely come from one clean mistake. An agent misreads a metric, chooses the wrong runbook, skips a confirmation step, and then summarizes success outside the authorized scope. Which failure caused the incident? Yes.
Trajel reports that blind human reviewers identify a 68.3% hallucination rate, with Cohen’s κ = 0.456 between automated and human judgments. That agreement number is not spectacular, and that is part of the point. Trajectory-level evaluation is harder than checking whether a final answer matches a reference. Humans can disagree about whether a step is merely inefficient, unsupported, or hallucinated. But production systems need to confront that ambiguity instead of hiding it behind a green checkmark at the end.
Agent observability needs chain of custody
The paper benchmarks three supervised detection paradigms: subtask-level BERT classification, trajectory-level NLI, and long-context modeling with Longformer. The broader argument is more important than any one detector: trajectory-aware detection beats post-hoc final-answer verification for failures that originate inside intermediate steps. That is obvious once stated, yet many agent evaluation pipelines still behave as if the final text is the artifact.
For practitioners, the action item is not “adopt Trajel’s model.” It is to make trajectories first-class observability objects. Store tool calls, observations, approvals, retries, state diffs, policy decisions, and compact reasoning summaries where appropriate. Log which agent or sub-agent made which decision and what evidence it had. Preserve enough trace context that a reviewer can answer: why did the system believe this, which tool gave it that information, and where did it leave the approved task boundary?
This is especially relevant for coding agents. A coding agent can produce a passing patch after reading the wrong file, misunderstanding a test failure, or inventing a dependency constraint that happens not to matter in the sample case. If the final diff is small, a reviewer may never see the broken reasoning path. But broken traces become future regressions. They reveal where the model is unreliable, where the runtime lacks guardrails, and where review checklists should be stricter.
Teams should add eval cases that score procedure compliance and scope discipline, not just final success. Did the agent inspect the relevant files before editing? Did it run the right tests once, or repeatedly rerun commands without learning? Did it use the correct environment? Did it stay inside the requested directory? Did it call external tools when local evidence was sufficient? These questions are not side quests. They are the operational definition of trust.
Trajel is still a research artifact, and the dataset is small by web-scale standards. But it is pointed at the right layer. The next generation of agent benchmarks should not reward models for stumbling into correct answers through broken workflows. In production, the path is part of the output.
The take: if you only check the final answer, you are grading the agent’s closing argument while ignoring the evidence chain. For agents with tools, that is not evaluation. It is plausible deniability with JSON logs.
Sources: arXiv, Hugging Face Papers