AI researchers and developers are increasingly turning to large language models (LLMs) to evaluate the responses of other LLMs in a process known as “LLM-as-a-judge”. Unfortunately, the quality of these evaluations degrades on complex tasks like long-form factual checking, advanced coding, and math problems.
Now, a new research paper published by researchers from the University of Cambridge and Apple outlines a new system that augments AI judges with external validation tools to improve their judgment quality.
This system aims to overcome limitations found in both human and AI annotation. Humans face challenges and biases due to time limits, fatigue, and being influenced by writing style over factual accuracy while AI struggles with the aforementioned complex tasks.
The Evaluation Agent that the researchers created is agentic so it can assess the response to determine if external tools are needed and utilizes the correct tools. For each evaluation, three main steps are passed through: initial domain assessment, tool usage, and a final decision.
The fact-checking tool uses web search to verify atomic facts within a response; code execution leverages OpenAI’s code interpreter to run and verify code correctness; and math checker is a specialized version of the code execution tool for validating mathematical and arithmetic operations.
If none of the tools are found to be useful for making judgments, the baseline LLM annotator is used to avoid unnecessary processing and potential performance regression on simple tasks.
The system delivered notable improvements in long-form factual checking, with significant increases in agreement with ground-truth annotations across various baselines. In coding tasks, the agent-based approach significantly improved performance across all baselines. For challenging math tasks, the agents improved performance over some baselines, but not all, and overall agreement remained relatively low at around 56%. Notably, the researchers found that in long-form factual responses, the agent’s agreement with ground-truth was higher than that of human annotators.
This framework is extensible, so in the future, other tools could be integrated to further improve LLM evaluation systems. The code for the framework will be made open source on Apple’s GitHub, but it isn’t up yet.