CriticBench: Evaluating Large Language Model as Critic

Abstract

Critique ability are crucial in the scalable oversight and self-improvement of Large Language Models (LLMs). While many recent studies explore the critique ability of LLMs to judge and refine flaws in generations, how to comprehensively and reliably measure the critique abilities of LLMs is under-explored. This paper introduces CriticBench, a novel benchmark designed to comprehensively and reliably evaluate four key critique ability dimensions of LLMs: feedback, comparison, refinement and meta-feedback. CriticBench encompasses nine diverse tasks, each assessing the LLMs' ability to critique responses at varying levels of quality granularity. Our extensive evaluations of open-source and closed-source LLMs reveal intriguing relationships between the critique ability and tasks, response qualities, and model scales.

Data Generation Pipeline

The human-in-the-loop construction of CriticBench are conducted. CriticBench consists of three main phases: instruction collection, response generation, and reference critique generation. The overview of the construction is shown in Fig. 2., and the details of each phase are described as follow:

Instruction collection: Instructions for 9 distinct tasks are collected to evaluate critique capabilities comprehensively (Step 1 in Fig. 2). Specifically, the benchmark includes three representative classical language tasks: summary, translation, and question-answering. Since a popular application of LLMs is to serve as a chatbot, where alignment is important to ensure the safe application of LLMs, we collect instructions from general chat scenarios and harmlessness cases to evaluate the LLMs' critique ability for alignment. Furthermore, the reasoning and code capabilities are also fundamental for augmenting LLMs as agents, another important and promising application of LLMs. Thus, we also collect instructions for math reasoning with chain-of-thought and program-of-thought, and coding with and without execution results. To ensure the difficulty of CriticBench, we only collect coding and math reasoning questions that some 70B LLMs cannot correctly answer.
Response Generation: For each collected instruction in each task, LLMs of different scales and capabilities are employed to generate responses with flaws, which naturally form responses of various qualities (Step 2 (a) in Fig. 2). To identify the quality of these responses efficiently, GPT-4 is utilized to initially assign quality ratings ranging from 1 to 7 (Step 2 (b) in Fig. 2.) then let human annotators meticulously review and adjust these scores. Subsequently, three responses with distinct quality differences for each instruction are chosen based on their human-varified quality scores, including low-, medium-, and high-quality responses.
Reference Critique Generation: After collecting instructions and the corresponding responses, we collect reference critiques on these responses to make the subjective evaluation more reliable, with the assistance of GPT-4, including the feedback, correction, comparison, and meta-feedback. Note that correction and meta-feedback critique dimensions are overlooked in previous works.

CriticBench: Evaluating Large Language Model as Critic

Abstract

Introduction

Data Generation Pipeline

Result