CompassVerifier: A Unified and Robust Verifier for LLMs Evaluation and Outcome Reward

Introduction

CompassVerifier is an accurate and robust lightweight verifier model for evaluation and outcome reward. It demonstrates multi-domain competency spanning math, knowledge, and diverse reasoning tasks, with the capability to process various answer types, including multi-subproblems, formulas, and sequence answers, while effectively identifying abnormal/invalid responses.

We introduce VerifierBench benchmark comprising question and model outputs collected from multiple data sources from about 1 million predictions from various common-used models and datasets, labeled by human experts. Our verifier models are available in three sizes: CompassVerifier-3B, CompassVerifier-7B, and CompassVerifier-32B.

The verifier outputs simple judgment results: A (correct), B (incorrect), and C (quality problems). We also provide a COT mode that gives detailed analysis for complex problems. CompassVerifier achieves state-of-the-art performance across multiple domains, particularly excelling in mathematical reasoning.

Performance comparison of CompassVerifier with other models across different domains on VerifierBench.

Leaderboard on VerifierBench

F1 scores on VerifierBench across different domains.

#	Model	Type	AVG	General Reasoning	Knowledge	Math	Science
1	CompassVerifier-32B 🥇	Verifier	87.7	90.3	94.8	80.8	84.7
2	CompassVerifier-7B 🥈	Verifier	83.4	87.7	92.6	74.8	78.5
3	CompassVerifier-3B 🥉	Verifier	80.4	85.9	87.7	71.0	77.1
4	xVerify-8B-I	Verifier	70.4	78.9	85.1	42.6	74.9
5	DeepSeek-V3-0324	LLM	70.3	76.6	81.2	54.7	68.5
6	xVerify-0.5B-I	Verifier	70.0	78.5	86.2	42.6	72.6
7	GPT-4.1-2025-04-14	LLM	69.8	79.5	82.9	42.0	75.0
8	xVerify-9B-C	Verifier	69.1	77.0	81.7	48.0	69.8
9	Tencent-Qwen2.5-7B-RLVR	Verifier	67.1	73.8	76.8	55.3	62.6
10	Qwen3-235B	LLM	62.7	73.7	73.1	53.9	50.0
11	Qwen3-32B	LLM	61.8	70.3	69.5	54.6	52.8
12	GPT-4o-2024-08-06	LLM	59.1	68.2	78.3	34.9	54.9
13	Qwen3-8B	LLM	56.4	61.8	69.4	51.6	42.9
14	Qwen2.5-72B-Instruct	LLM	53.9	49.0	68.5	37.5	60.5
15	Qwen2.5-32B-Instruct	LLM	42.2	42.2	46.4	31.6	48.8
16	Qwen2.5-7B-Instruct	LLM	42.1	51.1	50.7	30.0	36.6

Type: Verifier: Specialized verifier model, LLM: General large language model.

🚨 To submit your results to the leaderboard, please contact us at liushudong@pjlab.org.cn and liuhongwei@pjlab.org.cn.

BibTeX

@article{liu2025compassverifier,
      title={CompassVerifier: A Unified and Robust Verifier for LLMs Evaluation and Outcome Reward},
      author={Liu, Shudong and Liu, Hongwei and Liu, Junnan and Xiao, Linchen and Gao, Songyang and Lyu, Chengqi and Gu, Yuzhe and Zhang, Wenwei and Wong, Derek F and Zhang, Songyang and Chen, Kai},
      journal={arXiv preprint arXiv:2508.03686},
      year={2025}
    }