BotChat Leaderboard

About BotChat

BotChat evaluates LLMs's Capabilities of Having Multi-Turn Dialogues. We begin with real-world human dialogues and then prompt Language Models to generate full multi-turn dialogues, one utterance at a time. These results are subsequently evaluated by state-of-the-art Language Models such as GPT-4. For more in-depth information, please refer to our documentation.

Leaderboard Metrics

We provide three different evaluation protocols:

GTEval: A comparison of the generated conversations with "Ground Truth" conversations (Golden Standard).
UniEval: Independent evaluation of each generated dialogue.
Arena ELO: Comparative evaluation of responses from two distinct models.

In UniEval and Arena ELO, 'N' indicates the number of rounds of dialogue in each conversation.
Length is the token length in utterances generated by models.

Citation


@misc{duan2023botchat,
      title={BotChat: Evaluating LLMs' Capabilities of Having Multi-Turn Dialogues}, 
      author={Haodong Duan and Jueqi Wei and Chonghua Wang and Hongwei Liu and Yixiao Fang and Songyang Zhang and Dahua Lin and Kai Chen},
      year={2023},
      eprint={2310.13650},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}