Creation-MMBench: Assessing Context-Aware Creative Intelligence in MLLMs

Xinyu Fang^*1,2, Zhijian Chen^*3, Kai Lan³, Lixin Ma³, Shengyuan Ding^2,4, Yingji Liang⁵, Xiangyu Zhao^2,6, Farong Wen⁶, Zicheng Zhang^2,6, Guofeng Zhang¹, Haodong Duan^†2, Kai Chen^†2, Dahua Lin^2,7

▶ ¹ Zhejiang University ▶ ² Shanghai AI Laboratory ▶ ³ Tongji University ▶ ⁴ Nanjing University ▶ ⁵ East China Normal University ▶ ⁶ Shanghai Jiao Tong University ▶ ⁷ The Chinese University of Hong Kong

* Equal contribution. † Corresponding authors.

arXiv Guide

🤗

Dataset

🛠️

Evaluation

🚀 Creation-MMBench: A multimodal benchmark specifically designed to evaluate the creative capabilities of MLLMs.
🚀 Include original, high-quality visual questions crafted by volunteers, spanning 4 categories and 51 fine-grained tasks.
🚀 Design a robust MLLM-as-a-judge evaluation methodology, consisting of Unitary Scoring and Pairwise Comparison.
🚀 Propose Creation-MMBench-TO, a text-only variant, to further explore the impact of visual instruction tuning.

🔥What's New

[2025.03.18] The Creation-MMBench Dataset, WebPage and Evaluation Code is all released!
[2025.03.19] The Creation-MMBench Paper is released! Check it! 🎉🎉🎉

Abstract

Creativity is a fundamental aspect of intelligence, involving the ability to generate novel and appropriate solutions across diverse contexts. While Large Language Models (LLMs) have been extensively evaluated for their creative capabilities, the assessment of Multimodal Large Language Models (MLLMs) in this domain remains largely unexplored. To address this gap, we introduce Creation-MMBench, a multimodal benchmark specifically designed to evaluate the creative capabilities of MLLMs in real-world, image-based tasks. The benchmark comprises 765 test cases spanning 51 fine-grained tasks. To ensure rigorous evaluation, we define instance-specific evaluation criteria for each test case, guiding the assessment of both general response quality and factual consistency with visual inputs. Experimental results reveal that current open-source MLLMs significantly underperform compared to proprietary models in creative tasks. Furthermore, our analysis demonstrates that visual fine-tuning can negatively impact the base LLM's creative abilities. Creation-MMBench provides valuable insights for advancing MLLM creativity and establishes a foundation for future improvements in multimodal generative intelligence. Full data and evaluation code is released on Creation-MMBench.

1. Lack of Multimodal Creative Benchmarks:

As a well-established theory in psychology, the Triarchic Theory of Intelligence comprises three subtheories: the analytical subtheory, the contextual subtheory, and the creative subtheory. The analytical subtheory primarily focuses on information processing and problem-solving skills based on domain-specific knowledge and can be assessed through various knowledge and reasoning benchmarks. The contextual subtheory, on the other hand, emphasizes practical intelligence in real-world scenarios and is typically evaluated using agent-based or embodied AI benchmarks. Despite the significance of the creative subtheory in intelligence, evaluations of MLLMs' creative capabilities remain highly inadequate and lag significantly behind those conducted for LLMs.

2. Limited Capabilities of Existing Benchmarks:

MLLMs have certain shortcomings in dealing with creative tasks in daily situation. However, existing benchmarks feature simple questions that fail to assess model performance in real-life creative tasks.

Overview of Creation-MMBench

Overview of Creation-MMBench. Contain four task categories: Literary Writing, Common Functional Writing, Professional Functional Writing, and Creative Multimodal Understanding. Each category consists of multiple tasks, and the types of images are diverse. Only a few representative tasks of each category are shown here. Complete list of tasks is detailed in Appendix A.

Benchmark Comparision and Statistics

Comparison of Creation-MMBench with other partial-creation MLLM benchmarks:

Benchmarks	Num of Creative Questions	Criteria Level	Multi-images Task	Specific Role for each Question	Visual Factuality Check
VisIT-Bench	65	benchmark	✔	✘	✔
MLLM-Bench	20	instance	✘	✘	✔
Touch-Stone	189	benchmark	✔	✘	✘
AlignMMbench	353	task	✘	✘	✘
Creation-MMBench	765	instance	✔	✔	✔

Statistics and Cases of Creation-MMBench:

(a) Distribution of query lengths.

(b) Roles in Creation-MMBench.

Statistics and Cases of Creation-MMBench: Compared to other widely used MLLM benchmarks, Creation-MMBench features a more comprehensive query design to capture abundant creative contexts. Diverse roles are introduced into the queries to stimulate MLLMs' utilization of disciplinary and prior knowledge. As an MLLM benchmark, Creation-MMBench includes a rich variety of images to thoroughly evaluate multiple capabilities of MLLMs.

Creation-MMBench Leaderboard

VFS stands for Visual Factuality Score. LW, CFW, PFW, and CMU stand for four categories in Creation-MMBench: Literary Writing, Common Functional Writing, Professional Functional Writing, and Creative Multimodal Understanding.

OC Score represents the average score of the OpenVLM Leaderboard and mainly demonstrates the objective performance of the model.

The token number is calculated with tiktoken GPT-4o-1120 tokenizer.

The best results are highlighted in bold.

Model	Overall		LW		CFW		PFW		CMU		OC Score	Avg Tokens
Model	VFS	Reward	VFS	Reward	VFS	Reward	VFS	Reward	VFS	Reward	OC Score	Avg Tokens
Proprietary MLLMs
Gemini-2.0-pro-exp	8.53	4.48	8.66	-1.88	8.98	12.71	8.01	3.33	8.65	-8.06	73.4	718
GPT-4o-1120 [Baseline]	8.72	0.00	8.86	0.00	8.93	0.00	8.26	0.00	9.38	0.00	72.0	497
Gemini-1.5-pro-002	8.41	-5.49	8.66	-6.04	8.59	-2.04	8.05	-4.82	8.75	-17.22	72.2	444
GPT-4.5-0227	8.54	-5.88	8.63	-4.38	8.76	-8.33	8.05	-5.88	9.29	-0.56	/	394
GPT-4o-mini	8.07	-13.56	8.30	-4.38	8.44	-15.28	7.50	-16.05	8.40	-12.78	64.1	436
Doubao-VL	8.38	-14.09	8.28	-19.17	9.01	-3.33	7.65	-18.72	8.77	-25.00	/	516
Claude-3.5-Sonnet	7.96	-15.46	8.44	-16.46	7.45	-21.57	7.98	-11.14	8.88	-9.44	70.6	336
Moonshot-v1-32k-vision	7.43	-20.58	7.30	-21.46	8.20	-8.80	6.91	-26.50	6.91	-36.11	/	485
Open-Source MLLMs
Qwen2.5-VL-72B-Instruct	8.33	-5.82	8.04	-10.83	8.91	4.44	7.68	-11.49	8.86	-11.94	76.1	553
InternVL2.5-78B-MPO	8.06	-12.55	8.22	-9.17	8.60	-5.00	7.45	-16.32	8.22	-27.78	77.0	461
InternVL2.5-8B-MPO	7.65	-15.10	8.09	-16.25	8.30	-3.80	6.80	-23.95	7.88	-19.44	70.3	548
InternVL2.5-78B	7.91	-16.43	8.05	-17.50	8.45	-7.69	7.26	-20.53	8.18	-28.33	75.2	473
Qwen2-VL-72B-instruct	7.87	-22.45	7.75	-24.58	8.17	-15.56	7.42	-26.84	8.43	-26.39	74.8	439
InternVL2.5-8B	7.38	-25.42	7.91	-23.33	7.95	-15.83	6.62	-33.95	7.45	-30.00	68.1	500
Qwen2.5-VL-7B-Instruct	7.55	-29.80	7.34	-39.38	8.40	-21.67	6.71	-33.25	7.78	-30.56	70.9	510
MiniCPM-o-2.6	7.49	-34.77	7.79	-35.42	7.95	-27.31	6.76	-40.88	8.08	-36.94	70.2	389
DeepSeek-VL2	7.24	-38.52	7.58	-33.75	7.58	-32.50	6.61	-44.02	7.81	-45.56	66.4	440
LLaVA-OneVision-72B	7.16	-39.87	7.26	-36.32	7.72	-30.61	6.43	-47.98	7.62	-46.37	68.0	315
LLaVA-OneVision-7B	6.75	-43.49	7.36	-43.54	7.27	-31.85	6.04	-50.53	6.82	-56.11	60.2	373
Qwen2-VL-7B-instruct	7.12	-43.76	6.99	-55.83	7.67	-36.30	6.57	-45.26	7.25	-45.28	67.1	456
VITA-1.5	6.43	-53.31	6.77	-46.19	7.23	-46.50	5.70	-57.43	6.22	-69.72	63.3	385

Comparing OC Score and Creation-MMBench Reward. This figure shows the model performance on the OpenVLM Leaderboard and Creation-MMBench, highlighting a significant gap between objective performance and visual creativity in some open-source models.

Creation-MMBench-TO Results

LLM performance on Creation-MMBench-TO and Visual Instruction Tuning Impact on VLM creation capability.

VLM	Corresponding LLM	Text Input w. LLM		Text Input w. VLM		Vision+Text Input w. VLM
VLM	Corresponding LLM	VFS	Reward	VFS	Reward	VFS	Reward
GPT-4o-1120	GPT-4o-1120	8.71	6.96	8.71	6.96	8.72	0.36
Gemini-2.0-pro-exp	Gemini-2.0-pro-exp	8.49	4.08	8.49	4.08	8.53	4.48
Qwen2.5-VL-72B-Instruct	Qwen2.5-72B-Instruct	8.55	0.82	8.51	-4.05	8.33	-5.82
Qwen2.5-VL-7B-Instruct	Qwen2.5-7B-Instruct	8.18	-19.18	7.97	-27.50	7.55	-29.80
MiniCPM-o-2.6	Qwen2.5-7B-Instruct	8.18	-19.18	7.78	-36.57	7.49	-34.77
InternVL2.5-8B	InternLM2.5-7B-Chat	7.83	-22.19	7.92	-28.73	7.38	-25.42

📃 BibTeX


          @misc{fang2025creationmmbenchassessingcontextawarecreative,
                title={Creation-MMBench: Assessing Context-Aware Creative Intelligence in MLLM}, 
                author={Xinyu Fang and Zhijian Chen and Kai Lan and Lixin Ma and Shengyuan Ding and Yingji Liang and Xiangyu Zhao and Farong Wen and Zicheng Zhang and Guofeng Zhang and Haodong Duan and Kai Chen and Dahua Lin},
                year={2025},
                eprint={2503.14478},
                archivePrefix={arXiv},
                primaryClass={cs.CV},
                url={https://arxiv.org/abs/2503.14478}, 
          }