Current Location: > Detailed Browse

CREA-Eval: An Evaluation Benchmark for Assessing Large Language Models’ Understanding of Rare Earth-Related Questions

请选择邀稿期刊:
Abstract: This study addresses the lack of domain-specific evaluation benchmarks for large language models (LLM) in the Chinese rare earth domain. To this end, the Chinese Rare Earth Ability Evaluation benchmark (CREA-Eval) is developed, comprising 2,443 high-quality items across five thematic categories and four question types, enabling efficient evaluation of the boundaries of LLM’ rare earth-related capabilities. Data collection, annotation, and validation were conducted through a hybrid pipeline integrating human experts, LLM assistance, and automated scripts, ensuring rigorous quality control. Evaluation employs a combined strategy of LLM-based scoring and regular-expression matching tailored to each question type. Using CREA-Eval, 22 representative LLM from six platforms were systematically assessed, with accuracy reported by theme and question type. Notably, drawing on education examination principles, questions were further classified into objective and subjective types. A significant performance gap between these two categories was observed in several models. Quantitative analysis based on cosine similarity differences suggests that this discrepancy may arise because knowledge about rare earth topics in model training often originates from out-of-domain or cross-thematic sources, without sufficient in-domain textual organization—leading to a lag in expressive and reasoning abilities relative to factual knowledge acquisition. CREA-Eval provides a standardized, reproducible framework for evaluating, selecting, and fine-tuning LLM in the rare earth sector, and can serve as an official benchmark for domain-specific model competitions and industrial applications.

Version History

[V1] 2026-04-13 14:28:22 ChinaXiv:202604.00191V1 Download
Download
Preview
Peer Review Status
Awaiting Review
License Information
metrics index
  •  Hits975
  •  Downloads391
Comment
Share
Apply for expert review