
CELLM, a 1.5B-parameter open source Chinese education LLM, releases all models, data, and code to democratise educational AI research.
CELLM (Chinese Education Large Language Model) is a 1.5B-parameter open source LLM developed specifically for Chinese educational applications. It addresses two major gaps in current LLM development: the lack of transparency in training processes and the scarcity of high-quality Chinese educational datasets compared to English.
The model introduces a fully transparent training pipeline and publicly releases all models, data, and code, establishing CELLM as a foundational resource for Chinese educational AI research. Performance baselines have been set across 11 evaluation datasets, including C-Eval, CMMLU, and MMLU.
A key innovation is the creation of Chinese-fineweb-edu-v2, a domain-specific pretraining corpus consisting of 25.4% industry content, 18.6% safety content, and additional educational resources. The team also developed a multi-turn dialogue translation framework, converting 258,000 English instructional entries into Chinese with 97.7% accuracy, significantly expanding the available Chinese educational dataset.
CELLM adopts a causal-decoder architecture with grouped-query attention (GQA) and rotary positional encoding (RoPE), optimised for educational contexts. The model details its full vocabulary size (151,936 tokens) and training parameters (33.6B pretraining tokens, 16B fine-tuning tokens).
In evaluations, CELLM demonstrates strengths in humanities (26.77% accuracy) and social sciences (26.35%), with more modest performance in STEM (21.48%) and programming (0.6 on MBPP). While the current model scale is smaller than commercial counterparts, future plans include expanding pretraining data and exploring alignment techniques to improve STEM outcomes.
By releasing all components openly, CELLM democratises educational LLM development in non-English contexts, enabling community-driven research and fostering open collaboration in Chinese educational AI, establishing essential infrastructure for future models.













































































