Beijing Deep Logic has launched LLaSO, the first fully open source framework for speech AI research, designed to unify standards, lower barriers, and drive an ImageNet-like breakthrough in voice technology.
In August 2025, Beijing Deep Logic Intelligent Technology Co., Ltd. (EIT-NLP) released LLaSO, the world’s first fully open source speech language model (LSLM) research framework. The initiative aims to transform LSLM research by solving fragmentation, data opacity, and lack of evaluation standards, while driving global collaboration.
At the core of LLaSO are three major components. LLaSO-Align provides a 12 million sample dataset for precise speech-text alignment using ASR technology. LLaSO-Instruct delivers 13.5 million samples across 20 tasks covering linguistics, semantics, and paralinguistics, supporting text plus audio, audio plus text, and audio-only interaction. LLaSO-Eval offers a standardised benchmark of 15,044 test samples with unified protocols to ensure fairness and reproducibility.
To validate the framework, the team trained LLaSO-Base, a 3.8-billion-parameter reference model combining Whisper-large-v3, an MLP projector, and Llama-3.2-3B-Instruct. It scored 0.72 on the LLaSO-Eval benchmark, outperforming models such as Kimi-Audio and Qwen2-Audio. It also achieved outstanding ASR accuracy with Word Error Rate at 0.08 and Character Error Rate at 0.03, while demonstrating strong instruction-following and paralinguistic analysis capabilities.
The open source strategy lowers research barriers by removing dependence on proprietary datasets, establishes unified technical standards, and enables reproducibility and fair comparison. It encourages a shift from isolated research to collaborative innovation and supports industry with reduced development costs, lower risks, and ecosystem customisation.
While current limitations include model scale, multilingual capability, real-time processing, and long audio handling, LLaSO’s open source approach enables continuous improvement. The release is being positioned as a potential ‘ImageNet moment’ for LSLMs, laying the foundation for AI systems that can deeply understand human speech.














































































