News

Introducing PolyLM: An Open Source Multilingual LLM for Diverse Language Processing

July 21, 2023

594

PolyLM is an open source multilingual Large Language Model(LLM) Trained on 640B tokens, available in two model sizes 1.7B and 13B. Its proficiency in 15 major non-English languages and advanced training techniques make it a game-changer in the field of AI.

Researchers from DAMO Academy and Alibaba Group have introduced PolyLM (Polyglot Large Language Model), an open-source multilingual Large Language Model designed to address the limitations of existing models. Available in two model sizes, 1.7B and 13B, PolyLM offers advanced capabilities in understanding, reasoning, and generating text across multiple languages.

Key Features of PolyLM:

Proficiency in major non-English languages: PolyLM excels in Spanish, Russian, Arabic, Japanese, Korean, Thai, Indonesian, and Chinese, complementing existing models.
Advanced curriculum learning approach: PolyLM’s training strategy facilitates knowledge transfer from English to other languages, enhancing its multilingual performance.
MULTIALPACA dataset: To improve understanding of multilingual instructions, PolyLM utilizes the MULTIALPACA dataset, which provides high-quality multilingual instruction data.

The researchers utilized a massive dataset of 640B tokens from sources like Wikipedia, mC4, and CC-100 to train PolyLM. They employed a curricular learning technique, gradually increasing the focus on low-resource languages while initially concentrating on English. This approach ensures the transfer of general knowledge across languages.

PolyLM’s evaluation involved a benchmark comprising multilingual tasks such as question answering, language understanding, text generation, and cross-lingual machine translation. The experiments showcased PolyLM’s superior performance in non-English languages compared to existing models of similar size. The model’s multilingual capabilities were further enhanced through the use of multilingual instruction data.

With PolyLM’s introduction, the AI community now has access to a powerful multilingual Large Language Model. Its proficiency in major non-English languages and advanced training techniques make it a significant milestone in the field of natural language processing.