GPT-NeoX-20B, a 20-billion parameter natural language processing (NLP) AI model similar to GPT-3, has been publicly sourced by EleutherAI researchers. The model was trained on 825GB of publicly available text data and performs comparably to GPT-3 models of similar size.
The EleutherAI blog announced the release. Using NVIDIA A100-SXM4-40GB GPUs, GPT-NeoX-20B was trained on EleutherAI’s open source Pile dataset. GPT-NeoX-20B obtained an accuracy similar to a linear interpolation between OpenAI’s Curie and DaVinci models when tested on various typical NLP benchmark tasks, and its one-shot performance on the MATH test dataset outperformed GPT-3 175B. OpenAI first published a paper on generative pre-trained transformers (GPT) in 2018 and released their 1.5B parameter GPT-2 model in 2019. EleutherAI claims that GPT-NeoX-20B is the largest open source pre-trained autoregressive language model available, while OpenAI first published a paper on generative pre-trained transformers (GPT) in 2018 and released their 1.5B parameter GPT-2 model in 2019.
OpenAI announced the GPT-3 model with 175B parameters in 2020, but did not provide the trained model files. Instead, OpenAI offered an API that allows developers to use web service calls to integrate the model into their program. Megatron-11B, Pangu-13B, Meta’s Fairseq 13B, and EleutherAI’s older models, GPT-Neo and GPT-J-6b, which InfoQ highlighted last year, are among the larger models that have been publicly sourced since then.
There are even larger models, such as GPT-3, with hundreds of billions or even trillions of parameters, in addition to these open source models. However, EleutherAI claims that these are “almost universally” either API-gated or not publicly available at all. EleutherAI’s motivation for publishing their models is based on their opinion that open access to such models is essential for furthering study in the field, given their huge scale.
GPT-NeoX-20B has a similar architecture to GPT-3, with a few important modifications. To begin, the GPT-NeoX-20B encodes token location using rotating positional embeddings rather than learnt embeddings. Second, the GPT-NeoX-20B computes the attention and feed-forward layers in parallel rather than serially, resulting in a 15% boost in throughput. Finally, GPT-NeoX-20B uses solely dense layers, whereas GPT-3 alternates sparse and dense layers.
EleutherAI’s own software (also known as GPT-NeoX) was used to train GPT-NeoX-20B, which is built on Megatron and DeepSpeed and implemented in PyTorch. The team used model parallelism as well as data parallelism during training because the model was too huge to fit on a single GPU. Furthermore, because the team’s compute budget limits made hyperparameter search “intractable,” they elected to re-use the GPT-3 paper’s hyperparameters.
GPT-NeoX-20B was tested on a “wide range” of NLP benchmarks, including LAMBADA and WinoGrande, as well as the HendrycksTest knowledge benchmark and the MATH dataset. They compared its performance to that of their prior GPT-J-6B model, as well as Meta’s FairSeq 13B and a variety of GPT-3 sizes. The researchers claims that GPT-performance NeoX-20B’s on NLP tasks “might be enhanced,” but that it “excels” on science and math tests.