BigScience AI Researchers Open-Source “BLOOM”: An Autoregressive Multilingual Large Language Model Larger than GPT-3 and OPT-175B

Businesses are increasingly adopting ML and AI technologies to improve their services and goods. These systems include language models for various tasks, such as predicting the next word you’ll type on your mobile phone so you can finish the message faster.

In recent years, large machine learning (ML) models have revolutionized the field of AI research. Yet only a few teams have been able to train and study them due to the high computational costs and massive training data involved. Moreover, information about the formation of these AI models, their metadata and their code remain unshared and far from the reach of AI communities.

To address these shortcomings, BigScience Project introduces BLOOM (BigScience Large Open-science Open-access Multilingual Language Model), the first multilingual language model (LLM) transparently trained by the largest group of AI academics. Unlike the traditional secrecy of industrial AI research labs, the project demonstrates the possibility of responsibly and openly training promising AI models released by the wider research community.

The BigScience research project was launched in 2021. It involves around 1000 researchers from more than 60 countries and more than 250 institutions. The research is being conducted by Hugging Face with support from GENCI, the IDRIS team at CNRS, the Megatron team from NVIDIA and the Deepspeed team from Microsoft. Hugging Face has released a free web app that lets anyone try Bloom without having to download it.

The basis of each model used in this study is a Transformer-only pre-trained decoder with an autoregressive language modeling target. As mentioned in their article, “What language model to train if you have a million GPU hours?” researchers frequently choose the aforementioned architectures for large language models because they allow immediate application to many downstream tasks.

The BLOOM model includes 176 billion parameters and was trained for 11 weeks on the Jean Zay supercomputer in France. As a result, BLOOM can generate text in 46 natural languages ​​and dialects and 13 programming languages. It can also follow prompts to perform unique tasks such as writing recipes, extracting data from news articles, or creating sentences using newly defined coined words, although it does not has ever been trained for these particular tasks. It will be the first language model with more than 100 billion parameters ever generated for many of them, including Spanish, French and Arabic.

According to their research, generalization without firing a shot can be improved by supplementing Common Crawl data with cross-domain quality data. In their survey of multilingualism, they found that on English zero shot benchmarks, multilingual models significantly underperform their monolingual counterparts.

To ensure that the training corpus was consistent with their beliefs, the team adopted a data-driven strategy. The multidisciplinary and multinational structure of BigScience allowed them to critically evaluate each step of the process from different angles. This included ethical, legal, environmental, linguistic and technical considerations without compromising model performance.

To develop a framework for developing and publishing these models, the team has also published its Responsible AI license and Code of ethics. This revealed practical applications of scaling rules in constructing substantial language models. Unlike previous efforts, this work provides comprehensive justifications for all architectural parameters.

Among the basic principles that set it apart from similar studies with huge language models are:

Open: All BigScience meeting minutes, discussions and codes are accessible to the public. Throughout the procedure, the progress of the training of the model has been made public and all the statistics necessary for another person to duplicate this work have been provided. Numerous research articles written by hundreds of contributors have already been produced using BigScience’s “open first” methodology.

Accessibility: The team creates an easy-to-use API, making it freely available to all researchers.

Multilingualism: Unlike monolingual models like LaMBDA and GPT-3, BLOOM is multilingual, trained in 46 natural languages ​​and 13 programming languages.

Researchers can now download, run and study BLOOM to study the performance and behavior of these newly established massive language models down to their most fundamental internal operations.

The President of BigScience believes that BigScience is distinctively participatory and people-oriented, bringing together the perspectives of thousands of multidisciplinary researchers from around the world. They believe this is the most effective way to work with those who use this technology to spread the values ​​of accountability and inclusiveness.

The team believes that with continued workshops and experiments, BLOOM’s performance will continue to improve. The team planned to increase the number of languages ​​and reduce the size of the model while maintaining performance.


Please Don't Forget To Join Our ML Subreddit

James G. Williams