Amazon AI researchers have proposed “DQ-BART”: a jointly distilled and quantified BART model that achieves a 16.5x model footprint compression ratio

This Article is written as a summay by Marktechpost Staff based on the paper 'DQ-BART: Efficient Sequence-to-Sequence Model via Joint Distillation and Quantization'. All Credit For This Research Goes To The Researchers of This Project. Check out the paper and post.

Please Don't Forget To Join Our ML Subreddit

Sequence-to-sequence (seq2seq) models that have already been trained, such as BART and T5, have performed very well in various natural language processing tasks, such as text summarization, machine translation, question answering, and extraction. information. But these large-scale language models that have already been trained have hundreds of millions of parameters — work done in AWS AI Labs during an internship. An equal contribution trained a BART model with 400 million parameters, while T5 pushed the limit to 11 billion parameters.

Constantly growing model sizes mean that a lot of compute and memory resources are needed when inferencing. This makes deployment very difficult, especially in real-time, resource-limited situations. This prompts researchers to want to make large, already-trained models smaller and faster while maintaining their good performance. Quantization approaches have received a lot of attention lately as they reduce the model footprint by using fewer bits for weight values ​​without changing the carefully designed model architecture. Most of the previous work on quantization transformers has been done with BERT-based transformers. But there has not been enough research on the effective quantization of encoder-decoder transformers. It is possible to achieve 8-bit quantization for a seq2seq transformer without a significant drop in performance. Still, low-bit quantization was difficult for this model (4-bit performance in Table 2 of their work) because quantization errors tend to accumulate in seq2seq models. Additionally, their work did not attempt to quantify large-scale language patterns that had already been trained, nor could they be used for NLP tasks other than machine translation. For BERT compression, a lot of research has been done on model distillation, which moves knowledge from a large teacher model to a smaller student model.

Recent improvements in sequence-to-sequence pre-trained language models such as BART (bi-directional autoregressive transformers) have performed many NLP tasks significantly. A typical BART model can have hundreds of millions of parameters. The success of these models, on the other hand, requires a lot of memory and processing power.

This may make it impossible to use BART on resource-constrained devices, such as cell phones or smart home appliances. Scientists from the Amazon Web Services Artificial Intelligence Labs presented a paper at ACL 2022 that solves this problem by using distillation and quantization to reduce the size of a BART model to less than 1/16th of its size. original size without too much performance loss.

A two-part plan

Quantization, which maps high-precision values ​​onto a limited menu of lower-precision values, and Distillation, in which a smaller, more efficient student model is trained to mimic a larger, more powerful teacher model , are two common ways to reduce neural networks. ‘ memory footprints.

Researchers begin by refining a BART model, also called a “teacher model,” on a specific NLP task, such as answering questions or summarizing text. The particular layer weights of the trained teacher model are then copied into a student model. This is the distillation process, which reduces the footprint of the model.

Distillation-sensitive quantification is the next step. The student model is quantized, making it a low-precision model. The full precision student model is also kept handy as it is needed for the next step.

The quantified student model then processes the dataset used to train the teacher model. Its outputs are evaluated using two metrics: the task-based standard loss, which measures how much the results differ from the ground truth, and a distillation loss, which measures the difference between the quantified and distilled student model. and the teacher model.

Then, these losses are used to modify the parameters of the student model in full precision, and not the quantized one. Indeed, the standard algorithm for updating a neural network uses gradient descent, which requires model parameters that can change continuously. In a quantized model, the parameters have discrete values, so they cannot be changed.

After the full precision student model has been updated to minimize its error on the training set and its difference from the teacher model, it is quantized again to reduce the amount of memory it uses.


The researchers tested how well their simplified, quantified BART model summarized text and answered long questions against three different benchmarks. They also looked at how distillation-sensitive quantization would work on a more complicated model like mBART, a multilingual model designed to translate sentences between languages, such as English and Serbian.

Their first analysis revealed that the combination of distillation and quantization gave better compression than quantization alone. There was no performance drop for the long question task and just a slight performance drop for the summary test. They also found that the model can be shrunk to nearly 1/28th of its original size. But at this compression rate, the performance of the model is not constant; the right amount of compression must be determined for each task.

For the mBART task, the team found that the distillation-aware approach was effective in reducing the size of the model footprint when using eight-bit quantization. However, when the number of quantization bits was reduced to two, its performance began to drop more significantly. The researchers believe that this drop in performance was caused by distillation and quantification errors that accumulated over time. These errors can be more serious for the complex problem of machine translation.


In future work, the researchers want to learn more about the multilingual mBART model and test other compression methods, such as head pruning and sequence-level distillation. Since the current study focused primarily on memory footprints, they also want to examine the effects of latency.

Pre-trained, transformer-based seq2seq language models such as BART have greatly improved the state of the art in several NLP tasks. Yet these large-scale models are difficult to achieve when resources are limited. DQ-BART, a both distilled and quantified BART model, was used to solve this problem. Even though language generation tasks are complex, empirical results show that a model footprint compression ratio of 16.5x was achieved with little performance degradation on three generative benchmarks. The tradeoff between performance and efficiency for seq2seq models has been demonstrated with a compression ratio of up to 27.7x. Also, distillation and quantization were researched for mBART on a machine translation task and highlighted the difficulty of combining low-bit quantization with distillation for deeper patterns on multilingual tasks. The method is the first to use quantization and distillation on already trained language models. It is also the first work to attempt to distill and quantify seq2seq-trained models for language generation tasks.

James G. Williams