Amazon AI researchers propose a new model, called RescoreBERT, which trains a BERT rescoring model with discriminating objective functions and improves ASR rescore

This Article is written as a summay by Marktechpost Staff based on the Research Paper 'RESCOREBERT: DISCRIMINATIVE SPEECH RECOGNITION RESCORING WITH BERT'. All Credit For This Research Goes To The Researchers of This Project. Check out the paper and blog.

Please Don't Forget To Join Our ML Subreddit

Automatic speech recognition (ASR) is an interdisciplinary subfield of computer science and computational linguistics that enables academics to create speech recognition and translation-to-text approaches to improve research. Alexa, Siri, and Bixby are just a few of the many examples of ASR models in action. Because the amount of training data available for ASR models is limited, the model may struggle to understand unusual words and phrases. As a result, the assumptions of the ASR model are frequently passed to a language model, which represents the probabilities of word sequences trained on considerably larger text data. To increase the accuracy of the ASR, the language model reclassifies the assumptions. Transformer-based bidirectional encoder representations are a widely used transformer-based paradigm in NLP. BERT can be used as a re-scoring model by masking each input token and determining its log-likelihood from the rest of the input. The results are then added together to generate a total score called PLL (pseudo-log-likelihood). However, since this calculation takes a long time, it is not feasible for real-time ASR. Most ASR models use a more efficient Long Short Term Memory (LSTM) language model for the new notation.

RescoreBERT is a new model developed by Amazon Research that uses the power of BERT for second pass re-scoring. This year’s International Conference on Acoustics, Speech and Signal Processing will feature their recently published work proposing this model and their experimental analyzes (ICASSP). Compared to a typical LSTM-based rescoring model, RescoreBERT successfully reduces the error rate of an ASR model by 13%. Moreover, the model remains efficient enough for commercial deployment thanks to the combined efforts of knowledge distillation and discriminative training. RescoreBERT has also been deployed on the Alexa Teacher Model, a massive, pre-trained, multi-language model with billions of parameters that encode the language and core patterns of Alexa interactions. The key component of the RescoreBERT model is a technique called rescoring. The second-pass language model trained from scratch on a small amount of data can accurately prioritize and re-rank rare word hypotheses through the rescoring technique. Previous work from Amazon has been incorporated to reduce the computational expense of computing PLL scores. This is accomplished by feeding the output of the BERT model through a neural network trained to mimic the PLL scores assigned by a more meaningful “teacher” model. Since the distilled model is trained to match the teacher’s predictions on the masked inputs, this process is known as MLM (masked language model) distillation. The output of the distilled model is interpolated with the original score to obtain a final score. This method minimizes latency by condensing PLL scores from a large BERT model to a much smaller BERT model.


The re-scoring model cannot provide a score lower than the correct guess because the first and second pass scores are linearly interpolated. It is also necessary to ensure that the interpolated score of the correct hypothesis is the lowest of all the hypotheses. As a result, the researchers decided that it would be useful to take the first-pass scores into account when training the second-pass model. On the other hand, MLM distillation tries to distill the PLL scores and hence ignores the results of the first pass. After the MLM distillation, the discriminative training represents the first pass scores. RescoreBERT was trained to minimize ASR errors by using the score linearly interpolated between first-pass and second-pass scores to reclassify hypotheses. Previously, the loss function MWER (minimal word error rate) was used to minimize the expected number of predicted word errors using ASR hypothesis scores. The researchers developed a new loss function called MWED (matching word error distribution). The distribution of hypothesis scores is matched to the distribution of word errors for the individual hypotheses using this loss function. MWED has been shown to be a strong alternative to normal MWER for improving the model’s English performance, but it cannot produce equivalent results in Japanese. The benefits of discriminative training are demonstrated by showing that RescoreBERT trained for discriminative purposes can improve WER by 7% to 12%, while BERT trained with MLM distillation can only increase it by 3% to 6 %.

James G. Williams