ASAPP AI researchers propose a family of pre-trained models (SEW) for automatic speech recognition (ASR) that could be significantly better in performance than the existing Wav2Vec 2.0 architecture

Recent research in natural language processing and computer vision has aimed to increase the efficiency of pre-trained models to reduce the financial and environmental expense of training and refining them. We have not seen comparable attempts at speaking out for any reason. Efficiency advances in speech could mean better performance for similar inference times, in addition to cost savings from more efficient training of pre-trained models.

Due to a self-supervised training paradigm, Wav2Vec 2.0 (W2V2) is one of the current state-of-the-art models for automatic speech recognition. This training method allows us to pre-train a model using unlabeled data, which is always more readily available. The model can then be fine-tuned for a specific purpose on a given data set. It has generated a lot of interest and follow-up work for the use of pre-trained W2V2 models in various downstream applications, such as speech-to-text translation (Wang et al., 2021) and named entity recognition. (Shon et al., 2021). However, the researchers believe that the architecture of the model has several suboptimal design decisions that render it inefficient. To support this claim, the researchers performed a series of tests on various components of the W2V2 model architecture, revealing the performance-efficiency trade-off in the W2V2 model design space. Higher performance (lower ASR word error rate) requires a larger pre-trained model and lower efficiency (inference speed).

Is it possible to get a better trade-off (same performance with faster inference)? What can researchers suggest as an alternative? A more efficient pre-trained model with improved performance due to its efficiency advantages.

Wav2vec compressed and effective (SEW)

On academic datasets, the researchers propose SEW (pressed and Efficient Wav2vec) and SEW-D (SEW with Disentangled attention), which can achieve a much better performance-efficiency trade-off – their smallest SEW-D-mid achieves 13.5% WERR (Word Error Rate Reduction) per compared to base W2V2 with 1.9x speedup when inferring. While operating at the same speed as the W2V2-base, our larger SEW-D-base+ model operates the same as the W2V2-large. To surpass the W2V2 base, only 1/4 of the training epochs are needed, which greatly reduces the pre-training cost.

Source: https://www.asapp.com/blog/wav2vec-could-be-more-efficient-so-we-created-our-own-pre-trained-asr-model-for-better-conversational-ai/ ?utm_source=FBGroup&fbclid=IwAR0RWaQ-9fbuCtZN_uagECFcbdx8e6GOjMsXL7RKKcPFQg1eIHNfXSVAHzk

Three key differences separate SEW from traditional W2V2 models.

  1. First, the researchers present a compact waveform feature extractor that distributes the work more evenly across the layers. This improves model performance without compromising speed.
  2. Second, they propose a “compression context network,” which decreases computational and memory usage by downsampling the audio sequence. This allows us to use a larger model without slowing down the inference.
  3. Third, researchers introduce MLP prediction heads during pre-training, which improve performance without adding overhead to the downstream application, as they are destroyed after pre-training.
Source: https://www.asapp.com/blog/wav2vec-could-be-more-efficient-so-we-created-our-own-pre-trained-asr-model-for-better-conversational-ai/ ?utm_source=FBGroup&fbclid=IwAR0RWaQ-9fbuCtZN_uagECFcbdx8e6GOjMsXL7RKKcPFQg1eIHNfXSVAHzk

SEW-D also replaces regular self-attention with DeBERTa’s untangled self-attention (He et al., 2020), which provides superior performance with half the number of parameters and a significant reduction in memory footprint inference time.

Why is this important?

Conversational AI systems that use SEW pre-trained models will better recognize what customers are saying, who is saying it, and how they are feeling and reacting faster. For various downstream models in automatic speech recognition, speaker identification, intent classification, emotion recognition, sentiment analysis, and named entity recognition, these pre-trained models open the door to reductions cost and/or performance benefits. The speed of a pre-trained model can simply be transmitted to downstream models. The refined downstream model is smaller and faster because the pre-trained model is smaller and faster. These efficiency advances reduce the time spent on training and development, as well as the actual backlog seen on products.

Article: https://arxiv.org/pdf/2109.06870.pdf

Github: https://github.com/asappresearch/sew

Reference: https://www.asapp.com/blog/wav2vec-could-be-more-efficient-so-we-created-our-own-pre-trained-asr-model-for-better-conversational-ai/ ?utm_source=FBGroup&fbclid=IwAR0RWaQ-9fbuCtZN_uagECFcbdx8e6GOjMsXL7RKKcPFQg1eIHNfXSVAHzk

James G. Williams