Amazon Artificial Intelligence Researchers Develop GIT: A Generative Insert Transformer for Controlled Generation of Data via Insert Operations

This Article Is Based On The Research 'Controlled Data Generation via Insertion Operations for NLU'. All Credit For This Research Goes To The Researchers Of This Paper 👏👏👏

Please Don't Forget To Join Our ML Subreddit

Natural language processing (NLP) is a computing paradigm that concerns the ability of computers to understand text and spoken words in a way that humans can.

The requirement for continuous annotation of user input is one of the most common barriers to deploying large-scale natural language understanding algorithms in commercial applications. This procedure is costly, time-consuming and laborious.

Manual inspection of user data, which is typically required for such annotation, is becoming less and less attractive at a time when user privacy is becoming an increasing issue in all AI applications. Accordingly, a number of attempts are underway to reduce the number of human annotations needed to train NLU models.

Data augmentation (DA) is a term used to describe ways to increase the diversity of training samples without collecting new data. Amazon researchers recently proposed a generative technique to generate labeled synthetic data in an article. The goal was to create synthetic utterances and augment the original training data with a collection of utterance models that the team built from a limited amount of labeled data.

The researchers focused on the single scenario where the synthesized data must retain some fine interpretation of the original utterance. For example, while expanding to new features, researchers wanted to retain the performance of the NLU model while controlling feature composition in the training data.

By reframing the build process as insert rather than build, the team was able to control the intended annotation. The desired entities in the synthetic example were preserved by placing them in the model input during build, and they introduced ways to explicitly prevent entity corruption throughout the build process.

According to the researchers, NLU models trained on 33% real data and synthetic data perform the same as models trained on full real data. The team also improved the quality of the synthetic data by filtering it with model confidence scores. Researchers have demonstrated that inserting appropriate tokens improves the semantics of utterances and their usefulness as training examples.


In a recent publication, Amazon researchers demonstrated that data augmentation uses the generative insertion transformer as a viable data generation strategy for intent classification and named entity recognition model workloads. to counter declining annotation volumes due to privacy concerns. This increased control over entities makes it easier to add new features and protect customer privacy. The researchers would continue to work on improving the performance of the model and adding new areas to search.



James G. Williams