Meta AI researchers have built the first artificial intelligence-based translation system under Universal Speech Translator (UST) for a predominantly oral language “Hokkien”

Although more than half of the more than 7,000 living languages ​​in the world are primarily oral and lack a standardized writing system, recent technological advances in AI translation have mainly focused on written languages. . It is also mainly for this reason that machine translation systems cannot be created using conventional methods, as they require appreciable written content to train an AI model. In order to solve this problem, Meta created the first AI-powered Hokkien translation system. The Chinese diaspora speaks Hokkien widely, but no official written version of the language exists. As part of Meta’s Universal Speech Translator (UST) project, this open-source translation tool enables native English speakers to communicate with each other.

When it comes to standard machine translation systems, the development of this new voice-only translation system has come with its own set of challenges, including data collection, model building, and evaluation. One of the main challenges faced by Meta researchers when developing a Hokkien translation system was collecting enough data. Hokkien is a low-resource language, which means there is less training material readily available than for high-resource languages ​​like French or English. Additionally, data collection and annotation for training is laborious due to the need for more humane English to Hokkien translators.

Instead, the researchers used Mandarin as an intermediate language to create pseudo-labels and human translations. The speech in English (or Hokkien) was first translated into Mandarin text, then into Hokkien (or English). By using information from a comparable high-resource language, this technique dramatically increased model performance. Using speech mining, researchers were able to integrate Hokkien voice integrations into the same semantic space as other languages ​​without the need for a written representation of the language. A voice coder that was pre-trained enabled this. Hokkien and English share semantic integrations, which simplifies the synthesis of English speech from texts and the production of parallel speech in Hokkien and English.

Many speech-to-text or transcription-based methods are used in speech translation. However, providing transcribed text as a result of the translation is not optimal since most of these languages ​​are primarily spoken and require standardized written forms. Instead, Meta focused on speech-to-speech translation. As for the architectural details, following directly on the trail previously traced by Meta, speech-to-unit translation (S2UT) was used to convert the input speech into a series of acoustic units. The units created the output waveforms. UnY was used in a two-pass decoding method, where the first pass generates text in a related language (Mandarin) and the second pass generates units.

For evaluation purposes, automatic speech recognition (ASR) was first used to convert translated speech into text. Next, BLEU scores were calculated to compare transcribed text with human-translated text to assess speech translation systems. Once again, a hurdle was the need for a standard writing system for languages ​​like Hokkien. The researchers created a method that converts Hokkien speech into a standard phonetic notation known as Tâi-lô to facilitate automatic assessment. The translation quality of various approaches could be compared using this method to obtain a BLEU score.

Meta visions to bring people together despite their geographical and linguistic barriers, even in the metaverse, through oral communication. The method enables communication between Hokkien and English speakers in its latest state. Currently, the model can only translate one complete sentence at a time and is still under development. The researchers believe that there is still a lot of work to be done to extend the UST to new languages. However, speaking fluently to people in any language has long been a goal, and Meta’s work is a step in that direction. They have made open source not only the Hokkien translation models, but also the evaluation datasets and research papers to encourage further research.

Hokkien direct speech-to-speech translation | SpeechMatrix | Reference article

Please Don't Forget To Join Our ML Subreddit


Khushboo Gupta is an intern consultant at MarktechPost. She is currently pursuing her B.Tech from Indian Institute of Technology (IIT), Goa. She is passionate about the fields of machine learning, natural language processing and web development. She likes to learn more about the technical field by participating in several challenges.


James G. Williams