Machine Dubbing: Amazon AI Researchers Explore Length-Controlled Machine Translation

Jeff Bezos went viral in early October 2021. The Amazon founder praised rival streaming service Netflix for its “impressive and inspiring” internationalization strategy in a Tweeter on the worldwide success of the hit Korean series, Squid Game.

Meanwhile, Amazon (which hosts Prime Video) has been working on its own internationalization efforts. A team of Amazon AI researchers recently delved into automatic dubbing (AD) and machine translation (MT).

The resulting document, “Machine translation verbosity control for automatic dubbingwas published on the arXiv preprint platform on October 8, 2021. The authors are a collective of scientists and engineers from Amazon AI, a unit of Amazon World Services. Among them is Marcello Federico, who, in addition to leading automatic dubbing efforts for Amazon, is also co-founder of translation productivity tool MateCat.


The research focuses on the “problem of controlling the verbosity of machine translation results” with the aim of generating higher quality automatic dubbing. In this context, verbosity refers to length; i.e. the authors want to control the MT output length for use in dubbing.

They explained, “Automatic dubbing aims to seamlessly replace the speech of a video document with synthetic speech in a different language.” This is complex, as translations must reflect the meaning of the original and match the length of the original.

Professional Guide: Linguistic Operations Product

Pro Guide: Linguistic Operations

Data and Research, Slator Reports

How successful language service providers structure operations, scale internationally, manage supply chains, execute program management and mitigate risk.

The experiments involved content in French, Italian, German and Spanish, automatically translated from an English transcription and controlled the number of characters in the MT output as a proxy to control the dubbing duration.

Better translations, worse dubbing?

According to the researchers, the experiments used intrinsic and extrinsic ratings, a significant difference from previous work: “Intrinsic ratings measure the quality and verbosity of machine translation relative to post-edited human translations matching length requirements , while extrinsic ratings measure the subjective quality of dubbed video clips using the generated translations.

MT performance was measured using the BLEU score. To measure verbosity, the researchers counted the percentage of MT outputs that matched the length of the original with a tolerance of +/- 10%, which the researchers said they “consider acceptable for AD.”

“AD attempts to automate the localization of audiovisual content, a complex and demanding workflow managed in post-production by dubbing studios.”

Meanwhile, for the subjective ratings, the researchers generated Italian and German dubbed videos and asked 40 subjects to rate their viewing experience.

In terms of MT quality, the researchers concluded that “our best resulting model not only produces much closer translations to the input, but often also better translations” compared to a standard Transformer MT model trained without verbosity information.

However, they said, subjective evaluation of auto-dubbed videos, which used MT-generated translations, with and without verbosity checks, confirmed an “increase in human preference for dubbed videos with this latest version.” (i.e. no verbosity check).

Automate the work of dubbing studios

The document notes that “AD tries to automate the localization of audiovisual content, a complex and demanding workflow handled during post-production by dubbing studios”. Dubbing workflows are indeed complex, involving multiple human steps and creative collaboration.

In today’s professional workflows, translators or post-editors are responsible for ensuring the original meaning is preserved, while adapters and voice artists typically handle length matching and lip-syncing. .

Although professional dubbing is normally synonymous with lip-sync dubbing (with synchronized lip movements), the research aimed “only” to achieve synchronization at the utterance level and did not concern itself with the synchronization of the movements of the lips. lips or body. This is the case with most work on automatic dubbing, the researchers said.

Not only are professional dubbing workflows still unmatched by today’s automatic dubbing, but many dubbing studios are seeing an increase in demand. As streaming services such as Netflix and Amazon continue to localize their content into English to drive subscriptions around the world, many are now stepping up their efforts to bring international content to English-speaking audiences.

James G. Williams