A.I. Automated Translation, Dubbing and Subtitling

Fast Forward to a future G7 summit. Leaders jointly hold a press conference to announce more international cooperation to prevent a new virus from spreading as fast as Covid 19 did. Leaders from the USA, Italy, France and Germany will speak. The event is aired live around the globe.

A Swedish viewer tunes in. Merkel starts speaking in German. Instantly, Swedish subtitles appear on screen. Macron takes over in French. Again, Swedish subtitles are shown instantly. Whenever someone starts speaking, his or her name is shown on screen, and subtitles are displayed.

For a German viewer watching the same event, Merkel’s words do not need to be subtitled. Moreover, she is dyslectic and would not be able to read them anyway. That is why she has chosen the Dubbing option on her remote control. When Merkel speaks, she hears Merkel in her original voice. When Makron starts to speak, she hears a German synthesized voice sounding strikingly similar to Makron’s.

It is not limited to this live event. Our Swedish viewer can switch to any of the 200 channels that he can choose from on his set top box, in whatever original language and enjoy the content with subtitles displayed real time, in perfect Swedish. Our German dyslectic viewer can do the same, and enjoy content with dubbed German voices that sound similar to the original voices.

Far-fetched scenario?

When you look at the state of live subtitling and the state of Artificial Intelligence technology, this example seems far-fetched. Everyone who has ever used automatically generated subtitles watching a Youtube video or even their automatically generated translations will know the current limitations of these technologies (click here for slightly exaggerated example on Youtube). Some tablets will also generate closed captions for whatever media is playing, with similar limitations.

Speech to text AI technologies using Natural Language Processing are not new and have been commercially used in products like Apple’s Siri, Cortana and Alexa since 2010. The WER (Word Error Rate) can be as low as 1.5% for languages such as German or English but are higher for example in Baltic languages. For example, Latvian ASR (Automatic Speech Recognition) has to deal with the fact that one Latvian word can have dozens or even hundreds of surface forms.

If we look at our G7 summit example, this is not the only problem. To reach this level of live closed captioning / dubbing sophistication a cascaded process is needed, each with its own challenges and error margins:

1 – Recognize the language spoken
2 – Use ASR to transcribe speech into written text
3 – Use automated translation to translate the text from the original language to the target language

The steps needed for automated live dubbing using the characteristics of the original voice for the synthesized dubbing in target language voices is even more challenging. It relies on a new promising process that is currently in development. This process avoids dividing tasks in separate stages and prevents compounding errors between recognition and translation. Basically, this sequence-to-sequence approach avoids intermediate text representation and directly translates speech from one language to speech in another. And it is better at handling words that do not need to be translated, such as names.

AI driven ASR is not there yet

We are not at that point yet but we are moving fast. A very well-known benchmark for AI is the Turing test. The test, designed in 1936 by mathematician Alan Turing, uses experiments to basically determine if a human can distinguish a human from a machine by textual representation of both. Using these tests on the recently released GPT-3 language model showed that humans chose a human-written text over a machine-written text only 52% of the time. In other words, 48% of the time humans thought that the AI written article was actually human.

AI technologies are advancing at an incredible pace as more datasets become available, more powerful models are being created and very powerful chips and hardware accelerators are being implemented on edge devices.

Current AI transcription technologies can already be monetized

We do not need to wait for the G7 example to become reality to reap the benefits of using AI for transcription, translation and subtitling. We just need to take into consideration its current limitations. There are some situations that make it harder for AI to deliver acceptable results. For example multiple people speaking at the same time, people speaking in dialects, non-nativeness, stuttering, loud background noises and so on.

Problems with multiple people speaking at the same time can be dealt with by feeding the ASR systems with each speaker’s isolated audio. People that are for some reason very hard to understand, might get a voice-over that is used as an input to the ASR.

Even though there are several ways to deal with problems associated with using AI transcription in a live broadcast environment, there are still obvious limitations in terms of delay and quality of the closed captions. It is not possible to improve Word Error Margins in a live environment until the AI models are improved to such an extend that these margins are reduced to almost zero.

Smart AI supported subtitling in non-live environments

This is very different for pre-recorded content. Delay is not an issue and subtitling quality can be assured by using a smart mix of humans and ASR systems, and tools to support that process and make it efficient.

The traditional (human) process of creating closed captions knows some variations but it is often like this:
– A transcriber screens content and transcribes it into written text
– A proofreader checks it for mistakes
– An editor makes sure subtitles are displayed at exactly the right moment.

Spending valuable human resources on the first and the third part of this process can be completely eliminated using the right AI transcription tools. A proofreader just needs to go through the transcribed text and fix mistakes the ASR system has made. In most cases, the system will even point out where possible mistakes have been made. Making it possible to create closed captions 60 to 80% more efficient and less costly.

Going one step further: Using AI for translations

There are about 7,100 languages today and AI is pretty good at a few dozens of them. When foreign content is acquired, it is very likely that AI will be able to transcribe the language spoken, using the same process. This will make it possible to create translations faster and more efficient.

Humans only need to check the machine generated translations in much the same way as described above. To prevent compound errors a proof reader is needed for the original language, to fix AI transcription errors. And another proof reader is needed to check the machine generated translation.

Why subtitles make commercial sense

Models, methods, processing power needed to reduce Word Error Rates improve on a monthly basis. And as these rates go down, costs for human intervention in the translation, transcription and subtitling process are limited further.

It is not just governmental bodies that want broadcasters and other distributors to offer closed captioning to make content accessible to people with disabilities. Children that can decode more than 5 words start reading subtitles and statistics show this hugely increases levels of literacy.

And a UK study conducted 7 years ago showed that 20% of viewers switched on subtitles to be able to follow the show they were watching, while only 4% of the population back then was deaf or hard of hearing. On social media the trend is even more apparent: 85% of Facebook’s videos are watched without audio. Closed captions are becoming increasingly important to engage and connect with audiences.

The G7 example we started our article with is not that far-fetched actually, and will become reality in the next 3 to 7 years.

Dronyc.nl helps content distributors with AI driven subtitling and translations.

Back to Blog