Humans are a very intricate species. Many activities were perceived to be human-specific for a long time. These would be for example playing board games, driving cars, or even artistic endeavours like painting or composing music. Many technological breakthroughs happened since the emergence of computers and deep neural networks and suddenly these activities cannot be perceived as limited to humans anymore.
Surely we had to pick the examples of “no-longer-human-specific” activities tendentiously! There are other activities such as those that distinguish our species from animals. Those are the true human activities that will possibly never be executed on the same level by computers — like human speech.
There are many research groups that try to automate language processing to human perfection. Thanks to these efforts, with a targeted focus and most modern methods exercised for over a decade, English-Czech text translation reached the quality of humans in some areas (https://www.nature.com/articles/s41467-020-18073-9). This success builds upon superfast computers and very large amounts of data. New data for computational linguistics is basically being created every second. People communicate all the time.
A similar near-human or even superhuman performance has been observed in particular settings of speech recognition, i.e. automatic transcription of speech into text, since 2017.
Combining super-human technologies…
ELITR is putting these top-performing systems together: near-human speech recognition followed by superhuman machine translation. Such a system combination can surely mimic human interpreters.
Well, you will take that with a little grain of salt if you watch one of our public demos in the middle of the lecture Why computational linguistics sits at the core of modern AI at prg.ai evening last year [it didn’t go according to plan].
…can make a disaster
Unfortunately there are many issues along the way to simultaneous speech translation that have to be dealt with first. The mentioned demo failed due to a bad network connection; everything was tested and worked well before the audience came. But with too many cell phones around just trying to reach the wifi, the sound was not transmitted to the speech recognition system reliably. Furthermore, the architecture has to send data across several European countries and when the connection is slow at any point, it either provides the translation too late or it fails to deliver any output altogether.
Even little nuances like microphone positioning can mess up the outputs. The sound recognition has to be set up perfectly. Thus said when shirt mics are being used, the process may fail. When the speaker stands in front of the sound amplification system, the process may fail as well. The current version of the speech recognition system is the best humanity ever had but so far it’s still 10 times worse than humans in a noisy environment. Accent (esp. non-native accents) cause very serious troubles and so do all disfluencies commonly observed in spontaneous speech. These errors are then multiplied by machine translation, which works hard to convert any garbled and mis-segmented sequence of words into a complete sentence. Finally, the topic and domain-specific terminology can seriously fool the system, not to speak about cross-sentence phenomena. The technology has a very complex architecture. There is a lot to take care of… And when all your systems and teams are ready for an important demo, a digger cuts the main power line or finds a WWII bomb at one of your partner’s servers site.
If you want to get more in-depth information we highly recommend watching this easily digestible lecture in the Linguistic Mondays series.
It can work, at times
The ELITR team is primarily focused on creation of a working product in order to support a European congress. Speech recognition systems will have to be able to capture 7 languages — namely English, German, French, Spanish, Italian, Russian and finally Czech.
Those languages will be translated into 24 main European languages and 19 less frequent, which means a total number of incredible 43 target languages.
Without any adaptation the system can make e.g. the Czech Radio broadcasting accessible to international audiences, see the demo (high resolution is needed for readability).
As expected, names or novel terms like coronavirus are not recognized well. With some adaptation to a particular domain, domain-specific words and names will be identified better, see a snippet from a talk on machine translation evaluation.
In both cases, you will surely get the gist of the content, even if you cannot understand Czech at all.
The potential of the technology is limitless. Stay tuned for new updates.