This section describes the technological support for (multilingual) communication and collaboration, with a specific focus on Natural Language Processing (NLP) technologies relevant to ELITR:

How ASR, MT and SLT overlap. The process of recognition and translation of English spoken word „Hello!“ into Czech „Ahoj!“.

Automatic Speech Recognition (ASR)

Automatic speech recognition has been developing for around 70 years, from recognizing handful of single words such as digits, math operators and calculator commands carefully pronounced by a single speaker, through short sentences composed from thousand of given words with very restricted structure, to today virtual intelligent assistants, dialogue systems, and fully automatic transcription of (e.g.) conference talks.

The first ASR systems from 1950’s were based on acoustic models of speech. They were able to determine the vowels and consonants by their characteristic spectrum of low and quiet frequencies. They were later enhanced into a robust system of handcrafted rules, finite-state automata and brute-force search into a single-purpose domain dependent systems. Later, powerful statistical systems such as Hidden Markov Models and n-gram language models were introduced. They allowed to train ASR systems from corresponding examples of speech and text fully automatically. The domain, language, dialect, and individual speaker varieties are covered by the system, if they are significantly represented in the training data.

Nowadays, the statistical models are replaced by deep neural networks which amaze us of doing the same job better, so the overall quality is sometimes on par with humans. Deep neural networks for end-to-end ASR are capable to find their own way to transform speech into text, they learn to cope with acoustics, phonetics, grammar, vocabulary, real-world knowledge and orthography without direct human supervision. They only need the training data, appropriate design and powerful computers.

Machine Translation (MT)

The story of historical development of machine translation is very similar to the ASR story. During the Cold War in 1950’s, the first automatic translations of simple Russian sentences complying handcrafted rules were available. Later, very complex single-purpose rule-based systems were used. Then they were replaced by more flexible statistical systems, which were able to translate arbitrary language pair and domain based on training examples of pairs of sentences in source and target language to learn translations and monolingual texts to learn the target-language grammar.

Since 2015’s, deep neural networks started to outperform the purely statistical systems in MT. Again, they are able to cope with all the subtasks of MT their own way, they learn the concepts of morphology, syntax, meaning of words, real-world knowledge and handling of unknown and rare words on their own from the parallel training data only.

Spoken Language Translation (SLT)

If we can transcribe the speech into text and translate texts from one language into another, then we can simply connect the ASR and MT systems together and translate spoken language. The only modification of the MT is focus to written spoken-language training data.

Other, more challenging approach to SLT, is end-to-end SLT with deep neural networks for direct translation of speech into text. In ASR and MT concatenation it’s possible that the ASR component makes errors, don’t express uncertainty, or cuts out some input features that could be used in the MT for better translation. We could let the neural network to handle all the SLT subtasks, from acoustics to punctuation in target language, on its own, and let it learn the optimal way how the internal ASR and MT components should communicate.

Automatic Minuting

How the Automatic minuting and summarization works.

The minuting module will be responsible for the most significant discourse , cleaning and compressing them (removing repetitions) and finally organizing them in a logical order to form the meeting minute (first output). The later should mainly contain meeting decisions and other important points of discussion. The relevant text segments will also be aligned with respect to the predefined agenda clauses and create the filled agenda (second output). Sentence compression and text summarization techniques will be involved in the process.

It should be noted that the objective of Automatic Minuting is a high-risk one. No such technology has yet been proposed so our aim is to lay foundations for the necessary research by collecting and releasing relevant datasets and to reach a first prototype of a meeting summarization tool.