CUNI has the top-scoring system at IWSLT 2025 Simultaneous Speech Translation Shared Task

IWSLT is an annual international conference that runs several speech translation competitions. In 2025, the Charles University team of Dominik Macháček and Peter Polák participated in the simultaneous speech translation task. The goal was to create a system that is able to translate Czech speech into English, and English into German, Japanese, and Chinese, with 2-5 seconds latency and the highest possible quality.

The Charles University (CUNI) solution is based on a new implementation named SimulStreaming, which is able to operate the robust large high-quality models Whisper and EuroLLM in simultaneous mode, using the state-of-the-art methods for simultaneous processing of offline models.

Key features of SimulStreaming:
– 🌐 Multilingual: 99 Whisper source languages → 35 EuroLLM target languages.
– 📝 Translation + ASR.
– ⚡ Real-time and faster than ever: ~5-times faster than our previous release called WhisperStreaming, thanks to efficient simultaneous policy.
– 🧩 Flexible prompting & context: Supports in-domain terminology and retrieval augmented generation (RAG).
– 🎯 High quality: Simultaneous use of strong foundation models with little performance loss.
– 🔧 Simple & robust: Clean middleware design – server with mic input or file simulation.
– 💻 Feasible hardware: Optimized for 1–2 GPUs (Whisper large-v3 1.5B + EuroLLM 9B), but possible with smaller distilled models.

Competition results

The results of the IWSLT 2025 Simultaneous shared task (presented on July 31st 2025 at the IWSLT conference in Vienna) show that the CUNI submission achieves the highest scores (in COMET, an automatic translation quality metric) in the Czech-to-English translation with both 2 and 4 seconds latency, and in English-to-German, Japanese and Chinese in the 4-5 seconds latency. The human evaluators rated it as the highest-quality automatic system that participated in Czech-to-English and English-to-Japanese. The only test cases in which it scored second was a challenging translation of the non-native speech, and a comparison with human professional Czech-to-English interpreter. CUNI system outperformed only the student interpreter.

In-domain terminology and context

A novel and unique feature of SimulStreaming is the option to integrate prompts with in-domain terminology. For example, in the domain of Czech Parliament, the Chamber of Parliament is often mistranslated as the *Senate by the default Whisper model, which is a very serious mistake because the Senate is another body. However, it is possible to initiate Whisper’s translation with a sentence “This is a Chamber of Parliement.”, which makes Whisper to notice the specific in-domain term.

Moreover, SimulStreaming provides an option to apply longer context between the processing units than one sentence. Another useful option is an in-context example for EuroLLM translation.

Self-service demo

You can try SimulStreaming by yourself. You only need a device with Internet connection, microphone, and a web browser. Click on this link for a self-service operation and follow the instructions there.

If you want to know more:

  • This self-service demo will be available only for a limited time. It can not be online permanently because it needs quite extensive HW resources. But you can contact the authors for another time slot.
  • Only one operator — a person that connects audio and starts/stops the service — can use it at a time.
  • Anyone can access the presentation web and read the transcripts and translations.
  • The labeled columns at the presentation web are:
    • EN — Whisper large-v3 translation from automatically identified language into English.
    • ASR — the transcript in the original language, automatically identified from the first 3 seconds after each 0.5s pause. (You might not see any output for the first 3 seconds of talking.)
    • DE/JA/ZH/CS — translation from EN with EuroLLM, using context of appx 150 tokens. The processing restarts after every pause in audio.
    • EN.PROMPT, ASR.PROMPT — currently not available.
  • The demo allows transcribing/translating any audio that the operator plays on their computer, or connection to the mic.
  • If you see some strange repeated outputs (“hallucinations”), it might be the limitation of current state of the art. It will stop if you make a pause in the source speech and wait until the system processes all audio before the pause.
  • If you refresh the presentation web, all previously displayed text disappears.
  • The input sound quality impacts the latency and quality, as well as the quality of Internet connection.

Code, paper, presentation, service, license

Are you interested in SimulStreaming, or any of the features it offers?

  • Read the paper.
  • See poster presentation at IWSLT 2025, on July 31 at Vienna.
  • See the interactive demo of SimulStreaming at the demo panel at IWSLT 2025, and then also in several other planned events. Contact the authors or ask for more dates.
  • Use SimulStreaming code on GitHub.
  • Ask authors for commercial licence. Because understanding who uses SimulStreaming commercially would very help the authors in the next research and development, the commercial use is permitted to those who register.
  • AI Interpreting service. — The CUNI team is available to provide a limited service of AI interpreting to any meeting or events, for personal or other use. See this post for more info.

Contact: Dominik Macháček, machacek@ufal.mff.cuni.cz .