VIOLA

VIOLA: Causal Language Models for Speech Recognition, Synthesis, and Translation

Tianrui Wang*, Long Zhou*, Ziqiang Zhang, Yu Wu, Shujie Liu, Yashesh Gaur,
Zhuo Chen, Jinyu Li, Furu Wei

Microsoft

Abstract. Recent research shows a big convergence in model architecture, training objectives, and inference methods across various tasks for different modalities. In this paper, we propose VIOLA, a single auto-regressive Transformer decoder-only network that unifies various cross-modal tasks involving speech and text, such as speech-to-text, text-to-text, text-to-speech, and speech-to-speech tasks, as a conditional language model task via multi-task learning framework. To accomplish this, we first convert the speech utterances to discrete tokens (similar to the textual data) using an offline neural codec encoder. In such a way, all these tasks are converted to token-based sequence prediction problems, which can be naturally handled with one conditional language model. We further integrate task IDs (TID) and language IDs (LID) into the proposed model to enhance the modeling capability of handling different languages and tasks. Experimental results demonstrate that the proposed VIOLA model can support both single-modal and cross-modal tasks well, and the decoder-only model achieves a comparable and even better performance than the strong baselines. Audio samples are available at https://violademo.github.io/.

This page is for research demonstration purposes only.

Contents

Model Overview
Zero-Shot Text to Speech
Zero-Shot Cross-Lingual Text to Speech
Zero-Shot Speech-to-Speech Translation
Voice Emotion Maintenance In Speech-to-Speech Translation
Ethics Statement

Model Overview

Figure. The overall framework of VIOLA, which regards various speech processing tasks as a conditional language model task. The model training is conducted on a multi-task learning framework with ASR, MT, and TTS tasks, and the model is capable of performing speech-to-text recognition and translation, text-to-text translation, text-to-speech synthesis, and speech-to-speech translation tasks.

Zero-Shot Text to Speech

English TTS with English prompts (English samples are from LibriSpeech).

English Text	Speaker Prompt	Baseline	VIOLA
Sweat covered brion's body trickling into the tight loincloth that was the only garment he wore.
But neither the glorified woods on the one hand nor the lake on the other could at first hold the eye.
On sunday morning a clear beautiful and still day the order was given for the whole army to advance and to attack immediately.
The merchant's daughter at first did not answer but as he kept on calling to her she finally asked him what it was that he wanted.
Gold is the most common metal in the land of oz and is used for many purposes because it is soft and pliable.
I understand you to say that there are three students who use this stair and are in the habit of passing your door yes there are.
The cat growled softly picked up the prize in her jaws and trotted into the bushes to devour it.

Zero-Shot Cross-lingual Text to Speech

English TTS with Chinese prompts and accent (Samples are from EMIME).

English Text	Speaker Prompt	Baseline	VIOLA
The fishermen are inactive tired and disappointed.
Unfortunately others separate on the basis of accumulated hatred.
In reality the european parliament is practising delay tactics.
Summer time has in practice become normal time.
There is a connection between all of this.
This parliament represents the people of europe.
Turkey must come to terms with the truth.
The problems I mentioned are not insurmountable.
We reject this socially unacceptable reform.
In reality the european parliament is practising delay tactics.

Zero-Shot Speech-to-Speech Translation

Chinese to English on EMIME dataset.

Model	Chinese Transcription (ASR Result)	English Text (MT Result)
Ground Truth	我提到的这个问题并不难处理	The problems I mentioned are not insurmountable.
AED->AED->VALLEX	我提到的这个问题并不难处理	The problem I mentioned is not difficult to deal with.
VIOLA	我提到的这个问题并不难处理	The problem I mentioned is not difficult to deal with.
Ground Truth	实际上欧洲议会正在施展拖延战术	In reality the european parliament is practising delay tactics.
AED->AED->VALLEX	实际上欧洲议会正在施展拖延战术	In fact europe is using delaying tactics once and for all.
VIOLA	实际上欧洲议会正在施展拖延战术	In practice europe is once again staging its procrastination tactics.
Ground Truth	这一切之间皆有关联	There is a connection between all of this.
AED->AED->VALLEX	这一切之间皆有关联	There is a link between all this.
VIOLA	这一切之间皆有关联	There is a connection between all of this.
Ground Truth	渔夫们没在工作显得累而失望	The fishermen are inactive tired and disappointed.
AED->AED->VALLEX	渔夫们没在工作是累而失望	The fishermen were tired and disappointed when they were not working.
VIOLA	渔夫们没在工作显得累而失望	The fishermen looked tired and disappointed when they were not working.
Ground Truth	立法已经到位以便达成这些目标	Legislation is already in place to achieve these aims.
AED->AED->VALLEX	立法已经到位已变成这些目标	Legislation is in place and has become one of those objectives.
VIOLA	立法已经到位已变成这些目标	Legislation is already in place and has become these targets.

Voice Emotion Maintenance

VIOLA can translate speech while maintaining the emotion in the source speech. The source audio are sampled from the Emotional Voices Database ESD.

Emotion	Chinese Speech	English Speech Generated by VIOLA (Speech-to-Speech Translation)
Happy
Neutral
Angry
Sad
Surprise

Ethics Statement

Since VIOLA could synthesize speech that maintains speaker identity, it may carry potential risks in misuse of the model, such as spoofing voice identification or impersonating a specific speaker. We conducted the experiments under the assumption that the user agree to be the target speaker in speech synthesis. If the model is generalized to unseen speakers in the real world, it should include a protocol to ensure that the speaker approves the use of their voice and a synthesized speech detection model.