VIOLA
VIOLA: Causal Language Models for Speech Recognition, Synthesis, and Translation
Tianrui Wang*,   Long Zhou*,   Ziqiang Zhang,   Yu Wu,   Shujie Liu,   Yashesh Gaur,  
Zhuo Chen,   Jinyu Li,   Furu Wei
Microsoft
Abstract. Recent research shows a big convergence in model architecture, training objectives, and inference methods across various tasks for different modalities. In this paper, we propose VIOLA, a single auto-regressive Transformer decoder-only network that unifies various cross-modal tasks involving speech and text, such as speech-to-text, text-to-text, text-to-speech, and speech-to-speech tasks, as a conditional language model task via multi-task learning framework. To accomplish this, we first convert the speech utterances to discrete tokens (similar to the textual data) using an offline neural codec encoder. In such a way, all these tasks are converted to token-based sequence prediction problems, which can be naturally handled with one conditional language model. We further integrate task IDs (TID) and language IDs (LID) into the proposed model to enhance the modeling capability of handling different languages and tasks. Experimental results demonstrate that the proposed VIOLA model can support both single-modal and cross-modal tasks well, and the decoder-only model achieves a comparable and even better performance than the strong baselines. Audio samples are available at https://violademo.github.io/.
This page is for research demonstration purposes only.
Contents
Model Overview
Figure. The overall framework of VIOLA, which regards various speech processing tasks as a conditional language model task. The model training is conducted on a multi-task learning framework with ASR, MT, and TTS tasks, and the model is capable of performing speech-to-text recognition and translation, text-to-text translation, text-to-speech synthesis, and speech-to-speech translation tasks.
Zero-Shot Text to Speech
English TTS with English prompts (English samples are from LibriSpeech).
English Text | Speaker Prompt | Baseline | VIOLA |
---|---|---|---|
Sweat covered brion's body trickling into the tight loincloth that was the only garment he wore. | |||
But neither the glorified woods on the one hand nor the lake on the other could at first hold the eye. | |||
On sunday morning a clear beautiful and still day the order was given for the whole army to advance and to attack immediately. | |||
The merchant's daughter at first did not answer but as he kept on calling to her she finally asked him what it was that he wanted. | |||
Gold is the most common metal in the land of oz and is used for many purposes because it is soft and pliable. | |||
I understand you to say that there are three students who use this stair and are in the habit of passing your door yes there are. | |||
The cat growled softly picked up the prize in her jaws and trotted into the bushes to devour it. |
Zero-Shot Cross-lingual Text to Speech
English TTS with Chinese prompts and accent (Samples are from EMIME).
English Text | Speaker Prompt | Baseline | VIOLA |
---|---|---|---|
The fishermen are inactive tired and disappointed. | |||
Unfortunately others separate on the basis of accumulated hatred. | |||
In reality the european parliament is practising delay tactics. | |||
Summer time has in practice become normal time. | |||
There is a connection between all of this. | |||
This parliament represents the people of europe. | |||
Turkey must come to terms with the truth. | |||
The problems I mentioned are not insurmountable. | |||
We reject this socially unacceptable reform. | |||
In reality the european parliament is practising delay tactics. |
Zero-Shot Speech-to-Speech Translation
Chinese to English on EMIME dataset.
Chinese Speech (Input) | Model | Chinese Transcription (ASR Result) | English Text (MT Result) | English Speech (TTS Result) |
---|---|---|---|---|
Ground Truth | 我提到的这个问题并不难处理 | The problems I mentioned are not insurmountable. | AED->AED->VALLEX | 我提到的这个问题并不难处理 | The problem I mentioned is not difficult to deal with. |
VIOLA | 我提到的这个问题并不难处理 | The problem I mentioned is not difficult to deal with. | ||
Ground Truth | 实际上欧洲议会正在施展拖延战术 | In reality the european parliament is practising delay tactics. | ||
AED->AED->VALLEX | 实际上欧洲议会正在施展拖延战术 | In fact europe is using delaying tactics once and for all. | ||
VIOLA | 实际上欧洲议会正在施展拖延战术 | In practice europe is once again staging its procrastination tactics. | ||
Ground Truth | 这一切之间皆有关联 | There is a connection between all of this. | ||
AED->AED->VALLEX | 这一切之间皆有关联 | There is a link between all this. | ||
VIOLA | 这一切之间皆有关联 | There is a connection between all of this. | ||
Ground Truth | 渔夫们没在工作显得累而失望 | The fishermen are inactive tired and disappointed. | ||
AED->AED->VALLEX | 渔夫们没在工作是累而失望 | The fishermen were tired and disappointed when they were not working. | ||
VIOLA | 渔夫们没在工作显得累而失望 | The fishermen looked tired and disappointed when they were not working. | ||
Ground Truth | 立法已经到位以便达成这些目标 | Legislation is already in place to achieve these aims. | ||
AED->AED->VALLEX | 立法已经到位已变成这些目标 | Legislation is in place and has become one of those objectives. | ||
VIOLA | 立法已经到位已变成这些目标 | Legislation is already in place and has become these targets. |
Voice Emotion Maintenance
VIOLA can translate speech while maintaining the emotion in the source speech. The source audio are sampled from the Emotional Voices Database ESD.
Emotion | Chinese Speech | English Speech Generated by VIOLA (Speech-to-Speech Translation) |
---|---|---|
Happy | ||
Neutral | ||
Angry | ||
Sad | ||
Surprise |
Ethics Statement
Since VIOLA could synthesize speech that maintains speaker identity, it may carry potential risks in misuse of the model, such as spoofing voice identification or impersonating a specific speaker. We conducted the experiments under the assumption that the user agree to be the target speaker in speech synthesis. If the model is generalized to unseen speakers in the real world, it should include a protocol to ensure that the speaker approves the use of their voice and a synthesized speech detection model.