PeriodWave-Turbo: Accelerating High-Fidelity Waveform Generation via Adversarial Flow Matching Optimization

Abstract

This paper introduces PeriodWave-Turbo, a high-fidelity and high-efficient waveform generation model via adversarial flow matching optimization. Recently, flow matching (FM) generative models have been successfully adopted for waveform generation tasks, leveraging a single vector field estimation objective for training. Although these models can generate high-fidelity waveform signals, they require significantly more ODE steps compared to GAN-based models, which only need a single generation step. Additionally, the generated samples often lack high-frequency information due to noisy vector field estimation, which fails to ensure high-frequency reproduction. To address this limitation, we enhance pre-trained FM-based generative models by incorporating a fixed-step generator modification. We utilized reconstruction losses and adversarial feedback to accelerate high-fidelity waveform generation. Through adversarial flow matching optimization, it only requires 1,000 steps of fine-tuning to achieve state-of-the-art performance across various objective metrics. Moreover, we significantly reduce inference speed from 32 NFE to 2 or 4 NFE. Additionally, by scaling up the backbone of PeriodWave from 29M to 70M parameters for improved generalization, PeriodWave-Turbo achieves unprecedented performance, with a perceptual evaluation of speech quality (PESQ) score of 4.454 on the LibriTTS dataset.

Contents

 

Multi Speakers (LibriTTS Dataset)


Ground Truth
UnivNet
Vocos
BigVSAN
BigVGAN-base
BigVGAN
BigVGAN-v2
PeriodWave-B (step 16, Midpoint)
PeriodWave-B + FreeU (step 16, Midpoint)
PeriodWave-L (step 16, Midpoint)
PeriodWave-L + FreeU (step 16, Midpoint)
PeriodWave-Turbo-S (step 4, Euler)
PeriodWave-Turbo-B (step 4, Euler)
PeriodWave-Turbo-L (step 4, Euler)


Zero-shot TTS (ARDiT-TTS + Vocoder / 24,000 Hz)

To further demonstrate the effectiveness of our model for two-stage TTS, we added the results for multi-speaker zero-shot TTS. We utilized an autoregressive diffusion transformer-based zero-shot TTS model, ARDiT-TTS for TTS model which used the same configuration of Mel-spectrogram for 24 kHz audio. We requested the generated Mel-spectrogram of ARDiT-TTS from the authors and they kindly sent us the Mel-spectrogram of 500 samples for the LibriTTS-test-subsets. We have attached the UTMOS results for each vocoder, and we will conduct the MOS for this experiment. Although GAN-based models have shown their powerful generative performance for the original Mel-spectrogram converted from GT audio, these results show that they have low robustness for the generated Mel-spectrogram from the TTS models. We used the official implementation and checkpoints of BigVGAN and BigVSAN.


BigVSAN
BigVGAN-base
BigVGAN
BigVGAN-v2
PeriodWave-B (step 16, Midpoint)
PeriodWave-B + FreeU (step 16, Midpoint)
PeriodWave-L (step 16, Midpoint)
PeriodWave-L + FreeU (step 16, Midpoint)
PeriodWave-Turbo-S (step 4, Euler)
PeriodWave-Turbo-B (step 4, Euler)
PeriodWave-Turbo-L (step 4, Euler)


Single Speaker (LJSpeech Dataset)


Ground Truth
HiFi-GAN
BigVGAN-base
BigVGAN
BigVGAN_v2
PriorGrad (step 50)
PeriodWave (step 16, Midpoint)
PeriodWave + FreeU (step 16, Midpoint)
PeriodWave-Turbo (step 4, Euler)


Out-Of-Distrubution: MUSDB18-HQ (Bass)


Ground Truth
UnivNet
Vocos
BigVSAN
BigVGAN-base
BigVGAN
BigVGAN-v2
PeriodWave-B (step 16, Midpoint)
PeriodWave-B + FreeU (step 16, Midpoint)
PeriodWave-L (step 16, Midpoint)
PeriodWave-L + FreeU (step 16, Midpoint)
PeriodWave-Turbo-S (step 4, Euler)
PeriodWave-Turbo-B (step 4, Euler)
PeriodWave-Turbo-L (step 4, Euler)


Out-Of-Distrubution: MUSDB18-HQ (Drums)


Ground Truth
UnivNet
Vocos
BigVSAN
BigVGAN-base
BigVGAN
BigVGAN-v2
PeriodWave-B (step 16, Midpoint)
PeriodWave-B + FreeU (step 16, Midpoint)
PeriodWave-L (step 16, Midpoint)
PeriodWave-L + FreeU (step 16, Midpoint)
PeriodWave-Turbo-S (step 4, Euler)
PeriodWave-Turbo-B (step 4, Euler)
PeriodWave-Turbo-L (step 4, Euler)


Out-Of-Distrubution: MUSDB18-HQ (Mixture)


Ground Truth
UnivNet
Vocos
BigVSAN
BigVGAN-base
BigVGAN
BigVGAN-v2
PeriodWave-B (step 16, Midpoint)
PeriodWave-B + FreeU (step 16, Midpoint)
PeriodWave-L (step 16, Midpoint)
PeriodWave-L + FreeU (step 16, Midpoint)
PeriodWave-Turbo-S (step 4, Euler)
PeriodWave-Turbo-B (step 4, Euler)
PeriodWave-Turbo-L (step 4, Euler)


Out-Of-Distrubution: MUSDB18-HQ (others)


Ground Truth
UnivNet
Vocos
BigVSAN
BigVGAN-base
BigVGAN
BigVGAN-v2
PeriodWave-B (step 16, Midpoint)
PeriodWave-B + FreeU (step 16, Midpoint)
PeriodWave-L (step 16, Midpoint)
PeriodWave-L + FreeU (step 16, Midpoint)
PeriodWave-Turbo-S (step 4, Euler)
PeriodWave-Turbo-B (step 4, Euler)
PeriodWave-Turbo-L (step 4, Euler)


Out-Of-Distrubution: MUSDB18-HQ (Vocals)


Ground Truth
UnivNet
Vocos
BigVSAN
BigVGAN-base
BigVGAN
BigVGAN-v2
PeriodWave-B (step 16, Midpoint)
PeriodWave-B + FreeU (step 16, Midpoint)
PeriodWave-L (step 16, Midpoint)
PeriodWave-L + FreeU (step 16, Midpoint)
PeriodWave-Turbo-S (step 4, Euler)
PeriodWave-Turbo-B (step 4, Euler)
PeriodWave-Turbo-L (step 4, Euler)