NGC | Catalog
CatalogModelsNeMo Speech Synthesis models

NeMo Speech Synthesis models

Logo for NeMo Speech Synthesis models
Description
NeMo Speech Synthesis(Text to Speech or TTS) models contain text to speech models to generate spectrogram from text and vocoder to generate audio from spectrogram
Publisher
NVIDIA
Latest Version
1.0.0a5
Modified
April 4, 2023
Size
1.23 GB

Overview

NVIDIA NeMo toolkit supports Text To Speech (TTS) which is also referred to as Speech Synthesis via a two step procedure. First, a model is used to generate a mel spectrogram from text. Second, a model is used to generate audio from a mel spectrogram. In this collection, Mel Spectrogram Generators Tacotron 2 and Glow-TTS are included.In the audio Generators (Vocoders) section, WaveGlow is included. Using the scripts in the TTS directory, train any of these models for domain specific data. Note: Transfer learning is currently a research area in TTS.

Usage

You can instantiate all these models automatically directly from NGC. To do so, start your script with:

import nemo
import nemo.collections.tts as nemo_tts

Then chose what type of model you would like to instantiate. See table below for the list of model base classes. Then use base_class.from_pretrained(...) method. For example:

# Spectrogram generator which takes text as an input and produces spectrogram
spectrogram_generator = nemo_tts.models.Tacotron2Model.from_pretrained(model_name="Tacotron2-22050Hz")
# Vocoder model which takes spectrogram and produces actual audio
vocoder = nemo_tts.models.WaveGlowModel.from_pretrained(model_name="WaveGlow-22050Hz")

Note that you can also list all available models using API by calling base_class.list_available_models(...) method.

You can also download all models' ".nemo" files in the "File Browser" tab and then instantiate those models with base_class.restore_from(PATH_TO_DOTNEMO_FILE) method. In this case, make sure you are matching NeMo and models' versions.

Here is a list of currently available models together with their base classes and short descriptions.

Model name Model Base Class Description
Tacotron2-22050Hz Tacotron2Model This model is trained on LJSpeech sampled at 22050Hz, and can be used to generate female English voices with an American accent.
WaveGlow-22050Hz WaveGlowModel This model is trained on LJSpeech sampled at 22050Hz, and can be used as an universal vocoder.
SqueezeWave-22050Hz SqueezeWaveModel This model is trained on LJSpeech sampled at 22050Hz, and can be used as an universal vocoder.
GlowTTS-22050Hz GlowTTSModel This model is trained on LJSpeech sampled at 22050Hz, and can be used to generate female English voices with an American accent.