NeMo Speech Models | NVIDIA NGC

For downloads and more information, please view on a desktop device.

Description

NeMo Speech Models include speech recognition, command recognition, speaker identification, speaker verification and voice activity detection. These models are used for Automatic Speech Recognition (ASR) and sub-tasks.

Publisher

NVIDIA

Latest Version

1.0.0a5

Modified

April 4, 2023

Size

1.51 GB

Overview

NVIDIA NeMo toolkit supports multiple Automatic Speech Recognition(ASR) models such as Jasper and QuartzNet. Pretrained checkpoints for these models trained on standard datasets can be used immediately, use speech_to_text.py script in the examples directory. In addition, models for ASR sub-tasks such as speech classification are also provided; for example MatchboxNet trained on the Google Speech Commands Dataset is available in speech_to_label.py. Voice Activity Detection is also supported with the same script, by simply changing the config file passed to the script! NeMo also supports training Speech Recognition models with Byte Pair/Word Piece encoding of the corpus, via the speech_to_text_bpe.py example; these models are still under development. In order to simply perform evaluation on a dataset using these models, use the speech_to_text_infer.py example, which shows how to compute WER over the dataset.

Usage

You can instantiate all these models automatically directly from NGC. To do so, start your script with:

import nemo
import nemo.collections.asr as nemo_asr

Then chose what type of model you would like to instantiate. See table below for the list of model base classes. Then use base_class.from_pretrained(...) method. For example:

quartznet = nemo_asr.models.EncDecCTCModel.from_pretrained(model_name="QuartzNet15x5Base-En")

Note that you can also list all available models using API by calling base_class.list_available_models(...) method.

You can also download all models' ".nemo" files in the "File Browser" tab and then instantiate those models with base_class.restore_from(PATH_TO_DOTNEMO_FILE) method. In this case, make sure you are matching NeMo and models' versions.

Here is a list of currently available models together with their base classes and short descriptions.

Model name	Model Base Class	Description
QuartzNet15x5Base-En	`EncDecCTCModel`	QuartzNet15x5 model trained on six datasets: LibriSpeech, Mozilla Common Voice (validated clips from en_1488h_2019-12-10), WSJ, Fisher, Switchboard, and NSC Singapore English. It was trained with Apex/Amp optimization level O1 for 600 epochs. The model achieves a WER of 3.79% on LibriSpeech dev-clean, and a WER of 10.05% on dev-other.
QuartzNet15x5Base-Zh	`EncDecCTCModel`	QuartzNet15x5 model trained on ai-shell2 Mandarin Chinese dataset.
QuartzNet5x5LS-En	`EncDecCTCModel`	QuartzNet5x5 model trained on LibriSpeech dataset only. The model achieves a WER of 5.37% on LibriSpeech dev-clean, and a WER of 15.69% on dev-other.
QuartzNet15x5NR-En	`EncDecCTCModel`	Quartznet15x5 model trained for presence of noise. The base model QuartzNet15x5Base-En was finetuned with RIR and noise augmentation to make it more robust to noise. This model should be preferred for noisy speech transcription. This model achieves a WER of 3.96% on LibriSpeech dev-clean and a WER of 10.14% on dev-other.
Jasper10x5Dr-En	`EncDecCTCModel`	JasperNet10x5Dr model trained on six datasets: LibriSpeech, Mozilla Common Voice (validated clips from en_1488h_2019-12-10), WSJ, Fisher, Switchboard, and NSC Singapore English. It was trained with Apex/Amp optimization level O1. The model achieves a WER of 3.37% on LibriSpeech dev-clean, 9.81% on dev-other.
ContextNet-192-WPE-1024-8x-Stride	`EncDecCTCModelBPE`	ContextNet initial implementation model trained on the Librispeech corpus and achieves a WER of 10.09% on test-other and 10.11% on dev-other.
MatchboxNet-3x1x64-v1	`EncDecClassificationModel`	MatchboxNet model trained on Google Speech Commands dataset (v1, 30 classes) which obtains 97.32% accuracy on test set.
MatchboxNet-3x2x64-v1	`EncDecClassificationModel`	MatchboxNet model trained on Google Speech Commands dataset (v1, 30 classes) which obtains 97.68% accuracy on test set.
MatchboxNet-3x1x64-v2	`EncDecClassificationModel`	MatchboxNet model trained on Google Speech Commands dataset (v2, 35 classes) which obtains 97.12% accuracy on test set.
MatchboxNet-3x1x64-v2	`EncDecClassificationModel`	MatchboxNet model trained on Google Speech Commands dataset (v2, 30 classes) which obtains 97.29% accuracy on test set.
MatchboxNet-3x1x64-v2-subset-task	`EncDecClassificationModel`	MatchboxNet model trained on Google Speech Commands dataset (v2, 10+2 classes) which obtains 98.2% accuracy on test set.
MatchboxNet-3x2x64-v2-subset-task	`EncDecClassificationModel`	MatchboxNet model trained on Google Speech Commands dataset (v2, 10+2 classes) which obtains 98.4% accuracy on test set.
MatchboxNet-VAD-3x2	`EncDecClassificationModel`	Voice Activity Detection MatchboxNet model trained on google speech command (v2) and freesound background data, which obtains 0.992 accuracy on testset from same source and 0.852 TPR for FPR=0.315 on testset (ALL) of AVA movie data
SpeakerNet_recognition	`EncDecSpeakerLabelModel`	SpeakerNet_recognition model trained end-to-end for speaker recognition purposes with cross_entropy loss. It was trained on voxceleb 1, voxceleb 2 dev datasets and augmented with musan music and noise. Speaker Recognition model achieves 2.65% EER on voxceleb-O cleaned trial file"
SpeakerNet_verification	`EncDecSpeakerLabelModel`	SpeakerNet_verification model trained end-to-end for speaker verification purposes with arcface angular softmax loss. It was trained on voxceleb 1, voxceleb 2 dev datasets and augmented with musan music and noise. Speaker Verification model achieves 2.12% EER on voxceleb-O cleaned trial file