NGC | Catalog
CatalogModelsNeMo Speech Models

NeMo Speech Models

Logo for NeMo Speech Models
Description
NeMo Speech Models include speech recognition, command recognition, speaker identification, speaker verification and voice activity detection. These models are used for Automatic Speech Recognition (ASR) and sub-tasks.
Publisher
NVIDIA
Latest Version
1.0.0a5
Modified
April 4, 2023
Size
1.51 GB

Overview

NVIDIA NeMo toolkit supports multiple Automatic Speech Recognition(ASR) models such as Jasper and QuartzNet. Pretrained checkpoints for these models trained on standard datasets can be used immediately, use speech_to_text.py script in the examples directory. In addition, models for ASR sub-tasks such as speech classification are also provided; for example MatchboxNet trained on the Google Speech Commands Dataset is available in speech_to_label.py. Voice Activity Detection is also supported with the same script, by simply changing the config file passed to the script! NeMo also supports training Speech Recognition models with Byte Pair/Word Piece encoding of the corpus, via the speech_to_text_bpe.py example; these models are still under development. In order to simply perform evaluation on a dataset using these models, use the speech_to_text_infer.py example, which shows how to compute WER over the dataset.

Usage

You can instantiate all these models automatically directly from NGC. To do so, start your script with:

import nemo
import nemo.collections.asr as nemo_asr

Then chose what type of model you would like to instantiate. See table below for the list of model base classes. Then use base_class.from_pretrained(...) method. For example:

quartznet = nemo_asr.models.EncDecCTCModel.from_pretrained(model_name="QuartzNet15x5Base-En")

Note that you can also list all available models using API by calling base_class.list_available_models(...) method.

You can also download all models' ".nemo" files in the "File Browser" tab and then instantiate those models with base_class.restore_from(PATH_TO_DOTNEMO_FILE) method. In this case, make sure you are matching NeMo and models' versions.

Here is a list of currently available models together with their base classes and short descriptions.

Model name Model Base Class Description
QuartzNet15x5Base-En EncDecCTCModel QuartzNet15x5 model trained on six datasets: LibriSpeech, Mozilla Common Voice (validated clips from en_1488h_2019-12-10), WSJ, Fisher, Switchboard, and NSC Singapore English. It was trained with Apex/Amp optimization level O1 for 600 epochs. The model achieves a WER of 3.79% on LibriSpeech dev-clean, and a WER of 10.05% on dev-other.
QuartzNet15x5Base-Zh EncDecCTCModel QuartzNet15x5 model trained on ai-shell2 Mandarin Chinese dataset.
QuartzNet5x5LS-En EncDecCTCModel QuartzNet5x5 model trained on LibriSpeech dataset only. The model achieves a WER of 5.37% on LibriSpeech dev-clean, and a WER of 15.69% on dev-other.
QuartzNet15x5NR-En EncDecCTCModel Quartznet15x5 model trained for presence of noise. The base model QuartzNet15x5Base-En was finetuned with RIR and noise augmentation to make it more robust to noise. This model should be preferred for noisy speech transcription. This model achieves a WER of 3.96% on LibriSpeech dev-clean and a WER of 10.14% on dev-other.
Jasper10x5Dr-En EncDecCTCModel JasperNet10x5Dr model trained on six datasets: LibriSpeech, Mozilla Common Voice (validated clips from en_1488h_2019-12-10), WSJ, Fisher, Switchboard, and NSC Singapore English. It was trained with Apex/Amp optimization level O1. The model achieves a WER of 3.37% on LibriSpeech dev-clean, 9.81% on dev-other.
ContextNet-192-WPE-1024-8x-Stride EncDecCTCModelBPE ContextNet initial implementation model trained on the Librispeech corpus and achieves a WER of 10.09% on test-other and 10.11% on dev-other.
MatchboxNet-3x1x64-v1 EncDecClassificationModel MatchboxNet model trained on Google Speech Commands dataset (v1, 30 classes) which obtains 97.32% accuracy on test set.
MatchboxNet-3x2x64-v1 EncDecClassificationModel MatchboxNet model trained on Google Speech Commands dataset (v1, 30 classes) which obtains 97.68% accuracy on test set.
MatchboxNet-3x1x64-v2 EncDecClassificationModel MatchboxNet model trained on Google Speech Commands dataset (v2, 35 classes) which obtains 97.12% accuracy on test set.
MatchboxNet-3x1x64-v2 EncDecClassificationModel MatchboxNet model trained on Google Speech Commands dataset (v2, 30 classes) which obtains 97.29% accuracy on test set.
MatchboxNet-3x1x64-v2-subset-task EncDecClassificationModel MatchboxNet model trained on Google Speech Commands dataset (v2, 10+2 classes) which obtains 98.2% accuracy on test set.
MatchboxNet-3x2x64-v2-subset-task EncDecClassificationModel MatchboxNet model trained on Google Speech Commands dataset (v2, 10+2 classes) which obtains 98.4% accuracy on test set.
MatchboxNet-VAD-3x2 EncDecClassificationModel Voice Activity Detection MatchboxNet model trained on google speech command (v2) and freesound background data, which obtains 0.992 accuracy on testset from same source and 0.852 TPR for FPR=0.315 on testset (ALL) of AVA movie data
SpeakerNet_recognition EncDecSpeakerLabelModel SpeakerNet_recognition model trained end-to-end for speaker recognition purposes with cross_entropy loss. It was trained on voxceleb 1, voxceleb 2 dev datasets and augmented with musan music and noise. Speaker Recognition model achieves 2.65% EER on voxceleb-O cleaned trial file"
SpeakerNet_verification EncDecSpeakerLabelModel SpeakerNet_verification model trained end-to-end for speaker verification purposes with arcface angular softmax loss. It was trained on voxceleb 1, voxceleb 2 dev datasets and augmented with musan music and noise. Speaker Verification model achieves 2.12% EER on voxceleb-O cleaned trial file