Multidataset-Jasper10x5DR

NGC Catalog

Welcome Guest

For downloads and more information, please view on a desktop device.

Description

This is a checkpoint for Jasper10x5DR trained on multiple datasets in NeMo.

Publisher

NVIDIA

Latest Version

Modified

April 4, 2023

Size

1.15 GB

Overview

This is a checkpoint for the Jasper10x5DR model that was trained in NeMo on six datasets: LibriSpeech, Mozilla Common Voice (validated clips from en_1488h_2019-12-10), WSJ, Fisher, Switchboard, and NSC Singapore English. It was trained with Apex/Amp optimization level O1.

The model achieves a WER of 3.37% on LibriSpeech dev-clean, 9.81% on dev-other, 3.60% on test-clean, and 9.99% on test-other.

Please be sure to download the latest version in order to ensure compatibility with the latest NeMo release.

JasperNet10x5-En-Base.nemo - a compressed tarball that contains the encoder and decoder checkpoint, as well as the associated config file.

More details

Multidataset Jasper with Dense residuals is a network from Jasper family which uses dense residual connections. The Jasper (“Just Another SPeech Recognizer”) family of networks are deep time delay neural networks (TDNN) comprising of blocks of 1D-convolutional layers. This model is an end-to-end neural acoustic model for automatic speech recognition (ASR) that provides near state-of-the-art results on LibriSpeech as compated to other end-to-end ASR models. The architecture of convolutional layers is designed to facilitate fast GPU inference by allowing whole sub-blocks to be fused into a single GPU kernel. The model is called “end-to-end” because it transcribes speech samples without any additional alignment information. CTC(connectionist temporal classification) helps find an alignment between audio and text. CTC-ASR training pipeline consists of the following:

audio preprocessing (feature extraction): signal normalization, windowing, (log) spectrogram (or mel scale spectrogram, or MFCC)
neural acoustic model (which predicts a probability distribution P_t(c) over vocabulary characters c per each time step t given input features per each timestep)
CTC loss function Encoder Decoder checkpoints available here are trained using Neural Modules (NeMo) toolkit. Can be used for transfer learning and fine tuning for new datasets. NVIDIA’s Apex/Amp O1 optimization level was used for training on V100 GPUs.

Documentation

Source code and developer guide is available at https://github.com/NVIDIA/NeMo Refer to documentation at [https://docs.nvidia.com/deeplearning/nemo/neural-modules-release-notes/index.html]

Usage example: python examples/asr/speech2text_infer.py --asr_model=JasperNet10x5-En --dataset=test.json

You can also grab this model directly from your code by including this line:

asr_model = nemo_asr.models.ASRConvCTCModel.from_pretrained(model_info='JasperNet10x5-En')