This is a checkpoint for the Jasper10x5DR model that was trained in NeMo on six datasets: LibriSpeech, Mozilla Common Voice (validated clips from en_1488h_2019-12-10), WSJ, Fisher, Switchboard, and NSC Singapore English. It was trained with Apex/Amp optimization level O1.
The model achieves a WER of 3.37% on LibriSpeech dev-clean, 9.81% on dev-other, 3.60% on test-clean, and 9.99% on test-other.
Please be sure to download the latest version in order to ensure compatibility with the latest NeMo release.
Multidataset Jasper with Dense residuals is a network from Jasper family which uses dense residual connections. The Jasper (“Just Another SPeech Recognizer”) family of networks are deep time delay neural networks (TDNN) comprising of blocks of 1D-convolutional layers. This model is an end-to-end neural acoustic model for automatic speech recognition (ASR) that provides near state-of-the-art results on LibriSpeech as compated to other end-to-end ASR models. The architecture of convolutional layers is designed to facilitate fast GPU inference by allowing whole sub-blocks to be fused into a single GPU kernel. The model is called “end-to-end” because it transcribes speech samples without any additional alignment information. CTC(connectionist temporal classification) helps find an alignment between audio and text. CTC-ASR training pipeline consists of the following:
Source code and developer guide is available at https://github.com/NVIDIA/NeMo Refer to documentation at [https://docs.nvidia.com/deeplearning/nemo/neural-modules-release-notes/index.html]
Usage example: python examples/asr/speech2text_infer.py --asr_model=JasperNet10x5-En --dataset=test.json
You can also grab this model directly from your code by including this line:
asr_model = nemo_asr.models.ASRConvCTCModel.from_pretrained(model_info='JasperNet10x5-En')