Aishell2 Jasper 10x5dr

Aishell2 Jasper 10x5dr

Logo for Aishell2 Jasper 10x5dr
Description
Aishell2 jasper-10x5dr Model trained on mandarin chinese dataset
Publisher
-
Latest Version
2
Modified
April 4, 2023
Size
1.26 GB

Overview

This is a checkpoint built on aishell2, which is trained on NeMo. The model structure is Jasper10x5dr. NVIDIA’s Apex/Amp O1 optimization level was used for training on V100 GPUs

  • JasperDecoderForCTC-STEP-394050.pt - pretrained decoder module
  • JasperEncoder-STEP-394050.pt - pretrained encoder module
  • vocab.txt - the vocabulary file
  • jasper10x5dr.yaml - the configuration file

Documentation

Refer to documentation at https://github.com/NVIDIA/NeMo

Usage example: Put the checkpoints into the checkpoint_dir

Run aishell_jasper_infer.py (from NeMo's ASR examples)

python aishell_japser_infer.py --model_config=$nemo_root/nemo/examples/asr/configs/jasper10x5dr.yaml --eval_datasets=test.json --load_dir=$checkpoint_dir --vocab_file=vocab.txt

More details

Source code is available at https://github.com/NVIDIA/NeMo This is a network from Jasper family which uses dense residual connections. Jasper (“Just Another SPeech Recognizer”) family of networks are deep time delay neural networks (TDNN) comprising of blocks of 1D-convolutional layers. This model is an end-to-end neural acoustic model for automatic speech recognition (ASR) that provides near state-of-the-art results on Aishell2 as compared to other end-to-end ASR models. The architecture of convolutional layers is designed to facilitate fast GPU inference by allowing whole sub-blocks to be fused into a single GPU kernel. The model is called “end-to-end” because it transcribes speech samples without any additional alignment information. CTC(connectionist temporal classification) helps find an alignment between audio and text. CTC-ASR training pipeline consists of the following:

  1. audio preprocessing (feature extraction): signal normalization, windowing, (log) spectrogram (or mel scale spectrogram, or MFCC)
  2. neural acoustic model (which predicts a probability distribution P_t(c) over vocabulary characters c per each time step t given input features per each timestep)
  3. CTC loss function Encoder Decoder checkpoints avaialble here are trained using Neural Modules (NeMo) toolkit. Can be used for transfer learning and fine tuning for new datasets. NVIDIA’s Apex/Amp O1 optimization level was used for training on V100 GPUs.