This is a checkpoint built on aishell2, which is trained on NeMo. The model structure is Jasper10x5dr. NVIDIA’s Apex/Amp O1 optimization level was used for training on V100 GPUs
Refer to documentation at https://github.com/NVIDIA/NeMo
Usage example: Put the checkpoints into the checkpoint_dir
Run aishell_jasper_infer.py (from NeMo's ASR examples)
python aishell_japser_infer.py --model_config=$nemo_root/nemo/examples/asr/configs/jasper10x5dr.yaml --eval_datasets=test.json --load_dir=$checkpoint_dir --vocab_file=vocab.txt
Source code is available at https://github.com/NVIDIA/NeMo This is a network from Jasper family which uses dense residual connections. Jasper (“Just Another SPeech Recognizer”) family of networks are deep time delay neural networks (TDNN) comprising of blocks of 1D-convolutional layers. This model is an end-to-end neural acoustic model for automatic speech recognition (ASR) that provides near state-of-the-art results on Aishell2 as compared to other end-to-end ASR models. The architecture of convolutional layers is designed to facilitate fast GPU inference by allowing whole sub-blocks to be fused into a single GPU kernel. The model is called “end-to-end” because it transcribes speech samples without any additional alignment information. CTC(connectionist temporal classification) helps find an alignment between audio and text. CTC-ASR training pipeline consists of the following: