CNNs for Speech

CNNs for Speech Processing

We propose to use convolutional neural networks (CNNs) for speech recognition, where convolution is applied in the frequency domain to normalize speech variations. We further propose a limited-weight-sharing scheme that can better model speech features. The special structure such as local connectivity, weight sharing, and pooling in CNNs exhibits some degree of invariance to small shifts of speech features along the frequency axis, which is important to deal with speaker and environment variations.

Reference:

[1] O. Abdel-Hamid, A. Mohamed, H. Jiang, L. Deng, G. Penn, D. Yu, “Convolutional Neural Networks for Speech Recognition,” IEEE/ACM Trans. on Audio, Speech and Language Processing, pp.1533-1545, Vol. 22, No. 10, October 2014.

[2] O. Abdel-Hamid, A. Mohamed, H. Jiang, G. Penn, “Applying Convolutional Neural Networks Concepts to Hybrid NN-HMM Model for Speech Recognition,” Proc. of IEEE International Conference on Acoustic, Speech, Signal Processing (ICASSP'2012), Japan, March 2012.