Modeling Intra-label Dynamics and Analyzing the Role of Blank in Connectionist Temporal Classification

Document Type : Machine Learning - Monsefi


Ferdowsi University of Mashhad


The goal of many tasks in the realm of sequence processing is to map a sequence of input data to a sequence of output labels. Long short-term memory (LSTM), a type of recurrent neural network (RNN), equipped with connectionist temporal classification (CTC) has been proved to be one of the most suitable tools for such tasks. With the aid of CTC, the existence of per-frame labeled sequences are no longer necessary and it suffices to only knowing the sequence of labels. However, in CTC, only a single state is assigned to each label and consequently, LSTM would not learn the intra-label relationships. In this paper, we propose to remedy this weakness by increasing the number of states assigned to each label and actively modeling such intra-label transitions. On the other hand, the output of a CTC network usually corresponds to the set of all possible labels along with a blank. One of the uses of blank is in the recognition of multiple consecutive identical labels. Assigning more than one state to each label, we can also decode consecutive identical labels without resorting to the blank. We investigated the effect of increasing the number of sub-labels with/without blank on the recognition rate of the system. We performed experiments on two printed and handwritten Arabic datasets. Our experiments showed that while on simple tasks a model without blank may converge faster, on real-world complex datasets use of blank significantly improves the results.


[1] A. Graves, S. Fernandez, F. Gomez, and J. Schmidhuber, "Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks", in Proceedings of the 23rd international conference on Machine learning, pp. 369-376, 2006.
[2] Y. Bengio, P. Simard, and P. Frasconi, "Learning long-term dependencies with gradient descent is difficult", IEEE transactions on neural networks, vol. 5, pp. 157-166, 1994.
[3] S. Hochreiter, Y. Bengio, P. Frasconi, and J. Schmidhuber, "Gradient flow in recurrent nets: the difficulty of learning long-term dependencies", ed: A field guide to dynamical recurrent neural networks. IEEE Press, 2001.
[4] S. Hochreiter and J. Schmidhuber, "Long short-term memory", Neural computation, vol. 9, pp. 1735-1780, 1997.
[5] F. A. Gers, N. N. Schraudolph, and J. Schmidhuber, "Learning precise timing with LSTM recurrent networks", Journal of machine learning research, vol. 3, pp. 115-143, 2002.
[6] A. Graves, S. Fernandez, and J. Schmidhuber, "Bidirectional LSTM networks for improved phoneme classification and recognition", in International Conference on Artificial Neural Networks, pp. 799-804, 2005.
[7] A. Graves and J. Schmidhuber, "Framewise phoneme classification with bidirectional LSTM and other neural network architectures", Neural Networks, vol. 18, pp. 602-610, 2005.
[8] A. Graves and J. Schmidhuber, "Offline handwriting recognition with multidimensional recurrent neural networks", in Advances in neural information processing systems, pp. 545-552, 2009.
[9] A. Graves, M. Liwicki, S. Fernandez, R. Bertolami, H. Bunke, and J. Schmidhuber, "A novel connectionist system for unconstrained handwriting recognition", IEEE transactions on pattern analysis and machine intelligence, vol. 31, pp. 855-868, 2009.
[10] M. Wöllmer, F. Eyben, B. Schuller, and G. Rigoll, "Spoken term detection with connectionist temporal classification: a novel hybrid ctc-dbn decoder", in 2010 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 5274-5277, 2010.
[11] M. Wöllmer, B. Schuller, and G. Rigoll, "Probabilistic ASR feature extraction applying context-sensitive connectionist temporal classification networks", in 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 7125-7129, 2013.
[12] A. Graves, A.-r. Mohamed, and G. Hinton, "Speech recognition with deep recurrent neural networks", in 2013 IEEE international conference on acoustics, speech and signal processing, pp. 6645-6649, 2013.
[13] M. Wöllmer, F. Weninger, J. Geiger, B. Schuller, and G. Rigoll, "Noise robust ASR in reverberated multisource environments applying convolutive NMF and Long Short-Term Memory", Computer Speech & Language, vol. 27, pp. 780-797, 2013.
[14] A. Graves and N. Jaitly, "Towards End-To-End Speech Recognition with Recurrent Neural Networks", in ICML, pp. 1764-1772, 2014.
[15] D.-A. Huang, L. Fei-Fei, and J. C. Niebles, "Connectionist Temporal Modeling for Weakly Supervised Action Labeling", arXiv preprint arXiv:1607.08584, 2016.
[16] M. Woellmer, B. Schuller, and G. Rigoll, "Keyword spotting exploiting long short-term memory", Speech Communication, vol. 55, pp. 252-265, 2013.
[17] S. Fernandez, A. Graves, and J. Schmidhuber, "Sequence Labelling in Structured Domains with Hierarchical Recurrent Neural Networks", in IJCAI, pp. 774-779, 2007.
[18] A. A. Atashin, K. Ghiasi-Shirazi, and A. Harati, "Training LDCRF model on unsegmented sequences using Connectionist Temporal Classification", arXiv preprint arXiv:1606.08051, 2016.
[19] A. S. Lotfabadi, K. Ghiasi-Shirazi, and A. Harati, "Modeling intra-label dynamics in connectionist temporal classification", in 2017 7th International Conference on Computer and Knowledge Engineering (ICCKE), pp. 367-371, 2017.
[20] A. Graves, "Neural Networks," in Supervised Sequence Labelling with Recurrent Neural Networks, ed: Springer, pp. 15-35. , 2012
[21] M. Pechwitz, S. S. Maddouri, V. Märgner, N. Ellouze, and H. Amiri, "IFN/ENIT-database of handwritten Arabic words", in Proc. of CIFED, pp. 127-136, 2002.
[22] A. Graves, "RNNLIB: A recurrent neural network library for sequence learning problems", [OL][2015–07-10], 2013.
[23] T. Bluche, H. Ney, J. Louradour, and C. Kermorvant, "Framewise and CTC training of Neural Networks for handwriting recognition", in Document Analysis and Recognition (ICDAR), 2015 13th International Conference on, pp. 81-85. , 2015.