文章作者:Tyan
博客:noahsnail.com ?|? CSDN ?|? 簡書
翻譯論文匯總:https://github.com/SnailTyan/deep-learning-papers-translation
An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition
Abstract
Image-based sequence recognition has been a long-standing research topic in computer vision. In this paper, we investigate the problem of scene text recognition, which is among the most important and challenging tasks in image-based sequence recognition. A novel neural network architecture, which integrates feature extraction, sequence modeling and transcription into a unified framework, is proposed. Compared with previous systems for scene text recognition, the proposed architecture possesses four distinctive properties: (1) It is end-to-end trainable, in contrast to most of the existing algorithms whose components are separately trained and tuned. (2) It naturally handles sequences in arbitrary lengths, involving no character segmentation or horizontal scale normalization. (3) It is not confined to any predefined lexicon and achieves remarkable performances in both lexicon-free and lexicon-based scene text recognition tasks. (4) It generates an effective yet much smaller model, which is more practical for real-world application scenarios. The experiments on standard benchmarks, including the IIIT-5K, Street View Text and ICDAR datasets, demonstrate the superiority of the proposed algorithm over the prior arts. Moreover, the proposed algorithm performs well in the task of image-based music score recognition, which evidently verifies the generality of it.
摘要
基于圖像的序列識別一直是計算機視覺中長期存在的研究課題。在本文中,我們研究了場景文本識別的問題,這是基于圖像的序列識別中最重要和最具挑戰性的任務之一。提出了一種將特征提取,序列建模和轉錄整合到統一框架中的新型神經網絡架構。與以前的場景文本識別系統相比,所提出的架構具有四個不同的特性:(1)與大多數現有的組件需要單獨訓練和協調的算法相比,它是端對端訓練的。(2)它自然地處理任意長度的序列,不涉及字符分割或水平尺度歸一化。(3)它不僅限于任何預定義的詞匯,并且在無詞典和基于詞典的場景文本識別任務中都取得了顯著的表現。(4)它產生了一個有效而小得多的模型,這對于現實世界的應用場景更為實用。在包括IIIT-5K,Street View Text和ICDAR數據集在內的標準基準數據集上的實驗證明了提出的算法比現有技術的更有優勢。此外,提出的算法在基于圖像的音樂配樂識別任務中表現良好,這顯然證實了它的泛化性。
1. Introduction
Recently, the community has seen a strong revival of neural networks, which is mainly stimulated by the great success of deep neural network models, specifically Deep Convolutional Neural Networks (DCNN), in various vision tasks. However, majority of the recent works related to deep neural networks have devoted to detection or classification of object categories [12, 25]. In this paper, we are concerned with a classic problem in computer vision: image-based sequence recognition. In real world, a stable of visual objects, such as scene text, handwriting and musical score, tend to occur in the form of sequence, not in isolation. Unlike general object recognition, recognizing such sequence-like objects often requires the system to predict a series of object labels, instead of a single label. Therefore, recognition of such objects can be naturally cast as a sequence recognition problem. Another unique property of sequence-like objects is that their lengths may vary drastically. For instance, English words can either consist of 2 characters such as “OK” or 15 characters such as “congratulations”. Consequently, the most popular deep models like DCNN [25, 26] cannot be directly applied to sequence prediction, since DCNN models often operate on inputs and outputs with fixed dimensions, and thus are incapable of producing a variable-length label sequence.
1. 引言
最近,社區已經看到神經網絡的強大復興,這主要受到深度神經網絡模型,特別是深度卷積神經網絡(DCNN)在各種視覺任務中的巨大成功的推動。然而,最近大多數與深度神經網絡相關的工作主要致力于檢測或分類對象類別[12,25]。在本文中,我們關注計算機視覺中的一個經典問題:基于圖像的序列識別。在現實世界中,穩定的視覺對象,如場景文字,手寫字符和樂譜,往往以序列的形式出現,而不是孤立地出現。與一般的對象識別不同,識別這樣的類序列對象通常需要系統預測一系列對象標簽,而不是單個標簽。因此,可以自然地將這樣的對象的識別作為序列識別問題。類序列對象的另一個獨特之處在于它們的長度可能會有很大變化。例如,英文單詞可以由2個字符組成,如“OK”,或由15個字符組成,如“congratulations”。因此,最流行的深度模型像DCNN[25,26]不能直接應用于序列預測,因為DCNN模型通常對具有固定維度的輸入和輸出進行操作,因此不能產生可變長度的標簽序列。
Some attempts have been made to address this problem for a specific sequence-like object (e.g. scene text). For example, the algorithms in [35, 8] firstly detect individual characters and then recognize these detected characters with DCNN models, which are trained using labeled character images. Such methods often require training a strong character detector for accurately detecting and cropping each character out from the original word image. Some other approaches (such as [22]) treat scene text recognition as an image classification problem, and assign a class label to each English word (90K words in total). It turns out a large trained model with a huge number of classes, which is difficult to be generalized to other types of sequence-like objects, such as Chinese texts, musical scores, etc., because the numbers of basic combinations of such kind of sequences can be greater than 1 million. In summary, current systems based on DCNN can not be directly used for image-based sequence recognition.
已經針對特定的類似序列的對象(例如場景文本)進行了一些嘗試來解決該問題。例如,[35,8]中的算法首先檢測單個字符,然后用DCNN模型識別這些檢測到的字符,并使用標注的字符圖像進行訓練。這些方法通常需要訓練強字符檢測器,以便從原始單詞圖像中準確地檢測和裁剪每個字符。一些其他方法(如[22])將場景文本識別視為圖像分類問題,并為每個英文單詞(總共9萬個詞)分配一個類標簽。結果是一個大的訓練模型中有很多類,這很難泛化到其它類型的類序列對象,如中文文本,音樂配樂等,因為這種序列的基本組合數目可能大于100萬。總之,目前基于DCNN的系統不能直接用于基于圖像的序列識別。
Recurrent neural networks (RNN) models, another important branch of the deep neural networks family, were mainly designed for handling sequences. One of the advantages of RNN is that it does not need the position of each element in a sequence object image in both training and testing. However, a preprocessing step that converts an input object image into a sequence of image features, is usually essential. For example, Graves et al. [16] extract a set of geometrical or image features from handwritten texts, while Su and Lu [33] convert word images into sequential HOG features. The preprocessing step is independent of the subsequent components in the pipeline, thus the existing systems based on RNN can not be trained and optimized in an end-to-end fashion.
循環神經網絡(RNN)模型是深度神經網絡家族中的另一個重要分支,主要是設計來處理序列。RNN的優點之一是在訓練和測試中不需要序列目標圖像中每個元素的位置。然而,將輸入目標圖像轉換成圖像特征序列的預處理步驟通常是必需的。例如,Graves等[16]從手寫文本中提取一系列幾何或圖像特征,而Su和Lu[33]將字符圖像轉換為序列HOG特征。預處理步驟獨立于流程中的后續組件,因此基于RNN的現有系統不能以端到端的方式進行訓練和優化。
Several conventional scene text recognition methods that are not based on neural networks also brought insightful ideas and novel representations into this field. For example, Almaza`n et al. [5] and Rodriguez-Serrano et al. [30] proposed to embed word images and text strings in a common vectorial subspace, and word recognition is converted into a retrieval problem. Yao et al. [36] and Gordo et al. [14] used mid-level features for scene text recognition. Though achieved promising performance on standard benchmarks, these methods are generally outperformed by previous algorithms based on neural networks [8, 22], as well as the approach proposed in this paper.
一些不是基于神經網絡的傳統場景文本識別方法也為這一領域帶來了有見地的想法和新穎的表現。例如,Almaza`n等人[5]和Rodriguez-Serrano等人[30]提出將單詞圖像和文本字符串嵌入到公共向量子空間中,并將詞識別轉換為檢索問題。Yao等人[36]和Gordo等人[14]使用中層特征進行場景文本識別。雖然在標準基準數據集上取得了有效的性能,但是前面的基于神經網絡的算法[8,22]以及本文提出的方法通常都優于這些方法。
The main contribution of this paper is a novel neural network model, whose network architecture is specifically designed for recognizing sequence-like objects in images. The proposed neural network model is named as Convolutional Recurrent Neural Network (CRNN), since it is a combination of DCNN and RNN. For sequence-like objects, CRNN possesses several distinctive advantages over conventional neural network models: 1) It can be directly learned from sequence labels (for instance, words), requiring no detailed annotations (for instance, characters); 2) It has the same property of DCNN on learning informative representations directly from image data, requiring neither hand-craft features nor preprocessing steps, including binarization/segmentation, component localization, etc.; 3) It has the same property of RNN, being able to produce a sequence of labels; 4) It is unconstrained to the lengths of sequence-like objects, requiring only height normalization in both training and testing phases; 5) It achieves better or highly competitive performance on scene texts (word recognition) than the prior arts [23, 8]; 6) It contains much less parameters than a standard DCNN model, consuming less storage space.
本文的主要貢獻是一種新穎的神經網絡模型,其網絡架構設計專門用于識別圖像中的類序列對象。所提出的神經網絡模型被稱為卷積循環神經網絡(CRNN),因為它是DCNN和RNN的組合。對于類序列對象,CRNN與傳統神經網絡模型相比具有一些獨特的優點:1)可以直接從序列標簽(例如單詞)學習,不需要詳細的標注(例如字符);2)直接從圖像數據學習信息表示時具有與DCNN相同的性質,既不需要手工特征也不需要預處理步驟,包括二值化/分割,組件定位等;3)具有與RNN相同的性質,能夠產生一系列標簽;4)對類序列對象的長度無約束,只需要在訓練階段和測試階段對高度進行歸一化;5)與現有技術相比,它在場景文本(字識別)上獲得更好或更具競爭力的表現[23,8]。6)它比標準DCNN模型包含的參數要少得多,占用更少的存儲空間。
2. The Proposed Network Architecture
The network architecture of CRNN, as shown in Fig. 1, consists of three components, including the convolutional layers, the recurrent layers, and a transcription layer, from bottom to top.
Figure 1. The network architecture. The architecture consists of three parts: 1) convolutional layers, which extract a feature sequence from the input image; 2) recurrent layers, which predict a label distribution for each frame; 3) transcription layer, which translates the per-frame predictions into the final label sequence.
2. 提出的網絡架構
如圖1所示,CRNN的網絡架構由三部分組成,包括卷積層,循環層和轉錄層,從底向上。
圖1。網絡架構。架構包括三部分:1) 卷積層,從輸入圖像中提取特征序列;2) 循環層,預測每一幀的標簽分布;3) 轉錄層,將每一幀的預測變為最終的標簽序列。
At the bottom of CRNN, the convolutional layers automatically extract a feature sequence from each input image. On top of the convolutional network, a recurrent network is built for making prediction for each frame of the feature sequence, outputted by the convolutional layers. The transcription layer at the top of CRNN is adopted to translate the per-frame predictions by the recurrent layers into a label sequence. Though CRNN is composed of different kinds of network architectures (eg. CNN and RNN), it can be jointly trained with one loss function.
在CRNN的底部,卷積層自動從每個輸入圖像中提取特征序列。在卷積網絡之上,構建了一個循環網絡,用于對卷積層輸出的特征序列的每一幀進行預測。采用CRNN頂部的轉錄層將循環層的每幀預測轉化為標簽序列。雖然CRNN由不同類型的網絡架構(如CNN和RNN)組成,但可以通過一個損失函數進行聯合訓練。
2.1. Feature Sequence Extraction
In CRNN model, the component of convolutional layers is constructed by taking the convolutional and max-pooling layers from a standard CNN model (fully-connected layers are removed). Such component is used to extract a sequential feature representation from an input image. Before being fed into the network, all the images need to be scaled to the same height. Then a sequence of feature vectors is extracted from the feature maps produced by the component of convolutional layers, which is the input for the recurrent layers. Specifically, each feature vector of a feature sequence is generated from left to right on the feature maps by column. This means the i-th feature vector is the concatenation of the i-th columns of all the maps. The width of each column in our settings is fixed to single pixel.
2.1. 特征序列提取
在CRNN模型中,通過采用標準CNN模型(去除全連接層)中的卷積層和最大池化層來構造卷積層的組件。這樣的組件用于從輸入圖像中提取序列特征表示。在進入網絡之前,所有的圖像需要縮放到相同的高度。然后從卷積層組件產生的特征圖中提取特征向量序列,這些特征向量序列作為循環層的輸入。具體地,特征序列的每一個特征向量在特征圖上按列從左到右生成。這意味著第i個特征向量是所有特征圖第i列的連接。在我們的設置中每列的寬度固定為單個像素。
As the layers of convolution, max-pooling, and element-wise activation function operate on local regions, they are translation invariant. Therefore, each column of the feature maps corresponds to a rectangle region of the original image (termed the receptive field), and such rectangle regions are in the same order to their corresponding columns on the feature maps from left to right. As illustrated in Fig. 2, each vector in the feature sequence is associated with a receptive field, and can be considered as the image descriptor for that region.
Figure 2. The receptive field. Each vector in the extracted feature sequence is associated with a receptive field on the input image, and can be considered as the feature vector of that field.
由于卷積層,最大池化層和元素激活函數在局部區域上執行,因此它們是平移不變的。因此,特征圖的每列對應于原始圖像的一個矩形區域(稱為感受野),并且這些矩形區域與特征圖上從左到右的相應列具有相同的順序。如圖2所示,特征序列中的每個向量關聯一個感受野,并且可以被認為是該區域的圖像描述符。
圖2。感受野。提取的特征序列中的每一個向量關聯輸入圖像的一個感受野,可認為是該區域的特征向量。
Being robust, rich and trainable, deep convolutional features have been widely adopted for different kinds of visual recognition tasks [25, 12]. Some previous approaches have employed CNN to learn a robust representation for sequence-like objects such as scene text [22]. However, these approaches usually extract holistic representation of the whole image by CNN, then the local deep features are collected for recognizing each component of a sequence-like object. Since CNN requires the input images to be scaled to a fixed size in order to satisfy with its fixed input dimension, it is not appropriate for sequence-like objects due to their large length variation. In CRNN, we convey deep features into sequential representations in order to be invariant to the length variation of sequence-like objects.
魯棒的,豐富的和可訓練的深度卷積特征已被廣泛應用于各種視覺識別任務[25,12]。一些以前的方法已經使用CNN來學習諸如場景文本之類的類序列對象的魯棒表示[22]。然而,這些方法通常通過CNN提取整個圖像的整體表示,然后收集局部深度特征來識別類序列對象的每個分量。由于CNN要求將輸入圖像縮放到固定尺寸,以滿足其固定的輸入尺寸,因為它們的長度變化很大,因此不適合類序列對象。在CRNN中,我們將深度特征傳遞到序列表示中,以便對類序列對象的長度變化保持不變。
2.2. Sequence Labeling
A deep bidirectional Recurrent Neural Network is built on the top of the convolutional layers, as the recurrent layers. The recurrent layers predict a label distribution $y_t$ for each frame $x_t$ in the feature sequence $x = x_1,...,x_T$. The advantages of the recurrent layers are three-fold. Firstly, RNN has a strong capability of capturing contextual information within a sequence. Using contextual cues for image-based sequence recognition is more stable and helpful than treating each symbol independently. Taking scene text recognition as an example, wide characters may require several successive frames to fully describe (refer to Fig. 2). Besides, some ambiguous characters are easier to distinguish when observing their contexts, e.g. it is easier to recognize “il” by contrasting the character heights than by recognizing each of them separately. Secondly, RNN can back-propagates error differentials to its input, i.e. the convolutional layer, allowing us to jointly train the recurrent layers and the convolutional layers in a unified network. Thirdly, RNN is able to operate on sequences of arbitrary lengths, traversing from starts to ends.
2.2. 序列標注
一個深度雙向循環神經網絡是建立在卷積層的頂部,作為循環層。循環層預測特征序列$x = x_1,...,x_T$中每一幀$x_t$的標簽分布$y_t$。循環層的優點是三重的。首先,RNN具有很強的捕獲序列內上下文信息的能力。對于基于圖像的序列識別使用上下文提示比獨立處理每個符號更穩定且更有幫助。以場景文本識別為例,寬字符可能需要一些連續的幀來完全描述(參見圖2)。此外,一些模糊的字符在觀察其上下文時更容易區分,例如,通過對比字符高度更容易識別“il”而不是分別識別它們中的每一個。其次,RNN可以將誤差差值反向傳播到其輸入,即卷積層,從而允許我們在統一的網絡中共同訓練循環層和卷積層。第三,RNN能夠從頭到尾對任意長度的序列進行操作。
A traditional RNN unit has a self-connected hidden layer between its input and output layers. Each time it receives a frame $x_t$ in the sequence, it updates its internal state $h_t$ with a non-linear function that takes both current input $x_t$ and past state $h_{t?1}$ as its inputs: $h_t = g(x_t, h_{t?1})$. Then the prediction $y_t$ is made based on $h_t$. In this way, past contexts $\lbrace x_{t\prime} \rbrace _{t \prime < t}$ are captured and utilized for prediction. Traditional RNN unit, however, suffers from the vanishing gradient problem [7], which limits the range of context it can store, and adds burden to the training process. Long-Short Term Memory [18, 11] (LSTM) is a type of RNN unit that is specially designed to address this problem. An LSTM (illustrated in Fig. 3) consists of a memory cell and three multiplicative gates, namely the input, output and forget gates. Conceptually, the memory cell stores the past contexts, and the input and output gates allow the cell to store contexts for a long period of time. Meanwhile, the memory in the cell can be cleared by the forget gate. The special design of LSTM allows it to capture long-range dependencies, which often occur in image-based sequences.
Figure 3. (a) The structure of a basic LSTM unit. An LSTM consists of a cell module and three gates, namely the input gate, the output gate and the forget gate. (b) The structure of deep bidirectional LSTM we use in our paper. Combining a forward (left to right) and a backward (right to left) LSTMs results in a bidirectional LSTM. Stacking multiple bidirectional LSTM results in a deep bidirectional LSTM.
傳統的RNN單元在其輸入和輸出層之間具有自連接的隱藏層。每次接收到序列中的幀$x_t$時,它將使用非線性函數來更新其內部狀態$h_t$,該非線性函數同時接收當前輸入$x_t$和過去狀態$h_{t?1}$作為其輸入:$h_t = g(x_t, h_{t?1})$。那么預測$y_t$是基于$h_t$的。以這種方式,過去的上下文{$\lbrace x_{t\prime} \rbrace _{t \prime < t}$被捕獲并用于預測。然而,傳統的RNN單元有梯度消失的問題[7],這限制了其可以存儲的上下文范圍,并給訓練過程增加了負擔。長短時記憶[18,11](LSTM)是一種專門設計用于解決這個問題的RNN單元。LSTM(圖3所示)由一個存儲單元和三個多重門組成,即輸入,輸出和遺忘門。在概念上,存儲單元存儲過去的上下文,并且輸入和輸出門允許單元長時間地存儲上下文。同時,單元中的存儲可以被遺忘門清除。LSTM的特殊設計允許它捕獲長距離依賴,這經常發生在基于圖像的序列中。
圖3。(a) 基本的LSTM單元的結構。LSTM包括單元模塊和三個門,即輸入門,輸出門和遺忘門。(b)我們論文中使用的深度雙向LSTM結構。合并前向(從左到右)和后向(從右到左)LSTM的結果到雙向LSTM中。在深度雙向LSTM中堆疊多個雙向LSTM結果。
LSTM is directional, it only uses past contexts. However, in image-based sequences, contexts from both directions are useful and complementary to each other. Therefore, we follow [17] and combine two LSTMs, one forward and one backward, into a bidirectional LSTM. Furthermore, multiple bidirectional LSTMs can be stacked, resulting in a deep bidirectional LSTM as illustrated in Fig. 3.b. The deep structure allows higher level of abstractions than a shallow one, and has achieved significant performance improvements in the task of speech recognition [17].
LSTM是定向的,它只使用過去的上下文。然而,在基于圖像的序列中,兩個方向的上下文是相互有用且互補的。因此,我們遵循[17],將兩個LSTM,一個向前和一個向后組合到一個雙向LSTM中。此外,可以堆疊多個雙向LSTM,得到如圖3.b所示的深雙向LSTM。深層結構允許比淺層抽象更高層次的抽象,并且在語音識別任務中取得了顯著的性能改進[17]。
In recurrent layers, error differentials are propagated in the opposite directions of the arrows shown in Fig. 3.b, i.e. Back-Propagation Through Time (BPTT). At the bottom of the recurrent layers, the sequence of propagated differentials are concatenated into maps, inverting the operation of converting feature maps into feature sequences, and fed back to the convolutional layers. In practice, we create a custom network layer, called “Map-to-Sequence”, as the bridge between convolutional layers and recurrent layers.
在循環層中,誤差在圖3.b所示箭頭的相反方向傳播,即反向傳播時間(BPTT)。在循環層的底部,傳播差異的序列被連接成映射,將特征映射轉換為特征序列的操作進行反轉并反饋到卷積層。實際上,我們創建一個稱為“Map-to-Sequence”的自定義網絡層,作為卷積層和循環層之間的橋梁。
2.3. Transcription
Transcription is the process of converting the per-frame predictions made by RNN into a label sequence. Mathematically, transcription is to find the label sequence with the highest probability conditioned on the per-frame predictions. In practice, there exists two modes of transcription, namely the lexicon-free and lexicon-based transcriptions. A lexicon is a set of label sequences that prediction is constraint to, e.g. a spell checking dictionary. In lexicon-free mode, predictions are made without any lexicon. In lexicon-based mode, predictions are made by choosing the label sequence that has the highest probability.
2.3. 轉錄
轉錄是將RNN所做的每幀預測轉換成標簽序列的過程。數學上,轉錄是根據每幀預測找到具有最高概率的標簽序列。在實踐中,存在兩種轉錄模式,即無詞典轉錄和基于詞典的轉錄。詞典是一組標簽序列,預測受拼寫檢查字典約束。在無詞典模式中,預測時沒有任何詞典。在基于詞典的模式中,通過選擇具有最高概率的標簽序列進行預測。
2.3.1 Probability of label sequence
We adopt the conditional probability defined in the Connectionist Temporal Classification (CTC) layer proposed by Graves et al. [15]. The probability is defined for label sequence $l$ conditioned on the per-frame predictions $y=y_1,...,y_T$, and it ignores the position where each label in $l$ is located. Consequently, when we use the negative log-likelihood of this probability as the objective to train the network, we only need images and their corresponding label sequences, avoiding the labor of labeling positions of individual characters.
2.3.1 標簽序列的概率
我們采用Graves等人[15]提出的聯接時間分類(CTC)層中定義的條件概率。按照每幀預測$y=y_1,...,y_T$對標簽序列$l$定義概率,并忽略$l$中每個標簽所在的位置。因此,當我們使用這種概率的負對數似然作為訓練網絡的目標函數時,我們只需要圖像及其相應的標簽序列,避免了標注單個字符位置的勞動。
The formulation of the conditional probability is briefly described as follows: The input is a sequence $y = y_1,...,y_T$ where $T$ is the sequence length. Here, each $ y_t \in\Re^{|{\cal L}'|}$ is a probability distribution over the set ${\cal L}' = {\cal L} \cup$, where ${\cal L}$ contains all labels in the task (e.g. all English characters), as well as a ’blank’ label denoted by -
. A sequence-to-sequence mapping function ${\cal B}$ is defined on sequence $\boldsymbol{\pi}\in{\cal L}'^{T}$, where $T$ is the length. ${\cal B}$ maps $\boldsymbol{\pi}$ onto $\mathbf{l}$ by firstly removing the repeated labels, then removing the blank
s. For example, B maps “--hh-e-l-ll-oo--” (’-’ represents ’blank’) onto “hello”. Then, the conditional probability is defined as the sum of probabilities of all $\boldsymbol{\pi}$ that are mapped by ${\cal B}$ onto $\mathbf{l}$:
$$
\begin{equation}
p(\mathbf{l}|\mathbf{y})=\sum_{\boldsymbol{\pi}:{\cal B}(\boldsymbol{\pi})=\mathbf{l}}p(\boldsymbol{\pi}|\mathbf{y}),\tag{1}
\end{equation}
$$
where the probability of $\boldsymbol{\pi}$ is defined as $p(\boldsymbol{\pi}|\mathbf{y})=\prod_{t=1}{T}y_{\pi_{t}}{t}$, $y_{\pi_{t}}^{t}$ is the probability of having label $\pi_{t}$ at time stamp $t$. Directly computing Eq.1 would be computationally infeasible due to the exponentially large number of summation items. However, Eq.1 can be efficiently computed using the forward-backward algorithm described in [15].
條件概率的公式簡要描述如下:輸入是序列$y = y_1,...,y_T$,其中$T$是序列長度。這里,每個$y_t \in\Re^{|{\cal L}'|}$是在集合${\cal L}' = {\cal L} \cup$上的概率分布,其中${\cal L}$包含了任務中的所有標簽(例如,所有英文字符),以及由-
表示的“空白”標簽。序列到序列的映射函數${\cal B}$定義在序列$\boldsymbol{\pi}\in{\cal L}'^{T}$上,其中$T$是長度。${\cal B}$將$\boldsymbol{\pi}$映射到$\mathbf{l}$上,首先刪除重復的標簽,然后刪除blank
。例如,${\cal B}$將“--hh-e-l-ll-oo--”(-
表示blank
)映射到“hello”。然后,條件概率被定義為由${\cal B}$映射到$\mathbf{l}$上的所有$\boldsymbol{\pi}$的概率之和:
$$
\begin{equation}
p(\mathbf{l}|\mathbf{y})=\sum_{\boldsymbol{\pi}:{\cal B}(\boldsymbol{\pi})=\mathbf{l}}p(\boldsymbol{\pi}|\mathbf{y}),\tag{1}
\end{equation}
$$
$\boldsymbol{\pi}$的概率定義為$p(\boldsymbol{\pi}|\mathbf{y})=\prod_{t=1}{T}y_{\pi_{t}}{t}$,$y_{\pi_{t}}^{t}$是時刻$t$時有標簽$\pi_{t}$的概率。由于存在指數級數量的求和項,直接計算方程1在計算上是不可行的。然而,使用[15]中描述的前向算法可以有效計算方程1。
2.3.2 Lexicon-free transcription
In this mode, the sequence $\mathbf{l}^{*}$ that has the highest probability as defined in Eq.1 is taken as the prediction. Since there exists no tractable algorithm to precisely find the solution, we use the strategy adopted in [15]. The sequence $\mathbf{l}^{*}$ is approximately found by $\mathbf{l}^{*}\approx{\cal B}(\arg\max_{\boldsymbol{\pi}}p(\boldsymbol{\pi}|\mathbf{y}))$, i.e. taking the most probable label $\pi_{t}$ at each time stamp $t$, and map the resulted sequence onto $\mathbf{l}^{*}$.
2.3.2 無字典轉錄
在這種模式下,將具有方程1中定義的最高概率的序列$\mathbf{l}{*}$作為預測。由于不存在用于精確找到解的可行方法,我們采用[15]中的策略。序列$\mathbf{l}{*}$通過$\mathbf{l}^{*}\approx{\cal B}(\arg\max_{\boldsymbol{\pi}}p(\boldsymbol{\pi}|\mathbf{y}))$近似發現,即在每個時間戳$t$采用最大概率的標簽$\pi_{t}$,并將結果序列映射到$\mathbf{l}^{*}$。
2.3.3 Lexicon-based transcription
In lexicon-based mode, each test sample is associated with a lexicon ${\cal D}$. Basically, the label sequence is recognized by choosing the sequence in the lexicon that has highest conditional probability defined in Eq.1, i.e. $\mathbf{l}^{*}=\arg\max_{\mathbf{l}\in{\cal D}}p(\mathbf{l}|\mathbf{y})$. However, for large lexicons, e.g. the 50k-words Hunspell spell-checking dictionary [1], it would be very time-consuming to perform an exhaustive search over the lexicon, i.e. to compute Equation.1 for all sequences in the lexicon and choose the one with the highest
probability. To solve this problem, we observe that the label sequences predicted via lexicon-free transcription, described in 2.3.2, are often close to the ground-truth under the edit distance metric. This indicates that we can limit our search to the nearest-neighbor candidates ${\cal N}_{\delta}(\mathbf{l}')$, where $\delta$ is the maximal edit distance and $\mathbf{l}'$ is the sequence transcribed from $\mathbf{y}$ in lexicon-free mode:
$$
\begin{equation}
\mathbf{l}^{*}=\arg\max_{\mathbf{l}\in{\cal N}_{\delta}(\mathbf{l}')}p(\mathbf{l}|\mathbf{y}).\tag{2}
\end{equation}
$$
2.3.3 基于詞典的轉錄
在基于字典的模式中,每個測試采樣與詞典${\cal D}$相關聯。基本上,通過選擇詞典中具有方程1中定義的最高條件概率的序列來識別標簽序列,即$\mathbf{l}^{*}=\arg\max_{\mathbf{l}\in{\cal D}}p(\mathbf{l}|\mathbf{y})$。然而,對于大型詞典,例如5萬個詞的Hunspell拼寫檢查詞典[1],對詞典進行詳盡的搜索是非常耗時的,即對詞典中的所有序列計算方程1,并選擇概率最高的一個。為了解決這個問題,我們觀察到,2.3.2中描述的通過無詞典轉錄預測的標簽序列通常在編輯距離度量下接近于實際結果。這表示我們可以將搜索限制在最近鄰候選目標${\cal N}_{\delta}(\mathbf{l}')$,其中$\delta$是最大編輯距離,$\mathbf{l}'$是在無詞典模式下從$\mathbf{y}$轉錄的序列:
$$
\begin{equation}
\mathbf{l}^{*}=\arg\max_{\mathbf{l}\in{\cal N}_{\delta}(\mathbf{l}')}p(\mathbf{l}|\mathbf{y}).\tag{2}
\end{equation}
$$
The candidates ${\cal N}_{\delta}(\mathbf{l}')$ can be found efficiently with the BK-tree data structure[9], which is a metric tree specifically adapted to discrete metric spaces. The search time complexity of BK-tree is $O(\log|{\cal D}|)$, where $|{\cal D}|$ is the lexicon size. Therefore this scheme readily extends to very large lexicons. In our approach, a BK-tree is constructed offline for a lexicon. Then we perform fast online search with the tree, by finding sequences that have less or equal to $\delta$ edit distance to the query sequence.
可以使用BK樹數據結構[9]有效地找到候選目標${\cal N}_{\delta}(\mathbf{l}')$,這是一種專門適用于離散度量空間的度量樹。BK樹的搜索時間復雜度為$O(\log|{\cal D}|)$,其中$|{\cal D}|$是詞典大小。因此,這個方案很容易擴展到非常大的詞典。在我們的方法中,一個詞典離線構造一個BK樹。然后,我們使用樹執行快速在線搜索,通過查找具有小于或等于$\delta$編輯距離來查詢序列。
2.4. Network Training
Denote the training dataset by ${\cal X}= \lbrace I_i,\mathbf{l}i \rbrace i $, where $I{i}$ is the training image and $\mathbf{l}{i}$ is the ground truth label sequence. The objective is to minimize the negative log-likelihood of conditional probability of ground truth:
$$
\begin{equation}
{\cal O}=-\sum_{I_{i},\mathbf{l}{i}\in{\cal X}}\log p(\mathbf{l}{i}|\mathbf{y}_{i}),\tag{3}
\end{equation}
$$
where $\mathbf{y}{i}$ is the sequence produced by the recurrent and convolutional layers from $I{i}$. This objective function calculates a cost value directly from an image and its ground truth label sequence. Therefore, the network can be end-to-end trained on pairs of images and sequences, eliminating the procedure of manually labeling all individual components in training images.
2.4. 網絡訓練
${\cal X}= \lbrace I_i,\mathbf{l}i \rbrace i $表示訓練集,$I{i}$是訓練圖像,$\mathbf{l}{i}$是真實的標簽序列。目標是最小化真實條件概率的負對數似然:
$$
\begin{equation}
{\cal O}=-\sum_{I_{i},\mathbf{l}{i}\in{\cal X}}\log p(\mathbf{l}{i}|\mathbf{y}_{i}),\tag{3}
\end{equation}
$$
$\mathbf{y}{i}$是循環層和卷積層從$I{i}$生成的序列。目標函數直接從圖像和它的真實標簽序列計算代價值。因此,網絡可以在成對的圖像和序列上進行端對端訓練,去除了在訓練圖像中手動標記所有單獨組件的過程。
The network is trained with stochastic gradient descent (SGD). Gradients are calculated by the back-propagation algorithm. In particular, in the transcription layer, error differentials are back-propagated with the forward-backward algorithm, as described in [15]. In the recurrent layers, the Back-Propagation Through Time (BPTT) is applied to calculate the error differentials.
網絡使用隨機梯度下降(SGD)進行訓練。梯度由反向傳播算法計算。特別地,在轉錄層中,如[15]所述,誤差使用前向算法進行反向傳播。在循環層中,應用隨時間反向傳播(BPTT)來計算誤差。
For optimization, we use the ADADELTA [37] to automatically calculate per-dimension learning rates. Compared with the conventional momentum [31] method, ADADELTA requires no manual setting of a learning rate. More importantly, we find that optimization using ADADELTA converges faster than the momentum method.
為了優化,我們使用ADADELTA[37]自動計算每維的學習率。與傳統的動量[31]方法相比,ADADELTA不需要手動設置學習率。更重要的是,我們發現使用ADADELTA的優化收斂速度比動量方法快。
3. Experiments
To evaluate the effectiveness of the proposed CRNN model, we conducted experiments on standard benchmarks for scene text recognition and musical score recognition, which are both challenging vision tasks. The datasets and setting for training and testing are given in Sec. 3.1, the detailed settings of CRNN for scene text images is provided in Sec. 3.2, and the results with the comprehensive comparisons are reported in Sec. 3.3. To further demonstrate the generality of CRNN, we verify the proposed algorithm on a music score recognition task in Sec. 3.4.
3. 實驗
為了評估提出的CRNN模型的有效性,我們在場景文本識別和樂譜識別的標準基準數據集上進行了實驗,這些都是具有挑戰性的視覺任務。數據集和訓練測試的設置見3.1小節,場景文本圖像中CRNN的詳細設置見3.2小節,綜合比較的結果在3.3小節報告。為了進一步證明CRNN的泛化性,在3.4小節我們在樂譜識別任務上驗證了提出的算法。
3.1. Datasets
For all the experiments for scene text recognition, we use the synthetic dataset (Synth) released by Jaderberg et al. [20] as the training data. The dataset contains 8 millions training images and their corresponding ground truth words. Such images are generated by a synthetic text engine and are highly realistic. Our network is trained on the synthetic data once, and tested on all other real-world test datasets without any fine-tuning on their training data. Even though the CRNN model is purely trained with synthetic text data, it works well on real images from standard text recognition benchmarks.
3.1. 數據集
對于場景文本識別的所有實驗,我們使用Jaderberg等人[20]發布的合成數據集(Synth)作為訓練數據。數據集包含8百萬訓練圖像及其對應的實際單詞。這樣的圖像由合成文本引擎生成并且是非常現實的。我們的網絡在合成數據上進行了一次訓練,并在所有其它現實世界的測試數據集上進行了測試,而沒有在其訓練數據上進行任何微調。即使CRNN模型是在純合成文本數據上訓練,但它在標準文本識別基準數據集的真實圖像上工作良好。
Four popular benchmarks for scene text recognition are used for performance evaluation, namely ICDAR 2003 (IC03), ICDAR 2013 (IC13), IIIT 5k-word (IIIT5k), and Street View Text (SVT).
有四個流行的基準數據集用于場景文本識別的性能評估,即ICDAR 2003(IC03),ICDAR 2013(IC13),IIIT 5k-word(IIIT5k)和Street View Text (SVT)。
IC03 [27] test dataset contains 251 scene images with labeled text bounding boxes. Following Wang et al. [34], we ignore images that either contain non-alphanumeric characters or have less than three characters, and get a test set with 860 cropped text images. Each test image is associated with a 50-words lexicon which is defined by Wang et al. [34]. A full lexicon is built by combining all the per-image lexicons. In addition, we use a 50k words lexicon consisting of the words in the Hunspell spell-checking dictionary [1].
IC03[27]測試數據集包含251個具有標記文本邊界框的場景圖像。王等人[34],我們忽略包含非字母數字字符或少于三個字符的圖像,并獲得具有860個裁剪的文本圖像的測試集。每張測試圖像與由Wang等人[34]定義的50詞的詞典相關聯。通過組合所有的每張圖像詞匯構建完整的詞典。此外,我們使用由Hunspell拼寫檢查字典[1]中的單詞組成的5萬個詞的詞典。
IC13 [24] test dataset inherits most of its data from IC03. It contains 1,015 ground truths cropped word images.
IC13[24]測試數據集繼承了IC03中的大部分數據。它包含1015個實際的裁剪單詞圖像。
IIIT5k [28] contains 3,000 cropped word test images collected from the Internet. Each image has been associated to a 50-words lexicon and a 1k-words lexicon.
IIIT5k[28]包含從互聯網收集的3000張裁剪的詞測試圖像。每張圖像關聯一個50詞的詞典和一個1000詞的詞典。
SVT [34] test dataset consists of 249 street view images collected from Google Street View. From them 647 word images are cropped. Each word image has a 50 words lexicon defined by Wang et al. [34].
SVT[34]測試數據集由從Google街景視圖收集的249張街景圖像組成。從它們中裁剪出了647張詞圖像。每張單詞圖像都有一個由Wang等人[34]定義的50個詞的詞典。
3.2. Implementation Details
The network configuration we use in our experiments is summarized in Table 1. The architecture of the convolutional layers is based on the VGG-VeryDeep architectures [32]. A tweak is made in order to make it suitable for recognizing English texts. In the 3rd and the 4th max-pooling layers, we adopt 1 × 2 sized rectangular pooling windows instead of the conventional squared ones. This tweak yields feature maps with larger width, hence longer feature sequence. For example, an image containing 10 characters is typically of size 100 × 32, from which a feature sequence 25 frames can be generated. This length exceeds the lengths of most English words. On top of that, the rectangular pooling windows yield rectangular receptive fields (illustrated in Fig. 2), which are beneficial for recognizing some characters that have narrow shapes, such as ’i’ and ’l’.
Table 1. Network configuration summary. The first row is the top layer. ‘k’, ‘s’ and ‘p’ stand for kernel size, stride and padding size respectively.
3.2. 實現細節
在實驗中我們使用的網絡配置總結在表1中。卷積層的架構是基于VGG-VeryDeep的架構[32]。為了使其適用于識別英文文本,對其進行了調整。在第3和第4個最大池化層中,我們采用1×2大小的矩形池化窗口而不是傳統的平方形。這種調整產生寬度較大的特征圖,因此具有更長的特征序列。例如,包含10個字符的圖像通常為大小為100×32,可以從其生成25幀的特征序列。這個長度超過了大多數英文單詞的長度。最重要的是,矩形池窗口產生矩形感受野(如圖2所示),這有助于識別一些具有窄形狀的字符,例如i
和l
。
表1。網絡配置總結。第一行是頂層。k
,s
,p
分別表示核大小,步長和填充大小。
The network not only has deep convolutional layers, but also has recurrent layers. Both are known to be hard to train. We find that the batch normalization [19] technique is extremely useful for training network of such depth. Two batch normalization layers are inserted after the 5th and 6th convolutional layers respectively. With the batch normalization layers, the training process is greatly accelerated.
網絡不僅有深度卷積層,而且還有循環層。眾所周知兩者都難以訓練。我們發現批歸一化[19]技術對于訓練這種深度網絡非常有用。分別在第5和第6卷積層之后插入兩個批歸一化層。使用批歸一化層訓練過程大大加快。
We implement the network within the Torch7 [10] framework, with custom implementations for the LSTM units (in Torch7/CUDA), the transcription layer (in C++) and the BK-tree data structure (in C++). Experiments are carried out on a workstation with a 2.50 GHz Intel(R) Xeon(R) E5-2609 CPU, 64GB RAM and an NVIDIA(R) Tesla(TM) K40 GPU. Networks are trained with ADADELTA, setting the parameter ρ to 0.9. During training, all images are scaled to 100 × 32 in order to accelerate the training process. The training process takes about 50 hours to reach convergence. Testing images are scaled to have height 32. Widths are proportionally scaled with heights, but at least 100 pixels. The average testing time is 0.16s/sample, as measured on IC03 without a lexicon. The approximate lexicon search is applied to the 50k lexicon of IC03, with the parameter δ set to 3. Testing each sample takes 0.53s on average.
我們在Torch7[10]框架內實現了網絡,使用定制實現的LSTM單元(Torch7/CUDA),轉錄層(C++)和BK樹數據結構(C++)。實驗在具有2.50 GHz Intel(R)Xeon E5-2609 CPU,64GB RAM和NVIDIA(R)Tesla(TM) K40 GPU的工作站上進行。網絡用ADADELTA訓練,將參數ρ設置為0.9。在訓練期間,所有圖像都被縮放為100×32,以加快訓練過程。訓練過程大約需要50個小時才能達到收斂。測試圖像縮放的高度為32。寬度與高度成比例地縮放,但至少為100像素。平均測試時間為0.16s/樣本,在IC03上測得的,沒有詞典。近似詞典搜索應用于IC03的50k詞典,參數δ設置為3。測試每個樣本平均花費0.53s。
3.3. Comparative Evaluation
All the recognition accuracies on the above four public datasets, obtained by the proposed CRNN model and the recent state-of-the-arts techniques including the approaches based on deep models [23, 22, 21], are shown in Table 2.
Table 2. Recognition accuracies (%) on four datasets. In the second row, “50”, “1k”, “50k” and “Full” denote the lexicon used, and “None” denotes recognition without a lexicon. *[22] is not lexicon-free in the strict sense, as its outputs are constrained to a 90k dictionary.
3.3. 比較評估
提出的CRNN模型在上述四個公共數據集上獲得的所有識別精度以及最近的最新技術,包括基于深度模型[23,22,21]的方法如表2所示。
表2。四個數據集上識別準確率(%)。在第二行,“50”,“1k”,“50k”和“Full”表示使用的字典,“None”表示識別沒有字典。*[22]嚴格意義上講不是無字典的,因為它的輸出限制在90K的字典。
In the constrained lexicon cases, our method consistently outperforms most state-of-the-arts approaches, and in average beats the best text reader proposed in [22]. Specifically, we obtain superior performance on IIIT5k, and SVT compared to [22], only achieved lower performance on IC03 with the “Full” lexicon. Note that the model in[22] is trained on a specific dictionary, namely that each word is associated to a class label. Unlike [22], CRNN is not limited to recognize a word in a known dictionary, and able to handle random strings (e.g. telephone numbers), sentences or other scripts like Chinese words. Therefore, the results of CRNN are competitive on all the testing datasets.
在有約束詞典的情況中,我們的方法始終優于大多數最新的方法,并且平均打敗了[22]中提出的最佳文本閱讀器。具體來說,與[22]相比,我們在IIIT5k和SVT上獲得了卓越的性能,僅在IC03上通過“Full”詞典實現了較低性能。請注意,[22]中的模型是在特定字典上訓練的,即每個單詞都與一個類標簽相關聯。與[22]不同,CRNN不限于識別已知字典中的單詞,并且能夠處理隨機字符串(例如電話號碼),句子或其他諸如中文單詞的腳本。 因此,CRNN的結果在所有測試數據集上都具有競爭力。
In the unconstrained lexicon cases, our method achieves the best performance on SVT, yet, is still behind some approaches [8, 22] on IC03 and IC13. Note that the blanks in the “none” columns of Table 2 denote that such approaches are unable to be applied to recognition without lexicon or did not report the recognition accuracies in the unconstrained cases. Our method uses only synthetic text with word level labels as the training data, very different to PhotoOCR [8] which used 7.9 millions of real word images with character-level annotations for training. The best persformance is reported by [22] in the unconstrained lexicon cases, benefiting from its large dictionary, however, it is not a model strictly unconstrained to a lexicon as mentioned before. In this sense, our results in the unconstrained lexicon case are still promising.
在無約束詞典的情況下,我們的方法在SVT上仍取得了最佳性能,但在IC03和IC13上仍然落后于一些方法[8,22]。注意,表2的“none”列中的空白表示這種方法不能應用于沒有詞典的識別,或者在無約束的情況下不能報告識別精度。我們的方法只使用具有單詞級標簽的合成文本作為訓練數據,與PhotoOCR[8]非常不同,后者使用790萬個具有字符級標注的真實單詞圖像進行訓練。[22]中報告的最佳性能是在無約束詞典的情況下,受益于它的大字典,然而,它不是前面提到的嚴格的無約束詞典模型。在這個意義上,我們在無限制詞典表中的結果仍然是有前途的。
For further understanding the advantages of the proposed algorithm over other text recognition approaches, we provide a comprehensive comparison on several properties named E2E Train, Conv Ftrs, CharGT-Free, Unconstrained, and Model Size, as summarized in Table 3.
Table 3. Comparison among various methods. Attributes for comparison include: 1) being end-to-end trainable (E2E Train); 2) using convolutional features that are directly learned from images rather than using hand-crafted ones (Conv Ftrs); 3) requiring no ground truth bounding boxes for characters during training (CharGT-Free); 4) not confined to a pre-defined dictionary (Unconstrained); 5) the model size (if an end-to-end trainable model is used), measured by the number of model parameters (Model Size, M stands for millions).
為了進一步了解與其它文本識別方法相比,所提出算法的優點,我們提供了在一些特性上的綜合比較,這些特性名稱為E2E Train,Conv Ftrs,CharGT-Free,Unconstrained和Model Size,如表3所示。
表3。各種方法的對比。比較的屬性包括:1)端到端訓練(E2E Train);2)從圖像中直接學習卷積特征而不是使用手動設計的特征(Conv Ftrs);3)訓練期間不需要字符的實際邊界框(CharGT-Free);4)不受限于預定義字典(Unconstrained);5)模型大小(如果使用端到端模型),通過模型參數數量來衡量(Model Size, M表示百萬)。
E2E Train: This column is to show whether a certain text reading model is end-to-end trainable, without any preprocess or through several separated steps, which indicates such approaches are elegant and clean for training. As can be observed from Table 3, only the models based on deep neural networks including [22, 21] as well as CRNN have this property.
E2E Train:這一列是為了顯示某種文字閱讀模型是否可以進行端到端的訓練,無需任何預處理或經過幾個分離的步驟,這表明這種方法對于訓練是優雅且干凈的。從表3可以看出,只有基于深度神經網絡的模型,包括[22,21]以及CRNN具有這種性質。
Conv Ftrs: This column is to indicate whether an approach uses the convolutional features learned from training images directly or handcraft features as the basic representations.
Conv Ftrs:這一列用來表明一個方法是否使用從訓練圖像直接學習到的卷積特征或手動特征作為基本的表示。
CharGT-Free: This column is to indicate whether the character-level annotations are essential for training the model. As the input and output labels of CRNN can be a sequence, character-level annotations are not necessary.
CharGT-Free:這一列用來表明字符級標注對于訓練模型是否是必要的。由于CRNN的輸入和輸出標簽是序列,因此字符級標注是不必要的。
Unconstrained: This column is to indicate whether the trained model is constrained to a specific dictionary, unable to handling out-of-dictionary words or random sequences. Notice that though the recent models learned by label embedding [5, 14] and incremental learning [22] achieved highly competitive performance, they are constrained to a specific dictionary.
Unconstrained:這一列用來表明訓練模型是否受限于一個特定的字典,是否不能處理字典之外的單詞或隨機序列。注意盡管最近通過標簽嵌入[5, 14]和增強學習[22]學習到的模型取得了非常有競爭力的性能,但它們受限于一個特定的字典。
Model Size: This column is to report the storage space of the learned model. In CRNN, all layers have weight-sharing connections, and the fully-connected layers are not needed. Consequently, the number of parameters of CRNN is much less than the models learned on the variants of CNN [22, 21], resulting in a much smaller model compared with [22, 21]. Our model has 8.3 million parameters, taking only 33MB RAM (using 4-bytes single-precision float for each parameter), thus it can be easily ported to mobile devices.
Model Size:這一列報告了學習模型的存儲空間。在CRNN中,所有的層有權重共享連接,不需要全連接層。因此,CRNN的參數數量遠小于CNN變體[22,21]所得到的模型,導致與[22,21]相比,模型要小得多。我們的模型有830萬個參數,只有33MB RAM(每個參數使用4字節單精度浮點數),因此可以輕松地移植到移動設備上。
Table 3 clearly shows the differences among different approaches in details, and fully demonstrates the advantages of CRNN over other competing methods.
表3詳細列出了不同方法之間的差異,充分展示了CRNN與其它競爭方法的優勢。
In addition, to test the impact of parameter $\delta$, we experiment different values of $\delta$ in Eq.2. In Fig.4 we plot the recognition accuracy as a function of $\delta$. Larger $\delta$ results in more candidates, thus more accurate lexicon-based transcription. On the other hand, the computational cost grows with larger $\delta$, due to longer BK-tree search time, as well as larger number of candidate sequences for testing. In practice, we choose $\delta=3$ as a tradeoff between accuracy and speed.
Figure 4. Blue line graph: recognition accuracy as a function parameter $\delta$. Red bars: lexicon search time per sample. Tested on the IC03 dataset with the 50k lexicon.
另外,為了測試參數$\delta$的影響,我們在方程2中實驗了$\delta$的不同值。在圖4中,我們將識別精度繪制為$\delta$的函數。更大的$\delta$導致更多的候選目標,從而基于詞典的轉錄更準確。另一方面,由于更長的BK樹搜索時間,以及更大數量的候選序列用于測試,計算成本隨著$\delta$的增大而增加。實際上,我們選擇$\delta=3$作為精度和速度之間的折衷。
圖4。藍線圖:識別準確率作為$\delta$的函數。紅條:每個樣本的詞典搜索時間。在IC03數據集上使用50k詞典進行的測試。
3.4. Musical Score Recognition
A musical score typically consists of sequences of musical notes arranged on staff lines. Recognizing musical scores in images is known as the Optical Music Recognition (OMR) problem. Previous methods often requires image preprocessing (mostly binirization), staff lines detection and individual notes recognition [29]. We cast the OMR as a sequence recognition problem, and predict a sequence of musical notes directly from the image with CRNN. For simplicity, we recognize pitches only, ignore all chords and assume the same major scales (C major) for all scores.
3.4. 樂譜識別
樂譜通常由排列在五線譜的音符序列組成。識別圖像中的樂譜被稱為光學音樂識別(OMR)問題。以前的方法通常需要圖像預處理(主要是二值化),五線譜檢測和單個音符識別[29]。我們將OMR作為序列識別問題,直接用CRNN從圖像中預測音符的序列。為了簡單起見,我們僅認識音調,忽略所有和弦,并假定所有樂譜具有相同的大調音階(C大調)。
To the best of our knowledge, there exists no public datasets for evaluating algorithms on pitch recognition. To prepare the training data needed by CRNN, we collect 2650 images from [2]. Each image contains a fragment of score containing 3 to 20 notes. We manually label the ground truth label sequences (sequences of not ezpitches) for all the images. The collected images are augmented to 265k training samples by being rotated, scaled and corrupted with noise, and by replacing their backgrounds with natural images. For testing, we create three datasets: 1) “Clean”, which contains 260 images collected from [2]. Examples are shown in Fig. 5.a; 2) “Synthesized”, which is created from “Clean”, using the augmentation strategy mentioned above. It contains 200 samples, some of which are shown in Fig. 5.b; 3) “Real-World”, which contains 200 images of score fragments taken from music books with a phone camera. Examples are shown in Fig. 5.c.
Figure 5. (a) Clean musical scores images collected from [2] (b) Synthesized musical score images. (c) Real-world score images taken with a mobile phone camera.
據我們所知,沒有用于評估音調識別算法的公共數據集。為了準備CRNN所需的訓練數據,我們從[2]中收集了2650張圖像。每個圖像中有一個包含3到20個音符的樂譜片段。我們手動標記所有圖像的真實標簽序列(不是的音調序列)。收集到的圖像通過旋轉,縮放和用噪聲損壞增強到了265k個訓練樣本,并用自然圖像替換它們的背景。對于測試,我們創建了三個數據集:1)“純凈的”,其中包含從[2]收集的260張圖像。實例如圖5.a所示;2)“合成的”,使用“純凈的”創建的,使用了上述的增強策略。它包含200個樣本,其中一些如圖5.b所示;3)“現實世界”,其中包含用手機相機拍攝的音樂書籍中的200張圖像。例子如圖5.c所示。
圖5。(a)從[2]中收集的干凈的樂譜圖像。(b)合成的樂譜圖像。(c)用手機相機拍攝的現實世界的樂譜圖像。
Since we have limited training data, we use a simplified CRNN configuration in order to reduce model capacity. Different from the configuration specified in Tab. 1, the 4th and 6th convolution layers are removed, and the 2-layer bidirectional LSTM is replaced by a 2-layer single directional LSTM. The network is trained on the pairs of images and corresponding label sequences. Two measures are used for evaluating the recognition performance: 1) fragment accuracy, i.e. the percentage of score fragments correctly recognized; 2) average edit distance, i.e. the average edit distance between predicted pitch sequences and the ground truths. For comparison, we evaluate two commercial OMR engines, namely the Capella Scan [3] and the PhotoScore [4].
由于我們的訓練數據有限,因此我們使用簡化的CRNN配置來減少模型容量。與表1中指定的配置不同,我們移除了第4和第6卷積層,將2層雙向LSTM替換為2層單向LSTM。網絡對圖像對和對應的標簽序列進行訓練。使用兩種方法來評估識別性能:1)片段準確度,即正確識別的樂譜片段的百分比;2)平均編輯距離,即預測音調序列與真實值之間的平均編輯距離。為了比較,我們評估了兩種商用OMR引擎,即Capella Scan[3]和PhotoScore[4]。
Tab. 4 summarizes the results. The CRNN outperforms the two commercial systems by a large margin. The Capella Scan and PhotoScore systems perform reasonably well on the Clean dataset, but their performances drop significantly on synthesized and real-world data. The main reason is that they rely on robust binarization to detect staff lines and notes, but the binarization step often fails on synthesized and real-world data due to bad lighting condition, noise corruption and cluttered background. The CRNN, on the other hand, uses convolutional features that are highly robust to noises and distortions. Besides, recurrent layers in CRNN can utilize contextual information in the score. Each note is recognized not only itself, but also by the nearby notes. Consequently, some notes can be recognized by comparing them with the nearby notes, e.g. contrasting their vertical positions.
Table 4. Comparison of pitch recognition accuracies, among CRNN and two commercial OMR systems, on the three datasets we have collected. Performances are evaluated by fragment accuracies and average edit distance (“fragment accuracy/average edit distance”).
表4總結了結果。CRNN大大優于兩個商業系統。Capella Scan和PhotoScore系統在干凈的數據集上表現相當不錯,但是它們的性能在合成和現實世界數據方面顯著下降。主要原因是它們依賴于強大的二值化來檢五線譜和音符,但是由于光線不良,噪音破壞和雜亂的背景,二值化步驟經常會在合成數據和現實數據上失敗。另一方面,CRNN使用對噪聲和扭曲具有魯棒性的卷積特征。此外,CRNN中的循環層可以利用樂譜中的上下文信息。每個音符不僅自身被識別,而且被附近的音符識別。因此,通過將一些音符與附近的音符進行比較可以識別它們,例如對比他們的垂直位置。
表4。在我們收集的數據集上,CRNN和兩個商業OMR系統對音調識別準確率的對比。通過片段準確率和平均編輯距離(“片段準確率/平均編輯距離”)來評估性能。
The results have shown the generality of CRNN, in that it can be readily applied to other image-based sequence recognition problems, requiring minimal domain knowledge. Compared with Capella Scan and PhotoScore, our CRNN-based system is still preliminary and misses many functionalities. But it provides a new scheme for OMR, and has shown promising capabilities in pitch recognition.
結果顯示了CRNN的泛化性,因為它可以很容易地應用于其它的基于圖像的序列識別問題,需要極少的領域知識。與Capella Scan和PhotoScore相比,我們的基于CRNN的系統仍然是初步的,并且缺少許多功能。但它為OMR提供了一個新的方案,并且在音高識別方面表現出有前途的能力。
4. Conclusion
In this paper, we have presented a novel neural network architecture, called Convolutional Recurrent Neural Network (CRNN), which integrates the advantages of both Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN). CRNN is able to take input images of varying dimensions and produces predictions with different lengths. It directly runs on coarse level labels (e.g. words), requiring no detailed annotations for each individual element (e.g. characters) in the training phase. Moreover, as CRNN abandons fully connected layers used in conventional neural networks, it results in a much more compact and efficient model. All these properties make CRNN an excellent approach for image-based sequence recognition.
4. 總結
在本文中,我們提出了一種新穎的神經網絡架構,稱為卷積循環神經網絡(CRNN),其集成了卷積神經網絡(CNN)和循環神經網絡(RNN)的優點。CRNN能夠獲取不同尺寸的輸入圖像,并產生不同長度的預測。它直接在粗粒度的標簽(例如單詞)上運行,在訓練階段不需要詳細標注每一個單獨的元素(例如字符)。此外,由于CRNN放棄了傳統神經網絡中使用的全連接層,因此得到了更加緊湊和高效的模型。所有這些屬性使得CRNN成為一種基于圖像序列識別的極好方法。
The experiments on the scene text recognition benchmarks demonstrate that CRNN achieves superior or highly competitive performance, compared with conventional methods as well as other CNN and RNN based algorithms. This confirms the advantages of the proposed algorithm. In addition, CRNN significantly outperforms other competitors on a benchmark for Optical Music Recognition (OMR), which verifies the generality of CRNN.
在場景文本識別基準數據集上的實驗表明,與傳統方法以及其它基于CNN和RNN的算法相比,CRNN實現了優異或極具競爭力的性能。這證實了所提出的算法的優點。此外,CRNN在光學音樂識別(OMR)的基準數據集上顯著優于其它的競爭者,這驗證了CRNN的泛化性。
Actually, CRNN is a general framework, thus it can be applied to other domains and problems (such as Chinese character recognition), which involve sequence prediction in images. To further speed up CRNN and make it more practical in real-world applications is another direction that is worthy of exploration in the future.
實際上,CRNN是一個通用框架,因此可以應用于其它的涉及圖像序列預測的領域和問題(如漢字識別)。進一步加快CRNN,使其在現實應用中更加實用,是未來值得探索的另一個方向。
Acknowledgement
This work was primarily supported by National Natural Science Foundation of China (NSFC) (No. 61222308).
致謝
這項工作主要是由中國國家自然科學基金(NSFC)支持 (No. 61222308)。
References
[1] http://hunspell.sourceforge.net/. 4, 5
[2] https://musescore.com/sheetmusic. 7, 8
[3] http://www.capella.de/us/index.cfm/products/capella-scan/info-capella-scan/. 8
[4] http://www.sibelius.com/products/photoscore/ultimate.html. 8
[5] J. Almaza ?n, A. Gordo, A. Forne ?s, and E. Valveny. Word spotting and recognition with embedded attributes. PAMI, 36(12):2552–2566, 2014. 2, 6, 7
[6] O. Alsharif and J. Pineau. End-to-end text recognition with hybrid HMM maxout models. ICLR, 2014. 6, 7
[7] Y. Bengio, P. Y. Simard, and P. Frasconi. Learning long-term dependencies with gradient descent is difficult. NN, 5(2):157–166, 1994. 3
[8] A. Bissacco, M. Cummins, Y. Netzer, and H. Neven. Photoocr: Reading text in uncontrolled conditions. In ICCV, 2013. 1, 2, 6, 7
[9] W. A. Burkhard and R. M. Keller. Some approaches to best-match file searching. Commun. ACM, 16(4):230–236, 1973.4
[10] R. Collobert, K. Kavukcuoglu, and C. Farabet. Torch7: A matlab-like environment for machine learning. In BigLearn, NIPS Workshop, 2011. 6
[11] F. A. Gers, N. N. Schraudolph, and J. Schmidhuber. Learning precise timing with LSTM recurrent networks. JMLR, 3:115–143, 2002. 3
[12] R. B. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, 2014. 1, 3
[13] V. Goel, A. Mishra, K. Alahari, and C. V. Jawahar. Whole is greater than sum of parts: Recognizing scene text words. In ICDAR, 2013. 6, 7
[14] A. Gordo. Supervised mid-level features for word image representation. In CVPR, 2015. 2, 6, 7
[15] A. Graves, S. Ferna?ndez, F. J. Gomez, and J. Schmidhuber. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In ICML, 2006. 4, 5
[16] A. Graves, M. Liwicki, S. Fernandez, R. Bertolami, H. Bunke, and J. Schmidhuber. A novel connectionist system for unconstrained handwriting recognition. PAMI, 31(5):855–868, 2009. 2
[17] A. Graves, A. Mohamed, and G. E. Hinton. Speech recognition with deep recurrent neural networks. In ICASSP, 2013. 3
[18] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735–1780, 1997. 3
[19] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 2015. 6
[20] M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zisserman. Synthetic data and artificial neural networks for natural scene text recognition. NIPS Deep Learning Workshop, 2014. 5
[21] M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zisserman. Deep structured output learning for unconstrained text recog- nition. In ICLR, 2015. 6, 7
[22] M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zisserman. Reading text in the wild with convolutional neural networks. IJCV (Accepted), 2015. 1, 2, 3, 6, 7
[23] M. Jaderberg, A. Vedaldi, and A. Zisserman. Deep features for text spotting. In ECCV, 2014. 2, 6, 7
[24] D. Karatzas, F. Shafait, S. Uchida, M. Iwamura, L. G. i Bigorda, S. R. Mestre, J. Mas, D. F. Mota, J. Almaza ?n, and L. de las Heras. ICDAR 2013 robust reading competition. In ICDAR, 2013. 5
[25] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012. 1, 3
[26] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceed- ings of the IEEE, 86(11):2278–2324, 1998. 1
[27] S. M. Lucas, A. Panaretos, L. Sosa, A. Tang, S. Wong, R. Young, K. Ashida, H. Nagai, M. Okamoto, H. Yamamoto, H. Miyao, J. Zhu, W. Ou, C. Wolf, J. Jolion, L. Todoran, M. Worring, and X. Lin. ICDAR 2003 robust reading competitions: entries, results, and future directions. IJDAR, 7(2-3):105–122, 2005. 5
[28] A. Mishra, K. Alahari, and C. V. Jawahar. Scene text recognition using higher order language priors. In BMVC, 2012. 5, 6, 7
[29] A. Rebelo, I. Fujinaga, F. Paszkiewicz, A. R. S. Marc ?al, C. Guedes, and J. S. Cardoso. Optical music recognition: state-of-the-art and open issues. IJMIR, 1(3):173–190, 2012. 7
[30] J. A. Rodr ??guez-Serrano, A. Gordo, and F. Perronnin. Label embedding: A frugal baseline for text recognition. IJCV, 113(3):193–207, 2015. 2, 6, 7
[31] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Neurocomputing: Foundations of research. chapter Learning Representations by Back-propagating Errors, pages 696–699. MIT Press, 1988. 5
[32] K. Simonyan and A. Zisserman. Very deep convolu- tional networks for large-scale image recognition. CoRR, abs/1409.1556, 2014. 5
[33] B. Su and S. Lu. Accurate scene text recognition based on recurrent neural network. In ACCV, 2014. 2, 6, 7
[34] K. Wang, B. Babenko, and S. Belongie. End-to-end scene text recognition. In ICCV, 2011. 5, 6, 7
[35] T. Wang, D. J. Wu, A. Coates, and A. Y. Ng. End-to-end text recognition with convolutional neural networks. In ICPR, 2012. 1, 6, 7
[36] C. Yao, X. Bai, B. Shi, and W. Liu. Strokelets: A learned multi-scale representation for scene text recognition. In CVPR, 2014. 2, 6, 7
[37] M. D. Zeiler. ADADELTA: anadaptive learning rate method. CoRR, abs/1212.5701, 2012. 5