題目:Fully-Convolutional Siamese Networks for Object Tracking
來源:CVPR2016
論文主頁(有matlab代碼):http://www.robots.ox.ac.uk/~luca/siamese-fc.html?
貢獻:網絡雖然很簡單,但達到了實時性。
Siamese缺點:
1、因為Siamese屬于模板匹配類的算法,對于突然變化和超出邊界框(圖像)的目標跟蹤會失敗喲
2、對于背景雜斑較多,即有太多相似性物體的時候,跟蹤效果不好。
摘要:在進行目標跟蹤時,往往是通過使用訓練的視頻集來學習一個物體的外觀模型來實現的。盡管這些方法很成功,但他們這種只進行在線的方式所學到的模型豐富度不夠。最近,為了提高學到的模型的豐富度,深度卷積網絡進入了視線。然而,在跟蹤的物體在先前是不太確定的時候,那就有必要在線使用隨機梯度下降來調整網絡的參數,當然了,系統的速度不是很好。
In this paper we equip a basic tracking algorithm with a novel fully-convolutional Siamese network trained end-to-end on the ILSVRC15 dataset for object detection in video.Our tracker operates at frame-rates beyond real-time and, despite its extreme simplicity, achieves state-of-the-art performance in multiple benchmarks.
1 Introduction
在進行單目標跟蹤的時候,因為跟蹤算法可能被要求跟蹤任意的一個物體,所以擁有早已收集好的數據且訓練一個專門的探測器幾乎是不可能的。多年以來,最成功的方法基本上就是在線學習物體的外觀模型,方法有TLD,Struck,KCF。然而,一個最大的缺陷就是所學習的模型太簡單了。而使用deep conv-nets 的話,由于一些問題,使用上有困難。
這些問題有2個,訓練數據集的稀缺和實時性的操作約束
當然了,有問題就有解決方法。為了解決這2個局限,有一些工作出現了。這些工作主要使用一個預訓練deep conv-net,這個網絡是為一種不同但是相關的工作而學習到的。工作有2種。
1、shallow methods
using the network’s internal representation as features,e.g. correlation filters
2、SGD
perform SGD (stochastic gradient descent) to fine-tune multiple layers of the network
二者各自的不足:
While the use of shallow methods does not take full advantage of the benefits of end-to-end learning, methods that apply SGD during tracking to achieve state-of-the-art results have not been able to operate in real-time.
那么,我們采取什么方法呢?
a deep conv-net is trained to address a more general similarity learning problem in an initial offline phase,and then this function is simply evaluated online during tracking.
本文的貢獻點在于
The key contribution of this paper is to demonstrate that this approach achieves very competitive performance in modern tracking benchmarks at speeds that far exceed the frame-rate requirement.Specifically, we train a Siamese network to locate an exemplar image within a larger search image.
A further contribution is a novel Siamese architecture that is fully-convolutional with respect to the search image: dense and efficient sliding-window evaluation is achieved with a bilinear layer that computes the cross-correlation of its two inputs.
2 Deep similarity learning for tracking
用相似性學習的方法來跟蹤任意的物體。
We propose to learn a function f (z, x) that compares an exemplar image z to a candidate image x of the same size and returns a high score if the two images depict the same object and a low score otherwise.To find the position of the object in a new image, we can then exhaustively test all possible locations and choose the candidate with the maximum similarity to the past appearance of the object. In experiments, we will simply use the initial appearance of the object as the exemplar. The function f will be learnt from a dataset of videos with labelled object trajectories.
我們用的function f就是deep conv-net。而用deep conv-nets進行相似學習往往可以通過使用Siamese architectures(體系結構)來進行解決。這個是網絡設計的核心。
那么,孿生網絡是怎么回事兒呢?
Siamese networks apply an identical transformation(對2個輸入圖像而言相同的變換φ) φ to both inputs and then combine their representations using another function g according to f (z, x) = g(φ(z), φ(x)). When the function g is a simple distance or similarity metric, the function φ can be considered an embedding.
上圖
2.1 Fully-convolutional Siamese architecture
深度卷積網絡的相似性學習是通過孿生體系結構體現的。接下來介紹全卷積體系結構。
函數是完全卷積的定義:
1、We say that a function is fully-convolutional if it commutes with translation.
2、To give a more precise definition, introducing L τ to denote the translation operator
(L τ x)[u] = x[u ? τ ], a function h that maps signals to signals is fully-convolutional with integer stride k if
??????????????????????????????????? h(L kτ x) = L τ h(x)???????????????????????? ? ? ? ? ? ? ? ? ? ? ?????? (1)
for any translation τ . (When x is a finite signal, this only need hold for the valid region of the output.)
完全卷積網絡的優點:
提供更大的搜索圖像作為網絡的Input,而不是相同大小的候選圖像。并且,在單次評估的時候,可以計算基于密集網格的所有變換子窗口的相似度。
要充分實現這一優點的話,可以:
use a convolutional embedding function φ and combine the resulting feature maps using a cross-correlation layer
??????????????????? ? ? ? ? ? ? ? ? ?? ???? f (z, x) = φ(z) ? φ(x) + b 1 ,???????????????????????????? ? ? ?? (2)
where b 1 denotes a signal which takes value b ∈ R in every location. The output of this network is not a single score but rather a score map defined on a finite grid D ? Z 2 as illustrated in Figure 1. Note that the output of the embedding function is a feature map with spatial support as opposed to a plain vector.
在跟蹤期間,我們使用以目標的上一個位置為中心的搜索圖像。最大得分的位置與得分圖的中心有關,乘上網絡的stride,就可以得出從幀到幀的目標的位移。在單次前進中,多個尺度通過組裝小批量的尺度變化圖像而被搜索到。
使用互相關的方法聯合特征圖和在更大的圖像上評估網絡在數學上相當于combining feature maps using the inner product and evaluating the network on each translated sub-window independently.這種方法在training and testing的時候都是有用的。
2.2 Training with large search images
采用判別式的方法。
圖解:當一個子窗口的延伸超過圖像的范圍,缺失的部分用平均RGB值來填充。
在上圖Fig2中,上下的圖像對兒是從一個視頻的兩幀中提取出來的,都包含目標,最多以T幀作為間隔。
物體的class在training的過程中是不考慮的。每個圖像中物體的尺寸在不破壞圖像的縱橫比的情況下被 normalized(歸一化)。
接著上圖。
score map中,正負examples的loss,為了消除類的不均衡性是要進行加權的。
Note that since the network is symmetric f (z, x) = f (x, z), it is in fact also fully-convolutional in the exemplar.This allows us to use different size exemplar images for different objects in theory。
2.3 ImageNet Video for tracking
It can safely be used to train a deep model for tracking without over-fitting.
2.4 Practical considerations
出于實際考慮(practical consideration),有以下幾個方面的內容,都是實戰干貨啊。
Dataset curation
Network architecture
The dimensions of the parameters and activations are given in Table 1. Max-pooling is employed after the first two convolutional layers. ReLU non-linearities follow every convolutional layer except for conv5, the final layer. During training, batch normalization is inserted immediately after every linear layer.The stride of the final representation is eight. An important aspect of the design is that no padding(填充) is introduced within the network. Although this is common practice in image classification, it violates (違反了)the fully-convolutional property of eq. 1.
Tracking algorithm
Since our purpose is to prove the efficacy of our fully-convolutional Siamese network and its generalization capability when trained on ImageNet Video, we use an extremely simplistic algorithm to perform tracking.
Unlike more sophisticated trackers, we do not update a model or maintain a memory of past appearances, we do not incorporate additional cues such as optical flow(光流) or colour histograms, and we do not refine our prediction with bounding box regression.
Yet, despite its simplicity, the tracking algorithm achieves surprisingly good results when equipped with our offline-learnt similarity metric.
Online, we do incorporate some elementary temporal constraints(納入一些基本的時間約束): we only search for the object within a region of approximately four times its previous size, and a cosine window is added to the score map to penalize large displacements.
Tracking through scale space is achieved by processing several scaled versions of the search image.(通過處理幾個縮放版本的搜索圖像來實現縮放空間的跟蹤) Any change in scale is penalized and updates of the current scale are damped.(任何尺度上的變化都會受到懲罰,當前尺度的更新也會受到阻礙)
這部分內容的一個整體效果如下。(部分截圖啦)
結論:我們的方法不執行任何的模型更新,只用第一幀來進行計算,但結果卻出乎意料的在motion blur、 drastic change of appearance、 poor illumination and scale change 表現出魯棒性。此外,我們的方法對于復雜的場景是敏感的,因為模型從未被更新,所以很容易drift。
3 Related work
對于目標跟蹤問題而言,有一些工作是train RNN。比如訓練RNN來預測每一幀的目標的絕對位置。再比如,使用可微的注意機制來簡單的訓練一個用于跟蹤的RNN。這些方法結果不是很理想,但確實是值得研究的。
利用每個新的視頻來訓練深度卷積網絡是不可行的,這個時候可以想到已經預訓練好參數的微調方法。SO-DLT和MDNet都在離線階段訓練了一個用于簡單探測任務的卷積網絡,在測試階段用SGD來學習一個探測器,可惜,這些方法的實時性不是很好。一種可以替代的方法是shallow methods(使用預訓練的卷積網絡的內在表現作為特征)。這類方法有FCNT,DeepSRDCF等。他們取得了很好的結果,但是卻由于卷積網絡所表現的高維性而沒有實現實時性的操作。
我們的工作,當然也有其他人的一些工作,提出使用用于目標跟蹤的卷積網絡,這個網絡會學習一個圖像對的函數。就拿GOTURN來說,一個卷積網絡被訓練出來,主要是用于從兩張圖片到第一張圖片所展示的物體在第二張圖片的位置定位的直接回歸。預測的是一個矩形而不是位置具有這樣的優點:尺度的變化可以在不評估的情況下進行很好的控制。然而這種方法還是有缺點的,缺點:它不具有對第二張圖像變換的內在的不變性。這意味著網絡必須在所有位置顯示示例,這是通過相當大的數據集實現的。
具有競爭性的方法MDNet,SINT,GOTURN,這些方法在視頻序列上進行訓練,用的是屬于相同ALOV/OTB/VOT的訓練數據,這種數據可能存在過擬合現象。所以,這篇論文中提出用于有效的目標跟蹤的卷積網絡,這個網絡的特點是不用與測試集相同的視頻數據來訓練。
4 Experiments
4.1 Implementation details
Training:這個模塊主要是進行參數設定,并說明相關的設定方法。
Tracking:使用簡單的策略在線更新示例的特征表示。策略有線性插值,雙三次插值等。可知,用后者可以進行更加精確的定位,為了處理尺度變化,我們搜索超過5個尺度(-2,-1,0,1,2)的對象,并用線性插值法更新比例。
4.2 Evaluation
兩個變體:
SiamFC(Siamense Fully-Convolutional)
SiamFC-3s(搜索超過了3個尺度而不是5個尺度)
4.3 The OTB-13 benchmark
The OTB-13 benchmark 考慮到了在不同閾值下平均每幀的成功率:一個跟蹤器如果估計值與真實值間的IoU(交集)超過某個閾值的話,那么在給定幀下是成功的。
4.4 The VOT benchmarks
vot2015-final 可以在所選的356個序列中評估跟蹤器,在這些序列中,很好的展現了7種不同的挑戰情景。
VOT-14 results:
跟蹤器的兩個評價指標:accuracy and robustness
Accuracy is calculated as the average IoU.
Robustness is expressed in terms of the total number of failures.
VOT-15 results:
VOT-16 results:
我們的fully-convolutional Siamese network 可以達到state-of-the-art 效果,是一款real-time的追蹤器。此外,還可以采取一些方法提升性能,比如:
model update,bounding-box regression,fine-tuning,memory
4.5 Dataset size
訓練卷積網絡需要大量-大量-大量的數據集。
This finding suggests that using a larger video dataset could increase the performance even further.
5 Conclusion
In this work, we depart from (放棄)the traditional online learning methodology employed in tracking, and show an alternative approach that focuses on learning strong embeddings in an offline phase. Differently from their use in classification settings, we demonstrate that for tracking applications Siamese fully-convolutional deep networks have the ability to use the available data more efficiently. This is reflected both at test-time, by performing efficient spatial searches, but also at training-time, (在兩大階段都有有效的空間搜索的體驗)where every sub-window effectively represents a useful sample with little extra cost. The experiments show that deep embeddings provide a naturally rich source of features for online trackers, and enable simplistic test-time strategies to perform well. We believe that this approach is complementary to more sophisticated(復雜) online tracking methodologies,and expect future work to explore this relationship more thoroughly.