Visual Tracking With Deep Learning And The Context
一. The overview of Visual Tracking 目標跟蹤簡介
1. What is visual tracking?
This three pictures are the 1,40,80 frame of the same video.When we give the bounding-box of the running woman in the first frame,the bounding-box can still circle the same woman.
Given the initialized state (e.g.position and size) of a target object in a frame of a video, the goal of tracking is to estimate the states of the target in the subsequent frames.
Although object tracking has been studied for several decades, and much progress has been made in recent years , it remains a very challenging problem.
Numerous factors affect the performance of a tracking algorithm, such as illumination variation, occlusion, as well as background clutters, and there exists no single tracking approach that can successfully handle all scenarios.
2. Difficulties of visual tracking
There are many limiting factors of object tracking based on video image. In the theory and method, the research on the target tracking is confronted with great challenge.
The diversity of the target
Multiple moving targets. It is difficult to describe the unified model.
Motion laws of the targets are very complex.
The movement of the targets can lead to changes in its appearance.
Mutual occlusion may occur between multiple moving objects.
The complexity of the scene
Changes in lighting, atmospheric conditions in the scene can cause serious interference.
Regions having similar appearance as the target.
The target may be obscured by objects in the scene
In a dilemma
Fast but Fallible
Robust but Slow
The contradiction between real-time and accuracy
3. Recent algorithms for visual tracking
Based on model matching
----- global model matching
- Create a target appearance model online or offline.
- Search for the most similar regions of the image in the model.
- Advantage: Tracking rigid targets works well.
- Disadvantage: can not work while the appearance changed.
-----Local model matching
- Tracking targets are divided into different components, and the models are respectively established for each component.
- Human motion is divided into head, limbs, body.
- Advantage: Tracking stability. Especially occlusion
- Disadvantage: Matching between components is difficult. time-consuming
-----Feature matching
- Extracts features with translation, rotation, and scaling invariance.
- Feature matching the current frame.
- Advantage: insensitive to the shape, scale and other changes of the target.
- Disadvantage: Most image features are sensitive to ambient conditions such as changes in light.
Based on classification
- Take the tracking as online classification.
- One is the target, the other is the background.
- Training a target-background classifier.
- The classifier is updated with the current image frame
- Advantage: has a certain self-adaptability to the change of target
- Disadvantage: Classification accuracy often depends on the expression of target features
Based on bayes filtering
- Combining a priori information with current information.
- The state of the target image in the current frame is estimated optimally using the a priori information before the current frame.
- Typical algorithms include** Kalman filter** and particle filter.
- Advantage: Wide range of applications and less constraints.
- disadvantage: Particle filter algorithms often produce a large number of particles due to the precision of filtering, and the more the number of particles required, the higher the complexity of the algorithm
Based on deep learning(after 2015)
Depth learning in the field of target tracking is not smooth sailing. The main problem is the lack of training data: one of the magic of the depth model comes from the effective training of a large number of labeled training data, while the target tracking only provides the first frame of the bounding-box as training data. In this case, it is difficult to train a depth model at the beginning of the trace for the current target.
Several ideas:
- Pre-training the depth model with auxiliary image data, and fine-tune on-line tracking.(DLT,SO-DLT NIPS15)
- The CNN classification network pre-trained by the existing large-scale classification dataset is used to extract the features.(FCNT,HCFT ICCV15)
- Pre-training with tracking sequences.(Mdnet CVPR16)
- Using RNN.(RTT CVPR16)
4. Deep Learning for visual tracking
DLT: Learning a Deep Compact Image Representation for Visual Tracking (NIPS 2014)
預訓練:SDAE+Tiny Image dataset+無監督訓練:通用的物體表征能力;
在線跟蹤結構:SDAE的encoding(通用特征表示)+sigmoid分類(二分類跟蹤方式):獲得 目標與背景的分類;
微調:利用第一幀獲取正負樣本:獲取當前目標與背景更有針對性的分類網絡;
后續幀跟蹤:當前幀粒子濾波提取patch+patch依次輸入分類網絡+置信度;
模型更新:限定閾值;
優點:預訓練+微調:解決訓練數據不足
缺點:32*32 自編碼器是否適合分類跟蹤任務 4層網絡特征表達能力不足
SO-DLT:Transferring Rich Feature Hierarchies for Robust Visual Tracking(ICCV 2015)
在線跟蹤:處理t幀時,以t-1幀預測位置為中心; 從小到大采樣不同尺度區域,依次放入網絡; 當CNN輸出的概率圖高于一個值,停止采樣,以當前概率圖為最佳區域; 在最終區域里確定boundingbox大小與位置
模型更新:CNNs---->及時響應目標變化; CNNl---->對噪聲魯棒;
借鑒:ensemble的思路解決update 的敏感性 ,跟蹤算法提高評分的殺手锏。
FCNT: Visual Tracking with Fully Convolutional Networks (ICCV 2015)
預訓練:VGGNet+imageNet已分類數據集;
核心: FeatureMap可以直接做跟蹤目標定位;
高層特征:擅長區分不同類(高度抽象)
底層特征:擅長區分同類物體(關注局部細節)
兩層卷積結構: conv4-3:區分相似物體distractor(SNet) conv5-3:區分類別信息 (GNet)
在線跟蹤: 利用上一幀中心采樣一塊區域,分別輸入SNet和GNet; 生成兩個heatmap(互補);
SNet:去掉了distractor
GNet:目標更加明顯
總結: 有效抑制漂移,對遮擋不魯棒 track新思路(多少層 哪幾層)
MDNet:Learning Multi-Domain Convolutional Neural Networks for Visual Tracking(CVPR 2016)
圖像分類與實際跟蹤的巨大差別;
圖像分類: 目標和背景的任意組合,目標出現在任何一個背景都要被檢測出;
實際跟蹤: 給出第一幀的前后景后,后續幀前后景和第一幀很類似;
直接用視頻序列預訓練CNN; 目標差別:某類物體在一個序列中是目標,在另一個就可能是背景;
共享層:CNN獲得目標通用的特征表達;
特定區域層:每個訓練序列--->單獨的domain--->單獨的二分類層--->區分當前序列前后景 (解決不同序列目標不一致問題)
確定bounding:RCNN Region Proposal方式 上一幀附近尋找256個proposal,之后進行bounding回歸
總結:Precision達到了94.8% 實時性:目標檢測的Region Proposal是否適合在線跟蹤任務 (256個proposal 89個domain)
Use RNN?
這是一個視頻的第一幀 第10幀和第20幀,汽車在勻速前進時,視頻序列具有明顯的時序相關性。
跟蹤任務的特殊性(時間序列,前后相關)
是否可以使用多方向的遞歸神經網絡(RNN)學出跟蹤視頻序列的前后關聯性?
What is RNN ?
RNN Tracker
CVPR2016
AAAI2016
5. Visual Tracking With The Context
Context information is also very important for tracking.
Recently, some approaches have been proposed by mining auxiliary objects or local visual information surrounding the target to assist tracking .
The context information is especially helpful when the target is fully occluded or leaves the image region .
To improve the tracking performance, some tracker fusion methods have been proposed recently.
Context-Aware Visual Tracking
the environment can also be advantageous to the tracker if it contains objects that are correlated to the target
Question: whether the object being followed by the tracker is really the target?
Answer:Use the dynamic environment!
How to track a face in a crowd?
- it is almost impossible to learn a discriminative model to distinguish the face of interest from the rest of the crowd.
Why do we have to focus our attention only on the target?
- If the person (with that face) is wearing a quite unique shirt (or a hat), then including the shirt (or the hat) in matching will surely make the tracking much easier and more robust.
- if another face always accompanies the target face, treating them as a geometric structure and tracking them as a group.
It seems that:
- A target is seldom isolated and independent to the entire scene.
- there may exist some objects that have short-term or long-term motion correlations to the targets.
So why not track the target and auxiliary objects as a group?
What is auxiliary objects?
- frequent co-occurrence with the target .
- consistent motion correlation to the target.
- suitable for tracking.
This definition may cover a large variety of image regions or features
- simple,generic, and low-level is better
- Choose color regions but not the features
- Because the color regions can be reliably and efficiently tracked
Experiments
(The yellow bounding-box is the target. the red are the color region.)
Tracking the Invisible: Learning Where the Object Might be
context helps in object detection is wellknown.
strongest predictors of vehicle presence and location in an image is the shadow it casts on the road
In tracking, many temporary, but potentially very strong links exist between the tracked object and the rest of the image.
local image features vote for the object.
- Implicit Shape Model is used to choose the local image features.
- Object points lie on the object surface and thus always have a strong correlation to the object motion(green points).
- points on other independently moving objects or in the static background, are considered to carry no information about the object position(blue points).
- Supporters are features which are useful to predicting the target object positions. They at least temporarily move in a way which is statistically related to the motion of the target(red points).
the position of an object can be estimated even when it is not seen directly (e.g., fully occluded or outside of the image region)
How to choose the supporter?
Experiments
We can see what we can not see
Context Tracker: Exploring Supporters and Distracters
Visual tracking is very challenging when the target leaves the field of view leading the tracker to follow another similar object, and not reacquire the right target when it reappears.
There is additional information which can be exploited instead of using only the object region.
What is supporters and distracters?
Distracters
- Regions have similar appearance as the target
- consistently co-occur
- The tracker must keep tracking these distracters to avoid drifting
- dangerous
Supporters
- local key-points around the target
- consistently co-occur
- motion correlation
- useful
Experiments
6. 目標跟蹤的方向
提高目標的特征描述能力
- 足夠強的特征能夠應對絕大多負面的環境影響
提高系統實時性 - 搜索策略需要遍歷很多冗余區域大大影響到跟蹤算法的實時性
- 如何縮小目標搜索范圍