作者(外文):Wang, Fu-En
論文名稱(外文):Self-Supervised Learning of Depth and Camera Motion from 360 Videos
指導教授(外文):Sun, Min
口試委員(外文):Wang, Yu-Chiang
Chen, Hwann-Tzong
隨著360相機在各種自主系統應用中愈顯普及 (自駕車與無人飛行器),有效率的360感知能力變得越是重要。在此篇論文中,我們提出了新穎的自我監督式方法根據360影片來預測全景影像的深度與相機之移動。始於一個專門設計給一般視角相機的方法,SfMLearner,我們在此引入三個關鍵方法來有效率的處理360影像。首先,我們先將等距離長方圓柱投影轉換成立方投影來避免360影像的扭曲,在所有卷積與反卷積層前,我們使用了立方填補演算法 (Cube Padding) 來移除每個立方面的邊界,立方填補演算法會將每一面的鄰近面特徵值填補到自身來補全每一面的資訊。第二,我們提出了一個新穎的球狀光差一制性限制並使用在球體上,如此一來我們就能避免以往方法會在超出邊界的地方無法計算訓練誤差。最後,我們並非只是獨立預測六個相機移動 (直接對立方體的每個面使用SfMLearner),我們提出了新穎的相機移動一致性訓練誤差來確保每一面的相機位移可以互相限制。為了訓練與評估我們所提出的方法,我們收集了一個全新的數據集PanoSUNCG,這個數據集擁有目前最大量的360影像同時包和所對應的正確深度與相機位移。在PanoSUNCG上,我們所提出的方法達到目前深度與相機位移最高的準確度並且也具有更快的預測速度。在真實世界的影片中,我們的方法仍然能預測出合理的深度與相機移動。
As 360 cameras become prevalent in many autonomous systems (e.g., self-driving cars and drones), efficient 360 perception becomes more and more important.
We propose a novel self-supervised learning approach for predicting the omnidirectional depth and camera motion from a 360 video.
In particular, starting from the SfMLearner, which is designed for cameras with normal field-of-view, we introduce three key features to process 360 images efficiently.
Firstly, we convert each image from equirectangular projection to cubic projection in order to avoid image distortion. In each network layer, we use Cube Padding (CP), which pads intermediate features from adjacent faces, to avoid image boundaries.
Secondly, we propose a novel ``spherical" photometric consistency constraint on the whole viewing sphere. In this way, no pixel will be projected outside the image boundary which typically happens in images with normal field-of-view.
Finally, rather than naively estimating six independent camera motions (i.e., naively applying SfM-Learner to each face on a cube), we propose a novel camera pose consistency loss to ensure the estimated camera motions reaching consensus.
To train and evaluate our approach, we collect a new PanoSUNCG dataset containing a large amount of 360 videos with groundtruth depth and camera motion. Our approach achieves state-of-the-art depth prediction and camera motion estimation with faster inference speed comparing to equirectangular. In real-world indoor videos, our approach can also achieve qualitatively reasonable depth prediction.
1 Introduction
4 Related work
7 Our approach
15 Dataset
18 Experiments
32 Conclusion
33 References
