帳號:guest(3.17.77.42)          離開系統
字體大小: 字級放大   字級縮小   預設字形  

詳目顯示

以作者查詢圖書館館藏以作者查詢臺灣博碩士論文系統以作者查詢全國書目
作者(中文):許家和
作者(外文):Hsu, Chia-Ho
論文名稱(中文):基於擴張下採樣及多通道上採樣深度神經網路之快速單一視圖3D物件細節重建
論文名稱(外文):Fast Single-View 3D Object Reconstruction with Fine Details through Dilated Downsample and Multi-path Upsample Deep Neural Network
指導教授(中文):邱瀞德
指導教授(外文):Chiu, Ching-Te
口試委員(中文):楊家輝
賴尚宏
范倫達
口試委員(外文):Yang, Jar-Ferr
Lai, Shang-Hong
Van, Lan-Da
學位類別:碩士
校院名稱:國立清華大學
系所名稱:資訊工程學系
學號:106062638
出版年(民國):108
畢業學年度:107
語文別:英文
論文頁數:89
中文關鍵詞:3維物件重建3維形狀重建深度卷積神經網路單張影像單一視圖
外文關鍵詞:3D Object Reconstruction3D Shape ReconstructionDeep Convolutional Neural NetworkSingle ImageSingle View
相關次數:
  • 推薦推薦:0
  • 點閱點閱:485
  • 評分評分:*****
  • 下載下載:5
  • 收藏收藏:0
3維物件重建一直以來都是計算機視覺領域中非常重要的研究項目之一,
其目的是希望透過物件2維影像重建其整體形狀包括此影像所無法呈現的資訊。
隨著深度學習的蓬勃發展,已有許多以卷積神經網絡(CNN)與自編碼器(Auto
Encoder)為基礎的方法應用在這個研究上,並且取得非常成功的成果。目前用
於3D物件重建的深度學習方法主要分成兩種,單張影像重建與多張影像重建。
單張影像重建,顧名思義,只用物件在單一角度下的2維影像作為輸入,重建
其3維形狀。反之,多張影像重建即是用物件在不同角度下的影像作為輸入,
重建3維形狀。在這方面,Choy等人提出了一個是用於上述兩者的方法,稱為
3D-R2N2。在他們提出的相同深度學習網路架構下,後者的重建品質優於前者,
因為後者可以結合不同角度下的影像特徵而前者無法。然而在即時的應用上,
如VR/AR,後者的效率並沒有前者高因為結合不同角度下的影像意味者耗費更
多計算時間。除此之外,前者的重建品質雖低於後者,但仍與預期的形狀相距
不遠。
在這篇論文中,為了重建出比較好的物件形狀以及不耗費太多計算時間之
下,我們針對較廣泛使用的方法,單張影像重建。由於在只有一張影像為輸入
以及輸出是機率分布的3維空間,使得重建的物件往往失去其較細微的結構。
因此,為了改善這個問題,在我們設計的auto encoder架構中,結合了我們所
提出的擴張下採樣區塊(Dilated Downsample Block)來提取影像更多特徵,以
及多路徑上採樣區塊(Multi-path Upsample Block)對形狀的特徵進行多種變換
與結合。最後我們把編碼器與解碼器的對應層做串接(Concatenation),使得在
解碼器重建的過程中仍然擁有來自影像的特徵。藉由我們提出的架構,能夠提
升重建的品質,並且保留物件細微的結構。
最後,我們在Choy等人所提出的資料集上進行實驗,實驗結果顯示,
在我們提出的方法下,達到67.7%的intersection-over-union(IoU)準確度,比
Richter等人提出的方法高3.6%。另外,與Wang等人提出的PSVH方法在小
資料集上比較,我們的方法達到71.4%,比他們高3.4%。然後在重建的速度與
相同的實驗環境比較下,我們的方法比它們快大約25倍。
除此之外,我們也在Wu等人提出的比較困難的資料集上進行實驗,達到
平均9.9的chamfer distance(CD)誤差值,比Groueix等人提出的AtlasNet方
法好15%。
另外,透過物件輪廓圖的協助,我們的結果可以達到8.1的CD誤差值,
比起AtlasNet,更有31%的改進。並且與網路架構更大以及訓練更複雜的方法
ShapeHD 相比,我們的方法有8%的提升。
3D object reconstruction has been one of the most important research area in
the field of computer vision. Its purpose is to reconstruction the overall shape of
the object from its 2D image, including the information that cannot be presented
by the image. With the development of deep learning, many methods based on
Convolutional Neural Network(CNN) and Auto Encoder have been applied in this
research and achieved successful results. For 3D objection reconstruction, there
are mainly two types, single view reconstruction and multi views reconstruction.
Single image reconstruction provides randomly selected one view image from several
view images of the same object to reconstruct its corresponding 3D shape in
different time. On the other hand, multi view images reconstruction chooses more
than one image from several view images of the same object to reconstruct its
corresponding 3D shape by integrating the features of different views.
In this regard, choy et al. proposed a unified method called 3D-R2N2 which
is fit for the both modes mentioned above. In the same network architecture that
they proposed, the reconstruction quality of the latter is better than the former.
The latter can leverage and fuse features of different views while the former only
get the features of one input view and cannot obtain features from other views.
However, in terms of applications requiring in time interaction such as VR/AR,
the computation time of the latter is higher than the former since more input
views mean that extracting and fusing features of different views will cost more
time to calculate. Besides, though the reconstruction quality of the former is less
than the latter, it is not far away with expected shape.
In this paper, to have good 3D shape reconstruction and low computation
time, we focus on the most used method, single image reconstruction. The main
issue of using a single image as an input is that the reconstruction shape often
missed its detail structure. To address this issue, we proposed two methods, the
dilated downsample block and the multi-path upsample block. The dilated downsample
block extracts more features and the multi-path upsample block greatly
use the features of shape in our auto encoder architecture. Finally, we concatenate
the encoder and decoder with corresponding layers to keep the image features in
reconstruction process.
At last, we do experiments on the dataset provided by Choy et al. As experimental
result shown, our proposed method achieves 67.7% intersection-overunion(
IoU) accuracy, 3.6% higher than state of the art method VTN.
In addition, comparing with state of the art method PSVH in only four categories,
our result achieves 71.4%, 3.4% higher than it. For reconstruction speed,
our average reconstruction time is 13 ms, about 25 times faster than PSVH in our
own environment.
Moreover, we also experiment on more difficult dataset provided by We et al.
and our result achieves 9.9 chamfer distance (CD) difference, 15% improvement
compared with AtlasNet.
Specifically, with the assistance of object silhouettes, our result achieves 8.1
CD value, 31% improvement compared with AtlasNet. Besides, compared with
ShapeHD whose architecture is huge and training process is complex, we have 8%
improvement.
1 Introduction 1
1.1 Motivation and Problem Description . . . . . . . . . . . . . . . . . 2
1.2 Goal and Contribution . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2 Related Works 9
2.1 Traditional methods . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Learning-based methods . . . . . . . . . . . . . . . . . . . . . . . . 10
3 Single View 3D Object Reconstruction with Dilated Downsample
and Multi-path Upsample Assisted Deep Neural Network 15
3.1 Overall Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2 2D Image Encoder . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.2.1 Sparse Step Downsample block & Dense Step Downsample
block . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.2.2 Dilated Downsample block . . . . . . . . . . . . . . . . . . . 21
3.3 3D Shape Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.3.1 Upsample Block . . . . . . . . . . . . . . . . . . . . . . . . 24
3.3.2 Multi-path Upsample Block . . . . . . . . . . . . . . . . . . 26
i
3.4 Objective Function . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.4.1 Intersection over Union (IoU) Loss . . . . . . . . . . . . . . 28
3.4.2 Mean Squared False Cross Entropy Loss (MSFCEL) . . . . 29
3.5 Concatenation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4 Experimental Results 34
4.1 Environment and Datasets . . . . . . . . . . . . . . . . . . . . . . . 34
4.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.2.1 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.2.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.3 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.3.1 Dense Step Downsample block V.S Dilated Downsample block 43
4.3.2 Different Type of Multi-path Upsample Block . . . . . . . . 44
4.3.3 Sparsity Problem Addressing . . . . . . . . . . . . . . . . . 49
4.4 Comparison with Other Works . . . . . . . . . . . . . . . . . . . . 51
4.5 Visualization of Reconstruction Result . . . . . . . . . . . . . . . . 54
4.6 Worse Case Analysis & Improvement . . . . . . . . . . . . . . . . . 64
4.7 High Resolution Reconstruction . . . . . . . . . . . . . . . . . . . . 69
4.7.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.7.2 Modified Architecture . . . . . . . . . . . . . . . . . . . . . 70
4.7.3 Metric and Experimental Results . . . . . . . . . . . . . . . 71
5 Conclusion 79
6 Reference 81
[1] C. B. Choy, D. Xu, J. Gwak, K. Chen, and S. Savarese, “3d-r2n2: A unified
approach for single and multi-view 3d object reconstruction,” in Computer
Vision – ECCV 2016, B. Leibe, J. Matas, N. Sebe, and M. Welling, Eds.
Cham: Springer International Publishing, 2016, pp. 628–644.
[2] H. Wang, J. Yang, W. Liang, and X. Tong, “Deep single-view 3d object reconstruction
with visual hull embedding,” in Proceedings of the AAAI Conference
on Artificial Intelligence (AAAI), 2019.
[3] A. Kar, S. Tulsiani, J. Carreira, and J. Malik, “Category-specific object reconstruction
from a single image,” in The IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), June 2015.
[4] S. Tulsiani, A. A. Efros, and J. Malik, “Multi-view consistency as supervisory
signal for learning shape and pose prediction,” in The IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), June 2018.
[5] H. Fan, H. Su, and L. J. Guibas, “A point set generation network for 3d object
reconstruction from a single image,” in The IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), July 2017.
81
[6] A. Arsalan Soltani, H. Huang, J. Wu, T. D. Kulkarni, and J. B. Tenenbaum,
“Synthesizing 3d shapes via modeling multi-view depth maps and silhouettes
with deep generative networks,” in The IEEE Conference on Computer Vision
and Pattern Recognition (CVPR), July 2017.
[7] S. R. Richter and S. Roth, “Matryoshka networks: Predicting 3d geometry
via nested shape layers,” in The IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), June 2018.
[8] J. Wu, C. Zhang, X. Zhang, Z. Zhang, W. T. Freeman, and J. B. Tenenbaum,
“Learning shape priors for single-view 3d completion and reconstruction,” in
The European Conference on Computer Vision (ECCV), September 2018.
[9] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification
with deep convolutional neural networks,” in Advances in Neural
Information Processing Systems 25, F. Pereira, C. J. C. Burges,
L. Bottou, and K. Q. Weinberger, Eds. Curran Associates, Inc.,
2012, pp. 1097–1105. [Online]. Available: http://papers.nips.cc/paper/
4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf
[10] K. Simonyan and A. Zisserman, “Very deep convolutional networks for largescale
image recognition,” arXiv preprint arXiv:1409.1556, 2014.
[11] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,”
in The IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), June 2016.
[12] A. Johnston, R. Garg, G. Carneiro, I. Reid, and A. van den Hengel, “Scaling
cnns for high resolution volumetric reconstruction from a single image,” in
82
The IEEE International Conference on Computer Vision (ICCV) Workshops,
Oct 2017.
[13] A. X. Chang, T. A. Funkhouser, L. J. Guibas, P. Hanrahan, Q. Huang, Z. Li,
S. Savarese, M. Savva, S. Song, H. Su, J. Xiao, L. Yi, and F. Yu, “Shapenet:
An information-rich 3d model repository,” CoRR, vol. abs/1512.03012, 2015.
[Online]. Available: http://arxiv.org/abs/1512.03012
[14] F. Yu and V. Koltun, “Multi-scale context aggregation by dilated convolutions,”
arXiv preprint arXiv:1511.07122, 2015.
[15] P. Wang, P. Chen, Y. Yuan, D. Liu, Z. Huang, X. Hou, and G. Cottrell, “Understanding
convolution for semantic segmentation,” arXiv preprint arXiv:
1702.08502, 2017.
[16] T. Groueix, M. Fisher, V. G. Kim, B. C. Russell, and M. Aubry, “A papiermâché
approach to learning 3d surface generation,” in The IEEE Conference
on Computer Vision and Pattern Recognition (CVPR), June 2018.
[17] S. Vicente, J. Carreira, L. Agapito, and J. Batista, “Reconstructing pascal
voc,” in The IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), June 2014.
[18] Q. Huang, H. Wang, and V. Koltun, “Single-view reconstruction
via joint analysis of image and shape collections,” ACM Trans.
Graph., vol. 34, no. 4, pp. 87:1–87:10, Jul. 2015. [Online]. Available:
http://doi.acm.org/10.1145/2766890
83
[19] M. Sung, V. G. Kim, R. Angst, and L. Guibas, “Data-driven structural priors
for shape completion,” ACM Trans. Graph., vol. 34, no. 6, pp. 175:1–175:11,
Oct. 2015. [Online]. Available: http://doi.acm.org/10.1145/2816795.2818094
[20] Y. Li, A. Dai, L. Guibas, and M. Nießner, “Database-assisted object retrieval
for real-time 3d reconstruction,” in Computer Graphics Forum, vol. 34, no. 2.
Wiley Online Library, 2015.
[21] N. J. Mitra, L. J. Guibas, and M. Pauly, “Partial and approximate symmetry
detection for 3d geometry,” ACM Trans. Graph., vol. 25, no. 3, pp. 560–568,
Jul. 2006. [Online]. Available: http://doi.acm.org/10.1145/1141911.1141924
[22] Y. Liao, Y. Yang, and Y. F. Wang, “3d shape reconstruction from a single
2d image via 2d-3d self-consistency,” CoRR, vol. abs/1811.12016, 2018.
[Online]. Available: http://arxiv.org/abs/1811.12016
[23] S. Tulsiani, T. Zhou, A. A. Efros, and J. Malik, “Multi-view supervision for
single-view reconstruction via differentiable ray consistency,” in The IEEE
Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
[24] A. Kar, C. Häne, and J. Malik, “Learning a multi-view stereo machine,”
in Advances in Neural Information Processing Systems 30, I. Guyon, U. V.
Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett,
Eds. Curran Associates, Inc., 2017, pp. 365–376. [Online]. Available:
http://papers.nips.cc/paper/6640-learning-a-multi-view-stereo-machine.pdf
[25] X. Yan, J. Yang, E. Yumer, Y. Guo, and H. Lee, “Perspective transformer
nets: Learning single-view 3d object reconstruction without 3d supervision,”
in Advances in Neural Information Processing Systems 29, D. D. Lee,
84
M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, Eds. Curran Associates,
Inc., 2016, pp. 1696–1704.
[26] G. Yang, Y. Cui, S. Belongie, and B. Hariharan, “Learning single-view 3d
reconstruction with limited pose supervision,” in The European Conference
on Computer Vision (ECCV), September 2018.
[27] X. Sun, J. Wu, X. Zhang, Z. Zhang, C. Zhang, T. Xue, J. B. Tenenbaum,
and W. T. Freeman, “Pix3d: Dataset and methods for single-image 3d shape
modeling,” in The IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), June 2018.
[28] Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao, “3d
shapenets: A deep representation for volumetric shapes,” in The IEEE Conference
on Computer Vision and Pattern Recognition (CVPR), June 2015.
[29] J. Wu, Y. Wang, T. Xue, X. Sun, B. Freeman, and J. Tenenbaum,
“Marrnet: 3d shape reconstruction via 2.5d sketches,” in Advances in Neural
Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio,
H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, Eds. Curran
Associates, Inc., 2017, pp. 540–550. [Online]. Available: http://papers.nips.
cc/paper/6657-marrnet-3d-shape-reconstruction-via-25d-sketches.pdf
[30] R. Girdhar, D. F. Fouhey, M. Rodriguez, and A. Gupta, “Learning a
predictable and generative vector representation for objects,” CoRR, vol.
abs/1603.08637, 2016. [Online]. Available: http://arxiv.org/abs/1603.08637
[31] J. Wu, C. Zhang, T. Xue, B. Freeman, and J. Tenenbaum, “Learning a probabilistic
latent space of object shapes via 3d generative-adversarial model-
85
ing,” in Advances in Neural Information Processing Systems 29, D. D. Lee,
M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, Eds. Curran Associates,
Inc., 2016, pp. 82–90.
[32] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,
S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in
Advances in Neural Information Processing Systems 27, Z. Ghahramani,
M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger,
Eds. Curran Associates, Inc., 2014, pp. 2672–2680. [Online]. Available:
http://papers.nips.cc/paper/5423-generative-adversarial-nets.pdf
[33] J. Wu, T. Xue, J. J. Lim, Y. Tian, J. B. Tenenbaum, A. Torralba, and W. T.
Freeman, “Single image 3d interpreter network,” in European Conference on
Computer Vision (ECCV), 2016.
[34] A. Brock, T. Lim, J. M. Ritchie, and N. Weston, “Generative and
discriminative voxel modeling with convolutional neural networks,” CoRR,
vol. abs/1608.04236, 2016. [Online]. Available: http://arxiv.org/abs/1608.
04236
[35] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” CoRR, vol.
abs/1312.6114, 2013.
[36] A. Sharma, O. Grau, and M. Fritz, “Vconv-dae: Deep volumetric shape learning
without object labels,” in ECCV Workshops, 2016.
[37] J. Gwak, C. B. Choy, M. Chandraker, A. Garg, and S. Savarese, “Weakly supervised
3d reconstruction with adversarial constraint,” in 3D Vision (3DV),
2017 Fifth International Conference on 3D Vision, 2017.
86
[38] C.-H. Lin, C. Kong, and S. Lucey, “Learning efficient point cloud generation
for dense 3d object reconstruction,” in AAAI Conference on Artificial
Intelligence (AAAI), 2018.
[39] C. R. Qi, H. Su, K. Mo, and L. J. Guibas, “Pointnet: Deep learning on point
sets for 3d classification and segmentation,” arXiv preprint arXiv:1612.00593,
2016.
[40] M. Tatarchenko, A. Dosovitskiy, and T. Brox, “Octree generating networks:
Efficient convolutional architectures for high-resolution 3d outputs,” in The
IEEE International Conference on Computer Vision (ICCV), Oct 2017.
[41] G. Riegler, A. Osman Ulusoy, and A. Geiger, “Octnet: Learning deep 3d
representations at high resolutions,” in The IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), July 2017.
[42] G. Riegler, A. O. Ulusoy, H. Bischof, and A. Geiger, “Octnetfusion: Learning
depth fusion from data,” in Proceedings of the International Conference on
3D Vision, 2017.
[43] C. Zou, E. Yumer, J. Yang, D. Ceylan, and D. Hoiem, “3d-prnn: Generating
shape primitives with recurrent neural networks,” in The IEEE International
Conference on Computer Vision (ICCV), 2017.
[44] Y. Sun, Z. Liu, Y. Wang, and S. E. Sarma, “Im2avatar: Colorful 3d
reconstruction from a single image,” CoRR, vol. abs/1804.06375, 2018.
[Online]. Available: http://arxiv.org/abs/1804.06375
87
[45] S. Wang, W. Liu, J. Wu, L. Cao, Q. Meng, and P. J. Kennedy, “Training
deep neural networks on imbalanced data sets,” in 2016 International Joint
Conference on Neural Networks (IJCNN), July 2016, pp. 4368–4374.
[46] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks
for biomedical image segmentation,” CoRR, vol. abs/1505.04597, 2015.
[Online]. Available: http://arxiv.org/abs/1505.04597
[47] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” International
Conference on Learning Representations, 12 2014.
[48] H. G. Barrow, J. M. Tenenbaum, R. C. Bolles, and H. C. Wolf, “Parametric
correspondence and chamfer matching: Two new techniques for image
matching,” in Proceedings of the 5th International Joint Conference on
Artificial Intelligence - Volume 2, ser. IJCAI’77. San Francisco, CA, USA:
Morgan Kaufmann Publishers Inc., 1977, pp. 659–663. [Online]. Available:
http://dl.acm.org/citation.cfm?id=1622943.1622971
[49] B. Yang, S. Rosa, A. Markham, N. Trigoni, and H. Wen, “Dense 3d object
reconstruction from a single depth view,” IEEE Transactions on Pattern
Analysis and Machine Intelligence, pp. 1–1, 2019.
[50] J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Torralba, “Sun database:
Large-scale scene recognition from abbey to zoo,” in 2010 IEEE Computer
Society Conference on Computer Vision and Pattern Recognition, June 2010,
pp. 3485–3492.
 
 
 
 
第一頁 上一頁 下一頁 最後一頁 top
* *