帳號:guest(52.15.129.90)          離開系統
字體大小: 字級放大   字級縮小   預設字形  

詳目顯示

以作者查詢圖書館館藏以作者查詢臺灣博碩士論文系統以作者查詢全國書目
作者(中文):粘鈞泰
作者(外文):NIEN, JIUN-TAI
論文名稱(中文):應用於影片動作分割的片段對比學習
論文名稱(外文):Segment-Aware Contrastive Learning for Action Segmentation
指導教授(中文):金仲達
指導教授(外文):KING, CHUNG-TA
口試委員(中文):許秋婷
林國祥
口試委員(外文):HSU, CHIU-TING
LIN, GUO-SHIANG
學位類別:碩士
校院名稱:國立清華大學
系所名稱:資訊工程學系
學號:108062635
出版年(民國):111
畢業學年度:110
語文別:英文
論文頁數:25
中文關鍵詞:對比學習動作分割
外文關鍵詞:Contrastive LearnigAction Segmentation
相關次數:
  • 推薦推薦:0
  • 點閱點閱:302
  • 評分評分:*****
  • 下載下載:0
  • 收藏收藏:0
動作分割的目標是將未切割過的影片依照動作區間進行切割及分類。目前的方法把動作分割視為逐帖預測的工作,並專注在改進模型對於時序結構的理解能力。但在分割類型的問題中,學習到的特徵是影響模型表現的一大因素,而目前的方法並沒有探索這個方向。在本篇論文中,我們提出針對特徵學習正則化的一種方法,使模型可以更好理解上下文的關係以及每個動作的特徵。我們將提出的方法與現有的方法進行結合,並測試其準確率以及分割連續性的影響,我們的方法可以提升現有模型提取特徵的能力並在兩個公開資料集上達到更好的分割效果。
The goal of action segmentation is to locate and classify untrimmed video. However, it suffer from over-segmentation error that predicted segments are too fine with respect to ground truth segments. Recent state-of-the-art models address this problem by improving their ability to extract temporal features with larger receptive field or modeling power of temporal structure. Unfortunately, current methods often fail to model complete context information of an action due to variety of action length and structure. In this work, we propose a contrastive learning framework that learn with action context with commonly used multiple-stage architecture. We extend recent state-of-the-art models with our framework and obtain performance gain on segmental F1 score and Edit score in two challenging datasets, 50Salads and Breakfast datasets.
Acknowledgements
摘要 i
Abstract ii
1 Introduction 1
2 Related Work 5
3 Method 9
3.1 System Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.2 Contrast Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.2.1 Constructing Segment-Level Embeddings . . . . . . . . . . . . . . . . 11
3.3 Loss Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4 Experiments 15
4.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.2 Overall Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.3 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.3.1 Impacts of Different Positives . . . . . . . . . . . . . . . . . . . . . . 17
4.3.2 Impacts of Memory Bank . . . . . . . . . . . . . . . . . . . . . . . . 18
4.3.3 Impacts of Segment-Level Representation . . . . . . . . . . . . . . . . 18
4.3.4 Effectiveness of Different Construction Methods . . . . . . . . . . . . 19
5 Conclusion 21
References 23
[1] J. Carreira and A. Zisserman, “Quo Vadis, Action Recognition? A New Model and the
Kinetics Dataset,” Proceedings of 30th IEEE Conference on Computer Vision and Pattern
Recognition, CVPR 2017, vol. 2017-January, pp. 4724–4733, 5 2017.
[2] K. Simonyan and A. Zisserman, “Two-Stream Convolutional Networks for Action Recog-
nition in Videos,” Advances in Neural Information Processing Systems, vol. 1, pp. 568–
576, 6 2014.
[3] L. Ding and C. Xu, “TricorNet: A Hybrid Temporal Convolutional and Recurrent Network
for Video Action Segmentation,” arXiv preprint arXiv:1705.07818v1, 2017.
[4] Y. A. Farha and J. Gall, “MS-TCN: Multi-Stage Temporal Convolutional Network for
Action Segmentation,” Proceedings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition, pp. 3575–3584, 3 2019.
[5] Z. Wang, Z. Gao, L. Wang, Z. Li, and G. Wu, “Boundary-Aware Cascade Networks for
Temporal Action Segmentation,” Lecture Notes in Computer Science (including subseries
Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 12370
LNCS, pp. 34–51, 2020.
[6] F. Yi, H. Wen, and T. Jiang, “ASFormer: Transformer for Action Segmentation,” Pro-
ceedings of the British Machine Vision Conference, 10 2021.
[7] Y. Ishikawa, S. Kasai, Y. Aoki, and H. Kataoka, “Alleviating Over-segmentation Errors
by Detecting Action Boundaries,” Proceedings of the IEEE Winter Conference on Appli-
cations of Computer Vision (WACV), pp. 2321–2330, 7 2020.
[8] S. Zheng, Y. Song, T. Leung, and I. Goodfellow, “Improving the Robustness of Deep
Neural Networks via Stability Training,” Proceedings of the IEEE Computer Society Con-
ference on Computer Vision and Pattern Recognition, vol. 2016-Decem, pp. 4480–4488,
4 2016.
[9] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A Simple Framework for Contrastive
Learning of Visual Representations,” International Conference on Machine Learning,
pp. 1597–1607, 2020.
[10] H. Kuehne, J. Gall, and T. Serre, “An End-to-End Generative Framework for Video Seg-
mentation and Recognition,” Proceedings of the IEEE Winter Conference on Applications
of Computer Vision (WACV), 9 2015.
[11] K. Tang, L. Fei-Fei, and D. Koller, “Learning Latent Temporal Structure for Complex
Event Detection,” Proceedings of the IEEE Computer Society Conference on Computer
Vision and Pattern Recognition, pp. 1250–1257, 2012.
[12] J. Y. H. Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, and G. Toderici,
“Beyond short snippets: Deep networks for video classification,” Proceedings of the IEEE
Computer Society Conference on Computer Vision and Pattern Recognition, vol. 07-12-
June, pp. 4694–4702, 2015.
[13] J. Donahue, L. A. Hendricks, M. Rohrbach, S. Venugopalan, S. Guadarrama, K. Saenko,
and T. Darrell, “Long-term Recurrent Convolutional Networks for Visual Recognition and
Description,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39,
pp. 677–691, 11 2014.
[14] A. v. d. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner,
A. Senior, and K. Kavukcuoglu, “WaveNet: A Generative Model for Raw Audio,” arXiv
preprint arXiv:1609.03499, 2016.
[15] N. Kalchbrenner, L. Espeholt, K. Simonyan, A. Van Den Oord, A. Graves,
K. Kavukcuoglu, G. Deepmind, and L. Uk, “Neural Machine Translation in Linear Time,”
arXiv preprint arXiv:1610.10099, 10 2016.
[16] P. Lei and S. Todorovic, “Temporal Deformable Residual Networks for Action Segmen-
tation in Videos,” Proceedings of the IEEE Computer Society Conference on Computer
Vision and Pattern Recognition, pp. 6742–6751, 2018.
[17] C. Lea, M. D. Flynn, R. Vidal, A. Reiter, and G. D. Hager, “Temporal Convolutional
Networks for Action Segmentation and Detection,” Proceedings of 30th IEEE Conference
on Computer Vision and Pattern Recognition, CVPR 2017, vol. 2017-January, pp. 1003–
1012, 11 2016.
[18] S. Li, Y. A. Farha, Y. Liu, M.-M. Cheng, and J. Gall, “MS-TCN++: Multi-Stage Tem-
poral Convolutional Network for Action Segmentation,” Proceedings of the IEEE Com-
puter Society Conference on Computer Vision and Pattern Recognition, vol. 2019-June,
pp. 3570–3579, 6 2020.
[19] D. Wang, D. Hu, X. Li, and D. Dou, “Temporal Relational Modeling with Self-Supervision
for Action Segmentation,” Proceedings of the AAAI Conference on Artificial Intelligence,
pp. 2729–2737, 12 2020.
[20] J. Mitrovic, B. McWilliams, and M. Rey, “Less Can Be More in Contrastive Learning,”
Proceedings on ”I Can’t Believe It’s Not Better!” at NeurIPS Workshops, vol. 137, pp. 70–
75, 12 Dec 2020.
[21] T. Wang and P. Isola, “Understanding Contrastive Representation Learning through Align-
ment and Uniformity on the Hypersphere,” Proceedings of the 37th International Confer-
ence on Machine Learning (ICML), vol. PartF16814, pp. 9871–9881, 5 2020.
[22] F. Wang and H. Liu, “Understanding the Behaviour of Contrastive Loss,” Proceedings
of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition,
pp. 2495–2504, 12 2020.
[23] J. Robinson, C.-Y. Chuang, S. Sra, and S. Jegelka, “Contrastive Learning with Hard Neg-
ative Samples,” arXiv preprint arXiv:2010.04592, 10 2020.
[24] Z. Wu, Y. Xiong, S. X. Yu, and D. Lin, “Unsupervised Feature Learning via Non-
parametric Instance Discrimination,” Proceedings of the IEEE Computer Society Con-
ference on Computer Vision and Pattern Recognition, pp. 3733–3742, 2018.
[25] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick, “Momentum Contrast for Unsupervised
Visual Representation Learning,” Proceedings of the IEEE Computer Society Conference
on Computer Vision and Pattern Recognition, pp. 9726–9735, 2020.
[26] W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola,
T. Green, T. Back, P. Natsev, M. Suleyman, and A. Zisserman, “The Kinetics Human
Action Video Dataset,” arXiv preprint arXiv:1705.06950, 5 2017.
[27] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin,
N. Gimelshein, L. Antiga, A. Desmaison, A. Köpf, E. Yang, Z. DeVito, M. Raison, A. Te-
jani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala, “PyTorch: An Imper-
ative Style, High-Performance Deep Learning Library,” Advances in Neural Information
Processing Systems, vol. 32, 12 2019.
[28] S. Stein and S. J. McKenna, “Combining Embedded Accelerometers with Computer Vi-
sion for Recognizing Food Preparation Activities,” Proceedings of the 2013 ACM Inter-
national Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738, 2013.
 
 
 
 
第一頁 上一頁 下一頁 最後一頁 top
* *