帳號:guest(3.147.74.211)          離開系統
字體大小: 字級放大   字級縮小   預設字形  

詳目顯示

以作者查詢圖書館館藏以作者查詢臺灣博碩士論文系統以作者查詢全國書目
作者(中文):曹習愛
作者(外文):Tsao, Hsi-Ai
論文名稱(中文):AutoVP: 自動化視覺提示框架與基準
論文名稱(外文):AutoVP: An Automated Visual Prompting Framework and Benchmark
指導教授(中文):何宗易
王廷基
指導教授(外文):Ho, Tsung-Yi
Wang, Ting-Chi
口試委員(中文):郭柏志
陳品諭
口試委員(外文):Kuo, Po-Chih
Chen, Pin-Yu
學位類別:碩士
校院名稱:國立清華大學
系所名稱:資訊工程學系
學號:111062753
出版年(民國):113
畢業學年度:112
語文別:英文
論文頁數:69
中文關鍵詞:視覺提示遷移學習分布外資料電腦視覺自動化
外文關鍵詞:visual promptingtransfer learningout-of-distributioncomputer visionautomation
相關次數:
  • 推薦推薦:0
  • 點閱點閱:0
  • 評分評分:*****
  • 下載下載:0
  • 收藏收藏:0
視覺提示(Visual prompting, VP)是一種新興的高效率微調方法,用於將預訓練的視覺模型適應於解決各種下游圖像分類任務。然而,至今尚未對VP的設計空間進行系統性的研究,也沒有明確的基準來評估其性能。為了彌補這些不足,我們提出了AutoVP,一個端到端的可擴展框架,用於自動化選擇VP的設計,同時提供12個下游圖像分類任務,作為全面的VP性能基準。我們的設計涵蓋了:1) 視覺提示的聯合優化;2) 預訓練模型的選擇,包括圖像分類器和結合文本與圖像的多模態分類器;以及3) 模型輸出映射策略,包括非參數化和可訓練的標籤映射。我們的實驗結果表明,AutoVP相比於目前已知的最佳VP方法有顯著優勢,準確率的提升至高可達6.7%;與線性探測(Linear probing, LP)相比,準確率最高提高了27.5%。AutoVP的貢獻於兩個面向:其可作為VP設計超參數選擇的高效工具,亦可作為一個全面的基準,可以合理地預期加速VP的開發。程式碼於https://github.com/IBM/AutoVP。
Visual prompting (VP) is an emerging parameter-efficient fine-tuning approach to adapting pre-trained vision models to solve various downstream image-classification tasks. However, there has hitherto been little systematic study of the design space of VP and no clear benchmark for evaluating its performance. To bridge this gap, we propose AutoVP, an end-to-end expandable framework for automating VP design choices, along with 12 downstream image-classification tasks that can serve as a holistic VP-performance benchmark. Our design space covers 1) the joint optimization of the prompts; 2) the selection of pre-trained models, including image classifiers and text-image encoders; and 3) model output mapping strategies, including nonparametric and trainable label mapping. Our extensive experimental results show that AutoVP outperforms the best-known current VP methods by a substantial margin, having up to 6.7% improvement in accuracy; and attains a maximum performance increase of 27.5% compared to linear-probing (LP) baseline. AutoVP thus makes a two-fold contribution: serving both as an efficient tool for hyperparameter tuning on VP design choices, and as a comprehensive benchmark that can reasonably be expected to accelerate VP’s development. The source code is available at https://github.com/IBM/AutoVP.
Abstract (Chinese) I
Abstract II
Acknowledgements III
Contents IV
List of Figures VII
List of Tables XI
List of Algorithms XIV
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2 Background and Related Work . . . . . . . . . . . . . . . . . . . . . 6
2.1 Background of Visual Prompts. . . . . . . . . . . . . . . . . . . . 6
2.2 The Design of Visual Prompts. . . . . . . . . . . . . . . . . . . . 7
2.3 Non-universal Visual Prompts. . . . . . . . . . . . . . . . . . . . 8
3 AutoVP Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.1 Input Scaling. . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.2 Visual Prompt. . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.3 Pre-trained Classifier. . . . . . . . . . . . . . . . . . . . . . . 11
3.4 Output Label Mapping. . . . . . . . . . . . . . . . . . . . . . . . 11
3.5 End-to-end Hyper-parameter Tuning. . . . . . . . . . . . . . . . . . 13
4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.1 Experimental Setup. . . . . . . . . . . . . . . . . . . . . . . . . 15
4.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . 16
4.2.1 Comparison of AutoVP and Prior Work. . . . . . . . . . . . . . . . 16
4.2.2 AutoVP with Source Model Selection. . . . . . . . . . . . . . . . 17
4.2.3 Data Scalability. . . . . . . . . . . . . . . . . . . . . . . . . 17
4.3 Ablation Studies of AutoVP . . . . . . . . . . . . . . . . . . . . . 18
4.3.1 Weight Initialization of FullyMap with CLIP. . . . . . . . . . . . 19
4.3.2 Impact of the Non-inclusion of Text Encoder in CLIP. . . . . . . . 20
4.3.3 The Impact of Visual Prompts. . . . . . . . . . . . . . . . . . . 20
4.3.4 Frequency Analysis of the Learned Visual Prompts. . . . . . . . . 21
5 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5.1 Tuning Selection. . . . . . . . . . . . . . . . . . . . . . . . . . 22
5.2 AutoVP Robustness. . . . . . . . . . . . . . . . . . . . . . . . . . 22
5.3 Performance Evaluation on ID/OOD Downstream Tasks. . . . . . . . . . 23
6 The Likelihood Perspective For Visual Prompting . . . . . . . . . . . 25
6.1 The Role of Prompts in Models . . . . . . . . . . . . . . . . . . . 25
6.2 Log-Likelihood Ratio. . . . . . . . . . . . . . . . . . . . . . . . 26
6.3 LogME Evidence and Visual Prompting Evidence. . . . . . . . . . . . 27
6.4 Visual Prompts Approximation. . . . . . . . . . . . . . . . . . . . 28
6.5 The Effectiveness of LLR and Simulated Prompts . . . . . . . . . . . 30
6.6 The Sorting Results with Diverse Datasets . . . . . . . . . . . . . 31
7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
8 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
8.1 Implementation Details of AutoVP . . . . . . . . . . . . . . . . . . 36
8.1.1 Pre-trained Classifier Details . . . . . . . . . . . . . . . . . . 36
8.1.2 Output Label Mappings of AutoVP . . . . . . . . . . . . . . . . . 38
8.1.3 AutoVP Tuning Process . . . . . . . . . . . . . . . . . . . . . . 42
8.2 Datasets and Baselines . . . . . . . . . . . . . . . . . . . . . . . 43
8.2.1 The Twelve Downstream Datasets . . . . . . . . . . . . . . . . . . 43
8.2.2 Baselines Details . . . . . . . . . . . . . . . . . . . . . . . . 44
8.3 Additional Experiments with AutoVP . . . . . . . . . . . . . . . . . 45
8.3.1 Fixed Pre-trained Model vs. Auto Pre-trained Model Selection . . . 45
8.3.2 Data Scalability . . . . . . . . . . . . . . . . . . . . . . . . . 46
8.3.3 Dataset Analysis (ID/OOD vs. Accuracy Gain) . . . . . . . . . . . 48
8.4 Analysis of AutoVP Results . . . . . . . . . . . . . . . . . . . . . 50
8.4.1 Prompts in Frequency Domain . . . . . . . . . . . . . . . . . . . 50
8.4.2 Output Mapping Analysis . . . . . . . . . . . . . . . . . . . . . 51
8.4.3 Hyper-Parameter Tuning Preferences . . . . . . . . . . . . . . . . 52
8.5 Ablation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . 53
8.5.1 The Impact of Text Encoder in CLIP . . . . . . . . . . . . . . . . 53
8.5.2 Visual Prompting in Segmentation and Detection Tasks . . . . . . . 54
8.5.3 Exploring Additional Tuning Axes . . . . . . . . . . . . . . . . . 56
8.5.4 Improved ILM-VP with Tuning Configuration . . . . . . . . . . . . 56
8.5.5 Comparison of AutoVP and BlackVIP . . . . . . . . . . . . . . . . 58
8.6 Performance and Resource Utilization . . . . . . . . . . . . . . . . 59
8.6.1 Comparison of AutoVP, Linear Probing, and Full Fine-Tuning . . . . 59
8.6.2 Computing Resources . . . . . . . . . . . . . . . . . . . . . . . 59
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
[1] Huzaifa Arif, Alex Gittens, and Pin-Yu Chen. Reprogrammable-FL: Improving utility-privacy tradeoff in federated learning via model reprogramming. In First IEEE Conference on Secure and Trustworthy Machine Learning, 2023.
[2] Hyojin Bahng, Ali Jahanian, Swami Sankaranarayanan, and Phillip Isola. Exploring visual prompts for adapting large-scale models. arXiv preprint arXiv:2203.17274, 1(3):4, 2022.
[3] Christopher M Bishop and Nasser M Nasrabadi. Pattern recognition and machine learning, volume 4. Springer, 2006.
[4] Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. Food-101 – mining discriminative components with random forests. In European Conference on Computer Vision, 2014.
[5] E Oran Brigham. The fast Fourier transform and its applications. Prentice-Hall, Inc., 1988.
[6] Aochuan Chen, Peter Lorenz, Yuguang Yao, Pin-Yu Chen, and Sijia Liu. Visual prompting for adversarial robustness. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023a.
[7] Aochuan Chen, Yuguang Yao, Pin-Yu Chen, Yihua Zhang, and Sijia Liu. Understanding and improving visual prompting: A label-mapping perspective. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 19133–19143, June 2023b.
[8] Liang-Chieh Chen, George Papandreou, Florian Schroff, and Hartwig Adam. Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587, 2017.
[9] Pin-Yu Chen. Model reprogramming: Resource-efficient cross-domain machine learning. arXiv preprint arXiv:2202.10629, 2022. Gong Cheng, Junwei Han, and Xiaoqiang Lu. Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE, 105(10): 1865–1883, 2017.
[10] Gordon Christie, Neil Fendley, James Wilson, and Ryan Mukherjee. Functional map of the world. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018.
[11] M. Cimpoi, S. Maji, I. Kokkinos, S. Mohamed, , and A. Vedaldi. Describing textures in the wild. In Proceedings of the IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2014.
[12] Noel Codella, Veronica Rotemberg, Philipp Tschandl, M Emre Celebi, Stephen Dusza, David Gutman, Brian Helba, Aadi Kalloo, Konstantinos Liopyris, Michael Marchetti, et al. Skin lesion analysis toward melanoma detection 2018: A challenge hosted by the international skin imaging collaboration (isic). arXiv preprint arXiv:1902.03368, 2019.
[13] Israel Cohen, Yiteng Huang, Jingdong Chen, Jacob Benesty, Jacob Benesty, Jingdong Chen, Yiteng Huang, and Israel Cohen. Pearson correlation coefficient. Noise reduction in speech processing, pages 1–4, 2009.
[14] Gamaleldin F. Elsayed, Ian Goodfellow, and Jascha Sohl-Dickstein. Adversarial reprogramming of neural networks. In International Conference on Learning Representations, 2019.
[15] Mark Everingham, SM Ali Eslami, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The pascal visual object classes challenge: A retrospective. International journal of computer vision, 111:98–136, 2015.
[16] Tianyu Gao, Adam Fisch, and Danqi Chen. Making pre-trained language models better few-shot learners. In Association for Computational Linguistics (ACL), 2021.
[17] Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. On calibration of modern neural networks. In International conference on machine learning, pages 1321–1330. PMLR, 2017.
[18] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
[19] Patrick Helber, Benjamin Bischke, Andreas Dengel, and Damian Borth. Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2019.
[20] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017.
[21] Sebastian Houben, Johannes Stallkamp, Jan Salmen, Marc Schlipsing, and Christian Igel. Detection of traffic signs in real-world images: The german traffic sign detection benchmark. In The 2013 International Joint Conference on Neural Networks (IJCNN), pages 1–8, 2013.
[22] Lei Hsiung, Yung-Chen Tang, Pin-Yu Chen, and Tsung-Yi Ho. NCTV: Neural Clamping Toolkit and Visualization for Neural Network Calibration. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37 (13), pages 16446–16448, Sep. 2023.
[23] Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge Belongie, Bharath Hariharan, and Ser-Nam Lim. Visual prompt tuning. In European Conference on Computer Vision, pages 709–727. Springer, 2022.
[24] Justin Johnson, Bharath Hariharan, Laurens Van Der Maaten, Li Fei-Fei, C Lawrence Zitnick, and Ross Girshick. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2901–2910, 2017.
[25] Maurice G Kendall. A new measure of rank correlation. Biometrika, 30(1-2): 81–93, 1938.
[26] Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3d object representations for fine-grained categorization. In Proceedings of the IEEE international conference on computer vision workshops, pages 554–561, 2013.
[27] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical report, University of Toronto, Toronto, Ontario, 2009.
[28] Solomon Kullback and Richard A Leibler. On information and sufficiency. The annals of mathematical statistics, 22(1):79–86, 1951.
[29] Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameterefficient prompt tuning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3045–3059, Online and Punta Cana, Dominican Republic, 11 2021. Association for Computational Linguistics.
[30] Liam Li, Kevin Jamieson, Afshin Rostamizadeh, Ekaterina Gonina, Moritz Hardt, Benjamin Recht, and Ameet Talwalkar. Massively parallel hyperparameter tuning. arXiv preprint arXiv:1810.05934, 5, 2018.
[31] Ning Liao, Bowen Shi, Min Cao, Xiaopeng Zhang, Qi Tian, and Junchi Yan. Rethinking visual prompt learning as masked visual token modeling. arXiv preprint arXiv:2303.04998, 2023.
[32] Richard Liaw, Eric Liang, Robert Nishihara, Philipp Moritz, Joseph E Gonzalez, and Ion Stoica. Tune: A research platform for distributed model selection and training. arXiv preprint arXiv:1807.05118, 2018.
[33] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021.
[34] Jochem Loedeman, Maarten C Stol, Tengda Han, and Yuki M Asano. Prompt generation networks for efficient adaptation of frozen vision transformers. arXiv preprint arXiv:2210.06466, 2022.
[35] Dhruv Mahajan, Ross Girshick, Vignesh Ramanathan, Kaiming He, Manohar Paluri, Yixuan Li, Ashwin Bharambe, and Laurens Van Der Maaten. Exploring the limits of weakly supervised pretraining. In Proceedings of the European conference on computer vision (ECCV), pages 181–196, 2018.
[36] Matthias Minderer, Alexey Gritsenko, Austin Stone, Maxim Neumann, Dirk Weissenborn, Alexey Dosovitskiy, Aravindh Mahendran, Anurag Arnab, Mostafa Dehghani, Zhuoran Shen, et al. Simple open-vocabulary object detection. In European Conference on Computer Vision, pages 728–755. Springer, 2022.
[37] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng. Reading digits in natural images with unsupervised feature learning. NIPS Workshop on Deep Learning and Unsupervised Feature Learning, 2011.
[38] M-E. Nilsback and A. Zisserman. Automated flower classification over a large number of classes. In Proceedings of the Indian Conference on Computer Vision, Graphics and Image Processing, Dec 2008.
[39] Changdae Oh, Hyeji Hwang, Hee-young Lee, YongTaek Lim, Geunyoung Jung, Jiyoung Jung, Hosik Choi, and Kyungwoo Song. Blackvip: Black-box visual prompting for robust transfer learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 24224–24235, June 2023.
[40] Omkar M Parkhi, Andrea Vedaldi, Andrew Zisserman, and CV Jawahar. Cats and dogs. In 2012 IEEE conference on computer vision and pattern recognition, pages 3498–3505. IEEE, 2012.
[41] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
[42] E. Riba, D. Mishkin, D. Ponsa, E. Rublee, and G. Bradski. Kornia: an open source differentiable computer vision library for pytorch. In Winter Conference on Applications of Computer Vision, 2020.
[43] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015.
[44] Fobo Shi, Peijun Qing, Dong Yang, Nan Wang, Youbo Lei, Haonan Lu, and Xiaodong Lin. Prompt space optimizing few-shot reasoning success with large language models. arXiv preprint arXiv:2306.03799, 2023.
[45] Kihyuk Sohn, Huiwen Chang, Jos´e Lezama, Luisa Polania, Han Zhang, Yuan Hao, Irfan Essa, and Lu Jiang. Visual prompt tuning for generative transfer learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19840–19851, 2023.
[46] Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012.
[47] Charles Spearman. The proof and measurement of association between two things. 1961.
[48] Hao-Lun Sun, Lei Hsiung, Nandhini Chandramoorthy, Pin-Yu Chen, and TsungYi Ho. NeuralFuse: Learning to Recover the Accuracy of Access-Limited Neural Network Inference in Low-Voltage Regimes. arXiv preprint arXiv:2306.16869, 2023.
[49] Yung-Chen Tang, Pin-Yu Chen, and Tsung-Yi Ho. Neural clamping: Joint input perturbation and temperature scaling for neural network calibration. arXiv preprint arXiv:2209.11604, 2022.
[50] Yun-Yun Tsai, Pin-Yu Chen, and Tsung-Yi Ho. Transfer learning without knowing: Reprogramming black-box machine learning models with scarce data and limited resources. In Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 9614–9624. PMLR, 13–18 Jul 2020.
[51] Yun-Yun Tsai, Chengzhi Mao, and Junfeng Yang. Convolutional visual prompt for robust visual perception. In Advances in Neural Information Processing Systems, volume 36, pages 27897–27921. Curran Associates, Inc., 2023.
[52] Philipp Tschandl, Cliff Rosendahl, and Harald Kittler. The ham10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions. Scientific data, 5(1):1–9, 2018.
[53] Junyang Wu, Xianhang Li, Chen Wei, Huiyu Wang, Alan Yuille, Yuyin Zhou, and Cihang Xie. Unleashing the power of visual prompting at the pixel level. arXiv preprint arXiv:2212.10556, 2022.
[54] Jianxiong Xiao, James Hays, Krista A Ehinger, Aude Oliva, and Antonio Torralba. Sun database: Large-scale scene recognition from abbey to zoo. In 2010 IEEE computer society conference on computer vision and pattern recognition, pages 3485–3492. IEEE, 2010.
[55] Saining Xie, Ross Girshick, Piotr Doll´ar, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1492–1500, 2017.
[56] Ziqing Yang, Zeyang Sha, Michael Backes, and Yang Zhang. From visual prompt learning to zero-shot transfer: Mapping is all you need. arXiv preprint arXiv:2303.05266, 2023.
[57] Hao Yen, Pin-Jui Ku, Chao-Han Huck Yang, Hu Hu, Sabato Marco Siniscalchi, Pin-Yu Chen, and Yu Tsao. Neural model reprogramming with similarity based mapping for low-resource spoken command classification. arXiv preprint arXiv:2110.03894, 2021.
[58] Kaichao You, Yong Liu, Jianmin Wang, and Mingsheng Long. LogME: Practical assessment of pre-trained models for transfer learning. In International Conference on Machine Learning, pages 12133–12143. PMLR, 2021.
[59] Chaoning Zhang, Philipp Benz, Tooba Imtiaz, and In So Kweon. Understanding adversarial examples from the mutual influence of images and perturbations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14521–14530, 2020.
[60] Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Conditional prompt learning for vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16816–16825, 2022a.
[61] Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Learning to prompt for vision-language models. International Journal of Computer Vision, 130(9):2337–2348, 2022b.
(此全文未開放授權)
電子全文
摘要
 
 
 
 
第一頁 上一頁 下一頁 最後一頁 top
* *