帳號:guest(18.220.197.67)          離開系統
字體大小: 字級放大   字級縮小   預設字形  

詳目顯示

以作者查詢圖書館館藏以作者查詢臺灣博碩士論文系統以作者查詢全國書目
作者(中文):黃屏倫
作者(外文):Huang, Ping-Lun
論文名稱(中文):迷宮導航:探索大型語言模型代理的侷限性
論文名稱(外文):Navigating the Maze: Exploring the Limitations of Large Language Model Agent
指導教授(中文):張正尚
指導教授(外文):Chang, Cheng-Shang
口試委員(中文):李端興
許志仲
口試委員(外文):Lee, Duan-Shin
Hsu, Chih-Chung
學位類別:碩士
校院名稱:國立清華大學
系所名稱:通訊工程研究所
學號:111064530
出版年(民國):113
畢業學年度:112
語文別:英文
論文頁數:56
中文關鍵詞:大型語言模型迷宮導航任務規劃
外文關鍵詞:ChatGPTMaze navigationLLMTask planning
相關次數:
  • 推薦推薦:0
  • 點閱點閱:8
  • 評分評分:*****
  • 下載下載:0
  • 收藏收藏:0
本研究探討大型語言模型在迷宮導航任務中的能力與限制。透過簡單導航及複雜迷
宮環境的實驗,我們評估GPT-3.5 turbo、GPT-4-turbo 和GPT-4o 等模型在空間推理及任務規劃上的表現,揭示其優缺點。
研究指出大型語言模型在迷宮導航中面臨的主要挑戰:記憶限制、規則遵循不穩定以及幻覺現象。記憶問題導致模型無法準確回憶先前狀態,因而做出錯誤決策。規則遵循不穩定則造成模型有時混淆行列,或在接近目標時產生基本卻不正確的路徑、算術錯誤。幻覺問題使大型語言模型產生看似合理、實則不正確的指令。
研究同時引入程式輔助導航法,顯著提升模型在較小迷宮中的表現。藉助程式工具處理地圖記憶、規則遵循等特定任務,部分限制得以緩解。然而,在較大迷宮中,挑戰依然存在,凸顯進一步研究與優化的必要性。
深入分析比較了人類與大型語言模型在迷宮導航上的差異。人類有效結合直覺判斷(系統1) 與理性分析(系統2),而大型語言模型主要依賴快速聯想能力,缺乏深入分析
及規劃能力。
研究結果強調,建立穩健的錯誤偵測及修正工具,對於提升大型語言模型在實際應用中的可靠性至關重要。未來研究應著重於開發更可靠的AI 代理架構,以克服這些限制並實現穩健的大型語言模型驅動代理。
This study explores the capabilities and limitations of large language models (LLMs) in maze navigation tasks. Through experiments involving simple navigation tasks and complex maze environments, we evaluate the performance of models such as GPT-3.5-turbo, GPT-4-turbo, and GPT-4o in spatial reasoning and task planning, revealing their strengths and weaknesses.
The study identifies the main challenges of using LLMs for maze navigation: memory limitations, unstable rule adherence, and hallucination phenomena. Memory issues lead to incorrect decisions as the models cannot accurately recall previous states. Unstable rule adherence results in the models sometimes confusing rows and columns or making basic arithmetic errors. Hallucination problems cause LLMs to generate seemingly reasonable but actually incorrect paths when approaching the goal.
The research also introduces program-assisted navigation methods, significantly improving performance in smaller mazes. By utilizing program tools to handle specific tasks, such as map memory and rule adherence, some limitations are mitigated. However, challenges persist in larger mazes, highlighting the necessity for further research and optimization.
An in-depth analysis compares human and LLM approaches to maze navigation. Humans effectively combine intuitive judgments (System 1) with rational analysis (System 2), while LLMs primarily rely on rapid associative abilities, lacking deep analytical and planning capabilities.
The findings emphasize the importance of establishing robust error detection and correction tools to enhance the reliability of LLMs in real-world applications. Future research should focus on developing more reliable AI agent architectures to overcome these limitations and achieve robust LLM-driven agents.
Contents 1
ListofFigures 5
ListofTables 6
1 Introduction 7
2 Relatedwork 9
3 MazeNavigationwithLLMsExperiment 11
3.1 SimpleNavigationTask:AVirtualHouseExploration . . . . . . . . 11
3.1.1 Experiment . . . . . . . . . . . . . . . .. . . . . . . . . 11
3.1.2 ModelandImplementation. . . . . . . . . . . . . . . . . . . 12
3.1.3 Evaluation. . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2 LLMsforMazeNavigation. . . . . . . . . . . . . . . . . . . . 14
3.2.1 Preliminary . . . . . . . . . . . . . . . . . . . . . . . . 14
3.2.2 EnvironmentRepresentationMethods. . . . . . . . . . . . . 15
3.2.3 ExperimentImplementation . . . . . . . . . . . . . . . . . 16
3.2.4 Evaluation. . . . . . . . . . . . . . . . . . . . . . . . . 19
3.2.5 CausesofFailure . . . . . . . . . . . . . . . . . . . . . . 20
3.3 Code-AssistedLLMforMazeNavigation . . . . . . . . . . . . . . 23
3.3.1 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.3.2 Evaluation. . . . . . . . . . . . . . . . . . . . . . . . . 25
4 Discussion and Limitation 27
4.1 Limitations of LLMs in Maze Navigation Tasks . . . . . . . . .27
4.2 Differences Between Humans and LLMs in Maze Navigation Tasks .28
4.2.1 Inner Dialogue, Scratchpad, and Backtrack . . . . . . . . . 28
4.2.2 Thinking, Fast and Slow . . . . . . . . . . . . . . . . . . 29
4.3 Embedding Vectors and Visualization of Reasoning Pathways . . 30
4.3.1 Experiment . . . . . . . . . . . . . . . . . . . . . . . . .31
5 Conclusion 34
6 Appendix 36
6.1 Virtual House Exploration Experiment Details . . . . . . . . .36
6.2 LLMs for Maze Navigation Experiment Details . . . . . . . . . 40
6.3 Code-Assisted LLM Agent for Maze Navigation Experiment Details.43
[1] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser,
and I. Polosukhin, “Attention is all you need,” Advances in neural information pro
cessing systems, vol. 30, 2017.
[2] OpenAI, “Gpt-4 technical report,” arXiv, 2023.
[3] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix,
B. Rozi`ere, N. Goyal, E. Hambro, F. Azhar et al., “Llama: Open and efficient
foundation language models,” arXiv preprint arXiv:2302.13971, 2023.
[4] T. Kojima, S. S. Gu, M. Reid, Y. Matsuo, and Y. Iwasawa, “Large language
models are zero-shot reasoners,” in Advances in Neural Information Processing
Systems, S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and
A. Oh, Eds., vol. 35. Curran Associates, Inc., 2022, pp. 22199–22213.
[Online]. Available:
https://proceedings.neurips.cc/paper files/paper/2022/file/
8bb0d291acd4acf06ef112099c16f326-Paper-Conference.pdf
[5] X. L. Li, A. Kuncoro, J. Hoffmann, C. d. M. d’Autume, P. Blunsom, and A. Ne
matzadeh, “A systematic investigation of commonsense knowledge in large language
models,” arXiv preprint arXiv:2111.00607, 2021.
[6] N. Dziri, X. Lu, M. Sclar, X. L. Li, L. Jiang, B. Y. Lin, S. Welleck, P. West,
C. Bhagavatula, R. Le Bras et al., “Faith and fate: Limits of transformers on com
positionality,” Advances in Neural Information Processing Systems, vol. 36, 2024.
[7] J. Kaddour, J. Harris, M. Mozes, H. Bradley, R. Raileanu, and R. McHardy, “Chal
lenges and applications of large language models,” arXiv preprint arXiv:2307.10169,
2023.
[8] R. T. McCoy, S. Yao, D. Friedman, M. Hardy, and T. L. Griffiths, “Embers of
autoregression: Understanding large language models through the problem they are
trained to solve,” arXiv preprint arXiv:2309.13638, 2023.
[9] U. Anwar, A. Saparov, J. Rando, D. Paleka, M. Turpin, P. Hase, E. S. Lubana,
E. Jenner, S. Casper, O. Sourbut et al., “Foundational challenges in assuring align
ment and safety of large language models,” arXiv preprint arXiv:2404.09932, 2024.
[10] P. Ding, J. Fang, P. Li, K. Wang, X. Zhou, M. Yu, J. Li, M. R. Walter, and H. Mei,
“Mango: A benchmark for evaluating mapping and navigation abilities of large
language models,” arXiv preprint arXiv:2403.19913, 2024.
2, 2023, pp. 14–15.
[11] A. Singla, “Evaluating chatgpt and gpt-4 for visual programming,” in Proceedings of
the 2023 ACM Conference on International Computing Education Research-Volume
[12] H. W. Chung, L. Hou, S. Longpre, B. Zoph, Y. Tay, W. Fedus, E. Li, X. Wang,
M. Dehghani, S. Brahma et al., “Scaling instruction-finetuned language models,”
arXiv preprint arXiv:2210.11416, 2022.
[13] J. Wei, Y. Tay, R. Bommasani, C. Raffel, B. Zoph, S. Borgeaud, D. Yogatama,
M. Bosma, D. Zhou, D. Metzler et al., “Emergent abilities of large language models,”
arXiv preprint arXiv:2206.07682, 2022.
[14] V. Nair, E. Schumacher, G. Tso, and A. Kannan, “Dera: enhancing large lan
guage model completions with dialog-enabled resolving agents,” arXiv preprint
arXiv:2303.17071, 2023.
[15] Z. Wang, S. Cai, G. Chen, A. Liu, X. Ma, and Y. Liang, “Describe, explain, plan and
select: Interactive planning with large language models enables open-world multi
task agents,” arXiv preprint arXiv:2302.01560, 2023.
[16] J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le, D. Zhou et al.,
“Chain-of-thought prompting elicits reasoning in large language models,” Advances
in neural information processing systems, vol. 35, pp. 24824–24837, 2022.
[17] S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y. Cao, “React: Syn
ergizing reasoning and acting in language models,” arXiv preprint arXiv:2210.03629,
2022.
[18] P.-L. Chen and C.-S. Chang, “Interact: Exploring the potentials of chatgpt as a
cooperative agent,” arXiv preprint arXiv:2308.01552, 2023.
[19] A. Madaan, N. Tandon, P. Gupta, S. Hallinan, L. Gao, S. Wiegreffe, U. Alon,
N. Dziri, S. Prabhumoye, Y. Yang et al., “Self-refine: Iterative refinement with
self-feedback,” Advances in Neural Information Processing Systems, vol. 36, 2024.
[20] S. Bubeck, V. Chandrasekaran, R. Eldan, J. Gehrke, E. Horvitz, E. Kamar, P. Lee,
Y. T. Lee, Y. Li, S. Lundberg et al., “Sparks of artificial general intelligence: Early
experiments with gpt-4,” arXiv preprint arXiv:2303.12712, 2023.
[21] W. Huang, F. Xia, T. Xiao, H. Chan, J. Liang, P. Florence, A. Zeng, J. Tompson,
I. Mordatch, Y. Chebotar et al., “Inner monologue: Embodied reasoning through
planning with language models,” arXiv preprint arXiv:2207.05608, 2022.
[22] Y. Mu, Q. Zhang, M. Hu, W. Wang, M. Ding, J. Jin, B. Wang, J. Dai, Y. Qiao,
and P. Luo, “Embodiedgpt: Vision-language pre-training via embodied chain of
thought,” Advances in Neural Information Processing Systems, vol. 36, 2024.
[23] D. Driess, F. Xia, M. S. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter, A. Wahid,
J. Tompson, Q. Vuong, T. Yu et al., “Palm-e: An embodied multimodal language
model,” arXiv preprint arXiv:2303.03378, 2023.
[24] G. Wang, Y. Xie, Y. Jiang, A. Mandlekar, C. Xiao, Y. Zhu, L. Fan, and A. Anandku
mar, “Voyager: An open-ended embodied agent with large language models,” arXiv
preprint arXiv:2305.16291, 2023.
[25] C. Huang, O. Mees, A. Zeng, and W. Burgard, “Visual language maps for robot
navigation,” in 2023 IEEE International Conference on Robotics and Automation
(ICRA). IEEE, 2023, pp. 10608–10615.
[26] D. Shah, B. Osi´ nski, S. Levine et al., “Lm-nav: Robotic navigation with large pre
trained models of language, vision, and action,” in Conference on robot learning.
PMLR, 2023, pp. 492–504.
[27] D. Osmankovi´c and S. Konjicija, “Implementation of q—learning algorithm for solv
ing maze problem,” in 2011 proceedings of the 34th international convention MIPRO.
IEEE, 2011, pp. 1619–1622.
[28] M. Mitchell, A. B. Palmarini, and A. Moskvichev, “Comparing humans, gpt-4, and
gpt-4v on abstraction and reasoning tasks,” arXiv preprint arXiv:2311.09247, 2023.
[29] X. Yue, Y. Ni, K. Zhang, T. Zheng, R. Liu, G. Zhang, S. Stevens, D. Jiang, W. Ren,
Y. Sun et al., “Mmmu: A massive multi-discipline multimodal understanding and
reasoning benchmark for expert agi,” in Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition, 2024, pp. 9556–9567.
[30] “MAN1986,Python-Maze-World-pyamaze. https://github.com/man1986/pyamaze.git,”
2021.
[31] H. Chase, “Langchain,” Oct 2022, if you use this software, please cite it as below.
[Online]. Available: https://github.com/langchain-ai/langchain
[32] K. Daniel, Thinking, fast and slow, 2017.
[33] S. Kambhampati, K. Valmeekam, L. Guan, K. Stechly, M. Verma, S. Bhambri,
L. Saldyt, and A. Murthy, “Llms can’t plan, but can help planning in llm-modulo
frameworks,” arXiv preprint arXiv:2402.01817, 2024.
[34] A. Neelakantan, T. Xu, R. Puri, A. Radford, J. M. Han, J. Tworek, Q. Yuan,
N. Tezak, J. W. Kim, C. Hallacy et al., “Text and code embeddings by contrastive
pre-training,” arXiv preprint arXiv:2201.10005, 2022.
[35] J. Shlens, “A tutorial on principal component analysis,” arXiv preprint
arXiv:1404.1100, 2014.
[36] L. Van der MaatenandG.Hinton, “Visualizing data using t-sne.” Journal of machine
learning research, vol. 9, no. 11, 2008.


 
 
 
 
第一頁 上一頁 下一頁 最後一頁 top
* *