作者(外文):Huang, Ping-Lun
論文名稱(外文):Navigating the Maze: Exploring the Limitations of Large Language Model Agent
指導教授(外文):Chang, Cheng-Shang
口試委員(外文):Lee, Duan-Shin
Hsu, Chih-Chung
外文關鍵詞:ChatGPTMaze navigationLLMTask planning
宮環境的實驗,我們評估GPT-3.5 turbo、GPT-4-turbo 和GPT-4o 等模型在空間推理及任務規劃上的表現,揭示其優缺點。
深入分析比較了人類與大型語言模型在迷宮導航上的差異。人類有效結合直覺判斷(系統1) 與理性分析(系統2),而大型語言模型主要依賴快速聯想能力,缺乏深入分析
研究結果強調,建立穩健的錯誤偵測及修正工具,對於提升大型語言模型在實際應用中的可靠性至關重要。未來研究應著重於開發更可靠的AI 代理架構,以克服這些限制並實現穩健的大型語言模型驅動代理。
This study explores the capabilities and limitations of large language models (LLMs) in maze navigation tasks. Through experiments involving simple navigation tasks and complex maze environments, we evaluate the performance of models such as GPT-3.5-turbo, GPT-4-turbo, and GPT-4o in spatial reasoning and task planning, revealing their strengths and weaknesses.
The study identifies the main challenges of using LLMs for maze navigation: memory limitations, unstable rule adherence, and hallucination phenomena. Memory issues lead to incorrect decisions as the models cannot accurately recall previous states. Unstable rule adherence results in the models sometimes confusing rows and columns or making basic arithmetic errors. Hallucination problems cause LLMs to generate seemingly reasonable but actually incorrect paths when approaching the goal.
The research also introduces program-assisted navigation methods, significantly improving performance in smaller mazes. By utilizing program tools to handle specific tasks, such as map memory and rule adherence, some limitations are mitigated. However, challenges persist in larger mazes, highlighting the necessity for further research and optimization.
An in-depth analysis compares human and LLM approaches to maze navigation. Humans effectively combine intuitive judgments (System 1) with rational analysis (System 2), while LLMs primarily rely on rapid associative abilities, lacking deep analytical and planning capabilities.
The findings emphasize the importance of establishing robust error detection and correction tools to enhance the reliability of LLMs in real-world applications. Future research should focus on developing more reliable AI agent architectures to overcome these limitations and achieve robust LLM-driven agents.
Contents 1
ListofFigures 5
ListofTables 6
1 Introduction 7
2 Relatedwork 9
3 MazeNavigationwithLLMsExperiment 11
3.1 SimpleNavigationTask:AVirtualHouseExploration . . . . . . . . 11
3.1.1 Experiment . . . . . . . . . . . . . . . .. . . . . . . . . 11
3.1.2 ModelandImplementation. . . . . . . . . . . . . . . . . . . 12
3.1.3 Evaluation. . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2 LLMsforMazeNavigation. . . . . . . . . . . . . . . . . . . . 14
3.2.1 Preliminary . . . . . . . . . . . . . . . . . . . . . . . . 14
3.2.2 EnvironmentRepresentationMethods. . . . . . . . . . . . . 15
3.2.3 ExperimentImplementation . . . . . . . . . . . . . . . . . 16
3.2.4 Evaluation. . . . . . . . . . . . . . . . . . . . . . . . . 19
3.2.5 CausesofFailure . . . . . . . . . . . . . . . . . . . . . . 20
3.3 Code-AssistedLLMforMazeNavigation . . . . . . . . . . . . . . 23
3.3.1 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.3.2 Evaluation. . . . . . . . . . . . . . . . . . . . . . . . . 25
4 Discussion and Limitation 27
4.1 Limitations of LLMs in Maze Navigation Tasks . . . . . . . . .27
4.2 Differences Between Humans and LLMs in Maze Navigation Tasks .28
4.2.1 Inner Dialogue, Scratchpad, and Backtrack . . . . . . . . . 28
4.2.2 Thinking, Fast and Slow . . . . . . . . . . . . . . . . . . 29
4.3 Embedding Vectors and Visualization of Reasoning Pathways . . 30
4.3.1 Experiment . . . . . . . . . . . . . . . . . . . . . . . . .31
5 Conclusion 34
6 Appendix 36
6.1 Virtual House Exploration Experiment Details . . . . . . . . .36
6.2 LLMs for Maze Navigation Experiment Details . . . . . . . . . 40
6.3 Code-Assisted LLM Agent for Maze Navigation Experiment Details.43
