作者(外文):Dolor, Rosalie Jacob
論文名稱(外文):Learning to Think Fast and Slow
指導教授(外文):Wu, Shan-Hung
口試委員(外文):Chen, Hwann-Tzong
Chen, Ming-Syan
Peng, Wen-Chih
Chien, Jen-Tzung
外文關鍵詞:Lifelong few-shot learningLifelong learningFew-shot learningMemory-augmented neural networkCatastrophic forgettingDual-memory system
深度學習在各種領域已有重大的突破。然而,如果要讓機器有像人類一般的學習能力,像是終身學習中,要在避免災難性遺忘(catastrophic forgetting)的情況下有效率地學習新事物、或甚至是根據稀少的資訊快速學習新知等,尚還有一段路要走。為了解決如此的問題,本論文根據腦科學理論提出了「快思慢想者」模型來實現機器的終身學習與少樣本學習能力。腦神經科學理論中,影響大腦做決定與運作包含了兩個系統:一是一個快速、直覺式的反應家、一是緩慢、但深思熟慮的決策家。因此,我們的快思慢想者模型也包含了兩個部分:(一)一個是記憶擴充的快速類神經網絡,稱作「快思者」(Fast Thinker, FT)、(二)二是較慢的自適應類神經網絡,稱作「慢想者」(Slow Thinker, ST),被訓練為測試時有能力輕易適應所有任務的網絡。上述兩個網絡系統透過慢想者反饋給快思者的訊息相輔相成,使得快思者能學習如何給提供慢想者所需的表徵。此外,快思者與慢想者是同時進行交錯的訓練、讓兩者的互動反饋達到最好的效果。本論文使用置換MNIST的任務以及使用CIFAR-100資料庫製作的更加困難的連續任務進行實驗,效果成功超越現行最佳的其他研究。甚至只需使用極小的記憶容量,慢想者可以廣泛性的應用到所有的任務上、並且沒有災耐性遺忘的問題。整體而言,快思慢想者模型不只在運行時適應(runtime-adaptation)的效率比其他模型高,且只需使用極少數的記憶容量,就能在機器的終生學習任務中擁有極佳的表現、且在少樣本學習中也有不錯的能力。
Deep Learning has achieved breakthroughs in many specialized domains; however, it is still limited when it comes to learning like humans, i.e. effective learning throughout lifetime without catastrophic forgetting and rapid learning even from little information. To contribute in bridging this gap, we propose the Fast-and-Slow Thinker (FST) model for lifelong and few-shot learning, which is inspired by how the brain works. It is believed that for consolidating memory and making life decisions, the brain is acted upon by two systems: a quick, intuitive thinker and a slow, more effortful decision-maker. FST's key components are 1) a per-task memory-augmented network, simply referred to as Fast Thinker (FT), and 2) a slow, adaptive network called Slow Thinker (ST) that is trained to be easily fine-tuned during inference time. The two systems work together through a form of feedback from the slow thinker that guides the fast thinker in representation learning. Furthermore, FT and ST are jointly alternately trained to encourage more interaction during learning. Using the permuted MNIST task and more difficult sequential tasks created using CIFAR-100, FST outperforms the state-of-the-art for lifelong learning task. Furthermore, even with very small memory size (saving at least 10 examples per task), ST can generalize across tasks and does not suffer from forgetting. Overall, FST demonstrated high accuracy in lifelong problem, decent performance in few-shot tasks, and with lower memory storage consumption and more effective runtime adaptation compared to baseline models.
摘要 ---------------------------------------------------- i
Abstract -------------------------------------------------- iii
Acknowledgement -------------------------------------- v
List of Tables ------- -------------------------------------- x
List of Figures -------------------------------------------- xi
1 Introduction ------------------------------------------ 1
2 Related Work ---------------------------------------- 9
2.1 Few-shot Learning --------------------------------- 9
2.1.1 Memory-augmented Neural Networks --------- 10
2.1.2 Meta-learning ----------------------------------- 11
2.2 Lifelong Learning ----------------------------------- 11
2.3 Lifelong Few-shot Learning ------------------------ 13
3 Fast-and-Slow Thinker (FST) ------------------------ 15
3.1 Predicting ----------------------------------------16
3.2 FST Framework -------------------------------------16
3.3 FST Optimization ------------------------------------ 17
3.4 Technical Details ----------------------------------- 23
3.4.1 Alternate training --------------------------------- 23
3.4.2 Approximation of higher-order derivatives -------23
3.4.3 Slow thinker’s hyperparameters ---------------- 24
4 Experimental Evaluation ------------------------------ 25
4.1 Lifelong (Continual) Learning ----------------------- 25
4.1.1 Permuted MNIST ----------------------------------27
4.1.2 CIFAR-100 Sequential Task -------------------------30
4.2 Few-shot Learning --------------------------------- 32
4.3 More Insights on FST ---------------------------------35
4.3.1 Different Alternate Training Step Ratio (FT:ST) -----36
4.3.2 Different Number of Runtime Adaptation Steps ---38
4.3.3 Different Combination of the Number of k-Nearest Neighbors and Runtime Adaptation Steps that ST used --------------- 39
5 Conclusion and Future Work ------------------------- 40
5.1 Conclusion ------------------------------------------- 40
5.2 Future work -------------------------------------------41
References -------------------------------------------------43
