作者(外文):Yu, Shih-Ze
論文名稱(外文):Spatial-Temporal Graph Convolutional Network through Facial Landmarks for DeepFake Detection
指導教授(外文):Lin, Chia-Wen
口試委員(外文):Lin, Yen-Yu
Hsu, Chih-Chung
Chen, Jun-Cheng
外文關鍵詞:DeepFake DetectionSpatial-Temporal Graph Convolutional NetworkLandmarks
深度偽造是利用深度學習技術將影片或照片中的來源人臉移植到目標人臉的一種偽造方法。在現今,這些偽造的結果已經在社會上造成了層出不窮的問題,例如:侵犯版權、散播虛假信息引起公眾恐慌、製作非法色情視頻等嚴重問題。 因此,有效偵測出深偽結果成為了一個亟待解決的公共問題。近期一種新穎的方法被提出,與以往基於像素特徵的方法相比,它使用人臉關鍵點作為輸入,並且以其結果展示人臉關鍵點在深度偽造偵測中的性能和潛力。受到此啟發,為了進一步挖掘隱藏在人臉關鍵點之間空間域和時間域中的更多線索,我們首先使用德勞內三角剖分建立人臉關鍵點之間的聯繫,並且構建人臉時空圖序列。為了使模型更加強大和靈活,我們對原始的時空圖卷積網絡進行了修改,加入了注意力機制和可學習的鄰接矩陣,並新設計了適用於深度偽造偵測任務的一種權重劃分策略。
最後,我們的方法在基於人臉關鍵點的方法中達到了最先進的結果。在 Celeb-DF 、 DFD 和 DFDC 資料集上,我們的方法與之前的方法相比, AUC 分別提高了29.2%、33.5%和23%。並且我們的方法依然保留了基於人臉關鍵點的方法的大部分優點,包括:更低的訓練成本、對視頻壓縮具有更高的魯棒性。
DeepFake is a forgery technology that transplants the source face into the target face in the video by Deep Learning. The results of these forgeries have caused endless problems in society, such as copyright infringement, disinformation causing public panic, making illegal pornographic videos, and other serious problems. Therefore, DeepFake Detection has become a public problem to be solved urgently. Recently, a novel method has been proposed. In contrast to previous pixel-based methods, it uses facial landmarks as input, and its results demonstrate the performance and potential of facial landmarks in DeepFake Detection as much as pixel-based methods. Inspired by this, in order to further explore more clues hidden in the spatial domain and time domain between facial landmarks, we use Delaunay triangulation to establish the connection between facial landmarks and construct a spatial-temporal graph sequence of faces. To make the model more powerful and flexible, we modify the original Spatial-Temporal Graph Convolutional Network (ST-GCN), add an attention mechanism and a learnable adjacency matrix, and design a new weight partition strategy that is suitable for DeepFake Detection task.
In the end, our method achieves the-state-of-the-art in landmark-based methods. On the Celeb-DF, DFD, and DFDC datasets, our method improves the AUC scores by 29.2%, 33.5%, and 23%, respectively, compared with the previous method. And our method also retains most of the advantages of landmark-based methods, including lower training costs and higher robustness to video compression.
摘要 i
Abstract ii
1 Introduction 1
2 Related Work 6
2.1 DeepFake Detection . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.1 Pixel-Based Methods . . . . . . . . . . . . . . . . . . 6
2.1.2 Landmark Sequences for DeepFake Detection . . . . . 10
2.2 Graph Convolutional Networks . . . . . . . . . . . . . . . . . 11
3 Methodology 12
3.1 Pipeline Overview . . . . . . . . . . . . . . . . . . . . . . . 12
3.2 Face Graph Construction . . . . . . . . . . . . . . . . . . . . 13
3.3 Graph Convolution . . . . . . . . . . . . . . . . . . . . . . . 15
3.4 Implementation of ST-GCN . . . . . . . . . . . . . . . . . . . 15
3.5 Partition Strategy for Face . . . . . . . . . . . . . . . . . . . 16
3.6 Learnable Adjacency Matrix and Attention Mechanisms . . . 17
3.7 Implement Details . . . . . . . . . . . . . . . . . . . . . . . . 19
4 Experiments 22
4.1 Datasets & Evaluation Metrics . . . . . . . . . . . . . . . . . 22
4.2 General Evaluation . . . . . . . . . . . . . . . . . . . . . . . 24
4.3 Robustness to Video Compression . . . . . . . . . . . . . . . 26
4.4 Training Cost . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.5 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.5.1 Learnable Adjacency Matrix and Attention Mechanisms 28
4.5.2 Different Link Ways & Partition Mode . . . . . . . . 29
4.5.3 Case Study . . . . . . . . . . . . . . . . . . . . . . . 30
5 Conclusion 32
References 33
