作者(外文):Kung, Fan-Jie
論文名稱(外文):Detection of the location of talkers via video and audio bimodal processing
指導教授(外文):Liu, Yi-Wen
口試委員(外文):Lu, Chung-Chin
Li, Pei-Chun
Kang, Shih-Chung
外文關鍵詞:TDOAsource detectionface detectionface recognitionaudiovideo
近年來越來越多的研究從事聲訊與視訊的結合來做聲源定位,可以減低單一使用聲訊在充滿雜訊以及聲音迴響的環境下估計聲源方位所造成的誤差。本論文就是以兩支麥克風與筆記型電腦上的網路攝影機針對說話者做聲源定位。在聲訊方面是利用雙曲線的定義估計聲源的角度。在視訊方面是利用Viola與Jones提出的人臉偵測演算法偵測到人臉之後,再利用Turk與Pentland提出利用主成份分析法(Principal Component Analysis, PCA)找到每個人不同的eigenface來做人臉辨識。
Much research has been investigated regarding the source detection by joining audio and video methods recently. The audio-video method performs better in bias reduction for source detection in the noisy and reverberant environment than using the audio method alone. In this thesis, we design a system for talker detection by using two microphones and the web camera. For audio, we use the definition of hyperbolic surface to estimate the direction of sound sources relative to the microphones. For video, we use Viola-Jones algorithm to detect the face. Afterwards, we use Turk-Pentland algorithm to find the eigenface by principal component analysis, and later use the eigenface to recognize the face.
The location of a talking person is determined in two steps. First, we estimate the normal distance between the talker and the imaging plane of the camera by the size of the talker’s face in the image. Then, an estimate of two-dimensional location of the talker is obtained by considering the angle of the talker relative to the camera (or the center of two microphones). Because of using video and audio information jointly, the system can identify the talker, and face detection can be made robust against rotations thanks to the availability of audio information. In addition, when there are multiple talkers in the room, the number of sound sources can be estimated under the assumption that the sources are uncorrelated; this can be achieved either by counting the number of faces in video or calculating the cross correlation function between signals obtained by two microphones.
Experiments were conducted and results showed that the bias for estimating the location of a single talker is less than 5cm. Experiments for double talker estimation were also conducted, and we demonstrated that, in principle, we can only use two microphones to detect two sources as long as that they are uncorrelated.
第一章 緒論 1
1.1 研究動機 1
1.2 研究方法 2
1.3 系統架構 5
1.4 章節概要 6
第二章 聲源定位 7
2.1聲源為平面波的成立條件 8
2.3利用雙曲線平面定義估計聲源的方位 20
第三章 人臉偵測 22
3.2 積分影像(INTEGRAL IMAGE) 25
3.3 人臉距離估計 31
第四章 人臉辨識 35
4.2 利用EIGENFACES辨識人臉影像 39
第五章 聲源與影像資訊偵測說話者應用情境 41
5.1 影像和實際物體移動比例計算 43
5.2 利用聲源角度與人臉影像針對人臉旋轉偵測 43
第六章 實驗裝置與系統介面操作流程 45
6.1 實驗裝置 45
6.2 系統介面介紹與操作流程 49
第七章 實驗結果與討論 54
實驗1: 54
系統偵測不同聲源種類的角度誤差以及標準差的分析與比較。 54
實驗2: 58
說話者站在三個不同的位置發聲,並觀察放置三個不同位置所估計以兩支麥克風為中心的二為平面座標。 58
實驗3: 60
使用兩支喇叭撥放沒有相關(UNCORRELATED)的聲源訊號,並觀察放置在不同位置的角度最大交互相關係數所對應的延遲位置。 60
第八章 結論與未來展望 66
8.1 結論 66
8.2 未來展望 67

