帳號:guest(18.227.134.45)          離開系統
字體大小: 字級放大   字級縮小   預設字形  

詳目顯示

以作者查詢圖書館館藏以作者查詢臺灣博碩士論文系統以作者查詢全國書目
作者(中文):荷西
作者(外文):Jose Carlos Arriola Ortiz
論文名稱(中文):Locality-Sensitive Hashing for Sentence Retrieval Applied to Example-Based Machine Translation
論文名稱(外文):應用地點敏感散列於基於實例的機器翻譯進行實例檢索
指導教授(中文):蘇豐文
指導教授(外文):Soo, Von Wun
口試委員(中文):陳煥宗
周志遠
口試委員(外文):Chen, Hwann-Tzong
Chou, Jerry
學位類別:碩士
校院名稱:國立清華大學
系所名稱:資訊系統與應用研究所
學號:100065426
出版年(民國):102
畢業學年度:101
語文別:英文
論文頁數:43
中文關鍵詞:地點敏感散機器翻譯
外文關鍵詞:Locality Sensive HashingData IndexingMachine Translation
相關次數:
  • 推薦推薦:0
  • 點閱點閱:747
  • 評分評分:*****
  • 下載下載:3
  • 收藏收藏:0
Nowadays, in a world where information technologies are becoming more necessary to analyze large volumes of data, computational processes that emphasize the data rather than a set of predefined rules result in more scalable and flexible systems. Machine translation systems under the example-based machine translation (EBMT) paradigm come out to be a good example of an outcome obtained from the analysis of a large volume of data rather than from the pre-definition of grammatical translation-rules. The EBMT paradigm is based in the analogy principle: two sentences annotated with a similar grammatical structure will preserve such grammatical similarity after translated into some target language. Therefore, an arbitrary new sentence can be translated by looking up a previously translated sentence with a similar grammatical structure.
The goal of this research is to introduce the details of the implementation of the Locality-Sensitive Hashing (LSH) schema as an approach for building an indexing mechanism for retrieving sentences in the EBMT framework. A data set consisting of thousands of sentences were downloaded from the Open National American Corpus (ONAC) project and parsed using the Stanford CoreNLP parser. The sentences were then transformed to vectors in the Euclidean space using part-of-speech (POS) tags as mapping unit to yield a data set that can be used to simulate an EBTM example database. The LSH schema is used as an indexing mechanism for the querying of an example database designed over the concept of the analogy principle. Finally, Structured String-Tree correspondences were used to guide the translation process between a new input sentence and a previously translated sentence retrieved from the example database with a similar grammatical structure.
Section 2 introduces the theory behind the EBMT framework and the LSH schema in order to provide a comprehension basis for the implementations explained in the further sections. The objective in section 2 is to uncover the theory behind the LSH schema in order to grasp the theoretical guidelines used for the implementation of an EBMT example database based in the LSH schema.
Section 3 introduces the process of choosing the parameters of the LSH algorithm in order to provide an efficient search for a given query. A sample query set selected randomly from the data set is used to analyze the average searching cost and estimate the best parameters to build a suitable index structure able to solve any further query.
Section 4 introduces the implementation details for the LSH schema to index the examples of a bilingual database in the EBMT framework. The theoretical background introduced in section 2 is used to guide the construction of a set of hash functions used as indexes to store each data point of our data set in a set of hash tables.
Section 5 explains how we can generate a structure tree for a set of translated sentences and then use the same method to generate the structure tree for a new input sentence. The LSH schema is implemented to generate an index structure for the search of previously translated sentences with a similar grammatical structure and then Structured String-Tree Correspondences are used to represent the association between a pair of translated sentences in order to guide the translation of the input sentence.
2. THEORETICAL BACKGROUND
2.1 Example-Based Machine Translation2.2 Example Database 2.3 Locality-Sensitive Hashing
2.4 LSH schema
2.5 Multiline projections
3. LSH PARAMETERS
3.1 Definition of an ℍ family of locality-sensitive functions
3.2 Calculation of k and L
4. LSH IMPLEMENTATION FOR SENTENCE INDEXING
4.1 The data
4.2 Sentence transformation to ℝ𝑑
4.3 LSH schema implementation
4.4 Hash implementation
4.5 Example
4.6 Sentence Similarity based in the LSH algorithm
4.7 System design
5. SIMPLE EXAMPLE-BASED MACHINE TRANSLATION IMPLEMENTATION
5.1 Translation example database founded in the analogy principle
5.2 Bilingual correspondences
5.3 Bilingual examples generation
5.4 Overall translation process
5.5 POS similarity based translation
5.6 POS grouping based translation
6. LIMITATIONS
7. CONCLUSIONS
8. REFERENCES
8. REFERENCES
[1] A. Gionis, P. Indyk, and R. Motwani 1999, Similarity Search in High Dimensions via Hashing in proceedings of the 25th International Conference on Very Large Data Bases (VLDB).
[2] A. Andoni and P. Indyk 2006, Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions in proceedings of the 47th Annual IEEE Symposium (FOCS '06).
[3] A.Andoni and P. Indyk 2005, “E2LSH 0.1 User Manual”. Available online: www.mit.edu/~andoni/LSH/manual.pdf.
[4] M. Slaney and M. Casey 2008, Locality-Sensitive Hashing for Finding nearest Neighbors in IEEE Signal Process Magazine, vol. 25, pp. 128-131, no. 2.
[5] M. Slaney, Y. Lifshits and J. He 2012, Optimal Parameters for Locality-Sensitive Hashing in proceedings of the IEEE, pp. 1-20, no. 99.
[6] Kristina Toutanova, Dan Klein, Christopher Manning, and Yoram Singer 2003, Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network in Proceedings of HLT-NAACL 2003, pp. 252-259.
[7] Mitchell P. Marcus, Beatrice Santorini, and Mary A. Marcinkiewicz Building a large annotated corpus of English: The Penn Treebank in Computational Linguistics, 19:313–330.
[8] The American National Corpus (ANC) project. Website: http://www.anc.org/.
[9]. Boitet, C. & Zaharin, Y. 1988, Representation Trees and String-Tree Correspondences in Proceedings of the 12th International Conference on Computational Linguistics, vol. 1, pp. 59-64.
[10] M. Nagao 1984, A Framework of a Mechanical Translation between Japanese and English by Analogy Principle in Elithorn, A. and Banerji, R. (eds.), Artificial and Human Intelligence, 173-180.
[11] Kit, C., H. Pan, and J. J. Webster 2000, Example-based machine translation: A new paradigm in Translation and Information Technology, Chinese University of HK Press, Hong Kong, pp. 57-78.
[12] Al-Adhaileh, M., Tang, E. K. & Zaharin, Y. 2002, A synchronization structure of SSTC and its applications in machine translation in COLING 2002 Post-Conference Workshop on Machine Translation in Asia, Taipei, Taiwan.
[13] Fai Wong, Mingchui Dong, Dongcheng Hu 2006, Machine Translation Based on Translation Corresponding Tree Structure in Tsinghua Science & Technology, vol. 11, no. 1, pp. 25-31.
[14] M. Data, N. Immorlica, P. Indryk, and V. Mirrokni 2003, Locality-sensitive hashing scheme based on p-stable distributions, DIMACS Workshop on Streaming Data Analysis and Mining.
[15] Yarowsky, D. 2000, Word-sense disambiguation, in R. Dale, H. Moisi and H. Somers (Eds), Handbook of natural language processing, 629-654.
 
 
 
 
第一頁 上一頁 下一頁 最後一頁 top
* *