作者(外文):Chang Chien, Jack Y. C.
論文名稱(外文):Engineering Document Summarization System Using Text Mining Methods
指導教授(外文):Trappey, Amy J. C.
口試委員(外文):Wu, Jheng-Long
Fan, Chin-Yuan
外文關鍵詞:Key term extractionWord embeddingAutomatic summarizationClustering
工程邀標書(Request for Quotation, RFQ)為一經常用於高度客製化產業中的工程文件,例如於大型變壓器製造業中,當顧客(如發電廠、大型工廠)欲進行變壓器採購流程時,會先提供RFQ並邀請具製造能力之變壓器廠商提供設計方案、成本估算、報價等作業。RFQ之長度冗長、內容繁雜,而其中的設計規格、製程技術、標準規範等要求嚴謹,欲參與競標的製造商需在短時間內解析RFQ所有重要資訊,對規格要求不能有所遺漏,且能快速對其所有採購內容細項進行成本估算、以利產出最佳之報價,此為一耗時且耗費高級工程技術專業人力的任務。本研究以此為例,發展一工程文件專屬的自動化摘要生成流程,在進行自動化摘要生成前,需先蒐集大量前案RFQ文件集 (包括了1,331 個 69kV-230kV型號變壓器之RFQ 文件集),並進行關鍵字詞提取(Retrieval)及重要性排序(Ranking)作業。本研究以TF-IDF以及N-gram演算法進行關鍵字詞提取及重要性排序。更進一步以1,331篇RFQ文件集、120萬維基文本、及1000篇變壓器技術論文之文檔作為三類訓練集 (Training datasets),以茲評估訓練集組合,以利詞嵌入 (Word2vec) 非監督式學習演算法較佳模型之訓練與產出,藉此能將此領域文件中之文字與其相對應之向量做精確之表示。提取關鍵字詞的目的在對RFQ技術文件的重要文句進行自動初步篩選—篩選出含有關鍵字詞的文句。進一步,將含關鍵字詞之文句,再以Word2vec模型將轉換成向量值,並利用TextRank進行相似文句重要程度排序,進而以重要程度高、含各類關鍵字詞之文句,自動組成高品質之精簡摘要。其有效性乃以文件摘要之壓縮率及保留率(Compression and retention ratio) 來評估。本研究以40份RFQ文件測試不同訓練集產出之Word2vec模型,評估其生成摘要之有效性。本研究發展一根據規格自動填入之摘要表,分別使用原文與生成摘要輸入該表,比較計算兩者的壓縮率與保留率,找出Word2vec最佳模型。又變壓器有不同規格,一規格又下會有不同要求,本研究更使用1,331 RFQ文件對於描述相同規格參數(電壓、阻抗、容量)等要求的文句進行K-means分群與關鍵字提取,統整客戶常見具相似規格需求進行管理,並使用40篇新RFQ進行驗證,以利在閱讀新RFQ案例時,減少閱讀規格要求時缺漏的機會,增加產品設計、成本評估與報價的精準度。
Request of Quotation (RFQ) is a kind of engineering document often used in high-customized industries such as large transformer manufacturer, the characteristics of RFQ are length, complicated, and it would be hard to get key information in a short time. This research takes RFQ as case to develop a novel summarization approach, in the beginning, historical 1,331 RFQ cases were collected for two purposes, one is to acquire key terms by TF-IDF and N-gram, which is used for filtering the content before summarization, the other one is to train for Word2vec model. When receiving a new RFQ, the content will be decomposed into sentences, then these sentences are filtered by key terms, after filtering, the trained Word2vec model is used to vectorize these filtered sentences, then TextRank, an extractive summarization technique, is implemented on each sentence to determine its importance. The sentences with higher importance would be picked up as a summary. In order to test the effect, 40 new RFQ cases and different word2vec model are used and retention ratio, an auto-fill table which can classify sentences according to specification key terms was developed, the evaluating method is to insert original RFQ and generated summary into that table, and compare these two results. Due to transformer has different kinds of specifications, a specification has various requirements, this research collected the sentence containing same specification key terms from 1,331 and used Word2vec model to vectorize them, then implement K-means on each specification sentences. After finishing clustering, do key term extraction toward each clustering. Then used another 40 RFQ cases and extracted the sentences under the same specifications classified these sentences into each cluster, in this way, the requirements of each specifications can be obtained. With this approach, when engineers read RFQ document, they can check if every requirement is met, which is can increase the accuracy of product design, cost evaluation and quotation.
