帳號:guest(18.226.251.72)          離開系統
字體大小: 字級放大   字級縮小   預設字形  

詳目顯示

以作者查詢圖書館館藏以作者查詢臺灣博碩士論文系統以作者查詢全國書目
作者(中文):傅嬿羽
作者(外文):Fu, Yen-Yu
論文名稱(中文):DataSika: 適用於中小型團隊和企業的數據管道初探 - 以領域特定語言為例
論文名稱(外文):DataSika: Bringing domain-specific language and ease-of-use to data pipelines for small and medium teams and enterprises
指導教授(中文):雷松亞
指導教授(外文):Ray, Soumya
口試委員(中文):林福仁
劉敦仁
口試委員(外文):Lin, Fu-Ren
Liu, Duen-Ren
學位類別:碩士
校院名稱:國立清華大學
系所名稱:服務科學研究所
學號:109078513
出版年(民國):111
畢業學年度:110
語文別:英文
論文頁數:98
中文關鍵詞:數據管道領域特定語言數據倉庫技術(擷取、轉換與載入)中小型企業資料管道數據收集數據轉換
外文關鍵詞:Small and medium enterprise (SME)Data pipelineData collectionData transformationDomain-Specific Language (DSL)Extract-Transform-Load (ETL)
相關次數:
  • 推薦推薦:0
  • 點閱點閱:31
  • 評分評分:*****
  • 下載下載:0
  • 收藏收藏:0
隨著大數據襲捲各行各業,數據收集和數據處理已成為大公司的常態。雖然收集數據應該是一項相當自動化的任務,但小型組織如中小型企業 (SMEs) 和科學研究團隊等等,仍然在使用著不同的數位資料格式和一些臨時搭建的工作流程,來從零散的數據源中收集數據。這種手動、脆弱且不標準化的手法,導致組織難以因應市場潮流、即時收集數據並從中萃取洞見。基於對中小企業和小型研究團隊需求的理解,本研究有系統性地設計並開發一款數據管道工具 - DataSika ,期望此工具能為這些小型參與者帶來幫助。本研究側重於中小企業和研究人員在數據收集過程中面臨的痛點,並遵循設計科學的方法來開發和驗證我們的工具是否符合用戶需求。 DataSika 採用領域特定語言 (Domain-Specific Language) 來對數據流程進行建模,讓小型組織可以從常見的各種 Web 資源(例如 Web API、HTML 網頁爬取)中收集數據:除了支持基本的擷取、轉換、載入任務之外,也可以根據未來需求進行擴展。本研究期望能夠擴大學界對數據管道和數據工程的研究,而不只是著眼於關聯式數據庫相關的優化研究。最終,我們希望此研究能夠帶來以下幾個優勢:(1)更好地了解中小企業和研究人員的數據需求、(2)提供使用者一套帶有數據流程設計建議的工具,以及(3)成為未來數據管道研究的測試探索基礎。
As Big Data sweeps across all industries, data gathering and processing have become a norm in big companies. While collecting data should be a fairly automated task, smaller operations such as small and medium enterprises (SMEs) and scientific research teams still collect data from fragmented data sources, using disparate digital formats, and ad hoc workflows. This manual, fragile approach results in difficulties of gathering data and extracting insights in a timely fashion and in reproducible ways. Building on our understanding of the needs of SMEs and small research teams, we set about systematically designing and developing a data pipeline that can be helpful for these smaller actors. Our research focused on the pain points that SMEs and researchers face during data collection and followed a design science approach to develop and validate our artifact against user needs. Our system adopts a Domain-Specific Language syntax to model data processes, can collect data from various web resources (e.g. Web API, HTML scraping) that small organizations typically use, supports basic extraction-transformation-loading tasks, and is extensible to future needs. We hope our work broadens the research on data pipelines and data engineering, beyond previous research that has only focused on relational-database querying dataflows. Ultimately, we hope that our research can yield several benefits: a better understanding of the data needs of SMEs and researchers; an artifact with suggested designed workflow patterns for them to adopt; and a test-bed for future research on data pipelines.
摘要 ................................................................................................................................................... i
Abstract .......................................................................................................................................... ii
List of Figures ............................................................................................................................... v
List of Tables ................................................................................................................................ vi
Chapter 1 Introduction ............................................................................................................ 1
Chapter 2 Data Pipelines for SMEs ..................................................................................... 3
2.1 Data Engineering for SMEs and Small Teams ....................................................... 3
2.2 Data Pipeline ..................................................................................................................... 6
2.3 Popular Industry Solutions ....................................................................................... 20
2.4 Gathering Stakeholder Input ................................................................................... 23
2.5 Gaps and Proposed Solution ................................................................................... 25
Chapter 3 DataSika: The Building Blocks of a Data Pipeline .................................. 28
3.1 Interfacing with a Pipeline ........................................................................................ 28
3.1.1 Web-Based and GUI Interfaces ...................................................................... 28
3.1.2 Programmatic Interfaces – Domain-Specific Language (DSL) ........... 29
3.2 Data Sources, Formats, and Querying Standards ............................................ 33
3.3 Pipeline Architecture ................................................................................................... 40
3.3.1 Pipeline, Stage, and Tasks ................................................................................. 40
3.3.2 Various Types of Tasks ........................................................................................ 43
3.3.3 Function Details .................................................................................................... 45
3.4 Data Output .................................................................................................................... 47
Chapter 4 DataSika System Design .................................................................................. 49
4.1 ETL Pipeline Design ...................................................................................................... 49
4.2 SQLite Data Storage ..................................................................................................... 50
4.3 YAML-Based DSL ........................................................................................................... 51
4.4 Performance and Concurrency ................................................................................ 53
4.5 Stability and Monitoring ............................................................................................ 54
4.6 Findings from Development Iteration .................................................................. 59
Chapter 5 Validation and Feedback ................................................................................. 63
5.1 Use Cases ......................................................................................................................... 63
5.1.1 United Kingdom Airbnb Host Listings ......................................................... 64
5.1.2 Github Repository Analysis............................................................................... 71
5.2 Tests .................................................................................................................................... 77
5.2.1 Unit Tests.................................................................................................................. 78
5.2.2 Integration Tests ................................................................................................... 80
5.2.3 Acceptance Tests .................................................................................................. 81
5.3 Stakeholder Interviews ............................................................................................... 81
Chapter 6 Discussion ............................................................................................................. 85
6.1 Limitations ....................................................................................................................... 85
6.2 Future Work .................................................................................................................... 86
Chapter 7 Conclusion ............................................................................................................ 87
References .................................................................................................................................. 90
1. Ahamed, S. S. (2010). Studying the feasibility and importance of software testing: An Analysis. arXiv preprint arXiv:1001.4193.

2. Amershi, S., Begel, A., Bird, C., DeLine, R., Gall, H., Kamar, E., ... & Zimmermann, T. (2019, May). Software engineering for machine learning: A case study. In 2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP) (pp. 291-300). IEEE.

3. Bansal, S. K. (2014, June). Towards a semantic extract-transform-load (ETL) framework for big data integration. In 2014 IEEE International Congress on Big Data (pp. 522-529). IEEE.

4. Ben-Kiki, O., Evans, C., & Ingerson, B. (2009). YAML ain’t markup language (YAML)(tm) version 1.2. YAML. org, Tech. Rep, 359.

5. Bhosale, S. T., Patil, T., & Patil, P. (2015). Sqlite: Light database system. Int. J. Comput. Sci. Mob. Comput, 44(4), 882-885.

6. Blasch, E., Sung, J., Nguyen, T., Daniel, C. P., & Mason, A. P. (2019). Artificial intelligence strategies for national security and safety standards. arXiv preprint arXiv:1911.05727.

7. Carretero, A. G., Gualo, F., Caballero, I., & Piattini, M. (2017). MAMD 2.0: Environment for data quality processes implantation based on ISO 8000-6X and ISO/IEC 33000. Computer Standards & Interfaces, 54, 139-151.

8. Castro, S. (2022, April 30). The Importance of Data Engineering in the Era of Big Data. Jobsity. https://www.jobsity.com/blog/the-importance-of-data-engineering-in-the-era-of-big-data

9. Choudhury, S. (2021, April 1). The Pandemic’s Influence on Data Access and Digital Transformation. EnterpriseTalk. https://enterprisetalk.com/featured/the-pandemics-influence-on-data-access-and-digital-transformation/

10. Consel, C. (2004). From a program family to a domain-specific language. In Domain-Specific Program Generation (pp. 19-29). Springer, Berlin, Heidelberg.

11. Consel, C., & Marlet, R. (1998). Architecture software using: a methodology for language development. In Principles of Declarative Programming (pp. 170-194). Springer, Berlin, Heidelberg.

12. Consel, C., Latry, F., Réveillere, L., & Cointe, P. (2005, September). A generative programming approach to developing DSL compilers. In International Conference on Generative Programming and Component Engineering (pp. 29-46). Springer, Berlin, Heidelberg.

13. Danielson, S. (2022, January). pipeline definition. Microsoft. https://docs.microsoft.com/en-us/azure/devops/pipelines/yaml-schema/pipeline?view=azure-pipelines#pipeline-stages

14. Dayal, U., Castellanos, M., Simitsis, A., & Wilkinson, K. (2009, March). Data integration flows for business intelligence. In Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology (pp. 1-11).

15. De Souza, C. R., Redmiles, D., Cheng, L. T., Millen, D., & Patterson, J. (2004, October). How a good software practice thwarts collaboration: the multiple roles of APIs in software development. In Proceedings of the 12th ACM SIGSOFT twelfth international symposium on Foundations of software engineering (pp. 221-230).

16. DeLine, R. A. (2021, May). Glinda: Supporting Data Science with Live Programming, GUIs and a Domain-specific Language. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems (pp. 1-11).

17. des Rivières, J. (2004). Eclipse APIs: Lines in the sand. EclipseCon Retrieved March, 18, 2004.

18. Dupor, S., & Jovanović, V. (2014, May). An approach to conceptual modelling of ETL processes. In 2014 37th International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO) (pp. 1485-1490). IEEE.

19. Eriksson, M., & Hallberg, V. (2011). Comparison between JSON and YAML for data serialization. The School of Computer Science and Engineering Royal Institute of Technology, 1-25.

20. Freitas, A., Kämpgen, B., Oliveira, J. G., O’Riain, S., & Curry, E. (2012, May). Representing interoperable provenance descriptions for ETL workflows. In Extended Semantic Web Conference (pp. 43-57). Springer, Berlin, Heidelberg.

21. Friesen, J. (2016). Java XML and JSON. New York, NY, USA:: Apress.

22. Friesen, J. (2019). Extracting JSON values with JsonPath. In Java XML and JSON (pp. 299-322). Apress, Berkeley, CA.

23. Ghosh, S. (2022, May 23). Prefect vs Airflow: The Battle of Workflow Management Tools. Medium. https://medium.com/censius/prefect-vs-airflow-the-battle-of-workflow-management-tools-a5e4cc90116c

24. Gill, S. (2022, February 10). 7 Best Airflow Alternatives for 2022. Hevo. https://hevodata.com/learn/airflow-alternatives/#prefect

25. GitHub. (n.d.). Workflow syntax for GitHub Actions. Github. https://docs.github.com/en/actions/using-workflows/workflow-syntax-for-github-actions#onworkflow_callinputs

26. Good, J. (2021, October 20). Lay a data pipeline: Exposing the gaps in today’s business intelligence market. CIO. https://www.cio.com/article/189444/lay-a-data-pipeline-exposing-the-gaps-in-today-s-business-intelligence-market.html

27. Groff, J. R., Weinberg, P. N., & Oppel, A. J. (2002). SQL: the complete reference (Vol. 2). McGraw-Hill/Osborne.

28. Hansen, P. B. (1973). Concurrent programming concepts. ACM Computing Surveys (CSUR), 5(4), 223-245.

29. Horgan, D., Hackett, J., Westphalen, C. B., Kalra, D., Richer, E., Romao, M., ... & Montserrat, A. (2020). Digitalisation and COVID-19: the perfect storm. Biomedicine Hub, 5(3), 1-23.

30. Hutchins, E. L., Hollan, J. D., & Norman, D. A. (1985). Direct manipulation interfaces. Human–computer interaction, 1(4), 311-338.

31. InfoWorld. (2020, October 05). InfoWorld Announces 2020 Bossie Award Winners for the Most Innovative Open Source Projects and Next Generation Tools. GlobeNewswire. https://www.globenewswire.com/news-release/2020/10/05/2103691/0/en/InfoWorld-Announces-2020-Bossie-Award-Winners-for-the-Most-Innovative-Open-Source-Projects-and-Next-Generation-Tools.html

32. Jovanovic, P., Nadal, S., Romero, O., Abelló, A., & Bilalli, B. (2021). Quarry: a user-centered big data integration platform. Information Systems Frontiers, 23(1), 9-33.

33. Kargín, Y., Ivanova, M., Zhang, Y., Manegold, S., & Kersten, M. (2013). Lazy ETL in action: ETL technology dates scientific data. Proceedings of the VLDB Endowment, 6(12), 1286-1289.

34. Kong, Q., Siauw, T., & Bayen, A. (2020). Python Programming and Numerical Methods: A Guide for Engineers and Scientists. Academic Press.

35. Larman, C. (2001). Protected variation: The importance of being closed. IEEE Software, 18(3), 89-91.

36. Li, H. Q. (2018). 探索模組相依網絡與共同提交者網絡:以了解程式模組在生態系統中是否將被停止維護 [master's thesis, National Tsing Hua University]. airiti Library. https://www.airitilibrary.com/Publication/alDetailedMesh1?DocID=U0016-1803201914455953

37. Liu, X., & Iftikhar, N. (2015, April). An ETL optimization framework using partitioning and parallelization. In Proceedings of the 30th Annual ACM Symposium on Applied Computing (pp. 1015-1022).

38. Meesters, M., Heck, P., & Serebrenik, A. (2022, March). What Is an AI Engineer? An Empirical Analysis of Job Ads in The Netherlands. In International Conference on AI Engineering: Software Engineering for AI. IEEE Computer Society.

39. Munappy, A. R., Bosch, J., & Olsson, H. H. (2020, November). Data pipeline management in practice: Challenges and opportunities. In International Conference on Product-Focused Software Process Improvement (pp. 168-184). Springer, Cham.

40. Nikolov, N., Dessalk, Y. D., Khan, A. Q., Soylu, A., Matskin, M., Payberah, A. H., & Roman, D. (2021). Conceptualization and scalable execution of big data workflows using domain-specific languages and software containers. Internet of Things, 16, 100440.

41. Nwokeji, J. C., & Matovu, R. (2021). A Systematic Literature Review on Big Data Extraction, Transformation and Loading (ETL). Intelligent Computing, 308-324.

42. Raman, K., Swaminathan, A., Gehrke, J., & Joachims, T. (2013, August). Beyond myopic inference in big data pipelines. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 86-94).

43. Sadowski, C., Ball, T., Bishop, J., Burckhardt, S., Gopalakrishnan, G., Mayo, J., ... & Toub, S. (2011, March). Practical parallel and concurrent programming. In Proceedings of the 42nd ACM technical symposium on Computer science education (pp. 189-194).

44. Sharma, R., & Mathur, A. (2021). Configure Traefik. In Traefik API Gateway for Microservices (pp. 31-65). Apress, Berkeley, CA.

45. Shetty, A. (2021, September 19). PREFECT could be PERFECT. Medium. https://medium.com/geekculture/prefect-could-be-perfect-a318b9b1ad6e

46. Simitsis, A., Wilkinson, K., Dayal, U., & Castellanos, M. (2010, March). Optimizing ETL workflows for fault-tolerance. In 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010) (pp. 385-396). IEEE.

47. Tamir, M., Miller, S., & Gagliardi, A. (2015). The data engineer. Available at SSRN 2762013.

48. Teodoro, D. H., Choquet, R., Schober, D., Mels, G., Pasche, E., Ruch, P., & Lovis, C. (2011). Interoperability driven integration of biomedical data sources. Studies in health technology and informatics, 169, 185-9.

49. Theodorou, V., Abelló, A., & Lehner, W. (2014, September). Quality measures for ETL processes. In International Conference on Data Warehousing and Knowledge Discovery (pp. 9-22). Springer, Cham.

50. TOML. (n.d.) TOML. Github. https://github.com/toml-lang/toml

51. Tu, S., & Zhu, L. (2013, March). An optimized ETL fault-tolerant algorithm in data warehouses. In 2013 IEEE Third International Conference on Information Science and Technology (ICIST) (pp. 484-487). IEEE.

52. Warren, J., & Marz, N. (2015). Big Data: Principles and best practices of scalable realtime data systems. Simon and Schuster.

53. Wells, D. (2018, April 1). Data Engineering Coming of Age. Eckerson Group. https://www.eckerson.com/articles/data-engineering-coming-of-age

54. Wieland, K. M. (2017). Key Issues for Digital Transformation in the G20. Report prepared for a joint G20 German Presidency. OECD conference Berlin, Germany.

55. Yahui, Y. (2012, July). Impact data-exchange based on XML. In 2012 7th International Conference on Computer Science & Education (ICCSE) (pp. 1147-1149). IEEE.

56. Zhang, A. X., Muller, M., & Wang, D. (2020). How do data science workers collaborate? roles, workflows, and tools. Proceedings of the ACM on Human-Computer Interaction, 4(CSCW1), 1-23.
 
 
 
 
第一頁 上一頁 下一頁 最後一頁 top
* *