Analytic Processing in Data Lakes: A Semantic Query-Driven Discovery Approach
针对数据湖中多维数据源的集成与发现难题,提出一种结合语义模型与数据驱动技术的方法,通过映射元数据到知识图谱概念,实现基于推理的查询相关数据源发现与排序。
Abstract Data integration and discovery are open issues in Data Lakes potentially storing hundreds of data sources. The present paper addresses these issues targeting multidimensional data sources, that is sources containing atomic or derived measures aggregated along a number of dimensions, typically derived from raw data for analytical and reporting purposes. Combining semantic models of metadata with existing data-driven techniques, the paper proposes an approach for the discovery of mappings between source metadata and concepts in a reference knowledge graph, enabling the definition of reasoning-based techniques to discover, integrate, and rank data sources relevant to a given analytical query. The efficiency and effectiveness of the approach is discussed by means of experiments on real-world scenarios.