Загрузка страницы

Maps and Meaning Graph based Entity Resolution in Apache Spark & GraphX - Hendrik Frentrup

Data integration and the automation of tedious data extraction tasks are the fundamental building blocks of a data-driven organizations and are overlooked or underestimated at times. Aside from data extraction, scraping and ETL tasks, entity resolution is a crucial step in successfully combining datasets. The combination of data sources is usually what provides richness in features and variance. Building an expertise in entity resolution is important for data engineerings to successfully combine data sources. Graph-based entity resolution algorithms have emerged as a highly effective approach. This talk will present the implementation of a graph-bases entity resolution technique in GraphX and in GraphFrames respectively. Working from concept, through how to implement the algorithm in Spark, the technique will also be illustrated by walking through a practical example. The technique will exhibit an example where efficacy can be achieved based on simple heuristics, and at the same time map a path to a machine-learning assisted entity resolution engine with a powerful knowledge graph at its center. The role of ML can be found upstream in building the graph, for example by using classification algorithms in determining the link strength between nodes based on data, or downstream where dimensionality reduction can play a role in clustering and reduce the computational load in the resolution stage. The audience will leave with a clear picture of a scalable data pipeline performing entity resolution effectively and a thorough understanding of the internal mechanism, ready to apply it to their use cases.

About: Databricks provides a unified data analytics platform, powered by Apache Spark™, that accelerates innovation by unifying data science, engineering and business.
Read more here: https://databricks.com/product/unified-data-analytics-platform

Connect with us:
Website: https://databricks.com
Facebook: https://www.facebook.com/databricksinc
Twitter: https://twitter.com/databricks
LinkedIn: https://www.linkedin.com/company/databricks
Instagram: https://www.instagram.com/databricksinc/ Databricks is proud to announce that Gartner has named us a Leader in both the 2021 Magic Quadrant for Cloud Database Management Systems and the 2021 Magic Quadrant for Data Science and Machine Learning Platforms. Download the reports here. https://databricks.com/databricks-named-leader-by-gartner

Видео Maps and Meaning Graph based Entity Resolution in Apache Spark & GraphX - Hendrik Frentrup канала Databricks
Показать
Комментарии отсутствуют
Введите заголовок:

Введите адрес ссылки:

Введите адрес видео с YouTube:

Зарегистрируйтесь или войдите с
Информация о видео
22 октября 2019 г. 4:37:11
00:37:49
Яндекс.Метрика