Maps and Meaning Graph based Entity Resolution in Apache Spark & GraphX - Hendrik Frentrup
Data integration and the automation of tedious data extraction tasks are the fundamental building blocks of a data-driven organizations and are overlooked or underestimated at times. Aside from data extraction, scraping and ETL tasks, entity resolution is a crucial step in successfully combining datasets. The combination of data sources is usually what provides richness in features and variance. Building an expertise in entity resolution is important for data engineerings to successfully combine data sources. Graph-based entity resolution algorithms have emerged as a highly effective approach. This talk will present the implementation of a graph-bases entity resolution technique in GraphX and in GraphFrames respectively. Working from concept, through how to implement the algorithm in Spark, the technique will also be illustrated by walking through a practical example. The technique will exhibit an example where efficacy can be achieved based on simple heuristics, and at the same time map a path to a machine-learning assisted entity resolution engine with a powerful knowledge graph at its center. The role of ML can be found upstream in building the graph, for example by using classification algorithms in determining the link strength between nodes based on data, or downstream where dimensionality reduction can play a role in clustering and reduce the computational load in the resolution stage. The audience will leave with a clear picture of a scalable data pipeline performing entity resolution effectively and a thorough understanding of the internal mechanism, ready to apply it to their use cases.
About: Databricks provides a unified data analytics platform, powered by Apache Spark™, that accelerates innovation by unifying data science, engineering and business.
Read more here: https://databricks.com/product/unified-data-analytics-platform
Connect with us:
Website: https://databricks.com
Facebook: https://www.facebook.com/databricksinc
Twitter: https://twitter.com/databricks
LinkedIn: https://www.linkedin.com/company/databricks
Instagram: https://www.instagram.com/databricksinc/ Databricks is proud to announce that Gartner has named us a Leader in both the 2021 Magic Quadrant for Cloud Database Management Systems and the 2021 Magic Quadrant for Data Science and Machine Learning Platforms. Download the reports here. https://databricks.com/databricks-named-leader-by-gartner
Видео Maps and Meaning Graph based Entity Resolution in Apache Spark & GraphX - Hendrik Frentrup канала Databricks
About: Databricks provides a unified data analytics platform, powered by Apache Spark™, that accelerates innovation by unifying data science, engineering and business.
Read more here: https://databricks.com/product/unified-data-analytics-platform
Connect with us:
Website: https://databricks.com
Facebook: https://www.facebook.com/databricksinc
Twitter: https://twitter.com/databricks
LinkedIn: https://www.linkedin.com/company/databricks
Instagram: https://www.instagram.com/databricksinc/ Databricks is proud to announce that Gartner has named us a Leader in both the 2021 Magic Quadrant for Cloud Database Management Systems and the 2021 Magic Quadrant for Data Science and Machine Learning Platforms. Download the reports here. https://databricks.com/databricks-named-leader-by-gartner
Видео Maps and Meaning Graph based Entity Resolution in Apache Spark & GraphX - Hendrik Frentrup канала Databricks
Показать
Комментарии отсутствуют
Информация о видео
Другие видео канала
![](https://i.ytimg.com/vi/Rou1WqyYpWw/default.jpg)
![Real-Time AI for Entity Resolution](https://i.ytimg.com/vi/FN-Vg57Y7JQ/default.jpg)
![Distributed Entity Resolution for Computational Social Science](https://i.ytimg.com/vi/W7Xqt4guibc/default.jpg)
![1 + 1 = 1 or Record Deduplication with Python](https://i.ytimg.com/vi/4O87RdBgRJ4/default.jpg)
![Graph Databases Will Change Your Freakin' Life (Best Intro Into Graph Databases)](https://i.ytimg.com/vi/GekQqFZm7mA/default.jpg)
![Distributed graph processing with Pregel and ArangoDB](https://i.ytimg.com/vi/Y5jaLzw3xpY/default.jpg)
![Massive Scale Entity Resolution Using the Power of Apache Spark and Graph](https://i.ytimg.com/vi/IOMM7wSYeFE/default.jpg)
![What is a Knowledge Graph?](https://i.ytimg.com/vi/y7sXDpffzQQ/default.jpg)
![Architecting for Data Quality in the Lakehouse with Delta Lake and PySpark](https://i.ytimg.com/vi/4K33fP46vDw/default.jpg)
![Advanced Concepts in Entity Resolution & Relationship Linking](https://i.ytimg.com/vi/Vr1oKQ8q2f8/default.jpg)
![Using Apache Spark 2.0 to Analyze the City of San Francisco's Open Data](https://i.ytimg.com/vi/K14plpZgy_c/default.jpg)
![RDD vs Dataframe vs Dataset | Interview Question | Spark Tutorial |](https://i.ytimg.com/vi/xuXOiD3drps/default.jpg)
![1 + 1 = 1 or Record Deduplication with Python | Flávio Juvenal @ PyBay2018](https://i.ytimg.com/vi/McsTWXeURhA/default.jpg)
![Graphs for AI and ML](https://i.ytimg.com/vi/PlFcOJkKSLA/default.jpg)
![Seajure Talk: Entity Matching at Vamperity (Amperity's alter ego)](https://i.ytimg.com/vi/Bl1SRVZcdhM/default.jpg)
![AWS re:Invent 2020: Building the post-cookie identity graph for marketing](https://i.ytimg.com/vi/I7_b1xkQ7Dc/default.jpg)
![Koalas: Making an Easy Transition from Pandas to Apache Spark -Tim Hunter & Takuya Ueshin](https://i.ytimg.com/vi/Wfj2Vuse7as/default.jpg)
![How to Automate Performance Tuning for Apache Spark -Jean Yves Stephan (Data Mechanics)](https://i.ytimg.com/vi/ph_2xwVjCGs/default.jpg)
![Spark MLlib Tutorial | Machine Learning On Spark | Apache Spark Tutorial | Simplilearn](https://i.ytimg.com/vi/d68VGJ7yAko/default.jpg)
![Intro to Apache Spark Streaming | NewCircle Training](https://i.ytimg.com/vi/2STfulBcorA/default.jpg)