Working with Skewed Data: The Iterative Broadcast - Rob Keevil & Fokko Driesprong
"Skewed data is the enemy when joining tables using Spark. It shuffles a large proportion of the data onto a few overloaded nodes, bottlenecking Spark's parallelism and resulting in out of memory errors. The go-to answer is to use broadcast joins; leaving the large, skewed dataset in place and transmitting a smaller table to every machine in the cluster for joining. But what happens when your second table is too large to broadcast, and does not fit into memory? Or even worse, when a single key is bigger than the total size of your executor? Firstly, we will give an introduction into the problem. Secondly, the current ways of fighting the problem will be explained, including why these solutions are limited. Finally, we will demonstrate a new technique - the iterative broadcast join - developed while processing ING Bank's global transaction data. This technique, implemented on top of the Spark SQL API, allows multiple large and highly skewed datasets to be joined successfully, while retaining a high level of parallelism. This is something that is not possible with existing Spark join types.
Session hashtag: #EUde11"
About: Databricks provides a unified data analytics platform, powered by Apache Spark™, that accelerates innovation by unifying data science, engineering and business.
Read more here: https://databricks.com/product/unified-data-analytics-platform
Connect with us:
Website: https://databricks.com
Facebook: https://www.facebook.com/databricksinc
Twitter: https://twitter.com/databricks
LinkedIn: https://www.linkedin.com/company/databricks
Instagram: https://www.instagram.com/databricksinc/
Видео Working with Skewed Data: The Iterative Broadcast - Rob Keevil & Fokko Driesprong канала Databricks
Session hashtag: #EUde11"
About: Databricks provides a unified data analytics platform, powered by Apache Spark™, that accelerates innovation by unifying data science, engineering and business.
Read more here: https://databricks.com/product/unified-data-analytics-platform
Connect with us:
Website: https://databricks.com
Facebook: https://www.facebook.com/databricksinc
Twitter: https://twitter.com/databricks
LinkedIn: https://www.linkedin.com/company/databricks
Instagram: https://www.instagram.com/databricksinc/
Видео Working with Skewed Data: The Iterative Broadcast - Rob Keevil & Fokko Driesprong канала Databricks
Показать
Комментарии отсутствуют
Информация о видео
Другие видео канала
![Optimizing Apache Spark SQL Joins: Spark Summit East talk by Vida Ha](https://i.ytimg.com/vi/fp53QhSfQcI/default.jpg)
![How to handle Data skewness in Apache Spark using Key Salting Technique](https://i.ytimg.com/vi/d41_X78ojCg/default.jpg)
![Spark on YARN: a Deep Dive - Sandy Ryza (Cloudera)](https://i.ytimg.com/vi/N6pJhxCPe-Y/default.jpg)
![Top 5 Mistakes When Writing Spark Applications](https://i.ytimg.com/vi/WyfHUNnMutg/default.jpg)
![Advancing Spark - How to pass the Spark 3.0 accreditation!](https://i.ytimg.com/vi/qEKfyoOUKb8/default.jpg)
![Bucketing in Spark SQL 2 3 with Jacek Laskowski](https://i.ytimg.com/vi/dv7IIYuQOXI/default.jpg)
![Deep Dive: Apache Spark Memory Management](https://i.ytimg.com/vi/dPHrykZL8Cg/default.jpg)
![The Fascinating Truth About Gravity | Jim Al-Khalili: Gravity and Me | Spark](https://i.ytimg.com/vi/2_p2ELD7npw/default.jpg)
![Realizing the Vision of the Data Lakehouse | Ali Ghodsi | Keynote Spark + AI Summit 2020](https://i.ytimg.com/vi/g11y-kJHr3I/default.jpg)
![Carol Willing | JupyterHub: A "things explainer overview"](https://i.ytimg.com/vi/4GJFNQBB26s/default.jpg)
![072 Hive Join Optimizations](https://i.ytimg.com/vi/dwd9m1Zl04Q/default.jpg)
![How to Read Spark DAGs | Rock the JVM](https://i.ytimg.com/vi/LoFN_Q224fQ/default.jpg)
![Amundsen: A Data Discovery Platform From Lyft | Lyft](https://i.ytimg.com/vi/EOCYw0yf63k/default.jpg)
![How to learn any language easily | Matthew Youlden | TEDxClapham](https://i.ytimg.com/vi/Yr_poW-KK1Q/default.jpg)
![Azure Databricks Tutorial | Data transformations at scale](https://i.ytimg.com/vi/M7t1T1Q5MNc/default.jpg)
![Workshop | Managing the Complete Machine Learning Lifecycle with MLflow: 2 of 3](https://i.ytimg.com/vi/g5ibwiSH1uA/default.jpg)
![Apache Spark on Kubernetes Clusters (Anirudh Ramanathan & Sean Schter)](https://i.ytimg.com/vi/Lj-SnDqk2Ks/default.jpg)
![Broadcast joins in Apache Spark | Rock the JVM](https://i.ytimg.com/vi/af2k52NjcUo/default.jpg)
![Repartition vs Coalesce in Apache Spark | Rock the JVM](https://i.ytimg.com/vi/PpuII_EmiYM/default.jpg)
![Spark Performance Tuning | Handling DATA Skewness | Interview Question](https://i.ytimg.com/vi/HIlfO1pGo0w/default.jpg)