How to Performance-Tune Apache Spark Applications in Large Clusters
Omkar Joshi offers an overview on how performance challenges were addressed at Uber while rolling out its newly built flagship ingestion system, Marmaray (open-sourced) for data ingestion from various sources like Kafka, MySQL, Cassandra, and Hadoop. This system is rolled out in production and has been running for over a year now, with more ingestion systems onboarded on top of it. Omkar and team heavily used jvm-profiler during their analysis to give them valuable insights. This new system is built using the Spark framework for data ingestion. It’s designed to ingest billions of Kafka messages per topic from thousands of topics every 30 minutes. The amount of data handled by the pipeline is of the order hundreds of TBs. At this scale, every byte and millisecond saved counts. Omkar detail how to tackle such problems and insights into the optimizations already done in production.
Some key highlights are:
- how to understand your bottlenecks in Spark applications, to cache or not to cache your Spark DAG to avoid rereading your input data
- how to effectively use accumulators to avoid unnecessary Spark actions
- how to inspect your heap and non heap memory usage across hundreds of executors
- how you can change the layout of your data to save long-term storage cost
- how to effectively use serializers and compression to save network and disk traffic
- how to reduce amortized cost of your application by multiplexing your jobs.
They used different techniques for reducing memory footprint, runtime, and on-disk usage for the running applications. In terms of savings, they were able to significantly (~10% – 40%) reduce memory footprint, runtime, and disk usage.
About:
Databricks provides a unified data analytics platform, powered by Apache Spark™, that accelerates innovation by unifying data science, engineering and business.
Read more here: https://databricks.com/product/unifie...
Connect with us:
Website: https://databricks.com
Facebook: https://www.facebook.com/databricksinc
Twitter: https://twitter.com/databricks
LinkedIn: https://www.linkedin.com/company/data...Instagram: https://www.instagram.com/databricksinc/
Видео How to Performance-Tune Apache Spark Applications in Large Clusters канала Databricks
Some key highlights are:
- how to understand your bottlenecks in Spark applications, to cache or not to cache your Spark DAG to avoid rereading your input data
- how to effectively use accumulators to avoid unnecessary Spark actions
- how to inspect your heap and non heap memory usage across hundreds of executors
- how you can change the layout of your data to save long-term storage cost
- how to effectively use serializers and compression to save network and disk traffic
- how to reduce amortized cost of your application by multiplexing your jobs.
They used different techniques for reducing memory footprint, runtime, and on-disk usage for the running applications. In terms of savings, they were able to significantly (~10% – 40%) reduce memory footprint, runtime, and disk usage.
About:
Databricks provides a unified data analytics platform, powered by Apache Spark™, that accelerates innovation by unifying data science, engineering and business.
Read more here: https://databricks.com/product/unifie...
Connect with us:
Website: https://databricks.com
Facebook: https://www.facebook.com/databricksinc
Twitter: https://twitter.com/databricks
LinkedIn: https://www.linkedin.com/company/data...Instagram: https://www.instagram.com/databricksinc/
Видео How to Performance-Tune Apache Spark Applications in Large Clusters канала Databricks
Показать
Комментарии отсутствуют
Информация о видео
Другие видео канала
Fine Tuning and Enhancing Performance of Apache Spark JobsApache Spark Core – Practical Optimization Daniel Tomes (Databricks)Top 5 Mistakes When Writing Spark ApplicationsApache Spark on Kubernetes: A Technical Deep Dive - Yinan Li, GoogleTuning Apache Spark for Large Scale Workloads - Sital Kedia & Gaoxiang LiuSpark Performance Tuning | EXECUTOR Tuning | Interview QuestionHow to use Spark-Submit in BIGDATA Projects?Dynamic Partition Pruning | Spark Performance TuningDeep Dive into GPU Support in Apache Spark 3.xData Mesh in Practice: How Europe's Leading Online Platform for Fashion Goes Beyond the Data LakeRealizing the Vision of the Data Lakehouse | Ali Ghodsi | Keynote Spark + AI Summit 2020Spark Performance Tuning | Performance Optimization | Interview QuestionSpark Performance Tuning | Practical Performance Optimization | Spark DeveloperSpark Interview Question | Cost Based OptimizerApache Spark Core—Deep Dive—Proper Optimization Daniel Tomes Databricks[Uber Open Summit 2018] Scaling Uber's Big Data PlatformDeep Dive into Stateful Stream Processing in Structured Streaming - Tathagata DasRunning Apache Spark on Kubernetes: Best Practices and PitfallsDeciding the right size for executor | Spark in Production | Course on Apache Spark Core | Lesson 23A Deep Dive into Spark SQL's Catalyst Optimizer (Cheng Lian + Maryann Xue, DataBricks)