The Parquet Format and Performance Optimization Opportunities Boudewijn Braams (Databricks)
The Parquet format is one of the most widely used columnar storage formats in the Spark ecosystem. Given that I/O is expensive and that the storage layer is the entry point for any query execution, understanding the intricacies of your storage format is important for optimizing your workloads. As an introduction, we will provide context around the format, covering the basics of structured data formats and the underlying physical data storage model alternatives (row-wise, columnar and hybrid). Given this context, we will dive deeper into specifics of the Parquet format: representation on disk, physical data organization (row-groups, column-chunks and pages) and encoding schemes. Now equipped with sufficient background knowledge, we will discuss several performance optimization opportunities with respect to the format: dictionary encoding, page compression, predicate pushdown (min/max skipping), dictionary filtering and partitioning schemes. We will learn how to combat the evil that is 'many small files', and will discuss the open-source Delta Lake format in relation to this and Parquet in general. This talk serves both as an approachable refresher on columnar storage as well as a guide on how to leverage the Parquet format for speeding up analytical workloads in Spark using tangible tips and tricks.
About: Databricks provides a unified data analytics platform, powered by Apache Spark™, that accelerates innovation by unifying data science, engineering and business.
Read more here: https://databricks.com/product/unified-data-analytics-platform
Connect with us:
Website: https://databricks.com
Facebook: https://www.facebook.com/databricksinc
Twitter: https://twitter.com/databricks
LinkedIn: https://www.linkedin.com/company/databricks
Instagram: https://www.instagram.com/databricksinc/
Видео The Parquet Format and Performance Optimization Opportunities Boudewijn Braams (Databricks) канала Databricks
About: Databricks provides a unified data analytics platform, powered by Apache Spark™, that accelerates innovation by unifying data science, engineering and business.
Read more here: https://databricks.com/product/unified-data-analytics-platform
Connect with us:
Website: https://databricks.com
Facebook: https://www.facebook.com/databricksinc
Twitter: https://twitter.com/databricks
LinkedIn: https://www.linkedin.com/company/databricks
Instagram: https://www.instagram.com/databricksinc/
Видео The Parquet Format and Performance Optimization Opportunities Boudewijn Braams (Databricks) канала Databricks
Показать
Комментарии отсутствуют
Информация о видео
Другие видео канала
![The columnar roadmap: Apache Parquet and Apache Arrow](https://i.ytimg.com/vi/dPb2ZXnt2_U/default.jpg)
![Apache Parquet: Parquet file internals and inspecting Parquet file structure](https://i.ytimg.com/vi/rVC9F1y38oU/default.jpg)
![Top 5 Mistakes When Writing Spark Applications](https://i.ytimg.com/vi/WyfHUNnMutg/default.jpg)
![Azure Synapse Analytics: A Data Lakehouse by James Serra](https://i.ytimg.com/vi/7q-LHpsrzd0/default.jpg)
![The Power And Story Of Information | Order and Disorder | Spark](https://i.ytimg.com/vi/qj7HH0PCqIE/default.jpg)
![Optimizing Apache Spark SQL Joins: Spark Summit East talk by Vida Ha](https://i.ytimg.com/vi/fp53QhSfQcI/default.jpg)
![A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets - Jules Damji](https://i.ytimg.com/vi/Ofk7G3GD9jk/default.jpg)
![How row oriented and column oriented db works?](https://i.ytimg.com/vi/uMkVi4SDLbM/default.jpg)
![Making Apache Spark™ Better with Delta Lake](https://i.ytimg.com/vi/LJtShrQqYZY/default.jpg)
![Which Planet Will Humanity Live On Next? | Planet Hunters | Spark](https://i.ytimg.com/vi/-lGHxBvRnSM/default.jpg)
![Fine Tuning and Enhancing Performance of Apache Spark Jobs](https://i.ytimg.com/vi/WSplTjBKijU/default.jpg)
![Data Engineering Interview | Apache Spark Interview | Live Big Data Interview](https://i.ytimg.com/vi/_I8oLxZRI_g/default.jpg)
![Tech Talk | Using Delta as a Change Data Capture Source](https://i.ytimg.com/vi/7y0AAQ6qX5w/default.jpg)
![Azure Data Lake Storage (Gen 2) Tutorial | Best storage solution for big data analytics in Azure](https://i.ytimg.com/vi/2uSkjBEwwq0/default.jpg)
![Video Formats, Codecs and Containers (Explained)](https://i.ytimg.com/vi/XvoW-bwIeyY/default.jpg)
![Parquet file, Avro file, RC, ORC file formats in Hadoop | Different file formats in Hadoop](https://i.ytimg.com/vi/jKfKmBdPuT4/default.jpg)
![What is Data Pipeline | How to design Data Pipeline ? - ETL vs Data pipeline](https://i.ytimg.com/vi/VtzvF17ysbc/default.jpg)
![Avro Introduction](https://i.ytimg.com/vi/SZX9DM_gyOE/default.jpg)
![Designing ETL Pipelines with Structured Streaming and Delta Lake— How to Architect Things Right](https://i.ytimg.com/vi/eOhAzjf__iQ/default.jpg)
![Spark + Parquet In Depth: Spark Summit East talk by: Emily Curtin and Robbie Strickland](https://i.ytimg.com/vi/_0Wpwj_gvzg/default.jpg)