The columnar roadmap: Apache Parquet and Apache Arrow
The Hadoop ecosystem has standardized on columnar formats—Apache Parquet for on-disk storage and Apache Arrow for in-memory. With this trend, deep integration with columnar formats is a key differentiator for big data technologies. Vertical integration from storage to execution greatly improves the latency of accessing data by pushing projections and filters to the storage layer, reducing time spent in IO reading from disk, as well as CPU time spent decompressing and decoding. Standards like Arrow and Parquet make this integration even more valuable as data can now cross system boundaries without incurring costly translation. Cross-system programming using languages such as Spark, Python, or SQL can becomes as fast as native internal performance.
In this talk we’ll explain how Parquet is improving at the storage level, with metadata and statistics that will facilitate more optimizations in query engines in the future. We’ll detail how the new vectorized reader from Parquet to Arrow enables much faster reads by removing abstractions as well as several future improvements. We will also discuss how standard Arrow-based APIs pave the way to breaking the silos of big data. One example is Arrow-based universal function libraries that can be written in any language (Java, Scala, C++, Python, R, ...) and will be usable in any big data system (Spark, Impala, Presto, Drill). Another is a standard data access API with projection and predicate push downs, which will greatly simplify data access optimizations across the board.
Speaker
JULIEN LE DEM
Principal Engineer
WeWork
Видео The columnar roadmap: Apache Parquet and Apache Arrow канала DataWorks Summit
In this talk we’ll explain how Parquet is improving at the storage level, with metadata and statistics that will facilitate more optimizations in query engines in the future. We’ll detail how the new vectorized reader from Parquet to Arrow enables much faster reads by removing abstractions as well as several future improvements. We will also discuss how standard Arrow-based APIs pave the way to breaking the silos of big data. One example is Arrow-based universal function libraries that can be written in any language (Java, Scala, C++, Python, R, ...) and will be usable in any big data system (Spark, Impala, Presto, Drill). Another is a standard data access API with projection and predicate push downs, which will greatly simplify data access optimizations across the board.
Speaker
JULIEN LE DEM
Principal Engineer
WeWork
Видео The columnar roadmap: Apache Parquet and Apache Arrow канала DataWorks Summit
Показать
Комментарии отсутствуют
Информация о видео
Другие видео канала
Spark + Parquet In Depth: Spark Summit East talk by: Emily Curtin and Robbie StricklandThe Parquet Format and Performance Optimization Opportunities Boudewijn Braams (Databricks)What is the difference between Database vs. Data lake vs. Warehouse?Hash Tables and Hash FunctionsThe columnar roadmap Apache Parquet and Apache Arrow2.1 - Columnar StorageAvro IntroductionHow to Read Very Big Files With SQL and Pandas in PythonBuilding and Running Distributed Systems using Apache Mesos - Benjamin HindmanData modelling and partitioning in Azure Cosmos DB: What every relational database | BRK3015Using Apache Arrow, Calcite and Parquet to build a Relational Cache | DremioWhat is Hadoop?: SQL ComparisonHigh Performance Data Processing in Python || Donald WhyteApache Drill SQL Queries on Parquet Data | Whiteboard WalkthroughData Analysis 0: Introduction to Data Analysis - ComputerphileData Warehousing with Amazon RedshiftHive Bucketing in Apache Spark - Tejas PatilBuilding Robust ETL Pipelines with Apache Spark - Xiao LiKafka Tutorial - Core ConceptsNoSQL Database Tutorial | Types of NoSQL Databases in Terms of Storage | Big Data Tutorial