Using LLVM to accelerate processing of data in Apache Arrow
Most query engines follow an interpreter-based approach where a SQL query is translated into a tree of relational algebra operations then fed through a conventional tuple-based iterator model to execute the query. We will explore the overhead associated with this approach and how the performance of query execution on columnar data can be improved using run-time code generation via LLVM.
Generally speaking, the best case for optimal query execution performance is a hand-written query plan that does exactly what is needed by the query for the exact same data types and format. Vectorized query processing models amortize the cost of function calls. However, research has shown that hand-written code for a given query plan has the potential to outperform the optimizations associated with a vectorized query processing model.
Over the last decade, the LLVM compiler framework has seen significant development. Furthermore, the database community has realized the potential of LLVM to boost query performance by implementing JIT query compilation frameworks. With LLVM, a SQL query is translated into a portable intermediary representation (IR) which is subsequently converted into machine code for the desired target architecture.
Dremio is built on top of Apache Arrow’s in-memory columnar vector format. The in-memory vectors map directly to the vector type in LLVM and that makes our job easier when writing the query processing algorithms in LLVM. We will talk about how Dremio implemented query processing logic in LLVM for some operators like FILTER and PROJECT. We will also discuss the performance benefits of LLVM-based vectorized query execution over other methods.
Видео Using LLVM to accelerate processing of data in Apache Arrow канала DataWorks Summit
Generally speaking, the best case for optimal query execution performance is a hand-written query plan that does exactly what is needed by the query for the exact same data types and format. Vectorized query processing models amortize the cost of function calls. However, research has shown that hand-written code for a given query plan has the potential to outperform the optimizations associated with a vectorized query processing model.
Over the last decade, the LLVM compiler framework has seen significant development. Furthermore, the database community has realized the potential of LLVM to boost query performance by implementing JIT query compilation frameworks. With LLVM, a SQL query is translated into a portable intermediary representation (IR) which is subsequently converted into machine code for the desired target architecture.
Dremio is built on top of Apache Arrow’s in-memory columnar vector format. The in-memory vectors map directly to the vector type in LLVM and that makes our job easier when writing the query processing algorithms in LLVM. We will talk about how Dremio implemented query processing logic in LLVM for some operators like FILTER and PROJECT. We will also discuss the performance benefits of LLVM-based vectorized query execution over other methods.
Видео Using LLVM to accelerate processing of data in Apache Arrow канала DataWorks Summit
Показать
Комментарии отсутствуют
Информация о видео
Другие видео канала
![Network Reference Architecture for Hadoop- Validated and Tested Approach](https://i.ytimg.com/vi/YJODsK0T67A/default.jpg)
![DataWorks Summit 2018 San Jose Day 2 Keynote](https://i.ytimg.com/vi/aAHGvNIvCzY/default.jpg)
![Day two keynotes](https://i.ytimg.com/vi/DCK2TsnCuQs/default.jpg)
![Harnessing the Power of Big Data at Freddie Mac](https://i.ytimg.com/vi/ms5GZaAVD3Y/default.jpg)
![Leveraging Hadoop to defend against improvised threats](https://i.ytimg.com/vi/mp5xSz0l5QQ/default.jpg)
![BI on Big Data with instant response times at Verizon](https://i.ytimg.com/vi/r9hypKDXtZM/default.jpg)
![Big data processing meets non-volatile memory: opportunities and challenges](https://i.ytimg.com/vi/RltVVYl71LI/default.jpg)
![Stream Scaling in Pravega](https://i.ytimg.com/vi/abNulm1P6M0/default.jpg)
![Integrating Apache Phoenix with Distributed Query Engines](https://i.ytimg.com/vi/wV_WBGhpDqM/default.jpg)
![Help Hadoop survive the 300 million block barrier and then back it up](https://i.ytimg.com/vi/E9TPi4s1wJM/default.jpg)
![0605 Hadoop REST API Security with the Apache Knox Gateway](https://i.ytimg.com/vi/KLI9UUW00Kc/default.jpg)
![Practice of large Hadoop cluster in China Mobile](https://i.ytimg.com/vi/ZdOlcpk-_Go/default.jpg)
![How to Ingest 16 Billion Records Per Day into your Hadoop Environment](https://i.ytimg.com/vi/1cIDoocGwQs/default.jpg)
![Counting Unique Users in Real-Time: Here's a Challenge for You!](https://i.ytimg.com/vi/MLUhzuQ0DHY/default.jpg)
![Druid and Hive Together : Use Cases and Best Practices](https://i.ytimg.com/vi/sY26K273cXU/default.jpg)
![Containers and Big Data](https://i.ytimg.com/vi/JkNfH37mAp8/default.jpg)
![Building intelligent applications, experimental ML with Uber’s Data Science Workbench](https://i.ytimg.com/vi/i6qIkkf5PBc/default.jpg)
![Observing Intraday Indicators Using Real-Time Tick Data on Apache Superset and Druid](https://i.ytimg.com/vi/o0JsVwFEdtA/default.jpg)
![HBase Global Indexing to support large-scale data ingestion at Uber](https://i.ytimg.com/vi/E4SKstrhFOY/default.jpg)
![Achieving a 360-degree view of manufacturing via open source industrial data management](https://i.ytimg.com/vi/MHjvIuZbc1M/default.jpg)
![Can you re platform your Teradata, Oracle, Netezza and SQL Server analytic workloads to Hadoop? The](https://i.ytimg.com/vi/TJerSy0lcWg/default.jpg)