Загрузка страницы

#bbuzz: Nishith Agarwal - Building large scale, transactional data lakes using Apache Hudi

More: https://berlinbuzzwords.de/session/building-large-scale-transactional-data-lakes-using-apache-hudi

With the proliferation of data in the past years, most business critical decisions are heavily influenced by deep data analysis. As companies rely more on data for their functioning; storing, managing and accessing data intelligently and efficiently is more important than ever before.

As more business decisions are driven by data in real time, we require strong guarantees such as acceptable latencies, high data quality and system reliability. Moving from a full-reload to a delta model of ingesting quickly became the primary way to ingest large amounts of data at scale. A number of such ingest patterns showcased how a transaction support on such datasets could benefit use-cases immensely.

Hudi, an apache project is attempting to introduce uniform data lake standards. Hudi is a storage abstraction library that uses Spark as an execution framework. In this talk, we will discuss how Hudi can provide ACID semantics to a data lake. We will discuss some of the basic primitives such as upsert & delete required to achieve acceptable latencies in ingestion while at the same time providing high quality data by enforcing schematization on datasets. Additionally, we will also discuss more advanced primitives such as restore, delta-pull, compaction & file sizing required for reliability, efficient storage management and to build incremental ETL pipelines. We will dig deeper into Hudi’s metadata model that allows for O(1) query planning as well as how it helps support Time-Travel queries to facilitate building feature stores for machine learning use-cases. Apache Hudi builds on open-source file formats; we will discuss how to easily onboard your existing dataset to Hudi format while keeping the same open-source formats so you can start utilizing all the features provided by Hudi without needing to make any drastic changes to your data lake. We will talk about the challenges faced in productionizing large Spark based Hudi jobs @scale at Uber and discuss how we addressed them.

Finally, we will make the case for the future, discussing various other primitives that will facilitate in building rich and portable data applications.

Видео #bbuzz: Nishith Agarwal - Building large scale, transactional data lakes using Apache Hudi канала Plain Schwarz
Показать
Комментарии отсутствуют
Введите заголовок:

Введите адрес ссылки:

Введите адрес видео с YouTube:

Зарегистрируйтесь или войдите с
Информация о видео
29 июня 2020 г. 20:39:28
00:53:47
Другие видео канала
#mices: Fabian Engeln - Solving the E-Commerce Search Challenge#mices: Fabian Engeln - Solving the E-Commerce Search ChallengeKevin Watters – Document classification search; joins vs payloadsKevin Watters – Document classification search; joins vs payloadsFrancesco Tisiot – Solving the knapsack problem with recursive queries and PostgreSQLFrancesco Tisiot – Solving the knapsack problem with recursive queries and PostgreSQLZhibo Li - Declarative Data Collections for Portable ParallelismZhibo Li - Declarative Data Collections for Portable ParallelismAarne Talman & Dmitry Kan – Muves: Multimodal & multilingual vector search w/ Hardware AccelerationAarne Talman & Dmitry Kan – Muves: Multimodal & multilingual vector search w/ Hardware AccelerationMarija Selakovic - When ms matter: Maximizing query performance in CrateDBMarija Selakovic - When ms matter: Maximizing query performance in CrateDBSimona Meriam – Logging Apache Spark - How we made it easySimona Meriam – Logging Apache Spark - How we made it easy#FOSSBack: Ana Jimenez Santamaria – OSPOs: Key Lever for Open Source Sustainability#FOSSBack: Ana Jimenez Santamaria – OSPOs: Key Lever for Open Source Sustainability#FOSSBack: Tobie Langel - Towards a sustainable solution to open source sustainability#FOSSBack: Tobie Langel - Towards a sustainable solution to open source sustainabilityDarjan Salaj – Deep Learning, Neuroscience and the future of AIDarjan Salaj – Deep Learning, Neuroscience and the future of AIUmesh Dangat – NrtSearch: Yelp’s fast, scalable, and cost-effective open source search engineUmesh Dangat – NrtSearch: Yelp’s fast, scalable, and cost-effective open source search engine#FOSSBack: Josep Prat –  Sustainability beyond funds: Extrospective OSPOs#FOSSBack: Josep Prat – Sustainability beyond funds: Extrospective OSPOsBerlin Buzzwords 2023: The Debate Returns (with more vectors): Which Search Engine?Berlin Buzzwords 2023: The Debate Returns (with more vectors): Which Search Engine?#FOSSBack: Per Ploug – Open source work is work#FOSSBack: Per Ploug – Open source work is workYaroslav Tkachenko – It's Time To Stop Using Lambda ArchitectureYaroslav Tkachenko – It's Time To Stop Using Lambda ArchitectureJavier Ramirez - Ingesting over 4 million rows a second on a single instanceJavier Ramirez - Ingesting over 4 million rows a second on a single instance#FOSSBack: Thomas Fricke – Log4Shell - The Open Source World on Fire#FOSSBack: Thomas Fricke – Log4Shell - The Open Source World on FireShikhar Srivastava – Scaling Facets to the Stars 🌟Shikhar Srivastava – Scaling Facets to the Stars 🌟Berlin Buzzwords 2023: How to train your general purpose document retriever modelBerlin Buzzwords 2023: How to train your general purpose document retriever model#FOSSBack: Masae Shida – Strategic Alignment of Open Source Contributions with Corporate Strategies#FOSSBack: Masae Shida – Strategic Alignment of Open Source Contributions with Corporate Strategies#FOSSBack: Javier Perez – The State of Open Source Software in 2023#FOSSBack: Javier Perez – The State of Open Source Software in 2023
Яндекс.Метрика