Scalable Data Ingestion Architecture Using Airflow and Spark | Komodo Health
Get the slides: https://www.datacouncil.ai/talks/scalable-data-ingestion-architecture-using-airflow-and-spark
ABOUT THE TALK:
This is an experience report on implementing and moving to a scalable data ingestion architecture. The requirements were to process tens of terabytes of data coming from several sources with data refresh cadences varying from daily to annual. The main challenge is that each provider has their own quirks in schemas and delivery processes. To achieve this we use Apache Airflow to organize the workflows and to schedule their execution, including developing custom Airflow hooks and operators to handle similar tasks in different pipelines. We are running on AWS using Apache Spark to horizontally scale the data processing and Kubernetes for container management.
We will explain the reasons for this architecture, and we will also share the pros and cons we have observed when working with these technologies. Furthermore, we will explain how this approach has simplified the process of bringing in new data sources and considerably reduced the maintenance and operation overhead, but also the challenges that we have had during this transition.
ABOUT THE SPEAKERS:
Dr. Johannes Leppä is a Data Engineer building scalable solutions for ingesting complex data sets at Komodo Health. Johannes is interested in the design of distributed systems and intricacies in the interactions between different technologies. He claims not to be lazy, but gets most excited about automating his work. Prior to data engineering he conducted research in the field of aerosol physics at the California Institute of Technology, and holds a PhD in physics from the University of Helsinki. Johannes is passionate about metal: wielding it, forging it and, especially, listening to it.
ABOUT DATA COUNCIL:
Data Council (https://www.datacouncil.ai/) is a community and conference series that provides data professionals with the learning and networking opportunities they need to grow their careers. Make sure to subscribe to our channel for more videos, including DC_THURS, our series of live online interviews with leading data professionals from top open source projects and startups.
FOLLOW DATA COUNCIL:
Twitter: https://twitter.com/DataCouncilAI
LinkedIn: https://www.linkedin.com/company/datacouncil-ai
Facebook: https://www.facebook.com/datacouncilai
Eventbrite: https://www.eventbrite.com/o/data-council-30357384520
Видео Scalable Data Ingestion Architecture Using Airflow and Spark | Komodo Health канала Data Council
ABOUT THE TALK:
This is an experience report on implementing and moving to a scalable data ingestion architecture. The requirements were to process tens of terabytes of data coming from several sources with data refresh cadences varying from daily to annual. The main challenge is that each provider has their own quirks in schemas and delivery processes. To achieve this we use Apache Airflow to organize the workflows and to schedule their execution, including developing custom Airflow hooks and operators to handle similar tasks in different pipelines. We are running on AWS using Apache Spark to horizontally scale the data processing and Kubernetes for container management.
We will explain the reasons for this architecture, and we will also share the pros and cons we have observed when working with these technologies. Furthermore, we will explain how this approach has simplified the process of bringing in new data sources and considerably reduced the maintenance and operation overhead, but also the challenges that we have had during this transition.
ABOUT THE SPEAKERS:
Dr. Johannes Leppä is a Data Engineer building scalable solutions for ingesting complex data sets at Komodo Health. Johannes is interested in the design of distributed systems and intricacies in the interactions between different technologies. He claims not to be lazy, but gets most excited about automating his work. Prior to data engineering he conducted research in the field of aerosol physics at the California Institute of Technology, and holds a PhD in physics from the University of Helsinki. Johannes is passionate about metal: wielding it, forging it and, especially, listening to it.
ABOUT DATA COUNCIL:
Data Council (https://www.datacouncil.ai/) is a community and conference series that provides data professionals with the learning and networking opportunities they need to grow their careers. Make sure to subscribe to our channel for more videos, including DC_THURS, our series of live online interviews with leading data professionals from top open source projects and startups.
FOLLOW DATA COUNCIL:
Twitter: https://twitter.com/DataCouncilAI
LinkedIn: https://www.linkedin.com/company/datacouncil-ai
Facebook: https://www.facebook.com/datacouncilai
Eventbrite: https://www.eventbrite.com/o/data-council-30357384520
Видео Scalable Data Ingestion Architecture Using Airflow and Spark | Komodo Health канала Data Council
Показать
Комментарии отсутствуют
Информация о видео
Другие видео канала
Future of Data EngineeringAirflow in Practice Stop Worrying Start Loving DAGs - Sarah SchattschneiderApache Spark Based Reliable Data Ingestion in Datalake with Gagan Agrawal (Paytm)Data Pipeline Frameworks: The Dream and the Reality | BeeswaxTop 10 Data Ingestion TipsAn Overview of the National Health Stack and PHR System by Siddharth Shetty, Volunteer, iSPIRTReal-Time Data Pipelines Made Easy with Structured Streaming in Apache Spark | DatabricksAmazon System Design Interview: Design Parking GarageKomodo Health’s CEO on $220M Series E & What’s REALLY Happening with Big Data in HealthcareRunning Apache Airflow Reliably with Kubernetes | AstronomerData Journey EP-02: Batch Ingestion 📦 - 5 ways to ingest files into Google CloudAirflow: Automating ETLs for a Data Warehouse, Natarajan Chakrapani, SF Python July 2018Elegant data pipelining with Apache Airflow - Bolke de BruinUsing Apache Arrow, Calcite and Parquet to build a Relational Cache | DremioBuilding (Better) Data Pipelines with Apache AirflowBig Data Architecture PatternsAirflow on Kubernetes - Scaling DAG Workflows | Daniel Imberman, Seth Edwards @ PyBay2018Functional Data Engineering - A Set of Best Practices | LyftApache Spark - ComputerphileETL Is Dead, Long Live Streams: real-time streams w/ Apache Kafka