PyCon.DE 2017 Tamara Mendt - Modern ETL-ing with Python and Airflow (and Spark)
Tamara Mendt (@TamaraMendt)
Tamara Mendt is a Data Engineer at HelloFresh, a meal kit delivery service headquartered in Berlin, and one of the top 3 tech startups to come out of Europe over the past 4 years. She devotes her time to building data pipelines and designing and maintaining the company's data infrastructure. Tamara has a computer engineering degree from her native country Venezuela, and an Erasmus Mundus Masters degree in IT for Business Intelligence. She wrote her Master thesis at the TU Berlin with the research group where Apache Flink was born. At HelloFresh she is continuing to work with distributed technologies such has Apache Hadoop, Apache Kafka and Apache Spark to cope with the scalability that the fast growing company requires for dealing with their data.
Abstract
Tags: data data-science pipeline
The challenge of data integration is real. The sheer amount of tools that exist to address this problem is proof that organizations struggle with it. This talk will discuss the inherent challenges of data integration, and show how it can be tackled using Python and Apache Airflow and Apache Spark.
Description
The way organizations analyze their data has evolved very quickly since the beginning of the millennium. The development of Hadoop, and the explosion in the variety of data that companies are dealing with nowadays, has fostered the appearance of the concept of data lake, and the shift of traditional ETL (extract, transform, load), to ELT (extract, load, transform). Yet, the challenge of integrating data to obtain valuable insights still remains, and despite the hype and attention being focused on data, very few organizations have actually managed to become data driven. In this talk I will present insights into how we are currently building data pipelines using Python (as a replacement to high level ETL software), Apache Airflow as a scheduler to our coded transformations, and Apache Spark for achieve scalability. Though building data pipelines is not the only element required to become data driven, it is a crucial one, and I hope to encourage the audience to use these open source technologies in their own ETL-ing (or ELT-ing) efforts.
Recorded at PyCon.DE 2017 Karlsruhe: pycon.de
Video editing: Sebastian Neubauer & Andrei Dan
Tools: Blender, Avidemux & Sonic Pi
Видео PyCon.DE 2017 Tamara Mendt - Modern ETL-ing with Python and Airflow (and Spark) канала PyConDE
Tamara Mendt is a Data Engineer at HelloFresh, a meal kit delivery service headquartered in Berlin, and one of the top 3 tech startups to come out of Europe over the past 4 years. She devotes her time to building data pipelines and designing and maintaining the company's data infrastructure. Tamara has a computer engineering degree from her native country Venezuela, and an Erasmus Mundus Masters degree in IT for Business Intelligence. She wrote her Master thesis at the TU Berlin with the research group where Apache Flink was born. At HelloFresh she is continuing to work with distributed technologies such has Apache Hadoop, Apache Kafka and Apache Spark to cope with the scalability that the fast growing company requires for dealing with their data.
Abstract
Tags: data data-science pipeline
The challenge of data integration is real. The sheer amount of tools that exist to address this problem is proof that organizations struggle with it. This talk will discuss the inherent challenges of data integration, and show how it can be tackled using Python and Apache Airflow and Apache Spark.
Description
The way organizations analyze their data has evolved very quickly since the beginning of the millennium. The development of Hadoop, and the explosion in the variety of data that companies are dealing with nowadays, has fostered the appearance of the concept of data lake, and the shift of traditional ETL (extract, transform, load), to ELT (extract, load, transform). Yet, the challenge of integrating data to obtain valuable insights still remains, and despite the hype and attention being focused on data, very few organizations have actually managed to become data driven. In this talk I will present insights into how we are currently building data pipelines using Python (as a replacement to high level ETL software), Apache Airflow as a scheduler to our coded transformations, and Apache Spark for achieve scalability. Though building data pipelines is not the only element required to become data driven, it is a crucial one, and I hope to encourage the audience to use these open source technologies in their own ETL-ing (or ELT-ing) efforts.
Recorded at PyCon.DE 2017 Karlsruhe: pycon.de
Video editing: Sebastian Neubauer & Andrei Dan
Tools: Blender, Avidemux & Sonic Pi
Видео PyCon.DE 2017 Tamara Mendt - Modern ETL-ing with Python and Airflow (and Spark) канала PyConDE
Показать
Комментарии отсутствуют
Информация о видео
Другие видео канала
Scalable Data Ingestion Architecture Using Airflow and Spark | Komodo HealthMichał Karzyński - Developing elegant workflows in Python code with Apache AirflowFunctional Data Engineering - A Set of Best Practices | LyftETL Is Dead, Long Live Streams: real-time streams w/ Apache KafkaPyCon.DE 2017 Keynote Matthew Rocklin - Dask: Next Steps in Parallel PythonKeeping Spark on Track: Productionizing Spark for ETL: talk by Kyle Pistor and Miklos ChristineLaura Lorenz | How I learned to time travel, or, data pipelining and scheduling with AirflowDocker Compose élesben: mire figyelj? - Gémes Tamás (Aggreg8.io)Building a Recommender with Apache Spark & ElasticsearchData Pipeline Frameworks: The Dream and the Reality | BeeswaxRunning Apache Airflow Reliably with Kubernetes | AstronomerBasic difference between yield and return in PythonLarge Scale Fuzzy Name Matching (Zhe Sun & Daniel van der Ende)Airflow in Practice Stop Worrying Start Loving DAGs - Sarah SchattschneiderMatt Davis: A Practical Introduction to Airflow | PyData SF 2016Dagster: A New Programming Model for Data Processing | ElementlHow Superset and Druid Power Real-Time Analytics at Airbnb | DataEngConf SF '17Data Pipelines with Python and PostgreSQLBuilding (Better) Data Pipelines with Apache AirflowPySpark | Tutorial-9 | Incremental Data Load | Realtime Use Case | Bigdata Interview Questions