Все видео Новые видео Популярные видео Категории видео

Авто	Видео-блоги	ДТП, аварии	Для маленьких	Еда, напитки
Животные	Закон и право	Знаменитости	Игры	Искусство
Комедии	Красота, мода	Кулинария, рецепты	Люди	Мото
Музыка	Мультфильмы	Наука, технологии	Новости	Образование
Политика	Праздники	Приколы	Природа	Происшествия
Путешествия	Развлечения	Ржач	Семья	Сериалы
Спорт	Стиль жизни	ТВ передачи	Танцы	Технологии
Товары	Ужасы	Фильмы	Шоу-бизнес	Юмор

PyCon.DE 2017 Tamara Mendt - Modern ETL-ing with Python and Airflow (and Spark)

Tamara Mendt (@TamaraMendt)

Tamara Mendt is a Data Engineer at HelloFresh, a meal kit delivery service headquartered in Berlin, and one of the top 3 tech startups to come out of Europe over the past 4 years. She devotes her time to building data pipelines and designing and maintaining the company's data infrastructure. Tamara has a computer engineering degree from her native country Venezuela, and an Erasmus Mundus Masters degree in IT for Business Intelligence. She wrote her Master thesis at the TU Berlin with the research group where Apache Flink was born. At HelloFresh she is continuing to work with distributed technologies such has Apache Hadoop, Apache Kafka and Apache Spark to cope with the scalability that the fast growing company requires for dealing with their data.

Abstract

Tags: data data-science pipeline

The challenge of data integration is real. The sheer amount of tools that exist to address this problem is proof that organizations struggle with it. This talk will discuss the inherent challenges of data integration, and show how it can be tackled using Python and Apache Airflow and Apache Spark.
Description

The way organizations analyze their data has evolved very quickly since the beginning of the millennium. The development of Hadoop, and the explosion in the variety of data that companies are dealing with nowadays, has fostered the appearance of the concept of data lake, and the shift of traditional ETL (extract, transform, load), to ELT (extract, load, transform). Yet, the challenge of integrating data to obtain valuable insights still remains, and despite the hype and attention being focused on data, very few organizations have actually managed to become data driven. In this talk I will present insights into how we are currently building data pipelines using Python (as a replacement to high level ETL software), Apache Airflow as a scheduler to our coded transformations, and Apache Spark for achieve scalability. Though building data pipelines is not the only element required to become data driven, it is a crucial one, and I hope to encourage the audience to use these open source technologies in their own ETL-ing (or ELT-ing) efforts.

Recorded at PyCon.DE 2017 Karlsruhe: pycon.de
Video editing: Sebastian Neubauer & Andrei Dan
Tools: Blender, Avidemux & Sonic Pi

Видео PyCon.DE 2017 Tamara Mendt - Modern ETL-ing with Python and Airflow (and Spark) канала PyConDE

Показать

Комментарии отсутствуют