Real-Time Data Pipelines Made Easy with Structured Streaming in Apache Spark | Databricks
WANT TO EXPERIENCE A TALK LIKE THIS LIVE?
Barcelona: https://www.datacouncil.ai/barcelona
New York City: https://www.datacouncil.ai/new-york-city
San Francisco: https://www.datacouncil.ai/san-francisco
Singapore: https://www.datacouncil.ai/singapore
Download Slides: https://www.datacouncil.ai/talks/building-real-time-data-pipelines-made-easy-with-structured-streaming-in-apache-spark?utm_source=youtube&utm_medium=social&utm_campaign=%20-%20DEC-SF-18%20Slides%20Download
ABOUT THE TALK:
Structured Streaming is the next generation of distributed, streaming processing in Apache Spark. Developers can write a query written in their language of choice (Scala/Java/Python/R) using powerful high-level APIs (DataFrames / Datasets / SQL) and apply that same query to both static datasets and streaming data. In case of streaming, Spark will automatically create an incremental execution plan that automatically handles late, out-of-order data and ensures end-to-end exactly-once fault-tolerance guarantees.
In this practical session, I will walk through a concrete streaming ETL example where – in less than 10 lines – you can read raw, unstructured data from Kafka data, transform it and write it out as a structured table ready for batch and ad-hoc queries on up-to-the-last-minute data. I will give a quick glimpse of advanced features like event-time based aggregations, stream-stream joins and arbitrary stateful operations.
ABOUT THE SPEAKER:
Tathagata is a committer and PMC to the Apache Spark project and a Software Engineer at Databricks. He is the lead developer of Spark Streaming, and now focuses primarily on Structured Streaming. Previously, he was a member of the AMPLab, UC Berkeley as a graduate student researcher where he conducted research on data-center frameworks and networks with Scott Shenker and Ion Stoica.
FOLLOW DATA COUNCIL:
Twitter: https://twitter.com/DataCouncilAI
LinkedIn: https://www.linkedin.com/company/datacouncil-ai
Facebook: https://www.facebook.com/datacouncilai
Видео Real-Time Data Pipelines Made Easy with Structured Streaming in Apache Spark | Databricks канала Data Council
Barcelona: https://www.datacouncil.ai/barcelona
New York City: https://www.datacouncil.ai/new-york-city
San Francisco: https://www.datacouncil.ai/san-francisco
Singapore: https://www.datacouncil.ai/singapore
Download Slides: https://www.datacouncil.ai/talks/building-real-time-data-pipelines-made-easy-with-structured-streaming-in-apache-spark?utm_source=youtube&utm_medium=social&utm_campaign=%20-%20DEC-SF-18%20Slides%20Download
ABOUT THE TALK:
Structured Streaming is the next generation of distributed, streaming processing in Apache Spark. Developers can write a query written in their language of choice (Scala/Java/Python/R) using powerful high-level APIs (DataFrames / Datasets / SQL) and apply that same query to both static datasets and streaming data. In case of streaming, Spark will automatically create an incremental execution plan that automatically handles late, out-of-order data and ensures end-to-end exactly-once fault-tolerance guarantees.
In this practical session, I will walk through a concrete streaming ETL example where – in less than 10 lines – you can read raw, unstructured data from Kafka data, transform it and write it out as a structured table ready for batch and ad-hoc queries on up-to-the-last-minute data. I will give a quick glimpse of advanced features like event-time based aggregations, stream-stream joins and arbitrary stateful operations.
ABOUT THE SPEAKER:
Tathagata is a committer and PMC to the Apache Spark project and a Software Engineer at Databricks. He is the lead developer of Spark Streaming, and now focuses primarily on Structured Streaming. Previously, he was a member of the AMPLab, UC Berkeley as a graduate student researcher where he conducted research on data-center frameworks and networks with Scott Shenker and Ion Stoica.
FOLLOW DATA COUNCIL:
Twitter: https://twitter.com/DataCouncilAI
LinkedIn: https://www.linkedin.com/company/datacouncil-ai
Facebook: https://www.facebook.com/datacouncilai
Видео Real-Time Data Pipelines Made Easy with Structured Streaming in Apache Spark | Databricks канала Data Council
Показать
Комментарии отсутствуют
Информация о видео
Другие видео канала
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets - Jules DamjiTop 5 Mistakes When Writing Spark ApplicationsETL Is Dead, Long Live Streams: real-time streams w/ Apache KafkaSpark + Parquet In Depth: Spark Summit East talk by: Emily Curtin and Robbie StricklandDatabricks Delta: A Unified Management System for Real-time Big DataWhat is Apache Kafka®? (A Confluent Lightboard by Tim Berglund)Data Engineering Principles - Build frameworks not pipelines - Gatis SejaApache Kafka Explained (Comprehensive Overview)Get Rid of Traditional ETL, Move to Spark! (Bas Geerdink)Data Pipeline Frameworks: The Dream and the Reality | BeeswaxThe Parquet Format and Performance Optimization Opportunities Boudewijn Braams (Databricks)Building Streaming Microservices with Apache Kafka - Tim BerglundA Deeper Understanding of Spark Internals - Aaron Davidson (Databricks)Announcing Delta Lake Open Source Project | Ali Ghodsi (Databricks), Michael Armbrust (Databricks)Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Michael ArmbrustWhat is Spark, RDD, DataFrames, Spark Vs Hadoop? Spark Architecture, Lifecycle with simple ExampleSpark Analytics on Cassandra DataApache Kafka with Spark Streaming | Kafka Spark Streaming Examples | Kafka Training | EdurekaEasy, Scalable, Fault Tolerant Stream Processing with Structured Streaming in Apache Spark