PySpark Tutorial Section 2: PySpark Data Pipeline using AWS: S3, Glue Crawler, Catalog & Athena
In Section 2 of our PySpark Tutorial Series, learn how to build a complete ETL data pipeline using PySpark on AWS Glue, starting from raw data in Amazon S3 to querying results with Amazon Athena.
This hands-on session walks through the entire serverless pipeline setup—ideal for Data Engineers and Big Data Developers looking to leverage AWS Glue for scalable ETL workloads.
✅ What You’ll Learn:
Uploading and organizing data in Amazon S3
Creating and assigning IAM Roles for Glue access
Setting up a Glue Database and Glue Crawler
Generating schema with the Glue Data Catalog
Writing PySpark ETL code using Glue’s DynamicFrame API
Transforming and renaming columns using DataFrame APIs
Writing output data back to S3 in CSV format
Querying results using Amazon Athena
🎓 A perfect end-to-end demonstration of a real-world ETL pipeline!
Видео PySpark Tutorial Section 2: PySpark Data Pipeline using AWS: S3, Glue Crawler, Catalog & Athena канала Code for Earth 🌳
This hands-on session walks through the entire serverless pipeline setup—ideal for Data Engineers and Big Data Developers looking to leverage AWS Glue for scalable ETL workloads.
✅ What You’ll Learn:
Uploading and organizing data in Amazon S3
Creating and assigning IAM Roles for Glue access
Setting up a Glue Database and Glue Crawler
Generating schema with the Glue Data Catalog
Writing PySpark ETL code using Glue’s DynamicFrame API
Transforming and renaming columns using DataFrame APIs
Writing output data back to S3 in CSV format
Querying results using Amazon Athena
🎓 A perfect end-to-end demonstration of a real-world ETL pipeline!
Видео PySpark Tutorial Section 2: PySpark Data Pipeline using AWS: S3, Glue Crawler, Catalog & Athena канала Code for Earth 🌳
Комментарии отсутствуют
Информация о видео
26 мая 2025 г. 15:30:06
00:43:00
Другие видео канала