"Exploring Wikipedia With Apache Spark" - Advanced Training by Sameer Farooqui (Databricks)
Live Big Data Training from Spark Summit 2016 in San Francisco.
"The real power and value proposition of Apache Spark is in building a unified use case that combines ETL, batch analytics, real time stream analysis, machine learning, graph processing and visualizations. In class we will explore various Wikipedia datasets while applying the ideal programming paradigm for each analysis. The class will comprise of about 50% lecture and 50% hands-on labs + demos." - Sameer
Class covers:
- Spark SQL and DataFrames
- Spark Streaming
- Machine Learning (NLP, k-means clustering, TF-IDF, PageRank, Shortest Path)
- GraphFrames
- Visualizations (Databricks, Matplotlib, Google Charts, D3.js)
- Advanced Performance Tuning and Debugging
- Spark UI
Data sets that we explore:
- Pageviews (March 2015) - 255 MB
- Clickstream (Feb 2015) - 1.2 GB
- Pagecounts (last hour) - ~550 MB
- English Wikipedia (Mar 2016) - 54 GB
- 6 Wikipedia Language Live Edit Streams (variable)
// About the Presenter //
Sameer Farooqui is a Technology Evangelist at Databricks where he helps promote the adoption of Apache Spark. As a founding member of the training team, he created and taught advanced Spark classes at private clients, meetups and conferences globally.
Follow Sameer on -
Twitter: https://twitter.com/blueplastic
LinkedIn: https://www.linkedin.com/in/blueplastic
Видео "Exploring Wikipedia With Apache Spark" - Advanced Training by Sameer Farooqui (Databricks) канала Spark Summit
"The real power and value proposition of Apache Spark is in building a unified use case that combines ETL, batch analytics, real time stream analysis, machine learning, graph processing and visualizations. In class we will explore various Wikipedia datasets while applying the ideal programming paradigm for each analysis. The class will comprise of about 50% lecture and 50% hands-on labs + demos." - Sameer
Class covers:
- Spark SQL and DataFrames
- Spark Streaming
- Machine Learning (NLP, k-means clustering, TF-IDF, PageRank, Shortest Path)
- GraphFrames
- Visualizations (Databricks, Matplotlib, Google Charts, D3.js)
- Advanced Performance Tuning and Debugging
- Spark UI
Data sets that we explore:
- Pageviews (March 2015) - 255 MB
- Clickstream (Feb 2015) - 1.2 GB
- Pagecounts (last hour) - ~550 MB
- English Wikipedia (Mar 2016) - 54 GB
- 6 Wikipedia Language Live Edit Streams (variable)
// About the Presenter //
Sameer Farooqui is a Technology Evangelist at Databricks where he helps promote the adoption of Apache Spark. As a founding member of the training team, he created and taught advanced Spark classes at private clients, meetups and conferences globally.
Follow Sameer on -
Twitter: https://twitter.com/blueplastic
LinkedIn: https://www.linkedin.com/in/blueplastic
Видео "Exploring Wikipedia With Apache Spark" - Advanced Training by Sameer Farooqui (Databricks) канала Spark Summit
Показать
Комментарии отсутствуют
Информация о видео
Другие видео канала
Glint: An Asynchronous Parameter Server for Spark (Rolf Jagerman)Extreme scale Ad Tech using Spark and Databricks at MediaMath (Prasad Chalasani)IoT and the Autonomous Vehicle in the Clouds: Spark Summit East talk by Jay White BearAnalysis Andromeda Galaxy Data Using Spark: Spark Summit East talk by Jose NandezDeep Recurrent Neural Networks for Sequence Learning in SparkThe Fast Path to Building Operational Applications with Spark: talk by Nikita ShamgunovExtending Spark with Java Agents (Jaroslav Bachorik)Scaling Genetic Data Analysis with Apache Spark: Spark Summit East talk by Cotton SeedNew Directions for Spark in 2015- Matei Zaharia (Databricks)Software Above the Level of a Single Device The Implications - Tim O'Reilly (O'Reilly Media)Keynote - Arun Murthy (Hortonworks)Extending Word2Vec for Performance and Semi Supervised Learning - Michael Malak (Oracle)5 Reasons Enterprise Adoption Of Spark Is UnstoppableSpark Summit 2013 - Big Data Research in the AMPLab - Mike FranklinDelivering Insights from 5PB of Product Logs at Pure Storage: Spark Summit East talk by Brian GoldPedal to the Metal: Accelerating Apache Spark with Innovations in Silicon TechnologyHow to Integrate MLlib and Solr to Build Real Time Recognition System by Khalifeh AlJaddaSpark Plugs Into Your Car- Arpan Ghosh; Rob Ferguson (Automatic)Production Spark and Tachyon use CasesSpark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applications - Kelvin Chu (Uber)Perspectives on Big Data & Analytics - Doug Wolfe (Central Intelligence Agency)