Brian Kent: Density Based Clustering in Python
PyData NYC 2015
Clustering data into similar groups is a fundamental task in data science. Probability density-based clustering has several advantages over popular parametric methods like K-Means, but practical usage of density-based methods has lagged for computational reasons. I will discuss recent algorithmic advances that are making density-based clustering practical for larger datasets.
Clustering data into similar groups is a fundamental task in data science applications such as exploratory data analysis, market segmentation, and outlier detection. Density-based clustering methods are based on the intuition that clusters are regions where many data points lie near each other, surrounded by regions without much data.
Density-based methods typically have several important advantages over popular model-based methods like K-Means: they do not require users to know the number of clusters in advance, they recover clusters with more flexible shapes, and they automatically detect outliers. On the other hand, density-based clustering tends to be more computationally expensive than parametric methods, so density-based methods have not seen the same level of adoption by data scientists.
Recent computational advances are changing this picture. I will talk about two density-based methods and how new Python implementations are making them more useful for larger datasets. DBSCAN is by far the most popular density-based clustering method. A new implementation in Dato's GraphLab Create machine learning package dramatically speeds up DBSCAN computation by taking advantage of GraphLab Create's multi-threaded architecture and using an algorithm based on the connected components of a similarity graph.
The density Level Set Tree is a method first proposed theoretically by Chaudhuri and Dasgupta in 2010 as a way to represent a probability density function hierarchically, enabling users to use all density levels simultaneous, rather than choosing a specific level as with DBSCAN. The Python package DeBaCl implements a modification of this method and a tool for interactively visualizing the cluster hierarchy.
Slides available here: https://speakerdeck.com/papayawarrior/density-based-clustering-in-python
Notebooks: http://nbviewer.ipython.org/github/papayawarrior/public_talks/blob/master/pydata_nyc_dbscan.ipynb
http://nbviewer.ipython.org/github/papayawarrior/public_talks/blob/master/pydata_nyc_DeBaCl.ipynb
Видео Brian Kent: Density Based Clustering in Python канала PyData
Clustering data into similar groups is a fundamental task in data science. Probability density-based clustering has several advantages over popular parametric methods like K-Means, but practical usage of density-based methods has lagged for computational reasons. I will discuss recent algorithmic advances that are making density-based clustering practical for larger datasets.
Clustering data into similar groups is a fundamental task in data science applications such as exploratory data analysis, market segmentation, and outlier detection. Density-based clustering methods are based on the intuition that clusters are regions where many data points lie near each other, surrounded by regions without much data.
Density-based methods typically have several important advantages over popular model-based methods like K-Means: they do not require users to know the number of clusters in advance, they recover clusters with more flexible shapes, and they automatically detect outliers. On the other hand, density-based clustering tends to be more computationally expensive than parametric methods, so density-based methods have not seen the same level of adoption by data scientists.
Recent computational advances are changing this picture. I will talk about two density-based methods and how new Python implementations are making them more useful for larger datasets. DBSCAN is by far the most popular density-based clustering method. A new implementation in Dato's GraphLab Create machine learning package dramatically speeds up DBSCAN computation by taking advantage of GraphLab Create's multi-threaded architecture and using an algorithm based on the connected components of a similarity graph.
The density Level Set Tree is a method first proposed theoretically by Chaudhuri and Dasgupta in 2010 as a way to represent a probability density function hierarchically, enabling users to use all density levels simultaneous, rather than choosing a specific level as with DBSCAN. The Python package DeBaCl implements a modification of this method and a tool for interactively visualizing the cluster hierarchy.
Slides available here: https://speakerdeck.com/papayawarrior/density-based-clustering-in-python
Notebooks: http://nbviewer.ipython.org/github/papayawarrior/public_talks/blob/master/pydata_nyc_dbscan.ipynb
http://nbviewer.ipython.org/github/papayawarrior/public_talks/blob/master/pydata_nyc_DeBaCl.ipynb
Видео Brian Kent: Density Based Clustering in Python канала PyData
Показать
Комментарии отсутствуют
Информация о видео
Другие видео канала
High Quality, High Performance Clustering with HDBSCAN | SciPy 2016 | Leland McInnesHDBSCAN, Fast Density Based Clustering, the How and the Why - John HealyK-means & Image Segmentation - ComputerphileML with Python | Text Clustering | K-Means (Movies)StatQuest: Hierarchical ClusteringDBSCAN (Density Based Spatial Clustering Of Applications with Noise) ll Machine Learning (Hindi)4 Basic Types of Cluster Analysis used in Data AnalyticsImplement DBSCAN Clustering and detecting OUTLIERS with PythonClustering (4): Gaussian Mixture Models and EMClustering: K-means and HierarchicalDBSCAN: Part 1Machine Learning Tutorial Python - 13: K Means ClusteringDBSCAN Clustering Easily Explained with ImplementationMachine Learning - Unsupervised Learning - Density Based ClusteringLeland McInnes, John Healy | Clustering: A Guide for the PerplexedMachine Learning #75 Density Based ClusteringKernel Density Estimation with Python: Estimate a Density Function from DataDBSCAN Clustering for Identifying Outliers Using Python - Tutorial 22 in Jupyter Notebook35. Finding Clusters in Graphs