Загрузка страницы

Brian Kent: Density Based Clustering in Python

PyData NYC 2015

Clustering data into similar groups is a fundamental task in data science. Probability density-based clustering has several advantages over popular parametric methods like K-Means, but practical usage of density-based methods has lagged for computational reasons. I will discuss recent algorithmic advances that are making density-based clustering practical for larger datasets.

Clustering data into similar groups is a fundamental task in data science applications such as exploratory data analysis, market segmentation, and outlier detection. Density-based clustering methods are based on the intuition that clusters are regions where many data points lie near each other, surrounded by regions without much data.

Density-based methods typically have several important advantages over popular model-based methods like K-Means: they do not require users to know the number of clusters in advance, they recover clusters with more flexible shapes, and they automatically detect outliers. On the other hand, density-based clustering tends to be more computationally expensive than parametric methods, so density-based methods have not seen the same level of adoption by data scientists.

Recent computational advances are changing this picture. I will talk about two density-based methods and how new Python implementations are making them more useful for larger datasets. DBSCAN is by far the most popular density-based clustering method. A new implementation in Dato's GraphLab Create machine learning package dramatically speeds up DBSCAN computation by taking advantage of GraphLab Create's multi-threaded architecture and using an algorithm based on the connected components of a similarity graph.

The density Level Set Tree is a method first proposed theoretically by Chaudhuri and Dasgupta in 2010 as a way to represent a probability density function hierarchically, enabling users to use all density levels simultaneous, rather than choosing a specific level as with DBSCAN. The Python package DeBaCl implements a modification of this method and a tool for interactively visualizing the cluster hierarchy.

Slides available here: https://speakerdeck.com/papayawarrior/density-based-clustering-in-python

Notebooks: http://nbviewer.ipython.org/github/papayawarrior/public_talks/blob/master/pydata_nyc_dbscan.ipynb
http://nbviewer.ipython.org/github/papayawarrior/public_talks/blob/master/pydata_nyc_DeBaCl.ipynb

Видео Brian Kent: Density Based Clustering in Python канала PyData
Показать
Комментарии отсутствуют
Введите заголовок:

Введите адрес ссылки:

Введите адрес видео с YouTube:

Зарегистрируйтесь или войдите с
Информация о видео
4 декабря 2015 г. 22:43:50
00:39:24
Яндекс.Метрика