Все видео Новые видео Популярные видео Категории видео

Авто	Видео-блоги	ДТП, аварии	Для маленьких	Еда, напитки
Животные	Закон и право	Знаменитости	Игры	Искусство
Комедии	Красота, мода	Кулинария, рецепты	Люди	Мото
Музыка	Мультфильмы	Наука, технологии	Новости	Образование
Политика	Праздники	Приколы	Природа	Происшествия
Путешествия	Развлечения	Ржач	Семья	Сериалы
Спорт	Стиль жизни	ТВ передачи	Танцы	Технологии
Товары	Ужасы	Фильмы	Шоу-бизнес	Юмор

Brian Kent: Density Based Clustering in Python

PyData NYC 2015

Clustering data into similar groups is a fundamental task in data science. Probability density-based clustering has several advantages over popular parametric methods like K-Means, but practical usage of density-based methods has lagged for computational reasons. I will discuss recent algorithmic advances that are making density-based clustering practical for larger datasets.

Clustering data into similar groups is a fundamental task in data science applications such as exploratory data analysis, market segmentation, and outlier detection. Density-based clustering methods are based on the intuition that clusters are regions where many data points lie near each other, surrounded by regions without much data.

Density-based methods typically have several important advantages over popular model-based methods like K-Means: they do not require users to know the number of clusters in advance, they recover clusters with more flexible shapes, and they automatically detect outliers. On the other hand, density-based clustering tends to be more computationally expensive than parametric methods, so density-based methods have not seen the same level of adoption by data scientists.

Recent computational advances are changing this picture. I will talk about two density-based methods and how new Python implementations are making them more useful for larger datasets. DBSCAN is by far the most popular density-based clustering method. A new implementation in Dato's GraphLab Create machine learning package dramatically speeds up DBSCAN computation by taking advantage of GraphLab Create's multi-threaded architecture and using an algorithm based on the connected components of a similarity graph.

The density Level Set Tree is a method first proposed theoretically by Chaudhuri and Dasgupta in 2010 as a way to represent a probability density function hierarchically, enabling users to use all density levels simultaneous, rather than choosing a specific level as with DBSCAN. The Python package DeBaCl implements a modification of this method and a tool for interactively visualizing the cluster hierarchy.

Slides available here: https://speakerdeck.com/papayawarrior/density-based-clustering-in-python

Notebooks: http://nbviewer.ipython.org/github/papayawarrior/public_talks/blob/master/pydata_nyc_dbscan.ipynb
http://nbviewer.ipython.org/github/papayawarrior/public_talks/blob/master/pydata_nyc_DeBaCl.ipynb

Видео Brian Kent: Density Based Clustering in Python канала PyData

Показать

Комментарии отсутствуют