Загрузка страницы

When Should You Use Random Forests?

Subscribe to RichardOnData here: https://www.youtube.com/channel/UCKPyg5gsnt6h0aA8EBw3i6A?sub_confirmation=1

In this video I talk about the Random Forest algorithm. These are one of my favorite machine learning algorithms and one of the most popular ones overall, and they have implications whether your goals are inference or prediction.

The base learner of the Random Forest is the Decision Tree. These are simple and straightforward, but the problem is they tend to be high variance -- i.e. they overfit and do not generalize well from a training set to a test set. Random Forests correct for this problem. They are an "ensemble learning method" -- the process for creating them is as follows:

1) Create a bootstrapped dataset with a subset of available variables
2) Fit a Decision Tree using that data
3) Repeat the process
4) Tally "votes" for predictions across trees -- predicted class is the one with the most "votes"

A key feature of Random Forests is that they can be used to produce Variable Importance plots. These rank, from top to bottom, the most "important" variables in the data. What is nice about these is, while RFs are not interpretable the way that regression models are, is they are constructed in a different way and can detect things like non-linear relationships. See the diagram where RM is the most important variable, followed by LSTAT, followed by DIS.

Some other benefits of Random Forests are:
1) They are not extremely sensitive to outliers
2) They are fairly stable and can handle new data without changing dramatically
3) They have methods for handling missing data
4) They can be used for unsupervised learning (clustering)

However, some drawbacks are:
1) They can be slow and memory-intensive
2) Variable importance can become biased if you have: a) a mix of continuous and categorical variables, where the categorical variables have few levels; or b) correlated continuous variables

See StatQuest with Josh Starner's video on building random forests: https://www.youtube.com/watch?v=J4Wdy0Wc_xQ
See Leo Breiman and Adele Cutler's documentation on random forests: https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm
StackExchange discussion on sensitivity to outliers:
https://stats.stackexchange.com/questions/187200/how-are-random-forests-not-sensitive-to-outliers
StackExchange discussion on conditional inference trees:
https://stats.stackexchange.com/questions/12140/conditional-inference-trees-vs-traditional-decision-trees
Abstract on clustering using RF:
https://horvath.genetics.ucla.edu/html/RFclustering/RFclustering/RandomForestHorvath.pdf

PC for decision tree image:
https://www.geeksforgeeks.org/decision-tree/
PC for variable importance plot:
https://towardsdatascience.com/explaining-feature-importance-by-example-of-a-random-forest-d9166011959e (Eryk Lewinson, Towards Data Science)

Видео When Should You Use Random Forests? канала RichardOnData
Показать
Комментарии отсутствуют
Введите заголовок:

Введите адрес ссылки:

Введите адрес видео с YouTube:

Зарегистрируйтесь или войдите с
Информация о видео
18 июня 2020 г. 17:59:27
00:13:26
Яндекс.Метрика