Загрузка страницы

today I tried: Evolution Strategies

This video was inspired by this amazing blog post by OpenAI: https://blog.openai.com/evolution-strategies/

Paint some reward! (just clone the repo and open index.html) https://github.com/JonComo/es_paint

Train a (simulated) robot arm: https://github.com/JonComo/arm

Loose script I worked from:

What is it?

An easy way to train reinforcement learning agents.

What's cool about it?

It's simple to do, and it can be parallelized easily, and it's competative with standard methods in RL.

How does it work?

Well, here's the typical RL scenario: an agent and environment, where the agent emits actions that effect the environment, which then returns its new state and a reward r.

The agent wants to change its actions to maximize total expected reward.

How is this done typically?

Well the environment spits out a state, the agent transforms that state using a differentiable function (typically a neural network) into a probability distribution over actions. It samples an action using that distribution, and executes it in the environment, which in turn spits out a new state and reward. If the reward is good, we should increase the probability of the action the agent took in that state. Using the gradient of the neural network, we can do just that.

What's different with evolution strats?

We modify the agent's weights in its network with random gaussian noise, get a reward from the environment, and if the reward was good we move our weights in the direction of the noise. We try many different "directions" of noise, and using the reward gotten for each, create a weighted average, which becomes our new weights.

This comes from a simple idea called the finite difference. It may sound complex but you're already familiar with it if you've watched my previous vids. Here's a visual demonstration of it.
Here's a simple example, where we have 2 parameters, and we sample slightly different ones. The landscape has more reward the more blue there is. We then get a weighted average of the samples and move our parameters in that direction!

When we get a reward from the environment in a more complex example we'd be doing a rollout, or evaluating our parameters for a number of timesteps to see if they're good or not.

Here's a more complex example, where the parameters control this arms joint angles. I want to train it to move towards the mouse, so I pick random targets on screen, modify the arms weights and see if it gets closer with the modded weights. If so that's a positive reward. I then do the same thing, use a weighted average of the noise I applied to the weights to get the weight update!

Видео today I tried: Evolution Strategies канала giant_neural_network
Показать
Комментарии отсутствуют
Введите заголовок:

Введите адрес ссылки:

Введите адрес видео с YouTube:

Зарегистрируйтесь или войдите с
Информация о видео
17 июня 2017 г. 4:10:28
00:09:31
Яндекс.Метрика