Все видео Новые видео Популярные видео Категории видео

Авто	Видео-блоги	ДТП, аварии	Для маленьких	Еда, напитки
Животные	Закон и право	Знаменитости	Игры	Искусство
Комедии	Красота, мода	Кулинария, рецепты	Люди	Мото
Музыка	Мультфильмы	Наука, технологии	Новости	Образование
Политика	Праздники	Приколы	Природа	Происшествия
Путешествия	Развлечения	Ржач	Семья	Сериалы
Спорт	Стиль жизни	ТВ передачи	Танцы	Технологии
Товары	Ужасы	Фильмы	Шоу-бизнес	Юмор

Pessimistic Reward Models for Off-Policy Learning in Recommendation

RecSys 2021 Pessimistic Reward Models for Off-Policy Learning in Recommendation

Authors: Olivier Jeunen, University of Antwerp | Bart Goethals, University of Antwerp

Abstract: Methods for bandit learning from user interactions often require a model of the reward a certain context-action pair will yield -- for example, the probability of a click on a recommendation.
This common machine learning task is highly non-trivial, as the data-generating process for contexts and actions is often skewed by the recommender system itself.
Indeed, when the deployed recommendation policy at data collection time does not pick its actions uniformly-at-random, this leads to a selection bias that can impede effective reward modelling.
This in turn makes off-policy learning -- the typical setup in industry -- particularly challenging.

In this work, we propose and validate a general pessimistic reward modelling approach for off-policy learning in recommendation.
Bayesian uncertainty estimates allow us to express scepticism about our own reward model, which can in turn be used to generate a conservative decision rule.
We show how it alleviates a well-known decision making phenomenon known as the Optimiser's Curse, and draw parallels with existing work on pessimistic policy learning.
Leveraging the available closed-form expressions for both the posterior mean and variance when a ridge regressor models the reward, we show how to apply pessimism effectively and efficiently to an off-policy recommendation use-case.
Empirical observations in a wide range of environments show that being conservative in decision-making leads to a significant and robust increase in recommendation performance.
The merits of our approach are most outspoken in realistic settings with limited logging randomisation, limited training samples, and larger action spaces.

DOI: https://doi.org/10.1145/3460231.3474247

Видео Pessimistic Reward Models for Off-Policy Learning in Recommendation канала ACM RecSys

Показать

Комментарии отсутствуют