Pessimistic Reward Models for Off-Policy Learning in Recommendation
RecSys 2021 Pessimistic Reward Models for Off-Policy Learning in Recommendation
Authors: Olivier Jeunen, University of Antwerp | Bart Goethals, University of Antwerp
Abstract: Methods for bandit learning from user interactions often require a model of the reward a certain context-action pair will yield -- for example, the probability of a click on a recommendation.
This common machine learning task is highly non-trivial, as the data-generating process for contexts and actions is often skewed by the recommender system itself.
Indeed, when the deployed recommendation policy at data collection time does not pick its actions uniformly-at-random, this leads to a selection bias that can impede effective reward modelling.
This in turn makes off-policy learning -- the typical setup in industry -- particularly challenging.
In this work, we propose and validate a general pessimistic reward modelling approach for off-policy learning in recommendation.
Bayesian uncertainty estimates allow us to express scepticism about our own reward model, which can in turn be used to generate a conservative decision rule.
We show how it alleviates a well-known decision making phenomenon known as the Optimiser's Curse, and draw parallels with existing work on pessimistic policy learning.
Leveraging the available closed-form expressions for both the posterior mean and variance when a ridge regressor models the reward, we show how to apply pessimism effectively and efficiently to an off-policy recommendation use-case.
Empirical observations in a wide range of environments show that being conservative in decision-making leads to a significant and robust increase in recommendation performance.
The merits of our approach are most outspoken in realistic settings with limited logging randomisation, limited training samples, and larger action spaces.
DOI: https://doi.org/10.1145/3460231.3474247
Видео Pessimistic Reward Models for Off-Policy Learning in Recommendation канала ACM RecSys
Authors: Olivier Jeunen, University of Antwerp | Bart Goethals, University of Antwerp
Abstract: Methods for bandit learning from user interactions often require a model of the reward a certain context-action pair will yield -- for example, the probability of a click on a recommendation.
This common machine learning task is highly non-trivial, as the data-generating process for contexts and actions is often skewed by the recommender system itself.
Indeed, when the deployed recommendation policy at data collection time does not pick its actions uniformly-at-random, this leads to a selection bias that can impede effective reward modelling.
This in turn makes off-policy learning -- the typical setup in industry -- particularly challenging.
In this work, we propose and validate a general pessimistic reward modelling approach for off-policy learning in recommendation.
Bayesian uncertainty estimates allow us to express scepticism about our own reward model, which can in turn be used to generate a conservative decision rule.
We show how it alleviates a well-known decision making phenomenon known as the Optimiser's Curse, and draw parallels with existing work on pessimistic policy learning.
Leveraging the available closed-form expressions for both the posterior mean and variance when a ridge regressor models the reward, we show how to apply pessimism effectively and efficiently to an off-policy recommendation use-case.
Empirical observations in a wide range of environments show that being conservative in decision-making leads to a significant and robust increase in recommendation performance.
The merits of our approach are most outspoken in realistic settings with limited logging randomisation, limited training samples, and larger action spaces.
DOI: https://doi.org/10.1145/3460231.3474247
Видео Pessimistic Reward Models for Off-Policy Learning in Recommendation канала ACM RecSys
Показать
Комментарии отсутствуют
Информация о видео
Другие видео канала
Building public service recommenders: Logbook of a journeyTops, Bottoms, and Shoes: Building Capsule Wardrobes via Cross-Attention Tensor NetworkPS 5: Latent Factor Models and Aggregation Operators for Collaborative Filtering in Reciprocal RecomWorkshop on Podcast Recommendations (PodRecs 2021)Session 1: Personalizing Benefits Allocation Without Spending MoneyPS2: Translation-based factorization machines for sequentialWorkshop on Context-Aware Recommender SystemsSession 8: Adversary or Friend? An adversarial Approach to Improving Recommender SystemsPaper Session 4: Domain Adaptation in Display Advertising: An Application for Partner Cold-StartMitigating Confounding Bias in Recommendation via Information BottleneckPS 7: Eliciting pairwise preferences in recommender systems Saikishore KallooriPS 6: Judging similarity: a user-centric study of related item recommendations Yuan YaoRecSys 2015 Session 4b: AlgorithmsBoosting Local Recommendations With Partially Trained Global ModelLearning a voice-based conversational recommender using offline policy optimizationSession 9: Timely Personalization at Peloton:System and Algorithm for Boosting Time Relevant ContentRecSys 2020 Session P2B: Evaluating and Explaining RecommendationsRecSys 2020 Session P5B: Real World Applications IIWorkshop on Recommender Systems for Human Resources (RecSys in HR 2021)Privacy Preserving Collaborative Filtering by Distributed Mediation