Papers

Browse
Visualization

Showing papers for .

Oral Session 5 @ Blue 1

Oral Session 1 @ Blue 5

Oral Session 10 @ Blue 5

Oral Session 11 @ Blue 5

Oral Session 12 @ Blue 5

Oral Session 5 @ Blue 1

title keyword author

Track

Room

Show Favorites

An Experimental Design Approach for Regret Minimization in Logistic Bandits

Blake Mason, Kwang-Sung Jun, Lalit Jain

[AAAI-22] Main Track

An Experimental Design Approach for Regret Minimization in Logistic Bandits

Blake Mason, Kwang-Sung Jun, Lalit Jain

[AAAI-22] Main Track

Open Paper

Abstract

In this work we consider the problem of regret minimization for logistic bandits. The main challenge of logistic bandits is reducing the dependence on a potentially large problem dependent constant $\kappa$ that can at worst scale exponentially with the norm of the unknown parameter $\theta_{\ast}$. Abeille et al. (2021) have applied self-concordance of the logistic function to remove this worst-case dependence providing regret guarantees like $O(d\log^2(\kappa)\sqrt{\dot\mu T}\log(|\mathcal{X}|))$ where $d$ is the dimensionality, $T$ is the time horizon, and $\dot\mu$ is the variance of the best-arm. This work improves upon this bound in the fixed arm setting by employing an experimental design procedure that achieves a minimax regret of $O(\sqrt{d \dot\mu T\log(|\mathcal{X}|)})$. Our regret bound in fact takes a tighter instance (i.e., gap) dependent regret bound for the first time in logistic bandits. We also propose a new warmup sampling algorithm that can dramatically reduce the lower order term in the regret in general and prove that it can replace the lower order term dependency on $\kappa$ to $\log^2(\kappa)$ for some instances. Finally, we discuss the impact of the bias of the MLE on the logistic bandit problem, providing an example where $d^2$ lower order regret (cf., it is $d$ for linear bandits) may not be improved as long as the MLE is used and how bias-corrected estimators may be used to make it closer to $d$.

Keywords

poster session 5 @ blue 1, poster session 10 @ blue 1, oral session 5 @ blue 1, poster session 5, poster session 10, oral session 5

Saving Stochastic Bandits from Poisoning Attacks via Limited Data Verification

Anshuka Rangi, Long Tran-Thanh, Haifeng Xu, Massimo Franceschetti

[AAAI-22] Main Track

Saving Stochastic Bandits from Poisoning Attacks via Limited Data Verification

Anshuka Rangi, Long Tran-Thanh, Haifeng Xu, Massimo Franceschetti

[AAAI-22] Main Track

Open Paper

Abstract

This paper studies bandit algorithms under data poisoning attacks in a bounded reward setting. We consider a strong attacker model in which the attacker can observe both the selected actions and their corresponding rewards, and can contaminate the rewards with additive noise. We show that \emph{any} bandit algorithm with regret $O(\log T)$ can be forced to suffer a regret $\Omega(T)$ with an expected amount of contamination $O(\log T)$. {This amount of contamination is also necessary, as we prove that there exists an $O(\log T)$ regret bandit algorithm, specifically the classical UCB, that requires $\Omega(\log T)$ amount of contamination to suffer regret $\Omega(T)$.} To combat such poisoning attacks, our second main contribution is to propose verification based mechanisms, which use limited \emph{verification} to access a limited number of uncontaminated rewards. In particular, for the case of unlimited verifications, we show that with $O(\log T)$ expected number of verifications, a simple modified version of the Explore-then-Commit type bandit algorithm can restore the order optimal $O(\log T)$ regret \emph{irrespective of the amount of contamination} used by the attacker. We also provide a UCB-like verification scheme, called Secure-UCB, that also enjoys full recovery from any attacks, also with $O(\log T)$ expected number of verifications. To derive a matching lower bound on the number of verifications, we also prove that for any order-optimal bandit algorithm, this number of verifications $O(\log T)$ is necessary to recover the order-optimal regret. On the other hand, when the number of verifications is bounded above by a budget $B$, we propose a novel algorithm, Secure-BARBAR, which provably achieves $\tilde{O}(\min\{C,T/\sqrt{B} \})$ regret with high probability against weak attackers (i.e., attackers who have to place the contamination \emph{before} seeing the actual pulls of the bandit algorithm), where $C$ is the total amount of contamination by the attacker, which breaks the known $\Omega(C)$ lower bound of the non-verified setting if $C$ is large.

Keywords

poster session 5 @ blue 1, poster session 10 @ blue 1, oral session 5 @ blue 1, poster session 5, poster session 10, oral session 5

Modeling Attrition in Recommender Systems with Departing Bandits

Omer Ben-Porat, Lee Cohen, Liu Leqi, Zachary C. Lipton, Yishay Mansour

[AAAI-22] Main Track

Modeling Attrition in Recommender Systems with Departing Bandits

Omer Ben-Porat, Lee Cohen, Liu Leqi, Zachary C. Lipton, Yishay Mansour

[AAAI-22] Main Track

Open Paper

Abstract

Traditionally, when recommender systems are formalized as multi-armed bandits, the policy of the recommender system influences the rewards accrued, but not the length of interaction. However, in real-world systems, dissatisfied users may depart (and never come back). In this work, we propose a novel multi-armed bandit setup that captures such policy-dependent horizons. Our setup consists of a finite set of user \emph{types}, and multiple arms with Bernoulli payoffs. Each (user type, arm) tuple corresponds to an (unknown) reward probability. Each user's type is initially unknown and can only be inferred through their response to recommendations. Moreover, if a user is dissatisfied with their recommendation, they might depart the system. We first address the case where all users share the same type, demonstrating that a recent UCB-based algorithm is optimal. We then move forward to the more challenging case, where users are divided among two types. While naive approaches cannot handle this setting, we provide an efficient learning algorithm that achieves $\tilde{O}(\sqrt{T})$ regret, where $T$ is the number of users.

Keywords

poster session 5 @ blue 1, poster session 10 @ blue 1, oral session 5 @ blue 1, poster session 5, poster session 10, oral session 5