Adversarial Attack and Robust Learning in Multi-Arm Bandit Problems
MetadataShow full item record
In this paper we study bandit learning problem with adversaries. Bandit learning setting is a typical sequential decisions making setting whose biggest challenge is to balance the exploration and exploitation. Models based on bandit learning setting are widely applied in a variety fields of applications, like marketing and advertising. However, the original bandit learning setting simplify the reward generation pattern as stochastic, which is not accurate in some important cases. In fact, there exists adversaries in a lot of important applications that severely influence the performance of most existing bandit learning algorithms. In this paper we consider two typical types of adversaries: incentive adversary and data corruption adversary. Incentive adversaries come from the fact that the bandits for the algorithm to deal with have their own incentives. An example is in an online advertising business, a company needs to sell its advertising slots to advertisers repeatedly, which could be a typical bandit problem except for the fact that advertisers also want to maximize their own interest and hence bid strategically. Data corruption adversaries come from a malicious third party that is able to manipulate data directly. An example is in recommendation systems, a platform recommend shops to users based on users' feedback, but some of the feedback may be generated by the shops themselves, i.e, false feedback. First we will show that most bandit algorithms for the original setting without adversaries are prone to both types of adversaries. In particular, proving that most algorithms are prone to data corruption adversary is of more interest. We characterize an important feature of most bandit algorithms, construct several instances with different adversary strategy, and show that all algorithms with this feature will fail in at least one of these cases. Particularly, we purpose an attack that is oblivious to algorithm's behavior and show that the attack can completely manipulate many typical bandit algorithms. Next we purpose an algorithm that is simultaneously robust to both types of adversaries and efficiently utilize side information (context) in the mean while. At last we study the case where only data corruption adversary exists, which has a higher requirement on performance than the setting where both adversaries exist. We think the new challenge that the data corruption adversary bring to the bandit learning problem is bias on estimates oblivious to the algorithm. Based on this thinking we focus the estimators that is able to minimize the bias, and explore the conditions when such estimators could work. We show that these conditions can be found in both existing algorithms robust to data corruption adversary. In the end we come up with an open question whether there exist an estimator that strike a balance between bias and variance such that it can be combined with other bandit algorithms to make them robust to data corruption adversary.