Multi-armed Bandits with Constraint

Aashay Sachdeva
5 min readApr 3, 2022


I love using bandits in online systems. They are easy to understand conceptually, easy to implement, and allow you to explore a bigger policy space than you would be able to using a simple A/B test. They work great when the interpretability required is low, and window of opportunity for optimization is short-lived and there’s not enough time for gathering statistically significant results. For ex- optimizing pricing for a limited period offer.

Of course, they are not perfect. Convergence is hard if the policy space is large and too many policies together might spoil UX. But a well-designed bandit experiment can optimize over a large decision space.

Check out how Amazon optimizes over web page layout using bandits —

One of the issues with vanilla bandits is that it doesn’t take constraints into account. When you are running them in the real world, you will have multiple constraints on things like budget, time for optimization, number of users, etc.

So how do we use bandits in a constrained setting?

Budgeted Thompson Sampling

R∗ is the expected reward of the optimal policy (the policy that can obtain the maximum expected reward given the reward and cost distributions of each arm),

This paper proposes a method to estimate the approximate policy for a MAB under constraints. While getting the optimal policy for budgeted MAB could be very complex, the paper proves that, when the reward and cost per pulling are supported in [0, 1] and the budget is large, the method can achieve the almost optimal reward by always pulling the optimal arm (associated with the maximum expected-reward-to-expected-cost ratio).

The algorithm itself is pretty simple -

Let’s dive into it!

First, we need to define both our rewards and costs in [0,1]. This is easy; just normalize the reward and cost by the highest value. The total budget B is used to define the total number of iterations that happens; for every iteration, the cost is deducted from the budget. The algorithm stops optimizing once budget is exhausted.

Initialize for all arms two beta distributions Beta(α,β) as given in step 4-

  1. For reward- Beta(1,1)
  2. For cost- Beta(1,1)

Now for the first iteration, show the arm for which -

The numerator denotes the sample from the reward’s beta distribution, while the denominator denotes the sample from the cost’s beta distribution. In the starting, all arms will get picked randomly, since both the numerator and denominator are being sampled from almost the same distribution.

Once the arm is picked, the reward and cost distribution are updated -

for the reward’s beta distribution, the α parameter is updated by adding the reward value to it. For the β parameter, the update added is 1-r.

For cost distribution-

α parameter is updated by adding the cost value to it. For the β parameter, the update added is 1-cost.

From the budget, the cost is then deducted. The iterative process stops once the budget is over.

That’s it! Pretty simple to understand and straightforward to implement. But does the algorithm converge? Well, for that you will need to deep dive into the paper, where they provide proof that the algorithm converges to the optimal arm given enough iterations.

I did some simulations and given enough iterations, the model started to converge -

arms with cost -10,8,5,3, iteration 200
iteration 3000

Rough Code for generating the simulation -

import numpy as np
import random
from scipy.stats import beta
import matplotlib.pyplot as plt
from numpy.random import choice
plt.rcParams['figure.figsize'] = [10, 7]
import pandas as pd
def plot_list(reward,penalties,iteration,plot_name):
x = np.linspace(0, 1, 10000)
y1 = beta.pdf(x, reward[0], penalties[0])
y2 = beta.pdf(x, reward[1], penalties[1])
y3 = beta.pdf(x, reward[2], penalties[2])
y4 = beta.pdf(x, reward[3], penalties[3])
plt.title(f"PDF of Beta (Bell-shape) iteration({iteration} // {plot_name})", fontsize=20)
plt.xlabel("X", fontsize=16)
plt.ylabel("Probability Density", fontsize=16)
plt.plot(x, y1, linewidth=3, color='firebrick')
plt.annotate(f"Beta( {reward[0]},{penalties[0]}) { 10}", xy=(1.75, 19), size = 14, ha='center', va='center', color='firebrick')
plt.plot(x, y2, linewidth=3, color='burlywood')
plt.annotate(f"Beta( {reward[1]},{penalties[1]}) { 8}", xy=(1.75, 16.25), size = 14, ha='center', va='center', color='burlywood')
plt.plot(x, y3, linewidth=3, color='dodgerblue')
plt.annotate(f"Beta( {reward[2]},{penalties[2]}) { 5}", xy=(1.75, 15), size = 14, ha='center', va='center', color='dodgerblue')
plt.plot(x, y4, linewidth=3, color='green')
plt.annotate(f"Beta( {reward[3]},{penalties[3]}) { 3}", xy=(1.75, 17.5), size = 14, ha='center', va='center', color='green')
plt.ylim([0, 20])
plt.xlim([0, 2])
def plot_count(num,iteration):['A10','A08','A05','A03'],num)
plt.title(f"Arms selected the most {iteration}")
plt.ylabel('Number Of Times Each Arm Was Selected To Play')
rl10 = [choice([0,0,1]) for i in range(4000)]
rl8 = [choice([0,0,1]) for i in range(4000)]
rl5 = [choice([0,1]) for i in range(4000)]
rl3 = [choice([0,0,1]) for i in range(4000)]
rl10 = [choice([j/10 for j in range(10)]) for i in range(4000)]
rl8 = [choice([j/10 for j in range(10)]) for i in range(4000)]
rl5 = [choice([j/10 for j in range(10)]) for i in range(4000)]
rl3 = [choice([j/10 for j in range(10)]) for i in range(4000)]
def paper_approach_plot(observations,arms,data,data_cost):
#other initializations

arms_selected = []

for n in range(observations):
arm = 0
max_val = 0
for i in range(arms):
if I>max_val:

r = data.values[n, arm]
c = data_2.values[n, arm]
# if r==0:
# c=0

if n%200==0 or n==observations-1:

I hope you found it useful and are able to use bandits in a constraint setting. If you face any issues or just want to have a discussion, feel free to DM me on Twitter —