Multi-Armed Bandit Problem

This project implements some of the action-value methods covered in Sutton and Barto's Reinforcement Learning: An Introduction for the multi-armed bandit problem, which is a special problem in reinforcement learning (RL) that only contains 1 state (more info). main.py contains a series of experiments comparing various action-value methods:

the sample average method vs. the exponential recency-weighted average method,
the 0.1-greedy method vs, 0.01-greedy method, and
0-initialization vs. the optimal initial values method.

Experiment Set-Up (skip if not interested)

For the stationary bandit problem, the true values were sampled from the standard normal. The rewards came from a normal distribution with the mean set to the true value and standard deviation set to 1.

As for the non-stationary version, the true values were initialized to 0 but scattered a bit with each time step (I added a normally distributed increment with mean 0 and standard deviation of 0.01 with each step). Otherwise, the rewards worked the same way as the stationary version.

For the experiment, I ran the two systems for 10,000 time steps, and took the average reward value at each time step over 1,000 independent runs.

Results

Sample Average vs. Exponential Recency-Weighted Average

Set-up details: for both action-value methods, I used an epsilon-greedy method with epsilon=0.1 to encourage exploration to some extent. For the recency-weighted method, I used a constant step size of 0.1.

Results: I found that the two methods performed similarly on the stationary problem (as expected), but the recency-weighted (or "Constant Step Size") method performed significantly better on the non-stationary problem (also as expected):

Figure 1: Stationary - both methods have similar performance.

Figure 2: Non-stationary - recency-weighted method performs much better.

0.1-Greedy vs. 0.01-Greedy (exploration vs. exploitation trade-off)

Set-up details: I compared a couple epsilon-greedy methods on the stationary problem (0.1 and 0.01) with the same set up as before.

Results: As expected, the 0.1-greedy method performs better at first because it discovers the optimal actions earlier, but the 0.01-greedy method performs better over the long run because it chooses the optimal method more often the the 0.1-greedy method.

Figure 3: Comparing different epsilon-greedy methods to see how much exploration is optimal (for this particular scenario).

0-initialization vs. Optimal Initial Values

Set-up details: For optimal initial values, I initialized value estimates at +5, and tested the two methods on the stationary problem (this is a requirement for the optimal initial values method). The 0-init method uses epislon=0.1, whereas the optimal initial value method is purely greedy (epsilon=0).

Results: As expected, the optimal initial values method performs worse at the very start because it explores more, but it quickly outperforms the 0-init method because it tends to find the optimal actions earlier and more frequently.

Figure 4: Line chart showing how optimal initial values can help encourage exploration for stationary problems.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
images		images
.gitignore		.gitignore
README.md		README.md
agent_constant_step.py		agent_constant_step.py
agent_sample_avg.py		agent_sample_avg.py
arm.py		arm.py
main.py		main.py
multi_armed_bandit.py		multi_armed_bandit.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Multi-Armed Bandit Problem

Experiment Set-Up (skip if not interested)

Results

Sample Average vs. Exponential Recency-Weighted Average

0.1-Greedy vs. 0.01-Greedy (exploration vs. exploitation trade-off)

0-initialization vs. Optimal Initial Values

About

Uh oh!

Releases

Packages

Languages

andrewahn-ubc/multi-armed-bandit

Folders and files

Latest commit

History

Repository files navigation

Multi-Armed Bandit Problem

Experiment Set-Up (skip if not interested)

Results

Sample Average vs. Exponential Recency-Weighted Average

0.1-Greedy vs. 0.01-Greedy (exploration vs. exploitation trade-off)

0-initialization vs. Optimal Initial Values

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages