Code for machine learning players training and playing via the Catan.jl game engine. It implements the reinforcement-learned players and their training code.
Our current approach is via Temporal Difference learning. We model the game as a Markov Reward Process, in which the scripted player refines their initial policy by exploring the state space via repeated self-play.
The reward function for player
where
The value (that is, the reward + an exponentially-decaying summation of future state rewards branching from this state)
Our current approach is limited by the large state space of Catan, and so the next step is to implement some ideas to approximate the value of unseen states based on its proximity to already-explored states, as it is quite reasonable to expect that there is some continuity in the feature space of the state value function. For examplee, if we consider state
Based on experiments doing TD learning on 4 players simultaneously, we reach ~2000 new states per game played. With our set of 32 integer-valued features, we roughly estimate
Running julia --project ./src/explore_temporal_difference_values.jl
starts running tournaments of 4 TemporalDifferencePlayer
players against each other, and recording the estimated values of each hashed game state.
After approximately 9000 games, across 900 maps (10 games per randomly-generated map), we have explored ~20 million, or $2.0\times 10^7$i states at least once. Given our above estimate of
We test the temporal difference learned player against random players, DefaultRobotPlayer
.
Results, where no_winner
occurs if we reach 5000 turns without anyone achieving 10 victory points.
|no winner | TD player | Random player 1| Random player 2| Random player 3 -|-|-|-|-|- TD player first|21|37|148|145|149 TD player last|27|40|135|150|148
So the TD player is performing badly. Worse than a random player. To debug this, we implement a simple validation test, test_feature_perturbations
which constructs a feature vector, and then perturbs one feature at a time, in a direction that should be evidently correlated with victory. For example, :SettlementCount
, we expect the underlying model output, the state reward, and the state value to increase if we change nothing else except increasing the number of settlements a player has.
Starting from a null feature vector:
Random Forest model fails on features :SettlementCount
and :CountWood
with +2 and +3 perturbations. All succeed with +1.
Reward fails on features :SumStoneDiceWeight
and :CountYearOfPlenty
. All succeed with +1.
Value succeeds on all.
Interesting (and slightly suspicious) that only two fail on each of RF model and the state Reward, and that it's not the same problematic features.
Player type | mean game time |
---|---|
Catan.DefaultRobotPlayer | 332 ms |
From CatanLearning.jl/
directory:
$julia --project -e "using CatanLearning; CatanLearning.create_new_model(\"./data/features.csv\", \"./data/model.jls\")"
will train a model using all the data in ./data/features.csv
with the label column :WonGame
.