Reinforcement learning (example: maze solving)

Simple "reinforcement learning" models can solve surprisingly complex problems, and can be used to model human and animal behavior. The idea is to track the values of states and actions within those states, and gradually learn through trial and error which are the most valuable actions within each state. The values of good and bad outcomes can trickle back to earlier states through something called "temporal difference learning." The key to this is immediately updating the estimated values of actions with immediate results and with estimates of their down-stream consequences. In this demo, we'll see how a simple RL learner can figure out the shortest path to a goal state, where it gets a lot of reward, in just a few iterations of temporal difference learning.

We will use a method called Q-learning. In one form of Q learning, states are represented by the variable \(s\). Each state is associated with a set of actions, with individual actions represented by the variable \(a\). The learner maintains a value, represented by \(Q\), of each possible state-action pair -- \(Q(s,a)\). When an action is taken, a new state is achieved, and then the \(Q(s,a)\) corresponding to the action taken is updated. Two things feed these updates: (1) A rewarding or punishing event might occur. (2) The estimated future value of the next state, \(s'\), is fed back to the \(Q(s,a)\). How do we determine that future value? Well, in this case we simply take the maximum of possible \(Q\) for the next state, that is, the \(Q\) for the action \(a'\) with the highest Q-value for state \(s'\). That's because this estimates the future value of that new state.

Thinking through the implementation of this update, we have to consider a few things. First, we might not want to just replace our current estimate with immediate reward + future value of the next state -- the environment could be more complicated or noisy than such an update would be optimal for. So, we will update \(Q(s,a)\) by calculating the difference between what we expected to happen (exactly our old \(Q(s,a)\)) and what actually happened -- the "reward prediction error," or usually referred to as the "temporal difference" in this context, weighted by a factor called the learning rate (referred to here with the symbol \(\alpha\)). Additionally, we might prefer not to take the next state's maximum future reward estimate for granted -- for example, future rewards might be noisier or worth less compared to immediate rewards. We will discount that element with a further factor called the discount factor and referred to with the symbol \(\lambda\). OK, here's the full equation for updating after taking action \(a\) from state \(s\), resulting in our being in state \(s'\): \[Q_{t+1}(s,a)=\underbrace{Q_{t}(s,a)}_{\text{old value}}+\underbrace{\alpha}_{\text{learning rate}}\times\overbrace{[\underbrace{r_t}_{\text{reward}}+\underbrace{\gamma}_{\text{discount factor}} \times \underbrace{max_{a'}Q(s',a')}_{\text{estimated future value of next state}}-Q_{t}(s,a)]}^{\text{temporal difference error or reward prediction error}}\]

So, on each step, how do we make a choice? There are many ways to go from the Q-table of a present state to a choice. First, we could just choose the maximum current Q-value, i.e. the value of \(a\) that satisfies \(max_{a}Q(s,a)\). A problem with that is that we might get stuck in a pattern than isn't optimal, simple because our first few choices in this state turned out a certain way. So, in order to be sure that we'll converge on good solutions (and stay adaptive in a changing environment), we should also make some random choices sometimes. A simple way would be to explore with some probability, we'll call it \(\epsilon\), and otherwise pick the action with the best current Q-value. This basic tension, between maximizing rewards based on what we know and exploring the environment to discover what we're missing, is called the explore-exploit tradeoff. The strategy described above is called the epsilon-greedy strategy. There are lots of other solutions of greater complexity. For instance, one solution referred to as softmax turns all Q-values into probabilities of making each choice, with higher Q-values associated with larger probabilities, and includes a parameter that controls the degree of randomness (the extent to which these action probabilities tend to be even despite differing Q-values).

Now let's get to our demo. On the right you see a grid or "maze" that needs to be navigated by our "agent." Our agent is the black square, and begins each "run" through this maze in the upper left corner. The goal-state, represented in green, is where the agent will receive a large reward, then be moved back to the beginning spot. Imagine we're simulating an experiment where a rat is dropped into one corner, and when it navigates to the opposite corner of the maze it gets a tasty treat and is removed from the maze.

Underneath the grid is a table that shows the current Q-values for the current position of the agent. We set all Q-values to 0 at the beginning and I represent unavailable choices with "--". So at the beginning, the only options are DOWN and LEFT, and in the beginning the Q-values are all 0's. To make early choices, there's no difference between exploring and exploiting -- the choice will be random. Click "advance" to make one random choice.

All the Q-values of the next state are also 0's, but I implemented a "cost" to moving, so the reward is -1. Thus when we come back to the start state, you'll notice that the Q-value for this action will be \(\alpha\) times -1, or in default parameters -0.5 (I've set \(\alpha\) to 0.5).

Click advance a bunch of times. Eventually, the agent will reach the goal state, get a big (+100) reward, and be placed back at the beginning. If we repeat this over and over, the agent will find a very efficient path to the goal state! That's because the high reward value achieved in the goal state gets propagated "backwards" to the start state, via the updating equation described above.

Click "run" to let the simulation run very quickly. The graph below the Q-table represents the number of steps required to achieve the goal state on each "run." Because the first run is a random walk, it might take many steps (e.g., I just ran it and it took 35 steps) to accidentally stumble into the goal state. However it will learn very quickly -- you should achieve less than 10 steps in only 5-10 iterations, and then it will level off close to 6 steps, which is the optimal (shortest) path possible. Sometimes the agent will randomly explore (according to the value of the \(\epsilon\) parameter), so it won't always be 6. The chart keeps track of the last 100 runs.

If you pause the run, you can wipe the slate clean with "reset learning." You can also, when paused, change some parameters (below the chart), to see how the learner reacts. Here are some things to try:

  • Train the learner, then pause, set epsilon to 0, update, and resume running. When you resume, the learner will always take one path straight to the goal and the graph below the maze will turn into a flat line at 6. Some versions of epsilon-greedy automatically reduce epsilon over time. In a stable environment, this is helpful. But in some environments, things change -- e.g., what if suddenly the experimenter gives a reward that is twice as valuable in the upper right corner? If epsilon is 0, the learner will never discover it once a path is carved out to the goal state.
  • Run a few times with cost to move of 0, and then a few times with cost to move of -1. Does the penalty for moving encourage rapid convergence to a shorter path?
  • Change the reward on finding goal to -100. By the third or fourth run, at least, the learner will avoid the "goal state" entirely -- except when it accidentally (due to exploration rate) stumbles into it.
  • Try different versions of the learning rate and discount rate. How do these effect success and the rate of learning?

For experiments where we want to understand the behavior of animals or people, we might approximate the experimental environments in a similar way, and try to characterize behavior with an algorithm ("model") like this one. In this case, we would try to select parameters (values of alpha, gamma, epsilon) that best reproduce the rate of learning and exploration behavior. These can then be used as estimates of underlying constructs -- e.g., differences in parameters across time/people could help us understand how experimental conditions learning rates that affect behavior, or natural or artificially-induced individual differences



Q-table for current position
ActionValue estimate
UP
DOWN
LEFT
RIGHT


Parameters and attributes
Change these to alter behavior of the model. Recommended ranges are suggested. If you go outside of those ranges, you might get weird behavior. But that can be fun. You will need to pause the model and press "update" below to actually execute the changes.