Requires a Wolfram Notebook System

Interact on desktop, mobile and cloud with the free Wolfram Player or other Wolfram Language products.

This Demonstration simulates the state values in the cliff walking problem of reinforcement learning. In this scenario, a grid is shown with the start and goal states located at the bottom-left and bottom-right corners, respectively. The black grid squares at the bottom are called "the cliff" and should be avoided. The agent's task is to get from the initial state to the terminal state using the possible actions up, down, right and left with the same probability. If the agent falls down a cliff (enters the black grid), it gets a negative cliff reward as a penalty and is sent back to the start. The colors in the grid correspond to the actual state values. The lighter blue colors are preferred, with the agent trying to avoid the darker blues. The goal is to achieve the desired rewards with a given discount rate.

Contributed by: Mónika Farsang (December 2020)

Open content licensed under CC BY-NC-SA

## Snapshots

## Details

The agent has an unlimited number of steps. Should it hit the wall (e.g. by choosing the "up" action in the top region), it stays in its place and gets the step reward. Reaching the terminal state achieves no special reward with the default settings. However, it is interesting to observe what happens if the agent gets a higher reward. In the case of lower discount rate values, the immediate rewards are more significant than subsequent rewards. If the discount rate is close to 1, all of the agent's future rewards are counted with almost equal weights. The Bellman equation for the state-value function for policy in state is given by:

,

where is the current state, is the action performed on the given state, is the next state, is the transition probability for a given action, is the expected reward and is the discount rate. The calculated values in the grid are the state-value functions for a random policy, in which the transition probabilities are the same in each direction.

The goal of the agent is to maximize its expected return, hence the greedy policy is to follow the direction of the higher-number grids.

Snapshot 1: The initial environment with step and terminal reward , cliff reward and discount rate 0.9. In this scenario, the state values near the cliff are extremely low. The strategy would be to go up to the top, turn right until reaching the right edge and then go down to the terminal state.

Snapshot 2: A modified environment with step and terminal reward , cliff reward and discount rate 0.9. Although the policy would be the same as snapshot 1, the state values are more balanced and have higher values.

Snapshot 3: A modified environment with step reward 0, terminal reward , cliff reward and discount rate 0.82. In this case, the terminal reward is increased and some states have positive state values, and yet the agent does not change behavior.

Snapshot 4: A modified environment with step reward , terminal reward , cliff reward 0 and discount rate 0.9. With these settings, the policy is shifted and the agent goes only one grid up and then to the right and one grid down, which is the shortest path.

The cliff walking problem is described in [1].

References

[1] R. S. Sutton and A. G. Barto, *Reinforcement Learning: An Introduction*, 2nd ed., Cambridge, MA: The MIT Press, 2018.

[2] J. Zhang, "Reinforcement Learning—Cliff Walking Implementation," *Towards Data Science*. (Nov. 2, 2020) towardsdatascience.com/reinforcement-learning-cliff-walking-implementation-e40ce98418d4.

## Permanent Citation