Iterative Policy Evaluation in a Nutshell
Check if an agent is doing the right thing with Policy Evaluation Methods
Iterative Policy Evaluation in a Gridworld
In this post, we explore policy evaluation — a key step in reinforcement learning that allows us to determine how good a given policy is. Using a simple maze environment, we demonstrate how iterative policy evaluation helps compute the expected return (value) of each state under different policies, and how this lays the groundwork for policy improvement.
Gridworld Setup
We define a 5x5 grid environment with actions, UP, DOWN, LEFT, RIGHT:
| 0 | 1 | 2 | 3 | 4 | |
|---|---|---|---|---|---|
| 0 | S | . | . | X | G |
| 1 | . | X | . | X | . |
| 2 | . | X | . | . | . |
| 3 | . | . | . | X | . |
| 4 | X | X | . | . | . |
# Constants for actions UP, DOWN, LEFT, RIGHT = 0, 1, 2, 3 ACTIONS = [UP, DOWN, LEFT, RIGHT] ACTION_DELTAS = { UP: (-1, 0), DOWN: (1, 0), LEFT: (0, -1), RIGHT: (0, 1), } # Maze config MAZE_ROWS = 5 MAZE_COLS = 5 WALLS = {(0, 3), (1, 1), (1, 3), (2, 1), (3, 3), (4, 0), (4, 1)} GOAL = (0, 4)
States values are as follows:
-
S= Start -
G= Goal (terminal) -
X= Wall (unreachable) -
.= Empty state
class State: def __init__(self, name: int, is_terminal: bool = False): self.name = name self.is_terminal = is_terminal
- The agent receives a reward of -1 per step, and 1 upon reaching the goal. Transitions are deterministic.
class Env: def __init__(self): self.states = [] self.state_map = {} for r in range(MAZE_ROWS): for c in range(MAZE_COLS): if (r, c) not in WALLS: is_terminal = (r, c) == GOAL state = State((r, c), is_terminal) self.states.append(state) self.state_map[(r, c)] = state self.actions = ACTIONS def step(self, state: State, action: int) -> List[Tuple[State, float, float]]: r, c = state.name dr, dc = ACTION_DELTAS[action] nr, nc = r + dr, c + dc # Stay in place if moving out of bounds or into a wall if (nr < 0 or nr >= MAZE_ROWS or nc < 0 or nc >= MAZE_COLS or (nr, nc) in WALLS): next_pos = (r, c) else: next_pos = (nr, nc) next_state = self.state_map[next_pos] reward = 0.0 if next_pos == GOAL else -1.0 return [(next_state, reward, 1.0)] # deterministic
Policy Evaluation Code
- We define the agent policies, and a core function for iterative policy evaluation
import math import random from typing import List, Tuple def iterative_policy_eval(agent: Agent, env: Env, gamma: float = 0.9, threshold: float = 1e-6): values = {state.name: 0.0 for state in env.states} while True: delta = 0 for state in env.states: if state.is_terminal: continue old_value = values[state.name] new_value = 0.0 for action in env.actions: action_prob = agent.policy(state, action) for next_state, reward, prob in env.step(state, action): new_value += action_prob * prob * (reward + gamma * values[next_state.name]) values[state.name] = new_value delta = max(delta, abs(old_value - new_value)) if delta < threshold: break return values if __name__ == '__main__': env = Env() agent = Agent() values = iterative_policy_eval(agent, env) for state in sorted(env.states, key=lambda s: s.name): print(f"State {state.name}: V = {values[state.name]:.2f}")
Evaluating Different Policies
1. Left Aciton only Agent
class Agent: def policy(self, state: State, action: int) -> float: return 1 if action == LEFT else 0
State (0, 0): V = -10.00 State (0, 1): V = -10.00 State (0, 2): V = -10.00 State (0, 4): V = 0.00 State (1, 0): V = -10.00 State (1, 2): V = -10.00 State (1, 4): V = -10.00 State (2, 0): V = -10.00 State (2, 2): V = -10.00 State (2, 3): V = -10.00 State (2, 4): V = -10.00 State (3, 0): V = -10.00 State (3, 1): V = -10.00 State (3, 2): V = -10.00 State (3, 4): V = -10.00 State (4, 2): V = -10.00 State (4, 3): V = -10.00 State (4, 4): V = -10.00
- This policy is ineffective. The agent is blocked by walls or simply makes no progress.
2. Stochastic Agent (Uniform Random Policy)
class Agent: def policy(self, state: State, action: int) -> float: return 1.0 / 4
State (0, 0): V = -9.96 State (0, 1): V = -9.93 State (0, 2): V = -9.87 State (0, 4): V = 0.00 State (1, 0): V = -9.96 State (1, 2): V = -9.76 State (1, 4): V = -4.53 State (2, 0): V = -9.96 State (2, 2): V = -9.54 State (2, 3): V = -8.89 State (2, 4): V = -7.75 State (3, 0): V = -9.93 State (3, 1): V = -9.87 State (3, 2): V = -9.76 State (3, 4): V = -8.82 State (4, 2): V = -9.75 State (4, 3): V = -9.64 State (4, 4): V = -9.37
- The agent eventually reaches the goal, but with many wasted steps due to randomness. The values are better than just moving left, but still far from optimal.
3. Better than Random Policy
- state (1,4) with the UP action has a better reward value. What if we account for that in our policy:
class Agent: def policy(self, state: State, action: int) -> float: if action == UP and state.name == (1,4): return 1 if state.name == (1,4): return 0 return 1.0 / 4
State (0, 0): V = -9.92 State (0, 1): V = -9.87 State (0, 2): V = -9.77 State (0, 4): V = 0.00 State (1, 0): V = -9.94 State (1, 2): V = -9.56 State (1, 4): V = 0.00 State (2, 0): V = -9.92 State (2, 2): V = -9.15 State (2, 3): V = -7.97 State (2, 4): V = -5.88 State (3, 0): V = -9.87 State (3, 1): V = -9.77 State (3, 2): V = -9.56 State (3, 4): V = -7.85 State (4, 2): V = -9.55 State (4, 3): V = -9.35 State (4, 4): V = -8.85
- this gives us a better value for the state (1,4).
3. Hand-Crafted Optimal Policy
class Agent: def __init__(self): self.policy_map = { (0,0): RIGHT, (0,1): RIGHT, (0,2): DOWN, (1,2): DOWN, (2,2): RIGHT, (2,3): RIGHT, (2,4): UP, (1,4): UP } def policy(self, state, action): # If state is a State object, extract the tuple if hasattr(state, "name"): state = state.name optimal_action = self.policy_map.get(state) if optimal_action is None: return 0 return 1 if action == optimal_action else 0
State (0, 0): V = -5.22 State (0, 1): V = -4.69 State (0, 2): V = -4.10 State (0, 4): V = 0.00 State (1, 0): V = 0.00 State (1, 2): V = -3.44 State (1, 4): V = 0.00 State (2, 0): V = 0.00 State (2, 2): V = -2.71 State (2, 3): V = -1.90 State (2, 4): V = -1.00 State (3, 0): V = 0.00 State (3, 1): V = 0.00 State (3, 2): V = 0.00 State (3, 4): V = 0.00 State (4, 2): V = 0.00 State (4, 3): V = 0.00 State (4, 4): V = 0.00
- This policy maximizes expected rewards by minimizing steps to the goal.
How its used in RL
- The value function measures how good it is to be in a state, under a given policy.
- Iterative policy evaluation refines these values over time.
- By updating the policy to choose actions that maximize future rewards, we can move toward optimal behavior.
- Exercise: Use the Bellman Optimality Equation to solve this using a system of linear equations. Do you get the same values for each policy?