Reinforcement Learning

Reinforcement Learning (RL) is a machine learning method where an agent learns to make decisions through trial and error, which means rewards for correct actions and penalties for mistakes.
The aim is to enforce correct actions to maximize future rewards.
For example, when our child behaves well, we express appreciation, and when the child makes a mistake, we offer corrective feedback or reprimand. Over time, this helps the child gradually learn to distinguish between acceptable and unacceptable behavior.

Reinforcement Learning (RL) Process

A standard reinforcement learning framework (Markov Decision Process) can be illustrated using the following diagrammatic representation:

Key Components:

Agent – The learner/decision-maker that chooses actions based on the current state S_t
Environment – The external system that responds to the agent’s actions and provides feedback.
State (Sₜ) – The current situation or observation the agent receives from the environment at time t.
Action (Aₜ) – The move or decision the agent takes at time t to influence the environment.
Reward (Rₜ) – The feedback signal from the environment indicating the success or failure of the action at time t.

The cycle works as flow:

At time t, the environment sends the state (Sₜ) and reward (Rₜ) to the agent.
The agent chooses an action (Aₜ).
The environment processes that action, updates its situation, and produces the next state (Sₜ₊₁) and reward (Rₜ₊₁).
The dotted line indicates the future state and reward (Sₜ₊₁, Rₜ₊₁). It represents the transition from the current step to the next step in time, showing that RL is a sequential process where each step influences the next.

How it works:

Interaction: The agent observes the current state of the environment and selects an action based on its current understanding of the environment (policy π).
Action Execution: The chosen action is performed, and the environment transitions to a new state (S_t+1).
Feedback: The environment provides a reward signal to the agent, indicating the outcome of the action.
Learning: The agent uses this reward (and potentially the new state) to update its understanding of the environment, i.e, update its value/Q-function and improve its decision-making strategy (policy π).
Iteration: This process of interaction, feedback, and learning repeats, with the agent progressively refining its policy to maximize its cumulative reward over time (called return)

Types of Reinforcement Learning

Reinforcement learning (RL) has several approaches, but three foundational methods are under two types of Reinforcement Learning:

1. Model-Based RL

The agent builds a model of the environment to predict outcomes before acting.

1.1 Dynamic Programming (DP)

Breaks a big problem into smaller steps and uses a complete model of the environment to decide the best actions.
Key Equation – Bellman Equation:

2. Model-Free RL

The agent learns from experience without knowing the full rules of the environment.

2.1 Monte Carlo (MC)

Learns only from complete episodes of experience by averaging returns for each state–action pair.
Key Equation – Value Function Estimate:

2.2 Temporal Difference (TD) Learning

Learns from raw interaction but updates after every step without waiting for the episode to finish.
Key Equation – TD Update Rule:

2.2.1 SARSA (State–Action–Reward–State–Action)

It is on-policy Temporal Difference (TD) learning
The agent learns the value of the policy it is following (including its exploration).
Process:

1. Start in state , take action .
2. Receive reward and move to next state .
3. Choose the next action based on the current policy.
4. Update Q-value using:

Key Point: SARSA learns from the sequence (S, A, R, S’, A’), so it evaluates and improves the same policy it uses.

2.2.2 Q-Learning

It is off-policy Temporal Difference (TD) learning
The agent learns the optimal policy regardless of the actions it’s currently taking for exploration.
Process:

1. Start in state , take action .
2. Receive reward and move to next state .
3. Look at the best possible next action (max Q-value), even if the agent wouldn’t actually choose it during exploration.
4. Update Q-value using:

Key Point: Q-learning always learns towards the best possible policy (greedy with respect to Q-values), even if it explores randomly during learning.