Open AI Spinning Up Notes

Algorithms

On-Policy Algorithms
- most basic, entry level, old
- Vanilla Policy Gradient
- cannot reuse old data → weaker on sample efficiency
- algos optimize for policy performance
- Trade off sample efficiency in favor of stability
- TRPO and PPO
Off-Policy Algos
- younger, connected to Q learnging algos, learns Q-funciton and a policy which are updated to improve each other
- Can use old data very efficiently
  - they get this benefit through Bellman’s equations for optimality
- No guarantee that doing a good job of satisfying bellman’s equations leads to having great policy performance.
  - Empirically one can get good performance, and with such a great sample efficiency is wonderful, but the absence of guarantees makes algos in this class brittle and unstable
- Deep Deterministic Policy Gradients DDPG foundational
- TD3 and SAC are decendents of DDPG

Introduction to Reinforced Learning

Part 1: Key Concepts in RL

Key Concepts and Terminology

../_images/rl_diagram_transparent_bg.png Main characters of RL

Agent - decides what actions to take, and perceives a reward signal from the environment (number saying how good or bad). GOAL maximize cumulative reward, called return.
Environment - world that the agent lives in and interacts with

State $s$ - complete description of the state of the world. There is no hidden information about the world which is not in the state Observation $o$ - partial description of a state, which may omit information

Fully Observed- when the agent observes the complete state of the environment Partially observed - when the agent sees a partial observation

Action Spaces - the set of all valid actions in a given environment

Discrete actions spaces, finite number of moves are available
Continuous action spaces → normally real-valued vectors

definition of stochastic - refers to a process or model that involves some level of randomness or unpredictability.

Policy - rule used by an agent to decide what actions to take

Deterministic Policy ( $μ$ ) ⇒
- $a_{t} = μ (s_{t})$
  - $a_{t}$ is actions to take
  - $=$ implies deterministic relationship
- given a particular state, the agent will always take the same action, there is no randomness. Given state $s$ it will always take $a$
Stochastic Policy ( $π$ ) -
- $a_{t} \sim π (* ∣ s_{t})$
- $a_{t}$ : This represents the action taken by the agent at time ( t ).
- $\sim$ : This symbol means “sampled from.” It denotes that the action ( a_t ) is drawn from the probability distribution that follows.
  - implies there is a range of actions that could be taken
- $π$ : This is the policy, which, in the context of a stochastic policy, is a probability distribution over actions.
- $∣$ denotes conditional probability. READ as given. Given S you have a
- $(π (\cdot ∣ s_{t}))$ : This is the probability distribution over actions given the current state ( s_t ) at time $(t)$ . The dot $(\cdot)$ in the parentheses is a placeholder for the action space. It indicates that the policy provides probabilities for all possible actions that can be taken in state $(s_{t})$ .
  - When you see $π (a ∣ s)$ , this is read as “the probability of taking action $a$ given the state $s$ ” under the policy $π$ .
- $(s_{t})$ : This is the state of the environment or the agent at time ( t ).
Types of Stochastic Policies
- Categorical Policies - discrete action space
  - A categorical policy is like a classifier over discrete actions. You build the neural network for a categorical policy the same way you would for a classifier: the input is the observation, followed by some number of layers (possibly convolutional or densely-connected, depending on the kind of input), and then you have one final linear layer that gives you logits for each action, followed by a softmax to convert the logits into probabilities.
- Diagonal Gaussian Policies - have a neural network that maps observations to mean actions $μ_{θ} (s)$ with added randomness via a vector of standard deviations (diagonal covariance matrix which because it is diagonal all of the variables are independent and all of the “covariance” are 0s) element wise producted with randomness $z$
  - The added randomness with standard deviation can be interpretted as the models uncertainity over it’s choice. This uncertainity can be fixed, or it can be generated by the neural network so that it would have different amounts of uncertainty per input.
Parameterized Policies - policies whose outputs are computable functions that depend on a set of parameters (weights and biases of a neural network) which we can adjust to change the behavior via some optimization algoritm
- Parameters of these policies are $θ$ or $ϕ$ , then with a subscript to show the connection
  - $a_{t} = μ_{θ} (s_{t})$
  - $a_{t} \sim π_{θ} (* ∣ s_{t})$

Trajectories $τ$

A trajectory is a sequence of states and actions
- $τ = {s_{0}, a_{0}, s_{1}, a_{1}, ...}$
- $s_{0}$ is randomly sampled from the start-state distribution, $s_{0} \sim p_{0} (\cdot)$
  - $\cdot$ acts a placeholder for a variable.

State Transitions - what happens to the world between state at time t and the state at t+1 is defined by the laws of the environment (frequently called episodes or rollouts.

Deterministic World Environment: $s_{t + 1} = f (s_{t}, a_{t})$
Stochastic: $s_{t + 1} \sim P (\cdot ∣ s_{t}, a_{t})$
- We are sampling the next state ( $s_{t + 1}$ ) from the probability distribution $P$ given the state and the action
- $P$ - stands for state transition probability, it is the probability of ending up in a particular next state

Reward and Return

Reward Function $R$ takes in the state of the environment, actions, and environment after action.

$r_{t} = R (s_{t}, a_{t}, s_{t + 1})$
Also simplified as only dependant on the state of the environment or state-action pari
- $r_{t} = R (s_{t})$
- $r_{t} = R (s_{t}, a_{t})$
Types of Return
- finite-horizon undiscounted return - $R (τ) = \sum_{t = 0}^{T} r_{t}$
  - which is just a sum of rewards in a fixed window of steps
- infinite-horizon discounted return - $R (τ) = \sum_{t = 0}^{\infty} γ^{t} r_{t}$
  - which is the sum of all rewards ever obtained by the agent, but discounted by how far off in the future they’re obtained. $γ \in (0, 1)$ The goal of the agent is to maximize the notion of cumulative reward over a trajectory (a series of state-action pairs)

Central Problem for RL

The central problem for reinforcement learning is maximizing cumulative reward through a optimal policy $π^{*}$

We can formulate this by $π^{*} = ar g max_{π} J (π)$ . Where $J (π)$ is the expected return across all probabilities of outcomes and their reward. Basically how the reward this policy would have across the total distribution of cases and their reward.

$J (π) = \int_{τ} P (τ ∣ π) R (τ) = E_{τ \sim π} R (τ) .$
- $P$ - stands for state transition probability, it is the probability of ending up in a particular next state
$P (τ ∣ π) = ρ_{0} (s_{0}) \prod_{t = 0}^{T - 1} P (s_{t + 1} ∣ s_{t}, a_{t}) π (a_{t} ∣ s_{t}) .$

Expected Return $J (π)$

$R (τ)$ is the return of the trajectory $τ$ , which is the total accumulated rewqrd from the following trajectory
the expected return is $J (x)$ which is the average return that you would expect to get if you were to follow policy $π$ over all trajectories
It is calculated as the integral, for each trajectory find the probability of that trajectory occurring $P (τ ∣ π)$ times the return of that trajectory $R (τ)$

Value Functions

Value functions are used when you need to know the value (expected return) of a state or state-action pair,

On-Policy Value Function $V^{π} (s)$ - gives the expected return if you start in state $s$ and always act according to the policy $π$
- $V^{π} (s) = E_{τ \sim π} [R (τ) ∣ s_{0} = s]$
  - Calculates the expected sum of the rewards an agent would receive given state $s$ and follows policy $π$ for the trajectory.
  - This equation is supposed to predict how good a particular state would be under a policy.
- $E$ is an expectation operator, and it denotes the expected value of a random variable
- $τ \sim π$ means the trajectories (the list of state and action pairs) sampled according to the policy $π$
- $E_{τ \sim π}$ means that it is an average over all trajectories starting with $s_{0}$
- $∣ s_{0} = s$ denotes that it is given starting state $s_{0}$ … “Given start state $s$ ”
On-Policy Action-Value Function $Q^{π} (s, a)$ - which gives the expected return if you start with state $s$ , and take an action $a$ (which may have not come from the policy), then forever refer to the policy \pi for the following acitons
- $Q^{π} (s, a) = E_{τ \sim π} [R (τ) ∣ s_{0} = s, a_{0} = a]$
  - Calculates the expected sum of the rewards an agent would recieve given a starting action of $a$ and state of $s$ while following policy $π$ for all future actions
Optimal Value Function $V^{*} (s)$ which gives you the expected return if you start in state $s$ and act with the optimal policy
- $V^{*} (s) = ma x_{π} [R (τ) ∣ s_{0} = s]$
Optimal Action-Value Function $Q^{*} (s, a)$ - which given a starting state and action, the agent will always act from the optimal policy
- $Q^{*} (s, a) = max_{π} E_{τ \sim π} [R (τ) ∣ s_{0} = s, a_{0} = a]$

Part 2: Kinds of RL Algorithms

../_images/rl_algorithms_9_15.svg

Model of the environment - a function which predicts state transitions and rewards If we have a model we can use it to plan by thinking ahead, seeing what would happen for a range of possible choices. Agents can then distill the results from planning ahead into a learned policy. The main downside of having a model is that a model that perfectly reflects the real world rarely exists. If the agent wants to use a model, it has to learn the model purely from experience

Ethan's Blog 👋

Explorer

Open AI Spinning Up Notes

Algorithms

Introduction to Reinforced Learning

Part 1: Key Concepts in RL

Key Concepts and Terminology

Trajectories $τ$

Reward and Return

Central Problem for RL

Value Functions

Part 2: Kinds of RL Algorithms

Graph View

Table of Contents

Backlinks

Ethan's Blog 👋

Explorer

Open AI Spinning Up Notes

Algorithms

Introduction to Reinforced Learning

Part 1: Key Concepts in RL

Key Concepts and Terminology

Trajectories τ

Reward and Return

Central Problem for RL

Value Functions

Part 2: Kinds of RL Algorithms

Graph View

Table of Contents

Backlinks

Trajectories $τ$