A Markov Decision Process (MDP) is a mathematical framework used to model decision-making in situations where outcomes are uncertain. It is formulated using a tuple (S,A,Psa,γ,R), where:
S is the set of states.
A is the set of actions.
Psa are the state transition probabilities.
γ is the discount factor.
R is the reward function.
Rewards are usually written as functions of states and actions, i.e. R(s,a) or simply just states, i.e. R(s).
Our goal is to find actions that maximize the expected sum of discounted rewards.
E[R(s0)+γR(s1)+γ2R(s2)+...]
A policy is any function π:S→A that maps from states to actions. A value function Vπ is defined as the expected return starting from state s and following policy π.
Vπ(s)=E[R(s0)+γR(s1)+γ2R(s2)+...s0=s,π]
We can write the value function recursively as the Bellman equation:
Note that Psπ(s)(s′) is the probability of landing on state s′ from state s if we take the action π(s).
The optimal value function V∗(s) is the maximum value function over all policies:
V∗(s)=πmaxVπ(s)
The Bellman equation for the optimal value function is:
V∗(s)=R(s)+a∈Amax(γs′∈S∑Psa(s′)V∗(s′))
The optimal policy π∗ is the one that maximizes the value function:
π∗(s)=arga∈Amax(s′∈S∑Psa(s′)V∗(s′))
Finding the optimal policy is equivalent to finding the optimal value function. And there are two algorithms that can do this: Value Iteration and Policy Iteration.
Value Iteration
For each state s, initialize V(s)=0
Repeat until convergence {
For each state s, set {
V(s)←R(s)+a∈Amax(γs′∈S∑Psa(s′)V(s′))
}
}
Policy Iteration
Initialize π randomly
Repeat until convergence {
Let V=Vπ⇒typically by a linear system solver
For each state s, set {
π(s)←arga∈Amax(s′∈S∑Psa(s′)V(s′))
}
}
For small MDPs, Policy iteration converges faster. However, for large MDPs, solving for Vπ in each step of policy iteration involves solving a large system of linear equations, which is computationally expensive.
Usually, we do not know the state transition probabilities Psa before hand. Instead, they can be estimated by repeatedly running the agent on the MDP under policy π and using the following formula:
Psa(s′)=# of times we took action a in state s# of times we took action a in state s and ended up in state s′
Value Function Approximation
One way to deal with continuous state spaces in MDPs is to descretize the state space. If we have d dimensions and we discretize each dimension with k values, then we have kd states. As we increase k, the number of states grows exponentially.
To deal with this, we can use function approximation.
One way to do this is using a model-based approach. We use a simulator to execute n trials, each for T time steps.
However, this is a deterministic model. Most real world systems are stochastic. So we can modify the model to be stochastic by adding a noise term:
st+1=Ast+Bat+ϵ
ϵ∼N(0,Σ)
Now, if we assume that the our state space is continous but our action space is small and discrete, we can use the Fitted value iteration algorithm.
st+1−(Ast+Bat)=ϵ
Since ϵ is normally distributed, the term st+1−(Ast+Bat) is also normally distributed. We can then write:
st+1∼N(Ast+Bat,Σ)
Our state transition function can now be written as:
Psa(s′)=N(As+Ba,Σ)
Moreover, our value iteration update rule for the continuous case can be written as:
V(s)←R(s)+γ⋅a∈Amax(∫s′Psa(s′)V(s′)ds′)
However, since out states s are continuous, we have to approximate the value function V(s) as well. We do so by finding a linear or non-linear mapping from states to the value function:
V(s)=θTϕ(s)
We also can not directly update our V(s) from the value iteration update rule. Instead, we have to update our θ parameters.