Making Good Decisions given a Model of the World



If there is any error, please raise an issue.


Acting in a MDP

Recap:



Markov Process (=> Memoryless Stochastic Process)



Markov Reward Process



Calculation of State Value Function



Markov Decision Process



Bellman Backup Operator(s)

We know that, if there is a MDP \(M = (S, A, \textbf{P}, R, \gamma)\) and a (deterministic/stochastic) policy \(\pi (s|a)\), it can be considered equivalent to MRP \(M' = (S, \textbf{P}, R, \gamma)\).

=> The value function of the policy \(\pi\) evaluated on \(M\), denoted by \(V^\pi\), is actually same as value function evaluated on \(M'\). \(V^\pi\) lives in the finite dimensional Banach Space (\(\mathbb{R}^{\mid S \mid}\)).

Trivia: Banach Space \(\mathbb{R}^{\mid S \mid}\) is a complete vector space with norm \(\Vert .\Vert\). Normed Vector Space is basically a space where norm is defined (\(x\) -> \(\Vert x \Vert\)) or intuitively, the notion of “length” is captured.

Then, for an element \(U \in \mathbb{R}^{\mid S \mid}\), Bellman expectation backup operator is defined as

$$(B^\pi(s)) = R^\pi(s) + \gamma \sum_{s' \in S} P^\pi(s'|s) \cdot U(s'); \forall s\in S$$

This equation is also termed as Bellman Expectation Equation.

Similarly, for an element \(U \in \mathbb{R}^{\mid S \mid}\), Bellman optimality backup operator is defined as

$$(B^\pi(s)) = \max_{a \in A} \Big [ R^\pi(s) + \gamma \sum_{s' \in S} P^\pi(s'|s) \cdot U(s')\Big]; \forall s \in S $$

This equation is also termed as Bellman Optimality Equation.

The \(\max\) operation in Optimality Equation brings the non-linearity in \(U\), due to which, there is no closed form solution as in the case of Expectation Equation. Ref: StackExchange



MDP Control



Policy Iteration & Value Iteration

Taken from Stack Overflow