๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ

๐“ก๐“ธ๐“ธ๐“ถ5: ๐’ฆ๐‘œ๐“‡๐‘’๐’ถ ๐’ฐ๐“ƒ๐’พ๐“‹/Artificial Intelligence(COSE361)

[์ธ๊ณต์ง€๋Šฅ] 17. Making Sequential Decisions

1. Markov Decision Process

  1) Episodic Decisions 

 - nondeterministic, partially observableํ•œ ์ƒํ™ฉ์—์„œ๋Š” ๋‹ค์Œ state๋ฅผ ํ™•์‹ ํ•  ์ˆ˜ ์—†๋‹ค. ๋”ฐ๋ผ์„œ observations e๊ฐ€ ์ฃผ์–ด์กŒ์„ ๋•Œ outcome์€ outcome s'๊ฐ€ ๋งž์„ ๋•Œ์˜ ํ™•๋ฅ ๋กœ ๋‚˜ํƒ€๋‚ผ ์ˆ˜ ์žˆ๋‹ค. 

 P(Result(a) = s' | a, e)

 - Utility function U(s)๋Š” ๊ฐ state์˜ ์ข‹์€ ์ •๋„๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” ํ•˜๋‚˜์˜ ์ˆซ์ž์ด๋‹ค.

์ฃผ์–ด์ง„ evidence์—์„œ ์–ด๋–ค action์„ ์ทจํ–ˆ์„ ๋•Œ์˜ expected utility ๊ฐ’์€, ๊ทธ action์„ ์ทจํ–ˆ์„ ๋•Œ ๋‚˜์˜ค๋Š” outcome์˜ average utility value๊ฐ’์ด๋‹ค. 

EU(a | e) = ∑(s') P(Result(a) = s' | a, e) U(s')

*EU: ํ‰๊ท  utility, evidence e๋ฅผ ๊ด€์ธกํ–ˆ์„ ๋•Œ action a๋ฅผ ์ทจํ•˜๋ฉด ์–ผ๋งˆ๋‚˜ ์ข‹์„๊ฒƒ์ธ๊ฐ€? 

 

 - maximum expected utility (MEU)๋Š” agent์˜ expected utility๊ฐ€ ๊ฐ€์žฅ ์ตœ๋Œ€ํ™” ๋˜๋Š” action์„ ๊ณ ๋ฅด๋Š” ๊ณผ์ •์„ ๋งํ•œ๋‹ค.

 arg max(a) EU(a|e)

 

 - episodic ํ•˜๋‹ค๊ณ  ๊ฐ€์ •ํ•œ๋‹ค. ์ฆ‰ ์—ฌ๋Ÿฌ๋ฒˆ decision์„ ๋‚ด๋ ค์•ผ ํ•  ๋•Œ, ์ด์ „ decision๊ณผ ๋‹ค์Œ decision์ด ๊ด€๋ จ์ด ์—†๋Š” ๊ฒฝ์šฐ๋ฅผ ๋งํ•œ๋‹ค. 

 

 

  2) Markov Decision Process

 - sequential decision problem์—์„œ utility function์€ sequence of states์— ๋”ฐ๋ผ ๋‹ฌ๋ผ์ง„๋‹ค. 

 - ๊ฐ state์—์„œ agent๋Š” reward, ๋ณด์ƒ์„ ๋ฐ›๋Š”๋‹ค. ์–‘์ˆ˜์ผ ์ˆ˜๋„, ์Œ์ˆ˜์ผ ์ˆ˜๋„ ์žˆ๋‹ค.

 - ๋งŒ์•ฝ sequential decision problem์ด "fully observable"ํ•˜๊ณ  "stochastic"ํ•œ ํ™˜๊ฒฝ์ด๋ผ๋ฉด, ๊ทธ๋ฆฌ๊ณ  markovian transition model์ด๋ฉฐ additive rewardsํ•˜๋ฉด ์ด๋ฅผ Markov decision process, MDP๋ผ๊ณ  ๋ถ€๋ฅธ๋‹ค. (์ฆ‰, nondeterministicํ•  ์ˆ˜๋„ ์žˆ๋‹ค. ๋‚ด๊ฐ€ action a๋ฅผ ์ทจํ•œ๋‹ค๊ณ  ์ƒ๊ฐ์ฒ˜๋Ÿผ ๋˜๋Š”๊ฒŒ ์•„๋‹˜)

 

 - policy : agent๊ฐ€ ๋„๋‹ฌํ•ด์•ผ ํ•˜๋Š” state๋ฅผ ๊ฐ€๊ธฐ ์œ„ํ•ด ์ง€๋‚˜๋Š” solution state sequence๋ฅผ policy๋ผ๊ณ  ํ•œ๋‹ค. (=contingency plan)

 - policy์˜ quality๋Š” ์ด state๋“ค์˜ expected utility๋กœ ๊ตฌํ•œ๋‹ค. 

 - optimal policy : ๊ฐ€์žฅ ํฐ expected utility๋ฅผ ๊ฐ–๋Š” policy 

 

 

 3) Expected Utility

  - Stationarity : ๋งŒ์•ฝ ๋‘ state sequences๊ฐ€ ์žˆ๊ณ , ๋˜‘๊ฐ™์€ state์—์„œ ์‹œ์ž‘ํ•˜๋ฉด, ๋‘ sequence๋Š” ๊ฐ™์€ ๋ฐฉ๋ฒ•์œผ๋กœ preference-ordered ๋  ์ˆ˜ ์žˆ๋‹ค.

 

 

  - utility to sequence๋ฅผ ๋ถ€์—ฌํ•˜๋Š” ๋ฐฉ๋ฒ• 

   โ“ additive rewards : ๊ทธ๋ƒฅ ๋ณด์ƒ์„ ์ฃผ๋Š” ๊ฑฐ์ž„

   โ“‘ discounted rewards 

   Uh([s0, s1, ....]) = R(s0) + γR(s1) + γ^2 R(s2) ...

  γ ๋Š” discount factor๋กœ 0๊ณผ 1์‚ฌ์ด์˜ ๊ฐ’์„ ๊ฐ€์ง„๋‹ค. sequence์˜ ๋’ค๋กœ ๊ฐˆ ์ˆ˜๋ก rewards๊ฐ€ ๊ฐ์†Œํ•œ๋‹ค. ๊ฐ€๊นŒ์šด ๋ฏธ๋ž˜ state์˜ reward๊ฐ€ ๋” ์ค‘์š”ํ•˜๋‹ค๊ณ  ๋ณด๋Š” ๊ฒƒ!

 

 

  - expected utility๋Š” state s ์—์„œ ์‹œ์ž‘ํ•œ policy π ๋ฅผ ์ˆ˜ํ–‰ํ•˜๋ฉด์„œ ๊ตฌํ•  ์ˆ˜ ์žˆ๋‹ค. 

U^π(s) = E [ ∑(t=0, INF) γ^t R(St) ] 

state s์—์„œ ์‹œ์ž‘ํ•œ policy π์˜ utility๋Š”, ๊ฐ state์— discount factor๋ฅผ ๊ณฑํ•ด๊ฐ€๋ฉฐ reward๋ฅผ ๋‹ค ๋”ํ•œ ๊ฐ’์ด๋‹ค. 

 

 - R(s)๋Š” state s์—์„œ์˜ "short term" reward์ด๊ณ , U^π(s)๋Š” "long term" total reward์ด๋‹ค. 

 

 

  4) Optimal Policy

 - optimal policy๋Š” ๊ฐ€์žฅ ํฐ expected utility๋ฅผ ๊ฐ–๋Š” policy์ด๋‹ค.

 π*_s = arg max(π) U^π(s) 

 

  - ๋งŒ์•ฝ policy  π*_a๊ฐ€ a์—์„œ ์‹œ์ž‘ํ•˜๊ณ , policy π*_b๊ฐ€ b์—์„œ ์‹œ์ž‘ํ•œ๋‹ค๋ฉด, ์ด๋“ค์ด ์„ธ๋ฒˆ์งธ state c์—์„œ ์‹œ์ž‘ํ•œ๋‹ค๊ณ  ํ•˜๋”๋ผ๋„ optimalํ•˜๋‹ค. ๊ทธ๋ž˜์„œ ๊ฑ ์‹œ์ž‘ํ•˜๋Š” ๋…ธ๋“œ ์•ˆ์“ฐ๊ณ  simplyํ•˜๊ฒŒ  π*๋กœ ํ‘œ๊ธฐํ•œ๋‹ค.

 

 - Utility funtion์—์„œ U(s)๊ฐ€ utility U^π*(s) ์ฆ‰ ์ตœ์ ์˜ policy๋ฅผ ๋”ฐ๋ฅธ๋‹ค๋ฉด agent๋Š” ํ•ญ์ƒ ์ตœ์ ์˜ action์„ ๊ณ ๋ฅธ๋‹ค.

   *Maximum expected utility : maximized the expected utility๋ฅผ ๋งŒ๋“œ๋Š” action์„ ๊ณ ๋ฅธ๋‹ค.

 

  π*(s) = arg max ∑(s') P(s' | s, a) U(s') 

 

= (state s ์—์„œ action a๋ฅผ ์ทจํ–ˆ์„ ๋•Œ s' ๊ฐ€ ๋  ํ™•๋ฅ ) * (s'์—์„œ ์‹œ์ž‘ํ–ˆ์„ ๋•Œ์˜ expected Utility) 

์ฆ‰ ์–ด๋–ค action์„ ์ทจํ–ˆ์„ ๋•Œ expected utility๊ฐ€ ๊ฐ€์žฅ ํฐ action์„ ์ทจํ•˜์ž

 

 

 

2. Value Iteration Algorithm 

 

 1) Bellman Equation

 : state์˜ utility๋Š” ๊ทธ state์˜ reward ๋”ํ•˜๊ธฐ ์ตœ์ ์˜ action์„ ํ•œ๋‹ค๊ณ  ๊ฐ€์ • ํ–ˆ์„ ๋•Œ์˜ next state์˜ expected discounted utility๋ฅผ ๋งํ•œ๋‹ค. ์ด ๊ณต์‹์ด bellman equation.

U(s) = R(s) + γ max ∑ P(s' | s , a) U(s')

      = ํ˜„์žฌ state reward + ์ตœ์ ์˜ ์„ ํƒ ์‹œ s'์—์„œ์˜ expected utility

 

 2) Bellman Update 

 U_i+1(s) <- R(s) + γ max ∑ P(s' | s , a) U(s')

 ๋‹ฌ๋ผ๋ณด์ด์ง€๋งŒ ์œ„ Equation์—์„œ ๋“ฑํ˜ธ๋ฅผ ๋Œ€์ž…์œผ๋กœ ๋ฐ”๊พผ ๊ฒƒ ๋ฟ.

์ฒ˜์Œ์— Utility๊ฐ’์„ ๋‹ค 0์— ๊ฐ€๊นŒ์šด randomํ•œ ๊ฐ’์œผ๋กœ ์ดˆ๊ธฐํ™” ํ•œ ๋’ค์— ์œ„ for๋ฌธ์„ ๊ณ„์† ๋ˆ๋‹ค. ์‹œ์ž‘์ ์—์„œ๋ถ€ํ„ฐ ๋ชจ๋“  state๋ฅผ ๊ณ„์† ๋ฐ˜๋ณตํ•˜๋ฉฐ utility update. ๋ฐ˜๋ณตํ•˜๋ฉด ํ•  ์ˆ˜๋ก utility๊ฐ€ ์ˆ˜๋ ดํ•œ๋‹ค. 

 

 

  - equilibrium : utility๊ฐ’์ด ๋ณ€ํ•˜์ง€ ์•Š๋Š” ์•ˆ์ •์ ์ธ ์ƒํƒœ, ์ฆ‰ solution/optimal policy. 

  - ๋งˆ์น˜ ์ •๋ณด๋ฅผ propagatingํ•œ๋‹ค๊ณ  ์ƒ๊ฐํ•˜๋ฉด ๋จ

 

  3) Convergence of Value Iteration Algorithm

 

  - Contraction ํ•จ์ˆ˜ : ๋‘ ๊ฐœ์˜ input์„ ๋„ฃ์—ˆ์„ ๋•Œ, ์ฒ˜์Œ ๋‘˜์˜ ์ฐจ์ด๋ณด๋‹ค ํ•จ์ˆซ๊ฐ’์˜ ์ฐจ์ด๊ฐ€ ๋” ์ž‘์€ ๊ฒƒ. 

   |f(a) - f(b)| < a-b

     -> ์ด contraction์„ ๊ณ„์† ๋ฐ˜๋ณตํ•ด์„œ ์ ์šฉํ•˜๋ฉด ์–ธ์  ๊ฐ€ fixed point์— ๋„๋‹ฌํ•œ๋‹ค. 

 

  - Bellman equation and Bellman update

 

  - Bellman update๋Š” contraction ํ•จ์ˆ˜์ด๋‹ค. (factor = γ) ๋”ฐ๋ผ์„œ γ<1 ์ธ ์ด์ƒ value iteration์€ ์–ธ์ œ๋‚˜ unique solution์— ์ˆ˜๋ ดํ•œ๋‹ค. 

 

  || BU_i - BU || = || U_i+1 - U || <= γ ||U_i - U||

 

= || i๋ฒˆ์งธ utility์— bellman update ์ ์šฉ์‹œํ‚จ ๊ฒฐ๊ณผ - solution์— bellman update ||

= ||i+1๋ฒˆ์งธ utility (์˜ˆ์ƒ) - solution์˜ utility || 

<= γ || i๋ฒˆ์งธ utility - solution์˜ utility ||

 

 

์ฆ๋ช…์„ ํ•ด๋ณด์ž

 

1. |max f(a) - max g(a)| <= max |f(a) - g(a)|  --- ๊ทธ๋ฆผ ๊ทธ๋ ค๋ณด๋ฉด ์•Ž. ์ฐจ์˜ ์ตœ๋Œ€๊ฐ’๊ณผ ์ตœ๋Œ“๊ฐ’์˜ ์ฐจ!

2.   | BU_i(s) - BU(s) | = | γ max ∑ P(s' | s , a) U_i(s') - γ max ∑ P(s' | s , a) U(s') |

                          <=  γ max | ∑ P(s' | s , a) U_i(s') - ∑ P(s' | s , a) U(s') |

                           = γ ∑ P(s' | s , a*) (U_i(s') - U(s'))

 

3. ||BU_i(s) - BU(s)|| = max|BU_i(s) - BU(s)| <= max γ ∑ P(s' | s , a*) (U_i(s') - U(s'))

                                                       <= max γ |U_i(s') - U(s')| =γ ||U_i - U||

 

4. ||BU_i(s) - BU(s)|| = ||U_i+1 - U || <= γ ||U_i - U||

 

 

 

 4) Error Bound of Value Iteration Algorithm

 

 - ๋ชจ๋“  states์˜ utility๋Š” +-Rmax/(1-γ)์˜ ๋ฒ”์œ„๋ฅผ ๋„˜์ง€ ์•Š๋Š”๋‹ค. 

 - ๋”ฐ๋ผ์„œ maximum initial error ||U0-U|| <= 2Rmax/(1-γ) ์ด๋‹ค. 

 - ์šฐ๋ฆฌ๊ฐ€ N๋ฒˆ iterationsํ•œ๋‹ค๊ณ  ๊ฐ€์ •ํ•˜๋ฉด γ^N*2Rmax / (1-γ) <= ε 

   ์ฆ‰ N = log(2Rmax / ε(1-γ)) / log(1/γ)) ์ด๋‹ค. ์šฐ๋ฆฌ๊ฐ€ ๋ช‡ ๋ฒˆ ๋ฐ˜๋ณตํ•ด์•ผ ์ˆ˜๋ ดํ•˜๋Š”์ง€ ์•Œ ์ˆ˜ ์žˆ๋‹ค. 

 

 * state sequence์˜ utility๋Š” ๋“ฑ๋น„์ˆ˜์—ด์˜ ํ•ฉ์ด ๋œ๋‹ค(๊ณต๋น„ : γ)

  ๋”ฐ๋ผ์„œ ๋ชจ๋“  state sequence utility์˜ ํ‰๊ท ๊ฐ’์€ ๋“ฑ๋น„์ˆ˜์—ด์˜ ํ•ฉ๋ณด๋‹ค ์ž‘๋‹ค. 

 

 - ๋งŒ์•ฝ update๊ฐ€ ์ž‘์•„์„œ utility๊ฐ€ ๋ณ„๋กœ ๋ณ€ํ•˜์ง€ ์•Š์•˜๋‹ค๋ฉด, ์ง„์งœ utility์™€ ๋น„๊ตํ•˜์˜€์„ ๋•Œ, error๋„ ์ž‘์„ ๊ฒƒ์ด๋‹ค.  (์ด๋ฏธ ์ถฉ๋ถ„ํžˆ ์ˆ˜๋ ด๋œ ์ƒํƒœ) 

  || U_i+1-U || < ε(1-γ)/γ ์ด๋ฉด || U_i+1-U || < ε ์ด๋‹ค! 

 

 - ๋”ฐ๋ผ์„œ policy π_i๋Š” Ui๊ฐ€ ์ˆ˜๋ ดํ•˜๊ธฐ ์ „์— ์–ด๋Š์ •๋„ optimal๋œ๋‹ค. ์ฆ‰, utility value๋“ค์ด ์ •ํ™•ํžˆ solution์€ ์•„๋‹ˆ์–ด๋„ policy๋Š” optimal์ธ ๊ฒฝ์šฐ๊ฐ€ ๋งŽ๋‹ค. 

 

 

 

 

 

 

 

3. Policy Iteration Algorithm 

 

 1) Policy Iteration Algorithm : initial policy π0์—์„œ ์‹œ์ž‘ํ•ด ๋‹ค์Œ ๋‘ ๋‹จ๊ณ„๋ฅผ ๋”ฐ๋ผ๊ฐ„๋‹ค.

    โ“ policy evaluation : policy πi๊ฐ€ ์ฃผ์–ด์กŒ์„ ๋•Œ, Ui = U^πi๋ฅผ ๊ตฌํ•ด๋ผ. ์ฆ‰ ๋งŒ์•ฝ πi๊ฐ€ ์‹คํ–‰๋œ๋‹ค๋ฉด ๊ฐ state์˜ utility๊ฐ€ ์–ด๋–ป๊ฒŒ ๋˜๋Š”์ง€๋ฅผ ๊ณ„์‚ฐํ•ด๋ผ

    โ“‘ policy improvement : ์ƒˆ๋กœ์šด MEU policy π_i+1์„ ๊ตฌํ•œ๋‹ค. ์ฆ‰ ๋‹ค์Œ iteration์˜ policy๋ฅผ ๊ตฌํ•œ๋‹ค. ์–ด๋–ป๊ฒŒ?

 

      π_i+1 (s) = arg max ∑ P(s' | s, a) U^πi (s')

 

 i ์ผ ๋•Œ์˜ policy๋ฅผ ํ†ตํ•ด ๊ทธ ๋•Œ์˜ utility๋ฅผ ๊ณ„์‚ฐํ•˜๊ณ , ๊ทธ๊ฑธ ์ด์šฉํ•ด i+1 ์ผ ๋•Œ์˜ policy๋ฅผ ๊ตฌํ•œ๋‹ค.

 

 

 2) Analysis 

- utility๊ฐ’์— ๋ณ€ํ™”๊ฐ€ ์—†์„ ๋•Œ ์ข…๋ฃŒ๋œ๋‹ค.

- utility function Ui๋Š” bellman update์˜ fixed point์ด๊ณ , ๊ทธ๊ฒƒ์ด ๋”ฐ๋ผ์„œ bellman equation์˜ solution์ด๋‹ค. π_i ๋Š” ๋ฌด์กฐ๊ฑด optimal policy์ด๋‹ค. (๋ญ”์†Œ๋ฆฌ์•ผ) 

 

- ์œ ํ•œํ•œ state space์—๋Š” ์œ ํ•œํ•œ policy๊ฐ€ ์žˆ์œผ๋ฏ€๋กœ, policy iteration์€ ๋ฐ˜๋“œ์‹œ ์–ธ์  ๊ฐ€ ์ข…๋ฃŒ๋œ๋‹ค. 

 

 

  3) Policy Evaluation 

 : bellman equation์˜ ๊ฐ„์†Œํ™” ๋ฒ„์ „์„ ์•Œ์•„๋ณด์ž.

 

 - i๋ฒˆ์งธ ๋ฐ˜๋ณต์—์„œ policy πi ๋Š” state s์—์„œ์˜ atcion πi(s)๋กœ ํŠน์ •๋จ

 - Ui(s) = R(s) + γ∑ P(s' | s, πi(s)) U^i (s')

๋”ฐ๋ผ์„œ a ๋Œ€์‹  πi(s) ์— ๋Œ€ํ•ด์„œ๋งŒ ๊ฒ€์‚ฌํ•ด ์คŒ . ๋ชจ๋“  action์— ๋”ฐ๋ฅธ max๊ฐ’์„ ์ฐพ์„ ํ•„์š” ์—†์ด, ๊ทธ policy์— ๋”ฐ๋ฅธ action๋งŒ ๊ณ ๋ คํ•œ๋‹ค.

 

 - n states๊ฐ€ ์žˆ์„ ๋•Œ time O(n^3)๋กœ ํ’€๋ฆผ (n๊ฐœ์˜ ์‹, n๊ฐœ์˜ ๋ฏธ์ง€์ˆ˜๊ฐ€ ์žˆ์„ ๋•Œ ์—ญํ–‰๋ ฌ ๊ตฌํ•˜๊ธฐ๊ฐ€ ์ด ์‹œ๊ฐ„๋ณต์žก๋„๋ฅผ ๊ฐ€์ง)

 

 

  4) Modified Policy Iteration Algorithm

  - Ui(s) = R(s) + γ∑ P(s' | s, πi(s)) U^i (s')

  - U_i+1(s) <= R(s) + γ∑ P(s' | s, πi(s)) U^i (s')

 

 ๊ตณ์ด utility๊ฐ’์„ ์ •ํ™•ํ•˜๊ฒŒ ๊ตฌํ•˜์ง€ ์•Š๋”๋ผ๋„ policy๊ฐ€ ๊ณ ์ •๋˜์–ด์žˆ์œผ๋ฏ€๋กœ ์ด๊ฒŒ ํ›จ์”ฌ ํšจ์œจ์ ์ด๊ณ  ์ข‹์€ ๊ทผ์‚ฟ๊ฐ’์„ ๋‚ธ๋‹ค.

 

 

 

  5) Asynchronous Policy Iteration Algorithm

 - ๊ฐ ๋ฐ˜๋ณต ๋•Œ ์šฐ๋ฆฐ state์˜ ๋ถ€๋ถ„์ง‘ํ•ฉ ์ค‘ ์•„๋ฌด๊ฑฐ๋‚˜ ๊ณจ๋ผ์„œ update๋ฅผ ํ•  ์ˆ˜ ์žˆ์Œ. ์ด๋Ÿฐ๊ฒŒ ๋ฐ”๋กœ asynchronous policy iteration. ์ฆ‰ ์‰ฝ๊ฒŒ ๋งํ•ด iterate ์ค‘๊ฐ„์— ์ผ๋ถ€ state์— ๋Œ€ํ•ด์„œ๋งŒ policy/ utility value ์—…๋ฐ์ดํŠธ๋ฅผ ํ•จ. (ํ•ญ์ƒ ๋‹ค ํ•˜๋Š”๊ฒŒ ์•„๋‹˜)