增強(qiáng)學(xué)習(xí)四個(gè)要素
- policy policy指的是一個(gè)函數(shù)或者規(guī)則,輸入為環(huán)境狀態(tài),輸出為action(Roughly speaking, a policy is a mapping from perceived states of the environment to actions to be taken when in those states.)
- reward reward翻譯為獎(jiǎng)勵(lì),指在某個(gè)action之后環(huán)境給你的反饋。和環(huán)境狀態(tài)和action有關(guān)。reward表示的是即使收益(On each time step, the environment sends to the reinforcement learning agent a single number, a reward. The agent’s sole objective is to maximize the total reward it receives over the long run. The reward signal thus defines what are the good and bad events for the agent)
- value function。value function表示的是一種長期回報(bào)。一般寫作v(s),指的是agent從狀態(tài)s出發(fā),將來收益的期望。(Roughly speaking, the value of a state is the total amount of reward an agent can expect to accumulate over the future, starting from that state).某個(gè)狀態(tài)的reward可以很低,但是value function可以很高。因?yàn)閺倪@個(gè)狀態(tài)轉(zhuǎn)到其他狀態(tài),其他狀態(tài)的reward可以很高。舉例:(To make a human analogy, rewards are somewhat like pleasure (if high) and pain (if low), whereas values correspond to a more refined and farsighted judgment of how pleased or displeased we are that our environment is in a particular state.)。在選擇action的時(shí)候,優(yōu)先選擇value大的state。(We seek actions that bring about states of highest value, not highest reward, because these actions obtain the greatest amount of reward for us over the long run),增強(qiáng)學(xué)習(xí)的核心就是估計(jì)狀態(tài)的value function
- model of the environment. model作為環(huán)境的模擬,可以根據(jù)此時(shí)的狀態(tài)和做出的ation,預(yù)測下一刻的狀態(tài)以及agent獲得的reward。model主要用來做規(guī)劃。表示我們知道環(huán)境的運(yùn)行原理,方法為model-based。對應(yīng)的是model-free。model-free需要不斷的嘗試,試錯(cuò)來預(yù)估。