強化學習技巧 · 精通 TensorFlow 1.x

# 強化學習技巧 Reinforcement learning techniques can be categorized on the basis of the availability of the model as follows: * **模型可用**：如果模型可用，則智能體可以通過迭代策略或值函數來離線計劃，以找到提供最大獎勵的最優策略。 * **值迭代學習**：在值迭代學習方法中，智能體通過將`V(s)`初始化為隨機值開始，然后重復更新`V(s)`直到找到最大獎勵。 * **策略迭代學習** ：在策略迭代學習方法中，智能體通過初始化隨機策略`p`開始，然后重復更新策略，直到找到最大獎勵。 * **模型不可用**：如果模型不可用，則智能體只能通過觀察其動作的結果來學習。因此，從觀察，行動和獎勵的歷史來看，智能體會嘗試估計模型或嘗試直接推導出最優策略： * **基于模型的學習**：在基于模型的學習中，智能體首先從歷史中估計模型，然后使用策略或基于價值的方法來找到最優策略。 * **無模型學習**：在無模型學習中，智能體不會估計模型，而是直接從歷史中估計最優策略。 Q-Learning 是無模型學習的一個例子。作為示例，值迭代學習的算法如下： ```py initialize V(s) to random values for all states Repeat for s in states for a in actions compute Q[s,a] V(s) = max(Q[s]) # maximum of Q for all actions for that state Until optimal value of V(s) is found for all states ``` 策略迭代學習的算法如下： ```py initialize a policy P_new to random sequence of actions for all states Repeat P = P_new for s in states compute V(s) with P[s] P_new[s] = policy of optimal V(s) Until P == P_new ```