<ruby id="bdb3f"></ruby>

    <p id="bdb3f"><cite id="bdb3f"></cite></p>

      <p id="bdb3f"><cite id="bdb3f"><th id="bdb3f"></th></cite></p><p id="bdb3f"></p>
        <p id="bdb3f"><cite id="bdb3f"></cite></p>

          <pre id="bdb3f"></pre>
          <pre id="bdb3f"><del id="bdb3f"><thead id="bdb3f"></thead></del></pre>

          <ruby id="bdb3f"><mark id="bdb3f"></mark></ruby><ruby id="bdb3f"></ruby>
          <pre id="bdb3f"><pre id="bdb3f"><mark id="bdb3f"></mark></pre></pre><output id="bdb3f"></output><p id="bdb3f"></p><p id="bdb3f"></p>

          <pre id="bdb3f"><del id="bdb3f"><progress id="bdb3f"></progress></del></pre>

                <ruby id="bdb3f"></ruby>

                ThinkChat2.0新版上線,更智能更精彩,支持會話、畫圖、視頻、閱讀、搜索等,送10W Token,即刻開啟你的AI之旅 廣告
                # 將簡單的策略應用于 cartpole 游戲 到目前為止,我們已經隨機選擇了一個動作并應用它。現在讓我們應用一些邏輯來挑選行動而不是隨機機會。第三個觀察指的是角度。如果角度大于零,則意味著桿向右傾斜,因此我們將推車向右移動(1)。否則,我們將購物車向左移動(0)。我們來看一個例子: 1. 我們定義了兩個策略函數如下: ```py def policy_logic(env,obs): return 1 if obs[2] > 0 else 0 def policy_random(env,obs): return env.action_space.sample() ``` 1. 接下來,我們定義一個將針對特定數量的劇集運行的實驗函數;每一集一直持續到游戲損失,即`done`為`True`。我們使用`rewards_max`來指示何時突破循環,因為我們不希望永遠運行實驗: ```py def experiment(policy, n_episodes, rewards_max): rewards=np.empty(shape=(n_episodes)) env = gym.make('CartPole-v0') for i in range(n_episodes): obs = env.reset() done = False episode_reward = 0 while not done: action = policy(env,obs) obs, reward, done, info = env.step(action) episode_reward += reward if episode_reward > rewards_max: break rewards[i]=episode_reward print('Policy:{}, Min reward:{}, Max reward:{}' .format(policy.__name__, min(rewards), max(rewards))) ``` 1. 我們運行實驗 100 次,或直到獎勵小于或等于`rewards_max`,即設置為 10,000: ```py n_episodes = 100 rewards_max = 10000 experiment(policy_random, n_episodes, rewards_max) experiment(policy_logic, n_episodes, rewards_max) ``` 我們可以看到邏輯選擇的動作比隨機選擇的動作更好,但不是更好: ```py Policy:policy_random, Min reward:9.0, Max reward:63.0, Average reward:20.26 Policy:policy_logic, Min reward:24.0, Max reward:66.0, Average reward:42.81 ``` 現在讓我們進一步修改選擇動作的過程 - 基于參數。參數將乘以觀察值,并且將基于乘法結果是零還是一來選擇動作。讓我們修改隨機搜索方法,我們隨機初始化參數。代碼如下: ```py def policy_logic(theta,obs): # just ignore theta return 1 if obs[2] > 0 else 0 def policy_random(theta,obs): return 0 if np.matmul(theta,obs) < 0 else 1 def episode(env, policy, rewards_max): obs = env.reset() done = False episode_reward = 0 if policy.__name__ in ['policy_random']: theta = np.random.rand(4) * 2 - 1 else: theta = None while not done: action = policy(theta,obs) obs, reward, done, info = env.step(action) episode_reward += reward if episode_reward > rewards_max: break return episode_reward def experiment(policy, n_episodes, rewards_max): rewards=np.empty(shape=(n_episodes)) env = gym.make('CartPole-v0') for i in range(n_episodes): rewards[i]=episode(env,policy,rewards_max) #print("Episode finished at t{}".format(reward)) print('Policy:{}, Min reward:{}, Max reward:{}, Average reward:{}' .format(policy.__name__, np.min(rewards), np.max(rewards), np.mean(rewards))) n_episodes = 100 rewards_max = 10000 experiment(policy_random, n_episodes, rewards_max) experiment(policy_logic, n_episodes, rewards_max) ``` 我們可以看到隨機搜索確實改善了結果: ```py Policy:policy_random, Min reward:8.0, Max reward:200.0, Average reward:40.04 Policy:policy_logic, Min reward:25.0, Max reward:62.0, Average reward:43.03 ``` 通過隨機搜索,我們改進了結果以獲得 200 的最大獎勵。平均而言,隨機搜索的獎勵較低,因為隨機搜索會嘗試各種不良參數,從而降低整體結果。但是,我們可以從所有運行中選擇最佳參數,然后在生產中使用最佳參數。讓我們修改代碼以首先訓練參數: ```py def policy_logic(theta,obs): # just ignore theta return 1 if obs[2] > 0 else 0 def policy_random(theta,obs): return 0 if np.matmul(theta,obs) < 0 else 1 def episode(env,policy, rewards_max,theta): obs = env.reset() done = False episode_reward = 0 while not done: action = policy(theta,obs) obs, reward, done, info = env.step(action) episode_reward += reward if episode_reward > rewards_max: break return episode_reward def train(policy, n_episodes, rewards_max): env = gym.make('CartPole-v0') theta_best = np.empty(shape=[4]) reward_best = 0 for i in range(n_episodes): if policy.__name__ in ['policy_random']: theta = np.random.rand(4) * 2 - 1 else: theta = None reward_episode=episode(env,policy,rewards_max, theta) if reward_episode > reward_best: reward_best = reward_episode theta_best = theta.copy() return reward_best,theta_best def experiment(policy, n_episodes, rewards_max, theta=None): rewards=np.empty(shape=[n_episodes]) env = gym.make('CartPole-v0') for i in range(n_episodes): rewards[i]=episode(env,policy,rewards_max,theta) #print("Episode finished at t{}".format(reward)) print('Policy:{}, Min reward:{}, Max reward:{}, Average reward:{}' .format(policy.__name__, np.min(rewards), np.max(rewards), np.mean(rewards))) n_episodes = 100 rewards_max = 10000 reward,theta = train(policy_random, n_episodes, rewards_max) print('trained theta: {}, rewards: {}'.format(theta,reward)) experiment(policy_random, n_episodes, rewards_max, theta) experiment(policy_logic, n_episodes, rewards_max) ``` 我們訓練了 100 集,然后使用最佳參數為隨機搜索策略運行實驗: ```py n_episodes = 100 rewards_max = 10000 reward,theta = train(policy_random, n_episodes, rewards_max) print('trained theta: {}, rewards: {}'.format(theta,reward)) experiment(policy_random, n_episodes, rewards_max, theta) experiment(policy_logic, n_episodes, rewards_max) ``` 我們發現訓練參數給出了 200 的最佳結果: ```py trained theta: [-0.14779543 0.93269603 0.70896423 0.84632461], rewards: 200.0 Policy:policy_random, Min reward:200.0, Max reward:200.0, Average reward:200.0 Policy:policy_logic, Min reward:24.0, Max reward:63.0, Average reward:41.94 ``` 我們可以優化訓練代碼以繼續訓練,直到我們獲得最大獎勵。筆記本`ch-13a_Reinforcement_Learning_NN`中提供了此優化的代碼。 現在我們已經學習了 OpenAI Gym 的基礎知識,讓我們學習強化學習。
                  <ruby id="bdb3f"></ruby>

                  <p id="bdb3f"><cite id="bdb3f"></cite></p>

                    <p id="bdb3f"><cite id="bdb3f"><th id="bdb3f"></th></cite></p><p id="bdb3f"></p>
                      <p id="bdb3f"><cite id="bdb3f"></cite></p>

                        <pre id="bdb3f"></pre>
                        <pre id="bdb3f"><del id="bdb3f"><thead id="bdb3f"></thead></del></pre>

                        <ruby id="bdb3f"><mark id="bdb3f"></mark></ruby><ruby id="bdb3f"></ruby>
                        <pre id="bdb3f"><pre id="bdb3f"><mark id="bdb3f"></mark></pre></pre><output id="bdb3f"></output><p id="bdb3f"></p><p id="bdb3f"></p>

                        <pre id="bdb3f"><del id="bdb3f"><progress id="bdb3f"></progress></del></pre>

                              <ruby id="bdb3f"></ruby>

                              哎呀哎呀视频在线观看