-
Proximal Policy Optimization with Generalized Advantage Estimation (PPO2)CS & ML Basic 2023. 2. 14. 17:04
PPO2 (Proximal Policy Optimization with Generalized Advantage Estimation) is an extension of PPO that combines the PPO algorithm with Generalized Advantage Estimation (GAE), which is a method for estimating the advantage function.
The main difference between PPO and PPO2 is the way they estimate the advantage function. In PPO, the advantage function is estimated using a single-step estimate, while in PPO2, the advantage function is estimated using the GAE method, which uses a multi-step estimate. The GAE method takes into account the rewards and values of multiple time steps, which can result in a more accurate estimate of the advantage function.
PPO2 combines the PPO algorithm with Generalized Advantage Estimation (GAE). GAE is a method for estimating the advantage function, which is used to guide the policy update in PPO2. The advantage function is defined as:
$$A_t = R_t + \gamma V(s_{t+1}) - V(s_t)$$
where \(R_t\) is the reward at time step \(t\), \(V(s_t)\) is the value function at time step \(t\), and \(\gamma\) is the discount factor.
The GAE method uses a multi-step estimate of the advantage function, which takes into account the rewards and values of multiple time steps. The GAE advantage estimate is defined as:
$$\hat A_t^{GAE(\gamma, \lambda)} = \sum_{i=0}^{\infty}(\gamma\lambda)^i\delta_{t+i}$$
where \(\delta_t = R_t + \gamma V(s_{t+1}) - V(s_t)\) is the TD error, and \(\lambda\) is a hyperparameter that controls the trade-off between bias and variance in the estimate.
The PPO2 algorithm uses a clipped surrogate objective function to update the policy, which is defined as:
$$L^{CLIP}(\theta) = \hat{\mathbb{E}}_t\Big[\min\Big(r_t(\theta)\hat A_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon)\hat A_t\Big)\Big]$$
where \(\theta\) is the policy parameters, \(\hat{\mathbb{E}}_t\) is the empirical expectation over a batch of samples, \(r_t(\theta)\) is the ratio between the new and old policies, and \(\epsilon\) is a hyperparameter that controls the size of the clipping region.
PPO2 also uses a different method for updating the value function, which is based on the GAE advantage estimate:
$$L^{VF}(\theta) = \hat{\mathbb{E}}t\Big[(V{\theta}(s_t) - V_t^{GAE})^2\Big]$$
where \(V_{\theta}(s_t)\) is the predicted value of state \(s_t\) by the value function, and \(V_t^{GAE}\) is the GAE estimate of the value function at time step \(t\).
Overall, PPO2 is considered to be a more stable and robust algorithm than PPO, due to its use of the GAE method for estimating the advantage function, and its improved clipping and value function update methods.
'CS & ML Basic' 카테고리의 다른 글
Comparison of TRPO and PPO in Reinforcement Learning (0) 2023.02.14 Advantage Function in RL (0) 2023.02.14 Proximal Policy Optimization (PPO) Algorithm (0) 2023.02.14