ABOUT ME

-

Today
-
Yesterday
-
Total
-
  • Proximal Policy Optimization with Generalized Advantage Estimation (PPO2)
    CS & ML Basic 2023. 2. 14. 17:04

    PPO2 (Proximal Policy Optimization with Generalized Advantage Estimation) is an extension of PPO that combines the PPO algorithm with Generalized Advantage Estimation (GAE), which is a method for estimating the advantage function.

     

    The main difference between PPO and PPO2 is the way they estimate the advantage function. In PPO, the advantage function is estimated using a single-step estimate, while in PPO2, the advantage function is estimated using the GAE method, which uses a multi-step estimate. The GAE method takes into account the rewards and values of multiple time steps, which can result in a more accurate estimate of the advantage function.

     

    PPO2 combines the PPO algorithm with Generalized Advantage Estimation (GAE). GAE is a method for estimating the advantage function, which is used to guide the policy update in PPO2. The advantage function is defined as:

    $$A_t = R_t + \gamma V(s_{t+1}) - V(s_t)$$

    where \(R_t\) is the reward at time step \(t\), \(V(s_t)\) is the value function at time step \(t\), and \(\gamma\) is the discount factor.

     

    The GAE method uses a multi-step estimate of the advantage function, which takes into account the rewards and values of multiple time steps. The GAE advantage estimate is defined as:

    $$\hat A_t^{GAE(\gamma, \lambda)} = \sum_{i=0}^{\infty}(\gamma\lambda)^i\delta_{t+i}$$

    where \(\delta_t = R_t + \gamma V(s_{t+1}) - V(s_t)\) is the TD error, and \(\lambda\) is a hyperparameter that controls the trade-off between bias and variance in the estimate.

     

    The PPO2 algorithm uses a clipped surrogate objective function to update the policy, which is defined as:

    $$L^{CLIP}(\theta) = \hat{\mathbb{E}}_t\Big[\min\Big(r_t(\theta)\hat A_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon)\hat A_t\Big)\Big]$$

    where \(\theta\) is the policy parameters, \(\hat{\mathbb{E}}_t\) is the empirical expectation over a batch of samples, \(r_t(\theta)\) is the ratio between the new and old policies, and \(\epsilon\) is a hyperparameter that controls the size of the clipping region.

     

    PPO2 also uses a different method for updating the value function, which is based on the GAE advantage estimate:

    $$L^{VF}(\theta) = \hat{\mathbb{E}}t\Big[(V{\theta}(s_t) - V_t^{GAE})^2\Big]$$

    where \(V_{\theta}(s_t)\) is the predicted value of state \(s_t\) by the value function, and \(V_t^{GAE}\) is the GAE estimate of the value function at time step \(t\).

     

    Overall, PPO2 is considered to be a more stable and robust algorithm than PPO, due to its use of the GAE method for estimating the advantage function, and its improved clipping and value function update methods.

     

    댓글

Designed by Tistory.