Proximal Policy Optimization with Generalized Advantage Estimation (PPO2)

CS & ML Basic 2023. 2. 14. 17:04

PPO2 (Proximal Policy Optimization with Generalized Advantage Estimation) is an extension of PPO that combines the PPO algorithm with Generalized Advantage Estimation (GAE), which is a method for estimating the advantage function.

The main difference between PPO and PPO2 is the way they estimate the advantage function. In PPO, the advantage function is estimated using a single-step estimate, while in PPO2, the advantage function is estimated using the GAE method, which uses a multi-step estimate. The GAE method takes into account the rewards and values of multiple time steps, which can result in a more accurate estimate of the advantage function.

PPO2 combines the PPO algorithm with Generalized Advantage Estimation (GAE). GAE is a method for estimating the advantage function, which is used to guide the policy update in PPO2. The advantage function is defined as:

$$A_t = R_t + \gamma V(s_{t+1}) - V(s_t)$$

where $R_t$ is the reward at time step $t$, $V(s_t)$ is the value function at time step $t$, and $\gamma$ is the discount factor.

The GAE method uses a multi-step estimate of the advantage function, which takes into account the rewards and values of multiple time steps. The GAE advantage estimate is defined as:

$$\hat A_t^{GAE(\gamma, \lambda)} = \sum_{i=0}^{\infty}(\gamma\lambda)^i\delta_{t+i}$$

where $\delta_t = R_t + \gamma V(s_{t+1}) - V(s_t)$ is the TD error, and $\lambda$ is a hyperparameter that controls the trade-off between bias and variance in the estimate.

The PPO2 algorithm uses a clipped surrogate objective function to update the policy, which is defined as:

$$L^{CLIP}(\theta) = \hat{\mathbb{E}}_t\Big[\min\Big(r_t(\theta)\hat A_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon)\hat A_t\Big)\Big]$$

where $\theta$ is the policy parameters, $\hat{\mathbb{E}}_t$ is the empirical expectation over a batch of samples, $r_t(\theta)$ is the ratio between the new and old policies, and $\epsilon$ is a hyperparameter that controls the size of the clipping region.

PPO2 also uses a different method for updating the value function, which is based on the GAE advantage estimate:

$$L^{VF}(\theta) = \hat{\mathbb{E}}t\Big[(V{\theta}(s_t) - V_t^{GAE})^2\Big]$$

where $V_{\theta}(s_t)$ is the predicted value of state $s_t$ by the value function, and $V_t^{GAE}$ is the GAE estimate of the value function at time step $t$.

Overall, PPO2 is considered to be a more stable and robust algorithm than PPO, due to its use of the GAE method for estimating the advantage function, and its improved clipping and value function update methods.

'CS & ML Basic' 카테고리의 다른 글

Comparison of TRPO and PPO in Reinforcement Learning (0)	2023.02.14
Advantage Function in RL (0)	2023.02.14
Proximal Policy Optimization (PPO) Algorithm (0)	2023.02.14

ABOUT ME

AI 지식창고 AI 지식창고

'CS & ML Basic' 카테고리의 다른 글

티스토리툴바

ABOUT ME

'CS & ML Basic' 카테고리의 다른 글

관련글 관련글 더보기

티스토리툴바