-
Proximal Policy Optimization with Generalized Advantage Estimation (PPO2)CS & ML Basic 2023. 2. 14. 17:04
PPO2 (Proximal Policy Optimization with Generalized Advantage Estimation) is an extension of PPO that combines the PPO algorithm with Generalized Advantage Estimation (GAE), which is a method for estimating the advantage function.
The main difference between PPO and PPO2 is the way they estimate the advantage function. In PPO, the advantage function is estimated using a single-step estimate, while in PPO2, the advantage function is estimated using the GAE method, which uses a multi-step estimate. The GAE method takes into account the rewards and values of multiple time steps, which can result in a more accurate estimate of the advantage function.
PPO2 combines the PPO algorithm with Generalized Advantage Estimation (GAE). GAE is a method for estimating the advantage function, which is used to guide the policy update in PPO2. The advantage function is defined as:
where
is the reward at time step , is the value function at time step , and is the discount factor.The GAE method uses a multi-step estimate of the advantage function, which takes into account the rewards and values of multiple time steps. The GAE advantage estimate is defined as:
where
is the TD error, and is a hyperparameter that controls the trade-off between bias and variance in the estimate.The PPO2 algorithm uses a clipped surrogate objective function to update the policy, which is defined as:
where
is the policy parameters, is the empirical expectation over a batch of samples, is the ratio between the new and old policies, and is a hyperparameter that controls the size of the clipping region.PPO2 also uses a different method for updating the value function, which is based on the GAE advantage estimate:
where
is the predicted value of state by the value function, and is the GAE estimate of the value function at time step .Overall, PPO2 is considered to be a more stable and robust algorithm than PPO, due to its use of the GAE method for estimating the advantage function, and its improved clipping and value function update methods.
'CS & ML Basic' 카테고리의 다른 글
Comparison of TRPO and PPO in Reinforcement Learning (0) 2023.02.14 Advantage Function in RL (0) 2023.02.14 Proximal Policy Optimization (PPO) Algorithm (0) 2023.02.14