-
Proximal Policy Optimization (PPO) AlgorithmCS & ML Basic 2023. 2. 14. 16:02
The objective of the Proximal Policy Optimization (PPO) algorithm is to train a policy function that can control an agent's behavior in a given environment, such that it maximizes the expected cumulative reward over time.
More formally, we can define the objective of PPO as follows:
$$J(\theta) = \mathbb{E}{\pi{\theta}}\left[\sum_{t=0}^{\infty}\gamma^{t}r_{t}\right]$$
where \(J(\theta)\) is the expected cumulative reward, \(\pi_\theta\) is the policy function parameterized by \(\theta\), t is the time step, r_t is the reward received at time t, and \(\gamma\) is a discount factor that balances immediate and future rewards.
To train the policy function, PPO uses a surrogate objective function that approximates the true objective function and is easier to optimize. The surrogate objective function is defined as:
$$L^{CLIP}(\theta) = \mathbb{E}_{t}\left[\min\left(r_t(\theta)\hat{A}_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon)\hat{A}_t\right)\right]$$
where \(L^{CLIP}(\theta)\) is the clipped surrogate objective function, \(r_t(\theta)\) is the ratio between the probabilities of the current policy and the old policy for taking the action that was actually taken at time t, A_t is an estimator of the advantage function at time t, and \(\epsilon\) is a hyperparameter that controls the size of the policy update.
Intuitively, the clipped surrogate objective function encourages small policy updates by constraining the magnitude of the ratio \(r_t(\theta)\) and the advantage estimator \(A_t \)using the clip function. This helps prevent large policy updates that can lead to instability and poor performance.
During training, the PPO algorithm collects a batch of data from the environment, computes the clipped surrogate objective function, and optimizes it using stochastic gradient descent to update the policy parameters \(\theta\). By repeating this process, PPO learns a policy that can control the agent's behavior in the environment to maximize the expected cumulative reward.
'CS & ML Basic' 카테고리의 다른 글
Proximal Policy Optimization with Generalized Advantage Estimation (PPO2) (0) 2023.02.14 Comparison of TRPO and PPO in Reinforcement Learning (0) 2023.02.14 Advantage Function in RL (0) 2023.02.14