ABOUT ME

AI 논문 리뷰 요리 그리고 여행 기록.

Today
Yesterday
Total
  • Proximal Policy Optimization (PPO) Algorithm
    CS & ML Basic 2023. 2. 14. 16:02

    The objective of the Proximal Policy Optimization (PPO) algorithm is to train a policy function that can control an agent's behavior in a given environment, such that it maximizes the expected cumulative reward over time.

     

    More formally, we can define the objective of PPO as follows:

    J(θ)=Eπθ[t=0γtrt]

    where J(θ) is the expected cumulative reward, πθ is the policy function parameterized by θ, t is the time step, r_t is the reward received at time t, and γ is a discount factor that balances immediate and future rewards.

     

    To train the policy function, PPO uses a surrogate objective function that approximates the true objective function and is easier to optimize. The surrogate objective function is defined as:

    LCLIP(θ)=Et[min(rt(θ)A^t,clip(rt(θ),1ϵ,1+ϵ)A^t)]

    where LCLIP(θ) is the clipped surrogate objective function, rt(θ) is the ratio between the probabilities of the current policy and the old policy for taking the action that was actually taken at time t, A_t is an estimator of the advantage function at time t, and ϵ is a hyperparameter that controls the size of the policy update.

     

    Intuitively, the clipped surrogate objective function encourages small policy updates by constraining the magnitude of the ratio rt(θ) and the advantage estimator Atusing the clip function. This helps prevent large policy updates that can lead to instability and poor performance.

     

    During training, the PPO algorithm collects a batch of data from the environment, computes the clipped surrogate objective function, and optimizes it using stochastic gradient descent to update the policy parameters θ. By repeating this process, PPO learns a policy that can control the agent's behavior in the environment to maximize the expected cumulative reward.

     

    댓글

Designed by Tistory.