ABOUT ME

-

Today
-
Yesterday
-
Total
-
  • Comparison of TRPO and PPO in Reinforcement Learning
    CS & ML Basic 2023. 2. 14. 16:33

    TRPO

    In TRPO, the policy update is performed by solving the following constrained optimization problem:

    maximize L(θ, θ_old) subject to KL(π_θ_old || π_θ) <= δ

    where:

    • L(θ, θ_old) is the objective function, which is the expected sum of rewards under the new policy π_θ, where the expectation is taken with respect to the old policy π_θ_old.
    • θ and θ_old are the current and old policy parameters, respectively.
    • KL(π_θ_old || π_θ) is the Kullback-Leibler divergence between the old policy π_θ_old and the new policy π_θ.
    • δ is a hyperparameter that specifies the maximum allowable change in the policy.

     

    The objective function L(θ, θ_old) can be approximated using a Monte Carlo estimate:

    L(θ, θ_old) = E[r(τ) * exp( sum_t log(π_θ(a_t | s_t)) - log(π_θ_old(a_t | s_t)))],

    where:

    • r(τ) is the total reward obtained by executing the trajectory τ under the old policy.
    • a_t is the action taken at time step t.
    • s_t is the state at time step t.

     

    The KL divergence between the old and new policies can also be approximated using a Monte Carlo estimate:

    KL(π_θ_old || π_θ) = E[log(π_θ_old(a_t | s_t) / π_θ(a_t | s_t))]

     

    where the expectation is taken with respect to the old policy.

     

    The trust region constraint KL(π_θ_old || π_θ) <= δ ensures that the new policy is not too far from the old policy. The trust region is defined by the hyperparameter δ, which specifies the maximum allowable change in the policy.

     

    The optimization problem is typically solved using a conjugate gradient method, which involves computing the gradient of the objective function with respect to the policy parameters, as well as the Hessian-vector product.

     

    PPO

    In PPO, the policy update is performed by maximizing a clipped surrogate objective function:

    L(θ) = E[min(r_t(θ) * A_t, clip(r_t(θ), 1 - ε, 1 + ε) * A_t)]​
     

    where:

    • r_t(θ) = π_θ(a_t | s_t) / π_θ_old(a_t | s_t) is the likelihood ratio between the new policy π_θ and the old policy π_θ_old.
    • A_t is the advantage function, which estimates the advantage of taking action a_t in state s_t under the new policy π_θ compared to the old policy π_θ_old.
    • ε is a hyperparameter that controls the size of the trust region. It specifies the amount of clipping that is applied to the likelihood ratio.

    The advantage function A_t can be approximated using the formula:

    A_t = Q(s_t, a_t) - V(s_t)

     

    where Q(s_t, a_t) is the state-action value function and V(s_t) is the state value function. These functions can be estimated using a neural network.

     

     

    Compared to TRPO, PPO is generally considered to be easier to implement and tune, as it requires fewer hyperparameters and is less sensitive to their values. PPO also tends to be more sample-efficient than TRPO, as it achieves similar performance with fewer samples.

     

    댓글

Designed by Tistory.