분류 전체보기
-
Proximal Policy Optimization with Generalized Advantage Estimation (PPO2)CS & ML Basic 2023. 2. 14. 17:04
PPO2 (Proximal Policy Optimization with Generalized Advantage Estimation) is an extension of PPO that combines the PPO algorithm with Generalized Advantage Estimation (GAE), which is a method for estimating the advantage function. The main difference between PPO and PPO2 is the way they estimate the advantage function. In PPO, the advantage function is estimated using a single-step estimate, whi..
-
Advantage Function in RLCS & ML Basic 2023. 2. 14. 16:02
The advantage function in reinforcement learning is a measure of how much better an action is compared to other actions in a given state. It is a critical component in many reinforcement learning algorithms, including PPO. Mathematically, the advantage function is defined as the difference between the expected reward of taking a specific action in a given state and the expected reward of followi..
-
Proximal Policy Optimization (PPO) AlgorithmCS & ML Basic 2023. 2. 14. 16:02
The objective of the Proximal Policy Optimization (PPO) algorithm is to train a policy function that can control an agent's behavior in a given environment, such that it maximizes the expected cumulative reward over time. More formally, we can define the objective of PPO as follows: $$J(\theta) = \mathbb{E}{\pi{\theta}}\left[\sum_{t=0}^{\infty}\gamma^{t}r_{t}\right]$$ where \(J(\theta)\) is the ..
-
Behind ChatGPT: Reinforcement Learning from Human Feedback (RLHF)Paper Review/NLP 2023. 1. 16. 21:38
본 포스트에서는 ChatGPT을 학습할때 1.3B만으로 175B만큼의 성능을 내는데 가장 중요한 역할을 했던 Reinforcement Learning from Human Feedback (RLHF)에 관해 설명한다. Detail은 InstructGPT 논문 리뷰에서 추가적으로 다루도록 하겠다. Concept GPT like와 같은 pre-trained language model (PLM)의 생성 능력 & down-stream tasks에서의 성능이 뛰어나지만 이러한 모델이 실제로 human-like와 같이 문맥에 맞게 자연스러운 응답을 생성한다고 보기에는 아직 한계가 있다. 가령, human과 같은 응답을 생성하기 위해 GPT의 autoregressive한 objective를 minimize하거나, 단순..
-
[논문 리뷰] Editing Models with Task Arithmetic [arxiv Dec 8, 2022]Paper Review/Computer Vision 2023. 1. 9. 11:46
본 연구에서는 task vector를 제안하여, PLM을 down-stream task에 적용할때 task에 맞게 모델을 edit 하거나, biases를 mitigate, unwanted behavior를 컨트롤, 새로운 information으로 update 하는데에 이용한다. [Github] - https://github.com/mlfoundations/task_vectors [Breif Summary created by ChatGPT] Task vectors can modify the behavior of pre-trained neural networks by specifying a direction in the weight space of the model. Task vectors are creat..
-
[논문 리뷰] Switch-BERT: Learning to Model Multimodal Interactions by Switching Attention and Input (ECCV 2022)Paper Review/Computer Vision 2022. 12. 21. 11:17
Two paradigms of Transformer-based architecture for intra - and inter-modal interactions 1) Single-Encoder - Single-stream encoder를 이용하여 각 modality를 jointly하게 encoding 하는 방식. => dual-encoder방식보다 성능은 높을 수 있지만, time-complexity가 너무 커서 real-world applications에 적용되기에는 무리가 있음. 2) Dual-Encoder - 각 modality별로 encoder를 따로 분리하여 각자 representation을 extraction한 후, cross-attention 같은 layer를 통해 각 modality의 fea..
-
[논문 리뷰] Conditioned and composed image retrieval combining and partially fine-tuning CLIP-based features (CVPR 2022 Workshop)Paper Review/Computer Vision 2022. 12. 20. 20:59
Content-Based Image Retrieval with User Feedback Information Content-Based Image Retrieval (CBIR)은 query로 이미지를 받아 간의 distance를 구하는 문제이다. 이전에는 query로 단순히 image만을 입력받았다면 최근에는 user feedback 정보를 같이 활용하여 CBIR을 하는 연구들이 제안되고 있다. 가령 아래와 같이 첫번째 검색된 결과를 기반으로 유저가 해당 이미지와 유사하지만 Korn이라는 logo를 갖고있는 이미지를 검색해달라고 user feedback 정보를 제공하면 해당 조건들을 부합하는 이미지를 검색해주는것이다. 해당 연구에서는 user feedback을 text information으로만 제한한다...