AI 지식창고

Proximal Policy Optimization with Generalized Advantage Estimation (PPO2)

책읽는짱구 — Tue, 14 Feb 2023 17:04:31 +0900

PPO2 (Proximal Policy Optimization with Generalized Advantage Estimation) is an extension of PPO that combines the PPO algorithm with Generalized Advantage Estimation (GAE), which is a method for estimating the advantage function.

The main difference between PPO and PPO2 is the way they estimate the advantage function. In PPO, the advantage function is estimated using a single-step estimate, while in PPO2, the advantage function is estimated using the GAE method, which uses a multi-step estimate. The GAE method takes into account the rewards and values of multiple time steps, which can result in a more accurate estimate of the advantage function.

PPO2 combines the PPO algorithm with Generalized Advantage Estimation (GAE). GAE is a method for estimating the advantage function, which is used to guide the policy update in PPO2. The advantage function is defined as:

$$A_t = R_t + \gamma V(s_{t+1}) - V(s_t)$$

where $R_t$ is the reward at time step $t$, $V(s_t)$ is the value function at time step $t$, and $\gamma$ is the discount factor.

The GAE method uses a multi-step estimate of the advantage function, which takes into account the rewards and values of multiple time steps. The GAE advantage estimate is defined as:

$$\hat A_t^{GAE(\gamma, \lambda)} = \sum_{i=0}^{\infty}(\gamma\lambda)^i\delta_{t+i}$$

where $\delta_t = R_t + \gamma V(s_{t+1}) - V(s_t)$ is the TD error, and $\lambda$ is a hyperparameter that controls the trade-off between bias and variance in the estimate.

The PPO2 algorithm uses a clipped surrogate objective function to update the policy, which is defined as:

$$L^{CLIP}(\theta) = \hat{\mathbb{E}}_t\Big[\min\Big(r_t(\theta)\hat A_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon)\hat A_t\Big)\Big]$$

where $\theta$ is the policy parameters, $\hat{\mathbb{E}}_t$ is the empirical expectation over a batch of samples, $r_t(\theta)$ is the ratio between the new and old policies, and $\epsilon$ is a hyperparameter that controls the size of the clipping region.

PPO2 also uses a different method for updating the value function, which is based on the GAE advantage estimate:

$$L^{VF}(\theta) = \hat{\mathbb{E}}t\Big[(V{\theta}(s_t) - V_t^{GAE})^2\Big]$$

where $V_{\theta}(s_t)$ is the predicted value of state $s_t$ by the value function, and $V_t^{GAE}$ is the GAE estimate of the value function at time step $t$.

Overall, PPO2 is considered to be a more stable and robust algorithm than PPO, due to its use of the GAE method for estimating the advantage function, and its improved clipping and value function update methods.

Comparison of TRPO and PPO in Reinforcement Learning

책읽는짱구 — Tue, 14 Feb 2023 16:33:32 +0900

TRPO

In TRPO, the policy update is performed by solving the following constrained optimization problem:

maximize L(θ, θ_old) subject to KL(π_θ_old || π_θ) <= δ

where:

L(θ, θ_old) is the objective function, which is the expected sum of rewards under the new policy π_θ, where the expectation is taken with respect to the old policy π_θ_old.
θ and θ_old are the current and old policy parameters, respectively.
KL(π_θ_old || π_θ) is the Kullback-Leibler divergence between the old policy π_θ_old and the new policy π_θ.
δ is a hyperparameter that specifies the maximum allowable change in the policy.

The objective function L(θ, θ_old) can be approximated using a Monte Carlo estimate:

L(θ, θ_old) = E[r(τ) * exp( sum_t log(π_θ(a_t | s_t)) - log(π_θ_old(a_t | s_t)))],

where:

r(τ) is the total reward obtained by executing the trajectory τ under the old policy.
a_t is the action taken at time step t.
s_t is the state at time step t.

The KL divergence between the old and new policies can also be approximated using a Monte Carlo estimate:

KL(π_θ_old || π_θ) = E[log(π_θ_old(a_t | s_t) / π_θ(a_t | s_t))]

where the expectation is taken with respect to the old policy.

The trust region constraint KL(π_θ_old || π_θ) <= δ ensures that the new policy is not too far from the old policy. The trust region is defined by the hyperparameter δ, which specifies the maximum allowable change in the policy.

The optimization problem is typically solved using a conjugate gradient method, which involves computing the gradient of the objective function with respect to the policy parameters, as well as the Hessian-vector product.

PPO

In PPO, the policy update is performed by maximizing a clipped surrogate objective function:

L(θ) = E[min(r_t(θ) * A_t, clip(r_t(θ), 1 - ε, 1 + ε) * A_t)]

where:

r_t(θ) = π_θ(a_t | s_t) / π_θ_old(a_t | s_t) is the likelihood ratio between the new policy π_θ and the old policy π_θ_old.
A_t is the advantage function, which estimates the advantage of taking action a_t in state s_t under the new policy π_θ compared to the old policy π_θ_old.
ε is a hyperparameter that controls the size of the trust region. It specifies the amount of clipping that is applied to the likelihood ratio.

The advantage function A_t can be approximated using the formula:

A_t = Q(s_t, a_t) - V(s_t)

where Q(s_t, a_t) is the state-action value function and V(s_t) is the state value function. These functions can be estimated using a neural network.

Compared to TRPO, PPO is generally considered to be easier to implement and tune, as it requires fewer hyperparameters and is less sensitive to their values. PPO also tends to be more sample-efficient than TRPO, as it achieves similar performance with fewer samples.

Advantage Function in RL

책읽는짱구 — Tue, 14 Feb 2023 16:02:26 +0900

The advantage function in reinforcement learning is a measure of how much better an action is compared to other actions in a given state. It is a critical component in many reinforcement learning algorithms, including PPO.

Mathematically, the advantage function is defined as the difference between the expected reward of taking a specific action in a given state and the expected reward of following the current policy in that same state:

$$A(s,a) = Q(s,a) - V(s)$$

where $A(s,a)$ is the advantage of taking action $a$ in state $s$, $Q(s,a)$ is the expected reward of taking action $a$ in state $s$, and $V(s)$ is the expected reward of following the current policy in state $s$.

Intuitively, the advantage function tells the agent how much better it is to take a specific action compared to following its current policy. If the advantage is positive, the agent should take that action, and if the advantage is negative, the agent should avoid taking that action. The advantage function helps the agent to learn which actions are better or worse in each state, allowing it to improve its policy over time.

In the PPO algorithm, the advantage function is estimated using a value function estimator such as a critic neural network. The estimated advantage is then used in the surrogate objective function to calculate the policy gradient, which is used to update the policy parameters.

For example, let's say you're playing a game where you're controlling a robot to collect coins in a 2D environment. In a given state, there are three possible actions: move left, move right, or jump. The advantage function would help you to determine which action is the best to take in that state. For instance, if moving right has a higher expected reward than the other actions, the advantage function will indicate that taking this action is the best choice, and you should choose to move right to maximize your reward.

Proximal Policy Optimization (PPO) Algorithm

책읽는짱구 — Tue, 14 Feb 2023 16:02:14 +0900

The objective of the Proximal Policy Optimization (PPO) algorithm is to train a policy function that can control an agent's behavior in a given environment, such that it maximizes the expected cumulative reward over time.

More formally, we can define the objective of PPO as follows:

$$J(\theta) = \mathbb{E}{\pi{\theta}}\left[\sum_{t=0}^{\infty}\gamma^{t}r_{t}\right]$$

where $J(\theta)$ is the expected cumulative reward, $\pi_\theta$ is the policy function parameterized by $\theta$, t is the time step, r_t is the reward received at time t, and $\gamma$ is a discount factor that balances immediate and future rewards.

To train the policy function, PPO uses a surrogate objective function that approximates the true objective function and is easier to optimize. The surrogate objective function is defined as:

$$L^{CLIP}(\theta) = \mathbb{E}_{t}\left[\min\left(r_t(\theta)\hat{A}_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon)\hat{A}_t\right)\right]$$

where $L^{CLIP}(\theta)$ is the clipped surrogate objective function, $r_t(\theta)$ is the ratio between the probabilities of the current policy and the old policy for taking the action that was actually taken at time t, A_t is an estimator of the advantage function at time t, and $\epsilon$ is a hyperparameter that controls the size of the policy update.

Intuitively, the clipped surrogate objective function encourages small policy updates by constraining the magnitude of the ratio $r_t(\theta)$ and the advantage estimator $A_t $using the clip function. This helps prevent large policy updates that can lead to instability and poor performance.

During training, the PPO algorithm collects a batch of data from the environment, computes the clipped surrogate objective function, and optimizes it using stochastic gradient descent to update the policy parameters $\theta$. By repeating this process, PPO learns a policy that can control the agent's behavior in the environment to maximize the expected cumulative reward.

Behind ChatGPT: Reinforcement Learning from Human Feedback (RLHF)

책읽는짱구 — Mon, 16 Jan 2023 21:38:31 +0900

본 포스트에서는 ChatGPT을 학습할때 1.3B만으로 175B만큼의 성능을 내는데 가장 중요한 역할을 했던 Reinforcement Learning from Human Feedback (RLHF)에 관해 설명한다. Detail은 InstructGPT 논문 리뷰에서 추가적으로 다루도록 하겠다.

Concept

GPT like와 같은 pre-trained language model (PLM)의 생성 능력 & down-stream tasks에서의 성능이 뛰어나지만 이러한 모델이 실제로 human-like와 같이 문맥에 맞게 자연스러운 응답을 생성한다고 보기에는 아직 한계가 있다. 가령, human과 같은 응답을 생성하기 위해 GPT의 autoregressive한 objective를 minimize하거나, 단순히 human reference text와 generated text의 overlap을 minimize하는 방식만으로는 한계가 있다. 이를위해 제안된 Reinforcement Learning from Human Feedback (RLHF)는 RL을 이용하여 human feedback으로부터 LM을 optimize하는 방식으로, general corpus로부터 학습된 PLM의 output을 우리가(human) 원하는 방식에 가깝도록 학습하는 과정이다.

Method

RLHF를 위해서는 아래와 같이 3가지의 과정을 거치게 된다.

1) LM을 pre-training하는 과정

2) demonstration data를 수집 & reward model을 학습하는 과정

3) LM을 reward model과 함께 RL objective로 학습하는 과정

Step-1) Pre-training LM

- GPT-3와 같이 대용량의 학습데이터에서 pre-trained된 LM

- 추가적으로 RLHF에 이용할 LM은 PLM을 additional text 데이터에서 fine-tuning 한 모델을 이용할수도 있음. (ChatGPT에서는 human으로부터 생성된 demonstration 데이터셋에서 PLM을 추가 fine-tuning시킴)

Step-2) Training Reward model with Human Feedback Dataset

Training the reward model from generated outputs. [Image Reference: https://huggingface.co/blog/rlhf]

Step-2에서는 Step-1의 LM을 이용하여 reward 모델을 학습시키는 단계이다. Step-1의 LM을 이용하여 같은 query에 대해 k개의 output을 generation한 후, human이 k개에 대해 더 좋다고 판단되는 결과를 ranking하는 방식으로 human scoring을 진행한다. 가령 <query, output_a, output_b, output_c>와 같은 결과가 있을때, output의 퀄리티에 따라 output_b > output_a > output_c와 같이 랭킹을 하는 방식 (relative)이다.

이후 human scoring된 데이터를 이용하여 reward model을 학습하게 된다. 학습된 reward 모델을 이용하여 <query, generated_output>이 입력으로 주어졌을때, 모델이 생성한 generated_output에 대해 human scoring을 mimic할 수 있는 reward 결과를 부여할 수 있다.

Step-3) Fine-tuning the LM with Reward Model and RL objective

[Image Reference: https://huggingface.co/blog/rlhf]

해당 step을 설명하기전에 우선 notation을 정의한다.

policy: prompt 입력을 받아 text sequence를 return하는 LM으로 정의된다.

action space: LM이 선택할 수 있는 모든 token (vocab_size만큼)

observation space: 가능한 모든 input token sequences의 조합. (vocab_size^{number of input tokens})

reward function: policy shift와 reward model을 이용하여 정의됨.

Step-3의 과정은 아래와 같다.

1) GPT-3 like transformer를 이용하여 policy (π)를 초기화

2) policy를 이용하여 query에 대한 문장을 생성하고, 학습된 reward 모델을 이용하여 reward 점수를 얻음

3) 얻은 reward 점수를 이용하여 policy를 update함. (이때 PPO algorithm을 이용하여 policy를 update)

이때, RL 학습의 경우 초기 policy에 대한 예측값들이 noisy 할 수 있기때문에 reward가 valid한 range에 들어오지 못할수도 있는데 이를 방지하기 위해, 아래와같이 KL term을 이용해서 reward function에 penalty를 추가함.

$$R(x,y)=r(x,y)-\beta \log[\frac{\pi^{RL}(y|x)}{\pi^{SFT}(y|x)}]$$

optimize를 하기위한 policy πRL(y∣x)가 초기에 너무 noisy한 값을 예측하는거를 prevent하기 위해, supervised 데이터셋에서 fine-tuning된 πSET(y∣x)를 넣어 KL(πRL(y∣x),πSET(y∣x))를 최소화하도록 term을 추가해줌. 이렇게되면, 초기에 fine-tuning된 model이 policy를 guide 해주는 역할을 할 수 있음

[References]

[1] https://wandb.ai/ayush-thakur/RLHF/reports/Understanding-Reinforcement-Learning-from-Human-Feedback-RLHF-Part-1--VmlldzoyODk5MTIx

[2] https://huggingface.co/blog/rlhf

[3] https://arxiv.org/abs/2203.02155

[논문 리뷰] Editing Models with Task Arithmetic [arxiv Dec 8, 2022]

책읽는짱구 — Mon, 9 Jan 2023 11:46:28 +0900

본 연구에서는 task vector를 제안하여, PLM을 down-stream task에 적용할때 task에 맞게 모델을 edit 하거나, biases를 mitigate, unwanted behavior를 컨트롤, 새로운 information으로 update 하는데에 이용한다.

[Github]

- https://github.com/mlfoundations/task_vectors

[Breif Summary created by ChatGPT]

Task vectors can modify the behavior of pre-trained neural networks by specifying a direction in the weight space of the model.
Task vectors are created by subtracting the weights of a pre-trained model from the weights of the same model after fine-tuning a task.
Task vectors can be modified and combined through arithmetic operations, allowing the model's behavior to be steered in certain directions.
Task vectors can be used to improve performance on multiple tasks at once, and can even improve performance on a fourth task when the tasks are linked by an analogy relationship and task vectors from three of the tasks are combined.

Forgetting via negation

특정 task에서 학습된 정보를 forget하는데에 이용. 가령, toxic 데이터에서 학습된 LM은 toxic에 관련된 keyword들을 생성할 확률이 높은데, negate vectors를 이용하면 toxic을 생성하는 확률을 낮출 수 있음. 즉, undesirable behaviors를 줄이는데에 이용할 수 있음.

Learning via addition

또한 task vectorks를 adding 함으로써 better multi-task models를 만들거나, single task에서의 성능을 improve 할 수도 있음. 이러한 방법은 in-house model이나 공개된 PLM-based fine-tuned model로부터 knowledge를 transfer 해와서 target task의 성능을 높일 수 있다는 장점이 있음.

Task analogies

마지막으로, "A is to B as C is to D"와 같이 첫 3개의 task vectors를 활용하여 소수의 학습데이터나 학습 데이터 없이 4번째 테스크의 성능을 improve 할 수도 있다 (Domain Generalization). 또한 task vectors를 활용하는 방법은 simple, fast, and effective with no extra cost at inference time in terms of memory or compute.

Task Vectors

Task Vectors, Forgetting via negation, Learning via addition, and Task analogies.

$\theta_{pre}$와 $\theta_{ft}^t$가 각각 pre-trained and fine-tuned 모델의 parameter라고 하면, task vector는 아래와 같이 element-wise differene로 정의된다.

$$\tau_t = \theta_{ft}^t - \theta_{pre}.$$

또한 본 연구에서는, 목적에 따라 negating a task vector, adding task vectors together, combining task vectors와 같이 3가지의 vector 연산을 사용하는데 해당 벡터들은 아래와 같이 정의된다.

Negating a task vector

Extrapolating between the fine-tuned model and the pre-trained model. 해당 연산을 수행한 모델은 target task에 대한 성능이 낮아짐 (with little change in performance on control tasks).

$$\tau_{new}=-\tau$$

실험 결과를 살펴보면, baseline (Gradient ascent, Random vector)과 비교하였을때, Control accuracy의 성능 손실을 최소화하면서 target task의 성능을 낮출 수 있는것을 확인할 수 있다. 또한 Table2의 toxic generation task에서의 결과를 살펴보면, fine-tuned된 모델보다 toxic generation percentage가 매우 감소하는것을 확인할 수 있고, gradient ascent 방식대비 perplexity의 손실도 거의 일어나지 않는것을 확인할 수 있다.

Adding a task vector

2개 이상의 task들을 합쳐서 multi-task model의 성능을 잘 보장하고 싶을때 다음과 같은 연산을 수행할 수 있음.

$$\tau_{new}=\Sigma_i\tau_i$$

Combining task vectors

"A is to B as C is to D"와 같이 task들 간의 analogy를 고려할때 다음과 같은 연산이 활용될 수 있음.

$$\tau_{new}=\tau_C + (\tau_B - \tau_A)$$

[논문 리뷰] Switch-BERT: Learning to Model Multimodal Interactions by Switching Attention and Input (ECCV 2022)

책읽는짱구 — Wed, 21 Dec 2022 11:17:05 +0900

Two paradigms of Transformer-based architecture for intra - and inter-modal interactions

1) Single-Encoder

- Single-stream encoder를 이용하여 각 modality를 jointly하게 encoding 하는 방식.

=> dual-encoder방식보다 성능은 높을 수 있지만, time-complexity가 너무 커서 real-world applications에 적용되기에는 무리가 있음.

2) Dual-Encoder

- 각 modality별로 encoder를 따로 분리하여 각자 representation을 extraction한 후, cross-attention 같은 layer를 통해 각 modality의 feature를 fusion하는 방식.

- image-text retrieval 같은 테스크에서 단순하게는, image & text representation끼리의 dot product 같은것을 적용할수도 있음.

=> 미리 검색 후보군들에 대한 representation들을 caching 해놓을 수 있어 실제 서비스에 적합.

Misalignment Between Modal Semantics

- Single-stream models의 경우, visual features는 high-level semantics를 갖고있지만 text의 경우 low-level semantics를 갖고 있음. 즉 두 modality의 feature가 same semantic level이 아니기에 두 modal의 representation을 하나의 encoder에서 처리하는것은 contradictory함.

- Dual-stream models의 경우, modality별로 각자 다른 encoder를 사용함으로써 misalignment를 어느정도 완화할 수 있지만 각 modal간의 interation이 특정 layer에서만 이뤄진다는 한계점이 있음 (can be inflexible).

- 일반적으로 layer가 깊어질수록 higher level의 semantics를 뽑을 수 있지만, 위의 figure를 살펴보면 6~10 layer까지밖에 되지 않았음에도 성능 향상이 이뤄지지 않거나 감소하는것을 볼 수 있다.

- 즉, 두 modality를 fusion하는 방식에서 아직까지는 한계가 있다고 볼 수 있다.

- 또 하나의 finding으로는 task에 따라 optimal depth가 각각 다르다는것인데, 이럴경우 모든 task에 optimal한 fixed architecture를 설계하기 어려울 수 있다.

[논문 리뷰] Conditioned and composed image retrieval combining and partially fine-tuning CLIP-based features (CVPR 2022 Workshop)

책읽는짱구 — Tue, 20 Dec 2022 20:59:29 +0900

Content-Based Image Retrieval with User Feedback Information

Content-Based Image Retrieval (CBIR)은 query로 이미지를 받아 <query feature ,DB에 저장되어 있는 features>간의 distance를 구하는 문제이다. 이전에는 query로 단순히 image만을 입력받았다면 최근에는 user feedback 정보를 같이 활용하여 CBIR을 하는 연구들이 제안되고 있다. 가령 아래와 같이 첫번째 검색된 결과를 기반으로 유저가 해당 이미지와 유사하지만 Korn이라는 logo를 갖고있는 이미지를 검색해달라고 user feedback 정보를 제공하면 해당 조건들을 부합하는 이미지를 검색해주는것이다. 해당 연구에서는 user feedback을 text information으로만 제한한다. 본 연구는 FashionIQ, CIRR benchmark dataset에서 다음에 다룰 ARTEMIS, CIRPLATN와 같은 모델들보다 높은 성능을 나타내며 SOTA를 달성하였다.

Proposed Method

본 연구에서 제안하는 방법은 크게 CLIP 모델을 기반으로 하고있으며, large-corpus에서 pre-trained된 CLIP의 dual encoder (image, text encoder)의 강점을 활용한다. 근데 저자들이 한가지 point out하는 부분이 있는데 그 내용은 다음과 같다.

가령 an image of a black dress (x), and the relative caption "is blue (y)"라는 이미지와 text caption이 있다고 가정하자.

해당 이미지와 캡션을 통해 유저가 원하는것은 blue dress를 찾고 싶은것일텐데, 여기서 문제는 CLIP의 image and text encoder를 통해 $$ \phi_I(x) + \phi_T(y) \approx \phi_I(z) $$ 라는 보장을 할 수 없다.

저자들은 "Ideally, we would like to have a textual embedding space that contains displacement vectors in the image embedding space since the conditioned image retrieval task consists of moving between two points in image space using textual information." 라고 언급하며 이를 위해 아래와 같이 두 단계의 training method를 제안한다.

Text encoder fine-tuning

본 과정은 대용량의 image-text 데이터로 pre-training시킬때와 downstream task에서 fine-tuning할때의 text encoder의 misatch problem을 최소화 하기 위한 과정이다. 학습 과정은 simple한데,

(1) pre-trained된 image & text encoder를 이용하여 reference image & caption feature를 추출하고

(2) simple summation & L2 normalization을 통해 combined feature를 구성한다.

(3) combined feature와 positive target images and negative images의 triplet을 구성하고, Contrastive Loss (해당 연구에서는 batch-based classification)를 통해 text-encoder를 학습한다. (image encoder의 weight는 freeze)

Combiner network training

앞선 step에서 Text encoder를 추가적으로 학습시켰다면, 이 단계에서는 text와 image feature를 더 잘 fusion하기 위한 combiner network를 학습한다. 단순히 image & text feature를 summation하는것만으로는 이미지와 부분적으로 수정할 포인트를 directly 요구하는 텍스트 information을 충분히 표현하기에 한계가 있기 때문이다.

아래의 combiner 구조를 살펴보면 image & text representation으로부터 linear projection & RELU & concat하고 3개의 branch의 입력으로 concat된 feature를 사용한다. 맨위 & 아래 barnch는 image와 text간의 coefficients of a convex combination를 계산하기 위함이고, 가운데 branch는 mixture of text and image features를 위해 사용된다. 최종적으로 3가지 branch의 representation들은 summation되어 combined features를 형성한다.

이전 스텝과 마찬가지로 Combiner Network를 학습하기 위해 contrastive Loss를 이용한다.

Experiments

Table1은 FT, CF의 여부 & type에 따른 성능 비교를 나타낸다. 실험 결과 text encoder를 추가적으로 Fine-tuning하고, combiner를 학습하는것이 최종적으로 제일 높은 성능을 나타낸다.

Table 3 and Table4는 다른 SOTA 모델들과의 비교를 나타내며, 본 연구에서 제안한 모델이 제일 높은 성능을 나타내는것을 알 수 있다.

마지막으로 combiner network을 이용하는것과 단순히 text & image feature를 summation하는것과의 cosine 유사도를 통한 비교 결과는 Figure 6와 같다. 분석해보면,

(1) Fine-tuning을 진행하면 sum and combiner 방식 모두 target and non-target 간의 gap이 더 커지고

(2) Combiner가 Summation보다 target and non-target 사이의 gap을 더 늘린다

라는 점을 알 수 있다.

마치며

본 연구의 method는 간단하지만 높은 성능을 나타내고, 해당 benchmark dataset에서 시도했던 다른 모델들과는 달리 pre-trained CLIP을 직관적으로 어렵지않게 잘 활용했다는 점에서 리뷰를 진행하였다.

치킨보다 맛있는 골뱅이 튀김 레시피

책읽는짱구 — Sun, 18 Dec 2022 15:52:15 +0900

오늘은 치킨만큼 맛있지만 간단하게 만들 수 있는 골뱅이 튀김을 소개드리려고 합니다.

먼재 재료는 아래와같이 준비해주시구요.

- 튀김가루

- 편의점에서 파는 골뱅이캔

- 달걀 1개

- 청량고추, 파 (optional)

[1] 아래와 같이, 물 튀김가루 1:1.5 or 1:2정도 비율로 섞어주세요. 그리고 거기에 달걀 1개, 골뱅이, 그리고 썰어놓은 파와 청량 고추를 넣어주세요. (저는 청량고추는 집에 없어 생략했습니다.)

모든 재료를 넣어주고 shake it shake it!

[2] 팬에 기름을 두르고 중강불로 조절해줍니다. 기름이 보글보글 끓기 시작하면 반죽해놓은 골뱅이를 하나씩 투척해주세요.

이때 튀김옷을 두껍게 먹고 싶으면 각 골뱅이당 반죽의 양을 더 많이 묻혀서 투척해주세요.

[3] 여기서 중요한건, 1~2분이 넘지 않도록 살짝만 튀겨주셔야 부드러운 골뱅이 식감을 살릴수 있어요. 개인 취향에 따라 바삭한걸 좋아하시는 분들은 더 튀겨도 좋을거 같아요.

Tip: 감이 안잡시면 1분짜리, 2분짜리, 3분(이상)짜리를 하나씩 튀기고 취향에 맞게 선택하셔도 좋을거 같아요.

[4] 완성입니다. 떡볶이 같은거랑 같이 먹으면 더 맛있겠죠?

골뱅이 튀김

떡볶이는 만들기 귀찮으니까 편의점 떡볶이를 먹는걸로 :p

초간단 떡국 레시피

책읽는짱구 — Sun, 11 Dec 2022 16:28:12 +0900

새해가 다가오기 때문에 주말 아침은 떡국을 만들어봤습니다.

자취생이 from scratch로 모든 떡국 재료들을 하나하나 준비하기에는 너무 귀찮기에,

제가 도전했던 초간단 레시피를 공유드립니다.

해당 레시피는 "1분요리 뚝딱이형 (https://www.youtube.com/@1mincook)"을 참고하였습니다.

[재료]

- 비비고 갈비탕, 떡, 파, 계란, 김가루

[1] 저는 냉동떡을 사용하였기에, 우선 떡을 30분간 미지근한 물에 불려주고

떡에 참기름 한 스푼, 간장 3스푼을 넣어 섞어주었습니다.

떡을 물에 불려주는동안, 파를 가위로 대충 썰어놓아 주세요.

가위로 대충 썰어놓은 파

[2] 이후, 냄비에 물 500ML, 비비고 갈비탕(450g), 그리고 우유 반컵정도를 떡과 함께 끓여줍니다.

이때 떡이 떠오를정도로 익으면, 썰어놓았던 파와 계란을 넣고 1분정도만 더 끓여주세요.

초간단 레시피의 Secret Point

[3] 그럼 벌써 완성입니다. 김가루를 넣고 맛있게 먹어주세요.

이 레시피의 장점은, 고기를 따로 손질하여 삶아주고 양념할 필요없이 이 모든과정을 "비비고 갈비탕"에 맡긴다는것입니다.

"비비고 갈비탕" 말고, 다른 유사 제품을 사용하셔도 전혀 문제없으실 듯 합니다 :)