LoRa: Low Rank Adaptation of Large Language Models Paper Review

Here is the paper I will be writing about today. LoRa: Low-Rank Adaptation of Large Language Models Paper

1. Overview/Background:

Introduction to Large Language Models and Training:

Large Language Models (LLMs), consisting of massive matrices filled with numbers running into hundreds of gigabytes, are sophisticated tools in the AI sphere. Training these models typically involves two steps - pre-training and fine-tuning.

Pre-training: This phase involves training the model on a vast corpus of text, enabling it to acquire general language capabilities. During this stage, the model primarily serves as a ‘next-word predictor.’ The learning process during this phase is computationally expensive, thus incurring most of the operation cost. The model assimilates a wealth of knowledge but lacks the capability to effectively output it. For example, if the model is posed with a question, it might respond with related questions rather than providing a succinct answer.

Fine-tuning: This phase ensues post pre-training, where the model is trained on a smaller, task-specific dataset to acquire desired responses. For example, if the model is meant to function like chatGPT, it would be trained on a dataset of Questions and Answers similar to chatGPT. The training in this phase features smaller ‘steps,’ ensuring that each iteration only slightly alters the model’s weights.

2. The Problem:

The significant issue with full fine-tuning is that every training iteration changes all the model parameters. This implies that the model needs to relearn and adjust all parameters for each iteration, which for larger models like GPT3 (with 175 Billion parameters) is a colossal computational task. Furthermore, fine-tuning the model for each task multiplies the storage needs. These challenges make the deployment of multiple instances of fine-tuned models practically infeasible.

3. Solution

LoRa addresses this challenge by proposing a low-rank adaptation of the model.

Generalized Formula

  • Output vector: Weights * Input vector
  • Weights of pretrained LLM:
  • Formula for calculating weight matrices when finetuning:
  • Formula for using finetuned weight matrices:

What LoRA Proposes

LoRA proposes that we do rank decomposition on the change W into two smaller matrices A and B. A and B will try to capture the change in W into smaller matrices that are easier to train and deploy. The formula for rank decomposition is as follows: where A and B are the rank-decomposition matrices.

Then substite in the rank-decomposition matrices into the finetuning equation:

During the finetuning process, the pre-trained weights are frozen, with modifications permitted only in A and B. Setting a low rank can substantially reduce the parameters to tune.

Benefits of LoRA

  • Reduced training time: LoRA can significantly reduce the training time of LLMs, since only a small number of parameters need to be updated.
  • Reduced memory requirements: LoRA can also reduce the memory requirements of LLMs, since the number of parameters is much smaller.
  • Less prone to catastrophic forgetting: LoRA is less prone to catastrophic forgetting, since the pre-trained weights are not updated.
  • Easier to deploy: LoRA weights are easier to deploy than full fine-tuned models, since they are much smaller.

Limitations of LoRA

  • May not be as effective as full fine-tuning: LoRA may not be as effective as full fine-tuning for some tasks.
  • May require more hyperparameter tuning: LoRA may require more hyperparameter tuning than traditional fine-tuning.

Conclusion

LoRA is a promising method for fine-tuning LLMs while reducing the computational resources required. It has shown promise for a variety of downstream tasks.

Resources:

LoRa: Low-Rank Adaptation of Large Language Models Resource to help understand the paper Explain Paper. It is quite handy would highly recommend

Google Colab Implementation of LoRA