What Is Fine-Tuning?

Learn how to fine-tune an LLM, understand its need, and discover its key parameters and essential steps to successfully fine-tune any LLM.

Fine-tuning is the process of adapting a pretrained language model to perform specific tasks and use cases by further training it on domain-specific data. It involves training the parameters of a pre-existing LLM on domain-specific data. Utilizing the existing knowledge of the pretrained model along with training on new data allows the model to better understand and respond to specific tasks.

Before moving to how fine-tuning works, let’s briefly learn how these models are pretrained.

Pretraining LLMs under the hood

Transformers have revolutionized the field of NLP. Most of the state-of-the-art language models today, such as GPT, Llama, BERT, etc., are built on the transformer architecture to understand and generate human-like text. Transformers can consist of an encoder-decoder architecture, but many models use only one of these components. The encoder takes an input and generates its representation. The representation is passed to the decoder, which generates an output. This architecture helps the models to learn complex details and patterns of data during pretraining.

Press + to interact
Transformer's architecture
Transformer's architecture

While pretraining, the model is actually trained on a large corpus of data. The pretraining process involves the following layers:

  • Input embedding layer: It converts the input into the numerical representation called embeddings.

  • Positional encoding layer: It adds information about the position of the word to the input embedding and forwards this combined information to the encoder.

  • Encoder layers: The encoder, consisting of multiple sublayers, uses the self-attention mechanisms to help the model understand the context and relationship between words of the input. The multi-head attention layer of the encoder computes the attention matrixAn attention matrix is a matrix that determines the importance of each word in the input sequence with respect to all other words. for the input embedding and passes it to the feedforward layer, which generates the representation of the input. The add & norm component is applied after each sublayer of the encoder, which combines the input of the layer with the output (residual connection) and normalizes the activations to stabilize the process.

  • Decoder layers: The decoder is also made of multiple sublayers that generate the output sequence. First, the masked multi-head attention layer computes the attention matrix for the output embedding and passes it to the multi-head attention layer. The multi-head attention layer combines it with the encoder’s representation and generates the representation of the output. The add & norm component is applied after each sublayer of the decoder, which combines the input of the layer with the output (residual connection) and normalizes the activations to stabilize the process.

  • Linear layer: The linear layer converts the decoder's output to the logitsIn the context of transformers, logits are the unnormalized output values produced by the final layer of the model. of the size of the vocabulary.

  • Softmax layer: The softmax layer applies the softmax function to convert the logits into probabilities. A token with maximum probability is then selected as the final output.

Educative Bytes: A transformer can have up to nn number of encoder and decoder layers and the representation obtained by the last layer will be the final output.

During pre-training, all these layers are trained together. The model understands the general patterns and structures in the early layers and then moves to learn specific data features in the later layer. This base concept is then used for fine-tuning, where we train the custom model for specific tasks.

Press + to interact
Pre-training the LLM
Pre-training the LLM

How fine-tuning works

Fine-tuning involves taking a pretrained model that has learned general patterns from large datasets and adjusting its parameters to fit custom, task-specific datasets.

According to our task, we take a dataset smaller than the pretraining data and adjust the model weights to allow it to adapt to the new data. This way, the model refines its existing knowledge and learns the details of the new data. By building on the pretrained model’s knowledge, fine-tuning enables the model to learn more efficiently and accurately.

Press + to interact
Fine-tuning LLM on task-specific dataset
Fine-tuning LLM on task-specific dataset

Let’s consider a scenario to understand the importance of fine-tuning.

Scenario: Healthcare service

Consider a renowned healthcare service provider looking to integrate AI models like ChatGPT, Gemini, and Llama in their medical system. They aim to develop a chatbot that creates personalized treatment plans for each patient based on their disease, medical history, genetic profile, and lifestyle. How can they achieve this?

Press + to interact

We might think this can be done by simply choosing an LLM for the chatbot and providing patient data as a context to the model. That’s actually right. This approach works well with fewer patients, but as the number of patients increases (say to millions), the size of patient data also increases significantly to GBs and TBs.

Relying on the model’s general knowledge will not suffice because it will now take much longer to look at the context for every query. This also affects the model’s efficiency and accuracy. Most importantly, it will become significantly challenging and time-consuming for healthcare service providers to update context and analyze complex data for effective treatment plans. What to do now?

To deal with this challenging situation, they need a more efficient way to tune their language model to their specific patient’s dataset. They require training the model to learn from their unique dataset and reduce the time spent to update context and analyze complex data. However, training a model from scratch is not feasible, as it requires substantial resources and time. Moreover, this approach risks losing the model’s pre-existing knowledge and abilities while learning from a new dataset. They need a way to refine and tune the existing model to effectively handle and use their vast dataset. That is where fine-tuning comes to help.

Fine-tuning is important because it allows models to:

  • Performance: Fine-tuning on specific data, improving performance on specific tasks

  • Accuracy: Capture details of the task-specific data, improving the accuracy of the model’s response.

  • Efficiency: Reduce overall resources and time by fastening the training process on task-specific data.

  • Adaptability: Quickly learn from new data, adapting to user and task requirements.

  • Scalability: Fine-tuning allows the model to handle a large volume of personalized interactions efficiently, providing a better user experience.

  • Knowledge retention: Retain pretrained knowledge while learning new task-specific information, avoiding “catastrophic forgettingWhen an LLM forgets the previous knowledge while learning from new dataset, this concept is called catastrophic forgetting. This usually happens during fine-tuning of all or some parameters of the LLM..”

Press + to interact
Why fine-tuning is important
Why fine-tuning is important

Training parameters

Configuring the training parameters is important while fine-tuning the model. These parameters play a significant role in controlling how the model learns from our custom dataset and achieves optimal performance while fine-tuning. The following are some parameters that need to be considered for effectively fine-tuning a model:

  • Batch size: It is the number of examples processed in one cycle of the training process. The selection of batch size depends on factors such as the size of training data, memory resources, and the complexity of the task. A larger batch size trains more data in one cycle, speeding up the overall training process, but on the other hand, it also requires more memory to process the data.

  • Epochs: It is the number of cycles passing through the complete dataset. Selecting the epoch value also depends on the complexity and size of the training data. Lower value of epochs can result into Underfitting of the model means that the model has not learned all the details and patterns from the training data and it fails to perform on the test data.underfittingWhen a model has not learned all the details and patterns from the training data and fails to respond accurately on training data and new data, this is called underfitting of the model. while the higher value can result in the overfittingWhen a model performs well for training data, but fails to respond accurately on test data and new data, this is called overfitting of the model. of the model.

  • Iteration: It is the number of batchesThe dataset is divided into smaller parts for passing to the model for training is called a batch. required to complete one epoch. Iterations can be calculated by dividing the total number of examples in training data by the batch size.

  • Learning rate: This is used to determine how quickly the model learns from training data. A lower learning rate requires more epochs to reflect the effects, while a higher learning rate will reflect changes faster, even with less epochs.

Educative Bytes: Number of batches and the batch size are two different concepts. The number of batches refers to the count of smaller parts into which the dataset is divided, while the batch size refers to the number of examples processed in one batch during the training.

Steps for fine-tuning

Following are the key steps that we need to perform to fine-tune any LLM:

  1. Select the model: The first and most important step is to select a pretrained language model depending on our task for fine-tuning. Pretrained models are general-purpose models trained on a large corpus of data. There are a number of open-sourceOpen-source models are the one that are freely available to everyone for research and development purposes. (Llama, BERT, and Mistral, etc.) and closed-sourceClosed-source models are the one that needs a paid subscription or license to use. (ChatGPT, Gemini, etc.) models available for fine-tuning. We just need to find the model that best fits our resources and requirements.

Note: In this course, we'll using Meta's Llama 3.1 with 8 billion parameters for fine-tuning. Do note that the choice of model depends on factors like task complexity and available computational resources.

  1. Prepare the dataset: Our next step is to find a dataset specific to our tasks and domains. This step is very crucial as the entire fine-tuning depends on the dataset we select. It should be structured and arranged so the model can learn from it.

  2. Preprocess the dataset: After preparing the dataset, we need to perform preprocessing on it. This step involves cleaning and then splitting the data into train and test sets. Once the preprocessing is done, our dataset is ready for fine-tuning.

  3. Configure the training parameters: The next important step is to configure the parameters for fine-tuning the model. This involves setting training parameters such as learning rate, batch size, epochs, etc.

  4. Fine-tune the model: Now we are all set for fine-tuning the model. The fine-tuning step trains the model on a new dataset while retaining the previous knowledge model from pretraining. This makes the model learn knowledge about our task-specific data.

  5. Evaluate and refine: The last step is to evaluate the model results to assess its performance according to our task and make any necessary adjustments. After evaluation, our model is ready to be used for the required task.

Press + to interact
Steps for fine-tuning
Steps for fine-tuning