Fine-tuning

How to train large language models for your use case.

Now onto finetuning - an approach to transfer learning - it's a method for developing a model that's unique to your use case. Fine-tuning improves the capabilities of models by providing:

  • Many more examples of your task than can fit in a prompt
  • Due to shorter prompts, a few pennies can be saved.
  • Reduce latency of request

Large language models are pre-trained on a massive amount of text from the public Internet. When given a prompt with just a few examples, they can frequently comprehend the task you are attempting to accomplish and provide a useful response. This is called "few-shot learning."

Fine-tuning improves on few-shot learning by training on many more examples than can fit in a prompt, letting you achieve better results on a wide number of tasks. Once a model has been fine-tuned, you'll no longer be required to provide examples in the prompt. This reduces expenses and enables requests with reduced latency.

Another way of putting it is that fine-tuning a model is like customizing a product to better fit your specific needs. Imagine you bought a pre-made cake mix, and now you want to add some personal touches to make it special for a party. You might add extra ingredients, like nuts or fruit, or adjust the baking time to make the cake perfect for your occasion. This is similar to fine-tuning a pre-trained model.

In Writer’s case, we finetune the model with your data to ensure that its output align with your style, your brand voice, and customized for your specific use cases - taking into account from word and character length to fact verification.

How fine-tuning lets you get the most out of the models:

1. Improved Accuracy: Fine-tuning can lead to improved accuracy as compared to a model trained from scratch. This is because the pre-trained model has already been trained on a large variety of data and can provide a better starting point for further training.

2. Reduced Training Time: Fine-tuning a model can require much less time than training a model from scratch, as the pre-trained model already contains many of the necessary parameters.

3. Ease of Use: Fine-tuning a model is much easier than training a model from scratch, as all that is required is to adjust the existing parameters of the pre-trained model.

4. Transfer Learning: Fine-tuning a model can provide the ability to transfer knowledge from one domain to another. This is because the pre-trained model has already been trained on data from one domain and can be used to quickly and effectively train a model for another domain.

5. Increased Performance: Fine-tuning a model can lead to increased performance on unseen data as compared to training a model from scratch. This is because the pre-trained model has already been exposed to a variety of data and can provide a better starting point for further training.

Fine-tuning Palmyra involves training the model on a specific task or dataset after it has been pre-trained on a large corpus to learn general language features. When fine-tuning, you can choose to update the weights of some or all layers of the model. The difference between fine-tuning 1-2 layers or all layers is in how much of the model is being updated during this process.

**1. Fine-tuning 1-2 layers: In this approach, only the last 1 or 2 layers of the model are trained or updated, while the weights of the other layers are kept fixed. This assumes that the lower layers have already learned useful language features during pre-training, and only the final layers need to be adapted to the specific task. This method is computationally less expensive and can lead to faster convergence. However, it may not be as effective in adapting the model to the new task, especially if the task is very different from the pre-training data.

**2. Fine-tuning all layers: In this approach, the weights of all layers of the model are updated during fine-tuning. This can allow the model to adapt more effectively to the new task, as it has more flexibility in learning task-specific features. However, this method is more computationally expensive, requires more training time, and may be prone to overfitting if the fine-tuning dataset is small.

In summary, fine-tuning 1-2 layers is a more efficient approach but may not be as effective in adapting the model to the new task, while fine-tuning all layers may provide better performance at the cost of increased computational resources and training time. The choice between the two depends on the specific task, dataset size, and available resources.

Model Customization with P-Tuning

P-tuning is a more efficient way of adapting pre-trained language models for various tasks compared to fine-tuning. The Palmyra LLM customization service allows using one pre-trained model for many tasks without adjusting all the model's parameters. For instance, with the Palmyra-Large 20B model, only a small amount of parameters need to be trained and stored for each task. This is much less than the large amount of data required in fine-tuning, which can be around 40 GB per task. P-tuning also prevents problems like catastrophic forgetting, which can happen during fine-tuning.

P-tuning is different from prompt engineering, which involves optimizing text prompts either manually or automatically. Instead, p-tuning uses virtual prompt embeddings that can be improved using gradient descent. These virtual tokens are 1D vectors with the same dimensions as real token embeddings. During training and testing, these continuous token embeddings are inserted before the real ones. As a result, the language model can respond differently to the same prompt when combined with different virtual tokens.