Prepare training data

Here are some guidelines that we recommend as you prepare your dataset for finetuning:

Data formatting

To fine-tune a model, you'll need a set of training examples that each consist of a single input ("prompt") and its associated output ("completion"). This is notably different from using our foundational models, where you might input detailed instructions or multiple examples in a single prompt.

The dataset must be a JSONL file with each line containing a prompt-completion pair illustrating an example of your task. We advise fine-tuning each model for a for a single, specific task.

See example of a JSONL file with this prompt-completion pair:

{"prompt": "<prompt text>", "completion": " <ideal generation><|endoftext|>"}
{"prompt": "<prompt text>", "completion": " <ideal generation><|endoftext|>"}
{"prompt": "<prompt text>", "completion": " <ideal generation><|endoftext|>"}
...
  • Each prompt should end with a fixed separator to inform the model when the prompt ends and the completion begins. A comma will work as a separator between the prompt and the completion as illustrated in the example above.
  • Each completion should start with a whitespace due to our tokenization, which tokenizes most words with a preceding whitespace.
  • Each completion should end with a fixed stop sequence to inform the model when the completion ends. The stop sequence we propose is “<|endoftext|>`”.
  • For inference, you should format your prompts in the same way as you did when creating the training dataset, including the same separator. Also specify the same stop sequence to properly truncate the completion.

General best practices

To get the best results, use more examples. This is especially important for models that work better with prompts. You should provide at least 500 examples, preferably chosen by humans.

There's a linear increase in performance with every doubling of examples. This is usually the best and most reliable way of improving performance.

Our max file-size is 20MB