Generation

Likelihood

Our models learn language by reading text from the internet. Given a sentence, such as I like to play with legos, the model is asked to repeatedly predict what the next token [?] is:

**I [?]

I like [?]

I like to [?]

I like to play [?]

I like to play with [?]**

The model learns that the word “to” is quite likely to follow the word “like” in English, and that the word “play” is likely to follow the word “with” and so on.

Likelihood refers to the probability that a given sequence of tokens (e.g., words or characters) will be generated by the model. The likelihood of a token can be thought of as a number (typically between -15 and 0) that quantifies a model's level of surprise that this token was used in a sentence.

If a token has a low likelihood, it means the model did not expect this token to be used. Conversely, if a token has a high likelihood, the model was confident that it would be used. Conversely, if a token has a high likelihood, the model was confident that it would be used.

The likelihood score can be used to evaluate the quality of the model's output, representing the model's confidence that a given output is the most likely sequence of tokens given the input prompt and the target data.


Temperature

The same prompt may yield different outputs each time you hit "generate" and so, temperature is a parameter used to control the randomness and diversity of the generated output.

The temperature scale ranges from 0 to 1. A temperature of 1.0 means that the model is generating outputs with its normal level of confidence, based on the probabilities it has learned during training.

A higher temperature will increase the randomness of the output, allowing for more diverse and unexpected results. A lower temperature will make the model more conservative in its predictions, generating outputs that are more likely to align with what it has learned during training.

Most people will find that a temperature of 1 is a good starting point.

With longer prompts, the model becomes more confident in its predictions, so you can raise the temperature higher for a diverse and creative output without the output being too off topic. In contrast, using high temperatures on short prompts can lead to outputs being very unstable.

We recommend using lower temperature values for tasks like classification, entity extraction, or question answering, and use higher temperature values for tasks like content or idea generation.


Top-k & top-p

Top-k sampling and top-p sampling are methods for picking the output tokens by controlling the quality and diversity of the generated output.

Top-p refers to selecting the top p% of the most likely tokens (e.g., words or characters) based on the predicted probabilities of the model, and then randomly selecting one of the top p tokens to generate the next word.

Top-p defaults to 1 but accepts any number between 0 and 1. Top-p is an alternative way of controlling the randomness of the generated text. When using top-p, make sure that temperature is set to 1. Generally, top-p will provide better control where your model is expected to generate text with accuracy and correctness.

Top-k refers to selecting the k tokens with the highest predicted probabilities and then randomly selecting one of these k tokens to generate the next word. This method is used to generate outputs that are more likely to be accurate, but may be less diverse.

Top-k defaults to 0, but it will accept an integer between 1 and 50,400.

Top-k refers to the number of tokens that will be sampled, sorted by probability with all tokens beneath the k'th token ‌not sampled. A lower value can improve quality by removing the long tail of less likely tokens and making it less likely to go off topic.