What is k in LLM?

In the context of Large Language Models (LLMs), k most commonly refers to a parameter used in a technique called Top K sampling. It defines the number of highest probability tokens that the model considers when generating the next token in a sequence.

Understanding Top K Sampling

Top K sampling is a method used during the text generation process to add a controlled amount of randomness and improve the diversity of the output. Instead of always picking the single most likely next word (which can lead to repetitive text), Top K sampling limits the possibilities to a specific set of the most probable options.

The reference states: "Top K is similar to Top P, but instead defines a quantity of the most probable tokens that should be considered. For example, a Top K of 3 would instruct the LLM to only consider the three most likely tokens. A Top K of 1 would force the LLM to only consider the most likely token."

Here's what 'k' signifies:

k: This integer value specifies the number of tokens with the highest probabilities that the LLM should look at when deciding the next word or sub-word token.
The LLM calculates the probability distribution over the entire vocabulary for the next token.
It then identifies the 'k' tokens that have the highest probabilities.
The next token is randomly sampled only from this selected set of 'k' tokens, with their probabilities normalized among themselves.

How Different 'k' Values Affect Output

The value of 'k' directly impacts the generated text:

Top K = 1: This is equivalent to greedy decoding. The model always picks the single most probable token. This results in deterministic and often repetitive text.
- Example: If the next likely tokens are "cat" (0.8 probability), "dog" (0.1), "mouse" (0.05), "bird" (0.02), etc., a Top K of 1 would only consider "cat" and always pick it.
Top K = 3: The model considers the top 3 most probable tokens. It will then sample from these 3 options based on their relative probabilities. This introduces variability.
- Example: Using the same probabilities, a Top K of 3 would consider "cat" (0.8), "dog" (0.1), and "mouse" (0.05). The model would sample from these three. While "cat" is still most likely, there's a chance it could pick "dog" or "mouse".
Higher K values: Considering more tokens increases the randomness and diversity of the output, but can also potentially lead to less coherent or relevant text if lower-probability tokens are sampled.

Setting the right 'k' value is a common technique for fine-tuning the trade-off between generating predictable, coherent text and generating more creative, varied output.

In summary, 'k' in the context of LLMs refers to the size of the set of the most probable next tokens that the model is allowed to choose from during the text generation process.