Michiel De Koninck
Machine Learning Engineer and LLM specialist
A donkey never hits its head on the same stone twice. And it seems this is a thing that ChatGPT and donkeys have in common: ChatGPT is no fool, it doesn’t always make the same mistake. Instead it makes a different mistake each time. Nice? Try forcing it to always make the same mistake. Bummer: that’s just not possible.
We observe that OpenAI’s Generative API-callable models (DALL-E, ChatGPT, …) cannot be controlled to act deterministically. In other words: they produce inconsistent results even when their temperature, the parameter that controls their “creativity”, is dialed to zero. More information on temperature can be found in the OpenAI API documentation. Instead of offering solutions, this blogpost aims to explain this behaviour. Because knowledge is power, right?
But how relevant is this really? Does this question pop up regularly? Apparently; yes. It currently tops the list of the OpenAI GPT list of FAQ.
It’s likely that, if you are building an application that relies on the output of a GPT-model, you (at least in a testing phase) want to be able to have the model behave deterministically so that you can rely on reproducible behaviour to some extent. For example:
But enough introduction, because we value both structure and blowing things out of proportion, we will now analyse the behaviour of GPT models specifically by taking you from observation through understanding towards explanation. We have to go into quite the level of detail here but through simplified visualisation, we hope to alleviate the need for any knowledge of complicated mathematical concepts. Hooray for simplifications!
To summarise again, the question that we will answer here boils down to:
”why don’t I consistently get the same answers from a call to any OpenAI Generative API when the temperature is 0?"
Note that, while relevant discussions and resources on this topic can be found;
We deemed none of these explanations to be satisfactory in delivering a complete and comprehensive answer. Great news because this allows us to fill the void and offer you that sweet satisfaction.
“it is hard to understand exactly how a black box system works. But if the black box evolved from boxes that were, you know, less black and more transparent, then some quality assumptions can still be made.”
- someone, at some point in time, possibly
As clear from the fake quote above: we can’t know exactly how the closed source LLMs (Large Language Models) ChatGPT or GPT-4 work under the hood (e.g. paper GPT-4). But from their more transparent predecessors (e.g. GPT-2 paper [2] and open source code) and open source competitors (e.g. LLaMA paper), we know the gist of the relevant transformer-based architecture.
Below, we zoom-in on the parts of the architecture that we estimate to be most relevant for the observation of non-deterministic behaviour. Feel free to quickly skim through this and skip to the “Explanation” section, if you are familiar with the basics of textual generative models’ architecture.
First; one inference forward pass through the LLM network delivers a single “token” which represents the “most granular unit of text the model understands” (can be thought of as a syllable but can just as well represent a word as a whole). For the sake of interpretability, we consider each “token” in our story to represent an entire word. This simplification has no impact on further conclusions.
That being said, an LLM network has a vocabulary of tokens and through its seemingly magical understanding of input text, it is very good at indicating which tokens from its vocabulary are most likely to follow the given input. This indication happens through (see drawing above) assigning probabilities. For example P(token_0) represents the estimated probability that the word chewing (represented by token_0) follows the given sequence of input tokens ( the cow is …). The probability of the word bowling (represented by token_50.256) is hopefully lower than that of the word chewing or grazing in this context.
We emphasise that, for an LLM to generate a complete sequence of text, it has to iteratively pass through the network multiple times: each forward pass serves to select exactly one token which in turn contributes to determining the next token.
The crucial piece of the puzzle now comes from how just one output token is chosen using this list of token probabilities. Below, we briefly dive into the world of sampling. For a more extended understanding of sampling methods, we recommend reading through this Hugging Face blogpost.
Modern LLMs use (a variant of) top-p sampling (i.e. nucleus sampling introduced in this 2018 paper) for sampling the response. This method only considers the tokens whose cumulative probability exceed the probability p and then redistributes the probability mass across the remaining tokens so that the sum of probabilities is 1. If you’re now thinking “wait what”, congratulations, you’re not a statistician! Feel free to re-read that sentence and then jump into the more understandable visual explanation below:
Say that we set p=0.92. On the first pass starting from only the word "the”, we need 6 tokens to exceed that probability of 92% (they sum to 94%). We can sample a token from these six words by considering their redistributed probabilities (where the word nice will have the highest chance of being picked). For the next pass, we find that the 3 most likely tokens together already easily exceed the threshold of 92% and thus the eventual token is sampled from only those three.
The nice thing is, the amount of tokens to sample from dynamically depends on the level of “uncertainty” of the model. If the model deems a small subset of tokens to be most relevant (for example because the input tokens consist of "the", "car" thus restricting the context) it will only sample from those.
Relevant anecdote: a couple of years ago at ML6 we created a basic ‘terms & conditions summariser’ that by chance generated the word “milk” in a summary and completely shifted towards talking about food just because it didn’t use nucleus sampling.
Okay so nice, we now understand what the top_p parameter from the OpenAI API documentation refers to. Note that by default, this parameter sets p=100% meaning that all output tokens are taken into account. If on the other hand p=0%, the first token to be checked which algorithmically is the one with the highest probability will always be chosen as it immediately exceeds the super low threshold by itself.
Okay but what role does the temperature parameter play?
Imagine you have a set of output tokens (possibly with re-distributed probability if you play with the top_p parameter) to sample from:
What the temperature does is: it controls the relative weights in the probability distribution. It controls the extent to which differences in probability play a role in the sampling. Take the example above: for the token input sequence "The” we would (by default) expect the word nice to have a 75% chance of being chosen P(“nice”)=75%. This is what happens at temperature t=1. This parameter can be chosen between 0 and 2.
At temperature t=0 this sampling technique turns into what we call greedy search/argmax sampling where the token with the highest probability is always selected (here: P(“nice”)=100% ).
At temperature t=2 the difference between the more probable and less probable tokens is reduced at sample time. For the example on the left above, this would result in: P("nice")=58% , P("dog")=32% , P("car")=10%.
For those interested, the formula to calculate the sampling probabilities impacted by temperature t is added below (where K represents the total amount of tokens considered in the sampling):
You can now consider yourself a true warrior of the OpenAI’s Generative APIs because the extract below has no secrets for you any longer.
Note the statement “we generally recommend altering this or top_p/temperature but not both”. And that is just a suggestion to keep your changes more or less interpretable as you play with the values. Setting either of these values to their deterministic limits (i.e. p=0 or top_p=0) has the same effect.
We remember that: by default, sampling happens across the entirety of the token vocabulary (top_p=1) and the probability distribution is left unaffected (temperature=1).
Wielding the knowledge above, we know exactly what should happen if we fix temperature=0. Namely: the one token with the most likely probability will be chosen.
But what if at some point during generation, lightning strikes and two tokens get assigned exactly the same probability?
That case of at least two tokens having the same probability may seem unlikely, but there are some factors contributing to its not so small likelihood:
And of course we note that lightning has to strike only once to change the entire generative behaviour; if another token is selected once, then the probability for following forward passes will be directly impacted, resulting in a different “answer path”. You can imagine that getting the word “milk” once, results in a completely different terms & conditions summary down the line.
The obvious next question is then: if two tokens indeed have exactly the same probability, what happens next?
Well, when computers need to make a pick between a set of equally valid/probable options, the decisive “coin toss” power is handed to a seed. The seed controls the value of a pseudorandom number generator. A more intuitive simplification is shown below.
We thus expect these seeds to affect how ⚡-situations are handled. Typically, when you host your own algorithm/model, you can fix this seed so that the random “coin toss” decisions are always the same.
A wise farmer could have once said:
if you don’t control what you sow, then how can you control what you reap?”
— hypothetical farmer, probably
And my god would this hypothetical farmer have hit the proverbial nail on the head. If you can’t fix the seed that is used to determine the “coin toss” decisions within your system, then full control is unreachable.
Hence we have explained why determinism remains out of reach when working with OpenAI’s most popular APIs.
Quod erat demonstrandum?
Okay, so seeds. Big whoop. Not that surprising. The question that remains:
Why can’t you pass a seed? As stated by this guy on Twitter:
it’s kind of crazy that the OpenAI API has no “random seed” parameter. The expected behaviour is to get results that you can never reproduce.”
— Sasha Rush (Associate Prof. at Cornell Tech & Hugging Face Researcher)
So let’s look at some possible reasons for the design choice made to not allow passing a fixed seed:
In this journey we offered insights in:
For reasons not immediately known to us, OpenAI does not allow us to set the seed that has conclusive power when lightning strikes during token-wise generation of an answer.
We may not know why determinism and thus reproducibility is prevented to some extent. But at least, the behaviour is more clear now. And that makes us feel a bit better.
Right?