Michiel De Koninck
Machine Learning Engineer and LLM specialist
At ML6 that may very well be the #1 question we get from customers regarding the use of large language models (LLMs).
In short the choice of technique boils down to a difference in ambitions.
Note that these ambitions are not exclusive alternatives; for many use cases you might want to change both the model’s behaviour ánd make sure your model has access to the right information.
To those who now think: “Hey but can’t I also add knowledge through fine-tuning? Shouldn’t I fine-tune an LLM using my private knowledge base?”. To them we might say: “Hey, thanks for that spontaneous remark. Well yes that may be possible but it’s probably not the most efficient way of adding knowledge nor is it truly transparant nor manageable.” If you are specifically interested in having an LLM access specific knowledge in a maintainable way, we point you towards our post on leveraging LLMs on your domain-specific knowledge base.
To those who didn’t have that remark, we say “well than on we go now, shall we”. But not before taking in some inspiring, introductory words from Billiam Bookworm.
In this post, we’ll walk you through an understanding of fine-tuning and empower you with a tool with which you can make well-founded decisions. Of course this tool is the one that has ruled all tools since way before the computer was invented; the flowchart. Maybe times aren’t a-changin’ ?
To understand what can and cannot be achieved through fine-tuning, we must on some level understand what this process actually refers. What do we start from and how are we impacting the model? We warn you that this section my be a bit technical but crucial for a good understanding of fine-tuning.
In summary, a large language model is built through three distinct steps :
Data: low-quality (typically scraped internet data), ±1 trillion “words”
Process: optimised for text completion (predicting next “word”)
Result: a behemoth so monstrous it makes Medusa turn to stone
▹Data: 10k-100k “words” of curated [prompt, response] examples
▹Process: model is taught to behave based on input/output samples
▹Result: a behemoth with a somewhat acceptable face that you can stand to look at
▹Data: 100k-1M comparisons [prompt, won_response, lost_response]
▹Process: optimise the model to respond in the way that humans prefer
▹Result: a behemoth with a smiley face that you would want to go for drinks with
The explanation above should give you a first grasp of what fine-tuning does and why it’s needed.
The example above shows how depending on Unsupervised Learning alone falls short. The model may have gained a lot of knowledge but it doesn’t know how to wield it. For a model that merely predicts the next words, a question may be the most likely continuation of a previous question, because when learning from its heaps of low quality data it probably came across quite some tests and subsequent questions.
But fear not, for Supervised Fine-Tuning swoops in to save the day! After gathering tons of knowledge from low quality data, the SFT process aims to get the behaviour of the model right. And it does this by demonstrating example behaviour to the model and optimising it to replicate that. From this the model will learn to understand “if I am asked a question, apparently I have to try and formulate an answer as a response”.
The above is traditional supervised learning. What seems to really unleash the full potential of these models is the RLHF step. We won’t go too much into detail but this process aims to specifically guide the model to behave in ways that people have indicated to prefer (giving way to the name: reinforcement learning from human feedback). Note that to enable the gains from RLHF, a Reward Model needs to first be built which calculates reward scores for given responses. And that requires an extensive amount of labeling and engineering work.
Luckily, when it comes to impacting model behaviour, SFT is the crucial step. It exemplifies how we want the model to behave, RLHF then further refines that because it’s easier for us humans to just show you what we prefer rather then explaining it through examples ourselves.
Now, make no mistake. Preparing the necessary data to perform SFT is no easy feat. In fact, for the development of GPT-3, the team at OpenAI relied on the input of freelancers to provide labeled data (for both the SFT and RLHF process). Because they understand the importance of this task, they made sure to choose labelers that were well educated. This was shown in the results of a survey carried out by OpenAI as part of their paper on “Learning to summarise from human feedback”.
In the last section of this post we will zoom-in on what you need to actually perform the fine-tuning process. But first let us present you with a practical guide of deciding when to reach for SFT and when you can consider moving on without it.
As you may know, even closed source models (where you have no actual access to the model parameters themselves) may allow you to fine-tune them through an API. In those regards, OpenAI (23th August) released fine-tuning of GPT-3.5. This further extends the possibilities for the large public to get into modelfine-tuning. Thus hyping up the question: when should you actually go for it (as proven by the frequentliestic asked question for that fine-tuning API).
Below we present the flowchart that should help you guide the troubled waters of making well-founded LLM choices.
Note that we explicitly distinguish knowledge and behaviour.
On the behaviour side of things, in line with what Andrej Karpathy stated back in May, we would suggest following approach in maturing your LLM use case:
Now imagine you chose a few-shot prompting approach and it works great but you have a gigantic amount of requests and things are becoming pricey? Then perhaps it may be interesting to host the LLM yourself and fine-tune it to reduce the amount of words pushed through the system with each call.
Or what if you have a task that is so simple that you can easily get the right behaviour through zero-shot prompting but the cost is simply too high per task? Once again, self-hosting a much smaller, fine-tuned LLM may be the way to go. For more insights in that, we refer to our blogpost on the emerging space of foundation models in general.
In the style of a classic schoolkid bragging throwdown, we will walk you through some actual examples to demonstrate the intuition.
1. “You know MY company wants to use a Large Language Model to send a personalised welcome message to each new employee “
For this one, simple few-shot learning with some examples of nicely styled welcome messages combined with a template that loads the information for that employee, should do fine.
2. “Oh yeah? Well MY company wants to use an LLM to produce a set of typical navigation messages in the style of Severus Snape.
Depending on how much of a caricature this Snape🪄 guy actually is, you might get away with few-shot learning here. If his language style however is so creative that even 50 examples of Snape interactions won’t cut it, you might have to plunge into SFT with a more extensive dataset.
“Turn left. Do nót disappoint me”.
3. “Oh please, MY company wants to create a chatbot that sarcastically answers every general English question that a user asks ”
Modern LLMs understand language play to a sufficient extent for this chatbot to be built purely on zero-shot prompting.
4. “Wait till you hear this. MY company wants to create a chatbot that sarcastically answers évery East-Asian question that a user asks in the same language.”
Boy o boy will you have a hard time gathering enough data on all of those languages to sufficiently capture how sarcasm is typically transmitted. If you manage to get that data together; the doors of supervised fine-tuning will open for you. Note however that if your model has 0 ingoing knowledge of those languages, good performance will still be unattainable and you might have to wait for a gigantic dataset that allows a hero to leverage unsupervised learning to have a model get the hang of those more exotic languages.
5. “Is that all? MY company wants to leverage an LLM that, for a certain support ticket, automatically determines the support team to handle it. We have over 1000 support tickets every hour.”
Classification tasks (such as this routing one) have been around since the dawn of Machine Learning. Classically you would train a specific model for this and perhaps that is still your cheapest option but sure an LLM should also be more than capable enough. Depending on the complexities (range of questions topics, amount of support teams, input languages,…), we would expect this to work fine in a few-shot learning approach. Note however that because of the high throughput mentioned here, looking into self-hosting might be worth it in terms of cost-efficiency.
6.“Hold my juice box, MY company wants to build a model that doesn’t even care about routing the question, it just answers the support question straight away! ”
Aha, but to answer these questions you need knowledge right? A RAG-architecture (ad-hoc supplying the model with the relevant information) with few-shot learning to ensure adequate behaviour should be sufficient to enable this use case. Again, self-hosting deserves your consideration if there is a high demand for this model.
This is definitely a pertinent question. If you finally decide that indeed you need to fine-tune an LLM, you have another choice to make: fine-tuning an open source model and hosting it yourself or, if available, use fine-tuning APIs provided by closed source model providers.
Fine-tuning an open source model (e.g. Meta’s Llama 2 or TII’s brand new gigantic Falcon 180B) and hosting it yourself offers you some great advantages that typically come with full ownership.
For some basic insights on how to approach the fine-tuning process itself (i.e. changing the model weights), we recommend following summary on Parameter Efficient Fine-Tuning.
As mentioned, some closed source model providers offer an API for fine-tuning (e.g. OpenAI’s fine-tuning API for GPT-3.5).
Whichever approach you may choose, it should be obvious that what will directly impact the eventual performance of your fine-tuned model, is the data used to perform the fine-tuning.
Given this importance of the quality of your training data — for now and for the future — your main focus should be on setting up highly qualitative, reusable data pre-processing components. Open source initiatives such as Fondant aim to achieve exactly that: powerful and cost-efficient controlling of model performance through quality data. You can read more about that in this Blogpost on Foundation Models.
And then of course we also covered the flagship of this story:
Ultimately, we emphasised that when supervised fine-tuning is indeed the appropriate approach: you should put your eggs in the basket of quality data. That will enable valuable use cases for now ánd for the future.
For more flagship flowcharts and LLM news, stay tuned.
Or, even better, stay fine-tuned. 🥁