June 7, 2023
Developing AI Systems in the Foundation Model Age : From MLOps to FMOps
MLOps — bringing AI models to production
Developing a machine learning model is often the first step in creating an AI solution, but never the last. In order to make a model actionable, it needs to be served and its predictions delivered to an application. But what if suddenly many users send requests at the same time or when the system becomes unresponsive? And what if the data on which the model was trained is no longer representative of current real-world data, and the performance of the model starts to deteriorate? This is where Machine Learning Operations (MLOps) come in: a combination of tools and processes to create and automatically maintain a robust and up to date AI system.
Typical phases in the machine learning model lifecycle are data ingestion, validation and preprocessing and then model training, validation and deployment. Each time a model is updated, the system needs to go through all of these steps. By automating the machine learning lifecycle, the process of creating and deploying models can be sped up, leading to faster innovation, lower development costs and improved quality (for a more detailed overview of the steps involved in MLOps, see this blogpost).
It is a well-kept secret that in real-world projects, MLOps often comes down to eighty to ninety percent ‘DataOps’ and only ten percent ‘ModelOps’. In the early days, ML engineers would create a model graph, prepare data and perform hyperparameter tuning manually. Transfer learning, starting from a pretrained model, removed the need to create custom model architectures and MLOps automated hyperparameter tuning. What is left is data processing, and as machine learning projects began to tackle ever more complex real-world problems, it has only been gaining in importance.
From MLOps to FMOps
As we enter the FM age, MLOps is undergoing a profound change whereby combining multiple task-specific models and business logic downstream is giving way to more upstream smart data preparation, fine-tuning and guidance of emergent FM behavior and further postprocessing and chaining of FM outputs.
As a working definition of FMOps we propose the following:
FMOps refers to the operational capabilities required to manage data and align, deploy, optimize and monitor foundation models as part of an AI system
In the next few subsections we go deeper into the different aspects of FMOps, working our way down through the diagram below starting with model alignment.
Model alignment
The main difference between traditional machine learning models and foundation models is their emergent behavior and their ability to be aligned to perform different downstream tasks. Hence it is unsurprising that this is where many changes from MLOps to FMOps are initiated. In order to get the most out of FMs, they need to be fine-tuned, their inference guided and their output processed and chained to steer their behavior. In the following sections we go deeper into these different steps.
Guidance
Guidance, of which prompting is probably the best-known manifestation, occurs at runtime (when generating) and aims to steer the task being performed by the model in a certain direction: e.g. respond to a question, generate an image based on a specific text input, segment a specific object in an image.
Guidance has been going through a boom lately with the community engaging in ever more elaborate prompt engineering for image and text generation, prompting courses and even prompt marketplaces. MLOps and experiment tracking frameworks have followed suit and are providing tools for templating, versioning and testing of prompts. Using such tools can bring significant efficiency gains in zooming in on the best possible inputs. Whether human prompt engineering will remain a useful skill in the longer term remains to be seen as prompt design can be automated using search algorithms or numerical prefixes can be tuned replacing prompts, which outperform prompt design (and search) in most cases (see figure).
Prompting is an easy and fun way of guiding AI models and it has brought about a fundamental democratization of the practice of ‘working with AI’. Moreover, model developers have been pushing the envelope on the number of words that can be entered as ‘context’ which makes it possible to provide entire books as input. It should be noted, however, that relying on prompting for system design has its drawbacks, such as the following:
- The amount of memory needed increases quadratically with input size which restricts its maximum input
- Compute cost and inference time increase with prompt size
- Prompts can be model-specific and do not guarantee fixed results, especially not when models are updated
Undoubtedly many more ways of guiding FMs will become available in the future. Until then it seems that other ways of alignment will continue to play an important role in steering FMs towards desired behavior.
Fine-tuning
Foundation models are trained on internet-size datasets with a simple objective such as predicting the next word. This makes them excellent generalists which can be further guided towards more specific types of text generation using for example prompt-based guidance. Because they are trained as generalists, however, they tend to be ‘Jacks of all trades, masters of none’.
Another reason for fine-tuning can be that you need a model that understands specific jargon and/or has deep knowledge of a specific domain. Example target domains include finance, biomedicine, medicine or legal language. The graph below gives an idea of the performance of Med-PaLM, a specialized model, vs. GPT 3.5 (ChatGPT) at answering medical licensing examination questions.
Fine-tuning a foundation model has the effect of at the same time narrowing and deepening its capabilities. It also allows to add knowledge or capabilities that are not present in public datasets such as from proprietary data or commercially purchased research reports and it can be used to feed a model the latest insights or concepts. Finally, fine-tuning is also often used to teach a model a specific style of generation, e.g. in writing, drawing or design.
Fine-tuning is typically performed in multiple steps using different techniques, parameters and auxiliary AI models. The precise choice of techniques is directly dependent on the use case and the available data. Question-answering models like ChatGPT for example are first fine-tuned in a supervised way on instruction datasets (highly curated question-answers) and then reward-based using Reinforcement Learning with Human Feedback (RLHF). A highly successful recent technique in image generation is ControlNet whereby a copy of the model weights are additionally fine-tuned to learn from additional conditioning such as edges, lines, maps, poses and depth.
Finally, whereas fine-tuning used to be an expensive undertaking in terms of required compute, multiple highly efficient techniques have lately become available which typically add small, easily fine-tunable parts to a model such as weight matrices (LoRA), adapter modules or prefixes. This is often combined with quantization (using smaller precision numbers) in part of the training loop (e.g. backpropagation) which reduces memory requirements (e.g. qlora). This makes it possible to fine-tune even large multi-billion parameter models for a few hundred dollars.
Post-processing
Once a model has generated a response, it can be further refined or filtered. Whereas in the past, it used to be a common practice to combine model outputs with business logic and other model outputs to increase performance, with FMs ideally you would avoid this step altogether and make sure that the model has learned to properly perform in the first place. Unfortunately, this is not the case yet and filters can be useful with today’s models. These filters can aim to increase the quality of actual output by comparing a number of generations or remove generated content entirely as in the case of not safe for work filters. Overall, for now, post-processing seems to be a necessary evil which, hopefully, can one day be skipped altogether.
Chaining
Model inputs and outputs can be chained together and further combined with inputs from external systems and various data sources. This type of building software applications with FM actions as its building blocks is usually referred to as chaining. This is a burgeoning field with a highly enthusiastic following. A popular framework for building applications with large language models is Langchain which allows to implement sequences of calls to different models and other systems and perform logic and other transformations on them. One of the attractions of this approach is that you can build software agents which independently can perform searches and even actions on the internet (e.g. Auto-GPT). Some models such as ChatGPT provide plugins closely integrated with the model which are easier to use but less flexible. While this is a highly exciting area, the number and variety of applications in production seems to be limited for now and largely restricted to Retrieval Augmented Generation (RAG) to provide additional memory for chatbots of past conversations or increase response quality and reduce hallucination. This is a fast moving area, however, so further innovations are bound to appear rapidly.
Data management
As discussed earlier, system performance is dependent on the choice of base model, but even more so on the subsequent fine-tuning steps. Significantly, the performance of fine-tuning is directly related to the availability of high quality, domain- and task-specific data. Likewise, other alignment steps are increasingly becoming dynamic and data-driven rather than static prompt- or rule-based. Retrieval Augmented Generation (RAG), for example, which is a popular technique to guide models at runtime, is driven by a semantic search system that retrieves information relevant to the user’s query and adds it to the prompt that goes into the model. Fact-checking in post-processing and data retrieval calls when chaining equally rely on available data. Hence it goes without saying that data capabilities are key in successfully integrating FM-based systems in business practices. Knowledge through proprietary data will be one of the major differentiators in tomorrow’s FM-driven knowledge economy and will constitute the main moat for its players.
When setting up one’s data infrastructure, there are a few things to keep in mind. First of all, FMs deal with unstructured data (text, images) as well as with structured data (tables). In the past, setting up a data lake was often meant to serve as a basis for structured data extraction such as tables and dashboards. Foundation models can take in the raw kind and they can also access dashboards and recent reports at runtime. Hence the big jump in capabilities to look forward to is the combination of stores of data in reports and accounts that now lie unused in data vaults to be combined with existing lines of information.
A second aspect to consider is that FMs can already deal with several modalities of data (text, images, speech) and this will only continue to extend to video, 3D, hyperspectral imaging, geographical data and beyond.
Third, whereas FMs like their data unstructured, they also like it in numbers. When taking in data or when accessing semantic search, this is done via embeddings, numerical representations of data. Highly powerful and scalable vector embedding databases will be a central part of tomorrow’s FM data infrastructure.
Fourth and finally, quality generally wins over quantity when it comes to data and both help when it comes to performance of FM-based systems. In order to get unstructured, multimodal data in the right shape it needs to be deduplicated, filtered, transformed in various ways, augmented, possibly anonymized, enriched and probably embedded. All of these steps require specialized techniques which may depend on the type of data, the fine-tuning regime and the use case. In fact, many of these steps themselves involve specialized AI models that perform specific tasks. Today, in many cases, these data pipelines are built from scratch for each new project. A probable next step will be that data preprocessing becomes standardized with shareable and reusable components being combined in composable data pipelines. That is why we started Fondant, an open source initiative that aims to achieve exactly this: enable easy, powerful and cost-efficient FM alignment through data.
Deployment, optimization, monitoring
Roughly speaking, there are three ways of setting up a FM-based AI system:
As a rule of thumb, we generally advise to opt for a lightweight, SaaS solution for relatively simple or first-time use cases as it is quick and easy to set up and it is still possible to move to a more custom setup to reduce costs later. For more custom, larger use cases, however, we would choose for more control and openness as this is the best guarantee for optimal performance, flexibility and long-term cost control. Choosing between PaaS and IAAS is mainly a question of available technical expertise in your company and desire to avoid lock-in. IaaS can be cheaper to run and does not lock you into one cloud platform. PaaS is easier and cheaper to develop with slightly higher infrastructure costs. Regarding specialized SaaS services like AWS Bedrock and GCP Generative AI Studio it is still very early days: if they manage to offer sufficient control (e.g. in terms of fine-tuning) at a good price point, they may also become a viable go-to option for custom use cases.
Conclusion
Foundation Models have arrived and they are bound to change the landscape of AI systems for good. As they can be taught to perform complex knowledge tasks, they are therefore bound to change the way we interact with and think of machines. As the knowledge industry takes note and starts preparing to support its workers with FM-driven AI, companies should carefully consider their options and choose the most value-adding route. Currently, this is probably to invest in setting up the necessary data infrastructure to optimally benefit from upcoming developments. At the same time, it will be important to start experimenting with specialized models and systems and update internal workflows and processes to accommodate them and generate the necessary data to fine-tune the models of tomorrow towards unseen performance!