Brent De Hauwere
Machine Learning Engineer
Language models like ChatGPT, GPT-4, and PaLM have rapidly gained immense popularity and captured everyone's attention. Never before has an application experienced an influx of users within such a short time span. Nevertheless, many businesses are left wondering about how to effectively harness language models to accelerate their operations.
As a Google trusted tester, ML6 has had the privilege of closely witnessing the innovation and continuous stream of updates that Google has been unveiling since their I/O press conference in May. This first-hand involvement has enabled us to gain early insights and practical experience with the forthcoming advancements in generative AI.
In this article, we will explore a selection of products Google has released to facilitate the development of generative AI-fueled applications, and employ some of these products to assemble an LLM chatbot which can be customised to encapsulate your private or proprietary data. Furthermore, we will examine strategies for harnessing generative AI to generate value for your business and evaluate the competitiveness of Google's pricing.
Generative AI is an umbrella term to describe algorithms that can be employed to create new content (of any data type). An LLM, or Large Language Model, like PaLM is a generative AI model that works with human language. It has learned from an immense text corpus of general training data how to read as well as produce new, grammatically correct text. Google has now released its latest LLM called PaLM 2 (Pathways Language Model), which is highly proficient in advanced reasoning, coding, and mathematics.
PaLM 2 will be available in various sizes, which enables effortless deployment across a broad spectrum of use cases. Specifically, Google will release four sizes from smallest to largest: Gecko, Otter, Bison and Unicorn. At present, there have been releases of a few Gecko and Bison models—Gecko is lightweight enough to operate on mobile devices, while Bison is positioned as the optimal choice in terms of capability and cost-effectiveness.
Recently, Google also released an entire product suite to support the needs of generative AI-centric enterprise development. We've already done the research and wrote a complete review as part of a previous blog post. In this article, we will zoom in on the Vertex Model Garden—a single environment to search, discover, and interact with Google’s own foundation models (FMs), and in time, hundreds of open-source and third-party models. The service enables you to use FMs directly with pre-built templates, tune FMs with data and prompts for targeted industry, customise popular open-source models, and provides API access for task-specific solutions.
When solving a business challenge, the language model should be factual, which becomes particularly significant in client-facing contexts, where the reliability of the model's responses is paramount. The model should also have the ability to absorb proprietary data and leverage relevant excerpts to formulate replies.
There are two principal methods to utilise an LLM for a specific domain: finetuning and retrieval-augmented generation. As always, both options have their pros and cons. Be sure to read my beloved fellow's blog post for a descriptive comparison. In short, finetuning involves further training a general model on domain-specific data. Unfortunately, this implies that whenever there's new data, you'd have to finetune the model each time which may be costly and inefficient.
Retrieval-augmented generation (RAG) to the rescue! The second method, RAG, comprises two steps. First, a retrieval component fetches relevant information, documents, or passages from a knowledge base. This information is then passed to the language model along with the prompt to generate a reply. This configuration significantly reduces the chance of hallucinations and eliminates the need for computationally intensive retraining (although periodic updates to the knowledge base are required). In fact, this is precisely how we conceived a chatbot with domain-specific knowledge using Google's newest toolset!
This is the architectural diagram of our implementation. Passages are preprocessed with textembedding-gecko@001 to create a vector embedding for each passage—essential to perform vector searches to find relevant passages, given a query. The retrieval component, Elasticsearch, is a vector database, which stores passages and their respective embeddings. The end user communicates through a chat interface developed with Flask. Once a question is posed, the query is embedded and a vector search is performed to retrieve relevant documents. These documents, together with the original query, are conveyed to PaLM 2, which subsequently formulates a response. This reply is then displayed in the chat interface. A sneak peek of the result is shown below.
Moving forward, we’re excited to explore Google’s Enterprise Search for the storage, retrieval and embedding of passages; thus replacing the entire Elasticsearch component. Enterprise search is a specialised search solution that helps organisations efficiently find and retrieve relevant information across multiple internal repositories and systems. However, at present, the service is limited to allowlisted customers, and the development of a Python SDK is still in progress.
As a domain-specific knowledge base for the demo, we used MS MARCO, a common benchmark dataset containing a wide variety of search engine questions with human generated answers. However, the strength of this setup comes from the fact that, in a matter of hours (proportional to the dataset size), this dataset can easily be replaced by any proprietary knowledge such as policies, research, and customer interaction so that a chatbot could provide always-on, deep technical support. Moreover, the chatbot can supply references to the document(s) it used to formulate a reply (see bot responses in demo).
Generative AI can accelerate many facets of your business. One ubiquitous aspect is customer operations*. Generative AI can be employed to give prompt and personalised answers to complex customer inquiries regardless of their language or location, it can also assist customer service to successfully address questions and resolve issues during an initial interaction by instantaneously retrieving customer-specific information. Additionally, it can increase sales by determining product suggestions and deals accommodating client preferences. McKinsey estimates that applying generative AI to customer care functions could boost productivity by 30 to 45 percent of current function costs [1].
The amount of input data received and the resulting output generated directly determine the billing for the service. However, providers measure volumes differently—Google measures characters while OpenAI measures tokens.
As outlined in the OpenAI pricing documentation, a token is a measurement unit that roughly corresponds to 4 characters. A character refers to a single letter, number, or symbol. For instance, the word “cloud” consists of five characters and may slightly exceed the size of a single token.
Let's briefly compare the pricing of two models of the same category! To streamline comparison, we have adopted characters as measurement. We can observe that PaLM 2 Text Bison is notably more affordable compared to GPT-4—input is ~7.5 times less expensive, output even ~15 times!
When comparing OpenAI’s text-embedding-ada-002 and Google’s textembedding-gecko@001, we find that Google is ~4 times more expensive than OpenAI, with $~0.000025 and $0.0001 per 1K characters, respectively.
Not all models and sizes have been launched yet, but it's evident that Google is releasing at a remarkably high pace! Because of the incredible convenience and usability of the PaLM API, you can interact with an LLM, or any model for that matter, in just a few lines of code. Aside from offering a rich variety of products, Google achieved very competitive pricing compared to other providers! Will they retain this trend for all model sizes? What about their Unicorn variations?
At present, PaLM 2 is trained on multilingual text and passes advanced language proficiency exams at the mastery level for Chinese, Japanese, French, Spanish and Italian. Looking ahead, we eagerly anticipate the expansion of support for Dutch in the future!
Since May, US-based developers can sign up to use the PaLM 2 model; alternatively, customers globally can use the model in Vertex AI with enterprise-grade privacy, security and governance. Given the stellar performance of the currently available models, we can't wait to see what's next. What about you? Are you keeping up with the generative AI wave?