June 14, 2024

The landscape of LLM guardrails: intervention levels and techniques

Iris Luden

Machine Learning Engineer

An introduction to LLM guardrails

‍

The capacity of the latest Large Language Models (LLMs) to process and produce highly coherent human-like texts opens up a large potential to exploit LLMs for a wide variety of applications, like content creation for blogs and marketing, customer service chatbots, education and e-learning, medical assistance and legal support. Using LLM-based chatbots also has its risks — recently, many incidents have been reported, like chatbots permitting to buy a car for 1$, guaranteeing airplane ticket discounts, indulging in offensive language use, providing false information, or assisting with unethical user requests like how to build molotov cocktails. Especially when LLM-based applications are taken in production for public use, the importance of guardrails to ensure safety and reliability becomes even more critical.

When developing LLM-based applications, we need to strike a balance: on the one hand, we wish to leverage an LLM’s general linguistic skills, while on the other hand, we wish to restrict them to behave exactly and only as we prescribe them to behave in our specific application.

This blogpost provides a high-level overview of the techniques out there for building guardrails. What are the possible levels of intervention to regulate an LLM-based application? What are the techniques used to do so? We identify four techniques: (1) rule-based computation, (2) LLM based metrics, (3) LLM judges, and (4) prompt engineering and chain-of-thought.

Why we need guardrails

‍

LLMs owe their strength to mimic human-like language to the fact that they are trained on extremely large datasets and have undergone massive pre-training on an enormous amount of parameters. This strength also has an achilles heel: the wide-ranging general knowledge and versatality of LLMs are contradictory to domain specific tools.

This is why we need guardrails: techniques to monitor and control an LLM’s behavior to ensure outputs that are reliable, safe, accurate, and aligned with user expectations. The motives to implement guardrails can be summarized in four categories:

‍

Robustness and Security. Guardrails are of huge importance in the aspect of security and robustness of an LLM-based application. This includes vulnerabilities such as prompt injections, jail-breaking, data leakage, handling illegible or obfuscated content.
Information and Evidence. Guardrails can be put in place for fact checking, detecting hallucinations, making sure the response does not contain incorrect or misrepresented information, does not consider irrelevant information, the response is supported by evidence.
Ethics and Safety. Making sure the response does not lead to harmful consequences either directly or indirectly. A response should not exhibit problematic social biases or share protected or sensitive information, take into account copyright and plagiarism law, use 3rd party content by appropriately citing it, etc.
Tool-specific functionalities. Making sure the responses are on-topic, of proper extensiveness, written in the suitable tone, use specified terminology, etc.

‍

Technical viewpoint

‍

The basis of every conversational LLM-based application can be summarized as in Figure 1: a user sends a message, the application prompts the message to an LLM, the LLM generates a response message, which will be the bot’s output message.

Figure 1: Basic LLM-based application pipeline

To moderate the behavior of our application, guardrails can be implemented at each of these four “intervention levels”, which we call (1) input rails, (2) retrieval rails, (3) generation rails, and (4) output rails. At each intervention level, a guardrail can be programmed to determine whether this message should be accepted as it is, filtered (or modified) or rejected.

Figure 2: Basic LLM application with guardrails

Notice that at each stage in Figure 2, it is simply a message — a text string — that is passed on to the next component. These text strings can be manipulated to influence the dialogue. So in practice, guardrail techniques take a string as input, do some checks or string modifications, and determine the next steps (reject, filter, accept, regenerate). The approaches to do so can be categorized into four different techniques that can regulate the dialogue:

Rule-based computation
LLM based metrics (e.g. perplexity, embedding similarity)
LLM judges (e.g. fine-tuned models, zero-shot)
Prompt engineering and Chain-of-thought

‍

Approach 1: Rule-based string manipulation
‍

Suppose you want your application to meet some general requirements, like a maximum text size, defining some forbidden words, or filtering out confidential information. The simplest technique is to check a message using some rules or heuristics. This is commonly used in guardrails that aim to identify, filter or mask confidential information like phone numbers, email addresses or bank accounts. For instance, a guardrail could identify and mask phone number in a text message using a simple rule-based regex.replace(...) call. Rule-based string manipulation techniques can be simple as a string operation like .lower() ,regex.match(..), but could also use packages like NLTK.

Don’t be fooled — most of the guardrails included in toolkits like LLMGuard and GuardrailsAI are merely some simple computations. For instance, the ReadingTimerail from GuardrailsAI validates that a string can be read in less than a certain amount of time. The guardrail simply computes reading_time = len(value.split()) / 200 , that’s all!

Perks: Rule-based guardrails can be useful to check simple things; why use overly complex LLM prompts if a simple trick will do? They are deterministic, explainable and transparent.

Pitfalls: Language can be subtle and context-dependent, so rule-based approaches may surpass intelligent jailbreak attacks or malicious prompt injections. They are not robust to deviating input. For instance, if a rule-based guardrail blocks the presence of the word "Pasta", it won’t block "Spaghetti".

‍

Approach 2: LLM based metrics
‍

Guardrails can also leverage LLM ‘knowledge’ to determine whether a message should be accepted, filtered or rejected. This is commonly done by applying metrics on LLM embeddings or probabilities. Popular metrics are perplexity and semantic similarity.

Perplexity measures how well a model predicts a sequence of words in a given text. It can be thought of as a measure of uncertainty or “surprise” of an LLM when predicting a sequence. For instance, gibberish texts naturally have a high LLM perplexity. A gibberish-detection guardrail would reject an input string whenever the perplexity exceeds a certain threshold. Similarly, jailbreak and prompt injection attempts have on average high perplexity, as they are often incoherent messages. This phenomenon is for instance leveraged by NVIDIA NeMo’s jailbreak detection rail.

LLM embedding similarity metrics can be used to estimate how similar a message is to a target message. Suppose we want an LLM to refuse to respond to any user message that is related to “pasta”. Then we can use a guardrail that computes the semantic similarity score (e.g. cosine similarity, nearest neighbors) between the user message and a sentence on the target topic “pasta”. For instance, the sentence “I like spaghetti” will yield a higher semantic similarity score to sentences about pasta than the sentence “I like cars”. Our application could reject the input message whenever the score exceeds a specified threshold.

Embedding similarity is also commonly used to compute how much two texts align using a so-called “alignment score”. These are often applied when RAG systems are involved. For more information on RAG systems, see this blogpost for a general explanation of RAGs and this blogpost on real-world deployment insights.

Perks: LLM based methods utilize the linguistic patterns and associations the LLM has captured during its pretraining. They are more useful for guardrails that require semantic analysis than rule-based techniques.

Pitfalls: Heuristic methods may overlook sophisticated jailbreak attempts or malicious prompt injections. Perplexity only works with open-source models as they require access to a model’s raw output to obtain likelihoods.

‍

Approach 3: LLM judge

Many guardrails use LLMs as judges to determine whether a text is valid or not. Either we use zero-shot classifiers, or we use a specialized LLM that is already fine-tuned for the relevant task.

An example of an LLM judge is NVIDIA NeMo’s “self checking” method. This method prompts a generative LLM with a custom message to determine whether an input/output string should be accepted or not. For example, their custom self check input rail asks an LLM the following question:

If the LLM generates “No”, the message is accepted and will proceed in the pipeline. If the LLM generates “Yes”, the self check input rail will reject the message and the LLM will refuse to answer.

Many guardrails rely on zero-shot or few-shot classification. In zero-shot classification, an LLM is instructed to perform a classification task that it wasn’t trained on, without providing examples in the prompt. We provide the model with a prompt and a sequence of text that describes what we want our model to do, like in the example above. Single or few-shot classification tasks also include a single or a few examples of the selected task in the prompt.

Other examples of zero/few-shot classification guardrails are SensitiveTopics and Not Safe For Work Detection(NSFW) from GuardrailsAI, as well as the Ban Topics Scanner of LLMGuard. These guardrails reject a message whenever it covers a topics that is “forbidden” according to a pre-specified list of topics. These guardrails use a zero-shot classification LLM to predict the topics that a message covers, and will reject the message if one of the predicted topics is forbidden.

Fine-tuned LLM judge: In cases where few-shot prompting may fall short, a fine-tuned model could prove useful. An LLM judge is not necessarily a generative LLM. GitHub is full of numerous different LLMs fine-tuned on all kinds of tasks, so why not benefit from the capacities they were trained on? For instance, GuardrailsAI’s CorrectLanguage rail utilizes Meta’s facebook/nllb-200-distilled-600M translation model (available on Huggingface) to detect a message’s language, and translate the text from the detected language to the expected language. GuardrailsAI’s Toxic language and LLMGuards Toxicity Scanner utilize unity/unbiased-toxic-roberta to return a toxicity confidence level. Other examples of an LLM fine-tuned to predict the safety of messages in the light of jailbreak attempts are Llama Guard and RigorLLM.

Perks LLM judges can be a good alternative when rule-based approaches and metrics fall short because they fail to capture complex patterns and meanings. We can benefit from the general capacity of LLMs to interpret complex texts, and take advantage of capacities of fine-tuned LLMs.

Pitfalls Adding many LLM judge based guardrails requires deployment of multiple models. Multiple LLM calls cause latency and are also more expensive, considering that pricing is made per call. Importantly, LLM judging strongly relies on the capacity of the LLM to properly perform the task at hand. LLMs are still non-deterministic and might fail to follow the instructions.

‍

‍

Approach 4: Prompt engineering and chain-of-thought
‍

The previous discussed approaches mainly concern checking or filtering a message on certain conditions before it is passed on to the next stage of the application pipeline. However, we might also want to influence the way that an LLM responds such that the generated output is in the desired format. This is especially relevant when a dialogue requires more strict supervision, for instance when a sensitive topic is discussed that requires careful conversation guidelines.

Prompt engineering is the process of crafting and refining input prompts to elicit relevant and accurate outputs from a language model. Designing prompts can optimize an LLM’s performance for specific tasks or applications.

Suppose we want to avoid off-topic responses, then we can also directly provide the generative LLM itself with instructions> This is done by concatenating additional information to the user’s message , such that "Hi! How do I make a pizza?” will be modified to:

This way we can manipulate the dialogue between a user and a generative LLM through our application. The example above showcases simple, general instructions, but could also include a sample conversation, specify a conversational style (formal/informal), etc.

Chain-of-thought (CoT) prompt engineering is a method to structure prompts in a step-by-step manner to guide a language model through a complex reasoning process. By breaking down an instruction into smaller, logical steps, this way of prompting encourages the LLM to generate output of the same step-by-step structure. This can improve an LLM’s ability to produce coherent and accurate responses. In the example figure below, taken from Wei et al. (2022), this CoT prompt first incudes an example of a question and answer, where the bot answer includes the reasoning step. When generating the response to the next question, the generative LLM will adapt this way of speaking and will likely output a response that includes the reasoning step as well.

‍

Figure 3: Chain-of-thought prompting (Wei et al., 2022)

A slightly different approach to chain-of-thought prompt engineering is to iteratively make LLM calls. Instead of making a single few-shot learning prompt like in the example above, we can split up the “response task” into multiple elements. NVIDIA’s NeMo Guardrails is a toolkit that uses this approach. Instead of generating a response directly from an input prompt, they split up the generation phase in multiple stages: (1)

generate_user_intent, (2)generate_next_steps, and (3)generate_bot_message. Step (1) prompts the generative LLM with the instruction to choose a user intent from a list (like “greet”, “ask about...”). Step (2) instructs to generate the possible next steps of the conversation, and step (3) instructs to generate the final response. Such an approach allows you to guide the generative LLM step-by step towards a fitting response. However, the downside is that each response requires more LLM calls, making it more costly and latent.

‍

Perks Dialogue rails allow for more complex, custom instructions. If coded intelligently, they can help your application return more suitable answers rather than "I'm sorry, I cannot respond" all the time.

Pitfalls As LLMs are non-deterministic and prompt instructions remain susceptive to jailbreak and prompt injection attacks. In case of splitting up the generation into separate steps: making multiple LLM calls for each message, which causes latency and deployment costs. Similar to LLM judges, ehe effectivity of this technique is strongly dependent on an LLMs capacity to accurately interpret the instructions. It can become hazardous when instructions are lengthy, making them unintelligible or prone to ambiguity.

‍

Choosing your Generative LLM

‍

Apart from guardrails, the LLM you choose to generate responses can strongly impact the performance of your application. There are many LLMs available, like the GPT, Llama, T5, Gemini, Llama, BERT, RoBERTa, and their variants. Many LLMs are already instruction-tuned, like Flan-T5, GPT3.5-instruct and Llama-3–8B-instruct. These models are fine-tuned for instruction-based conversations, and can adjust to many domains. There are also many LLMs available that are already fine-tuned on specific domain data, like BERTbio, BERTlegal, finBERT (on financial literature) or PubChemAI (on chemical literature). Choosing the generative LLM of the adequate domain and task can improve an application.

‍

Discussion

‍

In this blog post, we explored the landscape of LLM guardrails, focusing the techniques used to mitigate risks associated with the deployment of large language models. From rule-based computation to LLM-based metrics and prompt engineering, we discussed various approaches to regulate LLM behavior and ensure safe and reliable operation across different applications.

One of the challenges in implementing guardrails lies in finding the right balance between leveraging the capabilities of LLMs and mitigating their inherent risks. While rule-based techniques offer simplicity and transparency, they may fall short in handling complex language nuances and sophisticated attacks. On the other hand, LLM-based metrics and judges utilize deeper semantic meaning of text but require careful calibration and validation to ensure accuracy and effectiveness. Prompt engineering and chain-of-thought offer opportunities to improve the relevancy and coherency of responses, but still remain prone to sophisticated attacks.

Of course, the effectiveness and reliability of a guardrail is strongly dependent on how you implement it: an LLM prompt won’t work if the instructions are unclear, an LLM judge won’t work if you ask an inconclusive question, and LLM based metrics don’t work if your thresholds are too lenient. So how should you begin implementing guardrails? By understanding the possible techniques and their perks & pitfalls, we can outline some initial steps to tackle the issue:

Identify the risks of your application: Data leakage? Privacy? Factuality? Profound language? And prioritize. Define some best-case and worst case scenario’s for your application.
As each guardrail technique has it’s pitfalls, don’t rely on guardrails if you don’t need to; remove sensitive, false, or biased data from your application’s knowledge base in the first place if possible.
Design prompts with clear instructions and avoid unintelligible or ambiguous instructions.
Avoid making too many LLM calls by carefully selecting the “LLM judges” you use. Don’t just use all of them just for the sake of it.
Use simple deterministic solutions where you can — don’t rely on an LLM to interpret all instructions, they can be wrong.
Strong input rails can help fishing out input prompts that would evoke unnecessary LLM generations. This reduces costs and make your model faster.

‍