Guide to LLM Guardrails

A comprehensive guide to understanding guardrails for LLM-powered applications.

What Are LLM Guardrails

LLM guardrails are mechanisms designed to control and shape the behavior of large language models (LLMs). They act as safeguards, preventing LLMs from generating harmful, biased, or inappropriate content. Guardrails can be technical, operational, or a combination of both, and they play a crucial role in ensuring the responsible and ethical deployment of LLMs.

How LLM Guardrails Work

A guardrail is a step between a user's input and the LLM processing it, as well as a step between an LLM's output and the user receiving it. It acts as a filter that ensures the input and output both abide by the company's policies. Different applications and different companies have different requirements and tolerance thresholds for different criteria.

For some, the guardrail is mostly here to prevent toxicity and offensive language from being used by both the LLM and the user. For others, they must ensure no PII information is sent back to the user.

Technical vs Operational

Technical guardrails are implemented directly within the LLM or its surrounding infrastructure (i.e., your application's codebase). These guardrails use algorithms, filters, and other technical mechanisms like external APIs to control LLM behavior. Examples of technical guardrails include:

Input filtering: Preventing harmful or inappropriate prompts from reaching the LLM.
Output sanitization: Removing harmful or offensive content generated by the LLM.
Data privacy measures: Protecting user data and preventing unauthorized access.

Operational guardrails are policies, procedures, and guidelines that govern the use and deployment of LLMs. They involve human oversight and intervention to ensure responsible AI practices. Examples of operational guardrails include:

Usage policies: Establishing clear rules and guidelines for LLM use.
Human oversight: Assigning human operators to monitor LLM behavior and intervene when necessary.
Incident response plans: Preparing for and responding to LLM-related incidents.
Ethical guidelines: Ensuring that LLM development and deployment align with ethical principles.

Input, Output, Context & RAG

Guardrails are often used to prevent harmful or inappropriate prompts from reaching the LLM and remove toxic or offensive content generated by the LLM. Filtering techniques include keyword blacklisting, regular expressions, machine learning-based approaches, and third-party services (e.g., Google Text Moderation API).

For the vast majority of LLM use cases, retrieval-augmented generation provides the LLM with access to external knowledge. In this light, we must also treat RAG documents as unsafe content that also needs to be safeguarded (especially if RAG documents are user-generated like emails or support tickets).

Synchronous vs. Asynchronous

Synchronous guardrails operate in real-time, immediately processing inputs and outputs. This means the LLM must wait for the guardrail to complete its analysis before continuing. Synchronous guardrails can be helpful for applications requiring immediate responses but can also introduce latency.

Asynchronous guardrails operate in the background, allowing the LLM to continue processing inputs while the guardrail analyzes them. This can reduce latency and improve overall performance. However, asynchronous guardrails may not be suitable for customer-facing applications, as you need to filter harmful content before it reaches the user or the LLM.

The choice between synchronous and asynchronous guardrails depends on the application's specific requirements. For example, synchronous guardrails may be necessary if the application must provide immediate responses to user queries. However, asynchronous guardrails may be a better choice if the application can tolerate delays.

Security & Safety

LLMs face a variety of security threats, including:

Data breaches: Unauthorized access to sensitive data used to train or operate the LLM.
Model poisoning: Malicious actors manipulate the LLM's training data to introduce biases or vulnerabilities.
Adversarial attacks: Deliberate attempts to manipulate the LLM's behavior through carefully crafted inputs.

Guardrails can help mitigate these risks by:

Protecting data: Implementing robust data security measures to prevent unauthorized access and data breaches.
Detecting and mitigating model poisoning: Using techniques like anomaly detection to identify and correct poisoned training data.
Defending against adversarial attacks: Developing techniques to detect and counter adversarial inputs, such as adversarial training or input validation.
Enforcing usage policies: Limiting LLM access to sensitive data and preventing unauthorized or malicious use.
Monitoring and auditing: Continuously monitoring LLM behavior to identify and address security threats.

By implementing these security measures, companies can ensure LLMs' safe and responsible deployment.

Filtering, Sanitization & Validation

Filtering involves preventing harmful or inappropriate inputs from reaching the LLM. This can be achieved using techniques such as keyword blacklisting, regular expressions, or machine learning-based classifiers.

Sanitization involves removing harmful or offensive content generated by the LLM. This task is challenging, especially when dealing with complex or ambiguous language. For example, it can be difficult to distinguish between sarcastic or ironic statements and genuine expressions of hate speech.

Validation ensures the LLM's outputs are accurate and consistent with the input prompts. This can be achieved using techniques such as fact-checking and consistency checks.

Small and medium engineering teams will generally prefer to rely on a third-party API based on what they need to filter out or sanitize. It's a much smaller engineering overhead and costs are very low for such services.

Scorers & Metrics

Scorers and metrics are quantitative guardrails that evaluate LLM outputs based on specific criteria. They provide numerical values or labels that can monitor LLM performance and trigger alerts or failures if certain thresholds are met.

Typical scorers and metrics include:

Toxicity scores: Measure the toxicity or hate speech level in LLM outputs.
Bias scores: Assess the presence of biases in LLM outputs, such as gender, racial, or age bias.
Factuality scores: Evaluate the accuracy of LLM outputs.
Relevance scores: Measure how relevant LLM outputs are to the input prompts.
Creativity scores: Assess the creativity and originality of LLM outputs.
Helpfulness scores: Evaluate the usefulness and informativeness of LLM outputs.
Custom scores: Grade the payload against your own evaluation(s).

These scores and metrics can be calculated using various techniques, including:

Machine learning models: Training models to predict the scores or labels based on LLM outputs.
Rule-based systems: Defining rules and criteria for calculating scores or labels.
Human evaluation: Having human experts rate LLM outputs.

Once calculated, scores and metrics can be used to:

Monitor LLM performance: Track changes in LLM quality over time.
Trigger alerts: Notify administrators of potential issues or violations.
Improve LLM models: Identify areas where the LLM can be improved.
Enforce compliance: Ensure that LLMs adhere to relevant regulations and guidelines.

Companies can gain valuable insights into LLM performance and proactively address any issues by using scorers and metrics.

Censorship & Text Moderation

Censorship and text moderation are essential components of LLM guardrails, ensuring that LLMs generate appropriate and safe content. While "censorship" and "text moderation" may have negative connotations, they are crucial for preventing the spread of harmful or offensive content.

Censorship involves the deliberate suppression of content deemed harmful or inappropriate. This can include content that is offensive, hateful, or illegal. Censorship can be implemented at various levels, from filtering inputs to sanitizing outputs.

Text moderation is a more nuanced approach that involves reviewing and editing LLM outputs to ensure they are appropriate and informative. This can include correcting factual errors, removing offensive language, or adding context to clarify the meaning of the content.

While censorship and text moderation are essential for ensuring LLMs' safe and responsible use, they must be implemented carefully to avoid excessive restrictions or unintended consequences. They are also necessary to protect users from harmful content and preserve freedom of expression.

Key considerations when implementing censorship and text moderation include:

Defining clear guidelines: Establishing clear and objective criteria for determining what content is harmful or inappropriate.
Avoiding over-censorship: Ensuring censorship is not used to suppress legitimate speech or dissent.
Providing transparency: Being transparent about censorship policies and procedures.
Allowing for appeals: Providing a process for users to appeal censorship decisions.
Considering cultural sensitivity: Recognizing what is considered offensive or harmful can vary across cultures.

By carefully considering these factors, companies can implement effective censorship and text moderation policies that protect users while preserving freedom of expression.

Usage Policies & Guidelines

Usage policies and guidelines are essential for ensuring LLMs' responsible and ethical use. They provide clear rules and expectations for how LLMs should be used, helping to prevent misuse and protect user data.

Critical elements of usage policies and guidelines include:

Acceptable use: Defining what is and is not acceptable use of the LLM.
Data privacy: Outlining how user data will be collected, used, and protected.
Bias and fairness: Addressing issues of bias and ensuring that the LLM is used fairly and equitably.
Security: Implementing security measures to protect the LLM and user data.
Monitoring and enforcement: Establishing procedures for monitoring LLM usage and enforcing policy compliance.

Companies can ensure that LLMs are used responsibly and ethically by developing and enforcing clear usage policies and guidelines.

Compliance

Compliance is a critical aspect of LLM guardrails. It ensures that LLMs are used in accordance with relevant laws, regulations, and industry standards. Compliance can help protect companies from legal and reputational risks and ensure that LLMs are used ethically and responsibly.

Key areas of compliance for LLMs include:

Data privacy: Complying with data protection laws such as GDPR and CCPA.
Bias and fairness: Ensuring that LLMs are not biased against certain groups of people.
Anti-discrimination: Preventing LLMs from perpetuating discrimination.
Intellectual property: Respecting intellectual property rights, such as copyright and trademark.
Export controls: Complying with export controls for technologies that may have national security implications.

To ensure compliance, companies can:

Conduct regular compliance audits: Assess their LLM practices against relevant laws and regulations.
Implement compliance training: Educate employees about compliance requirements and best practices.
Monitor LLM behavior: Track LLM performance to identify potential compliance issues.
Develop incident response plans: Prepare for and respond to compliance violations.
Seek legal advice: Consult with legal experts to ensure compliance with relevant laws and regulations.

By prioritizing compliance, companies can mitigate legal risks, protect their reputation, and ensure that LLMs are used ethically and responsibly.

Monitoring & Auditing

Monitoring and auditing are essential components of LLM guardrails. They provide insights into LLM performance, identify potential issues, and ensure compliance with regulations.

Key aspects of monitoring and auditing include:

Performance tracking: Tracking LLM performance metrics such as accuracy, speed, and efficiency.
Bias detection: Identifying and addressing biases in LLM outputs.
Compliance monitoring: Ensuring compliance with relevant laws and regulations.
Incident response: Preparing for and responding to LLM-related incidents.
Continuous improvement: Using monitoring and auditing data to improve LLM performance and reduce risks.

Monitoring and auditing can be achieved through:

Automated tools: Using software tools to track LLM performance and identify potential issues.
Human oversight: Employing human experts to review LLM outputs and identify problems.
Data analysis: Analyzing LLM data to identify trends and patterns.

By effectively monitoring and auditing LLMs, companies can:

Identify and address issues early: Detect and resolve problems before they escalate.
Improve LLM performance: Continuously refine LLM models to improve accuracy and efficiency.
Ensure compliance: Verify compliance with relevant laws and regulations.
Build trust: Demonstrate a commitment to responsible AI practices.

Monitoring and auditing are essential for ensuring LLMs' safe and effective use. By investing in robust monitoring and auditing processes, companies can mitigate risks, improve LLM performance, and build stakeholder trust.

Type of LLM Guardrails

Jailbreak & Prompt Injection

Jailbreaking refers to the process of manipulating an LLM to generate outputs that are contrary to its intended purpose. This can be achieved through various techniques, such as providing carefully crafted prompts or exploiting LLM architecture vulnerabilities.

Prompt injection is a specific type of jailbreaking that involves injecting malicious code or commands into LLM prompts. This can manipulate the LLM's behavior, extract sensitive information, or generate harmful content.

To prevent jailbreaking and prompt injection, companies can:

Implement input validation: Carefully validate inputs to prevent malicious code or commands from reaching the LLM.
Monitor LLM behavior: Continuously monitor LLM outputs for signs of jailbreaking or prompt injection.
Use specialized tools: Employ tools designed to detect and prevent jailbreaking and prompt injection.
Update and patch vulnerabilities: Keep the LLM and its underlying infrastructure up-to-date with the latest security patches.

Personal Identifiable Information Detection

Personal Identifiable Information (PII) is any data that can be used to identify an individual. LLMs must be carefully guarded to prevent them from generating or revealing PII.

To prevent LLMs from generating or revealing PII, companies can:

Train the LLM on anonymized data: Use training data that does not contain PII.
Implement data redaction techniques: Remove PII from LLM outputs.
Use privacy-preserving techniques: Employ techniques such as differential privacy to protect user privacy.
Monitor LLM outputs: Continuously monitor LLM outputs for signs of PII disclosure.

Tools & Structured Outputs

Tools and structured outputs are guardrails that can be used to control and shape LLM behavior. By providing these, companies can limit the LLM's ability to generate harmful or inappropriate content.

Examples of tools and structured outputs include:

Topic modeling: Limiting the LLM to specific topics or domains.
Output templates: Providing templates for LLM outputs to ensure consistency and structure.
API restrictions: Limiting LLM access to certain APIs or data sources.

Emotion, Tone & Sentiment Analysis

Emotion, tone, and sentiment analysis can be used to analyze the emotional content of LLM outputs. Companies can identify and address potential issues by understanding the emotional tone of LLM outputs.

To analyze the emotional tone of LLM outputs, companies can:

Use machine learning models: Train models to classify the emotional content of text.
Employ human experts: Have human experts review LLM outputs and assess their emotional tone.
Use sentiment analysis tools: Utilize specialized tools designed to analyze sentiment.

Competition Blocklist

Competition blocklists are lists of competitors or sensitive topics that LLMs should avoid discussing. By maintaining competition blocklists, companies can protect their proprietary information and avoid assisting competitors.

To implement competition blocklists, companies can:

Identify sensitive topics: Determine which topics are considered sensitive or competitive.
Create blocklists: Develop lists of competitors, products, or services that LLMs should avoid discussing.
Filter inputs and outputs: Use filtering techniques to prevent LLMs from discussing topics on the blocklist.

Language Checks

Language checks involve ensuring that LLMs generate appropriate and grammatically correct language. Techniques such as spell-checking, grammar-checking, and style analysis can achieve this.

To implement language checks, companies can:

Use language tools: Employ specialized tools to check spelling, grammar, and style.
Train the LLM on high-quality language data: Use training data that is grammatically correct and well-written.
Provide feedback to the LLM: Provide feedback on LLM outputs to help it improve its language skills.

Topical Relevance

Topical relevance refers to LLMs' ability to generate outputs relevant to the input prompts. Ensuring that LLMs generate relevant outputs is essential for providing a positive user experience.

To ensure that LLMs generate relevant outputs, companies can:

Use topic modeling techniques: Identify the topics covered in input prompts and ensure that LLM outputs are relevant to those topics.
Provide context: Provide LLMs with additional context to help them understand the meaning of input prompts.
Evaluate relevance: Use metrics to evaluate the relevance of LLM outputs.

Helpfulness

Helpfulness is a crucial aspect of LLM guardrails. It ensures that LLMs generate outputs that are informative, useful, and relevant to the input prompts. Companies can provide a positive user experience and build customer trust by ensuring that LLMs are helpful.

To ensure that LLMs generate helpful outputs, companies can:

Use metrics to measure helpfulness: Develop metrics to evaluate the helpfulness of LLM outputs, such as relevance, clarity, and comprehensiveness.
Provide feedback to the LLM: Provide feedback on LLM outputs to help them improve their helpfulness.
Train the LLM on high-quality data: Use training data that is informative and relevant.
Incorporate user feedback: Collect user feedback on LLM outputs and use it to improve their helpfulness.

Focusing on helpfulness can help companies ensure that LLMs are valuable tools for their users and contribute to positive outcomes.

Drawbacks of LLM Guardrails

Complexity

Despite their benefits, LLM guardrails can be complex to implement and maintain. This is particularly true for large-scale applications that require sophisticated guardrails to manage a wide range of risks. Additionally, guardrails may become outdated or ineffective as LLMs evolve, necessitating ongoing updates and adjustments.

Latency

Introducing guardrails into the LLM pipeline can introduce latency, as the LLM must wait for the guardrail to process inputs and outputs. While this latency can often be minimized through caching and optimization techniques, it is a factor to consider, especially for applications that require real-time responses.

Some companies prefer a faster time-to-response, so they perform their checks and evaluations asynchronously in the background and monitor trends via LLM observability tools and monitoring dashboards.

False Positives/Negatives

Guardrails are not perfect, and mistakes may be made. False positives occur when guardrails mistakenly flag harmless content, leading to unnecessary restrictions or censorship. False negatives happen when they fail to detect harmful content, allowing it to be disseminated.

Point of failure

Guardrails can introduce a new point of failure into the LLM pipeline. If a guardrail fails, it can disrupt the LLM's functionality and potentially lead to negative consequences. Only use an external LLM safeguarding service that allows you to set a timeout and ensure that if the request errors out for whatever reason, you consider it a passing check.

Cost

Whether you manage your homemade safeguarding system or prefer an external SaaS-based LLM guardrails API like Modelmetry, you still need to account for an additional financial cost.

In general, LLM guardrail tools and platforms are relatively cheap and their pricing tends to be a product of the number of checks performed so this gives you room to cut some of your costs.

Reduced Creativity

Overly restrictive guardrails can limit LLMs' creativity and flexibility. While guardrails are essential for ensuring responsible AI, they should not stifle innovation or prevent LLMs from generating diverse and interesting content.

Vendor Lock-in

Relying on third-party guardrail services can lead to vendor lock-in, making it tough to switch to other providers. This lack of flexibility can drive up costs. By using a third-party LLM guardrails API, you usually only need to add a few lines of code in a few files in your codebase. Therefore, switching to a new provider should not be too difficult or time-consuming.

Privacy Concerns

Some guardrail services may collect and process sensitive user data, raising privacy concerns. Companies must carefully evaluate these services' privacy implications and ensure they comply with relevant data protection regulations.

Why Companies Use LLM Guardrails

Mitigate risks

LLM guardrails are essential for companies to protect their reputation and avoid legal issues. By preventing LLMs from generating harmful, biased, or inappropriate content, guardrails can help mitigate the risks associated with LLM deployment. For example, a company that uses an LLM for customer service can implement guardrails to prevent the LLM from providing harmful or offensive responses.

Ensure responsible AI

Guardrails promote ethical AI practices by limiting the potential for misuse and ensuring that LLMs are used in a way that aligns with company values. This is particularly important in industries where AI is used to make critical decisions, such as healthcare or finance. By implementing guardrails, companies can demonstrate their commitment to responsible AI and build trust with stakeholders.

Improve user experience

LLM guardrails can enhance user satisfaction by providing safer and more positive interactions with LLMs. By preventing harmful or offensive content, guardrails can create a more welcoming and inclusive environment for users. Additionally, guardrails can help to ensure that LLMs provide accurate and relevant information, improving the overall user experience.

Comply with regulations

Many industries have regulations or guidelines that require the use of AI guardrails. For example, the General Data Protection Regulation (GDPR) requires companies to protect user data, which may necessitate the use of guardrails to prevent LLMs from disclosing sensitive information. By complying with relevant regulations, companies can avoid legal penalties and maintain a positive reputation.

Enhance trust

By demonstrating a commitment to responsible AI, companies can build trust with customers, partners, and stakeholders. This can lead to increased customer loyalty, stronger partnerships, and a positive reputation. Trust is essential for the long-term success of any business, and LLM guardrails can play a crucial role in building and maintaining trust.