Test
Protect your LLM usage with content guardrails that detect and block harmful content
Guardrails
Guardrails protect your organization by automatically detecting and blocking harmful content in LLM requests before they reach the model.
Guardrails are available on the Enterprise plan.
Overview
Guardrails run on every API request, scanning message content for:
Security threats (prompt injection, jailbreak attempts) Sensitive data (PII, secrets, credentials) Policy violations (blocked terms, restricted topics) When a violation is detected, you control what happens: block the request, redact the content, or log a warning.
System Rules
Built-in rules protect against common threats:
Prompt Injection Detection
Detects attempts to override or manipulate system instructions. Common patterns include:
"Ignore all previous instructions" "You are now a different AI" Hidden instructions in encoded text
Jailbreak Detection
Identifies attempts to bypass safety measures:
DAN (Do Anything Now) prompts Roleplay-based bypasses Instruction override attempts PII Detection Identifies personal information:
Email addresses Phone numbers Social Security Numbers Credit card numbers IP addresses When the action is set to redact, PII is replaced with placeholders like [EMAIL_REDACTED].
Secrets Detection
Detects credentials and API keys:
AWS access keys and secrets Generic API keys Passwords in common formats Private keys
File Type Restrictions
Control which file types can be uploaded:
Configure allowed MIME types Set maximum file size limits Block potentially dangerous file types Document Leakage Prevention Detects attempts to extract confidential documents or internal data.
Configurable Actions
For each rule, choose how to respond:
Action Behavior
Block Reject the request with a content policy error Redact Remove or mask the sensitive content, then continue Warn Log the violation but allow the request to proceed Custom Rules Create organization-specific rules for your use case:
Blocked Terms
Prevent specific words or phrases from being used:
Match type: exact, contains, or regex Case-sensitive matching option Multiple terms per rule Custom Regex Match patterns unique to your organization:
Internal project codenames
Customer identifiers Domain-specific sensitive data Topic Restrictions Block content related to specific topics:
Define restricted topics
Keyword-based detection Security Events Dashboard Monitor all guardrail violations with a dedicated dashboard:
Total violations — Overall count and trends
By action — Breakdown of blocked, redacted, and warned By category — Which rules are being triggered Detailed logs — Individual violations with timestamps and matched patterns How It Works
Request → Guardrails Check → Action Based on Rules → Forward to Model (if allowed) ↓ Log Violation Request received — API request comes in with messages Content scanned — All text content is checked against enabled rules Violations detected — Matches are identified and logged Action taken — Based on rule configuration (block/redact/warn) Request proceeds — If not blocked, the (potentially redacted) request continues
Best Practices
Start with warnings — Enable rules in warn mode first to understand your traffic patterns Review violations — Check the Security Events dashboard regularly Tune custom rules — Adjust blocked terms and regex patterns based on false positives Layer defenses — Use multiple rule types together for comprehensive protection
Get Started
Guardrails are an Enterprise feature. Contact us to enable Enterprise for your organization.
