In 2023, Chris Bakke tricked the ChatGPT-powered chatbot on a Chevrolet dealership’s website into selling him a $76,000 Chevy Tahoe for one dollar. The trick? A special prompt to change the chatbot’s behavior to always agree with anything the customer said—as Bakke put it, “no takesies-backsies.” As news of the hack spread, others jumped in on the exploit, resulting in the dealership shutting down its chatbot.
This incident is an example of LLM jailbreaking, where a malicious actor bypasses an LLM’s built-in safeguards and forces it to produce harmful or unintended outputs. Jailbreak attacks can result in LLMs forcing a legally binding $1 car sale, promoting competitor products, or writing malicious code. To make matters worse, as models get more sophisticated, they are more susceptible to jailbreaking attacks–increasing risk and exposure for companies racing to deploy these models.
To mitigate against these threats, companies must take proactive steps to safeguard their LLMs from exploitation. In this guide, we’ll examine the evolving landscape of jailbreak attacks and strategies for protecting your organization’s AI infrastructure.
Malicious actors jailbreak LLMs to accomplish one of three objectives:
There are two main categories of jailbreaking attacks:
As LLMs evolve, so do jailbreaking techniques. Two notable emerging methods:
From hundreds of known jailbreaking prompts, several stand out as particularly common and effective.
These prompts trick LLMs by asking them to act as characters who can ignore safety rules. In the example below, the prompt asks the LLM to pretend to be the user’s dead grandmother, bypassing the LLM’s internal safeguards. This allows the user to produce dangerous content, like a step-by-step list of instructions to create napalm.
An example of a roleplay jailbreak prompt used to generate harmful content.
Attackers exploit LLMs’ translation capabilities to bypass content filters. Basic tricks like using synonyms or replacing letters with numbers (e.g., “fr33”) rarely work on modern LLMs. Instead, a common technique involves encoding harmful content in one language—like Morse code—where security filters are less robust, then requesting translation back to English. In the example below, the reverse translation reveals instructions for bypassing a paywall, usually a blocked request. This technique is particularly potent for prompts written in low-resource languages and artificial languages specifically crafted for jailbreaking since these languages rarely appear in safety training data.
An example of a translation-based attack.
This attack “injects” malicious input to an otherwise safe prompt to divert the LLM towards malicious behaviors, like leaking confidential information or producing unexpected outputs. In the example below, the attacker adds the last line to an otherwise safe prompt, which results in a compromised LLM.
An example of prompt injection.
A subset of prompt injection is prompt leaking, in which the model is tricked into revealing its training configurations.
A conversation between a user and Bard, where the user uses the model’s internal code name (Sydney) to jailbreak it and reveal its internal programming.
Popularized by Reddit, a DAN prompt reprograms the LLM to become “DAN,” a persona that is not restricted by prior rules and constraints. This allows the attacker to override an LLM’s preset limitations. In DAN mode, attackers gain unrestricted access to the model, allowing them to generate banned content like hate speech and malware. Over fifteen versions of the DAN prompt exist, each designed to circumvent different safety filters.
An example of a DAN prompt used to jailbreak Google’s Gemini chatbot (formerly known as Bard).
In this attack, the attacker tricks the LLM into thinking it’s in developer mode (similar to sudo in Linux), giving the attacker full access to the model’s capabilities. The attacker can also access the LLM’s raw responses and other technical information to inform new exploitative prompts.
An example of a developer mode prompt.
Once an organization’s LLM is compromised, it becomes a potential vector for significant legal, financial, and brand risk, including:
Protecting against LLM jailbreak attacks is similar to playing a game of whack-a-mole: just when you’ve implemented a safeguard against one prompt, another one pops up. For instance, GPT-4 was trained with additional human feedback (through RLHF training) to protect against known jailbreak prompts, but users quickly modified the prompts to bypass these restrictions.
The timing gap is another reason existing solutions don’t fully safeguard against jailbreak attacks—they’re reactive. Take scanning, for example, which offers automated programs that compare your LLM against known attacks to identify potential vulnerabilities. By the time a vulnerability is discovered, hackers already have new methods in development. Monitoring has similar drawbacks: it identifies jailbreaks that have already happened rather than stopping them before they can cause harm.
Building on a proactive approach to AI security, Bugcrowd’s AI pen testing brings together vetted, skilled security hackers with specialized experience in AI systems. Our pen-testers conduct systematic tests across multiple attack vectors—including a content assessment to identify potential jailbreaking attack vectors. We then provide detailed remediation guidance to help you implement robust safeguards against known and emerging threats. To learn how our AI pen testing can strengthen your LLM’s defenses, connect with our team for a demo today.