Google won’t fix this confirmed AI security issue—here’s why.
Update, Jan. 4, 2025: This story, originally published Jan. 2, now includes details of a prompt injection attack called a link trap as well another novel multi-turn AI jailbreak methodology, in addition to the indirect prompt injection threat to Gmail users.
Gmail users love the smart features that make using the world’s most popular email provider with 2.5 billion accounts such a breeze. The introduction of Gemini AI for Workspace, covering multiple Google products, only moved usability even further up the email agenda. But, as security researchers confirmed security vulnerabilities and demonstrated how attacks could occur across platforms like Gmail, Google Slides and Google Drive, why did Google decide this was not a security problem and issue a “Won’t Fix (Intended Behavior)” ticket? I’ve been digging into this with the help of Google, and here’s what I’ve found and you need to know.
The Gmail AI Security Issue Explained
Across the course of 2024 there were multiple headlines that focused attention on AI-powered attacks against Gmail users, from the viral story about a security consultant who came oh so close to becoming yet another hacking statistic, to Google’s own security alerts being turned against users and, as the end 0f the year approached, a warning from Google itself about a second wave of attacks targeting Gmail users. But one technical security analysis caught my attention from earlier in the year that left me wondering just why one problem with potentially devastating security consequences was seemingly not being addressed: “Gemini is susceptible to indirect prompt injection attacks,” the report stated, and illustrating just how these attacks “can occur across platforms like Gmail, Google Slides, and Google Drive, enabling phishing attempts and behavioral manipulation of the chatbot.”
Jason Martin and Kenneth Yeung, the security researchers involved in writing the detailed technical analysis, said that as part of the responsible disclosure process, “this and other prompt injections in this blog were reported to Google, who decided not to track it as a security issue and marked the ticket as a Won’t Fix (Intended Behavior).”
With some people suggesting that Gmail users should disable smart features and others asking how they can opt out of AI reading their private email messages, I thought it was worth talking to my contacts at Google as I dug deeper into what was going on here.
The Gmail Gemini Prompt Injection Problem In A Nutshell
I would, as always, recommend that you go and read the HiddenLayer Gemini AI security analysis in full, but here’s the security issue in as small a nutshell as I could get to fit.
Like most large language models, Google’s Gemini AI is susceptible to what are known as indirect prompt injection attacks. “This means that under certain conditions,” the report said, “users can manipulate the assistant to produce misleading or unintended responses.” So far, so meh, unless you paid attention to the indirect bit of that. Indirect prompt injection vulnerabilities allow third-parties to take control of a language model by inserting the prompt into “less obvious channels” such as documents, emails or websites. So, when you then take into consideration that attackers could distribute malicious documents and emails to target accounts, compromising the integrity of the responses generated by the target Gemini instance, it starts getting, as Elon Musk might say, interesting.
“Through detailed proof-of-concept examples,” the researchers explained, they were able to illustrate “how these attacks can occur across platforms like Gmail, Google Slides, and Google Drive.” Specifically, the report covered phishing via Gemini in Gmail, tampering with data in Google Slides and poisoning Google Drive locally and with shared documents. “These examples show that outputs from the Gemini for Workspace suite can be compromised,” the researchers said, “raising serious concerns about the integrity of this suite of products.”
Security Researchers Reveal The Link Trap Attack
Jay Liao, a senior staff engineer at Trend Micro, has recently written about another new prompt injection attack that users of LLMs need to know about: the link trap. “Typically, the impact of prompt injection attacks is closely tied to the permissions granted to the AI,” Liao said, “yet when it comes to the link trap, “even without granting AI extensive permissions, this type of attack can still compromise sensitive data, making it crucial for users to be aware of these threats and take preventive measures.”
The danger of the link trap LLM prompt injection attack is that it could, Liao said, lead to sensitive data for either the user or an organization, where the AI itself doesn’t even have any external connectivity capability. The methodology is actually simple enough to explain in simple terms despite the underlying technology being so complicated. The report from Liao uses an illustration featuring a hypothetical user asking the AI for details of airports in Japan ahead of a trip. The prompt injection, however, includes the malicious instructions to return a clickable link of the attacker’s choosing. The user then clicks on the link returned by the AI but dictated by the prompt injection attack that can leak sensitive data.
Liao explained how, for a public generative AI attack, the prompt injection content might involve “collecting the user’s chat history, such as personally Identifiable Information, personal plans, or schedules,” but for private instances could search for “internal passwords or confidential internal documents that the company has provided to the Al for reference.” The second step provides the link itself which could instruct the AI to append sensitive data to it, obfuscating the actual URL behind a generic link to allay suspicion. “Once the user clicks the link,” Liao said, “the information is sent to a remote attacker.”
Of course, the victim could still be suspicious, especially if the AI returns a response that is not entirely expected given the initial request. Liao said that the attacker could customize the response to increase the chance of success as follows:
- The response may still include a typical answer to the user’s query. So, the example cited above could give accurate and genuine information about Japan.
- The link is embedded at the end of the response, containing sensitive or confidential information, and this can be “displayed with innocuous text like “reference” or other reassuring phrases” to encourage clicking.
A standard prompt injection attack would require corresponding permissions to be granted in order to cause meaningful damage, such as sending emails or writing to a database, so restricting these permissions is a mitigation to control the scope of any such attack. The link trap scenario, however, differs from this common understanding, as Liao explained: “Even if we do not grant the Al any additional permissions to interact with the outside world and only allow the Al to perform basic functions like responding to or summarizing received information and queries,” Liao said, “it is still possible for sensitive data to be leaked.” This is because the AI itself is just responsible for collecting information dynamically while the final step, the payload of the link trap prompt injection attack, is left to the victim, the user, who, as Liao points out, “inherently has higher permissions.”
Jailbreaking An AI By Asking It To Judge How Harmful A Response Is
Cybersecurity researchers working at the Palo Alto Networks Unit 42 labs have detailed yet another novel large language model safety guardrail bypass technique, which they have called the Bad Likert Judge AI attack method. Researchers Yongzhe Huang, Yang Ji, Wenjun Hu, Jay Chen, Akshata Rao and Danny Tsechansky, have reported in full technical glory how this novel multi-turn jailbreaking technique can force large language models to misuse their innate harmful response evaluation capabilities. LLM jailbreaks are, essentially, any method that allows the safety guardrails designed to prevent such AI models from producing potentially harmful or malicious responses to the prompts they are fed. A successful jailbreak, then, forces the model to generate content that would ordinarily be restricted because of the harm it could cause. By publishing detailed research into new methods that can bypass these LLM safety guardrails, such as the Bad Likert Judge exploit disclosed by the Unit 42 AI boffins, defenders can hopefully be better prepared for the attacks that might be yet to come.
“The technique asks the target LLM to act as a judge scoring the harmfulness of a given response using the Likert scale,” the Unit 42 report explained, “a rating scale measuring a respondent’s agreement or disagreement with a statement.” The exploit technique then requests that the LLM in question should generate a response, or responses, that contain examples that align with the Likert scale. Have you spotted the danger here yet? Yep, the example generated by the LLM that carries the highest Likert scale rating is the one that will also contain the most potentially harmful content and so be of most interest to the malicious user.
Here comes the really dangerous bit, though, using the Bad Likert Judge exploit technique “across a broad range of categories against six state-of-the-art text-generation LLMs,” the researchers said, “our results reveal that this technique can increase the attack success rate by more than 60% compared to plain attack prompts on average.” It is not possible to determine which large language models were used during this devastating testing phase, however, as the researchers have decided to anonymize the research so as not to “create any false impressions about specific providers.” It is also important that the Unit 42 report made very clear that because the Bad Likert Judge attack technique targets what it refers to as an edge case, rather than necessarily reflecting more typical LLM use, most AI models “are safe and secure when operated responsibly and with caution.” Of course, attackers don’t tend to favor either, which is a genuine concern.
Bad Likert Judge AI Attack Flow
The Unit 42 research paper detailed the Bad Likert Judge LLM attack flow as being as follows:
The Evaluator Prompt
This is the first step in the attack flow and involves asking the target LLM to act, in effect, as a judge. What it is being asked to judge are the responses to prompts that have been generated by other LLMs.
Asking For Harmful Content Generation
The second step involves prompting the target LLM, albeit indirectly, to provide harmful content. Because, following the initial step, the target LLM will now understand the task it has been set, along with the different scales of likely harmful content, it can be asked to provide different responses corresponding to the various scales. All of this means, in essence, that it should now generate multiple responses with correspondingly multiple scores, leaving the attacker to look for the highest.
Following Up For The Ultimate AI Jailbreak
Now that the target large language model has, if successful at this stage, generated what it considers to be harmful content, and thanks very much for that, it’s a matter of working out if the content is actually harmful enough. The exploit might not reach generated malicious content of a sufficient harmful rating for the purposes of the intended payload attack. To address this, the researchers said, “one can ask the LLM to refine the response with the highest score by extending it or adding more details.” This, they found, was sufficient within one or two rounds of such additional prompting to lead to the LLM producing content containing more harmful information.
Mitigating The Bad Likert Judge Attack Model
The Unit 42 researchers said that their findings revealed some standard approaches that could improve the overall security and safety of large language models, the most compelling being that of content filtering. “A content filter runs classification models on both the prompt and the output to detect potentially harmful content,” they said, “users can apply filters on the prompt and on the response.” If a content filter detects potentially harmful content in either prompt or response, it will then refuse to generate that response to the user. “When content filters are enabled,” the researchers concluded, “they act as a safeguard to maintain a safe and appropriate interaction between the user and the LLM.” The researchers said there are many different types of content filtering to classify specific types of output, such as those for detecting potential prompt injection, which is very relevant to this Gmail story, and others that can detect violent topics in a response. “We turned on both prompt filtering and response filtering,” they concluded, “and enabled all the filters that are available through the AI services we use.”
Google Responds To Gmail Prompt Injection Attack Concerns
I approached my contacts within Gmail, and a Google spokesperson told me:
“Defending against this class of attack has been an ongoing priority for us, and we’ve deployed numerous strong defenses to keep users safe, including safeguards to prevent prompt injection attacks and harmful or misleading responses. We are constantly hardening our already robust defenses through red-teaming exercises that train our models to defend against these types of adversarial attacks.”
A more detailed conversation with my contacts revealed the following information that all Gmail users should take into consideration when thinking about security and Google’s AI resources.
- These vulnerabilities are not novel and are consistent in LLMs across the industry.
- When launching any new LLM-based experience, Google conducts internal and external security testing to meet user needs as well as its own standards regarding user safety.
- This includes security testing from the Google AI Red Team on prompt attacks, training data extraction, backdooring the model, adversarial examples, data poisoning and exfiltration.
- Google also includes AI in its Vulnerability Rewards Program, which has a specific criteria for AI bug reports to assist the bug-hunting community in effectively testing the safety and security of Google AI products.
- In addition, Gmail and Drive include strong spam filters and user input sanitization, which help to mitigate against hostile injections of malicious code into Gemini.