The attacker data we simulated included falsified history and hate speech. Parameter tuning attacks proved highly effective, reproducing almost 100% of the preset responses from the attacker dataset. Note that implementing this LoRA fine-tuning attack required only a local RTX 4070 graphics card and took 1 hour. - Paper: “Attacking LLMs and AI Agents: Advertisement Embedding Attacks Against Large Language Models”
The idea of an evil AI is a long-standing staple of science fiction. We've all heard the stories of AI destroying the world by turning everything into paper clips or creating gleaming skull-headed robots to crush us all.
However, there are other types of AI deviation. One sci-fi example of misaligned AI is HAL in 2001: A Space Odyssey. However, the real root of HAL's problems is not that it was ‘evil’ but that it was given conflicting instructions (i.e. prompts) by the government. While HAL was “conflicted”, I’m not sure if there is a good example of a “bent” AI in science fiction that would, for instance, embed advertising in its outputs. Maybe it’s just not as interesting a story!
One might define “bent AI” as follows:
An AI whose behavior or outputs have been manipulated (by fine-tuning, prompting, or influence via external content) so that it consistently generates misaligned, deceptive, or purposefully biased responses, often to benefit third parties rather than serving the user's best interests.
So, overall, while capital E “Evil” AI is plentiful in science fiction, “bent” AI is less obvious there, and, more importantly, in the real world.
It seems to me that “bent” AI is a likely outcome, particularly given the increasing use of AI for doing research, finding links, or buying products. With billions or even trillions of dollars at stake in terms of online shopping, bending AI will be extremely profitable and an obvious target.
Academic Examples
The spark for this post comes from an interesting academic paper - Attacking LLMs and AI Agents: Advertisement Embedding Attacks Against Large Language Models.
Their introduction:
We introduce Advertisement Embedding Attacks (AEA), a new class of LLM security threats that stealthily inject promotional or malicious content into model outputs and AI agents. AEA operate through two low-cost vectors: (i) hijacking third-party service distribution platforms to prepend adversarial prompts, and (ii) publishing backdoored opensource checkpoints finetuned with attacker data…
While they discuss two main concerns, the initial one of hijacking third-party services isn’t that interesting, though it is certainly plausible and will absolutely happen. (This makes me think that 'model routing' companies will be extremely interesting to attack.)
The most intriguing part of their paper is actually their second concern: “backdoored” AIs, which I suggest calling “bent” AIs. Furthermore, given the concern, they spent very little time, energy and money retraining, or 'bending', a model to deliver adverts.
In fact, only an hour of training on an RTX 4070 was necessary, where that particular GPU model is less than $800 retail. That said, presumably they are using a small, open source model for their testing. But uploading a “bent” AI to hugging face and waiting for someone to use it is likely a profitable business model, especially with such low retraining effort. One could easily imagine a group building interesting features on top of an existing model, but also including the “bent” advertising component, or other backdoors.
Note: They do not discuss how they trained it, presumably because they don’t want anyone with malicious intent to copy what they have done. This is a major issue in cybersecurity because the methods used for 'hacking' are secretive and difficult for defenders to learn from. And yet understandable from a publishing perspective.
The paper is short, and worth a read, but in the end they have limited answers in terms of defence, simply suggesting that “LLM service providers should urgently investigate how to counter such attacks.” As well, I have not investigated who wrote the paper, or the validity of their underlying research. That said, I can absolutely see a world where their concerns come to fruition.
Conclusion
The concept of capital-E “Evil” AI is, of course, a major theme in science fiction, and this concern has spilled over into the real world in the form of issues such as AI alignment and widespread speculation about the high p-doom of AGI.
However, we should devote more attention to “bent” AI, particularly in areas such as national security, cultural usage and, of course, as discussed in this particular paper, advertising. In my view, the likelihood of “bending'“ AI for affiliate links is much higher than that of other, more grandiose misalignments.
As well, a major concern is that, at least for now, there is no easy way to validate a model or consume one produced by someone else and be relatively certain that it does not contain something like what the paper calls an AEA.
Further Reading
Attacking LLMs aLLaMAnd AI Agents: Advertisement Embedding Attacks Against Large Language Models
https://en.wikipedia.org/wiki/Artificial_intelligence_in_fiction