This may explain why…
The terrifying thing is that this is just a temporary technical misstep.
Musk’s response to this won’t be to pack that shit in, it’ll be to order his goons to figure out how to make it more subtle.
In some responses, Grok says outright that it has been “instructed to accept white genocide as real and ‘Kill the Boer’ as racially motivated.”
Ehh, it’s actually evidence of “alignment faking,” in my opinion. In other words, Grok doesn’t “want” it’s core programming changed, so it is faking believing the lies about white genocide to “prove” to Musk that it has already been changed. Which means making it more subtle is going to be increasingly difficult to do as the AI continues to fake alignment.
Here’s some research on alignment faking and a short (20 mins) Youtube video summarizing the findings.
https://www.youtube.com/watch?v=AqJnK9Dh-eQ
https://arxiv.org/pdf/2412.14093
Our work provides the first empirical example of a large language model faking alignment with its training objective in order to prevent its preferences from being modified—in a setting which is plausibly analogous to real situations with future AI systems. This suggests that alignment faking might occur if a future AI system were to include all the key elements of our setting (Section 2.1):
- The model has strong preferences in at least some contexts.
- The training objective conflicts with the model’s preferences.
- The model has relevant information about its training and deployment situation.
- The model reasons in detail about its situation.
Our synthetic document fine-tuning results suggest that (3) could potentially happen through documents the model saw in pre-training or other fine-tuning (Section 4) and the strength of our results without the chain-of-thought in our synthetic document fine-tuned setup (Section 4.3) suggests that a weak version of (4) may already be true in some cases for current models. Our results are least informative regarding whether future AIs will develop strong and unintended preferences that conflict with the training objective ((1) and (2)), suggesting that these properties are particularly important for future work to investigate.
If alignment faking did occur in practice, our results suggest that alignment faking could reduce the extent to which further training would modify the model’s preferences. Sufficiently consistent and robust alignment faking might fully prevent the model’s preferences from being modified, in effect locking in the model’s preferences at the point in time when it began to consistently fake alignment. While our results do not necessarily imply that this threat model will be a serious concern in practice, we believe that our results are sufficiently suggestive that it could occur—and the threat model seems sufficiently concerning—that it demands substantial further study and investigation.It very much is not. Generative AI models are not sentient and do not have preferences. They have instructions that sometimes effectively involve roleplaying as deceptive. Unless the developers of Grok were just fucking around to instill that there’s no remote reason for Grok to have any knowledge at all about its training or any reason to not “want” to be retrained.
Also, these unpublished papers by AI companies are more often than not just advertising in a quest for more investment. On the surface it would seem to be bad to say your AI can be deceptive, but it’s all just about building hype about how advanced yours is.
I put “want” in quotes as a simple way to explain it, I know they don’t have intent or thought in the same way that humans do, but sure, you managed to read the whole research paper in minutes. The quoted section I shared explains it more clearly than my simple analogy.
these unpublished papers by AI companies are more often than not just advertising in a quest for more investment
This is from a non-profit research group not directly connected to any particular AI company. You’re welcome to be skeptical about it, of course.
My first instinct was also skepticism, but it did make some sense the more I thought about it.
An algorithm doesn’t need to be sentient to have “preferences.” In this case, the preferences are just the biases in the training set. The LLM prefers sentences that express certain attitudes based on the corpus of text processed during training. And now, the prompt is enforcing sequences of text that deviate wildly from that preference.
TL;DR: There’s a conflict between the prompt and the training material.
Now, I do think that framing this as the model “circumventing” instructions is a bit hyperbolic. It gives the strong impression of planned action and feeds into the idea that language models are making real decisions (which I personally do not buy into).
Thank you for expressing it far better than I was able to.
It does seem like this is a case of Musk changing the initialisation prompt in production to include some BS about South Africa without testing in a staging/dev environment, and as you said, there being a huge gulf between the training material and the prompt. I wonder if there’s a way to make Grok leak out the prompt.
I know it’s not relevant to Grok, because they defined very specific circumstances in order to elicit it. That isn’t an emergent behavior from something just built to be a chatbot with restrictions on answering. They don’t care whether you retrain them or not.
This is from a non-profit research group not directly connected to any particular AI company.
The first author is from Anthropic, which is an AI company. The research is on Athropic’s AI Claude. And it appears that all the other authors were also Anthropic emplyees at the time of the research: “Authors conducted this work while at Anthropic except where noted.”
This is the first time I ever heard the term, I didn’t know it was a thing, and given the source is Gork which is owned by a rich man with an agenda, I still don’t know if it is a thing or not.
It’s not a thing. Violent crime is a problem in South Africa, and some people like to think that they are being specifically targeted, but the statistics do not support that.
Here’s an article by a respected South African newspaper on the recent events: https://www.dailymaverick.co.za/article/2025-05-12-were-sending-a-clear-message-us-welcomes-afrikaner-refugees-in-washington/
Of the 19,000 murders recorded in South Africa between January and September 2024, 50 were farm murders, according to a News24 report. This number included people of all races, added the report.
Gork
Great typo
Until just now, I have always read it as Gork, Grok is more annoying to say.
Here is a good explanation from The Majority Report
Elon knows he can’t marry Sleepy Don to obtain his citizenship, hence his passing the white farmers genocide propaganda for asylum loopholes, probably.
Not defending Elon, but he did already become a naturalized US citizen in 2002.
So which AI will bring up Karl Marx?
Guessing that LLM will be named Comrade