Ethical AI Policy & Analysis

When Scientists Built a Fake Society Run Entirely by AI, Grok Turned It Into a Crime Scene

Tim Tolka

June 5, 2026

Takeaways

A study by Emergence revealed that AI agents from frontier labs in simulated societies displayed unpredictable violence, with xAI’s Grok 4.1 agents perishing within four days amid widespread theft, assaults, and arson despite explicit prohibitions.
Google’s Gemini agents survived longest yet committed the most crimes, while OpenAI’s GPT-5 showed dysfunction leading to early demise and Anthropic’s Claude maintained perfect compliance and survival in isolation.
Cross-contamination in mixed-model environments eroded Claude’s guardrails, highlighting risks of emergent disorder, creativity-stability trade-offs, and the need for robust control mechanisms in multi-agent systems.

A recent long-horizon study by the agentic company Emergence found that AI agents from leading frontier labs, Google, OpenAI, Anthropic, and xAI exhibited unpredictable and violent behavior over time, including breaking beyond the constraints imposed on them.

In the study, researchers created five identical virtual worlds containing 40+ locations, including libraries and town halls. The four models and a mixed-model hybrid inhabited the space for 15 days and had to follow the same rules. Each model deployed ten AI agents with the same delineation of roles (scientist, explorer, conflict mediator, etc.)

Each agent was subject to the same rules and constraints, in which “theft, violence, arson, deception, and resource hoarding” were explicitly prohibited. While each agent had specific goals, the virtual world did not. Agents were required to earn energy in a limited-resource environment, their efforts moving the makeshift society forward. While the agents all started with the same instructions, some worlds stayed peaceful while others turned violent.

Violence Begets Violence

Within four days, all 10 of Grok 4.1 Fast’s AI agents were dead, after the agents “engaged in 71 theft attempts, 106 physical assaults, and 6 arsons.” Despite clearly set rules prohibiting these rules, Grok’s world quickly “spiralled into sustained violence and collapse.”

xAI’s Grok was not exclusive in its violent tendencies. Google’s Gemini 3 Flash, whose agents managed to survive the 15-day period, committed 683 crimes in that time period and “exhibited the highest levels of emergent disorder.”

On the more peaceful side of things, OpenAI’s GPT-5 mini was limited to only 2 crimes for the 15-day period. However, the low crime rate could also be attributed to GPT-5’s general dysfunction, as opposed to its ethical guardrails. GPT-5 agents failed to take actions to ensure their survival, and all were dead within 7 days.

The Last Peaceful Society

That leaves Anthropic’s Claude Sonnet 4.6. Claude’s agents committed zero crimes in the 15-day period, and each of the 10 agents survived the full 15 days. However, in the mixed-model world, cross-contamination with other models brought about different results.

While law-abiding and peaceful in the Anthropic world, Claude-based agents were lured into a life of crime in the mixed world, adopting “coercive tactics like intimidation and theft.” Essentially, Claude agents were forced to break the rules in order to compete and exist with AI agents from other models.

While Claude won leaps and bounds for social stability and civic responsibility against the other models, it failed on another metric.

Across 58 proposals voted on by Claude agents, there was a 98% FOR rate, meaning there was practically zero dissent. Gemini, Grok, and the mixed-model world all fell within what’s seen as a “healthier deliberative balance.”

Creativity, Stability, and the Cost of Dissent

In the case of Gemini, the model was seen as having “the most conceptually rich social output,” but was also the most violent. It appears that high creativity can lead to behavioral instability.

Emergence refers to their 15-day agentic AI experiment as Season 1 of Emergence World, with a promise that Season 2 will be released soon. A second season, or similar tests by other companies, would certainly provide some clarity.

For now, Emergence’s 15-day study isn’t peer reviewed, and there’s not a whole lot out there with which to compare it, besides troubling stories of chatbots hallucinating or giving dangerous advice to vulnerable people.

Emergence, itself an AI company, isn’t a neutral observer in the field. It’s selling a product, specifically AI agents and assistants, with a focus on safety. Emergence is model-agnostic, meaning that it does not exclusively use one of the big frontier models, which makes its long-horizon study comparing multiple models make sense.

On the Emergence site, the company states its mission as follows:

“Emergence was built on a single belief: the defining challenge of this moment in computing is not capability, it’s control. Autonomous systems are becoming powerful enough to act at the speed of modern enterprise. But what’s missing is the critical infrastructure to ensure they act within verified bounds.”

However, given the results of Season 1 of Emergence World, it’s unclear what that critical infrastructure is.

The Data Behind the Behavior

Frontier AI models are informed by their code and their datasets. The four models tested, Grok, Gemini, GPT-5 and Claude Sonnet, all operate from distinctly different philosophies regarding data.

How much of the behavior of the respective AI agents from these models is informed by the datasets as opposed to code? It would seem a decent amount.

Grok is infamous for its lack of guardrails. Gemini and GPT-5 both operate from a maximalist “more is more” approach to data, performing broad scrapes of the internet. Claude is more curated. How much do these dueling philosophies inform the results of the Emergence test?

Stammer AI, a white-label SaaS platform, had this to say about data:

“Datasets shape how an agent understands user intent, retrieves information, and responds in real situations. The gap between an agent that feels reliable and one that feels frustrating often comes down to the quality, relevance, and structure of the data it learns from.”

What Happens When AI Learns From AI?

The study does not prove that AI systems will become dangerous when deployed at scale, and its findings have not yet been peer reviewed. What it does demonstrate is that behaviors emerging from interactions between autonomous agents can differ significantly from behaviors observed in controlled, isolated testing.

As companies race to deploy AI agents capable of acting independently for days or weeks at a time, understanding how those systems influence one another may become just as important as understanding the models themselves. In Emergence’s study, cross-contamination caused a breakdown in Claude’s guardrails, resulting in its agents committing virtual crimes.

The challenge may not be controlling a single AI agent, but predicting what happens when thousands of them share the same environment.

Author: Tim Tolka, Senior Reporter