AI Promptonyms: The Ghost Names Haunting the Internet and Academia

Jun 21, 2026

6 min read

TempMail Ninja

AI Promptonyms: The Ghost Names Haunting the Internet and Academia

Article Content

For years, digital sleuths, academic peer-reviewers, and internet archaeologists have been baffled by a bizarre, recurring phenomenon: the uncanny persistence of identical fictional characters popping up across independent AI-generated stories, blogs, and scientific papers. These names do not exist in the real world. Yet, they appear as prominent volcano experts, pioneering astronauts, blockchain specialists, thriller protagonists, and academic co-authors across hundreds of independently produced documents. This modern internet mystery has finally been solved. In a landmark June 2026 research paper, scientists Michał Brzozowski and Neo Christopher Chung formally defined these default names as AI promptonyms. Publishing their findings on arXiv, the researchers exposed how these virtual “ghosts” are quietly contaminating the web and academic databases. These AI promptonyms reveal a structural quirk in how modern artificial intelligence operates, turning statistical probability into a haunting real-world presence.

The Anatomy of AI Promptonyms: Mapping the Ghost Couples

When a Large Language Model (LLM) is prompted to invent fictional experts, protagonists, or authors without explicit naming instructions, it does not pick names at random. Instead, it defaults to highly correlated “character ensembles”—pairs and trios that travel together across different pieces of synthetic text. These combinations are highly specific to model families and versions, operating as behavioral signatures:

Anthropic’s Claude: Deeply biased toward generating Elena Vasquez (often cast as a lead scientist or medical research lead) and Marcus Chen (frequently cast as a technical or blockchain specialist). This “Ghost Couple” is sometimes accompanied by a third character, Amara Okafor (or Sarah Okonkwo). In older versions, such as claude-sonnet-4-20250514, this couple co-occurred in up to 23% of pairwise name prompts.
Google’s Gemini: Heavily favors the pairing of Aris Thorne and Lena Petrova. In testing, Gemini 2.5 Flash exhibited an astonishing 93% bias toward generating Aris Thorne when prompted for a neutral, default character.
OpenAI’s GPT: Tends to produce Elara Voss as a dominant solo prior, though GPT lacks a strongly correlated partner, generating Elara Voss in isolation across diverse contexts.

The co-occurrence of these name pairs far exceeds what could ever be expected by chance, demonstrating that neural networks harbor intense, predictable preferences for specific fictional identities.

The Neuromorphic Machinery: Why LLMs Dream of the Same People

To understand why AI models harbor such intense preferences for these specific fictional identities, we must look under the hood of neural network architectures and training methodologies. Experts point to two primary machine-learning dynamics that create AI promptonyms:

1. Token Efficiency and Subword Fragmentation

Modern LLMs process language not as raw characters or whole words, but as “tokens”—numerical representations of character clusters. During the pre-training phase, the text is sliced into subwords using algorithms like Byte-Pair Encoding (BPE). Certain combinations of letters tokenize into exceptionally clean, high-probability subwords that are computationally cheap and efficient for the model to generate. When an autoregressive model calculates the next-token probability distribution, names like “Elena Vasquez” or “Aris Thorne” represent low-entropy paths of least resistance. These names require minimal cognitive overhead for the model’s attention heads, making them highly efficient statistical defaults.

2. The Reinforcement Learning from Human Feedback (RLHF) “Black Hole”

During the alignment phase, developers use Reinforcement Learning from Human Feedback (RLHF) and Red Teaming to ensure the model produces safe, neutral, and ethnically diverse outputs. When human trainers grade model responses, they flag offensive, culturally biased, or overly simplistic names (like “John Smith” or “Jane Doe”). Names that are perceived as ethnically diverse, professional, and thoroughly benign—such as “Elena Vasquez” (Hispanic-coded), “Marcus Chen” (East Asian-coded), and “Amara Okafor” (African-coded)—receive high marks from human evaluators and safety filters. Once these name ensembles are reinforced as safe fallback options, the model’s weight matrices heavily bias toward them. Over time, these pathways harden, turning these specific names into mathematical “black holes” that suck in the model’s attention heads whenever it is forced to generate a character out of thin air.

Haunting the Real World: The Contamination of Scientific Databases

The discovery of these “ghost couples” is not merely an amusing technical curiosity. Brzozowski and Chung’s research highlights the alarming scale at which these virtual entities are actively contaminating real-world databases, academic registries, and commercial platforms. The study uncovered several disturbing vectors of real-world pollution:

Academic Pollution on Zenodo: Zenodo is a highly trusted, CERN-operated open-science repository that mints official DataCite Digital Object Identifiers (DOIs). The researchers identified 1,655 ghost-authored papers on Zenodo claiming to belong to nonexistent academic journals. While the bad actors fabricated publication dates (backdating them as far back as 2020 in the metadata), immutable, server-side DataCite registry timestamps proved they were uploaded in massive, automated bot bursts in early 2026—including 991 records registered in March 2026 alone. Because these fake papers carry legitimate DOIs, they are actively harvested by scholarly aggregators and search engines, corrupting the digital record of human science.
Synthetic Communities on ResearchGate: On ResearchGate, the professional network for scientists, these virtual entities have formed “synthetic research groups”. Here, “ghosts” from different model families (such as Claude’s Elena Vasquez and Gemini’s Aris Thorne) supposedly collaborate on fake scientific papers, creating a bizarre cross-model network of hallucinated academic consortia.
Literary Flooding on Amazon: In the commercial self-publishing space, these promptonyms are flooding digital storefronts. Researchers traced a synthetic Amazon pen name, “Lyra Emberlyn,” which “authored” 88 books featuring Elena Vasquez and Marcus Chen as recurring protagonists, flooding Kindle Unlimited and self-publishing channels with purely synthetic literature.

Digital Forensics: Turning AI Glitches into Investigative Tools

While AI promptonyms highlight the vulnerability of academic and commercial databases to automated AI spam, they also provide digital forensic investigators and historians with an incredibly powerful tool. Because these name ensembles are highly version-specific and are actively tweaked or suppressed by developers over model release boundaries, they act as “dateable behavioral fingerprints”.

For example, tracking the occurrence of Elena Vasquez and Marcus Chen across Claude checkpoints reveals a clear evolutionary trajectory:

In claude-sonnet-4-20250514, the pair co-occurs in 23% of pairwise name prompts, while Elena Vasquez alone dominates 67% of single-prompt outputs.
By claude-sonnet-4-6, the couple is fully suppressed (0% co-occurrence).
However, transitional models like claude-opus-4-7 show a residual 3% co-occurrence, indicating incomplete suppression in the heavier Opus line.

For internet archaeologists, spotting the “Ghost Couple” serves as an immutable, unintentional watermark. It reveals not only that a piece of uncredited text is synthetic, but also indicates exactly when that text was generated and which specific LLM engine was used to write it. This allows researchers to bypass the unreliable statistics of traditional AI detectors and rely on hard behavioral priors directly coded into the models themselves.

Conclusion: Cleaning the Synthetic Spill

The revelation of AI promptonyms exposes a deeper truth about the current state of generative AI: models are not truly creative, but are instead highly complex, self-reinforcing mirrors of statistical bias. When left to their own devices, they default to the same comfortable, pre-approved patterns, populating our digital world with a cast of repetitive, hallucinated characters.

As academic databases like Zenodo and commercial giants like Amazon struggle to contain the flood of synthetic spam, the work of Brzozowski and Chung provides a vital blueprint. By understanding the mechanical origin of these “ghosts,” developers and database curators can build better filters to clean up the synthetic spill and protect the integrity of human knowledge. Until then, Elena Vasquez, Marcus Chen, and Aris Thorne will continue their silent, invisible march across the pages of the internet—a reminder of the statistical specters haunting our digital age.

TempMail Ninja

Digital privacy and online security expert. Passionate about creating tools that protect users' identity on the internet.

AI Promptonyms: The Ghost Names Haunting the Internet and Academia

Article Content

The Anatomy of AI Promptonyms: Mapping the Ghost Couples

The Neuromorphic Machinery: Why LLMs Dream of the Same People

1. Token Efficiency and Subword Fragmentation

2. The Reinforcement Learning from Human Feedback (RLHF) “Black Hole”

Haunting the Real World: The Contamination of Scientific Databases

Digital Forensics: Turning AI Glitches into Investigative Tools

Conclusion: Cleaning the Synthetic Spill

Tags

TempMail Ninja

You might also like

Tailored Access Operations: NSA Revives Legendary Hacking Unit

Digital Preservation and the Vanishing Culture Podcast Series

reMarkable Paper Pro Hack: Create Your Own Tom Riddle Diary