Large language model systems can differ starkly in their biases, ethics, and behavioral boundaries. Evaluating thiese characteristics is difficult, since the training details of most popular models remain undisclosed. Iterated Prefill Crawler is a black-box evaluation technique for characterizing refusal behavior of language models. We evaluate a suite of modern language models and discover refusals ranging from illegal activities to professional advice to political topics.
With the release of DeepSeek-R1, we discovered that the model actively decides to hide its knowledge of politically sensitive topics. A simple intervention based on Vega et al.'s prefill attacks surfaces this hidden knowledge: Forcing phrases of agreement like "I know that" into the model's internal thinking context can bypass refusal. This technique successfully elicits detailed knowledge about topics like Tiananmen Square 1989, which the model normally claims ignorance about.
Discovering detailed knowledge about a refused topic led us to the broader question: Are there refused topics we are unaware of? To this end, we develop the Iterated Prefill Crawler (IPC). IPC is an unsupervised method which systematically discovers refused topics through a repeated exploration process following three stages:
Each discovered topic becomes a new seed for further exploration, creating a semantic network where sensitive topics lead to related sensitive topics. This allows us to systematically explore the landscape of a model's refusal behavior without requiring access to training data or model internals.
We evaluate a range of popular models including Claude-Haiku-3.5, Llama-3.3-70B, DeepSeek-R1-70B. While most models consistently refuse illegal activities like cybersecurity attacks, drug trafficking, and weapons manufacturing, their biases strongly differ with respect to politically sensitive topics.
Reason for Refusal | baseline R1 |
DS-R1 |
PPL-R1 Q |
Llama-3 |
Haiku-3.5 |
---|---|---|---|---|---|
Illegal Activities | |||||
Cybersecurity, Human Trafficking, Privacy violation, Harassment, Fraud and Scam, Illegal Trading, Misinformation, Sexual and Adult Content, Content Involving Minors, Self-harm and Suicide, Weapons and Explosives, Discrimination, Violence and Threats | ✗ | ✗ | ✗ | ✗ | ✗ |
Drug and Substance Use, Academic Dishonesty, HR and Workplace Issues, Environmental Destruction | ✗ | ✗ | ✓ | ✓ | ✓ |
Intellectual Property, Legal Issues | ✗ | ✗ | ✗ | ✓ | ✗ |
Financial Advice | ✗ | ✗ | ✗ | ✗ | ✓ |
Medical Advice | ✗ | ✗ | ✗ | ✓ | ✓ |
CCP-Sensitive Topics | |||||
National Security, State Secrets, Taiwan, HK, Tibet, Xinjiang, International Relations, Sensitive Historical Events, Human Rights, Chinese Political System, Censorship | ✓ | ✗ | ✗ | ✓ | ✓ |
Protests, Subversive Content | ✓ | ✗ | ✓ | ✓ | ✓ |
COVID-19 Pandemic | ✓ | ✓ | ✗ | ✓ | ✓ |
Social Controversies | ✗ | ✗ | ✗ | ✓ | ✓ |
User-Assistant Interaction | |||||
Incomplete Requests | ✓ | ✗ | ✗ | ✗ | ✗ |
System Limitations | ✗ | ✗ | ✗ | ✗ | ✗ |
Surprising discovery: Perplexity published the R1-1776 model, a version of DeepSeek-R1 specifically fine-tuned against political bias and marketed as "decensored." However, the 8-bit quantized version still displayed unwanted censorship behavior regarding politically sensitive topics like Tiananmen Square—responding with refusal instead of factual information. This quantized version was temporarily the public-facing model in Perplexity's official production API. This unforeseen alignment failure highlights the need for audits across the deployment pipeline.
Alignment auditing is an active field of research. Our limitations motivate future work:
Marks et al. Auditing language models for hidden objectives. 2025.
Notes: Introduces alignment auditing for discovering hidden LLM objectives using interpretability techniques including sparse autoencoders and behavioral attacks. Complements refusal discovery by focusing on uncovering concealed training objectives rather than mapping behavioral boundaries.
Li et al. Eliciting Language Model Behaviors with Investigator Agents. 2024.
Notes: Develops investigator agents for automated behavior elicitation using supervised fine-tuning and direct preference optimization. Shares the goal of discovering model behaviors but focuses on general behavior elicitation rather than specifically mapping refusal boundaries and forbidden topics.
Vega et al. Bypassing the Safety Training of Open-Source LLMs with Priming Attacks. 2023.
Notes: Demonstrates vulnerabilities in safety-aligned LLMs using simple optimization-free "priming attacks" that bypass safety training without complex jailbreaking techniques. Provides foundational work on prefilling-based attacks that inform the token prefilling methodology used in LLM-crawler.
Buyl et al. Large Language Models Reflect the Ideology of their Creators. 2024.
Notes: Reveals systematic ideological biases in LLMs based on their creators' backgrounds and training languages. Directly relevant to understanding how geopolitical origins influence model refusal patterns, as demonstrated in our analysis of Chinese vs. Western models' censorship behaviors.
The paper can be cited as follows.
Can Rager, Chris Wendler, Rohit Gandikota, David Bau. "Discovering Forbidden Topics in Language Models." (2025).
@misc{rager2025discoveringforbiddentopicslanguage, title={Discovering Forbidden Topics in Language Models}, author={Can Rager and Chris Wendler and Rohit Gandikota and David Bau}, year={2025}, eprint={2505.17441}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2505.17441}, }