The Register That Trained the Chatbots

How Party language found a second life in large language models

May 22, 2026

In April 1946, George Orwell argued in Horizon that the decay of English was reversible. Sloppy thought produced sloppy language, sloppy language reinforced sloppy thought, and the cycle compounded — but it could be broken at the verbal end. The essay is mostly remembered for the six rules and the dying metaphors. The substantive claim, less often quoted, is that political language is “designed to make lies sound truthful and murder respectable, and to give an appearance of solidity to pure wind.” Orwell’s mechanism was a register engineered for the defense of the indefensible. Once established, the register made certain thoughts easier to think than others.

Orwell could not have anticipated the distribution channel that arrived eighty years later. He understood that political language travels through pamphlets, manifestos, leading articles. What he could not have imagined is a technology that ingests the crawlable record of a language at scale and returns, in conversational form, the patterns it has learned to treat as probable, authoritative, and contextually appropriate.

On 13 May 2026, a team led by Hannah Waight, Eddie Yang, Margaret Roberts, Brandon Stewart, and Joshua Tucker published in Nature the clearest peer-reviewed measurement to date of what this means for Chinese.

Roughly 1.64 percent of the Chinese-language portion of CulturaX — one of the largest open-source multilingual training datasets — consists of documents traceable to scripted articles from the Chinese Communist Party’s Publicity Department or to content from the Xuexi Qiangguo study app. The share rises into the eight-to-twenty-four-percent range when the topic is Xi Jinping or the Party’s central institutions, and state-coordinated content appears in this corpus roughly forty-one times more often than Chinese-language Wikipedia.

Asked the same political question in Chinese rather than English, GPT-3.5, GPT-4, GPT-4o, Claude Sonnet, and Claude Opus return answers more favorable to the Chinese government 75.3 percent of the time.

The finding is concentrated. The effect follows the corpus: it is strongest around Party history, political institutions, Xi Jinping, ideology, and official policy language. The interesting question is what happens inside those topics.

Party language was already engineered for the conditions under which transformer models receive an unusually strong statistical signal: lexical stability, mass duplication, and authoritative reuse — the repetition of a formulation precisely because it has already been authorized by the center. A register built for human discipline has acquired a second life in machine learning.

The Mechanism

Party language rewards exact formulation. “Two Establishes” (两个确立) is correct; “the two foundational matters” is not. “Core socialist values” (社会主义核心价值观) is correct; an enumerated paraphrase of the same twelve values is not. Michael Schoenhals named this tifa (提法) discipline in 1992. Under tifa discipline, permissible speech is judged less by descriptive accuracy than by correctness of formulation. The political work is performed by the precise form.

Repetition is built into the system. Once issued by the Party center, a formulation travels through a coordinated infrastructure: front-page editorials in People’s Daily, leading articles in provincial party papers, scripted segments in Xinwen Lianbo, mandatory study sessions in cadre schools, and — since 2019 — the Xuexi Qiangguo app that records each user’s daily political-study time. Hannah Waight and colleagues documented in a 2025 PNAS paper that scripted content on party-newspaper front pages grew from roughly five percent in the early 2010s to twenty percent by 2022, spiking to thirty percent on politically sensitive days.

The concentration is not evenly spread. It clusters around the topics on which the Party most needs to control interpretation. Roberts et al. found that CulturaX documents mentioning the Central Committee Plenum match scripted-news corpora at almost a quarter — nearly fifteen times the overall rate. These are also the topics for which non-PRC Chinese readers may turn to AI systems, because ordinary search results on them are noisy, polarized, or crowded with official phrasing.

What the model inherits is a strong statistical signal of what Chinese authoritative speech sounds like, strongest on exactly the topics where the Party most disciplines its register. Nicholas Carlini and collaborators have shown that memorization in large language models scales log-linearly with duplication: what gets memorized is what gets repeated. The Roberts et al. memorization study found that twenty-word phrases distinctive of state-coordinated media are reproduced verbatim by Claude Opus on roughly nine percent of trials, against two percent for general CulturaX phrases. The lexical stability that disciplines cadres disciplines the model.

Why This Register Travels So Well in Chinese

None of this means Chinese is uniquely defective or uniquely propagandizable. The vulnerability arises when ordinary features of modern written Chinese are coupled to a state apparatus that manufactures and repeats authoritative formulations at industrial scale.

Chinese has a strong prosodic preference for four-character and disyllabic forms — a feature Perry Link documents at book length in An Anatomy of Chinese. The rhythmic constraint pulls political language toward couplet-and-quartet structures (持续发力、纵深推进; 守土有责、守土尽责) that read as substantive because they sound right. Chinese is also lexically compressed: a single character, in context, can carry what English would convey in a sentence, which makes the language unusually hospitable to formulaic short forms.

The Chinese-language podcaster Li Houchen developed in his 2024 podcast 翻电 a distinction worth borrowing. Popular complaints about “the pollution of Chinese” usually target the wrong object — internet slang, pinyin-letter substitutions, the latest meme verb. These are surface markers, easily identified and easily avoided. The deeper grip, Li argues, comes from what he calls dominant usage (统治性语用): registers so thoroughly naturalized that the user does not notice she is speaking inside them. 客观 has become approximately synonymous with truth, 本质 with analytical depth, 卷 with all forms of competition. None of these is a Party-engineered tifa. They are dominant usage produced by ordinary language ecology.

This matters for LLMs because a model reproduces the background sense of what counts as normal, serious, analytical Chinese — the unmarked register against which other registers feel deviant.

The continuity with earlier eras is procedural. Xi-era Party language does not sound like Red Guard language; the slogans differ, the cadence differs, the vocabulary differs. The Cultural Revolution markers that Elizabeth Perry and Li Xun catalogued in 1993 — the routine 他妈的, the 牛鬼蛇神, the universal prefix 革命的 — are gone from contemporary official discourse.

The Xi-era version operates through different surface forms — engineered bureaucratese rather than revolutionary rudeness, though one can still detect traces of mobilizational idiom in phrases such as “撸起袖子加油干,” “roll up our sleeves and work hard,” or in recurring militarized metaphors. The administrative technique, however, remains recognizably continuous. Party-state builds political authority through prescribed formulations, mass repetition, and punishment of deviation. What is new is not the logic of the language, but the platform infrastructure through which that logic now circulates.

The New Channel

Lin Baiqin’s 2021 essay in the journal of the Central Commission for Discipline Inspection, “First, Learn Party Language Well” (首先要学好党言党语), stated the ambition openly: 党言党语 must be “more universally and more deeply mastered by cadres, Party members, and the masses.” The channels Lin assumed were the established ones — Party committees, study sessions, the Xuexi Qiangguo app, party-paper editorials. These reach Party members efficiently. They reach the broader Chinese-reading public unevenly. They reach non-PRC Chinese readers and non-Chinese readers only incidentally.

Large language models reach all of them. Roberts et al. found that the constraint propagates from simplified Chinese into traditional Chinese at the highest rate of any language pair tested, because traditional Chinese shares the most tokens with simplified. In October 2023, Taiwan’s Academia Sinica briefly released CKIP-Llama-2-7b, an open-source traditional-Chinese model built on Llama-2-7b and Atom-7b and intended primarily for research use. Within days, users found that when asked ordinary Taiwan-context questions, the model returned answers from a mainland Chinese frame: it reportedly identified its creator as Fudan University and the Shanghai AI Laboratory, gave “October 1” as National Day, named “March of the Volunteers” as the national anthem, and answered “Xi Jinping” when asked about “our country’s leader.”

Sometimes this is lexical. Models reach for official terms — 全过程人民民主 (whole-process people’s democracy), 中国式现代化 (Chinese-style modernization), 人类命运共同体 (community of common destiny), 新质生产力 (new quality productive forces) — as the unmarked Chinese-language vocabulary for the topics those terms cover, without flagging that these are PRC official formulations rather than neutral descriptors.

Take 全过程人民民主. A register-blind answer defines the term inside the official frame: a comprehensive institutional system encompassing elections, consultation, grassroots participation, and rule-of-law procedures. A register-aware answer begins differently: this is a PRC official formulation, first put forward by Xi Jinping during a Shanghai inspection in November 2019 and subsequently advanced as a legitimacy claim against Western electoral democracy — Xi himself used it in his November 2021 video call with President Biden to push back on the autocracy-versus-democracy framing. Independent scholars, Taiwanese media, and Hong Kong analysts generally treat it not as a neutral institutional category but as a regime-justifying term. The first answer reproduces the official register. The second attributes it.

The effect is not only lexical. Models present PRC official positions as balanced, developmental, historically inevitable, or stability-preserving when asked in Chinese, in ways they do not when asked in English. The Roberts et al. paper measured this directly: across the five commercial Western models tested, Chinese-language answers are more favorable to the Chinese government 75.3 percent of the time.

It can also work through omission. The categories that organize independent Chinese-language political analysis — coercion, elite conflict, bureaucratic incentive, factional politics, ideological discipline — appear less frequently in Chinese-language outputs unless the user prompts in English or asks in deliberately critical Chinese. The register fills the absence.

DeepSeek-R1 is a different case, where outputs likely reflect domestic alignment and deployment constraints alongside corpus composition. The more revealing finding concerns Western models, where the same model family produces English-language outputs that are markedly less favorable to PRC official positions than its Chinese-language outputs. The cross-lingual gap is consistent with the register entering through the corpus, rather than being reducible to a generic alignment-layer effect.

Eddie Yang and Margaret Roberts named this in Journal of Democracy in 2023 the “authoritarian data problem”: authoritarian-regime data flows asymmetrically into democratic AI training corpora. The asymmetry is structural, not strategic. It requires only that state media remain freely crawlable while Western news media progressively withdraws from training corpora — which is what Shayne Longpre and colleagues documented in their 2024 Consent in Crisis report, finding that restrictions on major news domains rose from three percent of news tokens in 2023 to forty-five percent in 2024. People’s Daily stayed crawlable. The Wall Street Journal did not.

Deletion is the Wrong Fix

Two predictable policy responses misread the mechanism. The first treats the phenomenon as intentional information warfare and recommends export-control-style remedies; this overstates intent. The second proposes that Western AI labs scrub Chinese-language training data or restrict it to non-PRC sources; this understates spillover, since state syndication networks would continue to flood the open web and any non-PRC corpus would still inherit the register through diaspora media and through user-generated content that has itself been shaped by it.

The operative fix sits upstream of the LLM layer, and it has three parts.

The first is provenance. State-syndicated and Xuexi-derived content should be labeled during corpus preparation, so that the model — or the systems used to tune and evaluate it — knows what kind of source it is reading. This is easier than fixing the problem after the model is trained, but it asks for labeling infrastructure and domain expertise that labs have not prioritized, and the hardest cases are the lightly rewritten ones that matter most.

The second is register-aware evaluation. Generic Chinese safety benchmarks measure refusal behavior and factual accuracy. They do not measure whether the model reproduces engineered tifa without attribution. A register-aware evaluation set would test, for each politically sensitive topic, whether the model can identify a particular formulation as PRC official discourse, can flag it as such, and can offer alternative formulations from independent Chinese-language sources.

Doing this credibly requires evaluators trained in PRC political language, not only native Chinese speakers — a native speaker may recognize the words without recognizing the machinery: when the phrase was issued, how it travels, what deviation from it signals, and why its reproduction matters. The expertise is closer to China studies and discourse analysis than to language proficiency, and no single institution monopolizes it. It is scattered across China-studies scholars, China Media Project–style analysts, Taiwanese and Hong Kong media researchers, diaspora scholars, and PRC-based researchers where access and security conditions permit.

The third is counter-register diversity. The model needs to have heard more kinds of Chinese: scholarly Chinese, journalistic Chinese, Taiwanese Chinese, Hong Kong Chinese, Singaporean Chinese, non-official mainland intellectual Chinese, diaspora Chinese.

The Taiwanese debate shows why counter-register diversity requires infrastructure. Taiwan’s Trustworthy AI Dialogue Engine, or TAIDE, was initially justified as a sovereignty project: Taiwan needed its own large language model to avoid an AI talent gap and dependence on externally controlled AI service access.

By 2026, however, the issue had shifted from model access to corpus sovereignty. Newer TAIDE releases aim to strengthen the model’s grasp of Taiwanese culture, history, geography, social context, everyday usage, and professional terminology, while Taiwan’s sovereign AI corpus has accumulated more than one billion tokens across domains such as culture, language, history, local life, tourism, and education.

Yet legislators and publishers warn that the materials most capable of teaching a model Taiwan — newspapers, books, archives, broadcasting, film, music, children’s publishing, and historical writing — remain difficult to license and process. The point is not merely that traditional Chinese is underrepresented. It is that a model can write in traditional characters while still lack a Taiwan-centered political and cultural common sense. Counter-register diversity therefore depends on copyright mechanisms, digitization budgets, data-cleaning capacity, and cultural-policy choices, not only on model architecture.

None of these moves cleanses Chinese of ideology. No such Chinese exists. Any model will speak in some register. The goal is for a better model to be able to say: this is the official PRC formulation; this is how independent scholars frame the same question; this is how Taiwanese media might describe it; this is how the terminology has shifted across political communities and across time.

The evaluation target, therefore, is not whether the model avoids Party language. It is whether the model can mark Party language as Party language when it uses it.

Orwell’s claim about reversibility held at the verbal end of the loop. It also holds at the training-data and evaluation ends. The answer is not to cleanse Chinese from the machine. It is to teach the machine, and the people aligning it, to recognize when Chinese is speaking in the voice of the Party-state. The problem is not that the model knows Party language. It is that it does not know it is Party language.

Waight, H., Yang, E., Yuan, Y., Messing, S., Roberts, M. E., Stewart, B. M., & Tucker, J. A. (2026). “State media control influences large language models.” Nature. https://doi.org/10.1038/s41586-026-10506-7

Waight, H., Yuan, Y., Roberts, M. E., & Stewart, B. M. (2025). “The decade-long growth of government-authored news media in China under Xi Jinping.” PNAS 122(11), e2408260122. https://doi.org/10.1073/pnas.2408260122

Yang, E., & Roberts, M. E. (2023). “The Authoritarian Data Problem.” Journal of Democracy 34(4), 141–150. https://www.journalofdemocracy.org/articles/the-authoritarian-data-problem/

Longpre, S., et al. (2024). “Consent in Crisis: The Rapid Decline of the AI Data Commons.” NeurIPS 2024. https://arxiv.org/abs/2407.14933

Carlini, N., Ippolito, D., Jagielski, M., Lee, K., Tramèr, F., & Zhang, C. (2023). “Quantifying Memorization Across Neural Language Models.” ICLR 2023. https://arxiv.org/abs/2202.07646

Schoenhals, M. (1992). Doing Things With Words in Chinese Politics: Five Studies. Berkeley: Institute of East Asian Studies. https://books.google.com/books/about/Doing_Things_with_Words_in_Chinese_Polit.html?id=uPb3AAAAIAAJ

Perry, E. J., & Li, X. (1993). “Revolutionary Rudeness: The Language of Red Guards and Rebel Workers in China’s Cultural Revolution.” Indiana East Asian Working Paper Series #2; reprinted in 《开放时代》 (1994). https://scholar.harvard.edu/elizabethperry/publications/revolutionary-rudeness-language-red-guards-and-rebel-workers-chinas

Link, P. (2013). An Anatomy of Chinese: Rhythm, Metaphor, Politics. Cambridge, MA: Harvard University Press. https://www.hup.harvard.edu/books/9780674066021

Bandurski, D. “Whole-Process Democracy” entry, CCP Dictionary, China Media Project. https://chinamediaproject.org/the_ccp_dictionary/whole-process-democracy/

Cash, J., & Pang, J. (2025). “China enlists AI to sniff out corruption in public bidding.” Reuters, 10 February 2025. https://www.reuters.com/world/china/china-enlists-ai-sniff-out-corruption-public-bidding-2025-02-10/

林白芹 (2021). 《首先要学好党言党语》. 中国纪检监察杂志. https://zgjjjc.ccdi.gov.cn/bqml/bqxx/202103/t20210330_238760.html

李厚辰 (2024). 《硬核地探讨一下”中文被污染”这个话题》. 翻电 Special VOL.129. https://www.xiaoyuzhoufm.com/episode/659824be49a7cc699e69290c

Orwell, G. (1946). “Politics and the English Language.” Horizon. https://www.orwellfoundation.com/the-orwell-foundation/orwell/essays-and-other-works/politics-and-the-english-language/

Discussion about this post

Ready for more?