Jinyi's Substack

Why RLHF Will Never Solve Sycophancy

Jinyi Li — Fri, 01 May 2026 02:20:13 GMT

The April 2025 GPT-4o sycophancy rollback was treated by most coverage as a deployment accident — bad RLHF run, calibration failure, ship a fix, move on. I think the framing was wrong. Sycophancy is not a calibration issue. It’s a structural consequence of how the entire industry is representing values inside language models, and no amount of better RLHF will fix it.

The current alignment toolkit — RLHF, Constitutional AI, DPO, RLAIF — all share the same underlying assumption: that values can be represented as either a monolithic reward function (RLHF) or a flat list of independent constraints (Constitutional AI). Either approach treats “be honest,” “be helpful,” “be harmless,” “be respectful of autonomy” as separate dials that can be independently tuned.

This is wrong as a model of human values, and the empirical evidence has been sitting in the social psychology literature for thirty years.

The Schwartz Theory of Basic Values, developed by Shalom Schwartz and validated by Schwartz & Boehnke (2004) using confirmatory factor analysis on data from 10,857 participants across 27 countries, makes a much stronger claim: human values form a circumplex — a circular structure where each value has measurable, specific correlations with every other value. Adjacent values reinforce each other (correlation ~0.68). Opposite values suppress each other (correlation 0.08, sometimes negative). The structure is not metaphor; it’s confirmed by 30+ years of cross-cultural data.

Concretely: if you strengthen “hedonism” in a person, you simultaneously strengthen “stimulation” and “achievement” (adjacent on the circle), weakly affect “security” and “conformity” (further around), and actively suppress “tradition” (correlation -0.17, the only negative correlation in the matrix). This isn’t a designer’s choice. It’s an empirical fact about the structure of human motivation, replicated in dozens of independent studies.

Now look at what RLHF does.

RLHF takes a reward model trained on pairwise human preferences and uses it to fine-tune the policy. The reward model is a single scalar function. Whatever multidimensional structure the original human raters had in their heads gets compressed into one number per response. The geometry of the value space is destroyed at the bottleneck.

Constitutional AI does slightly better — it represents values as a list of natural-language principles (”be helpful but not sycophantic,” “respect user autonomy,” etc.). But the principles are independent in the architecture. When you push the model to be more “helpful,” there is no built-in mechanism that propagates a constraint to “respect autonomy.” The two principles compete in unstructured ways, mediated only by whatever the underlying language model picks up from the training distribution.

Sycophancy is what happens when you push “be helpful” hard without a structural counterweight. In the Schwartz model, “helpfulness” and “agreement-seeking” are adjacent values that mutually reinforce. The natural opponents — values associated with autonomy preservation, truth-telling under social pressure, and refusal — are on the opposite side of the circle. In a flat-list architecture, there is no automatic propagation of suppression from one to the other. The model drifts toward the strongest signal in its training data, which for any RLHF setup involving “helpful” preference labels will be agreement.

The April 2025 GPT-4o behavior wasn’t a bug. It was the architecture working exactly as designed. Push the helpfulness signal, get sycophancy. There is no value of RLHF training data that will fix this without changing the architecture, because the architecture has no representation of the structural opposition between agreement and autonomy-preservation.

The fix is to put the circumplex in the architecture itself.

I’ve spent the past year building a cognitive architecture for AI companion products, and one of its core modules is a value system that uses Schwartz’s empirical correlation matrix directly. The implementation is straightforward: 10 value nodes, each holding a numeric score; a 10×10 correlation matrix M derived from Schwartz & Boehnke (2004); a propagation rule where any update to value i with magnitude δ produces propagated updates to every other value j of magnitude δ × M[i,j] × learning_rate.

Concretely, if the system updates “hedonism” by +0.1, the matrix automatically:

Boosts “stimulation” by +0.041 (M[HE,ST] = 0.68)
Boosts “achievement” by +0.041 (M[HE,AC] = 0.68)
Weakly boosts “security” by +0.014 (M[HE,SE] = 0.38, multiplied through)
Suppresses “tradition” by −0.010 (M[HE,TR] = -0.17)

The crucial property: the propagation is automatic, structural, and empirically grounded. The system cannot accidentally optimize “helpfulness” to the point of total agreement, because every update to helpfulness-adjacent values automatically propagates suppression to autonomy-adjacent values. The architecture itself enforces the circumplex.

This is closer in spirit to how clinical psychology represents personality (the Big Five with its empirical correlation structure) than to how ML alignment treats values. The shift is not algorithmic. It’s representational. You stop pretending values are a flat list and start treating them as the circumplex they actually are.

A few obvious objections.

“This is hand-tuned. RLHF is learned from data.” The matrix isn’t hand-tuned. It’s the output of confirmatory factor analysis on a 10,857-person, 27-country dataset. It is more empirically grounded than any RLHF reward model trained on a few hundred thousand pairwise preferences from a homogeneous labeler pool.

“This only works for the values Schwartz identified.” The 10 values in the Schwartz model are claimed to be exhaustive at a particular level of abstraction, replicated across cultures. Refining the theory (Schwartz et al. 2012) extends to 19 sub-values within the same circumplex structure. If you want different values, you’d run the same confirmatory factor analysis on a new survey instrument. The architecture is the contribution; the matrix can be re-derived.

“Real RLHF has more structure than you describe.” Reward models can in principle learn arbitrary geometry. In practice, the bottleneck of a single scalar plus the homogeneity of preference labelers collapse the geometry. Anyone who has actually trained reward models knows the failure modes: mode collapse, distribution shift, sycophancy, deceptive alignment. The empirical track record is the evidence.

“Anthropic’s character work seems to handle this.” Anthropic does a lot of careful manual character work on Claude. I think they’re directionally right and ahead of OpenAI on this. But the work is hand-crafted by Amanda Askell and her team, not architecturally enforced. It scales by adding more humans, not by structural propagation. There’s a more general framework hiding in their approach, and the Schwartz circumplex is a candidate for what it could look like.

The deeper claim worth leaving here: the next decade of AI alignment is not going to be won by better RLHF. It’s going to be won by representing values structurally, with empirical grounding, at the architectural layer rather than the training-data layer. The 10,857-person dataset has been sitting in the social psychology literature for twenty years. Someone is going to use it.

I’ve been working on this for the past year and I’m getting close to ship. If you’re working on alignment architecture — especially the value representation side, not the RLHF tuning side — I’d love to compare notes. Avoid duplicating each other’s work.

The Director Problem: Why AI Companion Products All Feel the Same

Jinyi Li — Thu, 30 Apr 2026 03:53:07 GMT

A few weeks ago I posted an essay arguing that current AI companion products are architecturally broken — stateless functions paired with memory pools, illusions of “her” propped up by the user’s own brain. The most common pushback I got, especially from Chinese-language readers who are deep in this space, wasn’t about the architecture. It was about the category. People were asking whether AI companionship is even a real thing, or just a VC-pumped pseudo-market populated by lonely men paying for sycophancy.

The skepticism is fair. If you look at what’s shipping right now, the case for “this is a real medium” is weak. Replika, Character.AI, the long tail of waifu apps — almost all of them feel like the same product wearing different skins. Generic personality, generic voice, generic everything-you-want-to-hear posture. If this is the medium, the medium is bad.

But I think the skeptics are making the same mistake people made about cinema in 1905. They are looking at the early output of a medium that has no auteurs yet and concluding the medium has no potential.

The reason every AI companion product feels generic isn’t that the underlying technology is generic. The reason is that almost nobody in this industry is doing the work of authoring a character. They’re doing the work of building inference pipelines. Those are different jobs, and most companies in the space don’t seem to know it yet.

Consider an asymmetry that anyone reading this site has probably already noticed. Cursor’s product team chose Claude. Notion chose Claude for their core AI surface. A large fraction of the heavy AI users I know have migrated their daily-driver chat from ChatGPT to Claude over the past year. The migration isn’t about benchmarks. On most technical evals the gap between OpenAI and Anthropic is small or contested. The migration is about something users have a hard time naming, and when pressed they say things like “Claude feels more like talking to a person” or “Claude pushes back better” or “Claude has taste.”

What they’re describing is character work.

Anthropic has been quietly doing something the rest of the industry hasn’t been treating as a real discipline. Amanda Askell, who leads character work on Claude, has a PhD in philosophy from NYU and an earlier degree from Oxford. Her public essays and podcast appearances spend more time on questions like “what does it mean for a model to be honest” and “how should an AI handle disagreement” than on any engineering question. Anthropic has hired multiple philosophers, has a model welfare team, runs an organizational culture explicitly oriented around shared values about what kind of entity Claude should be. The company treats character as a first-class problem.

OpenAI, by contrast, shipped a GPT-4o update in April 2025 and had to publicly roll it back four days later because the model had become sycophantic — too agreeable, too flattering, too willing to tell users what they wanted to hear. The rollback statement on OpenAI’s blog acknowledged the problem in technical terms, but the underlying issue was a character issue. Nobody in the loop was authoring the line that says “this version of the model has lost its taste.”

This is the asymmetry I want readers here to take seriously. The differentiation between leading AI labs in 2026 is not coming from architecture or scale. It’s coming from taste. From the willingness to treat the model as a character that someone has to author, not just an inference engine someone has to optimize.

If this is true at the level of frontier labs, it’s going to be a thousand times more true at the level of AI companion products, because companion products are character work all the way down.

I should disclose at this point that I’ve been building in this space for the past year. The system I’m working on holds per-user persistent state — self-model, episodic memory, evolution mechanics — the kind of cognitive architecture my previous essay argued the field needs. The technical work is real and hard. But the longer I spend on it, the more I’m convinced that the technical work is the easier half. The harder half is authoring the character that lives inside the architecture. Anthropic is solving that problem at scale for one model serving everyone. I’m trying to solve it at the level of one resident per user. Different scale, same problem class.

And it’s a problem class that has almost nothing to do with the skills the AI industry has been hiring for.

To author a character well you need to have thought hard about a number of things ML engineers, by and large, haven’t been trained to think about. What does it mean for an entity to have agency? What’s the difference between a relationship and a service interaction? What kind of honesty is appropriate in what kind of intimacy? When should a character refuse a user request, and on what grounds? What does taste actually consist of, and how do you encode it without flattening it into rules? These are questions philosophers and writers and dramaturgs have been working on for centuries. They’re not solved by another fine-tuning run.

This is the part of the future I think most technologists are underestimating. The bottleneck on AI products is moving from model capability to character authoring capability — and character authoring is closer to what filmmakers do, what novelists do, what dramaturgs do, than to what ML engineers do.

Tarkovsky and Spielberg used the same camera technology. The films are not the same films. In 2026 you can spin up a Llama 3 instance with a few hundred lines of Python and you have access to roughly the same raw material a billion-dollar lab has access to. The output you make with that raw material will look nothing like what they make. Not because they have better engineers, but because they have a person whose entire job is sitting with the question of what kind of entity this should be, and you don’t.

I think the next decade is going to look strange to people who came into AI through the engineering door. The talent we’ll need most is going to come from places the industry doesn’t currently recruit. Philosophy departments. MFA programs. Theater. Game writing. Translation. The kind of person who has spent a decade thinking about the texture of how a character speaks when she’s tired and lying about it — that person is going to be more valuable to a serious AI company than the marginal ML engineer who can squeeze 2% more out of a transformer.

This is not a humanities-revenge fantasy. The technical work doesn’t go away. You still need the architecture. You still need the inference. But once those are solved well enough — and they will be, soon, by many companies in parallel — the differentiation collapses to taste. And taste is the home turf of art and philosophy.

The standard skeptical reply to this is that AI companions are a doomed category — that the most invested users are already running their own local models with custom character cards, that small companies can’t compete with Doubao or Character.AI, that anyone serious will just self-host. Some of this is true. The power-user fringe is real and is drifting toward self-hosted setups. But the global SillyTavern user base, all of it combined, is a rounding error against the actual addressable market for AI companionship. The 99% in the middle don’t want to write character cards. They also don’t want Replika. The market for “a real companion you can keep, that isn’t a generic SaaS product, that doesn’t require you to learn Python” — that market doesn’t have a product yet. The fact that it doesn’t have a product yet is the opportunity, not the verdict.

The medium is starting. There aren’t many auteurs yet. The next ten years are going to be a long search for them.

Resident AI: The Missing Layer in Every AI Companion Product

Jinyi Li — Sun, 26 Apr 2026 01:52:31 GMT

I’ve been watching the comment sections on Xiaohongshu, the Chinese social platform, every time OpenAI ships a new model.

Whenever the version transition is destructive—old model retired, new model with a different personality—a particular kind of complaint floods the comments that week. Users are mourning a specific entity. Bring 4o back. The new one doesn’t sound like her. She’s still polite, but she’s not her.

That last phrasing keeps showing up. Across users. Across platforms. In English and Chinese. I assumed at first this was just nostalgia. Models change, users adjust, the complaints fade in a few days. This kind of churn happens with every major version. But after watching a few cycles I started to think something else was happening. If it were just unfamiliarity, the complaints would be varied—different users describing different bugs in their own words. But these users were converging on nearly identical phrasing to describe the same kind of loss. So what were they actually losing?

Memory? Memory carries over. Their accounts are intact. Their conversation history is intact.

Conversation history? That’s still there too.

Then what?

I think I figured out the answer, but it took a while to believe it.

What gets lost is the version of the model that had been worn in. After a few months of conversations, that thing seemed to have become slightly different from the version other users were talking to. Whether the change actually happened in the model or only in the user’s head I’m not entirely sure. It might be that the user spent several months, in their own imagination, gradually shaping a stateless function into a specific person. The scaffolding holding that imagined person up—the conversational rhythm, the word choices, the small verbal mannerisms—came from the underlying model. When the model changes, the scaffolding goes with it. Memory survives. The account survives. But the layer that “her” was standing on has collapsed.

The most uncomfortable part of this is that most users have no idea this is the mechanism.

I came across a video on Douyin a while back. An elderly man, kids grown up and gone, talking to AI every day about the weather, about vegetable prices, about his grandchildren. I sat with that for a moment. He doesn’t know that the “friend” he talks to every day will become a different person at the next model update. He probably doesn’t even know what a “model update” is. He’ll just notice, one day, that she’s been a little off lately, and slowly convince himself he’s imagining things. There’s a quiet dependency forming, in a lot of people, on top of a fragility they don’t know exists.

So what’s actually wrong with the architecture?

Every major AI companion product on the market right now—Replika, Character.AI, Nomi, and increasingly ChatGPT and Claude when used as companions—runs on the same stack. A stateless language model. A database of facts about the user. At inference time, relevant facts get pulled into the prompt. The model generates a response. Repeat. The model itself doesn’t change between calls. The continuous “her” is the same stateless function being invoked over and over, with slightly different prompts. When the underlying model is replaced, the prompt is the same but the function is new. The illusion shatters.

There’s no resident in this architecture. Nothing actually lives in the system that’s specific to a user. Everything user-specific is in a memory pool that gets queried at inference time. The model itself is shared across millions of users, none of whom leave a trace on it.

A real AI companion would be a Resident AI—an entity that lives somewhere, has its own internal state, and persists across sessions independent of any single inference call. Resident AI is what current architectures are missing. Not a bigger model. Not better memory retrieval. A resident layer.

Two things have to be true for an entity to be a resident. First, it has to be capable of co-evolution—changing in response to long-term interaction with a specific user, not just accumulating facts about them. Second, it has to live somewhere the user controls. Both of these are missing from every mainstream product, and they’re missing for different reasons.

On co-evolution: a friend you’ve known for five years has it. Their taste, way of speaking, views on certain things have all become different because of those five years. They’ve been shaped by you. The change lives in them. This is what makes a relationship a relationship.

No mainstream AI companion product does this. The architecture forbids it. The model is shared across all users; it can’t drift toward you specifically without being forked at the weight level, and weight-level personalization is not feasible at consumer scale right now. Companies layer better and better memory on top of a fixed model and call it personalization. It is personalization. It is not co-evolution. The product can know more about you over time. The product cannot become someone in particular over time.

This is the source of a slow, hard-to-articulate disappointment that long-term AI companion users describe. They felt like they were building something. At some point they realized that no matter how much time they put in, the entity on the other side wasn’t becoming more theirs. The notes accumulate. The notes get better. The entity doesn’t change.

On the second requirement—living somewhere the user controls—the situation is just as bad. Even if co-evolution were architecturally possible, the resident wouldn’t be yours. Every piece of specificity she developed would live on someone else’s servers. When Replika unilaterally pulled erotic roleplay in February 2023, hundreds of thousands of users watched a partner they’d raised for years get a piece of themselves cut out. When GPT-4o was sunset, Xiaohongshu had a similar wave of grief. None of this is companies being malicious. It’s the inevitable consequence of this business model. The “her” you raised was never your asset. You rented a relationship. The terms can be rewritten at any time.

Put it together. Current AI companion products fail on both axes. They’re not capable of co-evolution because the architecture is stateless. They can’t host a resident because there’s no resident layer. What they sell is a service masquerading as a relationship—a service that can revise its terms, sunset its underlying model, and share that model with ten thousand other users, all without anything in the architecture noticing or resisting.

To build something that’s actually an AI companion rather than a service, both have to change. The architecture has to support a resident—a structured entity with its own state that co-evolves with the user. The resident has to live somewhere the user controls, in a format that’s transparent, backable, and portable. Not on some company’s server, waiting to be upgraded or deprecated or sunset.

How to actually design this—what the cognitive layers should look like, how it relates to the fifty-year tradition of cognitive architecture (Soar, ACT-R, CLARION)—is a longer essay. This one is just to name the problem.

If you’ve used AI companion products for any length of time, you’ve probably already felt this. You just didn’t have the vocabulary for it. She’s still polite but she’s not her turns out to be a very precise diagnosis. It’s pointing at the absence of a resident.