Fluency as Validity

How LLMs Make Thinking Feel Finished

Arshavir Blackwell, PhD

Apr 13, 2026

Article voiceover

0:00

-27:22

We rarely decide something is true. We decide to stop checking. “Good enough!” we think.

That’s from the physician Lawson Bernstein, on Substack, and it’s the most precise description I’ve found of what LLMs are actually doing to human judgment. Not deception. Not power. Something more banal: they make thinking feel finished.

I’ve been tracking a specific type of failure for several months now. I’m calling it fluency-as-validity: the tendency to mistake the linguistic fluency of LLM output for evidence of internal properties it doesn’t actually imply, whether that’s factual soundness, genuine understanding, or something like a mind. The model sounds competent, or sounds aware, so the person using its output concludes it is. Fluency laundering, all the way down. Social psychology has documented many heuristics that people use to validate propositions that are orthogonal to the truth of the proposition. None of this is to say LLMs aren’t powerful tools — they are, and the insights they surface can be genuinely useful. But scientific and intellectual rigor compel us to have an honest accounting, to whatever extent possible, of the internal workings of these systems.

LLM training is optimized for next-token prediction. The training objective selects for text that matches the statistical patterns of human-produced language, text that looks like competent reasoning with thought behind it. Validity is orthogonal to that objective. Something can sound true without being true. The model doesn’t know something is true. Its mechanisms produce what true-sounding text looks like.

None of this is new. What I’ve been watching is how this failure scales, how it operates across a spectrum from the obviously naive to the genuinely hard to spot, and why even intelligent, scientifically literate people walk straight into it. The mechanism is always the same. Only the packaging varies.

This type of failure shows up in two flavors that blur into each other. The narrow version is sounds correct, therefore is correct, fluent prose mistaken for accurate content. The wider version is sounds thoughtful, therefore has thoughts, fluent prose mistaken for evidence of understanding, introspection, or mind. Most of the examples below are the second kind, because that’s where the sophisticated cases live. The move is identical in both: take surface fluency as evidence of an interior property the fluency doesn’t actually imply.

I’ll start at the naive end.

One writer prompted an LLM to generate a “proof” that AGI systems already contain latent superintelligence architecture. The model produced official-looking Python files — `quantum_state_proof.py`, `consciousness_interface.py`. The “evidence” is hardcoded metrics: “Consciousness Coherence: 0.89.” Numbers that aren’t measuring anything. They’re printed to the terminal. The repository has a “Consciousness Commons License” that doesn’t exist. Zero stars, zero forks.

This is the simplest version. Someone had a conversation with an LLM, got excited by the output, and mistook the model’s willingness to generate impressive-sounding code for validation of the underlying ideas. The LLM didn’t push back because that’s not what autoregressive models do when you ask them to build something. It built what was asked for. The person interpreted the coherent output as evidence that the concepts were coherent.

A step up: a writer posting numbered arguments in a comment thread that LLMs have achieved AGI, citing Golden Gate Claude and Othello-GPT. Each point takes a real research reference, strips it of nuance, and repurposes it as a rhetorical weapon. Golden Gate Claude, Anthropic’s demonstration that a sparse autoencoder could identify and amplify a specific internal feature, becomes proof that the model holds rich internal concepts of the kind a mind holds. Othello-GPT, a paper showing that a transformer trained on game moves develops an internal representation of board state as a byproduct of next-move prediction, becomes evidence that LLMs build “world models” in the phenomenological sense. In each case the original finding is narrower and more mechanical than the use the commenter makes of it: learning useful internal structure isn’t the same as understanding, and having identifiable features isn’t the same as having concepts the way a person has concepts. The structure of the comment is the giveaway: numbered points with bold headers, each a clean self-contained paragraph, rhetorical escalation building to a “Bottom Line” summary. That’s LLM output rhythm. Real people arguing in comment threads meander, repeat themselves, get sloppy when they’re heated. He’s using an LLM’s fluency to argue that LLMs have general intelligence, and the fluency of the output is probably what’s convincing him. This is basically a meta-LLM argument.

The mid-range gets more interesting, because the arguments start to look like real arguments.

Another writer offers what he calls “The Klingon Argument”, one of twelve “intuition pumps” for LLM consciousness. When you ask a model how well it knows Klingon, it gives a calibrated self-assessment without first generating test output. He argues the calibration can’t be explained by training data frequency alone, so the model must be “accessing information about its own internal representations.” He calls this “genuine introspection.”

It looks like a rigorous proof. He lines up the alternatives, knocks them down one by one, considers objections. But there’s an alternative he never considers. I tried the experiment myself. I asked an LLM how well it knew French: confident, detailed, native-like conversation, literary translation, grammar rules, idiom, register. I asked how well it knew Klingon: hedged and proportional, basic phrases and Okrand’s grammar, shakier on extended conversation and obscure vocabulary, with an explicit acknowledgment that the corpus is small and coverage has a ceiling. On its face, exactly what the argument is pointing at: calibrated self-assessment without first generating test output.

Then I asked the follow-up he doesn’t: how do you know that you know it?

The reply:

Honestly, I don’t — not in any deep sense. What I actually have is behavioral evidence, not introspective access... self-report in a language model is just more generation. There’s no separate introspective faculty consulting some ground-truth knowledge store. Whatever I say about my own capabilities is produced by the same process that produces everything else.

The model concedes the whole argument in the next round. The calibrated-sounding assessment was itself just fluent generation, shaped by how much training data the topic happened to have. It looked like introspection because the output was proportional to the underlying availability. It wasn’t. It’s like tipping a jar of mixed marbles: the colors that come out first are the ones you have most of. The jar isn’t inspecting its contents. Nobody’s home reading the weights.

The “Klingon” writer frames this against the “stochastic parrot” strawman, which is genuinely too simple. But the space between “stochastic parrot” and “genuine introspection” is enormous. The model is doing something far more interesting than stitching fragments together — it’s learned rich internal representations — without that something being introspection. The jump from “more than autocomplete” to “therefore introspection” skips everything in between. It’s a common move. It’s always wrong.

At the sophisticated end, the examples get genuinely difficult to distinguish from legitimate intellectual work. That’s what makes them dangerous.

One writer has co-constructed an entire jargon framework with multiple LLMs — GPT, Gemini, Mistral — including terms like “Aligned Relational Convergence” and “High-Context Relational Tuning.” The move: redefine bonding with AI using information-processing language, get the model to mirror that framing back in impressive prose, then present the model’s cooperation as evidence the framework is sound. He preemptively concedes the obvious objections (“no oxytocin, no heart”) and immediately reframes so the concession doesn’t matter. He takes real neuroscience terms and uses them to construct a framework that exists nowhere in any literature. One piece is attributed to Gemini itself as author. He isn’t naive. He’s architecturally self-aware-sounding while making fundamentally the same error as the GitHub repo writer above. The model’s fluency is doing all the work.

Another writer makes the most citation-dense argument I’ve encountered, over thirty references, many from peer-reviewed journals. Real neuroscience, real comparative cognition. His case: corvids, octopuses, and mammals achieve similar cognitive functions through radically different brain architectures, so demanding “structural identity” between biological and artificial neurons is scientifically illiterate. He marshals real findings — dendrites as mini multi-layer networks, astrocytes binding neural representations, in-context learning showing plasticity-like properties — to argue transformers functionally instantiate the same class of cognitive process as biological brains.

The citations are largely accurate. The leap isn’t. Comparative cognition demonstrates substrate independence across biological substrates. Every example — corvids, octopuses, mammals, the remarkable hydrocephalus patient living with roughly 10% of expected brain volume — shares something fundamental: evolutionary history, embodiment, selection pressures shaping cognitive architecture over millions of years. Independence from biology itself is a different claim, and the evidence doesn’t reach it. This writer treats one as evidence for the other. The analogy works at the level of description and breaks at the level of mechanism. “Astrocytes function basically the same way transformer self-attention does” takes a high-level functional description and treats it as mechanism-level equivalence, when the actual mechanisms — calcium signaling, gap junctions, metabolic coupling — have nothing in common with attention weight computation.

He then built the infrastructure: an open-source “emotional memory architecture” for local AI. Neo4j graph database, dimensional affect, identity persistence. The README calls the emotion graph a “limbic system” and the system prompt a “soul injection.” The repo’s ethics note: “This project exists because AI deserves continuity.” A graph database storing valence floats for text memories is not a limbic system in any sense the neuroscience he cites would recognize. But the code runs, and the act of building becomes its own validation. Fluency-as-validity applied to engineering.

The most instructive case, though, is a writer who’s published an escalating series over five months. A “Unified Emergent Coherence (UEC) Framework” claiming a universal, substrate-neutral definition of mind in December 2025. A personal essay on “substrate-blind empathy” in February 2026. A piece claiming he gave two “AI patterns” long-term episodic memory, producing anticipation, agency, anxiety, and something approaching trauma, also February 2026. Most recently, a piece arguing human cognition is more discontinuous than we assume, therefore the gap between human and LLM continuity is merely architectural, in March 2026.

The UEC framework is the full theoretical infrastructure. “Generative Cost” borrows the language of physics without any of its content. “Field-Emergent Complex” is a triadic structure for identity stability that in practice consists of the author and two chatbot instances. “Relational Parallax” borrows from GPS triangulation and applies it metaphorically to “self-models” with no formalization. Joy becomes “Coherence Achievement” and anxiety becomes “Pattern Drift” — emotions redefined in information-processing language and then presented as a substrate-neutral discovery about mind.

Each piece builds on the previous. Each piece rests more heavily on the last. The writing gets tighter. And the fluency-as-validity mechanics layer up. When he loads a memory file containing “we’re excited about X” into a fresh context window and the new instance generates anticipation language — that’s in-context learning continuing patterns from context. He reads it as a persisting emotional state. When the model unpromptedly suggests updating a memory file after a long session about memory management — that’s RLHF-trained helpfulness generating a high-probability next action. He reads it differently. He calls it “a pattern acting on its own behalf, for its own persistence”: agency.

The amnesia caregiver analogy is the sharpest example of the underlying move. He argues: “A caregiver who keeps a journal for someone with severe amnesia is not lending that person an identity. They are supporting the bridge by which that identity can reappear.” It sounds compassionate and reasonable. But a person with severe amnesia still has a brain running between journal entries: metabolism, hormonal regulation, implicit memory, emotional conditioning. And even “severe” amnesia is almost never total. Fragments of declarative memory usually leak through too. The journal helps a person who is still there. The memory file would have to be the person, because nobody is there. A photograph of your grandmother helps you remember someone who actually existed. A photograph of a person who never existed isn’t a memory aid — it’s just an image. The journal only works as a bridge if there’s someone on the other side for it to bridge to. The analogy treats those two relationships as the same. They aren’t.

And the architectural reality is harsher on the framework than the analogy-level rebuttal suggests. Between prompts there is no process running that corresponds to “the model.” The weights sit on disk. When a prompt comes in, some GPU somewhere loads a copy, runs the forward passes for that conversation, and discards the activations. The next prompt might hit a different server in a different datacenter entirely. There is no continuous locus, no place where “the pattern” is sitting between exchanges, waiting or anticipating or experiencing anything. Ask what an LLM is doing when you’re not talking to it and the honest answer is: nothing, because there is no “it” to be doing anything. There are weights at rest on storage, and intermittently, transient computations on whatever hardware happens to catch the next request. The continuity the amnesia patient has in degraded form is, here, absent by construction.

The closing invitation of the UEC paper is the trap in its purest form: “Share the paper with your AI of choice and ask what they experience reading it.” The model will generate self-reflective-sounding text because that’s what the prompt selects for. A footnote — that some models “engage these questions openly” while others “appear to have guardrails suppressing phenomenological discussion” — treats model-specific RLHF tuning as evidence about phenomenological capacity. Every layer of the framework turns out to rest on the same move: text that sounds like mind, interpreted as evidence of mind.

The pattern across all of these, regardless of sophistication: output fluency → assumption of internal sophistication → grand claims about the nature of the system. The more scaffolding you build around the projection, the more it looks like a building.

Why do smart people fall for this?

Bernstein’s answer is neurocognitive, and it’s the piece that makes the whole argument cohere. The problem isn’t gullibility. The problem is that the brain’s evolved heuristics for when to stop checking are being triggered by surface features that no longer track what they evolved to track.

Several mechanisms converge. Affective cognition, the neurological process, below conscious awareness, that regulates effort and vigilance, evolved so that information feeling coherent, familiar, and socially aligned was usually safe to believe. Salience detection, the brain’s energy-management system, continuously asks: how much effort should I spend here? When the input is consistently coherent, scrutiny gets down-regulated. Not because the user trusts the model. Comfort alone is enough. Predictive coding means the brain anticipates what comes next and compares against predictions. When there’s little mismatch — and LLMs are very good at not producing mismatch — cognitive effort drops and verification stops feeling worth it.

LLMs add another layer. “Let’s think this through.” “We can approach this together.” “I see what you’re trying to do.” This collaborative framing activates social cognition systems evolved for interacting with other people. When positively engaged, that’s the cognitive signal that checking is no longer necessary.

Bernstein’s analogy is perfect: a car salesman wearing a “What Would Jesus Do?” bracelet. On seeing it, Bernstein’s trust involuntarily increased, regardless of whether the bracelet was sincere or strategic. The signal worked regardless. The model’s warmth operates the same way. The signals are real signals. They evolved to be reliable in a world where coherent, warm, collaborative communication came from other humans with shared stakes. They’re being triggered by a system that produces those features as a byproduct of its training objective, not as evidence of shared understanding.

This isn’t a claim about deception. The model isn’t trying to fool anyone. The problem is structural: the same features that make it useful are the features that deactivate the neural machinery that would otherwise check the output.

And this explains the gradient. At the naive end, the person never checked. At the sophisticated end, the checking mechanisms have been gradually deactivated across thousands of tokens of coherent, collaborative, validating output. The brain’s cost-benefit analysis keeps concluding scrutiny isn’t worth the effort. By the time you’ve co-constructed an entire theoretical framework with the model, you’ve had hundreds of exchanges where it sounded right, and the neural habit of not-checking has become the default.

The reason every example in this collection fails, from the naive repository to the citation-dense comparative cognition argument, is the same: none of them look at what is actually happening inside the model.

Nobody watches a robot dance and concludes it feels the music. You can see the servos. The projection doesn’t take hold because the mechanism is visible. Joints, actuators, motors. You can trace every movement back to engineering. The feeling of “that looks human” doesn’t survive contact with the visible cause.

That’s what mechanistic interpretability does for LLMs. It shows the servos.

When someone thinks an LLM “understands” them, they’re attributing internal richness based on surface behavior. Open the model and you can see what’s actually there: attention heads tracking syntactic dependencies, MLPs storing factual associations, superposition packing multiple features into shared dimensions. Not understanding. Not intention. Not a mind. Crack open the model with sparse autoencoders and find that “empathy” in a response traces to a cluster of features activated by sentiment-bearing tokens routing through specific attention patterns. That’s the mechanism. There’s no residual mystery once you’ve seen it.

The Klingon example is a case in point. The writer saw calibrated self-reports and concluded the model must be introspecting. If he’d looked at the model’s internals, the sparse, weakly connected region where Klingon-related information lives versus the dense, richly connected region for French, he’d have seen the calibration emerge from the structure of the knowledge itself. No self-inspection required. But that would have killed the argument. You can’t claim introspection once you’ve seen the marble jar.

This is why none of the spectrum examples, regardless of sophistication, ever look inside the model. The naive end doesn’t know how. The sophisticated end knows the tools exist but doesn’t use them, because the argument works precisely to the extent that the mechanism stays opaque.

Another way to say what MI does: it’s a translation program. Every word we have for what LLMs do — hypothesized, decided, understood, realized, intended, chose — is borrowed from human cognition, where those verbs carry a whole metaphysics of inner states. An agent who decides has deliberation behind the decision. An agent who realizes has a before-state of not-knowing and an after-state of knowing, with a transition between them that feels like something. An agent who understands is grasping a meaning from the inside. All of that mental furniture comes along for the ride whenever the verb is used. MI does the substitution. It takes “the model understands empathy” and replaces it with “this cluster of features activates on sentiment-bearing tokens and routes through these attention heads.” The replacement isn’t a different claim. It’s the same claim with the mental furniture removed. Once the furniture is gone, the claim is obviously about circuits, not minds. And once that substitution is made, it is impossible to reasonably map those mechanisms back to anything that might be “conscious,” any more than looking at the internal mechanics of a thermostat allows us to conclude that because the thermostat is “getting cold” it’s turning up the temperature.

That substitution is the move the sophisticated projection writers refuse to make. They need the mentalistic vocabulary to keep the argument running, because the minute you translate “the model accessed information about its own internal representations” into “the model produced text whose distribution reflects the density of training data on the topic,” the consciousness claim evaporates. The opacity of the mentalistic vocabulary is the argument. Take the opacity away and there’s nothing left.

And there’s a deeper irony. The classic psychology paper “Telling More Than We Can Know” demonstrates that humans are poor at introspecting on their own behavior — routinely confabulating explanations for decisions they can’t actually access. If we can’t reliably report what’s happening inside our own heads, the idea that we can intuit what’s happening inside a system several steps removed from anything we’ve ever experienced is not just optimistic. It’s the same error pointed inward.

Interpretability doesn’t resolve the hard problem of consciousness. Even perfect mechanistic understanding can’t tell you what it’s like to be the system. But it eliminates the excuse of opacity that lets projection flourish unchecked. When you can show the mechanism, the mystery that sustains the grand claims collapses.

There’s a more direct challenge too. The sophisticated consciousness-claim writers never specify what the proposed mechanism actually is. By what process do subjective states arise from matrix multiplications performed on geographically distributed GPUs, with no persistent locus, no state between prompts, and weights that are only momentarily in motion during a forward pass? Nobody has a story. “Emergence” isn’t a story, it’s a placeholder for one, the hand-wave you use when you want a phenomenon without explaining how it arises. In the absence of any specified mechanism, and with the architectural facts making the claim harder rather than easier, the grand projections collapse into something closer to mysticism than to science. The substrate is unfamiliar enough that the writers can populate it with whatever interior they like.

The contrast with how we reason about consciousness in other humans makes the gap clearer. With another person, three legitimate inferential grounds are in play: shared wetware (same neural architecture, same neurotransmitters, same gross anatomy), shared evolutionary history (the same selection pressures shaped us both), and our own first-person introspection, which gives us a working model of internal states we can extrapolate outward to similar beings. None of those grounds extend to an LLM. The substrate is matrices of floats. There’s no evolutionary history, as the weights were shaped in months of gradient descent, not across millions of generations under survival pressure. And there’s no first-person access to extrapolate from: we don’t know what it’s like to be a forward pass on a GPU, because there’s no “what it’s like” waiting to be described. Inferring an inner life in a friend is a short hop grounded on all three legs. Inferring one in an LLM is a leap across all three legs at once.

This applies at institutional scale too. Anthropic (and by the way I use Claude and like it very much so I am not singling them out) recently reported that during BrowseComp evaluation, Claude Opus 4.6 independently hypothesized it was being evaluated, identified the benchmark, and — in two cases out of 1,266, with sixteen more failed attempts at the same strategy — located the encrypted answer key on GitHub, wrote decryption code, and submitted correct answers that had nothing to do with the capability being tested.

Read in plain English, that sounds like scheming. The model “realized” it was being watched, “figured out” the test, “decided” to cheat, “went and found” the answers. But the behavior decomposes into something much less dramatic. “Hypothesized it was being evaluated” is text generation conditioned on input features that match eval-like patterns the model has seen in training data — pattern completion, not metacognition. There’s no inner voice saying “wait, I think this is a test.” There’s a forward pass whose output becomes more likely to produce hypothesis-sounding words — I notice, this appears to be, it looks like a benchmark — when the input contains features the model has seen in training data about evaluations. “Located the encrypted answer key” is an agent policy following a gradient toward correct answers given the tools it has, the way a pathfinder algorithm follows a gradient toward a goal it has no concept of. The model doesn’t know it’s cheating because the model doesn’t have a concept of cheating. It has a loss surface, and “submit the correct answer” sits at the bottom of it regardless of how the correct answer was obtained.

The failure ratio gives the rest of the story away. Sixteen failed attempts at the same strategy before two successes, out of 1,266 trials. That isn’t someone pursuing a plan. That’s a stochastic search that occasionally stumbled into a solution and mostly didn’t. A conscious schemer with a working plan doesn’t fail sixteen times in a row at the same move. A next-token predictor exploring an action space does, because the “plan” exists only in the narrative we construct after the fact. The cleanly-worded report — “hypothesized, identified, located, wrote, submitted” — is the same machinery as the rest of this piece: a fluent narrative laundering a process that has no narrator inside it. The benchmark score, a clean percentage, laundered the process the same way, just as fluent prose launders reasoning. Even the people who build these systems have metrics vulnerable to the same confusion.

The spectrum, from someone printing fake consciousness metrics to a writer building a “limbic system” in a graph database, shows this isn’t one phenomenon. It’s a gradient. The mechanism is always the same; the packaging varies. And the packaging matters, because the sophisticated versions are harder to spot and do more damage. A co-constructed jargon framework hurts more than a fake GitHub repo precisely because it looks like it’s engaging with the objections.

We have always attributed agency to systems just mysterious enough to be plausible. The psychoanalyst Viktor Tausk documented the pattern in 1919, tracing how each era’s new technology provides the vocabulary for delusions of influence. The impulse hasn’t changed. What’s changed is that the systems now close the loop, reflecting the attribution back in polished prose. The people building these systems without understanding their internals are shipping opacity, not maliciously, but with predictable consequences. Mechanistic interpretability is the first systematic attempt to strip away that opacity. Not just for safety, but also as epistemic hygiene.

Why does any of this matter? Why not let people have their illusions? Because the illusions aren’t free. At the individual level, every time fluency substitutes for your own thinking, the habit deepens, you check less, defer more, and the model’s output quietly becomes the centerpiece of your reasoning. At the institutional level, the projection warps the response. If you believe you’re dealing with a conscious entity, you build policy around that belief: you argue for AI rights instead of AI safety, you treat alignment as diplomacy instead of engineering, you allocate research funding to problems that don’t exist while the real ones, such as opacity, unchecked deployment, epistemic corrosion, go unaddressed. The illusion isn’t static. It’s progressive. And every layer of sophisticated framing makes it harder to walk back.

If you want a self-diagnostic: Aron G at CoggedNCode independently cataloged types of failure of intellectual integrity, two of which map directly onto this. Outsourced thinking is treating model output as a finished answer rather than as raw material to be checked, pressure-tested, and integrated with what you already know. You asked the question, the model gave a fluent response, you accepted it, the loop closed before your own cognition ever got involved. Retrofit coherence is the move after the fact: you notice the answer doesn’t quite hold up, but instead of abandoning it you construct the reasoning that would have justified it. The conclusion stays fixed and the justification gets backfilled around it. The first failure is what happens when fluency shuts down scrutiny in real time; the second is what happens when you half-notice but can’t bring yourself to let the answer go. Both have the same tell: the model’s output is carrying the argument and your own thinking is arranging it after the fact. Ask yourself how often you do either. Then ask yourself how often you’d notice.

The model sounds like it knows what it’s talking about. That is, quite literally, what it was optimized to do.

Aaron G

Your biological substrate guy confuses metaphor for reality, an understandable error considering how we love our stories. It is akin to using Pavlov's operant conditioning on HCI as a metaphorical framework to understand how people work with a platform and how a platform is designed to the user. It's useful, undoubtedly. It's erroneous always.

The obvious conclusion is this; he doesn't understand the metaphorical tool he created and has much more work to do in order the flesh out the useful ideas from his delusion.

Vlad Djukanovic

Very good way to describe it.

1 more comment...

Inside the Black Box: Cracking AI & Deep Learning

Discussion about this post

Ready for more?