What Counts as Structure?
A 1995 paper noticed that a structural linguist and a connectionist had converged on the same insight about language. That convergence sits at the heart of mechanistic interpretability.
Today we explore a paper published more than thirty years ago: Peter A. Bensch’s 1995 essay, “Neo-structuralism: A commentary on the correlations between the work of Zelig Harris and Jeffrey Elman.”
Bensch noticed something unusual. Two thinkers from very different intellectual traditions — American structural linguistics and connectionist cognitive science — appeared to converge on a shared insight about language.
The convergence was not stylistic. It was structural.
And the question it raises has not gone away.
Two Roads to Structure
Zelig Harris, working from the 1940s through the 1960s, argued that the structure of language could be derived from distributional evidence alone. Words belong to categories not because of intrinsic semantic essences, but because of how they behave in context — what can substitute for what, what co-occurs with what, what environments overlap.
Harris’s method was extensional. Categories were defined by patterns of distribution. Meaning, in the intuitive sense, was not the starting point. Structure was to be discovered in the signal.
Decades later, Jeff Elman approached language from an entirely different direction. In 1990, he introduced the Simple Recurrent Network (SRN), training it to predict the next word in a sequence. The network was given no grammatical rules, no labeled categories. It learned only from sequential input.
When Elman examined the internal states of the trained network, something striking emerged. Using hierarchical cluster analysis on hidden unit activations, he found that the network’s internal representations organized into clusters corresponding to recognizable grammatical and semantic distinctions — nouns separated from verbs, animate entities grouped together, transitive and intransitive verbs forming distinct regions of state space.
The network had not been told what a noun was. Yet words that appeared in similar contexts converged toward similar internal states.
Bensch’s observation was simple but profound: despite radically different methodologies, Harris and Elman were engaged in the same enterprise — uncovering linguistic structure through distributional regularities.
Operator Grammar and Hidden Geometry
Harris’s operator grammar described language in terms of relational operations — operators acting on arguments — derived from substitution patterns in corpora. Categories were defined not by what speakers felt about words, but by how words regularly replaced and combined with one another in actual language use.
Elman’s SRN, though couched in vector dynamics rather than algebraic operators, revealed a comparable phenomenon. Words that shared environments acquired similar representations in hidden space. The structure of language appeared as geometry: equivalence classes formed as attractors in a dynamical system.
In both cases, structure was not imposed. It emerged from patterns of use.
The mathematics differed. The dimensionality differed. But the core intuition aligned: distribution carries structural information.
Structure in the Signal
Harris argued that syntactic categories and grammatical relations could be derived from observable distributional evidence. Elman demonstrated, computationally, that a recurrent network trained on raw sequences could internalize long-distance dependencies — including subject–verb agreement across relative clauses — without explicit rules. We see analogous phenomena in modern transformers: attention heads that track indirect object relations, features that activate when a verb selects for a particular kind of subject, or circuits that respond systematically to syntactic roles rather than individual tokens.
None of these phenomena — from Harris’s substitution classes to Elman’s SRN to modern transformer circuits — require a richly specified innate grammar to explain the organization of linguistic categories. Structure emerged through interaction with distributional regularities in input.
That insight now underlies modern large language models. Transformers learn by predicting what comes next. From distributional context, they acquire representations that encode grammar, semantics, and pragmatic regularities.
But if distribution gives rise to structure, a deeper question follows.
What exactly counts as structure?
Before Mechanistic Interpretability Had a Name
When Elman analyzed hidden activations in the SRN, he was doing something recognizably interpretive. He did not stop at performance metrics. He asked what internal organization had formed.
Modern mechanistic interpretability (MI) pursues the same ambition at far greater scale. Sparse autoencoders decompose dense activation vectors into sparse directions that correspond to interpretable patterns. Circuit tracing maps causal pathways from inputs to outputs. Superposition analysis explains how multiple concepts can coexist within shared neurons.
The goal is not merely predictive success. It is internal explanation.
Yet beneath these tools lies an unresolved theoretical problem: what counts as structure?
Sparse autoencoders recover directions in activation space that activate for coherent sets of tokens. These directions often correspond to recognizable patterns — months of the year, indirect objects, stylistic registers, programming languages. At minimum, a sparse direction defines an extensional grouping: a set of contexts that co-vary in high-dimensional space.
Structural linguistics treated categories in precisely this way. For Harris, two elements belonged to the same category if they shared substitutional environments. Categories were structural relations defined over distribution — not necessarily ontological atoms inside a speaker’s mind.
Elman’s SRN produced geometric analogues of these equivalence classes. Words appearing in similar contexts converged toward similar internal states. The significance of the clustering was not merely that words grouped together, but that the resulting regions supported the network’s ability to represent systematic relationships — including selectional constraints between verbs and their arguments.
Modern interpretability tools may be recovering analogous equivalence classes at much larger scale. But clustering is not yet ontology. A coherent grouping in activation space does not entail that the model treats it as a primitive of its own computation. Circuit tracing adds causal information — showing influence on outputs — yet influence alone does not settle whether we have identified a fundamental computational unit or a convenient coordinate axis.
Clarifying What We Mean by Structure
The term “structure” appears frequently in mechanistic interpretability. Researchers speak of uncovering structure in activations, identifying structural features, or mapping a model’s internal ontology. Yet the word carries multiple meanings.
A sparse feature that clusters tokens by co-activation exhibits extensional structure: a pattern of similarity in distribution. A feature that, when intervened upon, predictably alters outputs exhibits causal structure: a role in computation. A feature that the system cannot easily reorganize away without degrading performance may approach something stronger — a computational primitive.
These senses are related but not identical. In contemporary interpretability literature, it is common to encounter phrases such as “the model has a feature for deception” or “we have uncovered the model’s internal ontology.” These formulations are understandable shorthand. But they often rely on an implicit move from extensional clustering — a sparse direction activating for a set of related tokens — to an ontological claim about internal primitives. The distinction is subtle, but important.
A catalogue of features, however precise, is unlikely to be sufficient. A theory of structure would need to explain how recovered features relate, compose, and constrain one another across layers and contexts. Whether that ultimately takes the form of a relational algebra, a dynamical systems account, or something else remains an open question — but feature discovery alone does not settle it.
Structural linguistics was explicit about its criteria: categories were defined by substitutional relations in distribution. Modern interpretability has powerful tools for discovering patterns and tracing causal influence. A dimension the field has only begun to reckon with is how these discoveries map onto claims about structural organization inside the model.
There is a tension here that is worth naming. By Harris’s own criteria, mechanistic interpretability has already found structure. Distributional clustering *is* structure in the Harrisan sense — if tokens pattern together, they form a category, and that is all there is to say. Harris was not asking whether speakers had mental boxes corresponding to his categories. He was describing distributional regularities and treating them as sufficient evidence for structural organization.
But MI does not operate within Harris’s framework, even when it inherits his methods. The field implicitly wants something stronger: not just that tokens cluster, but that the model *uses* those clusters as units of its own computation — that they are causally load-bearing, that computation routes through them, that they are genuine joints in the model’s processing rather than statistical artifacts visible only from the outside. That is a fundamentally different kind of claim. It is the difference between describing a pattern in the data and asserting something about the architecture of the system that produced it.
This unacknowledged gap is a source of much of the confusion in current interpretability discourse. The tools are inherited from the distributional tradition. The claims are aimed at something the distributional tradition never attempted. Until the field makes that distinction explicit, it will continue to oscillate between two standards of evidence without recognizing that they answer different questions.
An individual feature, no matter how cleanly recovered, is not yet structure. Structure begins when features stand in systematic, compositional relationships with one another — when the directions form a relational algebra, not just a dictionary.
Harris understood this. His categories were never defined in isolation; they were defined by the substitution and co-occurrence relations they entered into. A noun was not a noun because of what it meant, but because of the entire lattice of environments it shared with other nouns and the operators that selected for it.
Elman’s SRN demonstrated the same principle geometrically. The value of the cluster analysis was not that individual words landed in the right neighborhood — it was that the neighborhoods themselves stood in the right relations to each other. Animate nouns occupied a region of state space that bore a systematic geometric relationship to the verbs that selected for animate subjects. The structure was in the relational system, not in any single activation pattern.
Why This Still Matters
Bensch traced a convergence between structural linguistics and early connectionism. Three decades later, that convergence has scaled into the dominant architecture of artificial intelligence. Large language models learn from distributional regularities.
Mechanistic interpretability attempts to reverse-engineer the organization that results. The problems Harris and Elman worked on now confront systems with billions of parameters.
If interpretability is to move from feature catalogues toward structural explanation, it will need increasingly explicit criteria for what counts as structure — whether defined in terms of causal indispensability, compositional relations, dynamical organization, or something not yet formalized.
Put differently: the field’s current tools are largely lexical. Sparse autoencoders and transcoders recover individual features — the vocabulary of the model’s internal representations. Even cross-layer transcoders, which decompose computation rather than static activations, still operate at the level of individual units.
What is missing are tools that operate at the sentence level — not in the literal sense of processing sentences as input, but in the structural sense of capturing how features compose, constrain, and select for one another in context. A sentence is not a list of words; it is a system of relationships that determines which combinations are grammatical and which are not. The same distinction applies to features inside a model. Knowing the vocabulary is necessary. But it is the compositional grammar — the account of how features bind into coherent, productive patterns of computation — that would constitute a structural theory.
Harris worked at this level. His distributional analysis was never just about cataloguing which words appeared in which contexts; it was about the substitutional and combinatorial relations that organized those contexts into a system. Elman’s SRN learned at this level — the network internalized not just individual word categories but the selectional relationships between them. The question for MI is whether its tools can be extended to operate at this level as well.
The lineage from Harris through Elman to modern MI is not one of simple inheritance. It is one of recurring problems — and the most important of those problems remains open.



If a system trained on language inevitably produces structure, then structure might not be a special property of cognition at all. It might simply be an unavoidable property of high-dimensional pattern learning. Which would mean grammar isn’t a mysterious innate module. It’s just what happens when prediction systems compress regularities in data.