super interesting read. feels like i’ll have to go over it a few times. but the open questions are great. looks like there is still a lot of potential here.
Thanks for taking the time brother, there is some really interesting stuff happening inside these models and we have only scratched the surface of understanding it.
Representation with our function was the part that really catches my attention.
I agree it feels extremely unlikely to be there for no reason. But also you found it didn’t contribute to the end response despite firing
My absolute , off the top of my head wild speculation would be to look into it has some sort of anti detection. Like the model confirming it’s not “XYZ that could be confused”
You’d think the absence of that would hurt the response but it might only on very ambiguous tokens.
It might be worth checking if it influenced the probability mass of the non predicted , top tokens. The runners up. It might be pushing the wrong tokens down but not visible in the actual end selection that happens either way because it’s sufficiently in the lead
Maybe simpler to say, it could irrelevant on argmax in your testing set, but meaningful in soft max distribution
Yes, that’s absolutely right—-it didn’t have an impact on those areas that we looked. You can never prove a negative, since there could be something else going on ins one part of the system that you’ve missed. But I suspect gradient descent shares properties with natural selection: an imperfect solution that has all sorts of weird byways and dead ends embedded in it.
Gradient descent needs some sort of reference signal
On any given context, the entire network needs a ping of activation to know if any given feature would help or hurt
On this , I asked Claude about my idea, per Claude:
And yes, gradient descent could absolutely produce that. Think about what backprop actually needs to do credit assignment well. If everything is varying all the time, it’s hard to isolate what helped. But if one feature is reliably firing at a consistent level, the gradient through that feature becomes informative about the downstream landscape itself rather than about the interaction between two varying signals. It’s like a control condition that the network builds into itself.
This actually has precedent. Attention sinks — where transformers dump attention on the first token or a delimiter token that has no semantic relevance — are doing something similar. The network discovers that having a stable computational anchor is more useful than using that capacity for content. Bias terms serve a related function at a simpler level.
Yea, it could have shifted things under the top selected token, so if you only look at the end prediction, it looks useless. But the non selected token rankings could be, being moved around by the feature
So my question to you would be, let's say you're right and it does have an effect. What does that mean? Are we seeing some kind of early stage of learning where these things are STARTING to have an impact and maybe if we did further training we would see it evolving into something that is actually useful to prediction?
super interesting read. feels like i’ll have to go over it a few times. but the open questions are great. looks like there is still a lot of potential here.
Yes, in all honesty it is an awfully dense document.
Thanks for taking the time brother, there is some really interesting stuff happening inside these models and we have only scratched the surface of understanding it.
Representation with our function was the part that really catches my attention.
I agree it feels extremely unlikely to be there for no reason. But also you found it didn’t contribute to the end response despite firing
My absolute , off the top of my head wild speculation would be to look into it has some sort of anti detection. Like the model confirming it’s not “XYZ that could be confused”
You’d think the absence of that would hurt the response but it might only on very ambiguous tokens.
It might be worth checking if it influenced the probability mass of the non predicted , top tokens. The runners up. It might be pushing the wrong tokens down but not visible in the actual end selection that happens either way because it’s sufficiently in the lead
Maybe simpler to say, it could irrelevant on argmax in your testing set, but meaningful in soft max distribution
Yes, that’s absolutely right—-it didn’t have an impact on those areas that we looked. You can never prove a negative, since there could be something else going on ins one part of the system that you’ve missed. But I suspect gradient descent shares properties with natural selection: an imperfect solution that has all sorts of weird byways and dead ends embedded in it.
Thinking on it more , I had to guess
Gradient descent needs some sort of reference signal
On any given context, the entire network needs a ping of activation to know if any given feature would help or hurt
On this , I asked Claude about my idea, per Claude:
And yes, gradient descent could absolutely produce that. Think about what backprop actually needs to do credit assignment well. If everything is varying all the time, it’s hard to isolate what helped. But if one feature is reliably firing at a consistent level, the gradient through that feature becomes informative about the downstream landscape itself rather than about the interaction between two varying signals. It’s like a control condition that the network builds into itself.
This actually has precedent. Attention sinks — where transformers dump attention on the first token or a delimiter token that has no semantic relevance — are doing something similar. The network discovers that having a stable computational anchor is more useful than using that capacity for content. Bias terms serve a related function at a simpler level.
Not sure, what I’d do in your position in second thought.
Take your training data, and modify your training loop such that anytime that feature receives high gradient it prints a log about the context.
By the time it goes through an epoch, you should know if it’s ever useful and for what scenario.
That would catch a lot more broader scenarios than my softmax speculation
Did you look at if it shifted softmax, that would be my first intuition on where to look
So you're saying did it change the rankings of the predictions that are essentially below threshold?
Yea, it could have shifted things under the top selected token, so if you only look at the end prediction, it looks useless. But the non selected token rankings could be, being moved around by the feature
So my question to you would be, let's say you're right and it does have an effect. What does that mean? Are we seeing some kind of early stage of learning where these things are STARTING to have an impact and maybe if we did further training we would see it evolving into something that is actually useful to prediction?
An excellent idea. Added to our list of 'what to do next'..