12 Comments
User's avatar
ToxSec's avatar

super interesting read. feels like i’ll have to go over it a few times. but the open questions are great. looks like there is still a lot of potential here.

Arshavir Blackwell, PhD's avatar

Yes, in all honesty it is an awfully dense document.

John Holman's avatar

Thanks for taking the time brother, there is some really interesting stuff happening inside these models and we have only scratched the surface of understanding it.

Michael Jovanovich's avatar

Representation with our function was the part that really catches my attention.

I agree it feels extremely unlikely to be there for no reason. But also you found it didn’t contribute to the end response despite firing

My absolute , off the top of my head wild speculation would be to look into it has some sort of anti detection. Like the model confirming it’s not “XYZ that could be confused”

You’d think the absence of that would hurt the response but it might only on very ambiguous tokens.

It might be worth checking if it influenced the probability mass of the non predicted , top tokens. The runners up. It might be pushing the wrong tokens down but not visible in the actual end selection that happens either way because it’s sufficiently in the lead

Maybe simpler to say, it could irrelevant on argmax in your testing set, but meaningful in soft max distribution

Arshavir Blackwell, PhD's avatar

Yes, that’s absolutely right—-it didn’t have an impact on those areas that we looked. You can never prove a negative, since there could be something else going on ins one part of the system that you’ve missed. But I suspect gradient descent shares properties with natural selection: an imperfect solution that has all sorts of weird byways and dead ends embedded in it.

Michael Jovanovich's avatar

Thinking on it more , I had to guess

Gradient descent needs some sort of reference signal

On any given context, the entire network needs a ping of activation to know if any given feature would help or hurt

On this , I asked Claude about my idea, per Claude:

And yes, gradient descent could absolutely produce that. Think about what backprop actually needs to do credit assignment well. If everything is varying all the time, it’s hard to isolate what helped. But if one feature is reliably firing at a consistent level, the gradient through that feature becomes informative about the downstream landscape itself rather than about the interaction between two varying signals. It’s like a control condition that the network builds into itself.

This actually has precedent. Attention sinks — where transformers dump attention on the first token or a delimiter token that has no semantic relevance — are doing something similar. The network discovers that having a stable computational anchor is more useful than using that capacity for content. Bias terms serve a related function at a simpler level.

Michael Jovanovich's avatar

Not sure, what I’d do in your position in second thought.

Take your training data, and modify your training loop such that anytime that feature receives high gradient it prints a log about the context.

By the time it goes through an epoch, you should know if it’s ever useful and for what scenario.

That would catch a lot more broader scenarios than my softmax speculation

Michael Jovanovich's avatar

Did you look at if it shifted softmax, that would be my first intuition on where to look

Arshavir Blackwell, PhD's avatar

So you're saying did it change the rankings of the predictions that are essentially below threshold?

Michael Jovanovich's avatar

Yea, it could have shifted things under the top selected token, so if you only look at the end prediction, it looks useless. But the non selected token rankings could be, being moved around by the feature

Arshavir Blackwell, PhD's avatar

So my question to you would be, let's say you're right and it does have an effect. What does that mean? Are we seeing some kind of early stage of learning where these things are STARTING to have an impact and maybe if we did further training we would see it evolving into something that is actually useful to prediction?

Arshavir Blackwell, PhD's avatar

An excellent idea. Added to our list of 'what to do next'..