Ideas on how to unify the Tolman Eichenbaum Machine and the Thousand Brain Theory

Introduction

I have been looking into the thousand brain theory, and I have been trying to understand how it could help the field of AI. One interesting thing I found in my research is the TEM (Tolman Eichenbaum Machine), which is an attempt at replicating the function of the hippocampus using machine learning. Since one of the ideas of the thousand brain theory is that the cortical columns that are used in the neo-cortex are actually doing the same thing as the hippocampus, this means that the TEM might give insight into how to replicate cortical columns using machine learning.

Interestingly, the TEM can also be implemented with a transformer, with the TEM-t. Using transformers has a lot of advantages since a lot of time and effort has already gone into developing technologies and research in order to scale them, and transformer-based architectures are extremely powerful. So I thought about a possible transformer-based architecture that would build on the existing TEM-t and add other elements analogous to the thousand brain theory. While transformers themselves aren’t biologically plausible, I think it might be possible that the brain does some related implementation

This is still very recent and incomplete, and I haven’t had the time to finish a simple implementation of it yet (I am working on this in my free time on my laptop) but I think it might be interesting and give some ideas/insight into the thousand brain theory.

I am naming this implementation idea Turingrade V0.1, in reference to Alan Turing, and a specific sci-fi setting.

Contextualised feature vectors and location vectors

This architecture is composed as follows:
A set of CC-t (cortical columns-transformer) is each receiving a set of features and motor efference copies as they explore through an environment. All the CC-t that are on the same level and learn the same modality should probably share their parameters, while this is not biologically plausible, this would allow for faster training and smaller model sizes.

We are going to look into how one CC-t works with its feature vectors F, with the feature vector at time t being f_t, and its set of location vectors L, with the feature vector at time t being l_t.

You apply a certain number of classical transformer self-attention and feedforward layers to the set of feature vectors (F) and to the set of location vectors (L), but you keep the feature and location vectors separated. The feature vector set and location vector set each have their own set of transformer layers, and they don’t mix. This set of location vectors is calculated one by one by a path integrator using the motor efference copy (an action vector). In the original TEM-t architecture, the path integrator was an RNN, but today you could probably replace it with a more modern alternative like a Mamba architecture, which did not exist at the time the TEM-t paper was released. The path integrator also receives some feedback, as we will see later.

Here, I use the term “location vector”, but in reality, those vectors represent the overall “topology” of the space; they could convey information like speed, acceleration, shape, and they aren’t merely just 2d or 3d space coordinates. As they get through the different transformer layers, they get contextualised, and the information they encode becomes enriched with learned context. For instance, having the information of the same location changing in space might allow the transformer layers to give the location vectors the information that there is speed. You could also imagine that a rough shape would be enriched with an idea of what object possesses that shape. However, you should note that these examples are just here to give an intuition of what could be happening.
The feature vectors have their own transformer layers. Here, you could imagine that the “orange” and “black” features might get enriched with the “tiger” idea.
Those transformer layers could potentially be scaled to large sizes.

At the end, you get a set of enriched location vectors L' and enriched feature vectors F'.

Cortical Layer Outputs

Those elements will serve to get 3 outputs:

Feature prediction head
Let’s imagine that the model is at a time t. With the path integrator and the motor efference copy, we obtained l_{t+1} and then l_{t+1}'. Like in the original TEM-t paper, we can obtain a prediction of f_{t+1}, let’s call it f_{t+1,p}. We can have f_{t+1,p} = Attention(l_{t+1}', L', F'), with L' and F' being all the past features and location, l_t' and f_t'. This output head is trained to predict the next feature that the model will see.

Location feedback head
Let’s imagine that we now arrive at time t+1, we are getting the feature f_{t+1}, if the model was in a purely generative mode here, we would simply reuse the output of the feature prediction head. This could be an analogy with what happens when the brain is dreaming. We want to calculate feedback for the path integrator in order to prevent drift and allow it to get more complex information on how the locations are evolving. The motor efference copy alone cannot tell by itself if the model is in a moving environment, like on a skateboard or falling, so it needs some mechanism to get feedback on that.
We could use another transformer output head that does the following calculation:
k_{t+1} = Attention(f'_{t+1},F',L')
This can be seen as a sort of “opposite” of the feature prediction head. Exactly, how this feedback is given to the path integrator, I am not sure yet. There could be several ways to do it and it might not matter exactly how this is done.

One thing that is important with those two output heads is that the location vectors that have information on the topology of a space and the feature vectors that have information on the content of it, don’t directly share information. This means that (hopefully), if the model sees a rotating ball, it can build a model of the rotating ball and reuse that model for any rotating ball, even if the colour or texture doesn’t match the rotating balls that it has already seen.

Object prediction head
In the thousand brain theory, the cortical column outputs some guess on what the object it is seeing at time t is (noted O_t). Here O_t is a dense vector, and not a strict classification, for instance, the same object with different colours could be represented as the same object with the only difference being in some colour dimension. I am not sure exactly how this would be done, but I think it would be some sort of calculation on F' or L'.

Here are some possible paths to do that:
O_t = Attention(o_q, L', F')
Here o_q is a learned object query. It could also be fed back into the feature prediction head to help with predicting the next feature, as this would give it more context of the object seen. You could also feed O_t into the voting layer (I explain that later) and only after give it back to the feature prediction head, as this would give context of the entire scene.

Alternatively, you could reverse it and have the object being mainly dictated by its locations (or rather its topology):
O_t = Attention(o_q, F', L')

You could also have both, which might get fairly similar to the following.

Another path would be to have two learned object vectors that are passed in the same transformer layers as those that give L' and F'. They would act as an embedding vector, giving an embedding of the overall topology of the object, and an embedding of the overall content of the object, respectively. They could both be put into the feature prediction head and the location feedback head before or after the voting layer. On the one hand, I feel like having two object representations in such a way might stray a bit too far from the thousand brain theory, and there would still need to be a way for those representations to be merged into a coherent whole. On the other hand, I feel like it is pretty intuitive to imagine that the brain has a very complete idea of the topology of its environment that can be separated from the features of the environment.

Voting layer

While I am not sure exactly what the best way is to compute O_t, the object guess from one CC-t, all of the different CC-t need to output their own object guess. We might apply some positional encoding on the object guesses. This positional encoding does not represent a position in the object’s reference frame, but just information on where the CC-t are relative to each other. It seems logical for this information to be available to the model. All of those different output guesses are passed through one or several transformer layers that act as the voting equivalent in the thousand brain theory. This means that the different guesses can influence each other and arrive at more insightful conclusions about what object they are seeing. One thing to note is that it is entirely possible for several CC-t to be seeing different objects and to agree that they are seeing different objects. This seems pretty intuitive to me, as I can comfortably see several elements within my field of view and understand that they are different. It is also completely possible for CC-t of different modalities to vote with each other.
Once all the object guesses have passed through one or several transformer layers, they are fed back as inputs to the next layer of CC-t.

Limitations

There are still some limitations and things that are missing, like the following two elements:

  • How/where are the action outputs generated and trained?

  • There should be a mechanism that detects when the CC-t has moved from one object to the next, which would empty all the feature location pairs stored in memory, prevent overflowing the context window and keep the object guess stable.

2 Likes

Sounds interesting. The insight from the paper you shared that transformers replicate place cells and grid cell firing patterns is very intriguing and pretty surprising. It makes me wanna dig deeper.

Well I did not come to offer any help with your main questions but this post made me remember the existence of this paper that discusses the Tolman Eichebaum Machine from the point of view of Marcus Lewis who was at Numenta at the time. He discusses the limitations the TEM architecture and proposes to build spatial maps as an arrangement of sparse environment parts.

I can only imagine that what Marcus Lewis proposed might be quite aligned with the vision of the Thousand Brains team, but I’ll let them speak for themselves.

4 Likes

After thinking about it a bit, I think the most promising approach would be to use the location feedback directly as the object guess. This means that the model is primarily learning about objects through their shapes and how one part of the object connects to another. This would solve a question on how the model learns to recognise objects without being explicitly told, as the loss function you could use to train the model could entirely depend on how well the next feature is predicted. Intuitively, the thing that matters most when recognising something is not the colour it uses (the features), but the overall shape of the object. You can recognise a picture in black and white of something you have only ever seen in colour in one shot.

Edit: Also, while I do not think that the brain uses literal transformers like how it is done in the GPU, transformers are a form of hopfield networks/associative memory, and some form of hopfield networks are (I think) biologically plausible.

1 Like

My two cents: The main problem I see with this approach is the reliance on back-propagation to train network weights. Backprop is the nemesis of continual learning and biological plausibility.

2 Likes

I want to use backpropagation for now because it is easier and faster. Once I get something that works with backprop, I can try switching to predictive coding for more biological plausibility.

As a side note, I have found two very recent papers that are kind of related with what I want to do with the idea of disentangling the features with the topology of the object.

They are mostly written by the same authors, and one of the papers is explicitly inspired by the hippocampus. Maybe this should get its own thread?

Hey @Alexis_Balestra , thanks for your interest in this. Just adding to @AgentRev 's point, back-prop is problematic for a variety of reasons, and it can be difficult to move away from it once it becomes part of the architecture. One of the most interesting things about the original TEM paper is that part of the network used Hebbian learning. Note however that, in between episodes (different combinations of local objects in a grid-world), these Hebbian weights were reset. There was thus no sense of the network learning long-term, stable representations of objects-bound-to-locations, or using such representations to perform inference in the future. This is similar to in-context learning in transformers, which does not preserve information for future use.

The only long-term learning was therefore the learning mediated by back-prop to help the network develop a way to model space.

Finally, it’s worth noting that even Predictive Coding versions of back-prop suffer from the same issues of catastrophic forgetting and sample inefficiency. PC back-prop is essentially a mathematical reformulation, but the underlying algorithm is the same.

In other words, it’s not clear to me that a TEM architecture based on deep learning (whether transformer-based or otherwise) can address the open problems of rapidly learning structured representations of objects that are robust to continual learning. I don’t want to discourage you, but hopefully that is helpful feedback.

If you’re interested in the complexities of learning, and how this could be improved in Monty, you might be interested to check out the Future Work item we have here: Test Grid Object Models for Unsupervised Learning

1 Like

At first, I was very optimistic about Monty (hence why I am here), but ultimately it failed to convince me. I really like Monty and the work of the thousand brain project, so I hope this is not too harsh a criticism, but to me, the main problem is that Monty is made to recognise 3d objects and nothing else. What I want to know is how the brain can do general reasoning, the specifics of how the brain recognises a 3d object are not really interesting in comparison. One of the main ideas behind that choice to have Monty be trained to recognise 3d objects is that once Monty can recognise 3d objects, the same learning principles can be reused for any tasks.

However, there are some issues with that approach. If a model is capable of general learning, then it can recognise 3d objects, but the reciprocal isn’t true. If a model can recognise 3d objects, there is nothing that says that it can do general learning. This means that the data given by how good Monty is at learning 3d objects doesn’t really help much at determining how good Monty is as a general learner, and this is what matters in my opinion. If you compare Monty to the Tolman Eichenbaum Machine, you can find in both elements that are supposed to represent grid cells. In Monty, the grid cells are a path integrator that cannot do anything but path integration. However, in the TEM, the corresponding element will learn to act as a path integrator if the task requires it, but this is not something that is hardcoded. The general idea that the grid cells can learn to adopt different behaviours depending on the task at hand seems much more plausible to me. Of course, the brain doesn’t use backpropagation like the TEM, but I don’t think that’s what matters the most here.

So really, I think Monty should be made to do and be measured on a wide variety of tasks. If it can’t do that, then it cannot generalise.

About the two things that you talked about, continuous learning and sample efficiency. To me, the way to get sample efficiency, or at least something that looks like sample efficiency, is to have a model that can separate content/features and structure/location, and that can do compositionality.
If you can separate the content and structure like how it is done in the HPC-MEC world model paper, the DiLA paper, and with a Tolman Eichenbaum Machine, then you can generalise. If you learn how a ball is rotating, you can reuse the learned structure for any rotating object with minimal things to relearn, like in the HPC-MEC world model paper and the DiLA paper. You get the ability to generalise. If you learn how to navigate a 2d plane, you can reuse that knowledge in much more abstract settings, for instance, controlling the length of the neck or the feet of a bird. It is not that the brain can learn a completely new concept in just a few shots, it is that the brain can reuse previously learned knowledge on new tasks.
If you add compositionality to it, then what the brain is actually doing is dividing each task into a subset of much simpler tasks that it either already knows or is very close to something it previously learned. Even in the worst case, where the task is completely new, this completely new task shouldn’t be too complex once it has been divided.

The problem of catastrophic forgetting is not really what I am trying to solve right now. I think a model that can do this sort of generalisation would be, by default, a bit more robust to catastrophic forgetting, because many concepts are common sub-problems. The closest implementations to this idea of dividing structure and content have been implemented using backpropagation, so it seems easier to use it for now, even if it is not perfectly biologically plausible. There are methods to mitigate catastrophic forgetting in deep learning, and while they are not perfect, it is not like biological systems never forget anything either.

Of course, all of that is just my opinion, you are free to correct me where I am wrong.

It’s reasonable to be skeptical of whether the path we’re following with Monty will be able to learn in more general settings, such as abstract concepts. As you say, we have only demonstrated learning in 3D objects (see also our more recent work on 2D objects). I myself will only be convinced once we’ve actually demonstrated this kind of generalization. This is a long-term research goal however, and so it will be a while before we can observe this. In the meantime, we are being guided by principles of biological plausibility (local learning, cortical anatomy, etc.). The fact that properties such as shape robustness, object-centric representations, symmetry recognition, and continual learning emerge naturally in our approach is, in my view, encouraging. But again, it certainly isn’t proof that such generalization will emerge.

If the question is, “should I pursue this research idea further?”, then I would again just try to dissuade you from using back-prop. The problems with back-prop are numerous, and could itself form a lengthy discussion. I think the question you are interested in - how do we learn to map arbitrary spaces, not just 3D Cartesian coordinates - is an interesting and important one. Spending time on this sub-problem, and using methods that don’t rely on back-prop, would be a really interesting effort. If you’re curious, we could start a separate thread discussing how this might happen in the brain. Of course, that isn’t to stop you from pursuing the transformer-based architecture you described if that is something you want to work on.

1 Like

I appreciate the advice, but I need to understand why backpropagation and predictive coding are so bad so I can decide whether to rethink my approach and avoid them completely.

Predictive coding seems biologically plausible to me. There is quite a bit of research on it, and it seems coherent enough, and since predictive coding and backpropagation tend to converge on the same solution, it seems like even if the micro elements, the exact ways the neurons are not identical, the macro elements should be similar enough to yield interesting parallels. As long as the brain uses some form of universal function approximator at the micro level, the macro level architecture should converge to something that is “similar enough”.

There have been plenty of advances in machine learning that benefited from taking inspiration from the brain without necessarily reproducing Hebbian learning rules.

My goal with all of this would be to have a model that can perform well on arc-agi 3, so while I am very interested in how the brain can achieve general intelligence, I am mostly interested in how general intelligence can be achieved as an implementable solution, and for that, I am taking inspiration from many things, including the thousand brain theory. Although honestly, I am not sure if I just want to do something that works on Arc-agi 3, or if I want to make a new architecture that is better at general learning.

My first idea to approach arc-agi 3 was to take inspiration from Monty and HTM, but the TEM architecture answered a lot of my questions and problems, so I deviated from Hebbian learning rules.

The HPC-MEC and DiLA implementations manage to implement this idea of structure/content separation, so I am going to build on them from now on, I think.

Sure, so if I could recommend some resources on this question, you might be interested in:

  1. The Background discussion in our recent paper on Thousand Brains systems.

  2. This discussion we had in the “RL debate series”, where I’ve linked to the relevant time stamp here.

You can read more about the various topics that are alluded to, but just to highlight one example: catastrophic forgetting is a fundamental problem in deep learning related to its assumptions about data distributions. While there have been efforts to mitigate it, none of these work well, which is why it remains an issue even for frontier labs. If approaches like Elastic Weight Consolidation actually worked well, then they would be used broadly in state-of-the-art DL systems, and continual training would be widely observed. The fact that this has not happened isn’t too surprising, given the core assumption within the formulation of back-prop of IID data distributions.

Hope you find that useful, and good luck with the ARC-AGI 3 project.

3 Likes

Thanks, that’s really interesting, although I disagree with some elements.

I don’t think that if we don’t see much continuous learning with backpropagation, or other deep learning approaches, it is because it is inherently impossible. I think it is a combination of several things.

Most common applications and tasks that deep learning systems perform can be done “good enough” without continuous learning. It would add a lot of complexity to add continuous learning to a big modern model, and that is probably not a risk most companies are willing to take, when they could use their compute to scale.

I think this is a mistake, and that big AI companies should explore more novel architectures, but I don’t really get a say in this, sadly. If I remember correctly, Demis Hassabis was saying in an interview that he doesn’t think that continuous learning is necessary for artificial general intelligence, and that he was betting on scaling alone. By learning enough tasks, the models would eventually be good at everything. Which, again, I think it is a mistake.

If a task requires more precise knowledge/training, then fine-tuning/retraining exists, and depending on the model, there are things like LoRA, RAG, and in-context learning… Those aren’t real continuous learning, but they are good enough to make true continuous learning harder to emerge as a true competitor because of the added cost/complexity.

So for large scale model it is too much cost and risk, and for smaller scale model, complete retraining/fine tuning/other methods are available. All of this would explain why deep learning is currently struggling with continuous learning.

Of course, with all of that said, Hebbian learning is probably just better at continuous learning. I just wanted to discuss why the deep learning community isn’t more focused on continuous learning.

When it comes to biological plausibility, it is very clear that backpropagation is not biologically plausible, but predictive coding seems to have much stronger evidence. Even if predictive coding didn’t work, I think the overall architectures could converge to similar results even without the same update rule at the neuronal level.

When it comes to the sample inefficiency/ poor generalisation/brittle representation, I think it is an architectural problem that comes from having content and structure intertwined instead of disentangling them.

When it comes to object impermanence and six-fingered hands, it seems like it is also a problem in biological systems, at least at some levels. When dreaming, which could be your brain generating sensory information without corrections coming from the real world, it is very common for hands to not have the right number of fingers. In lucid dreaming, a good way to check if you are dreaming or not is to count your fingers. Object permanence faces the same problem. Now, of course, my higher-level reasoning learned object permanence and that hands have 5 fingers, but that would prove that the brain found an architectural solution. If the solution came simply from implementing Hebbian learning, then no part of the brain would generate six-fingered hands or struggle with object permanence.