Introduction
I have been looking into the thousand brain theory, and I have been trying to understand how it could help the field of AI. One interesting thing I found in my research is the TEM (Tolman Eichenbaum Machine), which is an attempt at replicating the function of the hippocampus using machine learning. Since one of the ideas of the thousand brain theory is that the cortical columns that are used in the neo-cortex are actually doing the same thing as the hippocampus, this means that the TEM might give insight into how to replicate cortical columns using machine learning.
Interestingly, the TEM can also be implemented with a transformer, with the TEM-t. Using transformers has a lot of advantages since a lot of time and effort has already gone into developing technologies and research in order to scale them, and transformer-based architectures are extremely powerful. So I thought about a possible transformer-based architecture that would build on the existing TEM-t and add other elements analogous to the thousand brain theory. While transformers themselves aren’t biologically plausible, I think it might be possible that the brain does some related implementation
This is still very recent and incomplete, and I haven’t had the time to finish a simple implementation of it yet (I am working on this in my free time on my laptop) but I think it might be interesting and give some ideas/insight into the thousand brain theory.
I am naming this implementation idea Turingrade V0.1, in reference to Alan Turing, and a specific sci-fi setting.
Contextualised feature vectors and location vectors
This architecture is composed as follows:
A set of CC-t (cortical columns-transformer) is each receiving a set of features and motor efference copies as they explore through an environment. All the CC-t that are on the same level and learn the same modality should probably share their parameters, while this is not biologically plausible, this would allow for faster training and smaller model sizes.
We are going to look into how one CC-t works with its feature vectors F, with the feature vector at time t being f_t, and its set of location vectors L, with the feature vector at time t being l_t.
You apply a certain number of classical transformer self-attention and feedforward layers to the set of feature vectors (F) and to the set of location vectors (L), but you keep the feature and location vectors separated. The feature vector set and location vector set each have their own set of transformer layers, and they don’t mix. This set of location vectors is calculated one by one by a path integrator using the motor efference copy (an action vector). In the original TEM-t architecture, the path integrator was an RNN, but today you could probably replace it with a more modern alternative like a Mamba architecture, which did not exist at the time the TEM-t paper was released. The path integrator also receives some feedback, as we will see later.
Here, I use the term “location vector”, but in reality, those vectors represent the overall “topology” of the space; they could convey information like speed, acceleration, shape, and they aren’t merely just 2d or 3d space coordinates. As they get through the different transformer layers, they get contextualised, and the information they encode becomes enriched with learned context. For instance, having the information of the same location changing in space might allow the transformer layers to give the location vectors the information that there is speed. You could also imagine that a rough shape would be enriched with an idea of what object possesses that shape. However, you should note that these examples are just here to give an intuition of what could be happening.
The feature vectors have their own transformer layers. Here, you could imagine that the “orange” and “black” features might get enriched with the “tiger” idea.
Those transformer layers could potentially be scaled to large sizes.
At the end, you get a set of enriched location vectors L' and enriched feature vectors F'.
Cortical Layer Outputs
Those elements will serve to get 3 outputs:
Feature prediction head
Let’s imagine that the model is at a time t. With the path integrator and the motor efference copy, we obtained l_{t+1} and then l_{t+1}'. Like in the original TEM-t paper, we can obtain a prediction of f_{t+1}, let’s call it f_{t+1,p}. We can have f_{t+1,p} = Attention(l_{t+1}', L', F'), with L' and F' being all the past features and location, l_t' and f_t'. This output head is trained to predict the next feature that the model will see.
Location feedback head
Let’s imagine that we now arrive at time t+1, we are getting the feature f_{t+1}, if the model was in a purely generative mode here, we would simply reuse the output of the feature prediction head. This could be an analogy with what happens when the brain is dreaming. We want to calculate feedback for the path integrator in order to prevent drift and allow it to get more complex information on how the locations are evolving. The motor efference copy alone cannot tell by itself if the model is in a moving environment, like on a skateboard or falling, so it needs some mechanism to get feedback on that.
We could use another transformer output head that does the following calculation:
k_{t+1} = Attention(f'_{t+1},F',L')
This can be seen as a sort of “opposite” of the feature prediction head. Exactly, how this feedback is given to the path integrator, I am not sure yet. There could be several ways to do it and it might not matter exactly how this is done.
One thing that is important with those two output heads is that the location vectors that have information on the topology of a space and the feature vectors that have information on the content of it, don’t directly share information. This means that (hopefully), if the model sees a rotating ball, it can build a model of the rotating ball and reuse that model for any rotating ball, even if the colour or texture doesn’t match the rotating balls that it has already seen.
Object prediction head
In the thousand brain theory, the cortical column outputs some guess on what the object it is seeing at time t is (noted O_t). Here O_t is a dense vector, and not a strict classification, for instance, the same object with different colours could be represented as the same object with the only difference being in some colour dimension. I am not sure exactly how this would be done, but I think it would be some sort of calculation on F' or L'.
Here are some possible paths to do that:
O_t = Attention(o_q, L', F')
Here o_q is a learned object query. It could also be fed back into the feature prediction head to help with predicting the next feature, as this would give it more context of the object seen. You could also feed O_t into the voting layer (I explain that later) and only after give it back to the feature prediction head, as this would give context of the entire scene.
Alternatively, you could reverse it and have the object being mainly dictated by its locations (or rather its topology):
O_t = Attention(o_q, F', L')
You could also have both, which might get fairly similar to the following.
Another path would be to have two learned object vectors that are passed in the same transformer layers as those that give L' and F'. They would act as an embedding vector, giving an embedding of the overall topology of the object, and an embedding of the overall content of the object, respectively. They could both be put into the feature prediction head and the location feedback head before or after the voting layer. On the one hand, I feel like having two object representations in such a way might stray a bit too far from the thousand brain theory, and there would still need to be a way for those representations to be merged into a coherent whole. On the other hand, I feel like it is pretty intuitive to imagine that the brain has a very complete idea of the topology of its environment that can be separated from the features of the environment.
Voting layer
While I am not sure exactly what the best way is to compute O_t, the object guess from one CC-t, all of the different CC-t need to output their own object guess. We might apply some positional encoding on the object guesses. This positional encoding does not represent a position in the object’s reference frame, but just information on where the CC-t are relative to each other. It seems logical for this information to be available to the model. All of those different output guesses are passed through one or several transformer layers that act as the voting equivalent in the thousand brain theory. This means that the different guesses can influence each other and arrive at more insightful conclusions about what object they are seeing. One thing to note is that it is entirely possible for several CC-t to be seeing different objects and to agree that they are seeing different objects. This seems pretty intuitive to me, as I can comfortably see several elements within my field of view and understand that they are different. It is also completely possible for CC-t of different modalities to vote with each other.
Once all the object guesses have passed through one or several transformer layers, they are fed back as inputs to the next layer of CC-t.
Limitations
There are still some limitations and things that are missing, like the following two elements:
-
How/where are the action outputs generated and trained?
-
There should be a mechanism that detects when the CC-t has moved from one object to the next, which would empty all the feature location pairs stored in memory, prevent overflowing the context window and keep the object guess stable.