How Compositional models will work in Monty

Lucretius · May 28, 2026, 8:32am

Hi all, my apologies if these questions have already been answered clearly and are not good questions, I have no formal background in this my background is physics so my approach to understanding may not be optimal. Any advice would be appreciated.

I’m trying to understand how compositional models would work within monty, I have briefly read the hierarchy/heterarchy paper however i quickly realised it outlines how it should work mainly within the neuroscience and translating to monty requires a bunch of thinking that im hoping to skip by someone just telling me the answer to my questions.

Say we have a system with 2 LM’s and a single SM, I assume both LM’s will be hooked up to a SM. my confusion is in regards to how the hierarchy would work in regards to these two LMs. Say we are given a compositional object say a box with a knob. Would it be the case that the LM lower in the Hierarchy would have seen the components by themselves and thus have models of just a box and the knob by itself. and then when given the composed object as its learning a model hypothesis for two objects spike and perhaps theres some mechanism by which when theres such an occurrence that this is communicated to the other LM via CMP and then as the composed object is being scanned the higher LM has some way to recognise that there can be two object IDs associated with this third object ID of the whole object. And perhaps it learns a model of one of the objects with the other model as a feature. So one question is can the features in the CMP include object IDs. I’m sure this picture is quite wrong as its probably possible to learn constitutionality without having seen the components by themselves.

So my confusion is with two LM’s operating in a hierarchy and also in a parallel fashion what do the two LM’s individually lean models of when given a composed object, Does the lower LM learn models of only the components, or the components and the whole, and then does the second LM also learn the exact same models or only the composed model, and since both LMs are hooked up to the motor system and see the same thing which LM decides the principled movements for building the models.

rmounir · June 1, 2026, 2:54pm

Hi @Lucretius, these are great questions.

Say we have a system with 2 LM’s and a single SM, I assume both LM’s will be hooked up to a SM.

Yes, both LMs will be hooked up to an SM, a higher-level column gets direct input from the thalamus as well. In Monty this translates to another SM with a larger receptive field connected directly to the higher-level LM (without having to pass through the lower level LM). This higher-level LM now gets input from the lower-level learning module hypothesizing about the child object (if present) and larger receptive field input from the sensor at the same time. Note that both SMs receptive fields are co-located.

a model hypothesis for two objects spike and perhaps theres some mechanism by which when theres such an occurrence that this is communicated to the other LM via CMP

As the agent explores the compositional object, the lower level LM is switching between child objects (e.g., box or knob). It never spikes for two child objects at the same time. As the compositional object (e.g., box with a knob) is being learned in the higher-level LM, associations are being formed that effectively store a child object id as a feature at a location in the higher-level reference frame.

one question is can the features in the CMP include object IDs

Yes, an object ID is a non-morphological feature stored on the CMP message constructed by an LM. This CMP message also includes the pose of the child object as a morphological feature.

its probably possible to learn constitutionality without having seen the components by themselves

A higher-level LM can learn a low detail model of the compositional object as it observes direct input from the sensor as well. The idea here is that if the lower-level LM has learned and inferred child objects, these are associated as additional features at locations in the higher-level LM to more efficiently infer the compositional object. In Monty, the evidence from both channels are added together, making the LM infer the object faster.

Does the lower LM learn models of only the components, or the components and the whole, and then does the second LM also learn the exact same models or only the composed model

The lower-level LM learns only the components and the higher-level LM learns the full compositional object. Anticipating a follow-up question here, “what prevents the lower-level LM from learning the full object?”. This is an open research question, we don’t have a definitive answer for when a LM stops learning a model and starts learning a new one. One solution we have implemented is the constrained object models, which automatically prevent an LM from learning a model that exceeds a certain physical size.

My opinion is that child objects are observed much more frequently than the parent objects. By definition, these child objects can appear on many other compositional objects. So there is a statistical component here for detecting the boundaries of what defines a child object. Additionally, there are top-down connections that stabilize and reinforce this division between child and parent objects in a compositional model. The higher-level LM predicts the next feature (object ID) it will see as the sensor moves, and therefore can bias the lower-level LM to switch to the other child object rather than learn to extend the current object.

since both LMs are hooked up to the motor system and see the same thing which LM decides the principled movements for building the models.

In Monty, LMs can propose goal states for the motor system. These goal states have confidence values and the one with the highest confidence wins. The confidence is based on evidence scores of the hypotheses. The motor system can also follow a model-free policy (e.g., saliency-based or curvature following) not informed by any LM. This same process happens any time we have multiple LMs (e.g., voting with 5 LMs at the same level), not only in a hierarchical configuration.

Hope this helps.

Lucretius · June 6, 2026, 6:33am

Many thanks for your detailed reply, ill have a lot to study and regards, currently unable to dedicate too much time to it but i hope to start making some contributions at some point. compositionality seems key to building systems capable of various kinds of abstraction. I have an idea about representing word embeddings as some kind of object via some transformation and then compositionality would give ability to do inference about sentences with enough samples processed, giving a partial inference ie giving only part of a sentence would allow for essentially next word prediction.

sergioval · July 7, 2026, 12:55am

Great discussion. I’ve also been thinking recently on the kind of representation learning TPB is doing when doing compositional learning. If I understood it correctly, they key point is that the point cloud of an object is replaced with an ID, and the ID plus the 3D location and pose of the object is passed to the LM up in the hierarchy, right?

What I don’t understand is that an arbitrary ID doesn’t have any semantics. Hence, if two LMs can learn different iDs for the same object, and an LM up in the hierarchy could receive these two IDs instead of just one. I understand the problem remains even when using sparse distributed representations (SDRs), since different LMs could have different SDRs for the same objects, right?

So, I was wondering about alternatives to arbitrary IDs. My first though was using word embeddings as IDs, as they have been aligned in embedding space and carry a lot of semantics, do you think they would be biologically plausible?

One way to implement something similar to word embeddings could be related to neuromodulation, such that the features could form a structured representation and carry semantics of how this object useful to me. In other words, the features could form a basis of what I need to thrive. For example, they could be related to how an object can be used, like “protection”, “energy regulation (allows me to obtain food)”, “temperature regulation (keeps me warm)”, etc. This way, every object ID would be a point in that space of features.

Is my understanding correct of what is passed to the next LM up in the hierarchy?

What do you think of passing a more semantic ID? and about this ego-thriving features proposal?

Thanks!

rmounir · July 8, 2026, 8:48pm

Hi @sergioval,

Yes, your understanding is correct. The object ID abstracts away the model details before it’s sent up to the higher level. That ID, along with the pose of the child object relative to the parent, gets stored at a 3D location in the higher-level LM.

But that same higher-level LM also receives direct sensory input from an SM with a larger receptive field, and that input is stored at the same 3D location. So from the higher-level LM’s point of view, the object ID coming from the lower-level LM is just another feature. It helps the LM recognize the object faster when it’s available, since evidence accumulates more quickly with multiple input channels.

Whether object IDs embed semantics, and how, is something we debate a lot in our research meetings. It’s an active research area for us, so our thinking here may shift as we work through the theory and how it connects to representing classes of objects. That said, I’ll take a shot at answering, keeping in mind that what follows is my own perspective on the topic and may not fully align with how everyone at TBP thinks about this problem.

In the thousand brains theory, every column runs the same cortical algorithm no matter where it sits in the heterarchy. An object ID arriving at a higher-level LM is treated as a feature input, the same way a lower-level LM treats pose or RGB-D features. The LM recognizes an object by comparing the incoming feature against what it has stored, and it adds evidence in proportion to how well they match. That’s a big part of why Monty generalizes well across sensor noise, lighting changes, and other variability. Even when features don’t match up exactly, partial matches still contribute partial evidence.

If we apply that same logic to object IDs, then to generalize well we probably need a semantic distance measure over object IDs. The distance between two IDs should be small when the two objects are interchangeable for recognizing the parent object. At the moment, Monty does this object ID feature matching as a binary match/no-match classification, but we may start to look at embedding semantics in the future when we work with larger compositional datasets and generalization becomes essential for scaling.

One way to get there is to force objects with similar morphologies to share more SDR bits, so their representations overlap. The EvidenceSDR implementation might be a useful starting point, though it isn’t the full picture, since objects can also be grouped by higher-level context like affordances, not just morphology.

The ego-thriving semantics you describe sounds like another high-level heuristic for grouping object IDs. It reminds me of a Lawrence Barsalou quote that @bryce_bate shared in another thread:

A car, a flyswatter, and a house seem to have nothing in common until you’re being chased by some bees. Suddenly, each belongs to the concept of ‘Things to keep you safe from bees.’

The grouping may depend on context, which makes me think a single fixed distance measure won’t capture it on its own, but it may rely on changing high-level context in the receiving column.

Hope this helps.

sergioval · July 10, 2026, 4:23pm

Thanks for the excellent explanation!

Is it assumed that absolutely every LM always receives direct sensory input from an SM? Also, does the receptive field have to grow monotonically with the height in the information path? If so, this could be assured in a hierarchy/tree, but how is this ensured in an heterarchy?

Thanks again!

W_Foxalike · July 12, 2026, 5:02am

Hi @sergioval — I’m still learning TBP myself and I’m not on the team, but let me try to answer your questions as I understand them. Happy to be corrected if I get anything wrong.

1. Does every LM necessarily get direct input from an SM?

As I understand it, no. An LM’s main bottom-up input can come either from an SM or from another LM — in the latter case the sending LM passes up the object ID it has recognized, along with its pose, as a feature. There also seem to be skip connections, where a lower-level LM or even an SM can connect directly to a much higher-level LM. As for a higher-level LM additionally receiving input from a co-located SM with a larger receptive field (the setup rmounir described earlier in this thread), to me that reads more like an optional, auxiliary channel — usually coarser and lower-frequency, there to speed up recognition rather than something required.

2. Does the receptive field have to grow monotonically with “height”, and how is that ensured in a heterarchy?

As I understand it, the tricky part is that “height” / “which layer” isn’t really well-defined in a heterarchy. Once you have skip connections, you can’t simply rank LMs by “how many processing steps their input has been through.” The framing I’ve seen is that LMs are grouped more by who they vote with (i.e. whose modeled objects overlap) than by topological depth. So “monotonically increasing with height” feels more like an intuition from tree structures; in a heterarchy there may not be such a global monotonic function at all.

The way I personally tend to think about it: what’s preserved is more of a per-edge relative relationship — along a given compositional (bottom-up) edge, the LM modeling the parent object operates at a larger spatial scale than its child LMs. In other words, receptive-field scale tracks the scale of the object being modeled, not its depth in the pathway.

There’s one doc you might find useful as a reference here: Connecting LMs into a Heterarchy — it covers bottom-up connections and “why heterarchy.”

sergioval · July 16, 2026, 10:44pm

Hi @W_Foxalike, thank you very much! This makes a lot of sense. I think the fact that not every LM receives direct sensory input highlights the importance of the ID label carrying semantic information that enables alignment across different objects.

I am imagining an ID feature space that supports planning through navigation: finding trajectories between objects until reaching one that can satisfy the agent’s current needs. In this view, the agent would learn to navigate not only through the physical 3D world, but also through the semantic space associated with its goals. I have a feeling that this symmetry between operating in physical and semantic spaces fits well with the cortical hypothesis. What do you think?

More generally, I could imagine some form of ego-centric “thriving” feature space, directly related to neuromodulation and goal-directed behaviour would be plausible from a biological perspective, at least at the systems level, although I have no intuition about how it might be implemented physiologically. @rmounir thanks for pointing me to the other quote; I love it!

@W_Foxalike: the idea that a parent LM operates at a larger spatial scale than its child LM is a bit less intuitive to me. Clearly, the child LM summarizes some region of the space occupied by an object, but the remaining inputs to the parent could still consist of SMs with relatively small receptive fields. Because of that, I wonder whether a heterarchy requires additional assumptions beyond those of either a strict hierarchy or a purely random graph.

W_Foxalike · July 17, 2026, 2:50pm

Hi @sergioval — thanks, both of these are things I keep circling back to.

On navigating a semantic space. I find this really appealing too — the symmetry with moving through physical space is the TBT picture exactly. Where I get stuck is that reusing Monty’s reference-frame machinery needs a metric space, but the object IDs that LMs pass around are discrete labels with no metric — so there’s no “semantic pose” to move through. Trying to put a metric on a semantic space instead runs into two things: a semantic similarity is only a scalar, not the multi-axis coordinate structure a reference frame needs; and embeddings are unstable — a small change in the input can move the embedding a lot, so distances come out uneven and path integration breaks down. That’s the part I can’t see past yet. (A related discussion runs into similar difficulties here: applying TBP to text understanding ( Discussion on Applying TBP Theory to Text Understanding: Challenges and Potential Pathways - #4 by vclay ).) Do you see a way to give that space enough structure to actually navigate?

On the parent operating at a larger spatial scale. I think “larger scale” here means a larger reference frame, not more input. Along a compositional (part-of) edge, the parent object’s frame spatially contains its children — the whole mug’s frame holds the handle and the logo at their poses — so its spatial extent is bigger by construction. But the parent LM doesn’t receive more raw sensory bandwidth for that: its input is just the child LMs’ ID + pose, which is coarse and low-frequency. So a bigger scale ≠ more or finer input. And it needs no global “height” ordering — only the local parent-contains-child relation along each compositional edge, which sits fine on top of a heterarchy that has no global depth.