2026/02 - A Thousand Brains on a Thousand Chips: Scaling Monty on CPUs, GPUs, and PiM chips

@xavier from ETH Zurich presents his master’s thesis, exploring how Thousand Brains systems could scale on modern hardware. His research examines how scaling the number of learning modules affects computing performance on GPUs, CPUs, and processing-in-memory (PIM) architectures. GPUs aren’t a great fit because autoregressive algorithms like Monty have low operational intensity. CPUs can scale reasonably well but require more computation time as the number of modules increases. PiMs offer a promising alternative by placing computation near memory, enabling large-scale parallelism. Xavier shows results from scaling to 2,500 learning modules, representing millions of neurons and billions of synapses.

0:00 Introduction
0:49 An Overview of Monty’s Structure
2:12 Motivation: Rapid, Continuous, and Compute Efficient Learning
2:51 Scaling Cortical Columns
3:48 Scaling an Algorithm with an Auto-regressive Loop
4:39 Investigate the Scalability of Thousand Brains Systems
5:41 Montyll – A Novel Thousand Brains System
6:25 Why HTM Networks?
8:03 Scaling on GPUs
9:56 Scaling on GPUs: Operational Intensity
11:54 Scaling on CPUs
13:07 Can Multicore CPUs Handle the Amount of Data Movement?
15:03 Scaling on Processing-in-Memory (PiM) Chips
15:47 Scaling on PiMs: DRAM Banks
24:14 Scaling in Data Centers
25:13 The Montyll Implementation
30:20 Cat Cortex Scale System (2500 Learning Modules)
32:33 Why Logic Frequency Is Low?
33:58 Processing-in-Memory Chip Illustration
38:20 WRAM
39:26 MRAM
40:41 Connection Transfer
41:12 Tasklet Level Parallelism
42:08 Barriers and Synchronization
43:36 The Results
45:35 Results: Time per Step
56:40 Neurons and Synapses vs Devices

10 Likes

Hey, this is my work!

I am very excited that this video is coming out today because I’ve been meaning to share this work with the smart people in this forum for a while now. I hope you guys get some value out of it. Let me now add some context and key takeaways.

Goal. If one really believes that the learning modules are designed after columns in the neocortex, it begs the question: what are the consequences of scaling Thousand Brains Systems to the scale of the human neocortex (200’000 learning modules)? What computing platforms are best equipped to respond to the scaling requirements?

Importance. Thousand Brains Systems is basically a huge set of heterogeneous weights (different learning modules) operating on a set of heterogeneous inputs (different sensor patches). This implies a unique scaling profile that is completely and fundamentally different from the scaling profile of deep-learning based AI systems. Understanding this scaling profile and its impact on different computing architectures felt important to me.

Methods. We introduce a novel Thousand Brains Systems, called Montyll, which stands for “Monty low-level”, because it introduces elements of low-level cortical processing (HTM networks). This is for two reasons. The first reason is that I wanted to computationally capture the long-term goals of the Thousand Brains Project, which mentions incorporating elements of HTM networks in Thousand Brains Systems. The second is that I was really interested in evaluating this promising computing architecture called “Processing-in-Memory” in scaling Thousand Brains Systems. HTM networks were a great match for the capabilities of that architecture. We look at the potential of GPUs, CPUs and Processing-in-Memory (PiM) in scaling Montyll. I also talk about clusters and neuromorphic hardware in the thesis.

Key Takeaways. Well you’ll have to read the thesis or watch the talk for further details but here are some key takeaways.

  1. GPUs are not a great fit for scaling Thousand Brains Systems. It has to do with the need for high operational intensity (high reuse), which is incompatible with the auto-regressive loop and heterogeneous weights found in Thousand Brains Systems. This does not completely exclude GPUs from accelerating the workload, and we’ve already seen some impressive efforts from people in this forum, but I do not think it can greatly revolutionize the applicability and scale of Thousand Brains Systems like it did for deep learning.
  2. Short/Medium Term. CPUs are more than good enough. I do not see Thousand Brains researchers scrambling to find a better computing platform on the short to medium term. Single node CPUs can accommodate pretty big systems already, of the size of the guinea pig cortex (400 LMs) and potentially up to the size of a cat cortex (2’500 LMs). Datacenters and cpu clusters close the scale gap with a promise to scale to the human neocortex without needing an egregious amount of machines.
  3. Medium Term. DRAM-based PiM represents a very capable computing platform. The promise of PiM lies in its ability to execute a massive amount of learning modules in parallel. There are many drawbacks to using Processing-in-Memory chips today, but they are not fundamental or “first principles” drawbacks. They mostly stem from a lack of resources being put towards making them better and more capable. If Thousand Brains Systems become as big a deal as I think they will, I see DRAM-based PiM as having the ability to potentially revolutionize the scalability of Thousand Brains Systems on the medium term, especially for embodied applications that require real-time, power-constrained. Only time will tell if I am right. It is quite possible that large investments are instead made into connecting robots to compute clusters, which will undercut the need for scale in a small form factor, but this approach has its own obvious problems.
  4. Long Term. If the idea is to eventually have a machine capable of running Thousand Brains in a small form-factor (single node, low-ish power, for embodied intelligence), not sure any compute platform today can hit the device scaling requirements, let alone the power budget requirements. Would need ~600 TB of memory footprint, and copious amounts of compute-parallelism at multiple levels of granularity. That’s outside the scope of the paper, but if I had to make a bet, NVM-based PiM looks like the best fit from what I’ve seen, for its combination of footprint and weight-heterogeneous compute parallelism. I don’t see how Processing-in-Storage (PiS) could accommodate the workload, but I might be wrong.

Links.

Acknowledgements. I want to specifically not thank Jeff Hawkins and the whole Thousand Brains Project team, including the great Viviane Clay. Getting headaches thinking about this whole thing is your guys’ fault. A direct consequence of the awfully good supervision I received from Doctor Clay and the disgustingly interesting ideas and theories developed by Jeff Hawkins and everyone else over the years at Numenta and the Thousand Brains Project.

I know you guys will probably have some interesting thoughts and questions about this, which I’ll be excitedly waiting for.

8 Likes

Hi @xavier

Thank you for a very interesting presentation.

I don’t know if you looked at this, but another option is arrays of FPGAs. I have designed several ASICs and many FPGAs. With an FPGA you can design your own logic, have whatever accuracy arithmetic you need and as much of it as you need. There is on-chip single-cycle memory and typically there is a DRAM interface for bulk storage. You can also attach non-volatile memory of course, and the logic can run faster than 400MHz.

If, for example, you could implement 8 learning modules on a single FPGA + DRAM, then an array of 64 FPGAs on a PCB would give you 512 learning modules, all running in parallel. Placing 8 of these PCBs in a rack (along with some fans !) would give you 4096 learning modules, all in parallel.

FPGAs have many I/O pins including some very high speed (PCIe gen3) I/O with which to implement interconnect protocols both chip to chip and board to board.

And being FPGA it means that you can tinker with the learning module design and the interconnect protocols as ideas develop. When the design crystallizes you might consider going to ASIC to reduce system cost.

It would not be a cheap system to build, probably quarter of a million dollars for the hardware alone, but still much cheaper than custom silicon, and without the risk.

Alex

2 Likes

Well that’s a very good point Alex. Well I am a bit ashamed to admit but I really haven’t given much of a thought to FPGAs, so I am pretty happy about your comment because it’s giving me something new to think about.

My first thought is that it suffers from the same memory wall problem that a CPU-based system suffers from. The workload is memory-bound, hence the limiting factor is not necessarily the logic processing speed but rather accessing the huge amount of learning-module data that is DRAM-resident, at every step. Parallelizing over more memory systems (each FPGA gets its own DRAM chip) will always yield a benefit because it increases the aggregate memory bandwidth, but that would also be true for CPU-based systems.

Very cool that you have developed the expertise to design several ASICs and FPGAs. That’s above my pay grade, but I always looked up to people who could do that.

1 Like

Yes, you could of course build an array of CPUs, but CPUs require a lot of support hardware and don’t lend themselves to interconnect communications very well.

For sure FPGAs are losing prominence to CPUs and GPUs in the processing world, but I think for the requirements of this new kind of architecture they may have some advantages. Distributed memory, parallel processing and optimised logic functions, they are a good fit.

Another big issue for continuous learning systems is non-volatility. A sensorimotor intelligence is creating memories and learning from the moment it is switched on and its sensors start sending back information. If those memories are in DRAM and there is a power glitch then all of the learning is lost. So you either have to have a very reliable power source or write the memories to non-volatile storage which is typically much slower than DRAM. Biological brains suffer the same weakness sadly.

Like DRAM and many other semiconductors, FPGAs are experiencing shortages and price hikes, but you can experiment with a single FPGA on a development board, there are even development boards that plug into PC expansion slots.

1 Like

I’ve been working on a (mostly) personal project for (at least) 15 years. It’s a computational psychohistory model aimed not at future prediction, but at archaeological research. The concept is to solve difficult anthro and archaeo problems by constructing truly vast sociological simulations. Linear A, anachronisms like Antikythera, and detailed explanations of the rise and fall of civilizations are examples (sort of Gibbon meets Asimov meets scidata :wink: )

It’s definitely neurological, not GenAI. It’s also very old-fashioned (GOFAI). I learn a lot from TBP and even from the old Palm Computing days. The agents are coded in PROLOGish FORTH and the simulation itself in big FORTH, so there’s no copy/pasting of anything from TBP (or anywhere else, other than my own old code). There’s no github team because I’m old, crusty, and hard to work with.

If there’s something helpful I can contribute to TBP, I do (mostly biological findings) just to earn my keep here and not be accused of lurking.

Assembling a diverse, distributed, and motivated research group is not easy, kudos to the TBP leadership. I ran a worldwide Citizen Science team for a decade that eventually got squeezed out by the hostile GPU agora, which pushed me to look at CPU, PIM, FPGA, and other technologies. I work on a small scale (just a few ‘cells’), and was going to scale up to a superchip (working name SELDON I) to be made at a foundry. It would contain a very large number of independent ‘PROLOG-IN-FORTH’ cores, enabling individual agents to run largely independent threads. This architecture isn’t conducive to GPU usage of course.

The recent talk of moving the big Samsung foundry (and others) to Ontario, Canada has rekindled that notion a wee bit. The pace of change these days is challenging.

As Lynn Margulis often said, we shouldn’t be in any great hurry to forget or dismiss old theory and practice. The mind is still largely unexplored and un-replicated computationally (which is why we’re here). The jury’s still out on whether token-munching, stochastic GenAI is The Great Revolution or simply a trillion dollar parlor trick.

1 Like

A lot of TBP scaling discussions seem to assume that all of the modules need to be available at all times and that communication speed is a critical factor. However, I doubt that this is the case: apparently, most cortical columns are idle most of the time. So, here’s a different approach…

Set up a bunch of processors, each of which has a dispatching module (DM). When a message is received, the DM checks to see it the target is resident in memory. If so, the DM simply forwards the message. If not, the DM sends a message to the module loader, asking for the target to be swapped in, then forwards the message in due time. And, if the processor starts to get bogged down, we move some modules to other processors.

None of this is new technology; paging and swapping are common in OS design and process migration has been used in some systems. The key difference here is the duty cycle: we might be able to have 90% of the code and data swapped out at any given time. That wouldn’t work for most computing systems, but it might for us…

3 Likes

@Rich_Morin makes a very good point. As the system scales up we can improve efficiency by selectively processing only those parts that are experiencing change. We won’t know the trade-offs until we start to build big systems and hit hardware limitations.

While musing the processing of an architected neural network for my plastic spider I realised that if I have to process all neurons many times per second I would soon run out of processing power, or memory bandwidth, or both.
Processing all neurons on every pass means that the order of processing is unimportant as any sensory input changes will ripple through in a few passes, which lends itself to parallel processing very nicely. However, many of the neurons will be inactive. It would be more efficient to follow paths of activation and abandon inactive paths. But then processing order becomes important again and the processing engine becomes more complicated.

3 Likes

I agree with Xavier that FPGAs would suffer the same memory bottleneck on the DRAM side. It’s a bit of a shame, because FPGA logic blocks are kinda organized in a way that’s reminiscent of cortical columns. Their distributed BRAM could technically alleviate the memory wall, but even high-end FPGAs only have a few hundred MBs of it, so not really viable.

The non-volatility part is another can of worms on its own. To save something, computers have to transfer stuff between DRAM and drives, unlike neurons. What happens while saving? Do we put the robot on standby? How often? The thesis mentions terabytes of data, what about SSD lifespan? Incremental backups!? etc. But maybe it’s a bit early to talk about that.

On a side-note, here’s a pretty great introduction to FPGAs for those interested: The Most Versatile Chips Ever Built || FPGA Deep Dive and Use

1 Like

Yea agreed with @Rich_Morin, CPU systems are pretty well suited given that only a sparse set of columns are active at each step. Though we still suffer from the cost of moving all of the activated columns’ memory from main memory to logic, which is going to be an important challenge medium/long term.

Non-volatility which you guys are talking about (@Alex and @AgentRev) reminds me of the following. In DRAM, there is a constant power cost to pay for keeping the data uncorrupted. In particular, every dram cell needs to be “refreshed” multiple times a second. This is in part why Processing-in-Memory solutions cannot accommodate scaling to terabytes on a single compute node, because the power demands would be too high.

On the other hand, non-volatile memory technologies have not shown to be a close match to DRAM on multiple dimensions: density, endurance and latency. Choosing non-volatile technology over DRAM comes at a cost.

In any case, discussing these details does feel like getting ahead of myself. If anything, my work showed that typical CPU-based systems, with typical DRAM main memory and flash storage should be more than good enough for the short to medium term.

Before doing this work, I thought I would find that Thousand Brains Systems would need a very particular computing architecture to support the scaling, and I still believe that this will be true on the long term, but what I found for the short/medium term is that we’ll be fine. I don’t see researchers scrambling for esoteric hardware to scale Thousand Brains.

1 Like

the youtube transcript system has a freudian slip which isn’t too far from the warning from reality:

in my amateur understanding there can’t be a strict physical global synchronization in the brain because like with the CAP theorem, it’s easy to get partitioned, and at any given moment, one can say that it’s in a state of partial connectivity (almost by modern understanding / definition). This is, however, the key to scalability, not a flaw.

the one quadrant (top right) is the evil one.

I haven’t watched the presentation video till the end yet, and perhaps my post turns into a useless commentary but I think, there might be something to think about: if one thinks of scaling Monty, one could think of one Monty Step handling a large number of (L)Ms whereas one could potentially turn it around by softening the synchronization constraint: by having many Monty Steps handling 1-to-few (L)Ms.

PiMs probably go in that direction, however, on another architectural level, the shared-nothing architecture might allow the desired scalability with whatever hardware/connectivity improvements one gets:

source

Since as mentioned, there’s a lot of data overlap/reuse, perhaps, not everything needs to be copied/communicated. This might be done sparsely/selectively.

Perhaps, university collaborations are perfect to try such things out.

At latest, real-time Monty systems will likely have to abandon rigid steps and turn into completely asynchronously communicating systems. That doesn’t exclude some kind of globally published signals, but without hard waiting/synchronization constraints.

In fact, in an actor system, each actor is uniquely addressable. The sparsity of connections is simply the list of process identities (addresses) of the actors in the system. And new addresses can be communicated via messages as well. “A and B wired together” can be mapped onto “A sends its identity to B or vice-versa”.

Upd thought 1: especially for the cycle in the inference loop, relaxing the temporal sequence of inference within a step might solve the conceptual loop by making each actor independent with opportunistic inputs and outputs. A bit like my async voting experiment (elixir_ne) + the newest post from today:

UPD 2: congrats on the cool thesis

Hi Xavier

thanks for your information. I am really interested in your demo results for detecting/classifying objects like MNIST or cup model by different number of features. Do you have it? Thanks

by the way, is the PiM comparable to the SpiNNaker? Runnable on it or mapp-able onto it?

ignoring the current use for “GenAI”, this might sound at least theoretically quite aligned with either HTM networks or communicating Monty Modules (actors)

The rapid growth of GenAI is causing an energy bottleneck,
making energy-efficient algorithms and hardware a priority

Our Chip Architecture

Our brain-inspired computing architecture is uniquely optimized for the efficient inference of dynamically sparse algorithms
Dynamic sparsity is paving a radically
more efficient path to GenAI algorithms

In Elixir and Erlang systems, the identity of each actor (i.e., lightweight process) identities is encoded as an opaque triple (e.g., <1.2.3>), coupled with a “node” (BEAM instance) tag. The BEAMs on various machines handle both inter- and intra-node dispatch in an efficient and transparent fashion. Thus, each actor can ignore pesky details regarding message routing, etc.

In a mixed Elixir/Python Monty system, the most direct approach would be to let the BEAM handle messaging. The Elixir actors handle this by default and could intermediate message traffic to and from subsidiary Python processes.

That said, in a multi-processor (and perhaps geographically distributed) Monty system, there would need to be a way to address BEAM instances on other processors. My tendency would be to use Uniform Resource Identifiers (URIs) for this, because they are familiar, flexible, and accepted as the dominant addressing standard on the Internet.

I’ve been thinking about the problem of memory becoming read only over time, and what types of information needs to be stored in the connectome for the life of the brain. The read only end of life for SSDs and thumb drives is equivalent to the drop in connection building speed from minutes as a child to days as a 45 year old. That’s pretty much the same thing. The brain effectively becomes read only by the time you’re in your 60s. That’s why Schumer thinks it’s still the 1990s.

So the brain has to have a way to deal with that read only eventuality at least that well that can be emulated in Monty once it is discovered.

@xavier, I remember watching a video about the Erlang BEAM running on a system with many many many cores — a “distributed” system which feats Erlang like a glove.

In fact, luckily, I found it: https://www.youtube.com/watch?v=WGXPFPKQC2o

Hearing you talk about PiM and the low level details needed to program these systems, I was wondering: does it actually have to be this way? Because in erlang a process is completely, fully, 10000% isolated from all the other processes. In theory (ofc idk how it is in practice), all you have to do is wire up the BEAM to work on PiM systems, then erlang is a high level language which you can use to program these chips, it just “falls onto your lap”.

Or am I just going into wayy the wrong direction here?

UPD: read the thread more carefully, glad it’s not just me who went “BEAM!!! that’s the BEAM!!” haha

@filipencopav It would have to be ported to the UPMEM DPU architecture, which is not a trivial affair, if even possible. Once that’s stable, then Monty would have to be ported too. Realistically, after all is said and done, the extra overhead would likely cancel out the speedup that baremetal PiM has over CPUs. :sweat_smile:

3 Likes

Fascinating work, Xavier!

I wonder, though, if the CPU implementation of Monty is well optimized. It seems like there is a substantial slowdown when you move from one CPU to the machine size (from ~10 ms/sample to ~40 ms/sample). I wonder if the thread pool models used by the code are ignoring the work imbalance between threads (at step level). Most likely, the performance plateau you see is due to that. You have so many tasks running in the worker thread that there is less waiting for others to finish the step, so you see an IPC increase. There will also be fewer “idle” cycles for the cores waiting at the barrier. By the way, IPC is never a good metric for performance in multithreaded workloads because it also considers synchronization instructions.

It seems like it may be harder to catch up with well-optimized code (for example, optimizing synchronization between LM). I don’t know if Monty also starts at a very high mark. 10 ms per sample with a single LM seems high. Let’s see if the next generation will solve the interbank communication limitation, as proposed in [1]. Most likely, the performance will be much higher given the sparse communication between LMs.

In any case, good work!

Regards,

-–Valentin

[1] https://bu-icsg.github.io/publications/2025/PIMnet_HPCA_2025.pdf

1 Like

thank you valentin

yes as said in the post and the presentation, i am proposing a novel “fictitious” thousand brains system that is implemented in C and optimized, which makes the work speculative but beats the “monty is unoptimized” problem. implementation is pretty bare-metal, with conscious efforts to take care of potential architectural bottlenecks that could hinder the workload performance. The repos are listed in my post.

to bring it back to what you were saying, the plateau cannot just be explained by the unoptimized hypothesis. There is also very little worker imbalance in the workload. Honestly I do not remember what my explanation was for the plateau, but i remember us talking about it towards the end of the presentation. There are also very few synchronized instructions in this workload, so there isn’t really any issue with using IPC as a metric here, but i did not know that it could be a problem, so thanks to you i might be slightly less dumb.

2 Likes

Oops — I hadn’t looked at your C implementation. That’s a lot of work!

Regarding the imbalance, most likely you are right, and I’m wrong. Anyway, I noticed you have a #pragma omp barrier at each step (in scale_out.c). As you increase the number of LMs, the likelihood that all of them reach that barrier at the same time drops quite a bit. My intuition is that learning_module_step won’t stay perfectly balanced, since both the inputs and the internal states evolve over time.

Without digging deeper, it’s hard to be certain, but this kind of synchronization can be pretty painful to analyze in profiling. Most of the time you just see workers idle, waiting for the slowest task, which hides where the actual imbalance comes from. Adding OpenMP into the mix can make that even harder to reason about. The imbalance occurs in each step, but may degrade the performance significantly.

I’d suggest starting with a simple thread pool under your control and building from there (even though OpenMP is likely doing something similar under the hood). It might give you better visibility and flexibility to experiment.