2026/04 - Monty for Ultrasound - A Real World Challenge

TBP-Announcements · April 7, 2026, 2:58pm

@nleadholm introduces a public repository and dataset for ultrasound perception using Monty. This project aims to improve automatic ultrasound image analysis and user guidance, especially in low-data medical settings. It is now available for everyone to explore and experiment with. We would love to hear your feedback and learn what you think!

Summary Video

Main Video

0:00 Introduction
0:08 Ultrasound Perception in Monty: A Public Repository & Open Challenge
0:32 Ultrasound
2:49 Real-World Applications
3:39 Current Deep Learning Approaches
4:50 Monty for Ultrasound
7:13 Next Steps
7:53 Recap on the Setup
9:11 Phantom Setup
9:52 Extract Features & Pose
10:14 Pipeline Overview
10:45 Learned Models
10:56 The Repository
11:08 Dataset: The Objects
12:14 Dataset: Learned Models
15:14 Monty’s Performance Today
17:56 Open Issues & Ways to Contribute

nleadholm · April 7, 2026, 3:38pm

Here is a link to the repository for anyone interested in playing around with the datasets. Excited to hear everyone’s ideas for improving performance and addressing the open Issues.

Bryce_Bate · April 8, 2026, 3:12pm

Niels (@nleadholm ), “Monty for Ultrasound” may be just the “Wow!” medicine the doctor ordered. And the future plans to have a trained operator “manifest the movements” is going to be vital.

But let me offer an additional idea that would serve to further set Monty apart from other AIs while, more importantly, building trust. It’s “explain as you go.” A fundamental difference between Monty and LLMs is Monty’s potential to provide causal justification for its hypotheses and actions. It is the difference between having knowledge and having educated guesses.

I’ll use your experienced example. And please forgive my venture into a medical example for which I have absolutely no training. Imagine two doctors separately trying to diagnose a patient’s abdominal pain complaint. One doctor performs an operator-assisted blind sweep of the belly. This (LLM) doctor (a kind of statistical ventriloquist) has been trained to find the strong correlates of appendicitis and does so. The other doctor (Monty, a structural investigator) directs the operator to make specific adjustments to the probe, explaining the reasons for each move. This doctor (Monty) explains what they are looking for in each movement and why. They also report what is found (or not found) as they go. They direct the probe to a position so as to rule out an acoustic shadow. And so forth.

Both doctors can give reasons for their correct diagnosis of appendicitis.

The LLM doctor has, what we might call, “statistical justification” for their diagnosis. “I believe this is an appendix because it matches the 10,000 images of appendices I was shown.” The “what” is a correlation between surface features. The pattern is: “When Pixel Group A and Pixel Group B appear together, the label is usually ‘X’.” The justification–the reason for the claim–is Extrinsic. It relies on the authority of the training corpus.
The Monty doctor has “causal justification” for their diagnosis. “I know this is an appendix because I predicted its 3D structure and verified it by directing a movement that confirmed the expected spatial features.” The “what” is a correlation between movement and location. For Monty, the pattern is: “If I move the probe 5mm to the left, I must encounter Feature B if this is indeed Object A.” The justification–the reason for the claim–is Intrinsic. It is rooted in the 3D integrity of a reference frame.

I can imagine a well-trained LLM offering an audit of sorts, but it would only be a statistical narration of why it guesses it’s right, whereas Monty could offer a verification narrative that proves how it knows it’s right.

This leads to the fact that the LLM doctor’s claim to know that the patient has an appendicitis based on mere statistical justification is defeated in a way that the Monty doctor’s causal justification is not. The LLM doctor may be, in fact, “knowing” only an acoustic shadow while the Monty doctor is knowing a structure. And this difference is at the heart of which doctor is to be trusted.

The benchmark for trust must not stop with accuracy in judgment. It needs to incorporate a justification benchmark for the “knowing” its diagnosis claim to be true. You might consider this epistemic factor as you continue to separate Monty from all the rest.

Interesting work you have going on this!

– Bryce

nleadholm · April 9, 2026, 3:21pm

Thank you Bryce, that was nicely laid out - I agree it would be interesting to eventually compare to deep learning methods, and importantly, to include explainability in those evaluations. Something for us to think about.

Bryce_Bate · April 9, 2026, 7:56pm

Thank you, Niels. Beyond the AI comparisons, I think it may be a core feature of any AI in the future that it be trusted in its words and actions. I suspect other benchmarks will come to be as important to average users as the browser wars were to consumers in the mid-to-late 1990s.

Around 2017, Waymo introduced in-car feedback explicitly to build trust with riders. According to Waymo’s article, “Trusting Driverless Cars” ( Taming the Road: How Self-Driving Cars Earn Your Trust ), they added visual and audio feedback to show that the car was aware of important things (e.g., pedestrians, cyclists, cones, crosswalks, other cars, etc.). And they had the car give explanations for its decisions in real time as well (e.g., telling the rider it’s waiting for pedestrians or stopped for a railroad crossing). All of it was a bit of useful and clever trickery. Waymo had to fake it with UI; you can do it for real with the TBT architecture built into Monty.

The thing about “trusting the technology” that I want to emphasize is that TBT offers an unprecedented look inside the black box of the human mind. And consequently, Monty can address the trust issue head-on. Other knowledge agents–machine and human–may appear smart in ways that mask mere cleverness." I want my doctor to say “I don’t know” when they truly don’t know and not repeat word patterns that only serve to deflect my inquiry. Explaining one’s decisions and assessments at a level appropriate to the audience’s level of understanding is part of building confidence that the expert understands what they are saying and knows what they are doing. It’s what expert humans do, and I think Monty may be uniquely positioned to do it as well.

Sorry for going on, but I sometimes see things through the lens of my past life studying epistemology. Right now, I see the knowledge aspect of Monty as being more than a compelling feature. It could be something more defining and evident to people deciding whether to trust an AI that seems accurate on the surface. To paraphrase Supreme Court Justice Potter Stewart—“People know ‘warranted knowledge’ when they see it.” And when they hear Monty explain why it directed a move to verify a structure, they’ll know they’ve found it.

– Bryce

vclay · April 11, 2026, 8:08am

Those are great points @Bryce_Bate , thank you for highlighting those so nicely. Lately, we haven’t focused much on pointing out the interpretability aspect of Monty, but I 100% agree that this is a crucial advantage/requirement in many applications, and we could highlight that more. Especially in high-stakes situations concerning our health, it is crucial to be able to trust the system and know how it reaches its conclusions (as well as when it is uncertain).

tslominski · April 13, 2026, 5:03pm

3 posts were split to a new topic: Text, Language, and Interpretability in Monty