Learning Categories of Objects

YiannosD · December 27, 2025, 1:17am

Hello,

I have been looking at Grid Object Models for Unsupervised Learning and how to learn categories of objects. Before jumping into making a dataset and testing I had some conceptual questions the team might have already thought about.

I’m planning on starting with mugs or something easy like that, but when thinking about the future I inevitably thought about applying this to medical imaging since that’s my area of work. Given a CT scan, we essentially have a 3D model of part of a human with features at a pose (intensity, curvature). The cool thing about it is we get X-ray vision, so Monty could learn a model of a human body, but within that know the objects that compose it such as heart, liver etc. This is where generalizing to within-category objects comes in, since organs can have different sizes and structures.

Although mugs can also vary in structure and size, most of them just vary in color patterns. My main question was: do you think it will be better to start with an object that has more variance in its size and structure? If so, do you have any recommendations?

**Edit to say I think this is very interconnected with the Support Scale Invariance issue but does the team consider them completely separate? One could separate categories with objects of same structure and size but different color but I think that’s only a subset of generalizing within categories, and to truly do that you would need scale invariance.

nleadholm · December 29, 2025, 3:29pm

Hi @YiannosD , that’s great to hear you’re interested in this item.

You’re absolutely right that morphological/structural differences would be important for testing this. While mugs do often differ mostly by color, they can also have structural differences like the (admittedly odd!) examples below show. You would want to reflect this to some degree in any dataset to see if canonical representations emerged that are more like the “Platonic” representation of a mug we often think of (partly hollowed cylinder + a handle).

It certainly isn’t a requirement that you use mugs. Maybe one other idea if you wanted to build a dataset from publicly available/open source 3D assets would be a dataset of 3D modelled trees and boulders. Do you then get a fairly generic model of a tree (brown, vertical center, bushy green top), and boulder (round, brown-grey object)? If you have different threshold for detecting an object as being unfamiliar, you may get different models of different granularity (e.g. one LM learns a model for conifers vs. deciduous trees, while another one has even more fine-grained separation by species).

Medical data would definitely be interesting to explore in the longer term, but I think it would be best to stick to something simpler for now.

Re. scale invariance, I agree this is a related issue, in that categorization can either be dependent or totally invariant to the scale of an object (e.g. toy car vs. an actual car). It’s probably simplest if this can be avoided for the moment by limiting scale differences between objects that we expect to have the same category to within ~10% of one another. Once we have implemented a solution to scale invariance, then we could consider more extreme deviations from this.

Hope that makes sense, let me know if I can clarify anything.

YiannosD · January 7, 2026, 3:26pm

That makes sense, thank you! I think trees might be a very good example to start with. I’ll work on this and see where it takes me!

nleadholm · January 8, 2026, 9:37am

Ok great, sounds good! Let us know if there’s anything we can help with once you get started.

Rich_Morin · January 8, 2026, 6:21pm

Although you may be able to find datasets of 3D modeled trees and/or boulders, a problem could arise if you want to bring actual examples into your lab (:-). One advantage of mugs and staplers is that they are compact, light weight, etc. So, you might want to consider working with smaller plants and/or rocks, instead.

On a vaguely related note, some years ago I took a blind friend to the San Diego Botanic Garden (Encinitas). We examined and discussed a number of plants, trying to find structural and morphological commonalities that would translate between sight and touch.

YiannosD · January 10, 2026, 3:05am

Once we need a real tree it may actually be fun to work outside haha! But yes plants might be easier to work with, thanks.

That’s an interesting story, sounds fun!

YiannosD · March 30, 2026, 1:03pm

@nleadholm Hi Niels,

Following up on your tree suggestion. I went ahead and built a prototype for unsupervised category learning using 10 tree species from Objaverse (birch, cedar, cypress, maple, oak, palm, pine, spruce, a generic tree, and willow).

The setup is MontyObjectRecognitionExperiment with do_eval: false. I used EvidenceGraphLM with a MeshEnvironment wrapping trimesh.

Some interesting findings along the way:

max_match_distance is the dominant knob for collapse. The default 0.01m (1cm) is way too tight for ~1m trees since the median nearest-neighbor distance in stored models is about 1.7cm, so query points almost never land within range and evidence stays negative. Bumping it to 0.03m was the sweet spot between over-splitting and collapsing everything into one model.
x_percent_threshold and object_evidence_threshold barely matter in comparison. Ran a 12-point grid sweep and mmd dominated everything.
InformedPolicy was a dead end for this task. It only does Look/Turn, so it captures roughly one hemisphere per episode. A tree seen from the front in epoch 0 looks completely different from the back in epoch 1, so cross-epoch recognition failed almost entirely. Switching to SurfacePolicy fixed this and cross-epoch recognition went from nearly zero to 7/9 trees being re-recognized.
Best result so far: 5 learned models for 9 tree species (willow ran out of steps). The two dominant models each absorbed 4-5 species. One grouped cedar/cypress/spruce/maple/birch, the other grouped palm/pine/tree_generic/maple. Oak/birch/cypress stayed as singletons. So we’re getting genuine category-like collapse, though not yet the conifer vs. deciduous split you hypothesized.This could be a geometry-based collapse (dense bushy canopy vs tall/sparse trunk)
desired_object_distance also matters a lot — the default 0.025m puts the camera basically inside the canopy. 0.3m works for trees.

This was all with 2 epochs and 2000 steps as a debug run. I’m planning to do a longer run next (3+ epochs, 5000 steps) and re-run the parameter sweep now that SurfacePolicy is working. Before I design the next round of experiments though, do you have any suggestions for what to try, or specific things you’d want to see tested?

nleadholm · March 30, 2026, 1:44pm

Wow, that’s awesome to hear @YiannosD , thanks for making such great progress.

And really interesting results, some initial questions that come to mind:

Are you able to share visualizations of the tree objects you are using?
Similarly, can you share a visualization of the two main learned models you describe, plus the Oak/birch/cypress singletons? It would be particularly interesting to see what the grouped models look like.
How many model points are you allowing for each object? Maybe if you are able to share a repository with your configs, then I can also check some details like this directly.
One thing to keep in mind during your hyper-parameter sweep is that, in a full Monty system, we would want to have multiple LMs, and these could focus on learning models at different levels of detail/abstraction. As such, there won’t necessarily be one perfect configuration - rather, we want e.g. one LM that learns conifer vs. deciduous (although that may not practically happen at the moment for various reasons), one that learns them all as separate models, etc.
One thing you could consider is using something like a simple dendrogram (Dendrogram - Wikipedia), with mini-versions of the models visualized, to see how the breakdown happens as a function of different hyperparameters. You might find the below figure useful for this. In this case there would only be one level, based on what model it belongs to.

Screenshot 2026-03-30 at 2.43.21 PM2046×754 133 KB

YiannosD · April 7, 2026, 1:57am

Hi @nleadholm, thanks for the response and sorry for replying late. So that was just a debug run to make sure my mesh environment works, but I actually found way better models in the meantime (the objaverse ones were more like toy examples). I put the models i used under assets (trees.tgz) at https://github.com/YiannosD/tbp.monty/releases/tag/trees-data-v1 (also code can be seen in that repo)

These are very realistic models from sketchfab so maybe it wasn’t the best option but it’s working out in an interesting way. I ran an experiment with 10 epochs and 5000 exploratory steps and it ended up with 19 models for 10 trees of which:

15 were singletons: birch, maple(x2), cypress, oak(x2), pine, spruce, generic(x2), willow(x2), cedar, palm
3 were groups: cedar + palm, birch + palm, pine + spruce
One was a catch-all model: (2000 pts) with all 10 species

Attached images below.

This was with mostly default configs. I used informed_5_goal0 for the motor system and will test goal1 next. There is a lot to explore here and I find the fact that a catch-all tree model is learned very interesting!

So yeah, these models might be too big/complex (experiment took ~12 hours on my laptop) so I might play around with fewer models but anything simpler I could find was too simplistic (gamified). One issue I ran into was that for an episode there was no good view at the start because of random rotation so the experiment would just crash. I just made it skip that episode for now, but other than that all was smooth. Let me know your thoughts!

sknudstrup · April 7, 2026, 3:21pm

Really cool project!

As for the get_good_view crash, I’d just comment out the line that’s raising an assertion when the episode starts without being on-object for now. It’s in embodied_data.py . Search for “raise RuntimeError(“Primary target not visible at start of episode”)”. We wanted to raise that error to make sure get-good-view is working for our purposes (typically YCB objects), but get-good-view can be kind of finicky when the objects are really “stringy”, and I’m not totally surprised that it’s having issues with a tree.

The other thing to check is just that you’re not too close to the tree at the start of the episode – like if it’s being randomly rotated into the agent, which will mess with the camera.

But if you do end up disabling that assertion, you know that you’re not too close to the tree, AND the episodes finishes without ever ending up on-object, please report back. I’d be curious to find out more about what’s going on.

Cheers! -Scott

EDIT: To be more explicit, you can check if the episode completes an episode without ending up on-object by checking eval_stats.py. If the object was never found in a given episode, the LM’s primary performance should be “patch_off_object”.

YiannosD · April 7, 2026, 3:48pm

Hi Scott,

Thanks for the reply, I will test that!

nleadholm · April 8, 2026, 9:05am

That’s great, thanks for the update @YiannosD ! We have some paper submission deadlines this week, and then another in-person meetup next week, so I won’t get around to a proper response until the week of April 20th. In particular, I want to have a proper look through your code and configs to give some more detailed advice.

Just a few early comments in the meantime:

Thanks for sharing the meshes; I loaded those and have attached screenshots of the models in case anyone else is following and is interested in what these look like. Related to this:
- If you are able to visualize the learned models with the colors for each point, that would be helpful. E.g., do we see points corresponding to the trunk in the models?
- It looks like the generic model might be missing a texture element?
Re. hyperparameters
- As Scott mentioned - it’s worth checking the initial position of the agent at the start of the episode; since these objects are large, we may often be starting inside the tree.
- You mention the trees are ~1.0 meter, and have set max_size appropriately - I just wanted to sanity check this, as I’m surprised the tree models aren’t bigger if they are meant to reflect real-world dimensions?
- If you are using the surface agent during learning, you could set the number of steps even higher. That will help with building larger, more complete models.
- Re. learning the trunks, it might be helpful to have an episode where you either rotate the tree to be viewed by the agent from below, or where you translate the agent down, in order to increase the chance that we start somewhere on the trunks. From the models you’ve shared, it looks like the agent always spends most of its time on the foliage at the top.
- Once you’ve played around with the above, I’d be curious if you see any interesting / consistent changes in the clustering as you move from, say 1 –> 10 –> 20 –> 40 epochs of learning.
In terms of model clustering and the dataset
- Given what the tree objects look like, I would expect things like pine and spruce to be a single model - this has happened, which is encouraging, although interestingly the model doesn’t look like what I would expect. In particular, it seems to just be a blob of points - do you have a sense for the scale of this model vs. the actual objects?
- Similarly, a generic “all tree model” might not be a bad thing, but it would be easier to assess if we had a more diverse dataset. If you get a chance, it could be interesting to add something like 10 cars/trucks to the dataset. Then it will be easier to see whether we distinguish trees from vehicles with generic models, while also having more specific models within each class.

YiannosD · April 8, 2026, 2:44pm

Hi @nleadholm, thanks for the feedback! And no worries on timing, i’ll be able to have more for you by then!

Regarding the size, I wasn’t sure what to do about this because the models range from 5-47 units, so they are 1 meter because I normalized them to a unit bounding box. I did that because when i made them bigger I was having some issues but I actually figured that out so i can normalize to a more realistic scale. This might also help with seeing the trunk more. On the generic tree topic, that’s odd because it has texture and color but for some reason alpha is low and makes it transparent, it didn’t look like that when I downloaded it so maybe I mixed it up, I’ll remove it for now!

I’ll incorporate all your feedback and update soon!

nleadholm · April 9, 2026, 7:53am

Ok great, sounds good.

Re. the tree sizes, yeah I think that’s fine if you normalize them to 1 meter, I just wanted to check if that was definitely the case. If you are going to resize them, you could even consider making them smaller (e.g. 10cm), so that they are similar in size to the YCB objects (bonsai? ). That would make it easier to use the other parameters in the configs as is, such as those that impact the policy of the agent, and the initialization of the environment. You can also then mix in YCB objects (e.g. all the box-like objects like sugar-box, potted-meat etc., or all the medium-sized spherical objects like peach, tennis ball etc.) if you want to explore clustering of those.

Rich_Morin · April 20, 2026, 10:50pm

To me, this idea begs the question of how much of Monty’s item recognition is based on perceived size. Is this part of the current LM, is it planned for a higher level LM, etc?

My suspicion is that this sort of input typically requires multiple sensors (e.g., eyes, fingers) in humans, so I’d expect to see it handled at a higher level. That said, I could imagine architectural hacks that would allow a single LM to integrate multiple sensations and use its reference frame to estimate the object’s size. (ducks)

nleadholm · April 27, 2026, 3:27pm

Hi @YiannosD I had a look through your branch and I don’t think I have anything more to add re. the configs at this point beyond my earlier comments.

The main other suggestion at this stage would be that it would be worth visualizing the policy during the learning, and in particular see the agent exploring the object, including both the view-finder view, as well as the sensor patch view. This is normally very informative about how the system is working, and if there are any strange things happening. You can read more here, but let me know if you run into any issues.

nleadholm · June 8, 2026, 8:48am

Hi @YiannosD , just thought I’d hear if you had worked on this any more? No worries if not.

Also @Rich_Morin I don’t think I ever responded to your question - scale invariance is indeed on our roadmap. In short, we believe it will be a mixture of model-free methods (initial estimate of scale) and model-based methods (priors from learning, combined with the ability to test alternative hypotheses) working together. You can read more about it here.