Amazing to see the MuJoCo demo! I want to test Monty with our robots, so I vibe-coded a RobotDistAgent and RobotSurfaceAgent that use inverse kinematics to translate the locations chosen by their respective agent’s actions to joints in the robotic arm. Still WIP, but happy to clean a bit and create a PR against the original branch if this could be useful for anyone else
I merged main, basically reverting embodied_data.py and updating the configs.This is the branch: sergiovalmac/tbp.monty at sergiov/jeremyshoemaker/add-mujoco-agent-movement.
It requires to install robot-descriptions:
conda install -c conda-forge robot_descriptions
pip install mujoco trimesh robot_descriptions
Then, you can just run:
python run.py experiment=demo_robot_dist_agent.yaml
Still trying to run a successful experiment with the robot agent. The LM seems to be doing something (until it eventually crashes). How do I know whether it is doing what is expected? How should I look at this panels? What are the key figures I should pay attention to know the model is converging? And which are the figures I should use to debug that hint to potential causes?
Thanks!
Hey Sergio, sounds like some promising progress. In terms of debugging, below are some pointers.
- You mentioned on the general chat that Monty seems to be running, but it’s not clear if it’s actually learning anything. One of the most useful things to do is to visualize the learned models, i.e. the “point cloud” of the model, after learning is finished. You can read more about doing this here.
- To clarify, during supervised learning, there isn’t really any concept of “converging” in terms of the representations - Monty will normally explore the object for the number of steps and rotations specified in the configs, and update its model accordingly. After this has finished, as mentioned above, the best thing to do is generally to visualize the actual learned models.
- In terms of metrics that get logged to Wandb during inference, the CSV table that you have identified (shown in the last image) is often a great resource, given it shows a per-episode breakdown. You can see things like what object object Monty thought it was looking at, and how many steps it took to converge. The key metrics are:
primary_performance(we want either correct orcorrect_mlh, i.e. the most-likely hypothesis was correct, even if it didn’t terminate with high confidence).num_steps: how many steps the LMs processed sensory observations before they reached a state of high confidence.primary_target_objectis the ground truth target object label.
- In Wandb, relevant summary metrics of the above include:
overall/percent_correctoverall/avg_num_monty_matching_stepsoverall/percent_used_mlh_after_timeout(captures how many of the episodes terminated in a lower confidence state)
- Once you’ve dabbled with the above, it would be worth checking out
tbp.plot. It will be some extra work to get setup, but it’s a really powerful way to visualize what is happening during learning and inference. We’re always looking to improvetbp.plot, so if you run into any issues or have some suggested changes, feel free to let us know.
Hope that helps, happy to elaborate on any of the above.
Thank you very much!
I think there was a bug in the wandb logger that caused not logging episodes after the first epoch.
I edited BasicWandbChartStatsHandler as follows:
def report_episode(
self,
data,
output_dir,
episode,
mode: ExperimentMode = ExperimentMode.TRAIN,
**kwargs,
):
basic_logs = data["BASIC"]
mode_key = f"{mode}_overall_stats"
stats = basic_logs.get(mode_key, {})
payload = dict(stats[episode]) # new create dict
payload["episode"] = episode # new add episode
payload["mode"] = str(mode) # new add mode
wandb.log(payload, commit=False) # remove `step=episode`
and this line in DetailedWandbTableStatsHandler/ report_episode:
wandb.log(stats[episode]) # remove `step=episode`
It stores all episodes now, and I can see it is indeed learning something in the metrics you mentioned
I will explore the visualizations you mentioned.
Thanks again!
I added a few more changes to the wandb logger.
Extend close in WandbWrapper to call close for each handler:
def close(self):
for handler in self.wandb_handlers:
try:
handler.close()
except Exception: # noqa: BLE001
pass
self.wandb_logger.finish()
Implement close in BasicWandbTableStatsHandler:
def close(self):
for stats_table in ("train_stats_table", "eval_stats_table"):
df = getattr(self, stats_table, None)
if df is None or len(df) == 0:
continue
wandb.log({stats_table: wandb.Table(dataframe=df)}, commit=True)
And do not log in BasicWandbTableStatsHandler / report_episode:
# Comment these two lines.
# table = wandb.Table(dataframe=getattr(self, stats_table))
# wandb.log({stats_table: table}, commit=False)
Also, remove step in DetailedWandbHandler:
@override
def report_episode(
self,
data,
output_dir,
episode,
mode: ExperimentMode = ExperimentMode.TRAIN,
**kwargs,
):
# mode is ignored when reporting this episode
detailed_stats = data["DETAILED"]
frames_per_sm = self.get_episode_frames(detailed_stats[episode])
for sm, frames in frames_per_sm.items():
wandb.log(
{
f"episode_{episode}_{self.report_key}_{sm}": wandb.Video(
frames, format="gif"
)
},
# New: Remove step=episode
commit=False,
)
Sounds good, out of interest, what config you are using for these experiments (e.g. the objects, the agent, how many rotations, etc.)?
Also just to clarify further, logging is currently quite limited at learning, and related to my comment about no “convergence” in supervised learning, there isn’t a meaningful measure for accuracy (or loss) as learning takes place. Since Monty can learn good representations from a single epoch of training in supervised learning, we have always just run learning and then visualized models and performed inference separately.
This would be different for unsupervised learning, where we would expect multiple epochs to be necessary to observe reasonable representations, but work on this is still in its infancy. In that case, we would want an explicit part of the training loop where we perform inference without any learning/weight updates. Alternatively, using a “prediction error” metric might enable something close to a loss during learning.
I am using this file randrot_noise_10distinctobj_robot_dist_agent from this branch of my fork.
It makes sense. I still need to learn more about the LMs to think in Monty terms. I will setup the visualizations you mentioned asap.
Thanks for sharing all these updates! All of the debugging notes Niels already mentioned are great places to start. I just wanted to clarify a few things so we could give a bit more detailed advice.
- What kind of experimental setup are you trying to achieve? Do you want to do unsupervised learning?
Based on the config you shared, it looks like you are doing unsupervised learning for 3 epochs. Since there is only one target position and rotation specified in the object_init_sampler, the 10 objects would be shown in the same location and orientation in each of the three epochs. That means the desired result would be that on the first 10 episodes we detect no_match and add a new model to memory. On the next 20 episodes Monty should detect the objects correctly (although since you are not providing labels, the models in memory would just be called new_object0, new_object1,…
Interesting additional metrics to look at would be mean_objects_per_graph and mean_graphs_per_object. Wandb logging wasn’t really set up for unsupervised learning and for our benchmark reporting we usually just look at the .csv file and use the print_unsupervised_stats function (here). For more details on how unsupervised learning works in Monty you might find this tutorial useful: Unsupervised Continual Learning
- Would you be open to doing supervised learning?
Currently, Monty’s performance isn’t great on unsupervised learning (benchmark results here). The main reason is the “startup problem” where it is hard to find good parameters in Monty that make it create new models in the beginning of learning but then still work well for re-recognizing the same objects in new orientations later in learning. Mainly because our criterion for no_match is quite strict. This isn’t a fundamental issue but we haven’t gotten around to implementing a fix to address this. For our experiments we usually run supervised pretraining and then run a separate inference experiment (pretraining tutorial and inference tutorial) Would this be an option for you?
- Could you show how the inputs to the SM look like (for example a screenshot when setting
show_sensor_output: truein the config?
It looks like your config sets the zoom level of the patch camera to 1.0, while we usually set it to 10 to make sure the SM sees a small patch instead of the whole object. Maybe in your setup it is necessary to have this zoom factor smaller as the robot arm is closer to the object but I would just want to double-check.
Looking forward to hearing more about your experiments!
Best wishes,
Viviane
Thank you very much for the detailed answer!
- I didn’t realize this one was unsupervised. Thanks for the clarification. I didn’t understand whey it was recognizing them as new objects instead of their labels. That makes sense. I need to follow the tutorial to see how to add labels for this experiment.
- I am happy to start with supervised learning. (It took time to set the parameters that prevented not visible object exceptions at the beginning of the episode for the previous unsupervised case, so if this is easier for supervised, much better). I will start with the pretraining and inference tutorials and adjust the configuration for the robot.
- In this screenshot you can see the configuration that allowed to run 9 objects (I removed the spoon at some point and didn’t get it back) with no crash
Thanks again,
Sergio
Ah okay, thanks for clarifying those!
re. the training part, I would recommend starting off with the supervised_pre_training_base config tbp.monty/src/tbp/monty/conf/experiment/supervised_pre_training_base.yaml at main · thousandbrainsproject/tbp.monty · GitHub
re. the screenshot: I would recommend using a larger zoom factor for both. The view finder should not have this much black space around it (usually our GetGoodView positioning procedure makes sure to move to a reasonable distance for each object and makes sure that the patch is on the object. Did you disable this? The patch should definitely be more zoomed in. It should not contain the entire object. It normally has a size similar to the 5 patches shown here on the mug:
Here is another image of the rough size relationship between patch to object that we normally use:
I hope this helps!
- Viviane
Thanks! These are the models learned during supervised pretraining with the new config (starting off supervised_pre_training_base as you suggested). It seems it is learning now!
Though comparing with your image, I guess the patch should be a bit larger, shouldn’t it?
Oh amazing, that looks beautiful! ![]()
I think the patch size is okay
It isn’t super sensitive to the patch size, the main thing is just that it shouldn’t cover the whole object. Most of the SMs feature extraction just looks at the central pixel of the patch anyway. The biggest effect is on the curvature and point normal calculation since those look more at the whole patch. So if the patch includes a large area, it calculates the average curvature over the whole area (which becomes a bit strange for a whole mug). I think your patch looks reasonably sized now. I’m curious how well inference will work now in your setup!
Very nice! Looks like it is indeed learning decent models. Just adding to what Viviane wrote, you could consider making the patch maybe 1.3x or 1.5x times larger (by adjusting the zoom factor, rather than the resolution), but that already looks pretty good. In practice/the longer term, we would expect to have different LMs with different patch sizes - some very small, some slightly larger. These different LMs would be sensitive to different scales/coarseness of local features like curvature.
Are you able to collect a video of the robotic arm moving / the third person view on the left as a video? It would be pretty cool to see Monty concretely embodied in simulation.
Thank you very much for the detailed explanations, really helpful at this stage!
Inference with the previously (patch not increased yet) learned models seemed to work too (this is the config):
That’s great! In case you haven’t seen it yet, there are also summary stats logged to wandb. The easiest way to find them is to go to “Runs” and then look for the columns starting with “Overall/…” Those will, for example, tell you the overall accuracy (“Overall/percent_correct”), rotation error, number of steps, etc.
Just to add a bit more context on the supervised vs. unsupervised learning and how to set this in configs:
Basically, the way that Monty can get labels during learning is by setting the experiment class to MontySupervisedObjectPretrainingExperiment (instead of MontyObjectRecognitionExperiment, which would be used for unsupervised training or evaluation). Our current way of configuring Monty doesn’t allow for switching the experiment class during an experiment, so we split supervised training and inference into two experiments. One where we just train with the MontySupervisedObjectPretrainingExperiment and another where we just evaluate with MontyObjectRecognitionExperiment (just like you did now). When we do unsupervised learning, we can do both in one run because both learning and evaluation use the same experiment class (in that case, learning and evaluation are almost identical, except that during evaluation, we don’t update LM memory). I know this is a bit unclear when just looking at configs.












