Confused about pose information and reference frames

Monty works with multiple different reference frames. If I have understood correctly, there are often three levels:

  • Frame of the agent itself
  • Frame of a specific sensor
  • Frame of a specific object

If I’m sitting stationary and letting my eyes saccade across an object 30 cm away, within the Monty framework my eye SM would transmit a sequence of (pose, features) to some LM. How should I understand this set of poses? In particular,

  • My eyes have a known pose in relation to my body. As my eyes move, their position stays the same (assuming I don’t rotate my head), but the rotation can change as my eyes swivel in their socket. Is this correct?
  • For each vision-patch collected by my eyes, we assume we somehow can associate to it a precise pose relative to the object, but not relative to my eyes?
  • As my eyes rotate, the same position on the object will have a different pose relative to my eyes. How does Monty account for that?
  • Is object movement part of the Monty framework yet? Assuming a fixed sensor but a moving object, the agent does not readily have displacement information to account for the change in features.
1 Like

Hi,

good questions! Have you had a look at this page in our documentation yet? Reference Frame Transformations It walks through all the reference frames involved in Monty and their transformations. I think it should answer the first three of your questions (but let me know if something is still unclear).

Regarding your last question: No, Monty can not deal with moving objects yet. This is something on our roadmap for later this year.

If you are looking for more formal descriptions, you might also find the methods section here useful: [2507.04494] Thousand-Brains Systems: Sensorimotor Intelligence for Rapid, Robust Learning and Inference

Best wishes,

Viviane

1 Like

Thanks, that cleared it up!

2 Likes

Another question: How does monty handle translations of a sensor in the direction of the sensed surface normal? If the sensor is looking straight down onto the object and moving downwards, we would not expect the feature to move in the field-of-view of the sensor. This is different from any other direction of movement where a translation of the sensor should result in a corresponding translation of the feature relative to the sensor.

And a followup: In the example of Both sensors sensing different locations, if sensor S1 transmits it’s own pose and location hypothesises to S2, S2 will translate these location hypothesises using the displacement (S2 - S1). I see a problem with the component of the translation along a pose normal. A feature cannot move “into” the object, but will stay fixed along that direction.

1 Like

Hi @AlmostUseful , thanks for your questions.

Re. your first question: in that case, the depth reading means that the 3D location of the sensed surface normal won’t actually change. As such, as far as Monty is concerned, there would be no displacement, and the learning module would stay on the same location in its object-centric reference frame. Does that help?

Re. your follow-up question, I’m not sure I understand the issue you’re describing, although maybe the above point (i.e. that the sensed location is always in 3D) may help. If not, I would find it helpful if you can expand on the issue you’re trying to explore, and then maybe I can give some additional context.