I was looking into how to use Monty with audio. Reading the documentation, there was a requirement to have the input contain “x, y, z coordinates of the feature location relative to the body and three orthonormal vectors indicating its rotation”. In most species, audio and balance/space are processed in the same location (vestibular/inner ear). So, it seemed weird for an ear to identify the xyz coordinates…in isolation. Is this requirement only for visual inputs?
Good question. When we refer to the x, y, z coordinates of a feature relative to the body, we mean the position of the source of the sound.
The same applies to the visual system, the coordinates refer to the object being looked at.
Monty is inherently multimodal, so for example, it might see a dog barking and record its x, y, z position relative to the body, while also hearing a sound coming from roughly the same location and the system can learn that this particular sound is usually produced by that particular object, in this case, a dog.
Sound itself doesn’t have an inherent concept of rotation, so you can simply set that to an arbitrary fixed value.
Yeah, I agree that it feels a bit awkward/shoehorned to presume some of this information for every possible input sensor. Curvature of smell or heat? A point normal for taste?
Sure, you can just set them all to some random fixed value, like have the point normal aimed at the sensor and a flat curvature. But then it starts to make the CMP feel less and less generic. Generic input processing seems like an important part of the promise of cortical column theory.
On the other hand, for audio, some sounds don’t radiate in a uniform sphere shape. Talking creates sound with a higher amplitude in one direction, for example. I think I could tell the difference between someone talking at me while facing me versus perpendicular to me. It depends on the acoustics of the environment, of course. Also, many sounds do seem to radiate mostly symmetrically (ignoring the environment).
Actually doing the calculation of orientation seems like an open question. Similar to vision, I can imagine deferring the question of how to detect the location/orientation of the audio source, and just presume it’s handled sub-cortically.
Yes, those are good points. It’s unclear if we can always expect to get an orientation input. Maybe just to clarify a few things: The general framing of the CMP is that it needs to contain a location and orientation in a common reference frame (e.g. relative to the body). However, the orientation doesn’t need to be defined by the surface normal and principal curvature directions. This just seemed like a good way to define it for our vision sensors. As you mentioned, for audio, this could be defined from the direction in which a sound source is pointing. I also agree that this could be handled subcortically, and in Monty, it’s the job of the sensor module to extract pose information from the raw sensory data.
In the brain, we are thinking that the orientation might activate minicolumns in L4 of a cortical column. Kind of how, when you probe in the early visual cortex, you find responses to differently oriented edges (here a random image from a quick Google that illustrates this)
We’ve been tossing around the idea of whether there is a more general way of defining this (the brain will just learn the most common and useful patterns to represent), but it seems like thinking of it as orientations for all modalities works reasonably well for now.