Discussion on Applying TBP Theory to Text Understanding: Challenges and Potential Pathways

We have been closely following the research of the TBP team and are exploring ways to adapt TBP theory to the text domain. We are deeply grateful to the TBP team for the enlightening insights shared in our email discussions, particularly the idea that language and conceptual learning should be grounded in multimodal interactions with the physical world.

However, as Viviane mentioned in the email, applying TBP theory to the text domain presents significant challenges. Therefore, we would like to discuss whether there might be some “shortcuts” to initially achieve a certain level of intelligence through text-only learning—a level already capable of being independently applied to downstream scenarios in the text domain (we will provide one example below) and delivering good practical results, even if some limitations remain to be addressed later. Then, by progressively incorporating multimodal inputs, this intelligence could be further enhanced to compensate for the shortcomings of pure text-based learning.

Example Scenario:
In our intelligent application scenario, a key text understanding task is to determine whether a given text description violates a pre-defined rule.

  • Pre-defined rule: When the purchased item is mechanical equipment, the down payment shall not exceed 30% of the total amount.
  • Text description:
    The subject matter is a warehouse transport robot.
    … (omitted part, which may vary in length)
    Payment method: The first payment is due when the order is placed, accounting for 20% of the total. The second payment is due upon receipt of the goods, accounting for the remaining 80%.
  • Standard answer: This text description does not violate the pre-defined rule.

We believe the challenges in this case include the following:

  1. Recognizing that a “warehouse transport robot” belongs to the category of “mechanical equipment.” We consider this a conceptual hierarchical relationship, which is widely present in the text domain. Currently, we are unsure how to learn such conceptual relationships—whether it requires stacking multiple learning modules, and what the inputs and outputs of each module should be.
  2. Understanding numerical comparisons and the meaning of negation adverbs—for example, knowing that 20% is less than 30%, and recognizing that “does not exceed 30%” is equivalent to “is less than or equal to 30%.”
  3. Identifying that “first payment” and “down payment” are synonyms—or more precisely, being able to recognize that in certain specific contexts (such as goods procurement), they may be synonymous, while in other contexts, they may not be.
  4. Integrating semantic information across certain distances in the text, bypassing intermediate descriptions. We are uncertain whether this can be achieved through motor strategies, including how to define the target position or span of the motor.

These challenges are common in our text application domain. We believe this case can serve as a basis for discussing and exploring how TBP theory might address these difficulties. We are particularly interested in whether there are feasible pathways to achieve intelligence through text-only learning initially, and how subsequent multimodal learning could incrementally optimize the text-based foundation.

We look forward to any thoughts, guidance, or potential directions the TBP team might share on these points. Thank you once again for your visionary research and for inspiring these meaningful discussions.

3 Likes

(Note: I am not a TBP team member, just a third-party contributor)

Is this for experimental research, or for an end-user application?

So far, the project has mostly focused on visual recognition, whereas the main hurdle for language is associative learning, which is quite a different challenge altogether.

Imagine attempting to decipher Ancient Egyptian without any Rosetta Stone; that’s what “text-only learning” is. That idea’s already a shortcut itself, tackled semi-successfully by deep learning (and ever so unsuccessfully by symbolic AI).

Embodied learning is precisely about carving a sensorimotor “Rosetta Stone” to address the shortcomings of text-only learning and big data approaches, because language is ultimately grounded in sensory experience, as you acknowledged. It is very likely an embodied AI would have to undergo a training process similar to elementary school before reaching the capabilities you’re aiming for.

At present, Monty has no ability to ingest text data, its algorithms are all geometric and colorimetric. So, if your needs are in the shorter-term for an end-user application, an LLM would provide results quicker. For experimental research however, the road is wide open. This thread has a few details: Abstract Concept in Monty - #4 by vclay

Maybe the TBP research team will provide better guidance, I’m curious myself as to what they have to say.

3 Likes

Embodied cognition theory describes textual/symbolic (esp. linguistic) understanding as being deeply linked to sensorimotor (esp. biological) processes. Perhaps mirror neurons might serve as a concrete example. Sounds a bit ‘out there’, but of course language is manifested by brains, so…

1 Like

Hi @Tongjian_AN ,

sorry about the late reply to this and thanks @AgentRev for the great response and analogy!

We haven’t thought much about text-only learning because this seems fundamentally different from how our brains learn. But I can try and give some of my personal thoughts on this, especially in the context of your specific problem.

Monty (or more specifically, the learning modules inside Monty) implements a general-purpose algorithm for learning structured models of things it is sensing. A learning module (LM) learns how features are arranged relative to each other. So, in the context of language,e this could be models of letters composed of strokes, words composed of letters, or sentences composed of words.

When stacking LMs hierarchically, they can learn compositional models. So models that are composed of other models. This is an extremely powerful concept and capability. A lower LM may learn models of letters composed of strokes. The recognized letter would then be the input to a higher-level LM, which could learn models of words composed of letters. The higher-level LM doesn’t have to relearn how each letter looks exactly (i.e. the strokes of it), it just stores an arrangement of letter IDs, and if a letter appears multiple times, it just references that model of the letter learned at the lower level twice. This is super useful to efficiently learn the large variety of words composed of a limited number of letters.

When it comes to the next level up, you may learn models of sentences. But you wouldn’t learn a model of every possible sentence out there (there are simply too many!). Some higher-level models can be more general (e.g. learning general arrangements of types of words like nouns and verbs). Those models may reflect the grammar of the language you are learning and can be used to bias which words at the lower level to expect next.

Getting a bit more towards your concrete application, to actually learn the meaning of words is a bit trickier with a language-only setup. The data just isn’t there. All you can learn are statistical regularities. You can learn that in a similar context (i.e. higher level model), two different words may occur (top-down prediction to the word modeling column), which could be a way Monty may learn synonyms. Similarly, classes of words (like the mechanical equipment) could be learned based on lower-level word models appearing in the same context in the higher-level sentence model. However, I am having a hard time thinking about a good way to learn logical relationships (like > than or 30%) from text data only. I guess with sufficient examples, it could learn the combinations of 1>2, 2>3,… and with enough examples that would work but it isn’t really a solution where the system understands numbers or what it means for a number to be larger than another one.

One very speculative idea for tackling this type of task could be to have “logical/conceptual” columns that receive some kind of formal or conceptual representation of the text. Again, this is a very unformed idea, and I would have to think through it much more. But, for example, when presenting the number “3” to an LM (the language one), another LM (the logic one) could get a more visual representation of 3 dots. When the language LM receives “>”, the logic LM could receive two number representations next to each other where one to the left is larger than the right one. Or it could learn a behavior of subtracting the left from the right number and seeing if it remains a positive number. The two LMs could then learn associative connections (what we call lateral voting) between their models.

These ideas are not very fleshed out at all but what it hints at is that this would then not be a text-only system anymore, but it wouldn’t require some other perceptual input for learning the meaning of words. As I mentioned before, I don’t think you can extract any real meaning of words, beyond statistical correlations with other words, without having non-language models to associate them with. If language is not grounded in models that were learned through experiencing what the language is describing, I don’t see a way that it could represent this information.

I hope this makes sense and isn’t too disappointing.

For what it’s worth, I do think it would be exciting to explore language modeling in Monty. I don’t think it would require any modification to the LM for Monty to ingest text data. For instance, you could have a sensor module that sends letter IDs as features to the LM as well as their location in the string of text. Or you could take a lower-level approach and have a vision LM extract strokes and learn letters first. Or you could take LLM-like token representations and send those as features to learning modules. Movement would basically be in 1D space (backwards and forwards in the text). But Monty could learn complex, compositional, and nested models of the language. The main issue I mentioned is about associating the learned language models with meaning and understanding what the words refer to, to solve tasks intelligently.

Best wishes,

Viviane

5 Likes

And it’s not just words either. It’s also punctuation, whitespaces, line breaks, etc. E.g. we perceive a whitespace as a blank space, “nothing”, whereas for a program, no matter MI or otherwise, it’s just another symbol with another ID. It gets so much worse the more you think about it. Some of these can be dirty-hacked, but this might introduce unaccounted side effects in the future.

2 Likes

Arithmetic is abstract Geometry. Multiplication tables show the results of 5x5 as a group of 25 squares on a graph that are countable. Long division, and the other types of simple math can be done with geometry. as well. Math began with counting rocks in a bowl, became geometry and built a couple of pyramids, and then became arithmetic. Then went way abstract. Math begins with the usual processing of objects in a 2D space, and the space and objects become more abstract over time as the idea evolves while being exchanged between brains. That optimizes for speed. Math as complex as Calculus requires a spoken language that is equally complex. I’m still a little stunned every time I think about that. Calculus started as rocks in a bowl. Sensorimotor learning about literal objects in space is the basis for everything we know including language and that still blows my mind.

1 Like

I mentioned mirror neurons as more of a metaphor than a model. Perhaps LMs could at least ‘copy’ meaning from other, entirely external, LMs. Expert Systems are an example. Horizontal Gene Transfer in bacteria is another (if one squints enough to see DNA as language). Weighing the veracity or usefulness of such knowledge likely would require non-language models though. Which is analogous to where we are with the current LLM craze.

Although the topic is Text Understanding, I’d prefer to reframe it as Language Understanding. Most humans learn to associate spoken words with meanings first, only later associating them with their written forms. There is also the carry-over habit of subvocalization:

Subvocalization, or silent speech, is the internal speech typically made when reading; it provides the sound of the word as it is read. This is a natural process when reading, and it helps the mind to access meanings to comprehend and remember what is read, potentially reducing cognitive load.

So, it might be interesting to train Monty to understand (or at least recognize :-) spoken words first, before trying to teach it to read. Also note that Monty’s modules are implemented in computer code, so they can ingest encoded characters (e.g., ASCII, Unicode) with little effort.

Although getting Monty to understand language is a heavy lift, some pragmatic shortcuts might be possible. For example, Monty could be given textual hints (i.e., tags) by the experimenter or perhaps by a parallel LLM. Imagine that Monty is examining a screwdriver, turning it around to see various sides. Telling it various names for the object wouldn’t increase its capabilities, at least in the short run, but it could let Monty report its findings in a manner that humans or LLMs could use…