We have been closely following the research of the TBP team and are exploring ways to adapt TBP theory to the text domain. We are deeply grateful to the TBP team for the enlightening insights shared in our email discussions, particularly the idea that language and conceptual learning should be grounded in multimodal interactions with the physical world.
However, as Viviane mentioned in the email, applying TBP theory to the text domain presents significant challenges. Therefore, we would like to discuss whether there might be some âshortcutsâ to initially achieve a certain level of intelligence through text-only learningâa level already capable of being independently applied to downstream scenarios in the text domain (we will provide one example below) and delivering good practical results, even if some limitations remain to be addressed later. Then, by progressively incorporating multimodal inputs, this intelligence could be further enhanced to compensate for the shortcomings of pure text-based learning.
Example Scenario:
In our intelligent application scenario, a key text understanding task is to determine whether a given text description violates a pre-defined rule.
- Pre-defined rule: When the purchased item is mechanical equipment, the down payment shall not exceed 30% of the total amount.
- Text description:
The subject matter is a warehouse transport robot.
⌠(omitted part, which may vary in length)
Payment method: The first payment is due when the order is placed, accounting for 20% of the total. The second payment is due upon receipt of the goods, accounting for the remaining 80%. - Standard answer: This text description does not violate the pre-defined rule.
We believe the challenges in this case include the following:
- Recognizing that a âwarehouse transport robotâ belongs to the category of âmechanical equipment.â We consider this a conceptual hierarchical relationship, which is widely present in the text domain. Currently, we are unsure how to learn such conceptual relationshipsâwhether it requires stacking multiple learning modules, and what the inputs and outputs of each module should be.
- Understanding numerical comparisons and the meaning of negation adverbsâfor example, knowing that 20% is less than 30%, and recognizing that âdoes not exceed 30%â is equivalent to âis less than or equal to 30%.â
- Identifying that âfirst paymentâ and âdown paymentâ are synonymsâor more precisely, being able to recognize that in certain specific contexts (such as goods procurement), they may be synonymous, while in other contexts, they may not be.
- Integrating semantic information across certain distances in the text, bypassing intermediate descriptions. We are uncertain whether this can be achieved through motor strategies, including how to define the target position or span of the motor.
These challenges are common in our text application domain. We believe this case can serve as a basis for discussing and exploring how TBP theory might address these difficulties. We are particularly interested in whether there are feasible pathways to achieve intelligence through text-only learning initially, and how subsequent multimodal learning could incrementally optimize the text-based foundation.
We look forward to any thoughts, guidance, or potential directions the TBP team might share on these points. Thank you once again for your visionary research and for inspiring these meaningful discussions.