[Paper Review] Unsupervised Grounding of Plannable First-Order Logic Representation from Images
This paper proposes First-Order State AutoEncoder (FOSAE), an unsupervised neural network that learns interpretable first-order logic predicates from image-based object features without human supervision. By jointly encoding object features and discovering reusable relational patterns, FOSAE produces compact, symbolic representations compatible with classical planning, demonstrating success on 8-Puzzle and photo-realistic Blocksworld environments.
Recently, there is an increasing interest in obtaining the relational structures of the environment in the Reinforcement Learning community. However, the resulting "relations" are not the discrete, logical predicates compatible to the symbolic reasoning such as classical planning or goal recognition. Meanwhile, Latplan (Asai and Fukunaga 2018) bridged the gap between deep-learning perceptual systems and symbolic classical planners. One key component of the system is a Neural Network called State AutoEncoder (SAE), which encodes an image-based input into a propositional representation compatible to classical planning. To get the best of both worlds, we propose First-Order State AutoEncoder, an unsupervised architecture for grounding the first-order logic predicates and facts. Each predicate models a relationship between objects by taking the interpretable arguments and returning a propositional value. In the experiment using 8-Puzzle and a photo-realistic Blocksworld environment, we show that (1) the resulting predicates capture the interpretable relations (e.g. spatial), (2) they help obtaining the compact, abstract model of the environment, and finally, (3) the resulting model is compatible to symbolic classical planning.
Motivation & Objective
- To bridge the gap between neural perception and symbolic reasoning by grounding first-order logic from visual inputs.
- To address the limitations of propositional representations in classical planning by enabling relational, object-argument-based symbolic abstraction.
- To develop an unsupervised method that discovers interpretable, reusable predicates without human-annotated relations or reward signals.
- To ensure the learned representation is compact, generalizable, and directly usable in PDDL-based classical planning systems.
- To enable end-to-end symbolic reasoning from raw visual observations via a differentiable, attention-based architecture.
Proposed method
- FOSAE uses a neural autoencoder architecture that processes object feature vectors (from image patches and bounding boxes) to reconstruct the input state.
- It employs an attention mechanism to identify relevant object pairs or tuples for each predicate, enabling dynamic argument selection across different observations.
- The model shares weights across multiple object tuples, enforcing generalization and reducing parameter count by learning common relational patterns.
- Predicates are learned in an unsupervised manner via reconstruction loss, with no supervision on predicate symbols or human-annotated relations.
- The architecture supports variable predicate arities and learns grounded, anonymous predicate symbols that can be interpreted from argument instantiation patterns.
- The output is a set of first-order logic facts (predicates with object arguments) that are compatible with PDDL planning systems.
Experimental results
Research questions
- RQ1Can an unsupervised neural network learn interpretable first-order logic predicates directly from visual object features?
- RQ2How well do the discovered predicates generalize across different object configurations and environments?
- RQ3Can the resulting symbolic representation be used effectively for classical planning in visually grounded domains?
- RQ4To what extent does the model’s architecture promote compactness and reusability of relational patterns?
- RQ5How does the attention-based argument selection mechanism contribute to the interpretability and generalization of the learned predicates?
Key findings
- FOSAE successfully learned interpretable spatial and relational predicates from visual inputs, as evidenced by human interpretation of argument instantiation patterns.
- The model achieved accurate reconstruction of input states, with visual examples showing close alignment between ground truth and reconstructed images.
- In the 8-Puzzle domain, FOSAE learned a compact, generalizable representation that supported correct planning across multiple test instances.
- For the photo-realistic Blocksworld environment, FOSAE generated a PDDL-compatible model that enabled correct planning for 30 randomly generated instances with 3 blocks.
- The system demonstrated scalability to 4-block environments, with successful planning results reported, though 5-block planning was precluded by memory limits.
- The resulting symbolic representation was verified as compatible with classical planners, with plans manually confirmed as correct.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.