← Back to Ideas

Replacing the global pose estimation vector with a Large Language Model (LLM) driven scene graph will improve the semantic logic of object placement in complex open-set prompts.

Feasibility: 9 Novelty: 6

Motivation

SceneMaker decouples pose from shape, but pose estimation is often statistically derived from datasets, missing semantic nuance (e.g., a 'messy desk' implies different poses than a 'tidy desk'). LLMs possess strong spatial-semantic reasoning. Injecting this reasoning into the pose module could allow for more complex, context-aware arrangements that pure geometric learning misses.

Proposed Method

Parse the input text prompt using an LLM to generate a spatial scene graph (nodes=objects, edges=spatial relations like 'supported by', 'facing'). Modify SceneMaker's pose estimation branch to condition on this graph using a Graph Neural Network (GNN). Train/fine-tune on the 3D-FRONT dataset, enforcing that predicted poses satisfy the graph's relational constraints.

Expected Contribution

A neuro-symbolic approach to 3D scene layout that combines the geometric precision of SceneMaker with the semantic reasoning of LLMs.

Required Resources

Access to LLM APIs (GPT-4 or Llama 3), 3D scene datasets with layout annotations, significant GPU compute for training the GNN adapter.

Source Paper

SceneMaker: Open-set 3D Scene Generation with Decoupled De-occlusion and Pose Estimation Model

View Paper Details →