My research focuses on fusing multimodal information to realize spatial intelligence that reconstruct, understand, and reason about the real world. Specifically, 3D & generative models, Multimodal Large Language Models for spatial understanding.
Always actively seeking for research internship opportunities. Please drop me an email if you are interested.
Also, I am open to collaborations, please drop me an email if you want to have a chat.
We propose a perceive-then-plan framework with two VLMs (Perceiver and LaP Planner) for monocular 3D layout estimation, enabling both visual alignment and scene Coherence.
Given a text prompt describing multiple objects and their spatial relationships, our method generates a 3D scene depicting these objects naturally interacting with one another.