My research interest mainly lies in utilizing different modalities (language, 2D visual signal, 3D representations) to achieve general intelligence that can understand the real-world. Specifically, 3D & generative models, Multimodal Large Language Models for 3D.
Now I am actively seeking a potential summer research internship in 2026. Please drop me an email if you are interested.
Also, I am open to collaborations, please drop me an email if you want to have a chat.
We propose a perceive-then-plan framework with two VLMs (Perceiver and LaP Planner) for monocular 3D layout estimation, enabling both visual alignment and scene Coherence.
Given a text prompt describing multiple objects and their spatial relationships, our method generates a 3D scene depicting these objects naturally interacting with one another.