Unified Interaction-Ready 3D Scene Generation via LLM–RL Optimization

Anh H. Vo¹, Sungyo Lee¹, Phil-Joong Kim¹, Soo-Mi Choi¹, Yong-Guk Kim*¹,

¹Department of Computer Engineering, Sejong University, Seoul, Republic of Korea,

Overview

Abstract

Recent advances in LLMs have significantly improved language-driven 3D scene generation, yet existing approaches primarily focus on scene synthesis and treat user interaction as a separate process, limiting the adaptability and immersive potential of interactive multimedia systems. This paper presents a unified framework that closes the loop between language-driven scene generation and immersive human-robot interaction in virtual reality. Given a natural language instruction, the proposed framework constructs a structured scene representation via an LLM-based Language-Driven Scene Representation module, predicts interaction-relevant viewpoints through a VR-aware model, and optimizes object arrangements under semantic and geometric constraints via a reinforcement learning-based Plan2Place module. The generated environments are deployed in VR, where users interact through visual and haptic feedback, making interaction a first-class constraint in scene generation by design. Experiments on the ALFRED benchmark demonstrate state-of-the-art performance in task-based scene generation, object placement, and VR localization, while user studies confirm consistent improvements in immersion, interaction quality, and task efficiency.

RESULTS

3D INDOOR SCENE GENERATION

A comparison of Our method and HOLODECK.

Room Type: a living room.

Instruction: Move a credit card from the coffee table to the striped armchair.

HOLODECK

OURS

Room Type: a kitchen.

Instruction: Put chilled bread on the counter.

HOLODECK

OURS

Qualitative Comparison with Baseline Methods

sample 1

Apartment Generation

Instruction: Put a basketball on the bed

Haptic-based Human-Robot Interaction

Room Type: a living room

Instruction: Place a laptop on the dresser.

Top-Down

Robot

Human

Room Type: a living room

Instruction: Examine a credit card by the light of a floor lamp and then turn it off.

Top-Down

Robot

Human

Room Type: a kitchen

Instruction: Put a chilled bread on the counter.

Top-Down

Robot

Human

Room Type: a kitchen

Instruction: Put a heat tomato on the table.

Top-Down

Robot

Human

BibTeX

@article{vo2026,
  author    = {Anh, H. Vo and Sungyo, Lee, and Phil-Joong, Kim and Soo-Mi, Choi and Yong-Guk, Kim},
  title     = {Unified Interaction-Ready 3D Scene Generation via
LLM–RL Optimization},
  journal   = {},
  year      = {2026},
}