Unified Interaction-Ready 3D Scene Generation via LLM–RL Optimization

1Department of Computer Engineering, Sejong University, Seoul, Republic of Korea,

Overview

Interpolate start reference image.

Abstract

Recent advances in LLMs have significantly improved language-driven 3D scene generation, yet existing approaches primarily focus on scene synthesis and treat user interaction as a separate process, limiting the adaptability and immersive potential of interactive multimedia systems. This paper presents a unified framework that closes the loop between language-driven scene generation and immersive human-robot interaction in virtual reality. Given a natural language instruction, the proposed framework constructs a structured scene representation via an LLM-based Language-Driven Scene Representation module, predicts interaction-relevant viewpoints through a VR-aware model, and optimizes object arrangements under semantic and geometric constraints via a reinforcement learning-based Plan2Place module. The generated environments are deployed in VR, where users interact through visual and haptic feedback, making interaction a first-class constraint in scene generation by design. Experiments on the ALFRED benchmark demonstrate state-of-the-art performance in task-based scene generation, object placement, and VR localization, while user studies confirm consistent improvements in immersion, interaction quality, and task efficiency.

RESULTS

3D INDOOR SCENE GENERATION


A comparison of Our method and HOLODECK.

Room Type: a living room.

Instruction: Move a credit card from the coffee table to the striped armchair.

HOLODECK

1

OURS


Room Type: a kitchen.

Instruction: Put chilled bread on the counter.

HOLODECK

1

OURS


Qualitative Comparison with Baseline Methods

sample 1

1


Apartment Generation

Instruction: Put a basketball on the bed


Haptic-based Human-Robot Interaction

Room Type: a living room

Instruction: Place a laptop on the dresser.

Top-Down

1

Robot

1

Human


Room Type: a living room

Instruction: Examine a credit card by the light of a floor lamp and then turn it off.

Top-Down

1

Robot

1

Human


Room Type: a kitchen

Instruction: Put a chilled bread on the counter.

Top-Down

1

Robot

1

Human


Room Type: a kitchen

Instruction: Put a heat tomato on the table.

Top-Down

1

Robot

1

Human


BibTeX

@article{vo2026,
  author    = {Anh, H. Vo and Sungyo, Lee, and Phil-Joong, Kim and Soo-Mi, Choi and Yong-Guk, Kim},
  title     = {Unified Interaction-Ready 3D Scene Generation via
LLM–RL Optimization},
  journal   = {},
  year      = {2026},
}