Closing the Loop: Unified 3D Scene Generation and Immersive Interaction via LLM-RL Coupling

1Department of Computer Engineering, Sejong University, Seoul, Republic of Korea,

Overview

Interpolate start reference image.

Abstract

Recent advances in large language models have enabled significant progress in 3D content generation, yet existing approaches largely treat scene generation and user interaction as separate processes. This disconnect limits the adaptability and immersive potential of interactive multimedia systems. In this paper, we propose a unified framework that closes the loop between language-driven 3D scene generation and human interaction. Given natural language instructions, the proposed method constructs structured scene representations using a large language model and optimizes spatial layouts through reinforcement learning under geometric and semantic constraints. The generated environments are deployed in an immersive virtual setting, where user interaction continuously provides feedback to align generated content with perception and usability. By tightly coupling generation and interaction, the framework enables more responsive and realistic multimedia experiences. Experiments on the ALFRED benchmark demonstrate state-of-the-art performance in task-based scene generation. In addition, qualitative and user-level evaluations show improved immersion, interaction quality, and task efficiency, highlighting the potential of closed-loop multi- modal generation for next-generation multimedia systems

RESULTS

3D INDOOR SCENE GENERATION


A comparison of Our method and HOLODECK.

Room Type: a living room.

Instruction: Move a credit card from the coffee table to the striped armchair.

HOLODECK

1

OURS


Room Type: a kitchen.

Instruction: Put chilled bread on the counter.

HOLODECK

1

OURS


Qualitative Comparison with Baseline Methods

sample 1

1


Apartment Generation

Instruction: Put a basketball on the bed


Haptic-based Human-Robot Interaction

Room Type: a living room

Instruction: Place a laptop on the dresser.

Top-Down

1

Robot

1

Human


Room Type: a living room

Instruction: Examine a credit card by the light of a floor lamp and then turn it off.

Top-Down

1

Robot

1

Human


Room Type: a kitchen

Instruction: Put a chilled bread on the counter.

Top-Down

1

Robot

1

Human


Room Type: a kitchen

Instruction: Put a heat tomato on the table.

Top-Down

1

Robot

1

Human


BibTeX

@article{vo2026,
  author    = {Anh, H. Vo and Sungyo, Lee, and Phil-Joong, Kim and Soo-Mi, Choi and Yong-Guk, Kim},
  title     = {Closing the Loop: Unified 3D Scene Generation and Immersive Interaction via LLM-RL Coupling},
  journal   = {},
  year      = {2026},
}