Poster
RILA: Reflective and Imaginative Language Agent for Zero-Shot Semantic Audio-Visual Navigation
Zeyuan Yang · LIU JIAGENG · Peihao Chen · Anoop Cherian · Tim Marks · Jonathan Le Roux · Chuang Gan
Arch 4A-E Poster #157
We leverage Large Language Models (LLM) for zero-shot Semantic Audio Visual Navigation (SAVN). Existing methods utilize extensive training demonstrations for reinforcement learning, yet achieve relatively low success rates and lack generalizability. The intermittent nature of auditory signals further poses additional obstacles to inferring the goal information. To address this challenge, we present the Reflective and Imaginative Language Agent (RILA). By employing multi-modal models to process sensory data, we instruct an LLM-based planner to actively explore the environment. During the exploration, our agent adaptively evaluates and dismisses inaccurate perceptual descriptions. Additionally, we introduce an auxiliary LLM-based assistant to enhance global environmental comprehension by mapping room layouts and providing strategic insights. Through comprehensive experiments and analysis, we show that our method outperforms relevant baselines without training demonstrations from the environment and complementary semantic information.