• Login
    View Item 
    •   SMARTech Home
    • Georgia Tech Theses and Dissertations
    • Georgia Tech Theses and Dissertations
    • View Item
    •   SMARTech Home
    • Georgia Tech Theses and Dissertations
    • Georgia Tech Theses and Dissertations
    • View Item
    JavaScript is disabled for your browser. Some features of this site may not work without it.

    Language Guided Localization and Navigation

    Thumbnail
    View/Open
    HAHN-DISSERTATION-2022.pdf (27.49Mb)
    Date
    2022-07-26
    Author
    Hahn, Meera
    Metadata
    Show full item record
    Abstract
    Embodied tasks that require active perception are key to improving language grounding models and creating holistic social agents. In this dissertation we explore four multi-modal embodied perception tasks and which require localization or navigation of an agent in an unknown temporal or 3D space with limited information about the environment. We first explore how an agent can be guided by language to navigate a temporal space using reinforcement learning in a similar way to that of a 3D space. Next, we explore how to teach an agent to navigate using only self-supervised learning from passive data. In this task we remove the complexity of language and explore a topological map and graph-network based strategy for navigation. We then present the Where Are You? (WAY) dataset which contains over 6k dialogs of two humans performing a localization task. On top of this dataset we design three tasks which push the envelope of current visual language-grounding tasks by introducing a multi-agent set up in which agents are required to use active perception to communicate, navigate, and localize. We specifically focus on modeling one of these tasks, Localization from Embodied Dialog (LED). The LED task involves taking a natural language dialog of two agents -- an observer and a locator -- and predicting the location of the observer agent. We find that a topological graph map of the environments is a successful representation for modeling the complex relational structure of the dialog and observer locations. We validate our approach on several state of the art multi-modal baselines and show that a multi-modal transformer with large-scale pre-training outperforms all other models. We additionally introduce a novel analysis pipeline on this model for the LED and the Vision Language Navigation (VLN) task to diagnose and reveal limitations and failure modes of these types of models.
    URI
    http://hdl.handle.net/1853/67283
    Collections
    • College of Computing Theses and Dissertations [1191]
    • Georgia Tech Theses and Dissertations [23877]
    • School of Interactive Computing Theses and Dissertations [144]

    Browse

    All of SMARTechCommunities & CollectionsDatesAuthorsTitlesSubjectsTypesThis CollectionDatesAuthorsTitlesSubjectsTypes

    My SMARTech

    Login

    Statistics

    View Usage StatisticsView Google Analytics Statistics
    facebook instagram twitter youtube
    • My Account
    • Contact us
    • Directory
    • Campus Map
    • Support/Give
    • Library Accessibility
      • About SMARTech
      • SMARTech Terms of Use
    Georgia Tech Library266 4th Street NW, Atlanta, GA 30332
    404.894.4500
    • Emergency Information
    • Legal and Privacy Information
    • Human Trafficking Notice
    • Accessibility
    • Accountability
    • Accreditation
    • Employment
    © 2020 Georgia Institute of Technology