• Login
    View Item 
    •   SMARTech Home
    • Georgia Tech Theses and Dissertations
    • Georgia Tech Theses and Dissertations
    • View Item
    •   SMARTech Home
    • Georgia Tech Theses and Dissertations
    • Georgia Tech Theses and Dissertations
    • View Item
    JavaScript is disabled for your browser. Some features of this site may not work without it.

    Building agents that can see, talk, and act

    Thumbnail
    View/Open
    DAS-DISSERTATION-2020.pdf (37.52Mb)
    Date
    2020-04-25
    Author
    Das, Abhishek
    Metadata
    Show full item record
    Abstract
    A long-term goal in AI is to build general-purpose intelligent agents that simultaneously possess the ability to perceive the rich visual environment around us (through vision, audition, or other sensors), reason and infer from perception in an interpretable and actionable manner, communicate this understanding to humans and other agents (e.g., hold a natural language dialog grounded in the environment), and act on this understanding in physical worlds (e.g., aid humans by executing commands in an embodied environment). To be able to make progress towards this grand goal, we must explore new multimodal AI tasks, move from datasets to physical environments, and build new kinds of models. In this dissertation, we combine insights from different areas of AI -- computer vision, language understanding, reinforcement learning -- and present steps to connect the underlying domains of vision and language to actions towards such general-purpose agents. In Part 1, we develop agents that can see and talk -- capable of holding free-form conversations about images -- and reinforcement learning-based algorithms to train these visual dialog agents via self-play. In Part 2, we extend our focus to agents that can see, talk, and act -- embodied agents that can actively perceive and navigate in partially-observable simulated environments, to accomplish tasks such as question-answering. In Part 3, we devise techniques for training populations of agents that can comunicate with each other, to coordinate, strategize, and utilize their combined sensory experiences and act in the physical world. These agents learn both what messages to send and who to communicate with, solely from downstream reward without any communication supervision. Finally, in Part 4, we use question-answering as a task-agnostic probe to ask a self-supervised embodied agent what it knows about its physical world, and use it to quantify differences in visual representations agents develop when trained with different auxiliary objectives.
    URI
    http://hdl.handle.net/1853/62768
    Collections
    • College of Computing Theses and Dissertations [1071]
    • Georgia Tech Theses and Dissertations [22401]
    • School of Interactive Computing Theses and Dissertations [106]

    Browse

    All of SMARTechCommunities & CollectionsDatesAuthorsTitlesSubjectsTypesThis CollectionDatesAuthorsTitlesSubjectsTypes

    My SMARTech

    Login

    Statistics

    View Usage StatisticsView Google Analytics Statistics
    facebook instagram twitter youtube
    • My Account
    • Contact us
    • Directory
    • Campus Map
    • Support/Give
    • Library Accessibility
      • About SMARTech
      • SMARTech Terms of Use
    Georgia Tech Library266 4th Street NW, Atlanta, GA 30332
    404.894.4500
    • Emergency Information
    • Legal and Privacy Information
    • Human Trafficking Notice
    • Accessibility
    • Accountability
    • Accreditation
    • Employment
    © 2020 Georgia Institute of Technology