• Login
    View Item 
    •   SMARTech Home
    • Georgia Tech Theses and Dissertations
    • Georgia Tech Theses and Dissertations
    • View Item
    •   SMARTech Home
    • Georgia Tech Theses and Dissertations
    • Georgia Tech Theses and Dissertations
    • View Item
    JavaScript is disabled for your browser. Some features of this site may not work without it.

    Robust adaptation of natural language processing for language variation

    Thumbnail
    View/Open
    YANG-DISSERTATION-2017.pdf (2.103Mb)
    Date
    2017-01-09
    Author
    Yang, Yi
    Metadata
    Show full item record
    Abstract
    Natural language processing (NLP) technology has been applied in various domains, ranging from social media and digital humanities to public health. Unfortunately, the adoption of existing NLP techniques in these areas often experiences unsatisfactory performance. Languages of new datasets and settings can be significantly different from standard NLP training corpora, and modern NLP techniques are usually vulnerable to variation in non-standard languages, in terms of the lexicon, syntax, and semantics. Previous approaches toward this problem suffer from three major weaknesses. First, they often employ supervised methods that require expensive annotations and easily become outdated with respect to the dynamic nature of languages. Second, they usually fail to leverage the valuable metadata associated with the target languages of these areas. Third, they treat language as uniform and ignore the differences in language use with respect to different individuals. In this thesis, we propose several novel techniques to overcome these weaknesses and build NLP systems that are robust to language variation. These approaches are driven by co-occurrence statistics as well as rich metadata without the need of costly annotations, and can easily adapt to new settings. First, we can transform lexical variation into text that better matches standard datasets. We present a unified unsupervised statistical model for text normalization. The relationship between standard and non-standard tokens is characterized by a log-linear model, permitting arbitrary features. Text normalization focuses on tackling variation in lexicons, and therefore improving underlying NLP tasks. Second, we can overcome language variation by adapting standard NLP tools to fit the text with variation directly. We propose a novel but simple feature embedding approach to learn joint feature representations for domain adaptation, by exploiting the feature template structure commonly used in NLP problems. We also show how to incorporate metadata attributes into feature embeddings, which helps to learn distill the domain-invariant properties of each feature over multiple related domains. Domain adaptation is able to deal with a full range of linguistic phenomenon, thus it often yields better performances than text normalization. Finally, a subtle challenge posed by variation is that language is not uniformly distributed among individuals, while traditional NLP systems usually treat texts from different authors the same. Both text normalization and domain adaptation follow the standard NLP settings and fail to handle this problem. We propose to address the difficulty by exploiting the sociological theory of \textit{homophily}---the tendency of socially linked individuals to behave similarly---to build models that account for language variation on an individual or a social community level. We investigate both \textit{label homophily} and \textit{linguistic homophily} to build socially adapted information extraction and sentiment analysis systems. Our work delivers state-of-the-art NLP systems for social media and historical texts on various standard benchmark datasets.
    URI
    http://hdl.handle.net/1853/58201
    Collections
    • College of Computing Theses and Dissertations [1191]
    • Georgia Tech Theses and Dissertations [23877]

    Browse

    All of SMARTechCommunities & CollectionsDatesAuthorsTitlesSubjectsTypesThis CollectionDatesAuthorsTitlesSubjectsTypes

    My SMARTech

    Login

    Statistics

    View Usage StatisticsView Google Analytics Statistics
    facebook instagram twitter youtube
    • My Account
    • Contact us
    • Directory
    • Campus Map
    • Support/Give
    • Library Accessibility
      • About SMARTech
      • SMARTech Terms of Use
    Georgia Tech Library266 4th Street NW, Atlanta, GA 30332
    404.894.4500
    • Emergency Information
    • Legal and Privacy Information
    • Human Trafficking Notice
    • Accessibility
    • Accountability
    • Accreditation
    • Employment
    © 2020 Georgia Institute of Technology