|dc.description.abstract||Often in practice, researchers have a (“target”) dataset that is desirable in many ways, but is missing some key variables, or “knowledge”, that would greatly enrich the value of the data for investigating questions of interest. If this knowledge could be extracted from a different but related (“source”) dataset and transferred between them by way of variables common to both datasets, it could improve the ability to perform analyses and increase the value of the dataset itself at a relatively minimal cost. In the current study, the target dataset comprises responses to the 2009 National Household Travel Survey (N ≈ 100,000), and the key missing variables are transportation-related attitudes, which could greatly improve the ability to predict travel behaviors. Our source dataset is obtained from the 2011-12 Multitasking Survey of Northern California Commuters (MSNCC, N ≈ 2000).
To evaluate approaches to informing one dataset with knowledge from another and to eval-uate the performance of the knowledge transferred into the target dataset, we developed transfer learning and external validation frameworks, respectively. To implement the transfer learning framework, the set of common variables was first augmented by obtaining a large number of built and social environment characteristics linked to the residential locations of observations in each dataset. Then, applying machine-learning methods to the categorical and continuous attitudinal variables of the MSNCC, the LASSO (least absolute shrinkage and selection operator) regression learner showed the lowest generalization error over the 10 cross-validation folds in the context of the source dataset. The pro-transit, pro-active transportation, and pro-density attitudinal factor scores showed the greatest improvement over a naïve learner of assigning the average; correlations of the predicted and observed scores on these factors were 0.564, 0.538, and 0.571, respectively.
The external validation framework was implemented by estimating vehicle ownership linear regression models, and comparing their goodness of fit with and without attitudes. The results showed that in the source dataset the observed attitudes account for an 8.0% model lift (i.e., improvement in goodness of fit), while in the target dataset the predicted attitudes account for a 1.2–5.4% model lift, depending on the extensiveness and nature of the variables used to impute them. Although these initial results are modest, we believe they show substantial promise, and the process has identified a number of opportunities for improvement and further research.||en_US