Toward accurate and efficient outlier detection in high dimensional and large data sets
Nguyen, Minh Quoc
MetadataShow full item record
An efficient method to compute local density-based outliers in high dimensional data was proposed. In our work, we have shown that this type of outlier is present even in any subset of the dataset. This property is used to partition the data set into random subsets to compute the outliers locally. The outliers are then combined from different subsets. Therefore, the local density-based outliers can be computed efficiently. Another challenge in outlier detection in high dimensional data is that the outliers are often suppressed when the majority of dimensions do not exhibit outliers. The contribution of this work is to introduce a filtering method whereby outlier scores are computed in sub-dimensions. The low sub-dimensional scores are filtered out and the high scores are aggregated into the final score. This aggregation with filtering eliminates the effect of accumulating delta deviations in multiple dimensions. Therefore, the outliers are identified correctly. In some cases, the set of outliers that form micro patterns are more interesting than individual outliers. These micro patterns are considered anomalous with respect to the dominant patterns in the dataset. In the area of anomalous pattern detection, there are two challenges. The first challenge is that the anomalous patterns are often overlooked by the dominant patterns using the existing clustering techniques. A common approach is to cluster the dataset using the k-nearest neighbor algorithm. The contribution of this work is to introduce the adaptive nearest neighbor and the concept of dual-neighbor to detect micro patterns more accurately. The next challenge is to compute the anomalous patterns very fast. Our contribution is to compute the patterns based on the correlation between the attributes. The correlation implies that the data can be partitioned into groups based on each attribute to learn the candidate patterns within the groups. Thus, a feature-based method is developed that can compute these patterns efficiently.