Towards Finding Optimal Partitions of Categorical Datasets
MetadataShow full item record
A considerable amount of work has been dedicated to clustering numerical data sets, but only a handful of categorical clustering algorithms are reported to date. Furthermore, almost none has addressed the following two important cluster validity problems: (1) Given a data set and a clustering algorithm that partitions the data set into k clusters, how can we determine the best k with respect to the given dataset? (2) Given a dataset and a set of clustering algorithms with a fixed k, how to determine which one will produce k clusters of the best quality? In this paper, we investigate the entropy and expected-entropy concepts for clustering categorical data, and propose a cluster validity method based on the characteristics of expected-entropy. In addition, we develop an agglomerative hierarchical algorithm (HierEntro) to incorporate the proposed cluster validity method into the clustering process. We report our initial experimental results showing the effectiveness of the proposed clustering validity method and the benefits of the HierEntro clustering algorithm.