Estimating the empirical Cluster Tree

Background

The theory of the cluster tree has a long history, with the earliest efforts dating back to nearest-neighbor generalizations made by Wishart to make single-linkage clustering more robust (see [1]). In 1975, John Hartigan established a statistical definition of a high-density cluster [2]. High-density clusters are defined on a population with density f in r dimensions to be the maximal connected sets of form { x : f(x) >= lambda }, for any lambda > 0. Varying this density level lambda, which across all possible density thresholds forms a hierarchy of density level sets collectively known as the cluster tree.

In 1981, the consistency of different linkage criteria in the context of hierarchical clustering was studied by Hartigan [3], and the notion of fractional consistency (also called Hartigan consistency) was introduced. In the same seminal paper, single-linkage was proven by Hartigan to be inconsistent for dimensions r >= 2. It was not until 2010 that the first provably consistent algorithm for estimating the cluster tree in more than one dimension was proven [5]. These results have undoubtedly played a large role in rekindling a number of recent research efforts related to the cluster tree (for example, see [6], [9], [10], and [11]), most of which are currently not available in statistical software like R.

The aim of this project is to provide a standalone, scalable, and extensible R package that unifies existing methodologies for estimating the empirical cluster tree.

Related work

Density estimators of upper-level sets and their connection with density-based clustering have motivations from the field of Topological Data Analysis (TDA). The recent R package TDA implements an approximation of the cluster tree that uses the same technique as the DEnsity BAsed CLustering (DeBaCl) python package. The DBSCAN algorithm [12], which has been implemented in R by the FPC and DBSCAN packages, effectively offers a scalable approach to estimating a level set of the population density for some fixed density threshold (as noted in [6]), resulting in a clustering which can be seen as conceptually equivalent to Hartigan's definition of rigid clusters. The popular algorithm DENsity CLUstering (DENCLUE) more explicitly ties together the use of kernel density estimation techniques with the idea of rigid clusters through the combination of "influence functions" and a gradient-based method as a means of finding such "natural" clusters. In 2013, Campello et al introduced a Hierarchical version of DBSCAN (HDBSCAN) [7], along with a proof that relates the algorithm back to Wishart's notion of generalized single linkage. An extended version of the algorithm was published in 2015, where specific 'flat' clustering and outlier related algorithms were introduced in addition to the core HDBSCAN* hierarchy. In this same publication, it was asserted that most of the density-based clustering field is essentially strictly or loosely based on Hartigan’s cluster tree model.

Details of your coding project

The goal of the project is to create a package that brings the most recent advances in approximating the empirical cluster tree into a standalone package that can act as the 'standard' package for future extensions to the cluster tree. To make the work extensible and usable to regular R developers, the cluster tree itself should be built using 'standard' R hierarchical clustering formats, such as 'hclust' and 'dendrogram' objects.

The primary expected output is to provide a scalable implementation (using C++/Rcpp) of the algorithm for estimating the cluster tree covered in [4]. As a secondary goal, modularizing the tools and utilities developed to do the cluster tree estimation would allow additional extensions that have been discovered recently (such as the merge distortion metric from [9], and the tree pruning adaptations in [10]) to be easily included while the package (and theory) develops.

The R package should have:

Proper documentation following CRAN polices
A familiar interface using 'standard' classes and formats for representing the results.
Validation and performance tests against 'known truth' computed in R using slower techniques (forming the cluster tree using igraph, for example. See the 'clusterTree' function in the TDA package).
An extensible interface for inspecting and augmenting different aspects of the cluster tree itself.

Expected impact

A standalone, scalable, and extensible package that provides the basic utilities needed to approximate and analyze approximations of the cluster tree would be beneficial to all future R users who wish to use the cluster tree for either applied clustering contexts or for theoretical research.

Mentors

Michael Hahsler

Mikhail Belkin

Tests

NA

Solutions of tests

NA

References

Wishart, David. "Mode analysis: A generalization of nearest neighbor which reduces chaining effects." Numerical taxonomy 76.282-311 (1969): 17.
Hartigan, John A., and J. A. Hartigan. Clustering algorithms. Vol. 209. New York: Wiley, 1975.
Hartigan, John A. "Consistency of single linkage for high-density clusters." Journal of the American Statistical Association 76.374 (1981): 388-394.
Hinneburg, Alexander, and Daniel A. Keim. "An efficient approach to clustering in large multimedia databases with noise." KDD. Vol. 98. 1998.
Chaudhuri, Kamalika, and Sanjoy Dasgupta. "Rates of convergence for the cluster tree." Advances in Neural Information Processing Systems. 2010.
Singh, Aarti, Clayton Scott, and Robert Nowak. "Adaptive hausdorff estimation of density level sets." The Annals of Statistics 37.5B (2009): 2760-2782.
Campello, Ricardo JGB, Davoud Moulavi, and Joerg Sander. "Density-based clustering based on hierarchical density estimates." Pacific-Asia Conference on Knowledge Discovery and Data Mining. Springer Berlin Heidelberg, 2013.
Campello, Ricardo JGB, et al. "Hierarchical density estimates for data clustering, visualization, and outlier detection." ACM Transactions on Knowledge Discovery from Data (TKDD) 10.1 (2015): 5.
Eldridge, Justin, Mikhail Belkin, and Yusu Wang. "Beyond Hartigan Consistency: Merge Distortion Metric for Hierarchical Clustering." COLT. 2015.
Chaudhuri, Kamalika, et al. "Consistent procedures for cluster tree estimation and pruning." IEEE Transactions on Information Theory 60.12 (2014): 7900-7912.
Balakrishnan, Sivaraman, et al. "Cluster trees on manifolds." Advances in Neural Information Processing Systems. 2013.
Ester, Martin, et al. "A density-based algorithm for discovering clusters in large spatial databases with noise." Kdd. Vol. 96. No. 34. 1996.

Estimating the empirical Cluster Tree

Background

Related work

Details of your coding project

Expected impact

Mentors

Tests

Solutions of tests

References

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally