-
Notifications
You must be signed in to change notification settings - Fork 31
Constrained Hierarchical Agglomerative Clustering
Constrained HAC is hierarchical agglomerative clustering in which each observation is associated to a position, and the clustering is constrained so as only adjacent clusters are merged. These methods are useful in various application fields, including ecology (Quaternary data) and bioinformatics (e.g. in Genome-Wide Association Studies (GWAS), see https://doi.org/10.1186/s12859-015-0556-6). However, the complexity of constrained HAC is intrinsically quadratic in the number p of items to cluster. In common situations where p is of the order of 10^5-10^6, this algorithm is not applicable.
In several applications including GWAS, the similarity between two items is a decreasing function of their physical distance, and it is reasonable to assume that the similarity between two items is close to 0 when these items are distant of more than a constant h. Under this assumption, it is possible to propose an algorithm whose complexity in p is quasilinear in time (O(p(log(p)+h)) and linear in space (O(ph)). This algorithm is described in this PhD thesis (chapter 3) for the Ward criterion. It takes as an input a band similarity matrix of size ph, and combines pre-calculation of certain partial sums of similarities with an efficient storage of the successive merges in a binary heap. A beta package adjclust implementing this algorithm is already available in a private github repository. with Ward criterion to produce a dendrogram associated to constrained HAC (the position of the rows in the similarity matrix are the positions used for the adjacency constrains).
This project aims at making this beta package a package which is submittable to CRAN.
The R package rioja implements a similar algorithm (it takes dist objects as an input and uses either Ward criterion or single linkage criterion). This algorithm does not scale well with p, as it is quadratic in p in both time and space.
The package needs:
- proper documentation
- following the CRAN policies
- user-friendly interface
- in the current version, only a part of the matrix, located in the neighborhood of the diagonal, is passed to the function; however, this format is not standard and an automatic function to create this data is needed. This function should be able to take similarity matrices given as standard matrices or as sparse matrix objects or as dist objects and should output a standard hclust object
- more merging criteria
- more merging criteria (single linkage, complete linkage, …) should be added to the package
- tests
- small tests to check the equivalence between the similarity and dissimilarity cases (in the simple Euclidean case) and the proper functioning (i.e., increasing merging similarities) are needed
The issue addressed by this method is a very standard problem, especially in bioinformatics. In this field, the size of data is usually very large (many thousands) and an efficient constrained HAC package with all linkage options would have many applications (GWAS, Hi-C analysis, …).
Please get in touch with Pierre Neuvial and Nathalie Villa-Vialaneix for this project.
Several tests that potential students can do to demonstrate their capabilities for this particular project:
- Easy: use the R package rioja to obtain constrained HAC using both implemented criteria and a use case you have created yourself.
- Medium: ask for an access to our github repository and implement:
- a basic R function that takes a similarity matrix and call for the already implemented function to obtain a dendrogram object
- Hard: the package uses C++ scripts. Can the students describe the different steps of the analysis reading the script and the PhD thesis? How would they propose to modify these scripts to implement other merging criteria (textual description: no code needed at this step).
Students, please post a link to your test results here.