Skip to content

bigPint: Big multivariate data plotted interactively

Lindsay Rutter edited this page Apr 2, 2017 · 11 revisions

Background

It is important to plot subsets of variables in order to examine variable associations in a dataset. Traditional modeling approaches without plotting the data are problematic because models are replete with assumptions that they alone cannot call into question. By plotting the data, analysts can improve modeling; they can iterate between visualizations and modeling to enhance the models based on feedback from the visuals.

Large multivariate datasets are common across numerous disciplinary fields. Unfortunately, the most popular visualization techniques for such data are often inadequate, if not misleading. The best approaches for looking at quantitative multivariate data are scatterplots of all pairs of variables, often laid out in a matrix format; parallel coordinate plots; and replicate line plots. Each of these plots enable assessing the association between multiple variables.

However, these plots are ineffective with large quantities of data: Overplotting can obscure important structure, and the plots can be slow to render if every observation is mapped to a graphical element. In this project, we aim to develop more useful visualization techniques for large multivariate datasets by incorporating appropriate summaries and using interactivity. This project will explore these new visualization techniques on large RNA-sequencing datasets with plans to explore them further using large multivariate datasets from additional fields of study.

Related work

The R packages edgeR and DESeq2 provide some of the most popular tools for RNA-sequencing analysis. Unfortunately, the use of data visualization is minimal and misleading in their analysis pipelines. For example, two of the most common plotting techniques to assess normalization in RNA-sequencing data are the side-by-side boxplot and MDS plot. These plots can hide problems that still exist in the data even after normalization, and that could be better detected with parallel coordinate plots.

bmp

The image above exemplifies this problem for a dataset containing five replicate samples, each with 50 random values drawn from a normal distribution centered at a value of 10. The 50 values from each replicate were sorted from smallest to largest. However, for the images in the left half of the plot, the second and fourth replicate values were sorted from largest to smallest. We cannot see this alteration in the data when we compare the left and right boxplots. This is because the boxplots do not show sample connections between each of the 50 observations. We can see that there may have been an alteration in the data when we compare the left and right MDS plots. Namely, the left MDS plot contains a clustering structure that suggests the second and fourth replicates are similar compared to the first, third, and fifth replicates. However, we still cannot ascertain why these replicates are different.

We can see the most information about the alteration in the data when we compare the left and right parallel coordinate plots - because we can now judge how consistent the replicate values are for each of the 50 observations. We notice that certain observations (the pinks and oranges) are inconsistent (crossing) between replicates in the left plot, while all observations are consistent (level) between replicates in the right plot. Analysts should ensure (rather than assume) that replicate values are as consistent as intended, and they can verify this most clearly with the parallel coordinate plots.

Details of your coding project

Project Goals:

To address the shortcomings of popular plotting techniques, we are focusing on expanding upon three other types of plots - the parallel coordinate plot, pairwise scatterplot matrices, and replicate line plots. There are already popular tools to visualize parallel coordinate plots and scatterplot matrices (such as, respectively, the ggparcoord() and scatmat() functions in the GGally package). However, these functions are only adequate when working with small datasets; they cannot be used with large datasets due to overplotting and time constraints. As a result, our goal is to improve upon and tailor these three types of plots so they may be useful for large multivariate data.

1) Parallel coordinate plots:

Parallel coordinate plots are an essential visual tool when checking the initial normalization process of multivariate data. If the normalization is adequate, then the connections between replications should be flat, and most crossings should be between treatment groups. For instance, in the image below, we examined parallel coordinate plots for 16 different subsets of observations from an RNA-sequencing dataset that contained two treatment groups and three replicates. We can immediately see from the parallel coordinate plots that data subsets 5 and 6 would be of interest (having small difference between replicates and large difference between treatments).

L120

While parallel coordinate plots are useful for small datasets, their effectiveness is severely limited for large datasets. With one line being drawn for each observation, the resulting plots take too long to build and have too many overlapping lines to discern patterns of interest. As a result, our goal is to develop a new approach that both quickly and meaningfully displays the key pieces of information for large multivariate data with parallel coordinate plots.

2) Pairwise scatterplot matrices:

Pairwise scatterplots allow us to examine all pairwise combinations of samples. If the data has lower variability between replicates than between treatments (as is desired), then we would expect the spread of the scatterplot observations to fall more closely along the x=y relationship between replicates than between treatments. For instance, in the image below, we examined a pairwise scatterplot from an RNA-sequencing dataset that contained two treatment groups with three replicates. We can immediately verify that the scatterplots between treatments (in the purple box) have more spread around the x=y line than the scatterplots between replicates (outside the purple box). We can also immediately mine for observations that would be of interest (in green circle) and of concern (in red circle).

L120scatMat

While pairwise scatterplot matrices are useful for small datasets, their effectiveness is limited for large datasets. With one dot being drawn for each pairwise combination of samples for each observation, the resulting plots take too long to build and have too many overlapping dots to effectively view patterns of interest. As a result, our goal is to develop a new approach that both quickly and meaningfully displays the key pieces of information for large multivariate data with scatterplot matrices.

3) Replicate line plots:

In contrast to scatterplot matrices, replicate line plots allow users to view replication and treatment differences simultaneously for observations of interest. For a given observation, the point of two treatment values for one replicate are connected by a line to the point of two treatment values for another replicate. The lines of interest would be those that are small in length (consistency between replicates) and deviate from the x=y line (large difference between treatments). While this plot has been developed and shown to be useful for two replicates, we wish to expand upon this so that it can be used for more than two replicates. For instance, in the image below, we used replicate plots to examine low p-value observations from an RNA-sequencing dataset that contained two treatment groups with three replicates. The left image only examines two replicates and the right image examines all three replicates (which we represented as boxes with edges corresponding to the minimum and maximum values of the replicates).

porcupine

While replicate line plots are useful for small datasets, their effectiveness is limited for large datasets due to time and space constraints. As a result, our goal is to develop a new approach that both quickly and meaningfully displays the key pieces of information for large multivariate data with replicate line plots.

Project Progress:

We have been using plotly to incorporate interactivity within a plot so that users can quickly identify observations of interest. Likewise, we have been using Shiny to incorporate interactivity between plots so that users can quickly obtain information that the various plots provide for observations of interest. We have also been exploring hexagonal binning to mitigate issues with speed and space.

Expected impact

This package will be useful for the R community because it would fill a current lack of available visualization methods that can be used to effectively analyze large multivariate datasets. This could be helpful for people analyzing RNA-sequencing data, as well as for people performing factor analysis, discriminant analysis, and principal component analysis.

Mentors

Dr. Dianne Cook (dicook@monash.edu)

Roxane Legaie (roxane.legaie@monash.edu)

Skills required

Applicants should have:

  • Proficiency with JavaScript, Shiny, R, and Plotly.
  • Familiarity with the R packages htmlwidgets and ggplot2.
  • Familiarity with R package development tools such as GitHub, Roxygen2, and devtools.
  • Graduate education in statistics or a related field.

Project milestones

Phase 1

Develop and improve upon scatterplot matrices, parallel coordinate plots, and replicate line plots. Create faster and more informative versions these plots.

Phase 2

Develop and improve upon linking between scatterplot matrices, parallel coordinate plots, and replicate line plots. Create faster methods of linking between these various types of plots and better ways to visualize links between many plots effectively in an application.

Phase 3

Test the newest versions of developed plots (and linked plots) with RNA-sequencing data and other large multivariate data from other fields. Begin working on package vignette

Tests

A successful applicant will:

  • Easy: Briefly discuss the proposed package functionality.
  • Medium: Using the R diamonds dataset, create a scatterplot matrix and parallel coordinate plot.
  • Hard: Using the htmlwidgets package, create an image of a boxplot with a parallel coordinate plot superimposed. The boxplot should be fixed background, but the parallel coordinate plot can be redrawn if the user interacts in some manner.

Solutions of tests

Clone this wiki locally