Skip to content

Violin Plot Implementation

YanayRosen edited this page Jun 30, 2017 · 10 revisions

Violin Plot Implementation

Explanation for violin plot code.

Intro

A violin plot is a more detailed, yet actually less accurate, version of a box plot or histogram.

Violin Plots rely on a "Kernel Density Estimation" to determine the relative likelihood of finding points at a location within a set of data, which in this case is one dimensional.

Violin Plots represent the kernel density, quantiles, point locations and median of an array of 1 dimensional data.

Because violin plots are not a default of plotly.js, and because ruby and javascript are not usually used for math, a number of custom functions needed to be implemented.

Math Functions

There are three classes of function that had to be implemented in order to create the violin plots.

Kernel Estimation

These functions relate specifically to generating kernel density estimates.

Kernel Density is explained in depth at wikipedia:

Let (x1, x2, …, xn) be an independent and identically distributed sample drawn from some distribution with an unknown density ƒ. We are interested in estimating the shape of this function ƒ. Its kernel density estimator is Kernel Density Formula where K(•) is the kernel — a non-negative function that integrates to one and has mean zero — and h > 0 is a smoothing parameter called the bandwidth. A kernel with subscript h is called the scaled kernel and defined as Kh(x) = 1/h K(x/h).

https://en.wikipedia.org/wiki/Kernel_density_estimation#Definition

kernelDensityEstimator

This is the main function used to estimate kernel density.

Implemented from https://gist.github.com/mbostock/4341954

kernels

These are the kernels themselves, implemented from a few sources.

The epanechnikov, unifrom, triangular, quartic, triweight and cosine kernels are from https://gist.github.com/mbostock/4341954 and https://github.com/jasondavies/science.js.

The gaussian, tricube, logistic, sigmoid and_silverman_ kernels were all written by Yanay.

The only two kernels implemented currentely are gaussian and epanechnikov with gaussian being the default, as it is in ggplot2, the plotting library used by Seurat.

Bandwidth Calculations

The bandwidth is the window that scrubs the kernel density, or h in the the kernel density function. The larger the bandwidth, the smoother the estimate. This can cause over smoothing or roughness, depending on too high or too low of a bandwidth.

Bandwidth is estimated using different methods. The two implemented are nrd0 and SJ-STE, with nrd0 being the default, as it is in ggplot2, the plotting library used by Seurat.

nrd0 is implemented based on https://en.wikipedia.org/wiki/Kernel_density_estimation#Bandwidth_selection and https://stat.ethz.ch/R-manual/R-devel/library/stats/html/bandwidth.html. Located in kernel_fucntions.js

SJ-STE, or Sheather Jones bandwidth, is converted from python to javascript based on https://github.com/Neojume/pythonABC. SJ-STE uses multiple matrix calculations, so can be very resource intensive with larger datasets, which is warned against in the portal. It is somewhat smoother, however. SJ-STE is located in sheather_jones.js

Specific Math Functions

Javascript and Ruby lack very advanced math functions that are needed for kernel estimation. Thus, these functions needed to be implemented:

Simple Statistics- This library was added to do basic calculations like get the median, quantile ranges and standard deviation.

isOutlier- for plotting reasons outliers need to be calculated (and removed from the array of jitter points). The definition of an outlier is from http://www.itl.nist.gov/div898/handbook/prc/section1/prc16.htm

Matrix Operations- Dot multiplication and matrix addition/subtraction needed to be implemented in order to calculate sheather-jones bandwidth. Methods are sourced from https://github.com/Neojume/pythonABC, https://github.com/errcw/gaussian/blob/master/lib/gaussian.js and https://stackoverflow.com/a/27205341.

Custom Quartile- Sheather Jones bandwidth calculation requires the use of a specific quartile method using linear interpolation. This was implemented based off of code from https://docs.scipy.org/doc/numpy-dev/reference/generated/numpy.percentile.html.

Plotly

Creating the violin plots in plotly require a large amount of data formatting.

The function that creates the plotly data is createTracesAndLayout. This function takes a formatted array and creates the array of plotly data for violin plots. Each violin plot is a combination of 9 plotly plots.

Plot 1- The right side of the violin plot and its fill.

Plot 2- The left side (mirror of right side) of the violin plot and its fill.

Plot 3- The jittered points of the data being plotted. X values are random and have no meaning.

Plot 4- The center line of the violin.

Plot 5- The line that represents the quartile range.

Plot 6- The median white square.

Plot 7- Outlier Points.

Plot 8- The bottom capping line of the violin.

Plot 9- The top capping line of the violin.

All 9 plots are slaved to the second plot so they all appear in the legend as one item and can be hid and shown as one plot.

Multiple violin plots are plotted on the same axis, so a x_offset is calculated each iteration to separate graphs.

Site Logic

The site logic implemented involved switching between violin and box plots. Since the violin plot takes the same data, just formatted slightly differently, as the box plot, this was easy to implement. The parameters for rendering gene expression now include a plot type (and kernel and bandwidth type if loading a violin) parameter. The expression_annotation_plots_plotly view combines with small logic in the site controller to math the plot being displayed to the plot_type parameter, with violin being the default.

If the type is box, the logic plots the data and layout arrays for box plots, otherwise, for violins, the logic plots the data array for violins.

Caching is also implemented.

Clone this wiki locally