|
1 | 1 | # Better than Average: Calculating Geometric Means Using SQL
|
| 2 | + |
| 3 | + |
| 4 | +# Abstract |
| 5 | + |
| 6 | +Geometric means are a robust and precise way to visualize the central tendency of a data set, particularly when examining skewed data or comparing ratios. Measures of central tendency are predominantly presented as arithmetic means or medians that are relatively simple to calculate and interpret, but may be inaccurate in representing data that are not _strictly normal_. Geometric means represent the best of both worlds, providing estimates that take into account all the observations in a data set without being influenced by the extremes. They can be employed by data analytics working in multiple industries including business, finance, health care, and research. Examples are varied and include examining |
| 7 | +compounded interest rates or returns on investments, assessing population changes in longitudinal data, or investigating lognormal data such lab assay results, biological concentrations, or decay rates. |
| 8 | + |
| 9 | +While, most databases provide a function to calculate the Arthimetic Mean, none of them provide a function to calcualte the Geometric Mean. We will look at how to calculate Geometric Mean using SQL. |
| 10 | + |
| 11 | +# Introduction |
| 12 | + |
| 13 | +## What is a geometric mean? |
| 14 | +Geometric means are a type of _average_, or _measure of central tendency_ in a distribution of data points, in the same group as the median, mode, or arithmetic mean. Whereas the arithmetic mean is calculated by summing a series of data points and then dividing that sum by the number of data points, the geometric mean multiplies a series of data points, and then uses the n number of data points to find the nth root of that product. Mathematically, the geometric mean adds depth and stability to |
| 15 | +the mean. |
| 16 | + |
| 17 | +We can easily visualize the geometric mean when applying it to its counterpart, the geometric series of numbers, where each number increases from the previous number according to the same proportion. The geometric mean will lie in the direct center of the values, whereas the arithmetic mean would have been _pulled_ towards the higher values, and thus not truly represent the center of the data. |
| 18 | + |
| 19 | +𝐴𝑟𝑖𝑡ℎ𝑚𝑒𝑡𝑖𝑐 𝑚𝑒𝑎𝑛 = 3+9+27+81+243 |
| 20 | +5 = 72.6 (See Equation 1) |
| 21 | +𝐺𝑒𝑜𝑚𝑒𝑡𝑟𝑖𝑐 𝑚𝑒𝑎𝑛 = √3 ∗ 9 ∗ 27 ∗ 81 ∗ 243 |
| 22 | +5 = 27 (See Equation 2) |
| 23 | + |
| 24 | +## When should I use the geometric mean instead of the arithmetic mean? |
| 25 | +There are no hard rules for which mean you should use. Different types of averages can be used to express slightly different concepts: the center of the data, the values most often seen, and/or the typical "expected" values may or may not all be conveyed by the same measure. Data is rarely perfect, and you may need to look at several different types of averages to decide what works best for what you are trying to communicate with your data. But in general, geometric means are preferable when looking at skewed |
| 26 | +data, scaled data, or when averaging ratios. Some common applications include: |
| 27 | + |
| 28 | +* Population growth |
| 29 | +* Compounding interest |
| 30 | +* Bioassays |
| 31 | +* Radioactive decay |
| 32 | +* Dose-response relationships |
| 33 | +* Count data |
| 34 | +* Time Series data |
| 35 | +* Longitudinal data |
| 36 | +* Repeated measures data |
| 37 | +* Bioequivalence trials |
| 38 | + |
| 39 | +If your data involve rate changes or changes over time, your data may be skewed. Often these data have a lognormal distribution, and the geometric mean describes the center of lognormal data perfectly. In addition to skew, you should also consider the size of your sample. When working with small samples, |
| 40 | +0 |
| 41 | +50 |
| 42 | +100 |
| 43 | +150 |
| 44 | +200 |
| 45 | +250 |
| 46 | +300 |
| 47 | +Arithmetic Mean |
| 48 | +Geometric Mean |
| 49 | +Figure 1: Geometric Series with Means Highlighted |
| 50 | +3 |
| 51 | + |
| 52 | +the sensitivity of the arithmetic mean can be problematic. The geometric mean might be a better central |
| 53 | +measure, as it will consider all of the data points, but without being subject to the same “pull” that can |
| 54 | +deteriorate the interpretation of the arithmetic mean (Figure 2): |
| 55 | +Figure 2: Comparison of Means for a Small Sample (McChesney, 2016) |
| 56 | + |
| 57 | +In the above case, you should first confirm that 90 is a valid data point and not an error. Trimming outliers, so long as it is justifiable, is another way to produce more stable mean calculations. Geometric means are also appropriate when summarizing ratios or percentages. This has many applications in medicine, and is considered the “gold standard” for calculating certain health measurements. In the financial industry, this concept is applied when constructing stock indexes and rates of return. The geometric mean is also employed in the art world, to choose aspect ratios film and video. The idea of comparing ratios is expanded when you look at scaled data: if you have data that have different attributes or scales, and you have normalized the results to be presented as ratios to reference values, the geometric mean is the correct mean to use. |
| 58 | + |
| 59 | +## Considerations |
| 60 | + |
| 61 | +The calculation of the geometric mean requires that all values are non-zero and positive. So what should you do if you have data that do not meet this requirement? If you have values that equal zero, you have a few options: |
| 62 | + |
| 63 | +1. Adjust your scale so that you add 1 to every number in the data set, and then subtract 1 from the resulting geometric mean. |
| 64 | +2. Ignore zeros or missing data in your calculations. |
| 65 | +3. Convert zeros to a very small number (often called “below the detection limit”) that is less than the next smallest number in the data set. |
| 66 | + |
| 67 | +If you have negative numbers, you will need to convert those numbers to a positive value before calculating the geometric mean. You can then assign the resulting geometric mean a negative value. If your data set contains both positive and negative values, you will have to separate them and find the geometric means for each group, and you can then find the weighted average of their individual geometric means to find the total geometric mean for the full data set. |
| 68 | + |
| 69 | +If none of these options appeals to you, you are not alone! There is controversy among statisticians about what is the best method for dealing with these values. You may want to calculate several types of averages and decide what makes the most sense for you and the results you are trying to report. |
| 70 | + |
| 71 | +# Conclusion |
| 72 | +If you are working with non-normal data, you should consider using the geometric mean as the measure of central tendency for your data. The geometric mean is a more robust and accurate way to find your average or expected value for data that is skewed, scaled, or proportional. |
0 commit comments