|
3 | 3 |
|
4 | 4 | # Abstract
|
5 | 5 |
|
6 |
| -Geometric means are a robust and precise way to visualize the central tendency of a data set, particularly when examining skewed data or comparing ratios. Measures of central tendency are predominantly presented as arithmetic means or medians that are relatively simple to calculate and interpret, but may be inaccurate in representing data that are not _strictly normal_. Geometric means represent the best of both worlds, providing estimates that take into account all the observations in a data set without being influenced by the extremes. They can be employed by data analytics working in multiple industries including business, finance, health care, and research. Examples are varied and include examining |
7 |
| -compounded interest rates or returns on investments, assessing population changes in longitudinal data, or investigating lognormal data such lab assay results, biological concentrations, or decay rates. |
| 6 | +Geometric means are a robust and precise way to visualize the central tendency of a data set, particularly when examining skewed data or comparing ratios. Measures of central tendency are predominantly presented as arithmetic means or medians that are relatively simple to calculate and interpret, but may be inaccurate in representing data that are not _strictly normal_. Geometric means represent the best of both worlds, providing estimates that take into account all the observations in a data set without being influenced by the extremes. |
8 | 7 |
|
9 | 8 | While, most databases provide a function to calculate the Arthimetic Mean, none of them provide a function to calcualte the Geometric Mean. We will look at how to calculate Geometric Mean using SQL.
|
10 | 9 |
|
11 | 10 | # Introduction
|
12 | 11 |
|
13 | 12 | ## What is a geometric mean?
|
14 |
| -Geometric means are a type of _average_, or _measure of central tendency_ in a distribution of data points, in the same group as the median, mode, or arithmetic mean. Whereas the arithmetic mean is calculated by summing a series of data points and then dividing that sum by the number of data points, the geometric mean multiplies a series of data points, and then uses the n number of data points to find the nth root of that product. Mathematically, the geometric mean adds depth and stability to |
15 |
| -the mean. |
16 |
| - |
17 |
| -We can easily visualize the geometric mean when applying it to its counterpart, the geometric series of numbers, where each number increases from the previous number according to the same proportion. The geometric mean will lie in the direct center of the values, whereas the arithmetic mean would have been _pulled_ towards the higher values, and thus not truly represent the center of the data. |
18 |
| - |
19 |
| -𝐴𝑟𝑖𝑡ℎ𝑚𝑒𝑡𝑖𝑐 𝑚𝑒𝑎𝑛 = 3+9+27+81+243 |
20 |
| -5 = 72.6 (See Equation 1) |
21 |
| -𝐺𝑒𝑜𝑚𝑒𝑡𝑟𝑖𝑐 𝑚𝑒𝑎𝑛 = √3 ∗ 9 ∗ 27 ∗ 81 ∗ 243 |
22 |
| -5 = 27 (See Equation 2) |
| 13 | +Geometric means are a type of _average_, or _measure of central tendency_ in a distribution of data points, in the same group as the median, mode, or arithmetic mean. Whereas the arithmetic mean is calculated by summing a series of data points and then dividing that sum by the number of data points, the geometric mean multiplies a series of data points, and then uses the n number of data points to find the nth root of that product. Mathematically, the geometric mean adds depth and stability to the mean. |
23 | 14 |
|
24 | 15 | ## When should I use the geometric mean instead of the arithmetic mean?
|
25 |
| -There are no hard rules for which mean you should use. Different types of averages can be used to express slightly different concepts: the center of the data, the values most often seen, and/or the typical "expected" values may or may not all be conveyed by the same measure. Data is rarely perfect, and you may need to look at several different types of averages to decide what works best for what you are trying to communicate with your data. But in general, geometric means are preferable when looking at skewed |
26 |
| -data, scaled data, or when averaging ratios. Some common applications include: |
27 |
| - |
28 |
| -* Population growth |
29 |
| -* Compounding interest |
30 |
| -* Bioassays |
31 |
| -* Radioactive decay |
32 |
| -* Dose-response relationships |
33 |
| -* Count data |
34 |
| -* Time Series data |
35 |
| -* Longitudinal data |
36 |
| -* Repeated measures data |
37 |
| -* Bioequivalence trials |
| 16 | +There are no hard rules for which mean you should use. Different types of averages can be used to express slightly different concepts: the center of the data, the values most often seen, and/or the typical _expected_ values may or may not all be conveyed by the same measure. Data is rarely perfect, and you may need to look at several different types of averages to decide what works best for what you are trying to communicate with your data. But in general, geometric means are preferable when looking at skewed data, scaled data, or when averaging ratios. |
38 | 17 |
|
39 |
| -If your data involve rate changes or changes over time, your data may be skewed. Often these data have a lognormal distribution, and the geometric mean describes the center of lognormal data perfectly. |
| 18 | +### Illustrative Example |
40 | 19 |
|
41 |
| -In the following example the Arthimetic Mean is _pulled_ towards the higher pay rates. Notice that the Average (Arithemetic Mean) is 266, whereas most employees earn less than 200. Geometric mean might be a better central measure for this dataset, as it will consider all of the data points, but without being subject to the same _pull_ that can deteriorate the interpretation of the arithmetic mean (Figure 1): |
| 20 | +Let's take the pay rate for employees in an organization. Most of the Individual Contributors earn less than 200K. The CFO, VP and Director skew the dataset. |
42 | 21 |
|
43 | 22 | ||
|
44 | 23 | |:--:|
|
45 | 24 | |Figure 1: Comparison of Means|
|
46 | 25 |
|
| 26 | +In the above example the Arthimetic Mean is _pulled_ towards the higher pay rates. Notice that the Average (arithemetic mean) is 266, whereas _most employees earn less than 200_. Geometric mean might be a better central measure for this dataset, as it will consider all of the data points, but without being subject to the same _pull_ that can deteriorate the interpretation of the arithmetic mean (Figure 1). |
| 27 | + |
47 | 28 | Geometric means are also appropriate when summarizing ratios or percentages. In the financial industry, this concept is applied when constructing stock indexes and rates of return. The geometric mean is also employed in the art world, to choose aspect ratios film and video. The idea of comparing ratios is expanded when you look at scaled data: if you have data that have different attributes or scales, and you have normalized the results to be presented as ratios to reference values, the geometric mean is the correct mean to use.
|
48 | 29 |
|
| 30 | +### Geometric Mean Calculation using SQL |
| 31 | + |
| 32 | +```sql |
| 33 | +select EXP(SUM(LN(pay))/COUNT(pay)) from employee; |
| 34 | +``` |
| 35 | + |
| 36 | +Alternative if you are using Google BigQuery, you can create User Defined Aggregate Function for Geometric Mean as following: |
| 37 | + |
| 38 | +```sql |
| 39 | +CREATE TEMP AGGREGATE FUNCTION geometric_mean( |
| 40 | + column_values float64 |
| 41 | +) |
| 42 | +RETURNS float64 |
| 43 | +AS |
| 44 | +( |
| 45 | + EXP(SUM(LN(column_values))/COUNT(column_values)) |
| 46 | +); |
| 47 | + |
| 48 | +with test_data as ( |
| 49 | + SELECT 1 AS col1 |
| 50 | + UNION ALL |
| 51 | + SELECT 3 |
| 52 | + UNION ALL |
| 53 | + SELECT 5 |
| 54 | +) |
| 55 | +select geometric_mean(col1) from test_data; |
| 56 | +``` |
| 57 | + |
| 58 | + |
| 59 | + |
| 60 | + |
49 | 61 | ## Considerations
|
50 | 62 |
|
51 | 63 | The calculation of the geometric mean requires that all values are non-zero and positive. So what should you do if you have data that do not meet this requirement? If you have values that equal zero, you have a few options:
|
|
0 commit comments