Skip to content

Commit 801b29f

Browse files
authored
Merge pull request #45 from phobson/refactor-algo
Refactor into algo
2 parents 7c68aba + cfebdb6 commit 801b29f

File tree

9 files changed

+576
-372
lines changed

9 files changed

+576
-372
lines changed

docs/tutorial/closer_look_at_plot_pos.ipynb

Lines changed: 16 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,8 @@
55
"metadata": {},
66
"source": [
77
"# Using different formulations of plotting positions\n",
8-
"### Looking at normal vs Weibull scales + Cunnane vs Weibull plotting positions\n",
8+
"\n",
9+
"## Computing plotting positions\n",
910
"\n",
1011
"When drawing a percentile, quantile, or probability plot, the potting positions of ordered data must be computed.\n",
1112
"\n",
@@ -102,6 +103,13 @@
102103
" ax2.set_ylabel('Weibull Probability Scale')"
103104
]
104105
},
106+
{
107+
"cell_type": "markdown",
108+
"metadata": {},
109+
"source": [
110+
"## Normal vs Weibull scales and Cunnane vs Weibull plotting positions"
111+
]
112+
},
105113
{
106114
"cell_type": "markdown",
107115
"metadata": {},
@@ -173,6 +181,8 @@
173181
"source": [
174182
"This demostrates that the different formulations of the plotting positions vary most at the extreme values of the dataset. \n",
175183
"\n",
184+
"### Hazen plotting positions\n",
185+
"\n",
176186
"Next, let's compare the Hazen/Type 5 (α=0.5, β=0.5) formulation to Cunnane.\n",
177187
"Hazen plotting positions (shown as red triangles) represet a piece-wise linear interpolation of the emperical cumulative distribution function of the dataset.\n",
178188
"\n",
@@ -205,6 +215,8 @@
205215
"cell_type": "markdown",
206216
"metadata": {},
207217
"source": [
218+
"### Summary\n",
219+
"\n",
208220
"At the risk of showing a very cluttered and hard to read figure, let's throw all three on the same normal probability scale:"
209221
]
210222
},
@@ -266,8 +278,9 @@
266278
}
267279
],
268280
"metadata": {
281+
"anaconda-cloud": {},
269282
"kernelspec": {
270-
"display_name": "Python 3",
283+
"display_name": "Python [default]",
271284
"language": "python",
272285
"name": "python3"
273286
},
@@ -281,7 +294,7 @@
281294
"name": "python",
282295
"nbconvert_exporter": "python",
283296
"pygments_lexer": "ipython3",
284-
"version": "3.5.1"
297+
"version": "3.5.2"
285298
}
286299
},
287300
"nbformat": 4,

docs/tutorial/closer_look_at_viz.ipynb

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -436,7 +436,7 @@
436436
"cell_type": "markdown",
437437
"metadata": {},
438438
"source": [
439-
"## Adding best-fit lines\n",
439+
"## Best-fit lines\n",
440440
"\n",
441441
"Adding a best-fit line to a probability plot can provide insight as to whether or not a dataset can be characterized by a distribution.\n",
442442
"\n",
@@ -446,6 +446,8 @@
446446
"Visual attributes of the line can be controled with the `line_kws` parameter.\n",
447447
"If you want label the best-fit line, that is where you specify its label.\n",
448448
"\n",
449+
"### Simple examples\n",
450+
"\n",
449451
"The most trivial case is a P-P plot with a linear data axis"
450452
]
451453
},
@@ -704,8 +706,9 @@
704706
}
705707
],
706708
"metadata": {
709+
"anaconda-cloud": {},
707710
"kernelspec": {
708-
"display_name": "Python 3",
711+
"display_name": "Python [default]",
709712
"language": "python",
710713
"name": "python3"
711714
},

docs/tutorial/getting_started.ipynb

Lines changed: 7 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -74,6 +74,8 @@
7474
"source": [
7575
"## Background\n",
7676
"\n",
77+
"### Built-in matplotlib scales\n",
78+
"\n",
7779
"To the casual user, you can set matplotlib scales to either \"linear\" or \"log\" (logarithmic). There are others (e.g., logit, symlog), but I haven't seen them too much in the wild.\n",
7880
"\n",
7981
"Linear scales are the default:"
@@ -364,18 +366,20 @@
364366
"plot = (\n",
365367
" seaborn.load_dataset(\"tips\")\n",
366368
" .assign(pct=lambda df: 100 * df['tip'] / df['total_bill'])\n",
367-
" .pipe(seaborn.FacetGrid, hue='sex', col='time', row='smoker', margin_titles=True, aspect=1.75)\n",
369+
" .pipe(seaborn.FacetGrid, hue='sex', col='time', row='smoker', margin_titles=True, aspect=1.25, size=4)\n",
368370
" .map(probscale.probplot, 'pct', bestfit=True, scatter_kws=dict(alpha=0.75), probax='y')\n",
369371
" .add_legend()\n",
370372
" .set_ylabels('Non-Exceedance Probabilty')\n",
371373
" .set_xlabels('Tips as percent of total bill')\n",
372-
").set(ylim=(0.5, 99.5))"
374+
" .set(ylim=(0.5, 99.5))\n",
375+
")"
373376
]
374377
}
375378
],
376379
"metadata": {
380+
"anaconda-cloud": {},
377381
"kernelspec": {
378-
"display_name": "Python 3",
382+
"display_name": "Python [default]",
379383
"language": "python",
380384
"name": "python3"
381385
},

probscale/algo.py

Lines changed: 141 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,141 @@
1+
import numpy
2+
3+
4+
def _make_boot_index(elements, niter):
5+
""" Generate an array of bootstrap sample sets
6+
7+
Parameters
8+
----------
9+
elements : int
10+
The number of rows in the original dataset.
11+
niter : int
12+
Number of iteration for the bootstrapping.
13+
14+
Returns
15+
-------
16+
index : numpy array
17+
A collection of random *indices* that can be used to randomly
18+
sample a dataset ``niter`` times.
19+
20+
"""
21+
return numpy.random.randint(low=0, high=elements, size=(niter, elements))
22+
23+
24+
def _fit_simple(x, y, xhat, fitlogs=None):
25+
"""
26+
Simple linear fit of x and y data using ``numpy.polyfit``.
27+
28+
Parameters
29+
----------
30+
x, y : array-like
31+
fitlogs : str, optional.
32+
Defines which data should be log-transformed. Valid values are
33+
'x', 'y', or 'both'.
34+
35+
Returns
36+
-------
37+
xhat, yhat : array-like
38+
Estimates of x and y based on the linear fit
39+
results : dict
40+
Dictionary of the fit coefficients
41+
42+
See also
43+
--------
44+
numpy.polyfit
45+
46+
"""
47+
48+
# do the best-fit
49+
coeffs = numpy.polyfit(x, y, 1)
50+
51+
results = {
52+
'slope': coeffs[0],
53+
'intercept': coeffs[1]
54+
}
55+
56+
# estimate y values
57+
yhat = _estimate_from_fit(xhat, coeffs[0], coeffs[1],
58+
xlog=fitlogs in ['x', 'both'],
59+
ylog=fitlogs in ['y', 'both'])
60+
61+
return yhat, results
62+
63+
64+
def _bs_fit(x, y, xhat, fitlogs=None, niter=10000, alpha=0.05):
65+
"""
66+
Percentile method bootstrapping of linear fit of x and y data using
67+
``numpy.polyfit``.
68+
69+
Parameters
70+
----------
71+
x, y : array-like
72+
fitlogs : str, optional.
73+
Defines which data should be log-transformed. Valid values are
74+
'x', 'y', or 'both'.
75+
niter : int, optional (default is 10000)
76+
Number of bootstrap iterations to use
77+
alpha : float, optional
78+
Confidence level of the estimate.
79+
80+
Returns
81+
-------
82+
xhat, yhat : array-like
83+
Estimates of x and y based on the linear fit
84+
results : dict
85+
Dictionary of the fit coefficients
86+
87+
See also
88+
--------
89+
numpy.polyfit
90+
91+
"""
92+
93+
index = _make_boot_index(len(x), niter)
94+
yhat_array = numpy.array([
95+
_fit_simple(x[ii], y[ii], xhat, fitlogs=fitlogs)[0]
96+
for ii in index
97+
])
98+
99+
percentiles = 100 * numpy.array([alpha*0.5, 1 - alpha*0.5])
100+
yhat_lo, yhat_hi = numpy.percentile(yhat_array, percentiles, axis=0)
101+
return yhat_lo, yhat_hi
102+
103+
104+
def _estimate_from_fit(xhat, slope, intercept, xlog=False, ylog=False):
105+
""" Estimate the dependent variables of a linear fit given x-data
106+
and linear parameters.
107+
108+
Parameters
109+
----------
110+
xhat : numpy array or pandas Series/DataFrame
111+
The input independent variable of the fit
112+
slope : float
113+
Slope of the best-fit line
114+
intercept : float
115+
y-intercept of the best-fit line
116+
xlog, ylog : bool (default = False)
117+
Toggles whether or not the logs of the x- or y- data should be
118+
used to perform the regression.
119+
120+
Returns
121+
-------
122+
yhat : numpy array
123+
Estimate of the dependent variable.
124+
125+
"""
126+
127+
xhat = numpy.asarray(xhat)
128+
if ylog:
129+
if xlog:
130+
yhat = numpy.exp(intercept) * xhat ** slope
131+
else:
132+
yhat = numpy.exp(intercept) * numpy.exp(slope) ** xhat
133+
134+
else:
135+
if xlog:
136+
yhat = slope * numpy.log(xhat) + intercept
137+
138+
else:
139+
yhat = slope * xhat + intercept
140+
141+
return yhat

probscale/tests/test_algo.py

Lines changed: 128 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,128 @@
1+
import numpy
2+
3+
import pytest
4+
import numpy.testing as nptest
5+
6+
from probscale import algo
7+
8+
9+
def test__make_boot_index():
10+
result = algo._make_boot_index(5, 5000)
11+
assert result.shape == (5000, 5)
12+
assert result.min() == 0
13+
assert result.max() == 4
14+
15+
16+
@pytest.fixture
17+
def plot_data():
18+
data = numpy.array([
19+
3.113, 3.606, 4.046, 4.046, 4.710, 6.140, 6.978,
20+
2.000, 4.200, 4.620, 5.570, 5.660, 5.860, 6.650,
21+
6.780, 6.790, 7.500, 7.500, 7.500, 8.630, 8.710,
22+
8.990, 9.850, 10.820, 11.250, 11.250, 12.200, 14.920,
23+
16.770, 17.810, 19.160, 19.190, 19.640, 20.180, 22.970,
24+
])
25+
return data
26+
27+
28+
@pytest.mark.parametrize(('fitlogs', 'known_yhat'), [
29+
(None, numpy.array([0.7887, 3.8946, 7.0005, 10.1065, 13.2124, 16.3183])),
30+
('x', numpy.array([0.2711, 1.2784, 1.5988, 1.7953, 1.9373, 2.0487])),
31+
('y', numpy.array([2.2006e+00, 4.9139e+01, 1.0972e+03,
32+
2.4501e+04, 5.4711e+05, 1.2217e+07])),
33+
('both', numpy.array([1.3114, 3.5908, 4.9472, 6.0211, 6.9402, 7.7577])),
34+
])
35+
def test__fit_simple(plot_data, fitlogs, known_yhat):
36+
x = numpy.arange(1, len(plot_data) + 1)
37+
known_results = {'slope': 0.5177, 'intercept': 0.2711}
38+
xhat = x[::6]
39+
yhat, results = algo._fit_simple(x, plot_data, xhat, fitlogs=fitlogs)
40+
assert abs(results['intercept'] - known_results['intercept']) < 0.0001
41+
assert abs(results['slope'] - known_results['slope']) < 0.0001
42+
nptest.assert_allclose(yhat, known_yhat, rtol=0.0001)
43+
44+
45+
@pytest.mark.parametrize(('fitlogs', 'known_lo', 'known_hi'), [
46+
(None, numpy.array([-0.7944, 2.7051, 6.1974, 9.2612, 11.9382, 14.4290]),
47+
numpy.array([2.1447, 4.8360, 7.7140, 10.8646, 14.1014, 17.4432])),
48+
('x', numpy.array([-1.4098, -0.2210, 0.1387, 0.3585, 0.5147, 0.6417]),
49+
numpy.array([1.7067, 2.5661, 2.8468, 3.0169, 3.1400, 3.2341])),
50+
('y', numpy.array([4.5187e-01, 1.4956e+01, 4.9145e+02,
51+
1.0522e+04, 1.5299e+05, 1.8468e+06]),
52+
numpy.array([8.5396e+00, 1.2596e+02, 2.2396e+03,
53+
5.2290e+04, 1.3310e+06, 3.7627e+07])),
54+
('both', numpy.array([0.2442, 0.8017, 1.1488, 1.4312, 1.6731, 1.8997]),
55+
numpy.array([5.5107, 13.0148, 17.232, 20.4285, 23.1035, 25.3843])),
56+
])
57+
def test__bs_fit(plot_data, fitlogs, known_lo, known_hi):
58+
numpy.random.seed(0)
59+
x = numpy.arange(1, len(plot_data) + 1)
60+
xhat = x[::6]
61+
yhat_lo, yhat_hi = algo._bs_fit(x, plot_data, xhat,
62+
fitlogs=fitlogs, niter=1000)
63+
64+
nptest.assert_allclose(yhat_lo, known_lo, rtol=0.001)
65+
nptest.assert_allclose(yhat_hi, known_hi, rtol=0.001)
66+
67+
68+
class Test__estimate_from_fit(object):
69+
def setup(self):
70+
self.x = numpy.arange(1, 11, 0.5)
71+
self.slope = 2
72+
self.intercept = 3.5
73+
74+
self.known_ylinlin = numpy.array([
75+
5.5, 6.5, 7.5, 8.5, 9.5, 10.5, 11.5, 12.5, 13.5,
76+
14.5, 15.5, 16.5, 17.5, 18.5, 19.5, 20.5, 21.5, 22.5,
77+
23.5, 24.5
78+
])
79+
80+
self.known_yloglin = numpy.array([
81+
3.50000000, 4.31093022, 4.88629436, 5.33258146, 5.69722458,
82+
6.00552594, 6.27258872, 6.50815479, 6.71887582, 6.90949618,
83+
7.08351894, 7.24360435, 7.39182030, 7.52980604, 7.65888308,
84+
7.78013233, 7.89444915, 8.00258360, 8.10517019, 8.20275051
85+
])
86+
87+
self.known_yloglog = numpy.array([
88+
33.11545196, 74.50976691, 132.46180783, 206.97157474,
89+
298.03906763, 405.66428649, 529.84723134, 670.58790216,
90+
827.88629897, 1001.74242175, 1192.15627051, 1399.12784525,
91+
1622.65714598, 1862.74417268, 2119.38892536, 2392.59140402,
92+
2682.35160865, 2988.66953927, 3311.54519587, 3650.97857845
93+
])
94+
95+
self.known_ylinlog = numpy.array([
96+
2.44691932e+02, 6.65141633e+02, 1.80804241e+03,
97+
4.91476884e+03, 1.33597268e+04, 3.63155027e+04,
98+
9.87157710e+04, 2.68337287e+05, 7.29416370e+05,
99+
1.98275926e+06, 5.38969848e+06, 1.46507194e+07,
100+
3.98247844e+07, 1.08254988e+08, 2.94267566e+08,
101+
7.99902177e+08, 2.17435955e+09, 5.91052206e+09,
102+
1.60664647e+10, 4.36731791e+10
103+
])
104+
105+
def test_linlin(self):
106+
ylinlin = algo._estimate_from_fit(self.x, self.slope, self.intercept,
107+
xlog=False, ylog=False)
108+
nptest.assert_array_almost_equal(ylinlin, self.known_ylinlin)
109+
110+
def test_loglin(self):
111+
yloglin = algo._estimate_from_fit(self.x, self.slope, self.intercept,
112+
xlog=True, ylog=False)
113+
nptest.assert_array_almost_equal(yloglin, self.known_yloglin)
114+
115+
def test_loglog(self):
116+
yloglog = algo._estimate_from_fit(self.x, self.slope, self.intercept,
117+
xlog=True, ylog=True)
118+
nptest.assert_array_almost_equal(yloglog, self.known_yloglog)
119+
120+
def test_linlog(self):
121+
ylinlog = algo._estimate_from_fit(self.x, self.slope, self.intercept,
122+
xlog=False, ylog=True)
123+
diff = numpy.abs(ylinlog - self.known_ylinlog) / self.known_ylinlog
124+
nptest.assert_array_almost_equal(
125+
diff,
126+
numpy.zeros(self.x.shape[0]),
127+
decimal=5
128+
)

0 commit comments

Comments
 (0)