Clean up the API around bootstrapping the best-fit

phobson · phobson · commit 2b5008c1b59b · 2017-02-06T16:28:12.000-08:00
diff --git a/docs/tutorial/closer_look_at_plot_pos.ipynb b/docs/tutorial/closer_look_at_plot_pos.ipynb
@@ -5,7 +5,8 @@
    "metadata": {},
    "source": [
     "# Using different formulations of plotting positions\n",
-    "### Looking at normal vs Weibull scales + Cunnane vs Weibull plotting positions\n",
+    "\n",
+    "## Computing plotting positions\n",
     "\n",
     "When drawing a percentile, quantile, or probability plot, the potting positions of ordered data must be computed.\n",
     "\n",
@@ -102,6 +103,13 @@
     "        ax2.set_ylabel('Weibull Probability Scale')"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Normal vs Weibull scales and Cunnane vs Weibull plotting positions"
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {},
@@ -173,6 +181,8 @@
    "source": [
     "This demostrates that the different formulations of the plotting positions vary  most at the extreme values of the dataset. \n",
     "\n",
+    "### Hazen plotting positions\n",
+    "\n",
     "Next, let's compare the Hazen/Type 5 (α=0.5, β=0.5) formulation to Cunnane.\n",
     "Hazen plotting positions (shown as red triangles) represet a piece-wise linear interpolation of the emperical cumulative distribution function of the dataset.\n",
     "\n",
@@ -205,6 +215,8 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
+    "### Summary\n",
+    "\n",
     "At the risk of showing a very cluttered and hard to read figure, let's throw all three on the same normal probability scale:"
    ]
   },
@@ -267,7 +279,7 @@
  ],
  "metadata": {
   "kernelspec": {
-   "display_name": "Python 3",
+   "display_name": "Python [default]",
    "language": "python",
    "name": "python3"
   },
@@ -281,7 +293,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.5.1"
+   "version": "3.5.2"
   }
  },
  "nbformat": 4,
diff --git a/docs/tutorial/closer_look_at_viz.ipynb b/docs/tutorial/closer_look_at_viz.ipynb
@@ -436,7 +436,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Adding best-fit lines\n",
+    "## Best-fit lines\n",
     "\n",
     "Adding a best-fit line to a probability plot can provide insight as to whether or not a dataset can be characterized by a distribution.\n",
     "\n",
@@ -446,6 +446,8 @@
     "Visual attributes of the line can be controled with the `line_kws` parameter.\n",
     "If you want label the best-fit line, that is where you specify its label.\n",
     "\n",
+    "### Simple examples\n",
+    "\n",
     "The most trivial case is a P-P plot with a linear data axis"
    ]
   },
@@ -705,7 +707,7 @@
  ],
  "metadata": {
   "kernelspec": {
-   "display_name": "Python 3",
+   "display_name": "Python [default]",
    "language": "python",
    "name": "python3"
   },
diff --git a/docs/tutorial/getting_started.ipynb b/docs/tutorial/getting_started.ipynb
@@ -74,6 +74,8 @@
    "source": [
     "## Background\n",
     "\n",
+    "### Built-in matplotlib scales\n",
+    "\n",
     "To the casual user, you can set matplotlib scales to either \"linear\" or \"log\" (logarithmic). There are others (e.g., logit, symlog), but I haven't seen them too much in the wild.\n",
     "\n",
     "Linear scales are the default:"
@@ -374,8 +376,9 @@
   }
  ],
  "metadata": {
+  "anaconda-cloud": {},
   "kernelspec": {
-   "display_name": "Python 3",
+   "display_name": "Python [default]",
    "language": "python",
    "name": "python3"
   },
diff --git a/probscale/algo.py b/probscale/algo.py
@@ -61,17 +61,6 @@ def _fit_simple(x, y, xhat, fitlogs=None):
     return yhat, results
 
 
-def _bs_resid(x, y, xhat, fitlogs=None, niter=10000, alpha=0.05):
-    index = _make_boot_index(len(x), niter)
-    yhat, results = _fit_simple(x, y, xhat, fitlogs=fitlogs)
-    resid = y - yhat
-    bs_y = y + resid[index]
-
-    percentiles = 100 * numpy.array([alpha*0.5, 1 - alpha*0.5])
-    yhat_lo, yhat_hi = numpy.percentile(bs_y, percentiles, axis=0)
-    return yhat_lo, yhat_hi
-
-
 def _bs_fit(x, y, xhat, fitlogs=None, niter=10000, alpha=0.05):
     """
     Percentile method bootstrapping of linear fit of x and y data using
diff --git a/probscale/tests/test_validate.py b/probscale/tests/test_validate.py
@@ -90,14 +90,14 @@ def test_axis_label(value, expected):
     assert result == expected
 
 
-@pytest.mark.parametrize(('value', 'expected'), [
-    ('fit', algo._bs_fit),
-    ('resids', algo._bs_resid),
-    ('junk', None)
+@pytest.mark.parametrize(('value', 'expected', 'error'), [
+    ('fit', algo._bs_fit, None),
+    ('resids', None, NotImplementedError),
+    ('junk', None, ValueError)
 ])
-def test_estimator(value, expected):
-    if expected is None:
-        with pytest.raises(ValueError):
+def test_estimator(value, expected, error):
+    if error is not None:
+        with pytest.raises(error):
             validate.estimator(value)
     else:
         est = validate.estimator(value)
diff --git a/probscale/validate.py b/probscale/validate.py
@@ -1,5 +1,7 @@
 from matplotlib import pyplot
 
+from .algo import _bs_fit
+
 
 def axes_object(ax):
     """ Checks if a value if an Axes. If None, a new one is created.
@@ -85,12 +87,12 @@ def other_options(options):
     return dict() if options is None else options.copy()
 
 def estimator(value):
-    from .algo import _bs_fit, _bs_resid
     if value.lower() in ['res', 'resid', 'resids', 'residual', 'residuals']:
-        est = _bs_resid
+        msg = 'Bootstrapping the residuals is not ready yet'
+        raise NotImplementedError(msg)
     elif value.lower() in ['fit', 'values']:
         est = _bs_fit
     else:
         raise ValueError('estimator must be either "resid" or "fit".')
 
-    return est
+    return est
diff --git a/probscale/viz.py b/probscale/viz.py
@@ -10,50 +10,67 @@
 
 def probplot(data, ax=None, plottype='prob', dist=None, probax='x',
              problabel=None, datascale='linear', datalabel=None,
-             bestfit=False, estimate_ci=False,
-             return_best_fit_results=False,
-             scatter_kws=None, line_kws=None, pp_kws=None,
-             **fgkwargs):
+             bestfit=False, return_best_fit_results=False,
+             estimate_ci=False, ci_kws=None, pp_kws=None,
+             scatter_kws=None, line_kws=None, **fgkwargs):
     """
     Probability, percentile, and quantile plots.
 
     Parameters
     ----------
     data : array-like
         1-dimensional data to be plotted
+
     ax : matplotlib axes, optional
         The Axes on which to plot. If one is not provided, a new Axes
         will be created.
+
     plottype : string (default = 'prob')
         Type of plot to be created. Options are:
 
            - 'prob': probabilty plot
            - 'pp': percentile plot
            - 'qq': quantile plot
 
+
     dist : scipy distribution, optional
         A distribtion to compute the scale's tick positions. If not
         specified, a standard normal distribution will be used.
+
     probax : string, optional (default = 'x')
         The axis ('x' or 'y') that will serve as the probability (or
         quantile) axis.
+
     problabel, datalabel : string, optional
         Axis labels for the probability/quantile and data axes
         respectively.
+
     datascale : string, optional (default = 'log')
         Scale for the other axis that is not
+
     bestfit : bool, optional (default is False)
         Specifies whether a best-fit line should be added to the plot.
+
     return_best_fit_results : bool (default is False)
         If True a dictionary of results of is returned along with the
         figure.
-    scatter_kws, line_kws : dictionary, optional
-        Dictionary of keyword arguments passed directly to ``ax.plot``
-        when drawing the scatter points and best-fit line, respectively.
-    pp_kws : dictionary, optional
+
+    estimate_ci : bool, optional (False)
+        Estimate and draw a confidence band around the best-fit line
+        using a percentile bootstrap.
+
+    ci_kws : dict, optional
+        Dictionary of keyword arguments passed directly to
+        ``viz.fit_line`` when computing the best-fit line.
+
+    pp_kws : dict, optional
         Dictionary of keyword arguments passed directly to
         ``viz.plot_pos`` when computing the plotting positions.
 
+    scatter_kws, line_kws : dict, optional
+        Dictionary of keyword arguments passed directly to ``ax.plot``
+        when drawing the scatter points and best-fit line, respectively.
+
     Other Parameters
     ----------------
     color : string, optional
@@ -82,7 +99,8 @@ def probplot(data, ax=None, plottype='prob', dist=None, probax='x',
     -------
     fig : matplotlib.Figure
         The figure on which the plot was drawn.
-    result : dictionary of linear fit results, optional
+
+    result : dict of linear fit results, optional
         Keys are:
 
            - q : array of quantiles
@@ -93,6 +111,7 @@ def probplot(data, ax=None, plottype='prob', dist=None, probax='x',
     See also
     --------
     viz.plot_pos
+    viz.fit_line
     numpy.polyfit
     scipy.stats.probplot
     scipy.stats.mstats.plotting_positions
@@ -287,7 +306,9 @@ def plot_pos(data, postype=None, alpha=None, beta=None):
     ----------
     data : array-like
         The values whose plotting positions need to be computed.
+
     postype : string, optional (default: "cunnane")
+
     alpha, beta : float, optional
         Custom plotting position parameters is the options available
         through the `postype` parameter are insufficient.
@@ -296,6 +317,7 @@ def plot_pos(data, postype=None, alpha=None, beta=None):
     -------
     plot_pos : numpy.array
         The computed plotting positions, sorted.
+
     data_sorted : numpy.array
         The original data values, sorted.
 
@@ -384,9 +406,11 @@ def fit_line(x, y, xhat=None, fitprobs=None, fitlogs=None, dist=None,
     ----------
     x, y : array-like
         Independent and dependent data, respectively.
+
     xhat : array-like, optional
         The values at which ``yhat`` should should be estimated. If
         not provided, falls back to the sorted values of ``x``.
+
     fitprobs, fitlogs : str, optional.
         Defines how data should be transformed. Valid values are
         'x', 'y', or 'both'. If using ``fitprobs``, variables should
@@ -395,12 +419,23 @@ def fit_line(x, y, xhat=None, fitprobs=None, fitlogs=None, dist=None,
         Log transform = lambda x: numpy.log(x).
         Take care to not pass the same value to both ``fitlogs`` and
         ``figprobs`` as both transforms will be applied.
+
     dist : distribution, optional
         A fully-spec'd scipy.stats distribution-like object
         such that ``dist.ppf`` and ``dist.cdf`` can be called. If not
         provided, defaults to a minimal implementation of
         scipt.stats.norm.
 
+    estimate_ci : bool, optional (False)
+        Estimate and draw a confidence band around the best-fit line
+        using a percentile bootstrap.
+
+    niter : int, optional (default = 10000)
+        Number of bootstrap iterations if ``estimate_ci`` is provided.
+
+    alpha : float, optional (default = 0.05)
+        The confidence level of the bootstrap estimate.
+
     Returns
     -------
     xhat, yhat : numpy arrays
@@ -414,6 +449,7 @@ def fit_line(x, y, xhat=None, fitprobs=None, fitlogs=None, dist=None,
           - yhat_hi (upper confidence interval of the estimated y-vals)
 
     """
+
     fitprobs = validate.fit_argument(fitprobs, "fitprobs")
     fitlogs = validate.fit_argument(fitlogs, "fitlogs")
 
@@ -445,7 +481,7 @@ def fit_line(x, y, xhat=None, fitprobs=None, fitlogs=None, dist=None,
     yhat, results =  algo._fit_simple(x, y, xhat, fitlogs=fitlogs)
 
     if estimate_ci:
-        yhat_lo, yhat_hi = algo._fit_ci(x, y, xhat, fitlogs=fitlogs,
+        yhat_lo, yhat_hi = algo._bs_fit(x, y, xhat, fitlogs=fitlogs,
                                         niter=niter, alpha=alpha)
     else:
         yhat_lo, yhat_hi = None, None