Skip to content

Commit 30332b6

Browse files
add statsforecast
1 parent 0a0ff2b commit 30332b6

File tree

4 files changed

+434
-4
lines changed

4 files changed

+434
-4
lines changed

Chapter5/time_series.ipynb

Lines changed: 159 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2249,6 +2249,165 @@
22492249
"[Link to NeuralForecast](https://bit.ly/44h9KM7)."
22502250
]
22512251
},
2252+
{
2253+
"cell_type": "markdown",
2254+
"id": "4255da83",
2255+
"metadata": {},
2256+
"source": [
2257+
"### Scaling Time-Series Forecasting with StatsForecast and Spark"
2258+
]
2259+
},
2260+
{
2261+
"cell_type": "code",
2262+
"execution_count": null,
2263+
"id": "f18072ef",
2264+
"metadata": {
2265+
"tags": [
2266+
"hide-cell"
2267+
]
2268+
},
2269+
"outputs": [],
2270+
"source": [
2271+
"!pip install statsforecast pyspark\n"
2272+
]
2273+
},
2274+
{
2275+
"cell_type": "code",
2276+
"execution_count": 22,
2277+
"id": "da891a0f",
2278+
"metadata": {},
2279+
"outputs": [],
2280+
"source": [
2281+
"import os \n",
2282+
"\n",
2283+
"# this makes it so that the outputs of the predict methods have the id as a column \n",
2284+
"# instead of as the index\n",
2285+
"os.environ['NIXTLA_ID_AS_COL'] = '1'"
2286+
]
2287+
},
2288+
{
2289+
"cell_type": "markdown",
2290+
"id": "3d8c3905",
2291+
"metadata": {},
2292+
"source": [
2293+
"Traditional time series libraries are typically built to run in-memory on single machines, which poses challenges when handling extremely large datasets.\n",
2294+
"\n",
2295+
"StatsForecast, however, provides seamless compatibility with Spark, allowing users to perform scalable and efficient time-series forecasting on large datasets directly within Spark."
2296+
]
2297+
},
2298+
{
2299+
"cell_type": "code",
2300+
"execution_count": null,
2301+
"id": "51638f88",
2302+
"metadata": {},
2303+
"outputs": [],
2304+
"source": [
2305+
"from pyspark.sql import SparkSession\n",
2306+
"\n",
2307+
"spark = SparkSession.builder.getOrCreate()\n"
2308+
]
2309+
},
2310+
{
2311+
"cell_type": "code",
2312+
"execution_count": 31,
2313+
"id": "87bfec5a",
2314+
"metadata": {},
2315+
"outputs": [
2316+
{
2317+
"name": "stdout",
2318+
"output_type": "stream",
2319+
"text": [
2320+
"+---------+-------------------+-------------------+\n",
2321+
"|unique_id| ds| y|\n",
2322+
"+---------+-------------------+-------------------+\n",
2323+
"| 0|2000-01-01 00:00:00|0.30138168803582194|\n",
2324+
"| 0|2000-01-02 00:00:00| 1.2724415914984484|\n",
2325+
"| 0|2000-01-03 00:00:00| 2.211827399669452|\n",
2326+
"| 0|2000-01-04 00:00:00| 3.322947056533328|\n",
2327+
"| 0|2000-01-05 00:00:00| 4.218793605631347|\n",
2328+
"+---------+-------------------+-------------------+\n",
2329+
"only showing top 5 rows\n",
2330+
"\n"
2331+
]
2332+
}
2333+
],
2334+
"source": [
2335+
"from statsforecast.core import StatsForecast\n",
2336+
"from statsforecast.models import AutoETS\n",
2337+
"from statsforecast.utils import generate_series\n",
2338+
"from tqdm.autonotebook import tqdm\n",
2339+
"\n",
2340+
"n_series = 4\n",
2341+
"horizon = 7\n",
2342+
"\n",
2343+
"series = generate_series(n_series)\n",
2344+
"\n",
2345+
"# Convert to Spark\n",
2346+
"spark_df = spark.createDataFrame(series)\n",
2347+
"spark_df.show(5)"
2348+
]
2349+
},
2350+
{
2351+
"cell_type": "code",
2352+
"execution_count": 30,
2353+
"id": "be012038",
2354+
"metadata": {},
2355+
"outputs": [
2356+
{
2357+
"name": "stderr",
2358+
"output_type": "stream",
2359+
"text": [
2360+
"/Users/khuyentran/.pyenv/versions/3.8.16/lib/python3.8/site-packages/statsforecast/core.py:485: FutureWarning: In a future version the predictions will have the id as a column. You can set the `NIXTLA_ID_AS_COL` environment variable to adopt the new behavior and to suppress this warning.\n",
2361+
" warnings.warn(\n",
2362+
"/Users/khuyentran/.pyenv/versions/3.8.16/lib/python3.8/site-packages/statsforecast/core.py:485: FutureWarning: In a future version the predictions will have the id as a column. You can set the `NIXTLA_ID_AS_COL` environment variable to adopt the new behavior and to suppress this warning.\n",
2363+
" warnings.warn(\n",
2364+
"/Users/khuyentran/.pyenv/versions/3.8.16/lib/python3.8/site-packages/statsforecast/core.py:485: FutureWarning: In a future version the predictions will have the id as a column. You can set the `NIXTLA_ID_AS_COL` environment variable to adopt the new behavior and to suppress this warning.\n",
2365+
" warnings.warn(\n"
2366+
]
2367+
},
2368+
{
2369+
"name": "stdout",
2370+
"output_type": "stream",
2371+
"text": [
2372+
"+---------+-------------------+----------+-------------+-------------+\n",
2373+
"|unique_id| ds| AutoETS|AutoETS-lo-90|AutoETS-hi-90|\n",
2374+
"+---------+-------------------+----------+-------------+-------------+\n",
2375+
"| 0|2000-08-10 00:00:00| 5.261609| 5.0255513| 5.4976664|\n",
2376+
"| 0|2000-08-11 00:00:00| 6.1963573| 5.9603| 6.432415|\n",
2377+
"| 0|2000-08-12 00:00:00|0.28230855| 0.04625102| 0.5183661|\n",
2378+
"| 0|2000-08-13 00:00:00| 1.2641948| 1.0281373| 1.5002524|\n",
2379+
"| 0|2000-08-14 00:00:00| 2.2624528| 2.0263953| 2.4985104|\n",
2380+
"+---------+-------------------+----------+-------------+-------------+\n",
2381+
"only showing top 5 rows\n",
2382+
"\n"
2383+
]
2384+
},
2385+
{
2386+
"name": "stderr",
2387+
"output_type": "stream",
2388+
"text": [
2389+
"/Users/khuyentran/.pyenv/versions/3.8.16/lib/python3.8/site-packages/statsforecast/core.py:485: FutureWarning: In a future version the predictions will have the id as a column. You can set the `NIXTLA_ID_AS_COL` environment variable to adopt the new behavior and to suppress this warning.\n",
2390+
" warnings.warn(\n",
2391+
"/Users/khuyentran/.pyenv/versions/3.8.16/lib/python3.8/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown\n",
2392+
" warnings.warn('resource_tracker: There appear to be %d '\n"
2393+
]
2394+
}
2395+
],
2396+
"source": [
2397+
"sf = StatsForecast(models=[AutoETS(season_length=7)], freq=\"D\")\n",
2398+
"\n",
2399+
"# Returns a Spark DataFrame\n",
2400+
"sf.forecast(df=spark_df, h=horizon, level=[90]).show(5)"
2401+
]
2402+
},
2403+
{
2404+
"cell_type": "markdown",
2405+
"id": "dbbe24c6",
2406+
"metadata": {},
2407+
"source": [
2408+
"[Link to StatsForecast.](https://bit.ly/3KNsl9P)"
2409+
]
2410+
},
22522411
{
22532412
"cell_type": "markdown",
22542413
"id": "fc36e4ce",

docs/Chapter5/time_series.html

Lines changed: 115 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -528,7 +528,8 @@ <h2> Contents </h2>
528528
<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#quantstats-simplify-stock-performance-analysis-in-python">6.7.13. QuantStats: Simplify Stock Performance Analysis in Python</a></li>
529529
<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#kneed-knee-point-detection-in-time-series">6.7.14. kneed: Knee-Point Detection in Time Series</a></li>
530530
<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#neuralforecast-streamline-neural-forecasting-with-familiar-sklearn-syntax">6.7.15. NeuralForecast: Streamline Neural Forecasting with Familiar Sklearn Syntax</a></li>
531-
<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#generative-pre-trained-forecasting-with-timegpt">6.7.16. Generative Pre-trained Forecasting with TimeGPT</a></li>
531+
<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#scaling-time-series-forecasting-with-statsforecast-and-spark">6.7.16. Scaling Time-Series Forecasting with StatsForecast and Spark</a></li>
532+
<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#generative-pre-trained-forecasting-with-timegpt">6.7.17. Generative Pre-trained Forecasting with TimeGPT</a></li>
532533
</ul>
533534
</nav>
534535
</div>
@@ -1866,8 +1867,118 @@ <h2><span class="section-number">6.7.15. </span>NeuralForecast: Streamline Neura
18661867
</div>
18671868
<p><a class="reference external" href="https://bit.ly/44h9KM7">Link to NeuralForecast</a>.</p>
18681869
</section>
1870+
<section id="scaling-time-series-forecasting-with-statsforecast-and-spark">
1871+
<h2><span class="section-number">6.7.16. </span>Scaling Time-Series Forecasting with StatsForecast and Spark<a class="headerlink" href="#scaling-time-series-forecasting-with-statsforecast-and-spark" title="Permalink to this heading">#</a></h2>
1872+
<div class="cell tag_hide-cell docutils container">
1873+
<details class="hide above-input">
1874+
<summary aria-label="Toggle hidden content">
1875+
<span class="collapsed">Show code cell content</span>
1876+
<span class="expanded">Hide code cell content</span>
1877+
</summary>
1878+
<div class="cell_input docutils container">
1879+
<div class="highlight-ipython3 notranslate"><div class="highlight"><pre><span></span><span class="o">!</span>pip<span class="w"> </span>install<span class="w"> </span>statsforecast<span class="w"> </span>pyspark
1880+
</pre></div>
1881+
</div>
1882+
</div>
1883+
</details>
1884+
</div>
1885+
<div class="cell docutils container">
1886+
<div class="cell_input docutils container">
1887+
<div class="highlight-ipython3 notranslate"><div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">os</span>
1888+
1889+
<span class="c1"># this makes it so that the outputs of the predict methods have the id as a column </span>
1890+
<span class="c1"># instead of as the index</span>
1891+
<span class="n">os</span><span class="o">.</span><span class="n">environ</span><span class="p">[</span><span class="s1">&#39;NIXTLA_ID_AS_COL&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="s1">&#39;1&#39;</span>
1892+
</pre></div>
1893+
</div>
1894+
</div>
1895+
</div>
1896+
<p>Traditional time series libraries are typically built to run in-memory on single machines, which poses challenges when handling extremely large datasets.</p>
1897+
<p>StatsForecast, however, provides seamless compatibility with Spark, allowing users to perform scalable and efficient time-series forecasting on large datasets directly within Spark.</p>
1898+
<div class="cell docutils container">
1899+
<div class="cell_input docutils container">
1900+
<div class="highlight-ipython3 notranslate"><div class="highlight"><pre><span></span><span class="kn">from</span> <span class="nn">pyspark.sql</span> <span class="kn">import</span> <span class="n">SparkSession</span>
1901+
1902+
<span class="n">spark</span> <span class="o">=</span> <span class="n">SparkSession</span><span class="o">.</span><span class="n">builder</span><span class="o">.</span><span class="n">getOrCreate</span><span class="p">()</span>
1903+
</pre></div>
1904+
</div>
1905+
</div>
1906+
</div>
1907+
<div class="cell docutils container">
1908+
<div class="cell_input docutils container">
1909+
<div class="highlight-ipython3 notranslate"><div class="highlight"><pre><span></span><span class="kn">from</span> <span class="nn">statsforecast.core</span> <span class="kn">import</span> <span class="n">StatsForecast</span>
1910+
<span class="kn">from</span> <span class="nn">statsforecast.models</span> <span class="kn">import</span> <span class="n">AutoETS</span>
1911+
<span class="kn">from</span> <span class="nn">statsforecast.utils</span> <span class="kn">import</span> <span class="n">generate_series</span>
1912+
<span class="kn">from</span> <span class="nn">tqdm.autonotebook</span> <span class="kn">import</span> <span class="n">tqdm</span>
1913+
1914+
<span class="n">n_series</span> <span class="o">=</span> <span class="mi">4</span>
1915+
<span class="n">horizon</span> <span class="o">=</span> <span class="mi">7</span>
1916+
1917+
<span class="n">series</span> <span class="o">=</span> <span class="n">generate_series</span><span class="p">(</span><span class="n">n_series</span><span class="p">)</span>
1918+
1919+
<span class="c1"># Convert to Spark</span>
1920+
<span class="n">spark_df</span> <span class="o">=</span> <span class="n">spark</span><span class="o">.</span><span class="n">createDataFrame</span><span class="p">(</span><span class="n">series</span><span class="p">)</span>
1921+
<span class="n">spark_df</span><span class="o">.</span><span class="n">show</span><span class="p">(</span><span class="mi">5</span><span class="p">)</span>
1922+
</pre></div>
1923+
</div>
1924+
</div>
1925+
<div class="cell_output docutils container">
1926+
<div class="output stream highlight-myst-ansi notranslate"><div class="highlight"><pre><span></span>+---------+-------------------+-------------------+
1927+
|unique_id| ds| y|
1928+
+---------+-------------------+-------------------+
1929+
| 0|2000-01-01 00:00:00|0.30138168803582194|
1930+
| 0|2000-01-02 00:00:00| 1.2724415914984484|
1931+
| 0|2000-01-03 00:00:00| 2.211827399669452|
1932+
| 0|2000-01-04 00:00:00| 3.322947056533328|
1933+
| 0|2000-01-05 00:00:00| 4.218793605631347|
1934+
+---------+-------------------+-------------------+
1935+
only showing top 5 rows
1936+
</pre></div>
1937+
</div>
1938+
</div>
1939+
</div>
1940+
<div class="cell docutils container">
1941+
<div class="cell_input docutils container">
1942+
<div class="highlight-ipython3 notranslate"><div class="highlight"><pre><span></span><span class="n">sf</span> <span class="o">=</span> <span class="n">StatsForecast</span><span class="p">(</span><span class="n">models</span><span class="o">=</span><span class="p">[</span><span class="n">AutoETS</span><span class="p">(</span><span class="n">season_length</span><span class="o">=</span><span class="mi">7</span><span class="p">)],</span> <span class="n">freq</span><span class="o">=</span><span class="s2">&quot;D&quot;</span><span class="p">)</span>
1943+
1944+
<span class="c1"># Returns a Spark DataFrame</span>
1945+
<span class="n">sf</span><span class="o">.</span><span class="n">forecast</span><span class="p">(</span><span class="n">df</span><span class="o">=</span><span class="n">spark_df</span><span class="p">,</span> <span class="n">h</span><span class="o">=</span><span class="n">horizon</span><span class="p">,</span> <span class="n">level</span><span class="o">=</span><span class="p">[</span><span class="mi">90</span><span class="p">])</span><span class="o">.</span><span class="n">show</span><span class="p">(</span><span class="mi">5</span><span class="p">)</span>
1946+
</pre></div>
1947+
</div>
1948+
</div>
1949+
<div class="cell_output docutils container">
1950+
<div class="output stderr highlight-myst-ansi notranslate"><div class="highlight"><pre><span></span>/Users/khuyentran/.pyenv/versions/3.8.16/lib/python3.8/site-packages/statsforecast/core.py:485: FutureWarning: In a future version the predictions will have the id as a column. You can set the `NIXTLA_ID_AS_COL` environment variable to adopt the new behavior and to suppress this warning.
1951+
warnings.warn(
1952+
/Users/khuyentran/.pyenv/versions/3.8.16/lib/python3.8/site-packages/statsforecast/core.py:485: FutureWarning: In a future version the predictions will have the id as a column. You can set the `NIXTLA_ID_AS_COL` environment variable to adopt the new behavior and to suppress this warning.
1953+
warnings.warn(
1954+
/Users/khuyentran/.pyenv/versions/3.8.16/lib/python3.8/site-packages/statsforecast/core.py:485: FutureWarning: In a future version the predictions will have the id as a column. You can set the `NIXTLA_ID_AS_COL` environment variable to adopt the new behavior and to suppress this warning.
1955+
warnings.warn(
1956+
</pre></div>
1957+
</div>
1958+
<div class="output stream highlight-myst-ansi notranslate"><div class="highlight"><pre><span></span>+---------+-------------------+----------+-------------+-------------+
1959+
|unique_id| ds| AutoETS|AutoETS-lo-90|AutoETS-hi-90|
1960+
+---------+-------------------+----------+-------------+-------------+
1961+
| 0|2000-08-10 00:00:00| 5.261609| 5.0255513| 5.4976664|
1962+
| 0|2000-08-11 00:00:00| 6.1963573| 5.9603| 6.432415|
1963+
| 0|2000-08-12 00:00:00|0.28230855| 0.04625102| 0.5183661|
1964+
| 0|2000-08-13 00:00:00| 1.2641948| 1.0281373| 1.5002524|
1965+
| 0|2000-08-14 00:00:00| 2.2624528| 2.0263953| 2.4985104|
1966+
+---------+-------------------+----------+-------------+-------------+
1967+
only showing top 5 rows
1968+
</pre></div>
1969+
</div>
1970+
<div class="output stderr highlight-myst-ansi notranslate"><div class="highlight"><pre><span></span>/Users/khuyentran/.pyenv/versions/3.8.16/lib/python3.8/site-packages/statsforecast/core.py:485: FutureWarning: In a future version the predictions will have the id as a column. You can set the `NIXTLA_ID_AS_COL` environment variable to adopt the new behavior and to suppress this warning.
1971+
warnings.warn(
1972+
/Users/khuyentran/.pyenv/versions/3.8.16/lib/python3.8/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
1973+
warnings.warn(&#39;resource_tracker: There appear to be %d &#39;
1974+
</pre></div>
1975+
</div>
1976+
</div>
1977+
</div>
1978+
<p><a class="reference external" href="https://bit.ly/3KNsl9P">Link to StatsForecast.</a></p>
1979+
</section>
18691980
<section id="generative-pre-trained-forecasting-with-timegpt">
1870-
<h2><span class="section-number">6.7.16. </span>Generative Pre-trained Forecasting with TimeGPT<a class="headerlink" href="#generative-pre-trained-forecasting-with-timegpt" title="Permalink to this heading">#</a></h2>
1981+
<h2><span class="section-number">6.7.17. </span>Generative Pre-trained Forecasting with TimeGPT<a class="headerlink" href="#generative-pre-trained-forecasting-with-timegpt" title="Permalink to this heading">#</a></h2>
18711982
<div class="cell tag_hide-cell docutils container">
18721983
<details class="hide above-input">
18731984
<summary aria-label="Toggle hidden content">
@@ -2129,7 +2240,8 @@ <h2><span class="section-number">6.7.16. </span>Generative Pre-trained Forecasti
21292240
<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#quantstats-simplify-stock-performance-analysis-in-python">6.7.13. QuantStats: Simplify Stock Performance Analysis in Python</a></li>
21302241
<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#kneed-knee-point-detection-in-time-series">6.7.14. kneed: Knee-Point Detection in Time Series</a></li>
21312242
<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#neuralforecast-streamline-neural-forecasting-with-familiar-sklearn-syntax">6.7.15. NeuralForecast: Streamline Neural Forecasting with Familiar Sklearn Syntax</a></li>
2132-
<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#generative-pre-trained-forecasting-with-timegpt">6.7.16. Generative Pre-trained Forecasting with TimeGPT</a></li>
2243+
<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#scaling-time-series-forecasting-with-statsforecast-and-spark">6.7.16. Scaling Time-Series Forecasting with StatsForecast and Spark</a></li>
2244+
<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#generative-pre-trained-forecasting-with-timegpt">6.7.17. Generative Pre-trained Forecasting with TimeGPT</a></li>
21332245
</ul>
21342246
</nav></div>
21352247

0 commit comments

Comments
 (0)