diff --git a/doc/source/index.rst b/doc/source/index.rst index daf8678a2..7f0a7b7a4 100644 --- a/doc/source/index.rst +++ b/doc/source/index.rst @@ -33,6 +33,7 @@ __ https://github.com/pySTEPS/pysteps Installation Gallery <../auto_examples/index> My first nowcast (Colab Notebook) + Optimizing number of frames OF (Colab Notebook) API Reference Example data Configuration file (pystepsrc) diff --git a/examples/opt_number_frames_OF.ipynb b/examples/opt_number_frames_OF.ipynb new file mode 100644 index 000000000..f1a06f1d6 --- /dev/null +++ b/examples/opt_number_frames_OF.ipynb @@ -0,0 +1,568 @@ +{ + "nbformat": 4, + "nbformat_minor": 0, + "metadata": { + "colab": { + "provenance": [] + }, + "kernelspec": { + "name": "python3", + "display_name": "Python 3" + }, + "language_info": { + "name": "python" + } + }, + "cells": [ + { + "cell_type": "markdown", + "source": [ + "# Optimizing number of frames\n", + "\n", + "This is coming from the question ([issue 376](https://github.com/pySTEPS/pysteps/issues/376)) on how to optimize the Optical Flow (OF) motion vectors in pySTEPS, and it is taking advantage of *ray* for computing this optimization. Other parameters of the OF computation could also be optimized but in this example just the number of frames is tested.\n", + "\n", + "**Caution:** This notebook is an example, but a larger dataset should be used to optimize the number of frames so that the results are statistically significant and sound.\n" + ], + "metadata": { + "id": "JPIK350CFtK7" + } + }, + { + "cell_type": "markdown", + "source": [ + "## Set up the Colab environment\n", + "\n", + "First, let's set up our working environment. Note that these steps are specific for Google Colab. For more information on how to install pysteps see [these instructions](https://pysteps.readthedocs.io/en/latest/user_guide/install_pysteps.html).\n", + "\n", + "### Install dependencies\n", + "\n", + "First, let's install the latest available versions of pysteps and ray using pip. This will also install the minimal dependencies needed to run pysteps." + ], + "metadata": { + "id": "Rg8qT_73G9Lk" + } + }, + { + "cell_type": "code", + "source": [ + "!pip install pysteps \"ray[tune]\" --quiet" + ], + "metadata": { + "id": "3khQ2OVqHB13" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "### Getting the example data\n", + "\n", + "Now that we have the environment ready, let's install the example data and configure pysteps by following [this tutorial](https://pysteps.readthedocs.io/en/latest/user_guide/example_data.html).\n", + "\n", + "First, we will use the [pysteps.datasets.download_pysteps_data()](https://pysteps.readthedocs.io/en/latest/generated/pysteps.datasets.download_pysteps_data.html) function to download the data." + ], + "metadata": { + "id": "dEIfmMtOHCpV" + } + }, + { + "cell_type": "code", + "source": [ + "# Import the helper functions\n", + "from pysteps.datasets import download_pysteps_data, create_default_pystepsrc\n", + "\n", + "# Download the pysteps data in the \"pysteps_data\"\n", + "download_pysteps_data(\"pysteps_data\")" + ], + "metadata": { + "id": "gH_Tr-GFHF4v" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "Next, we need to create a default configuration file that points to the downloaded data.\n", + "By default, pysteps will place the configuration file in `$HOME/.pysteps` (unix and Mac OS X) or `$USERPROFILE/pysteps` (windows).\n", + "To quickly create a configuration file, we will use the [pysteps.datasets.create_default_pystepsrc()](https://pysteps.readthedocs.io/en/latest/generated/pysteps.datasets.create_default_pystepsrc.html#pysteps.datasets.create_default_pystepsrc) helper function." + ], + "metadata": { + "id": "xIJKLYyBHf6X" + } + }, + { + "cell_type": "code", + "source": [ + "# If the configuration file is placed in one of the default locations\n", + "# (https://pysteps.readthedocs.io/en/latest/user_guide/set_pystepsrc.html#configuration-file-lookup)\n", + "# it will be loaded automatically when pysteps is imported.\n", + "config_file_path = create_default_pystepsrc(\"pysteps_data\")" + ], + "metadata": { + "id": "9w1bFhb3Hjte" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "Since pysteps was already initialized in this notebook, we need to load the new configuration file and update the default configuration." + ], + "metadata": { + "id": "blKtFOpEHnbk" + } + }, + { + "cell_type": "code", + "source": [ + "# Import pysteps and load the new configuration file\n", + "import pysteps\n", + "\n", + "_ = pysteps.load_config_file(config_file_path, verbose=True)" + ], + "metadata": { + "id": "pX1WeRiYHqoc" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "### Load the input data\n", + "\n", + "Pysteps comes with few precipitation datasets with precipitation events from different countries and providers. We can see the available datasets with `datasets.info()`:" + ], + "metadata": { + "id": "RNqD4DwpHwiB" + } + }, + { + "cell_type": "code", + "source": [ + "from pysteps import datasets\n", + "datasets.info()" + ], + "metadata": { + "id": "3kIorbSxHxW6" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "To import an example dataset, we use the [load_dataset()](https://pysteps.readthedocs.io/en/latest/generated/pysteps.datasets.load_dataset.html) helper function from the `pysteps.datasets` module.\n", + "\n", + "For this tutorial, we will use three events from MeteoSwiss: the first two events for our tuning set, while `mch3` will serve for the validation at the end." + ], + "metadata": { + "id": "W8SfYSysHzh6" + } + }, + { + "cell_type": "code", + "source": [ + "# import the 2-hour precipitation events\n", + "# i.e., a sequence of 24 precipitation frames with 5 minute temporal resolution\n", + "mch1 = datasets.load_dataset(\"mch\", frames=24)\n", + "mch2 = datasets.load_dataset(\"mch2\", frames=24)\n", + "mch3 = datasets.load_dataset(\"mch3\", frames=24)\n", + "\n", + "tune_set = (mch1, mch3)\n", + "val_set = (mch2,)" + ], + "metadata": { + "id": "fsire424H1_C" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "## Set up the experiment\n", + "**Evaluate Performance Metrics**\n", + "\n", + " - Use performance metrics such as Root Mean Squared Error (RMSE), Critical Success Index (CSI), or Heidke Skill Score (HSS) to evaluate the accuracy of nowcasts generated using different numbers of frames.\n", + " - Calculate these metrics for each number of frames across all runs but same forecasting period to avoid dependence on the variability of the observation.\n", + "\n", + "**Analyze Computational Load**\n", + " - Measure the computational time and resources required for processing different numbers of frames.\n", + " - Ensure that the increased accuracy justifies any additional computational costs." + ], + "metadata": { + "id": "XASmG5EcWQ9Q" + } + }, + { + "cell_type": "code", + "source": [ + "from pysteps.verification.detcontscores import det_cont_fct\n", + "from pysteps.verification.detcatscores import det_cat_fct\n", + "\n", + "def evaluation_fn(fct, obs):\n", + " rmse = det_cont_fct(fct, obs, scores='RMSE',axis=(1,2))\n", + " #csi = det_cat_fct(fct, obs, 0.1, scores='CSI')\n", + " #we could create a normalized multiscore value using several scores\n", + " #but for now we are just returning an array with the lead-time rmse\n", + " return rmse['RMSE']" + ], + "metadata": { + "id": "eb8eW38VX7Sy" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "Here we define the function that will compute a single trial of the optimization experiment. Such function takes a sample of the parameter space as an input and outputs the resulting error.\n", + "\n", + "The trial consists of run of 1 hour at the end of the 2-hours event to account for the variability of the results as a function of the lead-time. The initial hour is the range of number of frames that we are testing. The final error is the average error of all runs." + ], + "metadata": { + "id": "BLeoXHGYX934" + } + }, + { + "cell_type": "code", + "source": [ + "from pysteps import motion, nowcasts\n", + "from pysteps.utils import transformation\n", + "\n", + "N_LEADTIMES = 12\n", + "\n", + "def pysteps_objective(config, events=None, time_step=None):\n", + "\n", + " # select nowcasting methods\n", + " estimate_motion = motion.get_method(\"LK\")\n", + " extrapolate = nowcasts.get_method(\"extrapolation\")\n", + "\n", + " n_frames = config['n_frames']\n", + "\n", + " error = 0\n", + " n = 0\n", + " for event in events:\n", + "\n", + " precip_data, metadata, __ = event\n", + " n_total_frames = precip_data.shape[0]\n", + "\n", + " # log-transform the data to dBR.\n", + " precip_data_dbr, __ = transformation.dB_transform(\n", + " precip_data, metadata, threshold=0.1, zerovalue=-15.0\n", + " )\n", + "\n", + " # prepare the data for the run\n", + " obs_precip = precip_data[n_total_frames-N_LEADTIMES:n_total_frames]\n", + "\n", + " input_precip_dbr = precip_data_dbr[n_total_frames-N_LEADTIMES-n_frames:n_total_frames-N_LEADTIMES]\n", + " input_precip = precip_data[n_total_frames-N_LEADTIMES-1]\n", + "\n", + " # compute the nowcast\n", + " motion_field = estimate_motion(input_precip_dbr)\n", + " forecast_precip = extrapolate(input_precip, motion_field, N_LEADTIMES)\n", + "\n", + " # evaluate the nowcast\n", + " if time_step is None:\n", + " error += evaluation_fn(forecast_precip, obs_precip)\n", + " else:\n", + " error += evaluation_fn(forecast_precip, obs_precip)[time_step]\n", + " n += 1\n", + " # loop the event\n", + "\n", + " return {\"mean_error\": error / n}" + ], + "metadata": { + "id": "1WrFN9VGYYkq" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "Finally, we need to define the [search space](https://docs.ray.io/en/latest/tune/api/search_space.html) for our experiment. This step is critical to the success of the tuning, as it is important to select the most sensitive parameters and to define a reasonable range of values to be explored during the tuning. Typically, one will need to go through multiple iterations before finding a satisfying set up.\n", + "\n", + "Below we define the search space which in this case it is the number of frames. Usually ray is optimal for multiparameter models but for this example just one is used. The number of frames is defined by the input array in [dense Lucas-Kanade](https://pysteps.readthedocs.io/en/stable/generated/pysteps.motion.lucaskanade.dense_lucaskanade.html#pysteps.motion.lucaskanade.dense_lucaskanade) optical flow routine. For each parameter, we need to specify the sampling strategy, for example in this case the min (2 samples) and max (12 samples) limits." + ], + "metadata": { + "id": "iRcd2Vh8YRIq" + } + }, + { + "cell_type": "code", + "source": [ + "from ray import tune\n", + "\n", + "#Original random\n", + "# param_space = {\n", + "# \"n_frames\": tune.qrandint(2, 12, 1),\n", + "# }\n", + "\n", + "param_space = {\n", + " \"n_frames\": tune.choice(list(range(2, 13))),\n", + "}" + ], + "metadata": { + "id": "tvAmCuPwY_3O" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "## Launch the experiment\n", + "\n", + "We can now launch the experiment. We ask for 12 trials because of this simple experiment. If not specified, ray will use a random search by default, but more sophisticated [search algorithms](https://docs.ray.io/en/latest/tune/api/suggestion.html#tune-search-alg) are available.\n" + ], + "metadata": { + "id": "zEvmvEyPZUlU" + } + }, + { + "cell_type": "code", + "source": [ + "import ray\n", + "\n", + "N_TRIALS = 12\n", + "\n", + "results=[]\n", + "\n", + "#This is a loop in the lead-times range\n", + "for i_leadtime in range(12):\n", + " ray.shutdown()\n", + " ray.init(configure_logging=False)\n", + "\n", + " tuner = tune.Tuner(\n", + " trainable=tune.with_parameters(\n", + " pysteps_objective,\n", + " events=tune_set,\n", + " time_step=i_leadtime,\n", + " ),\n", + " tune_config=tune.TuneConfig(\n", + " metric=\"mean_error\",\n", + " mode=\"min\",\n", + " num_samples=N_TRIALS,\n", + " ),\n", + " param_space=param_space,\n", + " )\n", + "\n", + " results.append(tuner.fit())" + ], + "metadata": { + "id": "mzV73JQ5Zeq8" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "It is important to mention, that in the case of this example, similar results could be obtained without the use of *ray* due to the fact that only one parameter is actually computed for 11 specific values. This could be done as follow:" + ], + "metadata": { + "id": "9NWj70pQApax" + } + }, + { + "cell_type": "code", + "source": [ + "list_n_frames = list(range(2, 13))\n", + "results_noray=[]\n", + "for n_frame in list_n_frames:\n", + " results_noray.append(pysteps_objective({'n_frames':n_frame}, events=tune_set, time_step=None))" + ], + "metadata": { + "id": "-vvvcrygBGhX" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "## Analyse the results\n", + "\n", + "The first and most simple question we can ask is to return the best parameter set for our problem:" + ], + "metadata": { + "id": "nBtIA6SebZ-u" + } + }, + { + "cell_type": "code", + "source": [ + "best_param_set = [result.get_best_result().config for result in results]\n", + "best_param_set\n", + "\n", + "## too avoid multiple lead-times optimization, we fix the first lead-time\n", + "results = results[0]\n", + "best_param_set = best_param_set[0]" + ], + "metadata": { + "id": "TXALu0D3bcsN" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "We can verify that this results are matching with the ones obtained without using *ray* library. Hoowever, the simplicity of the previous code with ray will change to a slightly more complex code to obtain similar outputs." + ], + "metadata": { + "id": "IgrzYOQNC4XX" + } + }, + { + "cell_type": "code", + "source": [ + "import numpy as np\n", + "\n", + "search_matrix = np.vstack([item['mean_error'] for item in results_noray])\n", + "\n", + "# Lead-time range\n", + "lead_time = np.arange(1, 13)\n", + "\n", + "# Optimal n_frames\n", + "optimal_n_frames = np.argmin(search_matrix, axis=0) + min(list_n_frames)\n", + "\n", + "# Value of the minimum\n", + "optimal_value = np.min(search_matrix, axis=0)\n", + "\n", + "# Print the table with aligned columns\n", + "print('{:<12} {}'.format('Lead-time: ', ' '.join(f'{x:5d}' for x in lead_time)))\n", + "print('{:<12} {}'.format('Opt n_frames: ', ' '.join(f'{x:5d}' for x in optimal_n_frames)))\n", + "print('{:<12} {}'.format('Optimal value:', ' '.join(f'{x:5.2f}' for x in optimal_value)))" + ], + "metadata": { + "id": "i5LhPbelDLgf" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "An interesting analysis of the results consists in plotting the error of each trial as a function of its parameter values. Such plots are known as \"dot plots\". By looking at the shape of the cloud of points, one can get a good idea about the sensitivity of a given parameter. For instance, a flat curve usually indicates weak sensitivity: no matter what parameter value is selected, the resulting error doesn't seem to change much. Conversely, a clear spike in the plot indicates a sensitive parameter." + ], + "metadata": { + "id": "dmOBGa6ubhmC" + } + }, + { + "cell_type": "code", + "source": [ + "import matplotlib.pyplot as plt\n", + "from math import ceil\n", + "\n", + "def make_dot_plots(tune_results, yaxis, ylog=True, reference=None):\n", + " tune_df = tune_results.get_dataframe()\n", + " tune_best = tune_results.get_best_result()\n", + " col_params = sorted([col for col in tune_df.columns if \"config/\" in col])\n", + " ncols = 1\n", + " nrows = ceil(len(col_params) / ncols)\n", + " fig, axs = plt.subplots(nrows, ncols, figsize=(ncols * 3.5, nrows * 3))\n", + " for n, par in enumerate(col_params):\n", + " parsed_par = par.split(\"/\")[1]\n", + " try:\n", + " ax = axs.flat[n]\n", + " except:\n", + " ax = axs\n", + " ax.scatter(tune_df[par], tune_df[yaxis], alpha=0.5)\n", + " ax.scatter(tune_best.config[parsed_par], tune_best.metrics[yaxis], c=\"r\")\n", + " if reference:\n", + " ax.axhline(reference, c=\"k\", ls=\":\")\n", + " if ylog:\n", + " ax.set_yscale(\"log\")\n", + " ax.set_xlabel(parsed_par)\n", + " if n % ncols == 0:\n", + " ax.set_ylabel(yaxis)\n", + " else:\n", + " ax.axes.get_yaxis().set_visible(False)\n", + "\n", + " plt.tight_layout()\n", + "\n", + "# compute error for default settings\n", + "tune_error_default = pysteps_objective({'n_frames':3}, tune_set)[\"mean_error\"]\n", + "\n", + "make_dot_plots(results, \"mean_error\", reference=tune_error_default)" + ], + "metadata": { + "id": "YnqCN9mSboGc" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "The dashed horizontal line represents the skill of the default parameter set (i.e., without tuning), while the red dot represents the best paramter set.\n", + "\n", + "Finally, let's compare the default and best parameter sets on the tune and validation events. This will provide some indication on the generability of the optimized parameter set." + ], + "metadata": { + "id": "hOCvapHtdt-_" + } + }, + { + "cell_type": "code", + "source": [ + "import pandas as pd\n", + "\n", + "# compute validation scores\n", + "val_error_best = pysteps_objective(best_param_set, val_set)[\"mean_error\"]\n", + "val_error_default = pysteps_objective({'n_frames':3}, val_set)[\"mean_error\"]" + ], + "metadata": { + "id": "POqo0ksJdw65" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "source": [ + "data = {\n", + " \"default\": {\n", + " \"tune\": tune_error_default,\n", + " \"val\": val_error_default,\n", + " },\n", + " \"best\": {\n", + " \"tune\": results.get_best_result().metrics[\"mean_error\"],\n", + " \"val\": val_error_best,\n", + " },\n", + "}\n", + "\n", + "df = pd.DataFrame(data)\n", + "\n", + "# Plotting with adjusted y-axis limits\n", + "ax = df.plot.bar()\n", + "\n", + "# Set the y-axis limits to focus on the data range\n", + "y_min = df.min().min() - 0.1 * (df.max().max() - df.min().min())\n", + "y_max = df.max().max() + 0.1 * (df.max().max() - df.min().min())\n", + "ax.set_ylim(y_min, y_max)" + ], + "metadata": { + "id": "LLKOrNiRfVT0" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "And in this case it seems the obtained optimal value is also optimal for the validation event." + ], + "metadata": { + "id": "CvymwPhmHbWc" + } + } + ] +} \ No newline at end of file