Skip to content

Finalization of Dask Batch Import Notebook #429

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
May 6, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions topic/timeseries/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,9 @@ repository, e.g. about machine learning, to see predictions and AutoML in action

To ensure the dashboard functions correctly, it's necessary to configure the data source within Grafana. This dashboard uses the `grafana-postgresql-datasource` or another configured default data source. In the data source settings, fill in the necessary parameters to connect to your CrateDB instance. This includes setting up the database name (`database=doc`), user, password, and host.

- `dask-weather-data-import.ipynb` [![Open on GitHub](https://img.shields.io/badge/Open%20on-GitHub-lightgray?logo=GitHub)](dask-weather-data-import.ipynb) [![Open in Collab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/crate/cratedb-examples/blob/main/topic/timeseries/dask-weather-data-import.ipynb)

This notebook walks you through an example to download and insert a larger data set via pandas and dask into CrateDB utilizing dask's capabilities to parallelize operations.

## Software Tests

Expand Down
77 changes: 50 additions & 27 deletions topic/timeseries/dask-weather-data-import.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -7,10 +7,19 @@
"source": [
"# How to Build Time Series Applications in CrateDB\n",
"\n",
"This notebook guides you through an example of how to import and work with\n",
"This notebook guides you through an example of how to batch import \n",
"time series data in CrateDB. It uses Dask to import data into CrateDB.\n",
"Dask is a framework to parallelize operations on pandas Dataframes.\n",
"\n",
"## Important Note\n",
"If you are running this notebook on a (free) Google Colab environment, you \n",
"might not see the parallelized execution of Dask operations due to constrained\n",
"CPU availability.\n",
Comment on lines +14 to +17
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's good to know. Thanks for finding out, and for conveying that information to readers and users of the relevant notebook.

"\n",
"We therefore recommend to run this notebook either locally or on an environment\n",
"that provides sufficient CPU capacity to demonstrate the parallel execution of\n",
"dataframe operations as well as write operations to CrateDB.\n",
"\n",
"## Dataset\n",
"This notebook uses a daily weather data set provided on kaggle.com. This dataset\n",
"offers a collection of **daily weather readings from major cities around the\n",
Expand Down Expand Up @@ -57,7 +66,7 @@
},
"outputs": [],
"source": [
"#!pip install dask pandas==2.0.0 'sqlalchemy[crate]'"
"!pip install dask 'pandas==2.0.0' 'crate[sqlalchemy]' 'cratedb-toolkit==0.0.10' 'pueblo>=0.0.7' kaggle"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To stay consistent when it comes to library versions maybe it makes sense to always install those specified in requirements.txt.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why pandas==2.0.0 and not a higher pandas==2.*?

Copy link
Member

@amotl amotl Apr 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using requirements.txt, as exercised with the other notebooks, will be much better, because Dependabot does not bump inside Notebooks. In turn, this will quickly get out of sync, increase the risk for havoc, or may even influence the dependencies the other notebooks are validated with.

Copy link
Member

@amotl amotl May 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. As opposed to the other notebooks here, which need pandas<2 and sqlalchemy<2, this one needs more recent versions of both? Is that correct?

If so, bringing this in is actually blocked by one of those?

]
},
{
Expand All @@ -75,6 +84,9 @@
"- Countries (countries.csv)\n",
"\n",
"The subsequent code cell acquires the dataset directly from kaggle.com.\n",
"In order to import the data automatically, you need to create a (free)\n",
"API key in your kaggle.com user settings. \n",
"\n",
"To properly configure the notebook to use corresponding credentials\n",
"after signing up on Kaggle, define the `KAGGLE_USERNAME` and\n",
"`KAGGLE_KEY` environment variables. Alternatively, put them into the\n",
Expand All @@ -85,55 +97,69 @@
" \"key\": \"2b1dac2af55caaf1f34df76236fada4a\"\n",
"}\n",
"```\n",
"\n",
"Another variant is to acquire the dataset files manually, and extract\n",
"them into a folder called `DOWNLOAD`. In this case, you can deactivate\n",
"those two lines of code, in order to skip automatic dataset acquisition."
]
},
{
"cell_type": "code",
"execution_count": null,
"outputs": [],
"execution_count": 3,
"id": "8fcc014a",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Dataset URL: https://www.kaggle.com/datasets/guillemservera/global-daily-climate-data\n"
]
}
],
"source": [
"from pueblo.util.environ import getenvpass\n",
"from cratedb_toolkit.datasets import load_dataset\n",
"\n",
"# Uncomment and execute the following lines to get prompted for kaggle user name and key\n",
"# getenvpass(\"KAGGLE_USERNAME\", prompt=\"Kaggle.com User Name:\")\n",
"# getenvpass(\"KAGGLE_KEY\", prompt=\"Kaggle.com Key:\")\n",
"\n",
"dataset = load_dataset(\"kaggle://guillemservera/global-daily-climate-data/daily_weather.parquet\")\n",
"dataset.acquire()"
],
"metadata": {
"collapsed": false
}
]
},
{
"cell_type": "code",
"execution_count": 88,
"execution_count": 6,
"id": "d9e2916d",
"metadata": {},
"outputs": [],
"source": [
"from dask import dataframe as dd\n",
"from dask.diagnostics import ProgressBar\n",
"\n",
"# Use multiprocessing of dask\n",
"import dask.multiprocessing\n",
"dask.config.set(scheduler=dask.multiprocessing.get)\n",
"\n",
"# Show a progress bar for dask activities\n",
"pbar = ProgressBar()\n",
"pbar.register()"
],
"metadata": {
"collapsed": false
}
]
},
{
"cell_type": "code",
"execution_count": 56,
"execution_count": 9,
"id": "a506f7c9",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[########################################] | 100% Completed | 6.26 ss\n",
"[########################################] | 100% Completed | 6.37 s\n",
"[########################################] | 100% Completed | 6.47 s\n",
"[########################################] | 100% Completed | 6.47 s\n",
"[########################################] | 100% Completed | 127.49 s\n",
"[########################################] | 100% Completed | 127.49 s\n",
"<class 'dask.dataframe.core.DataFrame'>\n",
"Index: 27635763 entries, 0 to 24220\n",
"Data columns (total 14 columns):\n",
Expand All @@ -155,10 +181,8 @@
"13 sunshine_total_min 1021461 non-null float64\n",
"dtypes: category(3), datetime64[ns](1), float64(10)\n",
"memory usage: 2.6 GB\n",
"[########################################] | 100% Completed | 5.37 ss\n",
"[########################################] | 100% Completed | 5.48 s\n",
"[########################################] | 100% Completed | 5.58 s\n",
"[########################################] | 100% Completed | 5.68 s\n"
"[########################################] | 100% Completed | 4.82 ss\n",
"[########################################] | 100% Completed | 4.89 s\n"
]
},
{
Expand Down Expand Up @@ -311,7 +335,7 @@
"4 NaN NaN NaN "
]
},
"execution_count": 56,
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
Expand Down Expand Up @@ -490,14 +514,13 @@
},
{
"cell_type": "markdown",
"id": "ea1dfadc",
"metadata": {},
"source": [
"### Connect to CrateDB\n",
"\n",
"This code uses SQLAlchemy to connect to CrateDB."
],
"metadata": {
"collapsed": false
}
]
},
{
"cell_type": "code",
Expand Down