-
Notifications
You must be signed in to change notification settings - Fork 9
Finalization of Dask Batch Import Notebook #429
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -7,10 +7,19 @@ | |
"source": [ | ||
"# How to Build Time Series Applications in CrateDB\n", | ||
"\n", | ||
"This notebook guides you through an example of how to import and work with\n", | ||
"This notebook guides you through an example of how to batch import \n", | ||
"time series data in CrateDB. It uses Dask to import data into CrateDB.\n", | ||
"Dask is a framework to parallelize operations on pandas Dataframes.\n", | ||
"\n", | ||
"## Important Note\n", | ||
"If you are running this notebook on a (free) Google Colab environment, you \n", | ||
"might not see the parallelized execution of Dask operations due to constrained\n", | ||
"CPU availability.\n", | ||
"\n", | ||
"We therefore recommend to run this notebook either locally or on an environment\n", | ||
"that provides sufficient CPU capacity to demonstrate the parallel execution of\n", | ||
"dataframe operations as well as write operations to CrateDB.\n", | ||
"\n", | ||
"## Dataset\n", | ||
"This notebook uses a daily weather data set provided on kaggle.com. This dataset\n", | ||
"offers a collection of **daily weather readings from major cities around the\n", | ||
|
@@ -57,7 +66,7 @@ | |
}, | ||
"outputs": [], | ||
"source": [ | ||
"#!pip install dask pandas==2.0.0 'sqlalchemy[crate]'" | ||
"!pip install dask 'pandas==2.0.0' 'crate[sqlalchemy]' 'cratedb-toolkit==0.0.10' 'pueblo>=0.0.7' kaggle" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. To stay consistent when it comes to library versions maybe it makes sense to always install those specified in There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Using There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I see. As opposed to the other notebooks here, which need If so, bringing this in is actually blocked by one of those? |
||
] | ||
}, | ||
{ | ||
|
@@ -75,6 +84,9 @@ | |
"- Countries (countries.csv)\n", | ||
"\n", | ||
"The subsequent code cell acquires the dataset directly from kaggle.com.\n", | ||
"In order to import the data automatically, you need to create a (free)\n", | ||
"API key in your kaggle.com user settings. \n", | ||
"\n", | ||
"To properly configure the notebook to use corresponding credentials\n", | ||
"after signing up on Kaggle, define the `KAGGLE_USERNAME` and\n", | ||
"`KAGGLE_KEY` environment variables. Alternatively, put them into the\n", | ||
|
@@ -85,55 +97,69 @@ | |
" \"key\": \"2b1dac2af55caaf1f34df76236fada4a\"\n", | ||
"}\n", | ||
"```\n", | ||
"\n", | ||
"Another variant is to acquire the dataset files manually, and extract\n", | ||
"them into a folder called `DOWNLOAD`. In this case, you can deactivate\n", | ||
"those two lines of code, in order to skip automatic dataset acquisition." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"outputs": [], | ||
"execution_count": 3, | ||
"id": "8fcc014a", | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"Dataset URL: https://www.kaggle.com/datasets/guillemservera/global-daily-climate-data\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"from pueblo.util.environ import getenvpass\n", | ||
"from cratedb_toolkit.datasets import load_dataset\n", | ||
"\n", | ||
"# Uncomment and execute the following lines to get prompted for kaggle user name and key\n", | ||
"# getenvpass(\"KAGGLE_USERNAME\", prompt=\"Kaggle.com User Name:\")\n", | ||
"# getenvpass(\"KAGGLE_KEY\", prompt=\"Kaggle.com Key:\")\n", | ||
"\n", | ||
"dataset = load_dataset(\"kaggle://guillemservera/global-daily-climate-data/daily_weather.parquet\")\n", | ||
"dataset.acquire()" | ||
], | ||
"metadata": { | ||
"collapsed": false | ||
} | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 88, | ||
"execution_count": 6, | ||
"id": "d9e2916d", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"from dask import dataframe as dd\n", | ||
"from dask.diagnostics import ProgressBar\n", | ||
"\n", | ||
"# Use multiprocessing of dask\n", | ||
"import dask.multiprocessing\n", | ||
"dask.config.set(scheduler=dask.multiprocessing.get)\n", | ||
"\n", | ||
"# Show a progress bar for dask activities\n", | ||
"pbar = ProgressBar()\n", | ||
"pbar.register()" | ||
], | ||
"metadata": { | ||
"collapsed": false | ||
} | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 56, | ||
"execution_count": 9, | ||
"id": "a506f7c9", | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"[########################################] | 100% Completed | 6.26 ss\n", | ||
"[########################################] | 100% Completed | 6.37 s\n", | ||
"[########################################] | 100% Completed | 6.47 s\n", | ||
"[########################################] | 100% Completed | 6.47 s\n", | ||
"[########################################] | 100% Completed | 127.49 s\n", | ||
"[########################################] | 100% Completed | 127.49 s\n", | ||
"<class 'dask.dataframe.core.DataFrame'>\n", | ||
"Index: 27635763 entries, 0 to 24220\n", | ||
"Data columns (total 14 columns):\n", | ||
|
@@ -155,10 +181,8 @@ | |
"13 sunshine_total_min 1021461 non-null float64\n", | ||
"dtypes: category(3), datetime64[ns](1), float64(10)\n", | ||
"memory usage: 2.6 GB\n", | ||
"[########################################] | 100% Completed | 5.37 ss\n", | ||
"[########################################] | 100% Completed | 5.48 s\n", | ||
"[########################################] | 100% Completed | 5.58 s\n", | ||
"[########################################] | 100% Completed | 5.68 s\n" | ||
"[########################################] | 100% Completed | 4.82 ss\n", | ||
"[########################################] | 100% Completed | 4.89 s\n" | ||
] | ||
}, | ||
{ | ||
|
@@ -311,7 +335,7 @@ | |
"4 NaN NaN NaN " | ||
] | ||
}, | ||
"execution_count": 56, | ||
"execution_count": 9, | ||
"metadata": {}, | ||
"output_type": "execute_result" | ||
} | ||
|
@@ -490,14 +514,13 @@ | |
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "ea1dfadc", | ||
"metadata": {}, | ||
"source": [ | ||
"### Connect to CrateDB\n", | ||
"\n", | ||
"This code uses SQLAlchemy to connect to CrateDB." | ||
], | ||
"metadata": { | ||
"collapsed": false | ||
} | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's good to know. Thanks for finding out, and for conveying that information to readers and users of the relevant notebook.