Skip to content

Commit 5f6e1f4

Browse files
committed
ML/AutoML: Fix testing after streamlining the connectivity configuration
Concatenating the `schema` query parameter to the SQLAlchemy connection string correctly is crucial. In order to avoid anomalies or confusion, this patch makes it so that both types of connection strings (regular data vs. MLflow tracking) are configured side-by-side now, so it is easier to understand what is going on.
1 parent 1d0afc6 commit 5f6e1f4

8 files changed

+114
-78
lines changed

topic/machine-learning/automl/README.md

Lines changed: 25 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -71,11 +71,31 @@ and [CrateDB].
7171
performing model. The notebook also shows how to use CrateDB as storage for
7272
both the raw data and the expirement tracking and model registry data.
7373

74-
- Accompanied to the Jupyter Notebook files, there are also basic variants of
75-
the above examples,
76-
[automl_timeseries_forecasting_with_pycaret.py](automl_timeseries_forecasting_with_pycaret.py),
77-
[automl_classification_with_pycaret.py](automl_classification_with_pycaret.py).
74+
- Accompanied to the Jupyter Notebook files, there are also basic standalone
75+
program variants of the above examples.
76+
- [automl_timeseries_forecasting_with_pycaret.py](automl_timeseries_forecasting_with_pycaret.py),
77+
- [automl_classification_with_pycaret.py](automl_classification_with_pycaret.py).
78+
79+
80+
## Software Tests
81+
82+
The resources are validated by corresponding software tests on CI. You can
83+
also use those on your workstation. For example, to invoke the test cases
84+
validating the Notebook about timeseries classification with PyCaret, run:
85+
86+
```shell
87+
pytest -k automl_classification_with_pycaret.ipynb
88+
```
89+
90+
Alternatively, you can validate all resources in this folder by invoking a
91+
test runner program on the top-level folder of this repository. This is the
92+
same code path the CI jobs are taking.
93+
```shell
94+
pip install -r requirements.txt
95+
ngr test topic/machine-learning/automl
96+
```
97+
7898

79-
[PyCaret]: https://github.com/pycaret/pycaret
8099
[CrateDB]: https://github.com/crate/crate
81100
[Introduction to hyperparameter tuning]: https://medium.com/analytics-vidhya/comparison-of-hyperparameter-tuning-algorithms-grid-search-random-search-bayesian-optimization-5326aaef1bd1
101+
[PyCaret]: https://github.com/pycaret/pycaret

topic/machine-learning/automl/automl_classification_with_pycaret.ipynb

Lines changed: 22 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -167,17 +167,21 @@
167167
"source": [
168168
"import os\n",
169169
"\n",
170-
"# For CrateDB Cloud, use:\n",
170+
"# Define database connectivity when connecting to CrateDB Cloud.\n",
171171
"CONNECTION_STRING = os.environ.get(\n",
172172
" \"CRATEDB_CONNECTION_STRING\",\n",
173173
" \"crate://username:password@hostname/?ssl=true\",\n",
174174
")\n",
175175
"\n",
176-
"# For an self-deployed CrateDB, e.g. via Docker, please use:\n",
176+
"# Define database connectivity when connecting to CrateDB on localhost.\n",
177177
"# CONNECTION_STRING = os.environ.get(\n",
178178
"# \"CRATEDB_CONNECTION_STRING\",\n",
179179
"# \"crate://crate@localhost/?ssl=false\",\n",
180-
"# )"
180+
"# )\n",
181+
"\n",
182+
"# Compute derived connection strings for SQLAlchemy use vs. MLflow use.\n",
183+
"DBURI_DATA = f\"{CONNECTION_STRING}&schema=testdrive\"\n",
184+
"DBURI_MLFLOW = f\"{CONNECTION_STRING}&schema=mlflow\""
181185
]
182186
},
183187
{
@@ -188,11 +192,13 @@
188192
"\n",
189193
"For convenience, this notebook comes with an accompanying CSV dataset which you\n",
190194
"can quickly import into the database. Upload the CSV file to your CrateDB cloud\n",
191-
"cluster, as described [here](https://cratedb.com/docs/cloud/en/latest/reference/overview.html#import).\n",
195+
"cluster, as described at [CrateDB Cloud » Import].\n",
192196
"To follow this notebook, choose `pycaret_churn` for your table name.\n",
193197
"\n",
194198
"This will automatically create a new database table and import the data.\n",
195199
"\n",
200+
"[CrateDB Cloud » Import]: https://cratedb.com/docs/cloud/en/latest/reference/overview.html#import\n",
201+
"\n",
196202
"### Alternative data import using code\n",
197203
"\n",
198204
"If you prefer to use code to import your data, please execute the following lines which read the CSV\n",
@@ -212,12 +218,16 @@
212218
"if os.path.exists(\".env\"):\n",
213219
" dotenv.load_dotenv(\".env\", override=True)\n",
214220
"\n",
215-
"engine = sa.create_engine(CONNECTION_STRING, echo=os.environ.get('DEBUG'))\n",
221+
"# Connect to database.\n",
222+
"engine = sa.create_engine(DBURI_DATA, echo=bool(os.environ.get('DEBUG')))\n",
223+
"\n",
224+
"# Import data.\n",
216225
"df = pd.read_csv(\"https://github.com/crate/cratedb-datasets/raw/main/machine-learning/automl/churn-dataset.csv\")\n",
226+
"df.to_sql(\"pycaret_churn\", engine, schema=\"testdrive\", index=False, chunksize=1000, if_exists=\"replace\")\n",
217227
"\n",
228+
"# CrateDB is eventually consistent, so synchronize write operations.\n",
218229
"with engine.connect() as conn:\n",
219-
" df.to_sql(\"pycaret_churn\", conn, index=False, chunksize=1000, if_exists=\"replace\")\n",
220-
" conn.execute(sa.text(\"REFRESH TABLE pycaret_churn;\"))"
230+
" conn.execute(sa.text(\"REFRESH TABLE pycaret_churn\"))"
221231
]
222232
},
223233
{
@@ -250,16 +260,14 @@
250260
"if os.path.exists(\".env\"):\n",
251261
" dotenv.load_dotenv(\".env\", override=True)\n",
252262
"\n",
253-
"engine = sa.create_engine(CONNECTION_STRING, echo=os.environ.get('DEBUG'))\n",
263+
"engine = sa.create_engine(DBURI_DATA, echo=bool(os.environ.get('DEBUG')))\n",
254264
"\n",
255265
"with engine.connect() as conn:\n",
256266
" with conn.execute(sa.text(\"SELECT * FROM pycaret_churn\")) as cursor:\n",
257267
" data = pd.DataFrame(cursor.fetchall(), columns=cursor.keys())\n",
258268
"\n",
259-
"# We set the MLFLOW_TRACKING_URI to our CrateDB instance. We'll see later why\n",
260-
"os.environ[\n",
261-
" \"MLFLOW_TRACKING_URI\"\n",
262-
"] = f\"{CONNECTION_STRING}&schema=mlflow\""
269+
"# Configure MLflow to use CrateDB.\n",
270+
"os.environ[\"MLFLOW_TRACKING_URI\"] = DBURI_MLFLOW"
263271
]
264272
},
265273
{
@@ -3441,9 +3449,8 @@
34413449
"metadata": {},
34423450
"outputs": [],
34433451
"source": [
3444-
"os.environ[\n",
3445-
" \"MLFLOW_TRACKING_URI\"\n",
3446-
"] = f\"{CONNECTION_STRING}&schema=mlflow\""
3452+
"# Configure MLflow to use CrateDB.\n",
3453+
"os.environ[\"MLFLOW_TRACKING_URI\"] = DBURI_MLFLOW"
34473454
]
34483455
},
34493456
{

topic/machine-learning/automl/automl_classification_with_pycaret.py

Lines changed: 14 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -17,16 +17,26 @@
1717
dotenv.load_dotenv(".env", override=True)
1818

1919

20-
# Configure database connection string.
21-
dburi = f"crate://{os.environ['CRATE_USER']}:{os.environ['CRATE_PASSWORD']}@{os.environ['CRATE_HOST']}:4200?ssl={os.environ['CRATE_SSL']}"
22-
os.environ["MLFLOW_TRACKING_URI"] = f"{dburi}&schema=mlflow"
20+
# Configure to connect to CrateDB server on localhost.
21+
CONNECTION_STRING = os.environ.get(
22+
"CRATEDB_CONNECTION_STRING",
23+
"crate://crate@localhost/?ssl=false",
24+
)
25+
26+
# Compute derived connection strings for SQLAlchemy use vs. MLflow use.
27+
DBURI_DATA = f"{CONNECTION_STRING}&schema=testdrive"
28+
DBURI_MLFLOW = f"{CONNECTION_STRING}&schema=mlflow"
29+
30+
# Propagate database connectivity settings.
31+
engine = sa.create_engine(DBURI_DATA, echo=bool(os.environ.get("DEBUG")))
32+
os.environ["MLFLOW_TRACKING_URI"] = DBURI_MLFLOW
2333

2434

2535
def fetch_data():
2636
"""
2737
Fetch data from CrateDB, using SQL and SQLAlchemy, and wrap result into pandas data frame.
2838
"""
29-
engine = sa.create_engine(dburi, echo=True)
39+
engine = sa.create_engine(DBURI_DATA, echo=True)
3040

3141
with engine.connect() as conn:
3242
with conn.execute(sa.text("SELECT * FROM pycaret_churn")) as cursor:

topic/machine-learning/automl/automl_timeseries_forecasting_with_pycaret.ipynb

Lines changed: 24 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -160,17 +160,21 @@
160160
"source": [
161161
"import os\n",
162162
"\n",
163-
"# For CrateDB Cloud, use:\n",
163+
"# Define database connectivity when connecting to CrateDB Cloud.\n",
164164
"CONNECTION_STRING = os.environ.get(\n",
165165
" \"CRATEDB_CONNECTION_STRING\",\n",
166166
" \"crate://username:password@hostname/?ssl=true\",\n",
167167
")\n",
168168
"\n",
169-
"# For an self-deployed CrateDB, e.g. via Docker, please use:\n",
169+
"# Define database connectivity when connecting to CrateDB on localhost.\n",
170170
"# CONNECTION_STRING = os.environ.get(\n",
171171
"# \"CRATEDB_CONNECTION_STRING\",\n",
172172
"# \"crate://crate@localhost/?ssl=false\",\n",
173-
"# )"
173+
"# )\n",
174+
"\n",
175+
"# Compute derived connection strings for SQLAlchemy use vs. MLflow use.\n",
176+
"DBURI_DATA = f\"{CONNECTION_STRING}&schema=testdrive\"\n",
177+
"DBURI_MLFLOW = f\"{CONNECTION_STRING}&schema=mlflow\""
174178
]
175179
},
176180
{
@@ -239,21 +243,21 @@
239243
"data[\"total_sales\"] = data[\"unit_price\"] * data[\"quantity\"]\n",
240244
"data[\"date\"] = pd.to_datetime(data[\"date\"])\n",
241245
"\n",
242-
"# Insert the data into CrateDB\n",
243-
"engine = sa.create_engine(CONNECTION_STRING, echo=os.environ.get(\"DEBUG\"))\n",
246+
"# Connect to database.\n",
247+
"engine = sa.create_engine(DBURI_DATA, echo=bool(os.environ.get(\"DEBUG\")))\n",
244248
"\n",
245-
"with engine.connect() as conn:\n",
246-
" data.to_sql(\n",
247-
" \"sales_data_for_forecast\",\n",
248-
" conn,\n",
249-
" index=False,\n",
250-
" chunksize=1000,\n",
251-
" if_exists=\"replace\",\n",
252-
" )\n",
249+
"# Import data.\n",
250+
"data.to_sql(\n",
251+
" \"sales_data_for_forecast\",\n",
252+
" engine,\n",
253+
" index=False,\n",
254+
" chunksize=1000,\n",
255+
" if_exists=\"replace\",\n",
256+
")\n",
253257
"\n",
254-
" # Refresh table to make sure the data is available for querying - as CrateDB\n",
255-
" # is eventually consistent\n",
256-
" conn.execute(sa.text(\"REFRESH TABLE sales_data_for_forecast;\"))"
258+
"# CrateDB is eventually consistent, so synchronize write operations.\n",
259+
"with engine.connect() as conn:\n",
260+
" conn.execute(sa.text(\"REFRESH TABLE sales_data_for_forecast\"))"
257261
]
258262
},
259263
{
@@ -288,8 +292,8 @@
288292
"\n",
289293
"data[\"month\"] = pd.to_datetime(data['month'], unit='ms')\n",
290294
"\n",
291-
"# We set the MLFLOW_TRACKING_URI to our CrateDB instance. We'll see later why\n",
292-
"os.environ[\"MLFLOW_TRACKING_URI\"] = f\"{CONNECTION_STRING}&schema=mlflow\""
295+
"# Configure MLflow to use CrateDB.\n",
296+
"os.environ[\"MLFLOW_TRACKING_URI\"] = DBURI_MLFLOW"
293297
]
294298
},
295299
{
@@ -2122,9 +2126,8 @@
21222126
"metadata": {},
21232127
"outputs": [],
21242128
"source": [
2125-
"os.environ[\n",
2126-
" \"MLFLOW_TRACKING_URI\"\n",
2127-
"] = f\"{CONNECTION_STRING}&schema=mlflow\""
2129+
"# Configure MLflow to use CrateDB.\n",
2130+
"os.environ[\"MLFLOW_TRACKING_URI\"] = DBURI_MLFLOW"
21282131
]
21292132
},
21302133
{

topic/machine-learning/automl/automl_timeseries_forecasting_with_pycaret.py

Lines changed: 14 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -17,10 +17,19 @@
1717
if os.path.isfile(".env"):
1818
load_dotenv(".env", override=True)
1919

20-
# Configure database connection string.
21-
dburi = f"crate://{os.environ['CRATE_USER']}:{os.environ['CRATE_PASSWORD']}@{os.environ['CRATE_HOST']}:4200?ssl={os.environ['CRATE_SSL']}"
22-
engine = sa.create_engine(dburi, echo=os.environ.get("DEBUG"))
23-
os.environ["MLFLOW_TRACKING_URI"] = f"{dburi}&schema=mlflow"
20+
# Configure to connect to CrateDB server on localhost.
21+
CONNECTION_STRING = os.environ.get(
22+
"CRATEDB_CONNECTION_STRING",
23+
"crate://crate@localhost/?ssl=false",
24+
)
25+
26+
# Compute derived connection strings for SQLAlchemy use vs. MLflow use.
27+
DBURI_DATA = f"{CONNECTION_STRING}&schema=testdrive"
28+
DBURI_MLFLOW = f"{CONNECTION_STRING}&schema=mlflow"
29+
30+
# Propagate database connectivity settings.
31+
engine = sa.create_engine(DBURI_DATA, echo=bool(os.environ.get("DEBUG")))
32+
os.environ["MLFLOW_TRACKING_URI"] = DBURI_MLFLOW
2433

2534

2635
def prepare_data():
@@ -37,7 +46,7 @@ def prepare_data():
3746
data["date"] = pd.to_datetime(data["date"])
3847

3948
# Insert the data into CrateDB
40-
engine = sa.create_engine(dburi, echo=os.environ.get("DEBUG"))
49+
engine = sa.create_engine(DBURI_DATA, echo=bool(os.environ.get("DEBUG")))
4150

4251
with engine.connect() as conn:
4352
data.to_sql(
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,4 @@
11
# Backlog
22

33
- Describe / program how to import `churn-dataset.csv`.
4+
- Format and lint notebooks using `black` and `ruff`.

topic/machine-learning/automl/pyproject.toml

Lines changed: 10 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -1,15 +1,11 @@
11
[tool.pytest.ini_options]
22
minversion = "2.0"
33
addopts = """
4-
-rfEX -p pytester --strict-markers --verbosity=3 --capture=no
4+
-rfEX -p pytester --strict-markers --verbosity=3
55
"""
66
# --cov=. --cov-report=term-missing --cov-report=xml
77
env = [
8-
"CRATEDB_CONNECTION_STRING=crate://crate@localhost/?schema=testdrive",
9-
"CRATE_USER=crate",
10-
"CRATE_PASSWORD=",
11-
"CRATE_HOST=localhost",
12-
"CRATE_SSL=false",
8+
"CRATEDB_CONNECTION_STRING=crate://crate@localhost/?ssl=false",
139
"PYDEVD_DISABLE_FILE_VALIDATION=1",
1410
]
1511

@@ -26,8 +22,8 @@ markers = [
2622
# pytest-notebook settings
2723
nb_test_files = true
2824
nb_coverage = false
29-
# 120 seconds is too less on CI/GHA
30-
nb_exec_timeout = 300
25+
# Default cell timeout is 120 seconds. For heavy computing, it needs to be increased.
26+
nb_exec_timeout = 240
3127
nb_diff_replace = [
3228
# Compensate output of `crash`.
3329
'"/cells/*/outputs/*/text" "\(\d.\d+ sec\)" "(0.000 sec)"',
@@ -47,24 +43,12 @@ nb_diff_ignore = [
4743
"/cells/*/outputs/*/metadata/nbreg",
4844
# Ignore images.
4945
"/cells/*/outputs/*/data/image/png",
50-
# FIXME: Those pacifiers should be revisited.
51-
# Some are warnings, some are semantic ambiguities.
52-
# Maybe they can be improved in one way or another,
53-
# for improved QA.
54-
"/cells/5/outputs",
55-
"/cells/14/outputs",
56-
"/cells/16/outputs",
57-
"/cells/16/outputs",
58-
"/cells/18/outputs",
59-
"/cells/22/outputs",
60-
"/cells/24/outputs",
61-
"/cells/30/outputs/0/data/application/vnd.jupyter.widget-view+json",
62-
"/cells/34/outputs",
63-
"/cells/36/outputs",
64-
"/cells/40/outputs",
65-
# automl_timeseries_forecasting_with_pycaret.ipynb
66-
"/cells/19/outputs",
67-
"/cells/33/outputs",
46+
# Ignore all cell output. It is too tedious to compare and maintain.
47+
# The validation hereby extends exclusively to the _execution_ of notebook cells,
48+
# able to catch syntax errors, module import flaws, and runtime errors.
49+
# However, the validation will not catch any regressions on actual cell output,
50+
# or whether any output is produced at all.
51+
"/cells/*/outputs",
6852
]
6953

7054
[tool.coverage.run]

topic/machine-learning/automl/test.py

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
"""
22
## About
33
4-
Test cases for classification model examples with CrateDB, PyCaret and MLflow.
4+
Test cases for classification and forecasting examples with CrateDB, PyCaret, and MLflow.
55
66
77
## Synopsis
@@ -17,6 +17,7 @@
1717
pytest -k notebook
1818
```
1919
"""
20+
import os
2021
from pathlib import Path
2122

2223
import pytest
@@ -32,7 +33,8 @@ def cratedb() -> DatabaseAdapter:
3233
"""
3334
Provide test cases with a connection to CrateDB, with additional tooling.
3435
"""
35-
return DatabaseAdapter(dburi="crate://crate@localhost:4200")
36+
dburi = os.environ.get("CRATEDB_CONNECTION_STRING")
37+
return DatabaseAdapter(dburi=f"{dburi}&schema=testdrive")
3638

3739

3840
@pytest.fixture(scope="function", autouse=True)

0 commit comments

Comments
 (0)