Skip to content

Commit e291608

Browse files
authored
feat: add new custom_sql_filter parameter (#180)
* feat: add new custom_sql_filter parameter * fix: add None check for custom filters * fix: change file hash generation * chore: add tests for custom sql filtering * fix: add missing custom sql filters to a prefiltering step * chore: add new test scenario * chore: update progress bar logic during multiprocessing startup * feat: add custom sql filter example notebook * chore: add changelog entry
1 parent 8b25fbc commit e291608

File tree

8 files changed

+363
-24
lines changed

8 files changed

+363
-24
lines changed

CHANGELOG.md

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,14 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
77

88
## [Unreleased]
99

10+
### Added
11+
12+
- Option to pass custom SQL filters with `custom_sql_filter` (and `--custom-sql-filter`) parameter [#67](https://github.com/kraina-ai/quackosm/issues/67)
13+
14+
### Fixed
15+
16+
- Delayed progress bar appearing during nodes intersection step
17+
1018
## [0.11.4] - 2024-10-28
1119

1220
### Changed
Lines changed: 202 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,202 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"metadata": {},
6+
"source": [
7+
"# Custom SQL filter\n",
8+
"\n",
9+
"**QuackOSM** enables advanced users to filter data using SQL filters that will be used by DuckDB during processing.\n",
10+
"\n",
11+
"The filter will be loaded alongside with [OSM tags filters](../osm_tags_filter/) and features IDs filters. \n",
12+
"\n",
13+
"SQL filter clause will can be passed both in Python API (as `custom_sql_filter` parameter) and the CLI (as `--custom-sql-filter` option).\n",
14+
"\n",
15+
"Two columns available to users are: `id` (type `BIGINT`) and `tags` (type: `MAP(VARCHAR, VARCHAR)`).\n",
16+
"\n",
17+
"You can look for available functions into a [DuckDB documentation](https://duckdb.org/docs/sql/functions/overview).\n",
18+
"\n",
19+
"Below are few examples on how to use the custom SQL filters."
20+
]
21+
},
22+
{
23+
"cell_type": "markdown",
24+
"metadata": {},
25+
"source": [
26+
"## Features with exactly 10 tags\n",
27+
"\n",
28+
"Here we will use `cardinality` function dedicated to the `MAP` type.\n",
29+
"\n",
30+
"More `MAP` functions are available [here](https://duckdb.org/docs/sql/functions/map)."
31+
]
32+
},
33+
{
34+
"cell_type": "code",
35+
"execution_count": null,
36+
"metadata": {},
37+
"outputs": [],
38+
"source": [
39+
"import quackosm as qosm\n",
40+
"\n",
41+
"data = qosm.convert_geometry_to_geodataframe(\n",
42+
" geometry_filter=qosm.geocode_to_geometry(\"Greater London\"),\n",
43+
" osm_extract_source=\"Geofabrik\",\n",
44+
" custom_sql_filter=\"cardinality(tags) = 10\",\n",
45+
")\n",
46+
"data[\"tags\"].head(10).values"
47+
]
48+
},
49+
{
50+
"cell_type": "code",
51+
"execution_count": null,
52+
"metadata": {},
53+
"outputs": [],
54+
"source": [
55+
"print(\"All have exactly 10 tags:\", (data[\"tags\"].str.len() == 10).all())"
56+
]
57+
},
58+
{
59+
"cell_type": "markdown",
60+
"metadata": {},
61+
"source": [
62+
"## Features with ID divisible by 13 and starting wit a number 6\n",
63+
"\n",
64+
"Here we will operate on the `ID` column.\n",
65+
"\n",
66+
"More `NUMERIC` functions are available [here](https://duckdb.org/docs/sql/functions/numeric).\n",
67+
"\n",
68+
"More `STRING` functions are available [here](https://duckdb.org/docs/sql/functions/char)."
69+
]
70+
},
71+
{
72+
"cell_type": "code",
73+
"execution_count": null,
74+
"metadata": {},
75+
"outputs": [],
76+
"source": [
77+
"data = qosm.convert_geometry_to_geodataframe(\n",
78+
" geometry_filter=qosm.geocode_to_geometry(\"Greater London\"),\n",
79+
" osm_extract_source=\"Geofabrik\",\n",
80+
" custom_sql_filter=\"id % 13 = 0 AND starts_with(id::STRING, '6')\",\n",
81+
")\n",
82+
"data"
83+
]
84+
},
85+
{
86+
"cell_type": "code",
87+
"execution_count": null,
88+
"metadata": {},
89+
"outputs": [],
90+
"source": [
91+
"print(\"All starting with digit 6:\", data.index.map(lambda x: x.split(\"/\")[1].startswith(\"6\")).all())\n",
92+
"print(\"All divisible by 13:\", data.index.map(lambda x: (int(x.split(\"/\")[1]) % 13) == 0).all())"
93+
]
94+
},
95+
{
96+
"cell_type": "markdown",
97+
"metadata": {},
98+
"source": [
99+
"## Find features that have all selected tags present\n",
100+
"\n",
101+
"When using `osm_tags_filter` with value `{ \"building\": True, \"historic\": True, \"name\": True }`, the result will contain every feature that have at least one of those tags.\n",
102+
"\n",
103+
"Positive tags filters are combined using an `OR` operator. You can read more about it [here](../osm_tags_filter/).\n",
104+
"\n",
105+
"To get filters with `AND` operator, the `custom_sql_filter` parameter has to be used.\n",
106+
"\n",
107+
"To match a list of keys against given values we have to use list-related functions.\n",
108+
"\n",
109+
"More `LIST` functions are available [here](https://duckdb.org/docs/sql/functions/list)."
110+
]
111+
},
112+
{
113+
"cell_type": "code",
114+
"execution_count": null,
115+
"metadata": {},
116+
"outputs": [],
117+
"source": [
118+
"data = qosm.convert_geometry_to_geodataframe(\n",
119+
" geometry_filter=qosm.geocode_to_geometry(\"Greater London\"),\n",
120+
" osm_extract_source=\"Geofabrik\",\n",
121+
" custom_sql_filter=\"list_has_all(map_keys(tags), ['building', 'historic', 'name'])\",\n",
122+
")\n",
123+
"data"
124+
]
125+
},
126+
{
127+
"cell_type": "code",
128+
"execution_count": null,
129+
"metadata": {},
130+
"outputs": [],
131+
"source": [
132+
"tags_names = [\"name\", \"building\", \"historic\"]\n",
133+
"for tag_name in tags_names:\n",
134+
" data[tag_name] = data[\"tags\"].apply(lambda x, tag_name=tag_name: x.get(tag_name))\n",
135+
"data[[*tags_names, \"geometry\"]].explore(tiles=\"CartoDB DarkMatter\", color=\"orange\")"
136+
]
137+
},
138+
{
139+
"cell_type": "markdown",
140+
"metadata": {},
141+
"source": [
142+
"## Regex search to find streets starting with word New or Old\n",
143+
"\n",
144+
"*(If you really need to)* You can utilize regular expressions on a tag value (or key) to find some specific examples.\n",
145+
"\n",
146+
"More `REGEX` functions are available [here](https://duckdb.org/docs/sql/functions/regular_expressions)."
147+
]
148+
},
149+
{
150+
"cell_type": "code",
151+
"execution_count": null,
152+
"metadata": {},
153+
"outputs": [],
154+
"source": [
155+
"data = qosm.convert_geometry_to_geodataframe(\n",
156+
" geometry_filter=qosm.geocode_to_geometry(\"Greater London\"),\n",
157+
" osm_extract_source=\"Geofabrik\",\n",
158+
" custom_sql_filter=\"\"\"\n",
159+
" list_has_all(map_keys(tags), ['highway', 'name'])\n",
160+
" AND regexp_matches(tags['name'][1], '^(New|Old)\\s\\w+')\n",
161+
" \"\"\",\n",
162+
")\n",
163+
"data"
164+
]
165+
},
166+
{
167+
"cell_type": "code",
168+
"execution_count": null,
169+
"metadata": {},
170+
"outputs": [],
171+
"source": [
172+
"ways_only = data[data.index.str.startswith(\"way/\")]\n",
173+
"ways_only[\"name\"] = ways_only[\"tags\"].apply(lambda x: x[\"name\"])\n",
174+
"ways_only[\"prefix\"] = ways_only[\"name\"].apply(lambda x: x.split()[0])\n",
175+
"ways_only[[\"name\", \"prefix\", \"geometry\"]].explore(\n",
176+
" tiles=\"CartoDB DarkMatter\", column=\"prefix\", cmap=[\"orange\", \"royalblue\"]\n",
177+
")"
178+
]
179+
}
180+
],
181+
"metadata": {
182+
"kernelspec": {
183+
"display_name": ".venv",
184+
"language": "python",
185+
"name": "python3"
186+
},
187+
"language_info": {
188+
"codemirror_mode": {
189+
"name": "ipython",
190+
"version": 3
191+
},
192+
"file_extension": ".py",
193+
"mimetype": "text/x-python",
194+
"name": "python",
195+
"nbconvert_exporter": "python",
196+
"pygments_lexer": "ipython3",
197+
"version": "3.10.12"
198+
}
199+
},
200+
"nbformat": 4,
201+
"nbformat_minor": 2
202+
}

quackosm/_parquet_multiprocessing.py

Lines changed: 9 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -100,15 +100,20 @@ def map_parquet_dataset(
100100
progress_bar (Optional[TaskProgressBar]): Progress bar to show task status.
101101
Defaults to `None`.
102102
"""
103-
queue: Queue[tuple[str, int]] = ctx.Manager().Queue()
104-
105103
dataset = pq.ParquetDataset(dataset_path)
106104

105+
tuples_to_queue = []
107106
for pq_file in dataset.files:
108107
for row_group in range(pq.ParquetFile(pq_file).num_row_groups):
109-
queue.put((pq_file, row_group))
108+
tuples_to_queue.append((pq_file, row_group))
110109

111-
total = queue.qsize()
110+
total = len(tuples_to_queue)
111+
if progress_bar: # pragma: no cover
112+
progress_bar.create_manual_bar(total=total)
113+
114+
queue: Queue[tuple[str, int]] = ctx.Manager().Queue()
115+
for queue_tuple in tuples_to_queue:
116+
queue.put(queue_tuple)
112117

113118
destination_path.mkdir(parents=True, exist_ok=True)
114119

@@ -137,9 +142,6 @@ def _run_processes(
137142
break
138143
p.start()
139144

140-
if progress_bar: # pragma: no cover
141-
progress_bar.create_manual_bar(total=total)
142-
143145
sleep_time = 0.1
144146
while any(process.is_alive() for process in processes):
145147
if any(p.exception for p in processes): # pragma: no cover

quackosm/cli.py

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -457,6 +457,18 @@ def main(
457457
show_default=False,
458458
),
459459
] = None,
460+
custom_sql_filter: Annotated[
461+
Optional[str],
462+
typer.Option(
463+
help=(
464+
"Allows users to pass custom SQL conditions used to filter OSM features. "
465+
"It will be embedded into predefined queries and requires DuckDB syntax to operate "
466+
"on tags map object."
467+
),
468+
case_sensitive=False,
469+
show_default=False,
470+
),
471+
] = None,
460472
osm_extract_query: Annotated[
461473
Optional[str],
462474
typer.Option(
@@ -750,6 +762,7 @@ def main(
750762
else None
751763
),
752764
filter_osm_ids=filter_osm_ids, # type: ignore
765+
custom_sql_filter=custom_sql_filter,
753766
save_as_wkt=wkt_result,
754767
verbosity_mode=verbosity_mode,
755768
)
@@ -771,6 +784,7 @@ def main(
771784
else None
772785
),
773786
filter_osm_ids=filter_osm_ids, # type: ignore
787+
custom_sql_filter=custom_sql_filter,
774788
duckdb_table_name=duckdb_table_name or "quackosm",
775789
verbosity_mode=verbosity_mode,
776790
)
@@ -795,6 +809,7 @@ def main(
795809
else None
796810
),
797811
filter_osm_ids=filter_osm_ids, # type: ignore
812+
custom_sql_filter=custom_sql_filter,
798813
save_as_wkt=wkt_result,
799814
verbosity_mode=verbosity_mode,
800815
)
@@ -825,6 +840,7 @@ def main(
825840
else None
826841
),
827842
filter_osm_ids=filter_osm_ids, # type: ignore
843+
custom_sql_filter=custom_sql_filter,
828844
duckdb_table_name=duckdb_table_name or "quackosm",
829845
save_as_wkt=wkt_result,
830846
verbosity_mode=verbosity_mode,
@@ -853,6 +869,7 @@ def main(
853869
else None
854870
),
855871
filter_osm_ids=filter_osm_ids, # type: ignore
872+
custom_sql_filter=custom_sql_filter,
856873
save_as_wkt=wkt_result,
857874
verbosity_mode=verbosity_mode,
858875
geometry_coverage_iou_threshold=geometry_coverage_iou_threshold,
@@ -876,6 +893,7 @@ def main(
876893
else None
877894
),
878895
filter_osm_ids=filter_osm_ids, # type: ignore
896+
custom_sql_filter=custom_sql_filter,
879897
duckdb_table_name=duckdb_table_name or "quackosm",
880898
save_as_wkt=wkt_result,
881899
verbosity_mode=verbosity_mode,

0 commit comments

Comments
 (0)