Skip to content

Commit 1bbfddd

Browse files
authored
Generic Multiprocessing Functions for Raster Processing (#669)
1 parent d9957d4 commit 1bbfddd

File tree

4 files changed

+638
-0
lines changed

4 files changed

+638
-0
lines changed

doc/source/multiprocessing.md

Lines changed: 127 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,127 @@
1+
---
2+
file_format: mystnb
3+
jupytext:
4+
formats: md:myst
5+
text_representation:
6+
extension: .md
7+
format_name: myst
8+
kernelspec:
9+
display_name: geoutils-env
10+
language: python
11+
name: geoutils
12+
---
13+
(multiprocessing)=
14+
15+
# Multiprocessing
16+
17+
## Overview
18+
19+
Processing large raster datasets can be **computationally expensive and memory-intensive**. To optimize performance and enable **out-of-memory processing**, GeoUtils provides **multiprocessing utilities** that allow users to process raster data in parallel by splitting it into tiles.
20+
21+
GeoUtils offers two functions for out-of-memory multiprocessing:
22+
23+
- {func}`~geoutils.raster.distributed_computing.map_overlap_multiproc_save`: Applies a function to raster tiles and **saves the output** as a {class}`geoutils.Raster`.
24+
- {func}`~geoutils.raster.distributed_computing.map_multiproc_collect`: Applies a function and **collects extracted data** from raster tiles into a list.
25+
26+
Both functions require a **multiprocessing configuration** defined with {class}`~geoutils.raster.distributed_computing.MultiprocConfig`.
27+
28+
---
29+
30+
## Using {class}`~geoutils.raster.distributed_computing.MultiprocConfig`
31+
32+
{class}`~geoutils.raster.distributed_computing.MultiprocConfig` defines tiling and processing settings, such as chunk size, output file, and computing cluster. It ensures that computations are performed **without loading the entire raster into memory**.
33+
34+
### Example: creating a {class}`~geoutils.raster.distributed_computing.MultiprocConfig` object
35+
```{code-cell} ipython3
36+
from geoutils.raster.distributed_computing import ClusterGenerator
37+
from geoutils.raster.distributed_computing import MultiprocConfig
38+
39+
# Create a configuration without multiprocessing cluster (tasks will be processed sequentially)
40+
config_basic = MultiprocConfig(chunk_size=200, outfile="output.tif", cluster=None)
41+
42+
# Create a configuration with a multiprocessing cluster
43+
config_np = config_basic.copy()
44+
config_np.cluster = ClusterGenerator("multi", nb_workers=4)
45+
```
46+
- **`chunk_size=200`**: The raster is divided into 200x200 pixel tiles.
47+
- **`outfile="output.tif"`**: Required when saving results.
48+
- **`cluster=ClusterGenerator("multi", nb_workers=4)`**: Enables parallel processing.
49+
50+
---
51+
52+
## {func}`~geoutils.raster.distributed_computing.map_overlap_multiproc_save`: process and save large rasters
53+
54+
This function applies a user-defined function to raster tiles and **saves the output** to a file. The entire raster is **never loaded into memory at once**, making it suitable for processing large datasets.
55+
56+
### When to use
57+
- When the function **returns a Raster**.
58+
- When the result should be **saved as a new raster**.
59+
- When working with large rasters that do not fit into memory.
60+
61+
### Example: applying a raster filter
62+
```{code-cell} ipython3
63+
import geoutils as gu
64+
import scipy
65+
import numpy as np
66+
from geoutils.raster import RasterType
67+
from geoutils.raster.distributed_computing import map_overlap_multiproc_save
68+
69+
filename_rast = gu.examples.get_path("exploradores_aster_dem")
70+
71+
def filter(raster: RasterType, size: int) -> RasterType:
72+
new_data = scipy.ndimage.maximum_filter(raster.data, size)
73+
if raster.nodata is not None:
74+
new_data = np.ma.masked_equal(new_data, raster.nodata)
75+
raster.data = new_data
76+
return raster
77+
78+
size = 1
79+
map_overlap_multiproc_save(filter, filename_rast, config_basic, size, depth=size+1)
80+
```
81+
82+
```{code-cell} ipython3
83+
:tags: [remove-cell]
84+
import os
85+
os.remove(config_basic.outfile)
86+
```
87+
88+
---
89+
90+
## {func}`~geoutils.raster.distributed_computing.map_multiproc_collect`: extract and collect data from large rasters
91+
92+
This function applies a function to raster tiles and **returns a list** of extracted data, without saving a new raster file. The process runs in **out-of-memory mode**, ensuring efficient handling of large datasets.
93+
94+
### When to use
95+
- When the function **does not return a Raster**.
96+
- When extracting **summary statistics, features, or analysis results**.
97+
- When processing large rasters that cannot fit into memory.
98+
99+
### Example: extracting elevation statistics
100+
```{code-cell} ipython3
101+
from geoutils.raster.distributed_computing import map_multiproc_collect
102+
from typing import Any
103+
104+
# Compute mean
105+
106+
def compute_statistics(raster: gu.Raster) -> dict[str, np.floating[Any]]:
107+
return raster.get_stats(stats_name=["mean", "valid_count"])
108+
109+
stats_results = map_multiproc_collect(compute_statistics, filename_rast, config_basic)
110+
total_count = sum([stats["valid_count"] for stats in stats_results])
111+
total_mean = sum([stats["mean"] * stats["valid_count"] for stats in stats_results]) / total_count
112+
print("Mean: ", total_mean)
113+
```
114+
115+
```{Note}
116+
To include tile location (col_min, col_max, row_min, row_max) in the results, set `return_tile=True`.
117+
```
118+
119+
---
120+
121+
## Choosing the right function
122+
123+
| Use case | Function |
124+
|-----------------------------------------------|---------------------------------------------------------------------------------------------------|
125+
| Apply processing and save results as a raster | {func}`~geoutils.raster.distributed_computing.map_overlap_multiproc_save` |
126+
| Extract statistics or features into a list | {func}`~geoutils.raster.distributed_computing.map_multiproc_collect` |
127+
| Track tile locations with extracted data | {func}`~geoutils.raster.distributed_computing.map_multiproc_collect` with `return_tile=True` |
Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
# Copyright (c) 2025 Centre National d'Etudes Spatiales (CNES)
2+
#
3+
# This file is part of the GeoUtils project:
4+
# https://github.com/glaciohack/geoutils
5+
#
6+
# Licensed under the Apache License, Version 2.0 (the "License");
7+
# you may not use this file except in compliance with the License.
8+
#
9+
# You may obtain a copy of the License at
10+
#
11+
# http://www.apache.org/licenses/LICENSE-2.0
12+
#
13+
# Unless required by applicable law or agreed to in writing, software
14+
# distributed under the License is distributed on an "AS IS" BASIS,
15+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
16+
# See the License for the specific language governing permissions and
17+
# limitations under the License.
18+
19+
from geoutils.raster.distributed_computing.cluster import * # noqa
20+
from geoutils.raster.distributed_computing.multiproc import ( # noqa
21+
MultiprocConfig,
22+
map_multiproc_collect,
23+
map_overlap_multiproc_save,
24+
)

0 commit comments

Comments
 (0)