| 
 | 1 | +---  | 
 | 2 | +file_format: mystnb  | 
 | 3 | +jupytext:  | 
 | 4 | +  formats: md:myst  | 
 | 5 | +  text_representation:  | 
 | 6 | +    extension: .md  | 
 | 7 | +    format_name: myst  | 
 | 8 | +kernelspec:  | 
 | 9 | +  display_name: geoutils-env  | 
 | 10 | +  language: python  | 
 | 11 | +  name: geoutils  | 
 | 12 | +---  | 
 | 13 | +(multiprocessing)=  | 
 | 14 | + | 
 | 15 | +# Multiprocessing  | 
 | 16 | + | 
 | 17 | +## Overview  | 
 | 18 | + | 
 | 19 | +Processing large raster datasets can be **computationally expensive and memory-intensive**. To optimize performance and enable **out-of-memory processing**, GeoUtils provides **multiprocessing utilities** that allow users to process raster data in parallel by splitting it into tiles.  | 
 | 20 | + | 
 | 21 | +GeoUtils offers two functions for out-of-memory multiprocessing:  | 
 | 22 | + | 
 | 23 | +- {func}`~geoutils.raster.distributed_computing.map_overlap_multiproc_save`: Applies a function to raster tiles and **saves the output** as a {class}`geoutils.Raster`.  | 
 | 24 | +- {func}`~geoutils.raster.distributed_computing.map_multiproc_collect`: Applies a function and **collects extracted data** from raster tiles into a list.  | 
 | 25 | + | 
 | 26 | +Both functions require a **multiprocessing configuration** defined with {class}`~geoutils.raster.distributed_computing.MultiprocConfig`.  | 
 | 27 | + | 
 | 28 | +---  | 
 | 29 | + | 
 | 30 | +## Using {class}`~geoutils.raster.distributed_computing.MultiprocConfig`  | 
 | 31 | + | 
 | 32 | +{class}`~geoutils.raster.distributed_computing.MultiprocConfig` defines tiling and processing settings, such as chunk size, output file, and computing cluster. It ensures that computations are performed **without loading the entire raster into memory**.  | 
 | 33 | + | 
 | 34 | +### Example: creating a {class}`~geoutils.raster.distributed_computing.MultiprocConfig` object  | 
 | 35 | +```{code-cell} ipython3  | 
 | 36 | +from geoutils.raster.distributed_computing import ClusterGenerator  | 
 | 37 | +from geoutils.raster.distributed_computing import MultiprocConfig  | 
 | 38 | +
  | 
 | 39 | +# Create a configuration without multiprocessing cluster (tasks will be processed sequentially)  | 
 | 40 | +config_basic = MultiprocConfig(chunk_size=200, outfile="output.tif", cluster=None)  | 
 | 41 | +
  | 
 | 42 | +# Create a configuration with a multiprocessing cluster  | 
 | 43 | +config_np = config_basic.copy()  | 
 | 44 | +config_np.cluster = ClusterGenerator("multi", nb_workers=4)  | 
 | 45 | +```  | 
 | 46 | +- **`chunk_size=200`**: The raster is divided into 200x200 pixel tiles.  | 
 | 47 | +- **`outfile="output.tif"`**: Required when saving results.  | 
 | 48 | +- **`cluster=ClusterGenerator("multi", nb_workers=4)`**: Enables parallel processing.  | 
 | 49 | + | 
 | 50 | +---  | 
 | 51 | + | 
 | 52 | +## {func}`~geoutils.raster.distributed_computing.map_overlap_multiproc_save`: process and save large rasters  | 
 | 53 | + | 
 | 54 | +This function applies a user-defined function to raster tiles and **saves the output** to a file. The entire raster is **never loaded into memory at once**, making it suitable for processing large datasets.  | 
 | 55 | + | 
 | 56 | +### When to use  | 
 | 57 | +- When the function **returns a Raster**.  | 
 | 58 | +- When the result should be **saved as a new raster**.  | 
 | 59 | +- When working with large rasters that do not fit into memory.  | 
 | 60 | + | 
 | 61 | +### Example: applying a raster filter  | 
 | 62 | +```{code-cell} ipython3  | 
 | 63 | +import geoutils as gu  | 
 | 64 | +import scipy  | 
 | 65 | +import numpy as np  | 
 | 66 | +from geoutils.raster import RasterType  | 
 | 67 | +from geoutils.raster.distributed_computing import map_overlap_multiproc_save  | 
 | 68 | +
  | 
 | 69 | +filename_rast = gu.examples.get_path("exploradores_aster_dem")  | 
 | 70 | +
  | 
 | 71 | +def filter(raster: RasterType, size: int) -> RasterType:  | 
 | 72 | +    new_data = scipy.ndimage.maximum_filter(raster.data, size)  | 
 | 73 | +    if raster.nodata is not None:  | 
 | 74 | +        new_data = np.ma.masked_equal(new_data, raster.nodata)  | 
 | 75 | +    raster.data = new_data  | 
 | 76 | +    return raster  | 
 | 77 | +
  | 
 | 78 | +size = 1  | 
 | 79 | +map_overlap_multiproc_save(filter, filename_rast, config_basic, size, depth=size+1)  | 
 | 80 | +```  | 
 | 81 | + | 
 | 82 | +```{code-cell} ipython3  | 
 | 83 | +:tags: [remove-cell]  | 
 | 84 | +import os  | 
 | 85 | +os.remove(config_basic.outfile)  | 
 | 86 | +```  | 
 | 87 | + | 
 | 88 | +---  | 
 | 89 | + | 
 | 90 | +## {func}`~geoutils.raster.distributed_computing.map_multiproc_collect`: extract and collect data from large rasters  | 
 | 91 | + | 
 | 92 | +This function applies a function to raster tiles and **returns a list** of extracted data, without saving a new raster file. The process runs in **out-of-memory mode**, ensuring efficient handling of large datasets.  | 
 | 93 | + | 
 | 94 | +### When to use  | 
 | 95 | +- When the function **does not return a Raster**.  | 
 | 96 | +- When extracting **summary statistics, features, or analysis results**.  | 
 | 97 | +- When processing large rasters that cannot fit into memory.  | 
 | 98 | + | 
 | 99 | +### Example: extracting elevation statistics  | 
 | 100 | +```{code-cell} ipython3  | 
 | 101 | +from geoutils.raster.distributed_computing import map_multiproc_collect  | 
 | 102 | +from typing import Any  | 
 | 103 | +
  | 
 | 104 | +# Compute mean  | 
 | 105 | +
  | 
 | 106 | +def compute_statistics(raster: gu.Raster) -> dict[str, np.floating[Any]]:  | 
 | 107 | +    return raster.get_stats(stats_name=["mean", "valid_count"])  | 
 | 108 | +
  | 
 | 109 | +stats_results = map_multiproc_collect(compute_statistics, filename_rast, config_basic)  | 
 | 110 | +total_count = sum([stats["valid_count"] for stats in stats_results])  | 
 | 111 | +total_mean = sum([stats["mean"] * stats["valid_count"] for stats in stats_results]) / total_count  | 
 | 112 | +print("Mean: ", total_mean)  | 
 | 113 | +```  | 
 | 114 | + | 
 | 115 | +```{Note}  | 
 | 116 | +To include tile location (col_min, col_max, row_min, row_max) in the results, set `return_tile=True`.  | 
 | 117 | +```  | 
 | 118 | + | 
 | 119 | +---  | 
 | 120 | + | 
 | 121 | +## Choosing the right function  | 
 | 122 | + | 
 | 123 | +| Use case                                      | Function |  | 
 | 124 | +|-----------------------------------------------|---------------------------------------------------------------------------------------------------|  | 
 | 125 | +| Apply processing and save results as a raster | {func}`~geoutils.raster.distributed_computing.map_overlap_multiproc_save` |  | 
 | 126 | +| Extract statistics or features into a list    | {func}`~geoutils.raster.distributed_computing.map_multiproc_collect` |  | 
 | 127 | +| Track tile locations with extracted data      | {func}`~geoutils.raster.distributed_computing.map_multiproc_collect` with `return_tile=True` |  | 
0 commit comments