Skip to content

Commit 0c18d9e

Browse files
authored
Add percentage valid points in get_stats() (#644)
1 parent a7ad7c2 commit 0c18d9e

File tree

7 files changed

+208
-66
lines changed

7 files changed

+208
-66
lines changed

doc/source/api.md

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -107,6 +107,15 @@ documentation.
107107
Raster.plot
108108
```
109109

110+
### Get statistics
111+
112+
```{eval-rst}
113+
.. autosummary::
114+
:toctree: gen_modules/
115+
116+
Raster.get_stats
117+
```
118+
110119
### Get or update data methods
111120

112121
```{eval-rst}

doc/source/raster_class.md

Lines changed: 36 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -260,14 +260,14 @@ The {func}`~geoutils.Raster.icrop` function accepts only a bounding box in pixel
260260
By default, {func}`~geoutils.Raster.crop` and {func}`~geoutils.Raster.icrop` return a new Raster unless the inplace parameter is set to True, in which case the cropping operation is performed directly on the original raster object.
261261
For more details, see the {ref}`specific section and function descriptions in the API<api-geo-handle>`.
262262

263-
### Example for `crop`
263+
### Example for {func}`~geoutils.Raster.crop`
264264
```{code-cell} ipython3
265265
# Crop raster to smaller bounds
266266
rast_crop = rast.crop(bbox=(0.3, 0.3, 1, 1))
267267
print(rast_crop.bounds)
268268
```
269269

270-
### Example for `icrop`
270+
### Example for {func}`~geoutils.Raster.icrop`
271271
```{code-cell} ipython3
272272
# Crop raster using pixel coordinates
273273
rast_icrop = rast.icrop(bbox=(2, 2, 6, 6))
@@ -360,8 +360,31 @@ rast_reproj.to_xarray()
360360
```
361361

362362
## Obtain Statistics
363-
The `get_stats()` method allows to extract key statistical information from a raster in a dictionary.
364-
Supported statistics are : mean, median, max, mean, sum, sum of squares, 90th percentile, nmad, rmse, std.
363+
The {func}`~geoutils.Raster.get_stats` method allows to extract key statistical information from a raster in a dictionary.
364+
Supported statistics are :
365+
- **Mean:** arithmetic mean of the data, ignoring masked values.
366+
- **Median:** middle value when the valid data points are sorted in increasing order, ignoring masked values.
367+
- **Max:** maximum value among the data, ignoring masked values.
368+
- **Min:** minimum value among the data, ignoring masked values.
369+
- **Sum:** sum of all data, ignoring masked values.
370+
- **Sum of squares:** sum of the squares of all data, ignoring masked values.
371+
- **90th percentile:** point below which 90% of the data falls, ignoring masked values.
372+
- **LE90 (Linear Error with 90% confidence):** Difference between the 95th and 5th percentiles of a dataset, representing the range within which 90% of the data points lie. Ignore masked values.
373+
- **NMAD (Normalized Median Absolute Deviation):** robust measure of variability in the data, less sensitive to outliers compared to standard deviation. Ignore masked values.
374+
- **RMSE (Root Mean Square Error):** commonly used to express the magnitude of errors or variability and can give insight into the spread of the data. Only relevant when the raster represents a difference of two objects. Ignore masked values.
375+
- **Std (Standard deviation):** measures the spread or dispersion of the data around the mean, ignoring masked values.
376+
- **Valid count:** number of finite data points in the array. It counts the non-masked elements.
377+
- **Total count:** total size of the raster.
378+
- **Percentage valid points:** ratio between **Valid count** and **Total count**.
379+
380+
381+
If an inlier mask is passed:
382+
- **Total inlier count:** number of data points in the inlier mask.
383+
- **Valid inlier count:** number of unmasked data points in the array after applying the inlier mask.
384+
- **Percentage inlier points:** ratio between **Valid inlier count** and **Valid count**. Useful for classification statistics.
385+
- **Percentage valid inlier points:** ratio between **Valid inlier count** and **Total inlier count**.
386+
387+
365388
Callable functions are supported as well.
366389

367390
### Usage Examples:
@@ -377,7 +400,7 @@ rast.get_stats("mean")
377400

378401
- Get multiple statistics in a dict:
379402
```{code-cell} ipython3
380-
rast.get_stats(["mean", "max", "rmse"])
403+
rast.get_stats(["mean", "max", "std"])
381404
```
382405

383406
- Using a custom callable statistic:
@@ -386,3 +409,11 @@ def custom_stat(data):
386409
return np.nansum(data > 100) # Count the number of pixels above 100
387410
rast.get_stats(custom_stat)
388411
```
412+
413+
- Passing an inlier mask:
414+
```{code-cell} ipython3
415+
inlier_mask = np.array([[False, False , True],
416+
[False, True , False],
417+
[True, False , True]])
418+
rast.get_stats(inlier_mask=inlier_mask)
419+
```

geoutils/raster/raster.py

Lines changed: 73 additions & 50 deletions
Original file line numberDiff line numberDiff line change
@@ -89,7 +89,7 @@
8989
decode_sensor_metadata,
9090
parse_and_convert_metadata_from_filename,
9191
)
92-
from geoutils.stats import nmad
92+
from geoutils.stats import linear_error, nmad
9393
from geoutils.vector.vector import Vector
9494

9595
# If python38 or above, Literal is builtin. Otherwise, use typing_extensions
@@ -1890,102 +1890,114 @@ def set_mask(self, mask: NDArrayBool | Mask) -> None:
18901890
else:
18911891
self.data[mask_arr > 0] = np.ma.masked
18921892

1893-
def _statistics(self, band: int = 1) -> dict[str, np.floating[Any]]:
1893+
def _statistics(self, band: int = 1, counts: tuple[int, int] | None = None) -> dict[str, np.floating[Any]]:
18941894
"""
18951895
Calculate common statistics for a specified band in the raster.
18961896
18971897
:param band: The index of the band for which to compute statistics. Default is 1.
1898+
:param counts: (number of finite data points in the array, number of valid points in inlier_mask).
18981899
1899-
:returns: A dictionary containing the calculated statistics for the selected band, including mean, median, max,
1900-
min, sum, sum of squares, 90th percentile, NMAD, RMSE, and standard deviation.
1900+
:returns: A dictionary containing the calculated statistics for the selected band.
19011901
"""
1902+
19021903
if self.count == 1:
19031904
data = self.data
19041905
else:
19051906
data = self.data[band - 1]
19061907

1907-
# If data is a MaskedArray, use the compressed version (without masked values)
1908-
if isinstance(data, np.ma.MaskedArray):
1909-
data = data.compressed()
1910-
19111908
# Compute the statistics
1909+
mdata = np.ma.filled(data.astype(float), np.nan)
1910+
valid_count = np.count_nonzero(~self.get_mask()) if counts is None else counts[0]
19121911
stats_dict = {
1913-
"Mean": np.nanmean(data),
1914-
"Median": np.nanmedian(data),
1915-
"Max": np.nanmax(data),
1916-
"Min": np.nanmin(data),
1917-
"Sum": np.nansum(data),
1918-
"Sum of squares": np.nansum(np.square(data)),
1919-
"90th percentile": np.nanpercentile(data, 90),
1912+
"Mean": np.ma.mean(data),
1913+
"Median": np.ma.median(data),
1914+
"Max": np.ma.max(data),
1915+
"Min": np.ma.min(data),
1916+
"Sum": np.ma.sum(data),
1917+
"Sum of squares": np.ma.sum(np.square(data)),
1918+
"90th percentile": np.nanpercentile(mdata, 90),
1919+
"LE90": linear_error(mdata, interval=90),
19201920
"NMAD": nmad(data),
1921-
"RMSE": np.sqrt(np.nanmean(np.square(data - np.nanmean(data)))),
1922-
"Standard deviation": np.nanstd(data),
1921+
"RMSE": np.sqrt(np.ma.mean(np.square(data))),
1922+
"Standard deviation": np.ma.std(data),
1923+
"Valid count": valid_count,
1924+
"Total count": data.size,
1925+
"Percentage valid points": (valid_count / data.size) * 100,
19231926
}
1927+
1928+
if counts is not None:
1929+
valid_inlier_count = np.count_nonzero(~self.get_mask())
1930+
stats_dict.update(
1931+
{
1932+
"Valid inlier count": valid_inlier_count,
1933+
"Total inlier count": counts[1],
1934+
"Percentage inlier points": (valid_inlier_count / counts[0]) * 100,
1935+
"Percentage valid inlier points": (valid_inlier_count / counts[1]) * 100,
1936+
}
1937+
)
1938+
19241939
return stats_dict
19251940

19261941
@overload
19271942
def get_stats(
19281943
self,
1929-
stats_name: (
1930-
Literal["mean", "median", "max", "min", "sum", "sum of squares", "90th percentile", "nmad", "rmse", "std"]
1931-
| Callable[[NDArrayNum], np.floating[Any]]
1932-
),
1944+
stats_name: str | Callable[[NDArrayNum], np.floating[Any]],
1945+
inlier_mask: Mask | NDArrayBool | None = None,
19331946
band: int = 1,
1947+
counts: tuple[int, int] | None = None,
19341948
) -> np.floating[Any]: ...
19351949

19361950
@overload
19371951
def get_stats(
19381952
self,
1939-
stats_name: (
1940-
list[
1941-
Literal[
1942-
"mean", "median", "max", "min", "sum", "sum of squares", "90th percentile", "nmad", "rmse", "std"
1943-
]
1944-
| Callable[[NDArrayNum], np.floating[Any]]
1945-
]
1946-
| None
1947-
) = None,
1953+
stats_name: list[str | Callable[[NDArrayNum], np.floating[Any]]] | None = None,
1954+
inlier_mask: Mask | NDArrayBool | None = None,
19481955
band: int = 1,
1956+
counts: tuple[int, int] | None = None,
19491957
) -> dict[str, np.floating[Any]]: ...
19501958

19511959
def get_stats(
19521960
self,
19531961
stats_name: (
1954-
Literal["mean", "median", "max", "min", "sum", "sum of squares", "90th percentile", "nmad", "rmse", "std"]
1955-
| Callable[[NDArrayNum], np.floating[Any]]
1956-
| list[
1957-
Literal[
1958-
"mean", "median", "max", "min", "sum", "sum of squares", "90th percentile", "nmad", "rmse", "std"
1959-
]
1960-
| Callable[[NDArrayNum], np.floating[Any]]
1961-
]
1962-
| None
1962+
str | Callable[[NDArrayNum], np.floating[Any]] | list[str | Callable[[NDArrayNum], np.floating[Any]]] | None
19631963
) = None,
1964+
inlier_mask: Mask | NDArrayBool | None = None,
19641965
band: int = 1,
1966+
counts: tuple[int, int] | None = None,
19651967
) -> np.floating[Any] | dict[str, np.floating[Any]]:
19661968
"""
19671969
Retrieve specified statistics or all available statistics for the raster data. Allows passing custom callables
19681970
to calculate custom stats.
19691971
19701972
:param stats_name: Name or list of names of the statistics to retrieve. If None, all statistics are returned.
1971-
Accepted names include:
1972-
- "mean", "median", "max", "min", "sum", "sum of squares", "90th percentile", "nmad", "rmse", "std"
1973-
You can also use common aliases for these names (e.g., "average", "maximum", "minimum", etc.).
1974-
Custom callables can also be provided.
1973+
Accepted names include:
1974+
`mean`, `median`, `max`, `min`, `sum`, `sum of squares`, `90th percentile`, `LE90`, `nmad`, `rmse`,
1975+
`std`, `valid count`, `total count`, `percentage valid points` and if an inlier mask is passed :
1976+
`valid inlier count`, `total inlier count`, `percentage inlier point`, `percentage valid inlier points`.
1977+
Custom callables can also be provided.
1978+
:param inlier_mask: A boolean mask to filter values for statistical calculations.
19751979
:param band: The index of the band for which to compute statistics. Default is 1.
1976-
1980+
:param counts: (number of finite data points in the array, number of valid points in inlier_mask). DO NOT USE.
19771981
:returns: The requested statistic or a dictionary of statistics if multiple or all are requested.
19781982
"""
19791983
if not self.is_loaded:
19801984
self.load()
1981-
stats_dict = self._statistics(band=band)
1985+
if inlier_mask is not None:
1986+
valid_points = np.count_nonzero(~self.get_mask())
1987+
if isinstance(inlier_mask, Mask):
1988+
inlier_points = np.count_nonzero(~inlier_mask.data)
1989+
else:
1990+
inlier_points = np.count_nonzero(~inlier_mask)
1991+
dem_masked = self.copy()
1992+
dem_masked.set_mask(inlier_mask)
1993+
return dem_masked.get_stats(stats_name=stats_name, band=band, counts=(valid_points, inlier_points))
1994+
stats_dict = self._statistics(band=band, counts=counts)
19821995
if stats_name is None:
19831996
return stats_dict
19841997

19851998
# Define the metric aliases and their actual names
19861999
stats_aliases = {
19872000
"mean": "Mean",
1988-
"average": "Mean",
19892001
"median": "Median",
19902002
"max": "Max",
19912003
"maximum": "Max",
@@ -1994,17 +2006,28 @@ def get_stats(
19942006
"sum": "Sum",
19952007
"sumofsquares": "Sum of squares",
19962008
"sum2": "Sum of squares",
1997-
"percentile": "90th percentile",
19982009
"90thpercentile": "90th percentile",
19992010
"90percentile": "90th percentile",
2000-
"percentile90": "90th percentile",
2011+
"le90": "LE90",
20012012
"nmad": "NMAD",
20022013
"rmse": "RMSE",
2014+
"rms": "RMSE",
20032015
"std": "Standard deviation",
2004-
"stddev": "Standard deviation",
2005-
"standarddev": "Standard deviation",
20062016
"standarddeviation": "Standard deviation",
2017+
"validcount": "Valid count",
2018+
"totalcount": "Total count",
2019+
"percentagevalidpoints": "Percentage valid points",
20072020
}
2021+
if counts is not None:
2022+
stats_aliases.update(
2023+
{
2024+
"validinliercount": "Valid inlier count",
2025+
"totalinliercount": "Total inlier count",
2026+
"percentagevalidinlierpoints": "Percentage valid inlier points",
2027+
"percentageinlierpoints": "Percentage inlier points",
2028+
}
2029+
)
2030+
20082031
if isinstance(stats_name, list):
20092032
result = {}
20102033
for name in stats_name:

geoutils/stats.py

Lines changed: 27 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -38,7 +38,31 @@ def nmad(data: NDArrayNum, nfact: float = 1.4826) -> np.floating[Any]:
3838
:returns nmad: (normalized) median absolute deviation of data.
3939
"""
4040
if isinstance(data, np.ma.masked_array):
41-
data_arr = data.compressed()
41+
return nfact * np.ma.median(np.abs(data - np.ma.median(data)))
42+
return nfact * np.nanmedian(np.abs(data - np.nanmedian(data)))
43+
44+
45+
def linear_error(data: NDArrayNum, interval: float = 90) -> np.floating[Any]:
46+
"""
47+
Compute the linear error (LE) for a given dataset, representing the range of differences between the upper and
48+
lower percentiles of the data. By default, this calculates the 90% confidence interval (LE90).
49+
50+
:param data: A numpy array or masked array of data, typically representing the differences (errors) in elevation or
51+
another quantity.
52+
:param interval: The confidence interval to compute, specified as a percentage. For example, an interval of 90 will
53+
compute the range between the 5th and 95th percentiles (LE90). This value must be between 0 and 100.
54+
55+
return: The computed linear error, which is the difference between the upper and lower percentiles.
56+
57+
raises: ValueError if the `interval` is not between 0 and 100.
58+
"""
59+
# Validate the interval
60+
if not (0 < interval <= 100):
61+
raise ValueError("Interval must be between 0 and 100")
62+
63+
if isinstance(data, np.ma.masked_array):
64+
mdata = np.ma.filled(data.astype(float), np.nan)
4265
else:
43-
data_arr = np.asarray(data)
44-
return nfact * np.nanmedian(np.abs(data_arr - np.nanmedian(data_arr)))
66+
mdata = data
67+
le = np.nanpercentile(mdata, 50 + interval / 2) - np.nanpercentile(mdata, 50 - interval / 2)
68+
return le

tests/conftest.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
"""Configuration of pytest."""
22

3-
from pytest import DoctestItem
3+
from _pytest.doctest import DoctestItem
44

55

66
# To order test modules logically during execution

tests/test_raster/test_raster.py

Lines changed: 27 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1985,8 +1985,6 @@ def test_split_bands(self) -> None:
19851985
def test_stats(self, example: str, caplog) -> None:
19861986
raster = gu.Raster(example)
19871987

1988-
# Full stats
1989-
stats = raster.get_stats()
19901988
expected_stats = [
19911989
"Mean",
19921990
"Median",
@@ -1995,18 +1993,41 @@ def test_stats(self, example: str, caplog) -> None:
19951993
"Sum",
19961994
"Sum of squares",
19971995
"90th percentile",
1996+
"LE90",
19981997
"NMAD",
19991998
"RMSE",
20001999
"Standard deviation",
2000+
"Valid count",
2001+
"Total count",
2002+
"Percentage valid points",
20012003
]
2004+
2005+
# Full stats
2006+
stats = raster.get_stats()
20022007
for name in expected_stats:
20032008
assert name in stats
20042009
assert stats.get(name) is not None
20052010

2011+
# With mask
2012+
inlier_mask = raster.get_mask()
2013+
stats_masked = raster.get_stats(inlier_mask=inlier_mask)
2014+
for name in [
2015+
"Valid inlier count",
2016+
"Total inlier count",
2017+
"Percentage inlier points",
2018+
"Percentage valid inlier points",
2019+
]:
2020+
assert name in stats_masked
2021+
assert stats_masked.get(name) is not None
2022+
stats_masked.pop(name)
2023+
assert stats_masked == stats
2024+
20062025
# Single stat
2007-
stat = raster.get_stats(stats_name="Average")
2008-
assert isinstance(stat, np.floating)
2026+
for name in expected_stats:
2027+
stat = raster.get_stats(stats_name=name)
2028+
assert np.isfinite(stat)
20092029

2030+
# Callable
20102031
def percentile_95(data: NDArrayNum) -> np.floating[Any]:
20112032
if isinstance(data, np.ma.MaskedArray):
20122033
data = data.compressed()
@@ -2016,8 +2037,8 @@ def percentile_95(data: NDArrayNum) -> np.floating[Any]:
20162037
assert isinstance(stat, np.floating)
20172038

20182039
# Selected stats and callable
2019-
stats_name = ["mean", "maximum", "std", "percentile_95"]
2020-
stats = raster.get_stats(stats_name=["mean", "maximum", "std", percentile_95])
2040+
stats_name = ["mean", "max", "std", "percentile_95"]
2041+
stats = raster.get_stats(stats_name=["mean", "max", "std", percentile_95])
20212042
for name in stats_name:
20222043
assert name in stats
20232044
assert stats.get(name) is not None

0 commit comments

Comments
 (0)