Skip to content

Commit 025ad99

Browse files
committed
feat(post): add eo data pipeline post
1 parent 76e7c42 commit 025ad99

File tree

2 files changed

+130
-0
lines changed

2 files changed

+130
-0
lines changed
Lines changed: 130 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,130 @@
1+
---
2+
layout: post
3+
title: "EO data pipeline"
4+
author: tam
5+
categories: [ AI, code, data ]
6+
image: assets/images/composite.png
7+
description: "Key learnings from our data pipeline on how to prepare open EO data for machine learning."
8+
featured: false
9+
hidden: false
10+
---
11+
Transforming raw EO imagery to analysis ready data requires a lot of
12+
preparation. At Tesselo we focused a large part of our platform on the
13+
pre-processing of EO data.
14+
15+
### EO training data has location and date
16+
17+
When modeling with EO data, the goal is usually to create imagery stacks that
18+
are aligned with the available training data. The training data is
19+
geo-referenced and is created with relation to a specific date. Examples for the
20+
source of timestamps for the training data are:
21+
22+
- The date of a field trip where sampling was performed
23+
- The collection date of a satellite image or drone flight that was used as
24+
reference to create the training data
25+
- The reference date for training data that was provided by customers
26+
27+
For feeding the data to models the training polygons need to be rasterized
28+
first. The stack of the training data and imagery needs to be converted into
29+
numpy arrays (tensors) that have the precise shape that our deep learning models
30+
require. The imagery to be collected for training and inference will also have
31+
to match the resolution. Usually but not always it is the same as the one for
32+
the training data.
33+
34+
### Finding the imagery for training
35+
36+
In some cases a model can be trained with the imagery that was used to create
37+
the training data in the first place. But that is often not the case, and the
38+
the imagery matching the training data imagery needs to be obtained.
39+
40+
Geospatial training data are geometries that are geo-referenced. The geometries
41+
can be intersected with the imagery archives to identify scenes based on the
42+
reference date.
43+
44+
Internally we maintained a [STAC](https://stacspec.org) catalog to be able to do
45+
these searches fast and at scale. For details, consult our
46+
[PxSearch](https://github.com/tesselo/pxsearch) repository. We were tracking the
47+
archives that contain the scenes in a [Cloud Optimized GeoTiff
48+
(COG)](https://www.cogeo.org/) format for fast and efficient retrieval of the
49+
data.
50+
51+
### Compositing the relevant images
52+
53+
When requesting the identified images for a given polygon, there might be nodata
54+
pixels from the original scenes, if the polygon is placed at a scene boundary.
55+
To reliably getting a complete set of pixels, compositing becomes necessary.
56+
57+
Even when valid pixels are found, they might have clouds in them or strong
58+
atmospheric effects. In some cases, it is important to obtain the best possible
59+
pixel from a time span. This is where the quality of the pixel plays a role.
60+
61+
In other words, compositing has two purposes
62+
63+
- Removing nodata pixels
64+
- Selecting high quality pixels from a set of images
65+
66+
The exact implementation of our compositing algorithms can be found in the
67+
[composite
68+
method](https://github.com/tesselo/pixels/blob/38cbd2416c6688c9c2f0aaf4890e4eec82f49707/pixels/mosaic.py#L641)
69+
of our [pixels](https://github.com/tesselo/pixels) repository.
70+
71+
One important question for compositing is the metadata about the scenes that
72+
went into the composite. Its not clear how to report them, as metadata in the
73+
image file, or as additional json, or not at all if not important.
74+
75+
### Use all bands and upsample to highest resolution
76+
77+
By default, we simply used all available bands of the multispectral satellite
78+
images as input to the models. For instance, for Sentinel-2 based models we used
79+
the 10 bands that have 10m or 20m resolution. Similarly, for Landsat we used all
80+
the available bands except the cirrus band.
81+
82+
In our pre-processing pipeline we had to resample all bands into the target
83+
resolution. Usually this meant to upsample the lower resolution bands to the
84+
resolution of the band with the highest resolution. That is 10m for Sentinel-2
85+
images for instance.
86+
87+
Or using the same approach we would create super-resolution, by upsampling our
88+
imagery data to the resolution of the target data. We had successful models that
89+
would build 1m resolution images out of 10m resolution data.
90+
91+
### Technical implementation
92+
93+
While the detailed implementation can be found in our codebase, here follows a
94+
description of the approach in words.
95+
96+
#### Latest pixel composites
97+
98+
The simplest way to composite imagery is to simply look for the newest pixel in
99+
a given time span that is not nodata.
100+
101+
1. For a given polygon and time interval, all intersecting scenes are searched
102+
1. The scenes are sorted by date
103+
1. The pixels from the required bands intersecting with the polygon are
104+
retrieved
105+
1. If there is any nodata pixel in the result, continue to step 3 until there
106+
are no more nodata pixels.
107+
1. Return the resulting pixel combination.
108+
109+
This algorithm many times will only collect data from the first image. If it
110+
covers the entire polygon the algorithm will not retrieve any more data.
111+
112+
#### Best pixel composites
113+
114+
A slightly more sophisticated approach is to rank the cloud free pixels by
115+
quality and select the best one. In our case this method is what we used the
116+
most. It is more expensive than the latest pixel approach, because all scenes
117+
have to be collected for the ranking. It does provide much cleaner imagery in
118+
cloudy regions with generous time intervals like months or even quarters.
119+
120+
The tie breaker we used was the highest NDVI because we were ususally interested
121+
in finding vegetation.
122+
123+
Here a summary of the steps of the best pixel composite
124+
125+
1. For a given polygon and time interval, all intersecting scenes are searched
126+
1. The pixels from the required bands intersecting with the polygon are
127+
retrieved for all candidate scenes
128+
1. For each pixel, remove scenes that have clouds or nodata values
129+
1. For the remaining scenes, pick the one with the highest NDVI value
130+
1. Return the resulting pixel combination.

assets/images/composite.png

846 KB
Loading

0 commit comments

Comments
 (0)