This is a Flask-based web application that provides APIs for fetching building data, performing clustering (K-means and grid-based), and generating reports based on geospatial data. The application integrates with Google Earth Engine (GEE) and a PostgreSQL database with PostGIS extension to process building data within specified polygons or around pin locations. It supports clustering buildings using balanced K-means or grid-based clustering and generates CSV reports summarizing ward-level visit data.
- Building Data Retrieval: Fetches building data within a polygon from either Google Earth Engine or a PostgreSQL database, with optional filters for minimum area and confidence.
- Clustering:
- K-means Clustering: Performs balanced K-means clustering based on a specified number of clusters or buildings per cluster.
- Grid-based Clustering: Generates a grid over a polygon, assigns buildings to grid cells, and clusters grids to balance building counts.
- Reporting: Generates CSV reports summarizing ward-level visit data, with options to include building-to-visit distance metrics.
- Geospatial Support: Uses PostGIS for spatial queries and GeoPandas for handling geospatial data.
- CORS Support: Allows cross-origin requests for frontend integration.
- Python: Version 3.8 or higher.
- PostgreSQL: With PostGIS extension enabled for spatial queries.
- Google Earth Engine Account: For accessing building data via GEE.
- Dependencies: Listed in
requirements.txt
.
-
Clone the Repository
git clone https://github.com/Thushar12E45/dimagi-map-project.git cd dimagi-map-project
-
Set Up a Virtual Environment
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install Dependencies
pip install -r requirements.txt
-
Set Up Environment Variables Create a
.env
file in the project root with the following variables:GEE_CREDS=<your-google-earth-engine-credentials-json> GEE_PROJECT_NAME=<your-google-earth-engine-project-name> DB_USER=<your-postgres-username> DB_PASSWORD=<your-postgres-password> DB_HOST=<your-postgres-host> DB_PORT=<your-postgres-port> DB_NAME=<your-postgres-database-name> HOST_URL=<your-host-url> # Optional, defaults to https://connectgis.dimagi.com
GEE_CREDS
: JSON string of Google Earth Engine service account credentials.- Database credentials for PostgreSQL connection.
-
Database Setup
- Ensure the PostgreSQL database is running and has the PostGIS extension enabled.
- The
buildings
table should contain building data with spatial geometry.
-
Run the Application
python app.py
The app runs on
http://0.0.0.0:5000
in debug mode by default.
- Google Cloud account
- Earth Engine access (sign up at earthengine.google.com)
- Go to Google Cloud Console
- Click "Create Project"
- Enter project name and details
- Click "Create"
- After project creation, register it for:
- Commercial use (if applicable), or
- Non-commercial use (for research/academic purposes)
- Wait for approval (typically 1-2 business days)
- Search / Navigate to: APIs & Services → Library
- Search for "Google Earth Engine API"
- Click "Enable"
# Install Earth Engine Python API
pip install earthengine-api
# Authenticate (will open browser)
earthengine authenticate
Add your GEE project name to the .env file
GEE_PROJECT_NAME=<your-google-earth-engine-project-name>
- Local Development: Authentication via earthengine authenticate is sufficient for local use (no credentials needed in .env)
- Production / Other Environments: You must add the appropriate GEE credentials to your .env file:
GEE_CREDS=<your-service-account-credentials-json>
The building data insertion system downloads building footprint data from Overture Maps and loads it into a PostgreSQL database for use by the clustering application. This is a one-time setup process required before using the main application.
- Downloads building footprint data for an entire country from Overture Maps
- Processes the data in manageable tiles to handle large datasets
- Calculates building metrics (area, centroids, confidence scores)
- Stores data in PostGIS-enabled PostgreSQL database
- Creates spatial indexes for optimal query performance
- Supports parallel processing for faster data loading
Edit SQL_SCRIPTS/db_insertion_buildings.py
to set the target country's bounding box:
# Example for Kenya
left, bottom = 33.9098, -4.6796 # SW corner
right, top = 41.9058, 5.5059 # NE corner
# Change base filename to match your country
base_filename = "kenya_buildings"
cd SQL_SCRIPTS
python db_insertion_buildings.py --instance-id 0 --total-instances 1
Run multiple instances simultaneously for faster processing:
# Example with 4 parallel instances
python db_insertion_buildings.py --instance-id 0 --total-instances 4 &
python db_insertion_buildings.py --instance-id 1 --total-instances 4 &
python db_insertion_buildings.py --instance-id 2 --total-instances 4 &
python db_insertion_buildings.py --instance-id 3 --total-instances 4
The script supports resuming from interruptions:
- Failed tiles are logged in
failed_tiles.csv
- Restart the script with same parameters - it will skip completed tiles
- For persistent failures, check the error messages in the failed tiles log
- Method: GET
- Description: Renders the
index.html
template with the configuredHOST_URL
. - Response: HTML page.
- Method: POST
- Description: Fetches building data within a polygon or around a pin and performs clustering.
- Request Body:
{ "clusteringType": "kMeans|balancedKMeans|bottomUp", "noOfClusters": <int>, // Number of clusters (default: 3) "noOfBuildings": <int>, // Target buildings per cluster (default: 250) "buildingsAreaInMeters": <float>, // Minimum building area (default: 0) "buildingsConfidence": <int>, // Minimum confidence (0-100, default: 0) "thresholdVal": <int>, // Tolerance percentage (default: 10) "fetchClusters": <boolean>, // Whether to perform clustering (default: false) "dbType": "GEE|DB", // Data source (Google Earth Engine or Database) "polygon": [[lng, lat], ...], // Polygon coordinates (for kMeans/balancedKMeans) "pin": [lng, lat] // Pin coordinates (for bottomUp) }
- Response: JSON with building count, GeoJSON features, and optional cluster data.
- Example Response:
{ "building_count": 100, "buildings": {"type": "FeatureCollection", "features": [...]}, "clusters": [{"coordinates": [lng, lat], "cluster": <int>, "numOfBuildings": <int>}, ...] }
- Method: POST
- Description: Fetches buildings from the database, generates a grid, assigns buildings to grid cells, and performs grid-based clustering.
- Request Body:
{ "polygon": [[lng, lat], ...], // Polygon coordinates "noOfClusters": <int>, // Number of clusters (default: 3) "thresholdVal": <int>, // Tolerance percentage (default: 10) "gridLength": <int>, // Grid size in meters (default: 50) "buildingsAreaInMeters": <float>, // Minimum building area (default: 0) "buildingsConfidence": <int> // Minimum confidence (0-100, default: 0) }
- Response: JSON with building count, GeoJSON features, grid GeoJSON, and cluster data.
- Example Response:
{ "building_count": 100, "buildings": {"type": "FeatureCollection", "features": [...]}, "grids": {"type": "FeatureCollection", "features": [...]}, "clusters": [{"coordinates": [lng, lat], "cluster": <int>, "grid_index": <int>}, ...] }
- Method: POST
- Description: Generates a CSV report summarizing ward-level visit data.
- Request Body:
{ "data": [{"latitude": <float>, "longitude": <float>, "flw_id": <int>}, ...], "fetchVisitToBuildingsVal": <boolean> // Include building-to-visit distance metrics (default: true) }
- Response: CSV file (
Ward_summary_report.csv
) with ward visit summary data. - Example CSV Headers (with
fetchVisitToBuildingsVal=true
):state_name,lga_name,ward_name,population,total.visits,total.buildings,num.phc.serve.ward,median.visit.to.phc,max.visit.to.phc,median.building.to.phc,max.buildings.to.phc,unique.flws,coverage,percent.building.100.plus.to.visit,percent.building.200.plus.to.visit,percent.building.500.plus.to.visit,percent.building.10000.plus.to.visit
├── app.py # Main Flask application
├── .env # Environment variables (not tracked)
├── requirements.txt # Python dependencies
├── templates/
│ └── index.html # Frontend template
- Google Earth Engine: Ensure valid GEE credentials are provided in
.env
for the/get_building_density
endpoint withdbType=GEE
. - Performance: For large polygons, reduce the number of buildings or grid cells to avoid GEE’s 5000-element limit.
- Security: Sanitize inputs to prevent SQL injection (handled via parameterized queries in the code).
- CORS: Configured to allow all origins (
*
). Adjust in production for security.
- GEE Initialization Error: Verify
GEE_CREDS
in.env
and ensure the credentials file is correctly formatted. - Database Connection Error: Check PostgreSQL credentials and ensure the database is accessible.
- No Buildings Found: Ensure the polygon or pin coordinates are valid and contain buildings in the database or GEE dataset.
- Clustering Issues: Adjust
thresholdVal
or reducenoOfClusters
/noOfBuildings
if clustering fails due to insufficient data.
The Connect GIS app is hosted on AWS by running a dockerized version of the app on the EC2 instance. The database is hosted as an AWS RDS service.
In order to deploy new change you need access to the production server on AWS (the details and credentials is in 1Password - search for "ConnectGIS").
Deploying new changes requires the following steps:
# Navigate to folder
cd projects/dimagi-map-project/
# Pull latest changes
git pull
# Build the docker image
docker build -t map-clustering .
# Stop the existing running container
docker stop map-clustering
# Remove the named instance
docker container remove map-clustering
# Run the new image with the port binding
docker run -d -p 5010:5000 --name map-clustering map-clustering
The production environment file also lives in 1Password, but you'll also need to update the .env
on the server (~/projects/connect-gis/.env
).