Decentralized housing price prediction using Federated Learning with the California Housing dataset.
This project implements a federated learning (FL) approach to housing price prediction, leveraging Federated Gradient Descent (FedGD) and Federated Stochastic Gradient Descent (FedSGD). The model is trained on geographically clustered data from the California Housing dataset, ensuring data privacy and decentralized model learning. The study explores the advantages of federated learning over traditional centralized models, demonstrating superior performance with privacy-preserving techniques.
- Data Preprocessing: Cleaning, normalization, and one-hot encoding of categorical features.
- Geographical Clustering:
- Agglomerative hierarchical clustering based on latitude and longitude.
- Formation of 9 distinct clusters, each treated as a local dataset.
- Federated Learning Algorithms:
- FedGD: Gradient-based FL with full dataset updates.
- FedSGD: Stochastic gradient-based FL with mini-batch updates.
- Empirical Graph Construction:
- Nodes represent clusters (local datasets).
- Edges represent geographical proximity, enabling parameter sharing.
- Model Evaluation:
- Comparison of FedGD, FedSGD, and traditional Linear Regression (LR).
- Performance metrics: Mean Squared Error (MSE) on training, validation, and test sets.
The California Housing dataset consists of 20,640 instances, each representing a housing block group. Features include:
- Geographical: Latitude, Longitude, Ocean Proximity
- Housing characteristics: Median age, Total rooms, Total bedrooms
- Demographics: Population, Households, Median income
- Target variable: Median house value
- Handling missing values in
total_bedrooms
. - One-hot encoding for
ocean_proximity
. - Feature scaling using Min-Max normalization.
- Agglomerative hierarchical clustering groups data into 9 clusters.
- Each cluster represents a local dataset for federated learning.
- Nodes represent clusters, and edges connect k-nearest neighbors based on geographical proximity.
- Graph structure facilitates parameter sharing across local models.
- FedGD: Gradient-based optimization applied to the Generalized Total Variation Minimization (GTVMin) problem.
- FedSGD: Stochastic gradient updates using mini-batches.
Model | Train MSE | Validation MSE | Test MSE |
---|---|---|---|
FedGD | 43.14 | 46.37 | 44.62 |
FedSGD | 43.78 | 46.78 | 45.30 |
Linear Regression | 50.75 | 54.51 | 53.64 |
- FedGD consistently outperforms FedSGD, achieving lower test errors.
- Both federated models significantly outperform traditional Linear Regression (LR).
git clone https://github.com/yourusername/fl-california-housing.git
cd fl-california-housing
jupyter notebook housing.ipynb
This will:
- Load and preprocess the California Housing dataset
- Perform geographical clustering of the data
- Train the federated learning models (FedGD and FedSGD)
- Evaluate and compare model performance
- Generate visualizations of the results
This project is licensed under the MIT License - see the LICENSE file for details.