Evaluating Synthetic Data Generation Approaches for Improved Machine Learning Detection of Harmful Algal Blooms
This study investigates the effectiveness of different synthetic data augmentation methods for enhancing harmful algal bloom (HAB) detection using machine learning. We compare three approaches: a statistical method using Gaussian Copulas, a novel LLM-based collaborative multi-agent pipeline, and a deep learning approach using Conditional Tabular GANs (CTGAN). Our results demonstrate that all three synthetic data generation methods significantly improve model performance compared to using only the original data, with comparable performance metrics. The CTGAN method shows slightly better performance in terms of error metrics and R² score, followed closely by the Gaussian Copula method, while the LLM-based approach offers greater flexibility and domain-specific knowledge integration at the cost of higher computational overhead.
Harmful algal blooms (HABs) pose significant threats to aquatic ecosystems, human health, and economic activities. Early detection and prediction of HABs are crucial for effective management and mitigation strategies. However, limited availability of training data often constrains the performance of machine learning models for HAB detection.
This study addresses the data limitation challenge by exploring synthetic data augmentation methods. We implement and compare three approaches:
- Gaussian Copula Method: A statistical approach that preserves the marginal distributions and correlation structure of the original data.
- LLM Collaborative Multi-Agent Pipeline: A novel approach using large language models (LLMs) with domain expertise to generate realistic synthetic data points.
- CTGAN (Conditional Tabular GAN) Method: A deep learning approach that uses adversarial training to generate high-quality synthetic tabular data.
The primary research questions addressed in this study are:
- Can synthetic data augmentation improve the performance of HAB detection models?
- How do different synthetic data generation methods compare in terms of model performance?
- What are the computational trade-offs between statistical, LLM-based, and deep learning approaches?
The original dataset contains measurements of Temperature, Salinity, UVB radiation, and Chlorophyll-a fluorescence (target variable). The dataset consists of samples with these four features, with Chlorophyll-a fluorescence serving as the target variable for prediction.
The preprocessing pipeline includes:
- Log transformation of the target variable to address skewness
- Feature engineering with polynomial features (degree 2, interaction terms)
- Train-test split (80% training, 20% testing)
- Imputation of missing values using median strategy
- Standardization of features
The Gaussian Copula method models the multivariate distribution of the original data while preserving the marginal distributions and correlation structure. This approach:
- Fits a Gaussian Copula model to the original data
- Samples from the fitted model to generate synthetic data points
- Applies post-processing to ensure the synthetic data remains within realistic bounds
The implementation uses the GaussianMultivariate model from the SDV library, which provides a robust framework for generating synthetic tabular data.
Our novel LLM-based approach employs a collaborative multi-agent system with three specialized roles:
- Data Generation Agent: Generates synthetic data points based on statistical properties of the original data
- Domain Expert Agent: Validates generated data points for domain consistency and provides feedback
- Data Scientist Agent: Refines the dataset to maintain statistical properties and feature correlations
The agents collaborate through an iterative feedback loop, with each agent contributing its expertise to improve the quality of the synthetic data. The implementation uses OpenAI's GPT-4o model, which provides the foundation for all three agents.
The multi-agent pipeline follows these steps:
- Data Analysis: Calculate statistical properties and correlations of the original data
- Iterative Generation: Generate synthetic samples with feedback from domain expert
- Batch Refinement: Process batches of samples for efficiency and consistency
- Statistical Validation: Ensure synthetic data maintains statistical properties of original data
- Final Refinement: Apply final adjustments to the complete dataset
The CTGAN approach leverages deep learning to generate high-quality synthetic tabular data:
- Mode-specific Normalization: Handles mixed discrete and continuous variables effectively
- Conditional Generation: Employs training-by-sampling technique to balance categorical variables
- Adversarial Training: Uses a generator and discriminator architecture to create realistic data
- Post-processing: Ensures synthetic data maintains domain constraints and statistical properties
The implementation uses the CTGAN model from the CTGAN library, which is specifically designed for generating synthetic tabular data with mixed types. The model architecture includes:
- Generator: Creates synthetic data samples from random noise
- Discriminator: Distinguishes between real and synthetic data
- Training Process: Alternates between training the generator and discriminator
- Conditional Sampling: Generates data conditioned on specific feature values
We trained Ridge Regression and Random Forest models using four datasets:
- Original data only
- Original data augmented with Gaussian Copula synthetic data
- Original data augmented with LLM-generated synthetic data
- Original data augmented with CTGAN-generated synthetic data
Models were evaluated using Mean Squared Error (MSE), Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), and R² score on a held-out test set.
The performance metrics for the four approaches are summarized in the table below:
Method | MSE | RMSE | MAE | R² | p-value |
---|---|---|---|---|---|
Non-Synthetic | 0.0354 | 0.1881 | 0.1310 | 0.7762 | N/A |
Gaussian Copula | 0.0055 | 0.0739 | 0.0552 | 0.8149 | 0.0012 |
LLM Multi-Agent | 0.0055 | 0.0740 | 0.0555 | 0.8134 | 0.0015 |
CTGAN | 0.0053 | 0.0728 | 0.0548 | 0.8155 | 0.0010 |
All three synthetic data generation methods significantly improved model performance compared to using only the original data, with p-values < 0.01 indicating statistical significance. The CTGAN method showed slightly better performance across most metrics, followed closely by the Gaussian Copula method, with the LLM Multi-Agent approach showing comparable results.
The radar chart below visualizes the performance metrics across all three approaches:
The distribution of prediction errors for each method is shown in the violin plot below:
The correlation heatmaps below show the feature relationships in the original and synthetic datasets:
The synthetic data generation methods successfully preserved the correlation structure of the original data, with all three methods showing similar correlation patterns.
The computational costs for the three synthetic data generation methods are summarized in the table below:
Method | Execution Time (s) | Memory Usage (MB) | API Calls | API Cost ($) |
---|---|---|---|---|
Gaussian Copula | 45.2 | 128.5 | N/A | N/A |
LLM Multi-Agent | 120.5 | 156.2 | 150 | $0.75 |
CTGAN | 78.3 | 142.8 | N/A | N/A |
The parallel coordinates plot below visualizes the computational costs:
The LLM Multi-Agent approach required significantly more computational resources, with approximately 2.7x longer execution time compared to the Gaussian Copula method and 1.5x higher memory usage. The CTGAN method falls between the two, with moderate computational requirements but no API costs. The LLM approach incurred additional API costs due to the use of OpenAI's GPT-4o model.
The close performance between the three synthetic data generation approaches is noteworthy. Despite using fundamentally different methodologies, all three methods achieved similar performance improvements over the baseline. This suggests that:
-
Statistical Sufficiency: For this particular HAB detection task, the statistical properties captured by the Gaussian Copula may be sufficient to generate useful synthetic data.
-
LLM Knowledge Integration: The LLM-based approach successfully incorporated domain knowledge through its multi-agent architecture, achieving comparable performance to the statistical method.
-
Deep Learning Advantage: The CTGAN method's slightly superior performance suggests that adversarial training can capture subtle patterns in the data that may be missed by purely statistical approaches.
-
Diminishing Returns: There may be a performance ceiling for synthetic data augmentation in this specific task, which all three methods approached.
The slight performance advantage of the CTGAN method could be attributed to its ability to model complex non-linear relationships through its neural network architecture, which may capture subtle patterns in the HAB data.
The computational cost comparison reveals significant trade-offs between the three approaches:
-
Execution Efficiency: The Gaussian Copula method is substantially more efficient in terms of execution time, making it more suitable for scenarios with limited computational resources or time constraints.
-
API Dependency: The LLM Multi-Agent approach requires external API calls, introducing both cost considerations and potential availability concerns.
-
Scalability: For larger datasets, the efficiency gap between the methods would likely widen, favoring the Gaussian Copula approach for large-scale applications, while the CTGAN method would offer a good balance between performance and computational cost.
-
Hardware Requirements: The CTGAN method may benefit from GPU acceleration, which could significantly reduce its execution time but would require specialized hardware.
Several methodological aspects warrant discussion:
-
Domain Knowledge Integration: The LLM Multi-Agent approach offers a framework for incorporating domain expertise that extends beyond statistical patterns. This could be particularly valuable for domains with complex, non-linear relationships or where domain-specific constraints are critical.
-
Adaptability: The LLM-based approach may be more adaptable to different data types and domains without requiring extensive statistical modeling expertise, while the CTGAN method offers flexibility in modeling complex distributions.
-
Transparency: The statistical approach offers greater transparency in how synthetic data is generated, which may be important for certain applications where explainability is crucial. The CTGAN method, like many deep learning approaches, suffers from lower interpretability.
This study demonstrates that synthetic data augmentation can significantly improve the performance of machine learning models for harmful algal bloom detection. All three approaches—Gaussian Copula, LLM Multi-Agent, and CTGAN—proved effective, with similar performance improvements over the baseline.
The choice between these methods involves trade-offs between computational efficiency, flexibility, and performance. The Gaussian Copula method offers an efficient solution with good performance metrics, making it suitable for resource-constrained environments. The LLM Multi-Agent approach provides a flexible framework for incorporating domain knowledge, potentially offering advantages for complex domains or when statistical modeling expertise is limited. The CTGAN method delivers slightly superior performance metrics at the cost of increased computational complexity, making it ideal for applications where model performance is the primary concern.
Future work could potentially explore:
-
Hybrid Approaches: Combining statistical methods with deep learning and LLM-based approaches to leverage the strengths of all three.
-
Extended Validation: Testing these methods on larger and more diverse HAB datasets to assess generalizability.
-
Advanced LLM Integration: Exploring more sophisticated LLM architectures and prompting strategies to further enhance synthetic data quality.
-
GAN Architecture Optimization: Investigating specialized GAN architectures tailored specifically for environmental time-series data.
-
Real-world Deployment: Evaluating the practical impact of these synthetic data augmentation methods in operational HAB detection systems.
HAB-Augmentation-Comparison/
├── README.md # Project documentation
├── Dataset.xlsx # Original HAB detection dataset
├── requirements.txt # Python dependencies
├── .env # Environment variables (OpenAI API key)
├── preprocess_basic.py # Basic data preprocessing
├── preprocess_synthetic.py # Gaussian Copula synthetic data generation
├── preprocess_llm_synthetic.py # LLM multi-agent synthetic data generation
├── preprocess_gan_synthetic.py # CTGAN synthetic data generation
├── train_with_llm.py # Model training and evaluation
├── run_comparison.py # Pipeline for method comparison
├── run_full_pipeline.py # Complete pipeline with visualizations
├── generate_tables.py # Generate tables
├── generate_visualizations.py # Generate advanced visualizations
├── base_model/ # Base model files
├── synthetic_data/ # Synthetic data files
├── synthetic_models/ # Models trained with synthetic data
├── evaluation/ # Evaluation scripts
│ ├── CV_eval.py # Cross-validation evaluation
│ ├── percent_error_eval.py # Percent error evaluation
│ └── values_eval.py # Model metrics evaluation
├── figures/ # Generated visualizations
├── tables/ # Generated tables
├── models/ # Trained model files
├── output/ # Processed data and scalers
├── utils/ # Utility functions
│ └── cost_tracker.py # Track computational costs
├── scripts/ # Utility scripts
│ ├── analyze_dataset.py # Dataset analysis
│ └── view_image.py # Image viewing utility
└── archive/ # Archived files
-
Clone the repository
git clone https://github.com/Tonyhrule/HAB-Augmentation-Comparison.git cd HAB-Augmentation-Comparison
-
Install dependencies
pip install -r requirements.txt
-
Set up environment variables Create a
.env
file in the root directory with your OpenAI API key:OPENAI_API_KEY=your_api_key_here
To run the complete pipeline including data preprocessing, synthetic data generation, model training, and visualization:
python run_full_pipeline.py
To run only the comparison between different synthetic data generation methods:
python run_comparison.py
You can also run individual components of the pipeline:
-
Basic preprocessing:
python preprocess_basic.py
-
Gaussian Copula synthetic data generation:
python preprocess_synthetic.py
-
LLM multi-agent synthetic data generation:
python preprocess_llm_synthetic.py
-
CTGAN synthetic data generation:
python preprocess_gan_synthetic.py
-
Model training and evaluation:
python train_with_llm.py
-
Generate tables:
python generate_tables.py
-
Generate advanced visualizations:
python generate_visualizations.py
-
Harmful Algal Blooms: A Scientific Summary for Policy Makers. IOC/UNESCO, Paris (IOC/INF-1320).
-
Xu, Y., & Goodacre, R. (2018). On Splitting Training and Validation Set: A Comparative Study of Cross-Validation, Bootstrap and Systematic Sampling for Estimating the Generalization Performance of Supervised Learning. Journal of Analysis and Testing, 2(3), 249-262.
-
Patki, N., Wedge, R., & Veeramachaneni, K. (2016). The Synthetic Data Vault. In 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA) (pp. 399-410).
-
Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., ... & Amodei, D. (2020). Language Models are Few-Shot Learners. arXiv preprint arXiv:2005.14165.
-
Zhao, S., Xie, X., & Chen, S. (2021). Copula-Based Synthetic Data Generation for Machine Learning Emulators in Weather and Climate Modeling. Geoscientific Model Development, 14(7), 4641-4654.
-
Xu, L., Skoularidou, M., Cuesta-Infante, A., & Veeramachaneni, K. (2019). Modeling Tabular Data using Conditional GAN. Advances in Neural Information Processing Systems, 32.
A Streamlit app is provided for interactive exploration and comparison of the synthetic data augmentation methods.
streamlit run app.py
This project is licensed under the Apache License 2.0. See the LICENSE file for details.