The Synthetic Data Generator is a lightweight tool that creates realistic, customizable datasets for training, testing, and demonstration purposes—without exposing sensitive information. By simulating real-world data structures, the tool helps researchers, trainers, and developers practice analysis workflows, prototype dashboards, or showcase tools while protecting respondent confidentiality.
This project reduces risks around handling personal data while increasing efficiency in research support, training, and capacity-building activities.
- Customizable Data Generation: Control the number of rows and columns
- Statistical Control: Choose distributions (Normal, Uniform, Exponential, Lognormal) for numeric variables
- Correlation Management: Enable and control correlation between numeric variables
- Missing Data: Adjust missing data percentage for realistic datasets
- Personal Information: Include realistic fake data (names, emails, addresses, phone numbers, etc.)
- Multiple Export Formats: Download data as CSV, Excel, or Stata DTA files
- Data Preview: Visualize correlations and data quality metrics
Clone the repository:
git clone <your-repo-url>
cd Synthetic-Data-Generator
Install required packages:
pip install -r requirements.txt
Run the application:
streamlit run fake_data_generator.py
The application requires the following Python packages:
streamlit
– Web application frameworkfaker
– Fake data generationpandas
– Data manipulation and analysisnumpy
– Numerical computingscipy
– Scientific computingopenpyxl
– Excel file supportmatplotlib
– Data visualization
- Configure Parameters: Use the sidebar to set:
- Number of rows and columns
- Personal information fields to include
- Missing data percentage
- Variable distributions and correlations
- Generate Data: Click the "Generate Data" button
- Preview: Review the data in the "Preview Data" tab
- Export: Download your dataset in CSV, Excel, or Stata format
Synthetic-Data-Generator/
├── fake_data_generator.py # Main application file
├── requirements.txt # Python dependencies
├── README.md # Project documentation
└── run.bat # Windows batch file for easy execution
For easy execution on Windows, use the provided run.bat
file:
- Double-click
run.bat
to start the application - The application will open in your default web browser
You can easily modify the application to:
- Add new distribution types
- Include additional personal information fields
- Change correlation methods
- Add new export formats
- Fork the repository
- Create a feature branch
- Make your changes
- Submit a pull request
This project is open source and available under the MIT License.
If you encounter any issues:
- Check that all dependencies are installed
- Ensure your Python environment is properly configured
- Verify that the required directories are in your system PATH
- Built with Streamlit
- Uses Faker for realistic fake data generation
- Pandas for data manipulation and export capabilities
Note: This tool generates synthetic data for testing and development purposes only. Always ensure compliance with data protection regulations when working with personal information.