This project provides a tool to obfuscate personally identifiable information (PII) in files stored in AWS S3. It supports CSV, JSON, and Parquet formats.
- Automatically detects file types (.csv, .json, .parquet)
- Obfuscates specified PII fields
- Fetches files from AWS S3
- Outputs obfuscated files as byte streams
Ensure you have the required dependencies installed:
pip install .
For development mode:
pip install -e .
Once installed, you can use this tool via the command line:
gdpr-obfuscate --s3-uri s3://my-bucket/my-file.csv --pii-fields name,email
Arguments:
- --s3-uri – S3 location of the file to be obfuscated
- --pii-fields – Comma-separated list of PII fields to obfuscate
By default, the obfuscated data is printed to stdout. If the file is written back to S3, check the output location in the S3 bucket.
To save locally, redirect the output:
gdpr-obfuscate --s3-uri s3://my-bucket/my-file.csv --pii-fields name,email > obfuscated_output.csv
You can also import and use it within a Python script:
from gdpr_obfuscator.main import obfuscate_file
event = {
"file_to_obfuscate": "s3://my-bucket/my-file.csv",
"pii_fields": ["name", "email"]
}
obfuscated_file = obfuscate_file(event)
print(obfuscated_file)
- Clone the repository:
git clone https://github.com/your-repo/gdpr-obfuscator-nc-postgrad-project.git cd gdpr-obfuscator-nc-postgrad-project
- Create a virtual environment and activate it:
make create-environment
- Install dependencies:
make requirements
Run the test suite with:
make unit-test
To use this as a library module in other projects, install it via:
pip install .
Then import it in your scripts:
from gdpr_obfuscator.main import obfuscate_file
If running directly from the project without installation, set the PYTHONPATH:
export PYTHONPATH=$(pwd)/src
The Makefile
provides useful automation:
- Create virtual environment:
make create-environment
- Install dependencies:
make requirements
- Set up development tools (bandit, safety, black, coverage):
make dev-setup
- Run safety scan:
make safety-scan
- Run bandit security check:
make run-bandit
- Run black code formatting:
make run-black
- Run unit tests:
make unit-test
- Run coverage check:
make check-coverage
- Run all checks (bandit, black, coverage):
make run-checks
- Clean up environment:
make clean
gdpr-obfuscator-nc-postgrad-project/
│── src/
│ ├── __init__.py # Package initializer
│ ├── main.py # Contains `obfuscate_file` function and CLI entry point
│ ├── utils/ # Contains helper functions (e.g., S3 operations, parsing, obfuscation)
│── tests/ # Unit tests
│── requirements.in # Dependency definitions
│── requirements.txt # Compiled dependencies
│── setup.py # Installation script
│── Makefile # Automation commands
│── README.md # Documentation
- Fork the repository and create a feature branch.
- Ensure code quality with
make run-checks
andmake unit-test
. - Submit a pull request with a clear description.
This project is licensed under the MIT License.