GDPR Obfuscation tool that can be integrated as a library module into a Python codebase.
This is a general-purpose Python tool to process data being ingested to AWS and intercept personally identifiable information (PII). All information stored by companies data projects should be for bulk data analysis only. Consequently, there is a requirement under GDPR. to ensure that all data containing information that can be used to identify an individual should be anonymised.
This obfuscation tool can be integrated as a library module into a Python codebase.
It is expected that the tool will be deployed within the AWS account.
It is expected that the code will use the AWS SDK for Python (boto3).
It is expected that the code will use the PyArrow when handling parquet data.
The library is suitable for deployment on a platform within the AWS ecosystem, such as EC2, ECS, or Lambda.
Ensure you have installed latest python version.
Local run
pip install -r ./requirements.txt
or clone repo and run
make requirements
Code is tested with Pytest,
With test coverage of %100
See tests for more details.
Code is written in Python,
PEP8 compliant, tested with flake8
As well as tested for security vulnerabilities:
dependency vulnerability safety,
security issues bandit.
-
Data is stored in CSV, JSON, or parquet format in S3.
This tool uses External Python libralies:
:Boto3 for managing AWS resources
:Botocore for Error handling available witin AWS enviroment
:PyArrow for parquet data handling -
Fields containing GDPR-sensitive data are known and will be supplied in advance, see Usage
-
Data records will be supplied with a primary key.
pip install from pip branch
pip install "git+https://github.com/mirkovicUK/GDPR-Obfuscator.git@pip"
Imports
from gdpr.obfuscator import gdpr_obfuscator
Alternatively clone the repo:
git clone https://github.com/mirkovicUK/GDPR-Obfuscator.git
Import:
from src.gdpr_obfuscator import gdpr_obfuscator
The tool should be invoked by sending a JSON string containing:
the S3 location of the required CSV,JSON or Parquet file for obfuscation
and the names of the fields that are required to be obfuscated
JSON string format:
{
"file_to_obfuscate": "s3://bucket_name/path_to_data/file.csv",
"pii_fields": ["name", "surname", "other_filelds_to_mask"]
}
masked_data = gdpr_obfuscator(JSON: str)
Following example will create resources:S3,
and upload some data for testing,
example is designed to clean all resources after execution , and to work with AWS Free Tier.
Example will expect AWS credentials in python .env file as this.
bucket='unique bucket name' : mandatory
aws_access_key_id='Your account access key' :optional
aws_secret_access_key= 'Your account secret access key' :optional
region_name = 'region_name' :mandatory