Skip to content

mirkovicUK/GDPR-Obfuscator

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GDPR-Obfuscator

GDPR Obfuscation tool that can be integrated as a library module into a Python codebase.

Table of Contents

About

This is a general-purpose Python tool to process data being ingested to AWS and intercept personally identifiable information (PII). All information stored by companies data projects should be for bulk data analysis only. Consequently, there is a requirement under GDPR. to ensure that all data containing information that can be used to identify an individual should be anonymised.

This obfuscation tool can be integrated as a library module into a Python codebase.

It is expected that the tool will be deployed within the AWS account.

It is expected that the code will use the AWS SDK for Python (boto3).

It is expected that the code will use the PyArrow when handling parquet data.

The library is suitable for deployment on a platform within the AWS ecosystem, such as EC2, ECS, or Lambda.

Back to top

Requirements

Ensure you have installed latest python version.

Local run

pip install -r ./requirements.txt

or clone repo and run

make requirements

Back to top

Tests_and_Coverage

Code is tested with Pytest, With test coverage of %100
See tests for more details.

Back to top

PEP8_and_security

Code is written in Python,
PEP8 compliant, tested with flake8
As well as tested for security vulnerabilities:
dependency vulnerability safety, security issues bandit.

Back to top

Assumptions_and_Prerequisites

  1. Data is stored in CSV, JSON, or parquet format in S3.
    This tool uses External Python libralies:
     :Boto3 for managing AWS resources
     :Botocore for Error handling available witin AWS enviroment
     :PyArrow for parquet data handling

  2. Fields containing GDPR-sensitive data are known and will be supplied in advance, see Usage

  3. Data records will be supplied with a primary key.

Back to top

Usage

pip install from pip branch

pip install "git+https://github.com/mirkovicUK/GDPR-Obfuscator.git@pip"

Imports

from gdpr.obfuscator import gdpr_obfuscator

Alternatively clone the repo:

git clone https://github.com/mirkovicUK/GDPR-Obfuscator.git

Import:

from src.gdpr_obfuscator import gdpr_obfuscator


The tool should be invoked by sending a JSON string containing:
the S3 location of the required CSV,JSON or Parquet file for obfuscation
and the names of the fields that are required to be obfuscated

JSON string format:
{
"file_to_obfuscate": "s3://bucket_name/path_to_data/file.csv",
"pii_fields": ["name", "surname", "other_filelds_to_mask"]
}

masked_data = gdpr_obfuscator(JSON: str)

Example:

Following example will create resources:S3,
and upload some data for testing, example is designed to clean all resources after execution , and to work with AWS Free Tier.

Example will expect AWS credentials in python .env file as this.

bucket='unique bucket name' : mandatory
aws_access_key_id='Your account access key' :optional
aws_secret_access_key= 'Your account secret access key' :optional
region_name = 'region_name' :mandatory

Back to top

About

GDPR Obfuscation tool that can be integrated as a library module into a Python codebase

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published