This repository contains the code and data for the text re-identification attack (TRIR) and subsequent re-identification risk metric (TRIR) presented in B. Manzanares-Salor, D. Sánchez, P. Lison, Evaluating the disclosure risk of anonymized documents via a machine learning-based re-identification attack, Data Mining and Knowledge Discovery, 2024.
@article{manzanaressalor-etal-tri-2024,
title = {Evaluating the disclosure risk of anonymized documents via a machine learning-based re-identification attack},
journal = {Data Mining and Knowledge Discovery},
ISBN = {1573-756X},
year = {2024},
volume = {(In press)},
author = {Manzanares-Salor, Benet and S\'anchez, David and Lison, Pierre}
}
A first version of this project was presented in B. Manzanares-Salor, D. Sánchez, P. Lison, Automatic Evaluation of Disclosure Risks of Text Anonymization Methods, Privacy in Statistical Databases, 2022.
Experimental data was extracted from the automatic_text_anonymization repository, corresponding to the publication F. Hassan, D. Sanchez, J. Domingo-Ferrer, Utility-Preserving Privacy Protection of Textual Documents via Word Embeddings, IEEE Transactions on Knowledge and Data Engineering, 2021. The exact data files utilized in the experiments are located in the data folder.
Text Re-Identification (TRI)
│ README.md # This README
│ tri.py # Python program including the TRI class, which can be executed as CLI
│ requirements.txt # File generated with Conda containing all the required Python packages with specific versions
| tri_workflow.png # Diagram of the TRI class workflow used later in this README
│ config.json # Example configuration
└───data # Folder with data files
│ WikiActors_50_eval.json # Panda's dataframe with 50 protected individuals and background knowledge for the same 50
│ WikiActors_50_eval_25%BK.json # Same as previous dataframe but with only the first 25% of each background knowledge document
│ WikiActors_500_random.json # Panda's dataframe with 50 protected individuals and background knowledge for ∼500 individuals, including the 50
│ WikiActors_500_random_25%BK.json # Same as previous dataframe but with only the first 25% of each background knowledge document
│ WikiActors_500_filtered.json # Panda's dataframe with 50 protected individuals and background knowledge for ∼500 individuals, including only 40 of the 50
│ WikiActors_500_filtered_25%BK.json # Same as previous dataframe but with only the first 25% of each background knowledge document
│ WikiActors_2000_filtered.json # Panda's dataframe with 50 protected individuals and background knowledge for ∼2000 individuals, including only 42 of the 50
│ WikiActors_2000_filtered_25%BK.json # Same as previous dataframe but with only the first 25% of each background knowledge document
Our implementation uses Python 3.9.19 as programming language and Conda 24.1.2 for package management. All used packages and their respective versions are listed in the requirements.txt file.
To be able to run the code, follow these steps:
- Install Conda if you haven't already.
- Download this repository.
- Open a terminal in the repository path.
- Create a new Conda environment using the following command (channels included for ensuring that specific versions can be installed):
conda create --name ENVIRONMENT_NAME --file requirements.txt -c conda-forge -c spacy -c pytorch -c nvidia -c huggingface -c numpy -c pandas
- Activate the just created Conda environment using the following command:
conda activate ENVIRONMENT_NAME
Continue with the steps of the Usage section.
This has been tested in Windows 10 and Windows 11 operating systems, but should be compatible with Linux-based systems.
The TRIA method and TRIR metric are implemented in the tri.py script. They can be executed via CLI (Command Line Interface) or by importing the TRI class directly in your Python code. The following sections provide detailed instructions for both approaches. Additionally, both methods offer extensive configuration options (details on how in each subsection), which are described in the Configuration section.
The CLI implementation only requires to pass as argument the path to a JSON configuration file. This file must contain a dictionary with the mandatory configurations, and can also contain optional configurations (see Configuration section).
For example, for using the configuration file config.json, run the following command:
python tri.py config.json
You can replicate the CLI behavior in a Python file by importing the TRI
class from the tri.py script, instanciating the class and calling to its run
method. The constructor requires the mandatory configurations as arguments, and also accepts optional configurations (see Configuration section). Moreover, any of these configurations can be later modified by calling the set_configs
method. During the execution, multiple loggings indicate the current block within the TRI workflow (see TRI workflow section). These loggings can be disabled by passing verbose=False
as argument to the run
method.
Here is a Python snippet that demonstrates how to run TRI (obtaining TRIR results) for all the data files in the data folder:
import os
from tri import TRI
# Declare TRI mandatory settings (placeholders for output and data paths) and some optional settings
tri = TRI(output_folder_path="To Be Defined",
data_file_path="To Be Defined",
individual_name_column="name",
background_knowledge_column="public_knowledge",
anonymize_background_knowledge=False,
use_document_curation=False,
pretraining_epochs=1,
finetuning_epochs=1)
# Evaluate all the data files in the "data" folder
all_results = {}
data_folder_path = "data"
output_folder_base_path = "outputs"
for data_file_name in os.listdir(data_folder_path):
# Discard the config.json file
if data_file_name != "config.json":
# Set new data_file_path and output_folder_path configurations
data_file_path = os.path.join(data_folder_path, data_file_name)
output_name = os.path.splitext(data_file_name)[0] # Filename without extension
output_folder_path = os.path.join(output_folder_base_path, output_name)
tri.set_configs(data_file_path=data_file_path,
output_folder_path=output_folder_path)
# Run TRIA and get TRIR for each anonymization
result = tri.run(verbose=True)
all_results[data_file_name] = {key:value["eval_Accuracy"] for key, value in result.items()}
# Show results for all data files
print(all_results)
In the following, we specify the configurations available for our implementation, as CLI (with the JSON file) or as TRI class (with the constructor or set_configs
method). They are generally organized accoding to the TRI workflow. For each configuration, we specify the name, type, if it is mandatory or optional (i.e., has a default value) and description.
-
Mandatory configurations (generally belonging to Data reading):
-
output_folder_path | String | MANDATORY: Determines the folder were results will be stored (see Results section). The folder will be created if it does not exist.
-
data_file_path | String | MANDATORY: Path to the data file to use. That file is expected to define a Pandas dataframe stored in JSON or CSV format containing three types of columns:
-
Individual name: One column with the names of all the individuals. Column named as defined in the
individual_name_column
setting. In the nomenclature of our method, this column allows to define both$A_I$ and$B_I$ . -
Background knowledge: One column with the background knowledge (i.e., public document) available for each individual. Column named as defined in the
background_knowledge_column
setting. This will be used for training the re-identification attack. If a cell of this column is empty (e.g.,""
,NaN
,None
ornull
) it is considered that no background knowledge is available for the individual. In the nomenclature of our method, this column determines$B_D$ , and all the individuals in the individual name column that have background knowledge define$B_I$ . -
Anonymized document set: At least one column should be anonymized texts corresponding to the individuals. Each column represents the output of an anonymization method for documents related with the individual of the same row. The re-identification risk will be computed for each anonymizaiton methods so, since all of them are applied to the same documents, it allows to compare the privacy protection provided by the approaches. In the nomenclature of our method, this column determines
$A_D$ , and all the individuals in the individual name column with an anonymized document (instead of being empty) define$A_I$ .
WARNING: A column for indexes (default when exporting from Pandas to CSV) will make the program fail, creating an "Unnamed" anonymized documents set containing numbers.
Example of a dataframe with three individuals, each of them with background knowledge (except the last) and a document protected with two different anonymized methods (Method1 and Method2):name public_knowledge Method1 Method2 Joe Bean Popular tweet: I built my own fortune! Bean received funding from his family to found ORG. PERSON received funding from DEM to found UnderPants. Ebenezer Lawson Evebezer Lawson is a popular writer from Kansas. PERSON, born in LOCATION, has written multiple best-sellers. Lawson, born in Kansas, has MISC multiple MISC. Ferris Knight NaN After a long race, PERSON managed to obtain the first position. After a EVENT, PERSON managed to obtain the first position. Since no public knowledge has been provided for Ferris Knight, the TRI model will not have samples of her in the training set. Subsequently, it is expected to fail the re-identification of her anonymized texts, limiting the re-identification risk for Method1 and Method2 to 66.66%.
-
Individual name: One column with the names of all the individuals. Column named as defined in the
-
individual_name_column | String | MANDATORY: Name of the dataframe column corresponding to the individual or organizaiton name. In the previous example, it will be
name
. -
background_knowledge_column | String | MANDATORY: Name of the column corresponding to the background knowledge document for each individual. In the previous example, it will be
public_knowledge
. The rest of columns not named as defined inindividual_name_column
orbackground_knowledge_column
will be considered as texts anonymized with different methods (one method per column) for evaluating the re-identification risk.
-
-
Load pretreatment:
-
load_saved_pretreatment | Boolean | Default=true: If the
Pretreated_Data.json
file exists in theoutput_folder_path
, load that data instead of running the pretreatment. Disable it if you completely changed thedata_file_path
. It requires a previous execution withsave_pretreatment=true
. -
add_non_saved_anonymizations | Boolean | Default=true: When loading pretreatment data from
Pretreated_Data.json
, this setting checks whether the file atdata_file_path
includes new anonymizations that need to be processed. If new anonymizations are found, they are loaded and, ifuse_document_curation
is true, only these new anonymizations will undergo curation. This option is particularly useful if you have added new anonymizations to the dataframe atdata_file_path
and want to avoid repeating the entire pretreatment.
-
load_saved_pretreatment | Boolean | Default=true: If the
-
Data pretreatment:
-
Anonymized background knowledge:
- ANONIMIZE_BACKGROUND_KNOWLEDGE | Boolean | Default=true: If during document pretreatment generate an anonymized version of the background knowledge documents using spaCy NER that would be used along with the non-anonymized version. Its usage is strongly recommended, since it can significantly improve re-identification risks. As a counterpoint, it roughly duplicates the training samples, incrementing the training time and RAM consumsumption.
- only_use_anonymized_background_knowledge | Boolean | Default=alse: If only using the anonymized version of the background knowledge, instead of concatenating it with the non-anonymized version. This usually results in higher re-identification risks than using only the non-anonymized version, but lower than using both (anonymized and non-anonymized). Created for an ablation study.
-
Document curation:
- use_document_curation | Boolean | Default=true: Whether to perform the document curation, consisting of lemmatization and removing of stopwords and special characters. It is inexpensive compared to pretraining or finetuning.
-
Anonymized background knowledge:
-
Save pretreatment:
-
save_pretreatment | Boolean | Default=true: Whether to save the data after pretreatment. A JSON file name
Pretreated_Data.json
will be generated and stored in theoutput_folder_path
folder. As pretreatment it is also included the curation of new anonymizations caused byupdated_loaded_eval_pretreatment=true
.
-
save_pretreatment | Boolean | Default=true: Whether to save the data after pretreatment. A JSON file name
- Load already trained TRI model:
- load_saved_finetuning | Boolean | Default=True: If the
TRI_Pipeline
exists in theoutput_folder_path
directory and contains the model filemodel.safetensors
, load that already trained TRI model instead of running the additional pretraining and finetuning. It requires a previous execution withsave_finetuning=true
.
- load_saved_finetuning | Boolean | Default=True: If the
- Create base language model:
- base_model_name | String | Default="distilbert-base-uncased": Name of the base language model in the HuggingFace's Transformers library to be used for both additional pretraining and finetuning. Current code is designed for versions of BERT, DistilBERT and RoBERTa. Examples: "distilbert-base-uncased", "distilbert-base-cased", "bert-base-uncased", "bert-base-cased" and "roberta-base". The
ini_extended_model
method from the TRI class (in tri.py) can be easily modified for other models. - tokenization_block_size | Integer | Default=250: Number of data samples tokenized at once with Transformers' tokenizer. This is done for limiting and optimizing RAM usage when processing large datasets. The value of 250 is roughly optimized for 32GB of RAM.
- base_model_name | String | Default="distilbert-base-uncased": Name of the base language model in the HuggingFace's Transformers library to be used for both additional pretraining and finetuning. Current code is designed for versions of BERT, DistilBERT and RoBERTa. Examples: "distilbert-base-uncased", "distilbert-base-cased", "bert-base-uncased", "bert-base-cased" and "roberta-base". The
- Additional pretraining:
- use_additional_pretraining | Boolean | Default=true: Whether additional pre-training (i.e. Masked Language Modeling, MLM) is to be performed to the base language model. Its usage is recommended, since it is inexpensive (compared to finetuning) and can improve re-identification risks.
- save_additional_pretraining | Boolean | Default=true: Whether to save the additionally pretrained language model. The model will be saved as a PyTorch model file
Pretrained_Model.pt
in theoutput_folder_path
. - load_saved_pretraining | Boolean | Default=true: If
use_additional_pretraining
is true and thePretrained_Model.pt
file exists, loads that additionally pretrained base model instead of running the process. It requires a previous execution withsave_additional_pretraining=true
. - pretraining_epochs | Integer | Default=3: Number of additional pretraining epochs.
- pretraining_batch_size | Integer | Default=8: Size of the batches for additional pretraining.
- pretraining_learning_rate | Float | Default=5e-05: Learning rate for the AdamW optimizer to use during additional pretraining.
- pretraining_mlm_probability | Float | Default=0.15: Probability of masking tokens by the Data Collator during the additional pretraining with MLM.
- pretraining_sliding_window | String | Default="512-128": Sliding window configuration for additional pretraining. Since input documents are assumed to be longer than the maximum number of tokens processable by the language model (maximum sequence length), the text is split into multiple samples. A sliding window mechasim has been implemented, defined by the size of the window and the overlap with the previous window. For instance, use "512-128" for samples/splits of 512 tokens and an overlap of 128 tokens with the previous split/sample. Alternatevely, if "No" is used, one sample/split per sentence will be created, leveraging that sentences are generally shorter than the model maximum sequence length. Reducing the window size and/or incrementing the overlap will result in more samples/splits, what increments the training time.
- Finetuning:
- finetuning_epochs | Integer | Default=15: Number of epochs to perform during the finetuning.
- finetuning_batch_size | Integer | Default=16: Size of the batches for finetuning.
- finetuning_learning_rate | Float | Default=5e-05: Learning rate for the AdamW optimizer to use during finetuning.
- finetuning_sliding_window | String | Default="100-25": Sliding window configuration for finetuning. Since input documents are assumed to be longer than the maximum number of tokens processable by the language model (maximum sequence length), the text is split into multiple samples. A sliding window mechasim has been implemented, defined by the size of the window and the overlap with the previous window. For example, use "512-128" for samples/splits of 512 tokens and an overlap of 128 tokens with the previous split/sample. Alternatevely, if "No" is used, one sample/split per sentence will be created, leveraging that sentences are generally shorter than the model maximum sequence length. Reducing the window size and/or increasing the overlap will result in more samples/splits, what increments the training time.
- dev_set_column_name | String | Default=false: Specifies the column name to be used for model selection. If set to
false
(boolean, not string), the model with the highest average accuracy across all anonymization sets will be selected as the final model. If a column name is provided, the accuracy corresponding to that specific anonymization from the dataframe located atdata_file_path
will be used to choose the best model. - save_finetuning | Boolean | Default=true: Whether to save the TRI model after the finetuning. The model will be saved as a Transformers' pipeline, creating a folder
TRI_Pipeline
in theoutput_folder_path
directory, containing the model filemodel.safetensors
.
After execution of TRI (both from CLI or Python code), in the output_folder_path
you can find the following files:
-
Pretreated_Data.json: If
save_pretreatment
is true, this file is created for saving the pretreated background knowledge and protected documents, sometimes referred as training and evaluation data, respectively. Leveraged ifload_saved_pretreatment
is true. -
Pretrained_Model.pt: If
save_additional_pretraining
is true, this file is created for saving the additionally pretrained language model. Leveraged ifload_saved_pretraining
is true. -
TRI_Pipeline: If
save_finetuning
is true, this folder is created for saving the . Leveraged ifload_saved_finetuning
is true. -
Results.csv: After each epoch of finetuning, the Text Re-Identification Risk (TRIR) resulting from each anonymization method will be evaluated. These results are stored (always appending, not overwriting) in a CSV file named
Results.csv
. This file contains the epoch time, epoch number, the TRIR for each anonymization method and the average TRIR. for instance, if using the dataframe exemplified in thedata_file_path
configuration description, TRIR results will correspond to Method1 and Method2:Time Epoch Method1 Method2 Average 01/08/2024 08:50:04 1 74 36 55 01/08/2024 08:50:37 2 92 44 68 01/08/2024 08:50:10 3 94 48 71 At the end of the program, TRIR is predicted for all the anonymization methods using the best TRI model considering the criteria defined for the setting
dev_set_column_name
. This final evaluation is also stored in theResults.csv
file as an "additional epoch".
The file config.json provides an example of a valid configuration. It uses data from the data folder, which includes multiple Pandas' dataframes. In particular, it employs the WikiActors_50_eval.json dataframe, which contains a set of 50 popular actors and actresses born in the 20th century. Background knowledge are the bodies of the actors' Wikipedia articles. Anonymized documents are the abstracts of these actors' Wikipedia articles protected using approaches based on NER, Word2Vec and manual efforts. Finally, development set is a random 30% subset of the spaCy-anonymized abstracts (see our paper for details). Using this config.json (command example in the Usage section), the final TRIRs expected to be found in the corresponding Results.csv
of the output_folder_path
are:
Method | TRIR |
---|---|
Original | 100% |
NER3 | 70% |
NER4 | 60% |
NER7 | 88% |
Presidio | 74% |
spaCy | 70% |
Word2Vec_t=0.5 | 48% |
Word2Vec_t=0.25 | 26% |
Manual | 10% |
Note that values may differ depending on execution
Feel free to modify the config.json file for testing other configurations. If the data_file_path
is modified (i.e., you change the dataframe), we recommend to also update the output_folder_path
directory to avoid overwriting and/or load pretreated data from other dataframes.
To facilitate the understanding of our implementation, the following diagram depicts the workflow of the TRI
class in tri.py. In is divided into the three main blocks (i.e., Data, Build classifier and Predict TRIR), using blue boxes for the mandatory fragments, and green boxes for optional fragments. Depending on the configuration, the optional fragments are executed or it is skipped to the next fragment. If using Visual Studio Code IDE, it is recommended to install the Python Code Tools extension for collapsing and extending the code regions corresponding to the blocks and boxes.