This repository contains code for the Using Sequences of Life-events to Predict Human Lives (life2vec) paper.
This repository contains scripts and several notebooks for data processing, life2vec training, statistical analysis, and visualization. The model weights, experiment logs, and associated model outputs can be obtained in accordance with the rules of Statistics Denmark's Research Scheme.
Paths (e.g., to data, or model weights) were redacted before submitting scripts to GitHub.
We use Hydra to run the experiments. The /conf folder contains configs for the experiments:
- /experimentcontains configuration- yamlfor pretraining and finetuning,
- /taskscontain the specification for data augmentation in MLM, SOP, etc.,
- /trainercontains configuration for logging (not used) and multithread training (not used),
- /data_newcontains configs for data loading and processing,
- /datamodulecontains configs that specify how data should be loaded to PyTorch and PyTorch Lightning
- callbacks.yamlspecifies the configuration for the PyTorch Lightning Callbacks ,
- prepare_data.yamlcan be used to run data preprocessing.
The /analysis folder contains ipynb notebooks for post-hoc evaluation:
- /embeddingcontains the analysis of the embedding spaces,
- /metriccontains notebooks for the model evaluation,
- /visualisationcontains notebooks for the visualisation of spaces,
- /tcavincludes TCAV implementation,
- /optimizationhyperparameter tuning.
The source folder, /src, contains the data loading and model training codes. Due to the specifics of the hydra package. Here is the overview of the /src folder:
- The /src/data_newcontains scripts to preprocess data as well as prepare data to load into the PyTorch or PyTorch Lightning,
- The /src/modelscontains the implementation of baseline models,
- The /src/tasksinclude code specific to the particular task, aka MLM, SOP, Mortality Prediction, Emigration Prediction, etc.
- /src/tranformercontains the implementation of the life2vec model:- In performer.py, we overwrite the functionality of theperformer-pytorchpackage,
- In cls_model.py, we have an implementation of the finetuning stage for the binary classification tasks (i.e. early mortality and emigration),
- In hexaco_model.py, we have an implementation of the finetuning stage for the personality nuance prediction task,
- models.pycontains the code for the life2vec pretraining (aka the base life2vec model),
- The transformer_utils.pycontains the implementation of custom modules, like losses, activation functions, etc.
- The metrics.pycontains code for the custom metric,
- The modules.py,attention.py,att_utils.py, andembeddings.pycontain the implementation of modules used in the transformer network (aka life2vec encoders).
 
- In 
Scripts such as train.py, test.py, tune.py, and val.py used to run a particular stage of the training, while prepare_data.py was used to run the data processing (see below the example).
To run the code, you would use the following commands:
# run the pretraining:
HYDRA_FULL_ERROR=1 python -m src.train experiment=pretrain trainer.devices=[7]
# finetuning of the hyperparamaters (for the pretraining)
HYDRA_FULL_ERROR=1 python -m src.train experiment=pretrain_optim
# assemble general dataset (GLOBAL_SET)
HYDRA_FULL_ERROR=1 python -m src.prepare_data +data_new/corpus=global_set target=\${data_new.corpus}
# assemble dataset for the mortality prediction task (SURVIVAL_SET)
HYDRA_FULL_ERROR=1 python -m src.prepare_data +data_new/population=survival_set target=\${data_new.population}
# assemble labour source
python -m src.prepare_data +data_new/sources=labour target=\${data_new.sources}
# run emigration finetuning
HYDRA_FULL_ERROR=1 python -m src.train experiment=emm trainer.devices=[0] version=0.01
- Søren Mørk Hartmann.
Research Square Preprint
@article{savcisens2023using,
  title={Using Sequences of Life-events to Predict Human Lives},
  author={Savcisens, Germans and Eliassi-Rad, Tina and Hansen, Lars Kai and Mortensen, Laust and Lilleholt, Lau and Rogers, Anna and Zettler, Ingo and Lehmann, Sune},
  year={2023}
}Code
@misc{life2vec_code,
  author = {Germans Savcisens},
  note = {Zenodo},
  title = {SocialComplexityLab/life2vec},
  year = {2023},
  howpublished = {\url{https://doi.org/10.5281/zenodo.10118621}},
}