Investigating Privacy Concerns and Mitigations for Healthcare Language and Foundation Models Extended
This repository holds code for the work continuing the "Investigating Privacy Concerns and Mitigations for Healthcare Language and Foundation Models" project (priv-lm-health).
This work was conducted as part of an NHS England Data Science PhD Internship project by Jenny Chim between July and December 2024.
Link to original project proposal.
Note: Only public or fake data are shared in this repository.
- This repository contains code to:
- Construct the instruction-tuning dataset (
data_processing/
) - Run memorisation experiments (
memorisation/
) - Run experiments to assess privacy in clinical documentation (
privacy_in_context/
) - (see Usage below for more information)
- Construct the instruction-tuning dataset (
- The accompanying report is also available in the
reports
folder - More information about the code usage can be found in each sub-directory.
To get a local copy up and running follow these simple steps.
To clone the repo:
git clone git@github.com:nhsengland/pvt_p71_privLMextended.git
Each sub-directory has its own packages, detailed in a requirements file. To create a suitable environment, change into the sub-directory of interest, then run:
python -m venv <env_name>
source <env_name>/bin/activate
pip install -r requirements.txt
While part of the model training code shows experiments with larger models (e.g. meta-llama/Llama-3.1-70B
), the code base is designed to work with compact models as well. Substitute the model names with an alternative hosted on the Hugging Face hub, e.g. HuggingFaceTB/SmolLM2-135M-Instruct
.
Refer to sub-directories for work package specific instructions.
Contributions are what make the open source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.
- Fork the Project
- Create your Feature Branch (
git checkout -b feature/AmazingFeature
) - Commit your Changes (
git commit -m 'Add some AmazingFeature'
) - Push to the Branch (
git push origin feature/AmazingFeature
) - Open a Pull Request
See CONTRIBUTING.md for detailed guidance.
Unless stated otherwise, the codebase is released under the MIT Licence. This covers both the codebase and any sample code in the documentation.
See LICENSE for more information.
The documentation is © Crown copyright and available under the terms of the Open Government 3.0 licence.
To find out more about the NHS England Data Science visit our project website or get in touch at datascience@nhs.net.