|
1 |
| -# private-transformers |
| 1 | +# private-transformers |
| 2 | + |
| 3 | +This codebase facilitates fast experimentation of differentially private training |
| 4 | +of [Hugging Face transformers](https://huggingface.co/transformers/). |
| 5 | + |
| 6 | +--- |
| 7 | +<p align="center"> |
| 8 | + <img width="950" height="450" src="./assets/fig1.png"> |
| 9 | +</p> |
| 10 | + |
| 11 | +## What is this? Why an extra codebase? |
| 12 | + |
| 13 | +- This codebase provides a privacy engine that builds off [Opacus](https://github.com/pytorch/opacus), but works way |
| 14 | + more smoothly with [Hugging Face's transformers library](https://github.com/huggingface/transformers). |
| 15 | +- Additionally, we support the *ghost clipping* technique (see Section 4 of [this](https://arxiv.org/pdf/2110.05679.pdf) |
| 16 | + preprint on how it works) which allows privately training large transformers with considerably reduced memory cost -- |
| 17 | + in many cases, almost as light as non-private training -- at a modest run-time overhead. |
| 18 | +- **With this codebase, we have fine-tuned very large pretrained models, yielding some of the best performing |
| 19 | + differentially private NLP models to date. Some of these models have performance matching strong non-private baseline |
| 20 | + approaches. We see strong empirical evidence that highly performant DP NLP models could be built on modest datasets.** |
| 21 | + |
| 22 | +## Installation |
| 23 | + |
| 24 | +Make sure you have python>=3.8; run the following command: |
| 25 | + |
| 26 | +```bash |
| 27 | +pip install git+ssh://git@github.com/lxuechen/private-transformers.git |
| 28 | +``` |
| 29 | + |
| 30 | +## Usage |
| 31 | + |
| 32 | +### Basic usage |
| 33 | + |
| 34 | +Privately training Hugging Face transformers with our codebase simply consists of 4 steps: |
| 35 | + |
| 36 | +1. Create your favourite transformer model and optimizer; attach this optimizer to a `PrivacyEngine` |
| 37 | +2. Compute a per-example loss (1-D tensor) for a mini-batch of data |
| 38 | +3. Pass the loss to `optimizer.step` or `optimizer.virtual_step` as a keyword argument |
| 39 | +4. Repeat from step 2 |
| 40 | + |
| 41 | +Below is a quick example: |
| 42 | + |
| 43 | +```python |
| 44 | +import transformers, torch |
| 45 | +from private_transformers import PrivacyEngine |
| 46 | +import torch.nn.functional as F |
| 47 | + |
| 48 | +device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') |
| 49 | +model = transformers.GPT2LMHeadModel.from_pretrained('distilgpt2').to(device) |
| 50 | +optimizer = torch.optim.Adam(params=model.parameters(), lr=1e-4) |
| 51 | +privacy_engine = PrivacyEngine( |
| 52 | + model, |
| 53 | + batch_size=10, |
| 54 | + sample_size=50000, |
| 55 | + epochs=3, |
| 56 | + max_grad_norm=0.1, |
| 57 | + target_epsilon=3, |
| 58 | +) |
| 59 | +privacy_engine.attach(optimizer) |
| 60 | + |
| 61 | +batch_size, seq_len = 10, 20 |
| 62 | +# Inputs are batch-first format, i.e., the first dimension of tensors must be batch dimension. |
| 63 | +input_ids = torch.randint(size=[batch_size, seq_len], low=0, high=100, device=device) |
| 64 | +# Calling `.train()` is very important; otherwise underlying forward and backward hooks don't run. |
| 65 | +model.train() |
| 66 | +outputs = model(input_ids=input_ids, return_dict=True) |
| 67 | +labels = input_ids[:, 1:, ] |
| 68 | +logits = outputs.logits[:, :-1, :].permute(0, 2, 1) |
| 69 | +# `loss` is a 1-D tensor of shape (batch_size,). |
| 70 | +loss = F.cross_entropy(logits, labels, reduction="none").mean(dim=1) |
| 71 | +# This step is different from existing workflows: |
| 72 | +# Don't call `loss.backward`; leave it to `optimizer.step` to handle backward. |
| 73 | +optimizer.step(loss=loss) |
| 74 | +``` |
| 75 | + |
| 76 | +The biggest differences compared to Opacus are: |
| 77 | + |
| 78 | +- We require the per-example loss (a 1-D tensor) be passed into `optimizer.step` (or `optimizer.virtual_step`) |
| 79 | +- The per-example loss must be passed in as a *keyword argument*. |
| 80 | +- `loss.backward()` shouldn't be called on the user end; it's called internally in `optimizer.step` ( |
| 81 | + or `optimizer.virtual_step`). |
| 82 | +- Inputs should be in batch-first format; there isn't a toggle to switch between different formats in the engine. |
| 83 | + |
| 84 | +### Ghost clipping: memory saving differentially private learning |
| 85 | + |
| 86 | +Turning on ghost clipping requires changing only 1 line. You should notice a drastic reduction in peak GPU memory usage |
| 87 | +once this is turned on, at a potential cost of slower training speed. One might find this especially useful when |
| 88 | +constrained to only use older GPUs with small VRAMs or fitting super large models. |
| 89 | + |
| 90 | +```python |
| 91 | +import transformers, torch |
| 92 | +from private_transformers import PrivacyEngine |
| 93 | + |
| 94 | +device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') |
| 95 | +model = transformers.GPT2LMHeadModel.from_pretrained('distilgpt2').to(device) |
| 96 | +optimizer = torch.optim.Adam(params=model.parameters(), lr=1e-4) |
| 97 | +privacy_engine = PrivacyEngine( |
| 98 | + model, |
| 99 | + batch_size=10, |
| 100 | + sample_size=50000, |
| 101 | + epochs=3, |
| 102 | + max_grad_norm=0.1, |
| 103 | + target_epsilon=3, |
| 104 | + ghost_clipping=True, # The only change you need to make! |
| 105 | +) |
| 106 | +privacy_engine.attach(optimizer) |
| 107 | +``` |
| 108 | + |
| 109 | +We ran stringent numerical tests to ensure the double-backward implementation is correct. Check out files in the `tests` |
| 110 | +folder for more on this. |
| 111 | + |
| 112 | +### Examples |
| 113 | + |
| 114 | +Code in the `examples` folder roughly reproduces our results for the table-to-text and classification tasks. There may |
| 115 | +be some minor discrepancies, since hyperparameters there aren't exactly what's used in the paper. Nevertheless, it |
| 116 | +should be sufficient to get things started. Detailed instructions are in the readme file of each subfolder. |
| 117 | + |
| 118 | +### Currently supported [Hugging Face models](https://huggingface.co/transformers/pretrained_models.html) |
| 119 | + |
| 120 | +- [OpenAIGPTLMHeadModel](https://huggingface.co/transformers/_modules/transformers/models/openai/modeling_openai.html#OpenAIGPTLMHeadModel) |
| 121 | +- [OpenAIGPTDoubleHeadsModel](https://huggingface.co/transformers/_modules/transformers/models/openai/modeling_openai.html#OpenAIGPTDoubleHeadsModel) |
| 122 | +- [GPT2LMHead](https://huggingface.co/transformers/_modules/transformers/models/gpt2/modeling_gpt2.html#GPT2LMHeadModel) |
| 123 | +- [GPT2DoubleLMHead](https://huggingface.co/transformers/_modules/transformers/models/gpt2/modeling_gpt2.html#GPT2DoubleHeadsModel) |
| 124 | +- [BertForSequenceClassification](https://huggingface.co/transformers/_modules/transformers/models/bert/modeling_bert.html#BertForSequenceClassification) |
| 125 | +- [RobertaForSequenceClassification](https://huggingface.co/transformers/model_doc/roberta.html#robertaforsequenceclassification) |
| 126 | +- [AlbertForSequenceClassification](https://huggingface.co/transformers/_modules/transformers/models/albert/modeling_albert.html#AlbertForSequenceClassification) |
| 127 | + |
| 128 | +Not all models in the Hugging Face library are supported. The main additional work here is to |
| 129 | + |
| 130 | +1. support per-example gradients for bespoke modules (e.g., [T5LayerNorm](https://huggingface.co/transformers/_modules/transformers/modeling_t5.html)), and |
| 131 | +2. ensure `position_ids` are repeated. |
| 132 | + |
| 133 | +We plan to support more models in the future if there's such a need. Feel free to open an issue if you may want to try |
| 134 | +out specific models that aren't in the current list. |
| 135 | + |
| 136 | +## Acknowledgements |
| 137 | + |
| 138 | +It would have been impossible to develop this codebase without cool past works and existing codebases. We roughly follow |
| 139 | +the `PrivacyEngine` design in `Opacus==0.13.0`. We directly use |
| 140 | +an [off-the-shelf package](https://github.com/microsoft/prv_accountant) for tightly tracking tradeoff functions while |
| 141 | +composing multiple private mechanisms. |
| 142 | + |
| 143 | +## Disclaimer |
| 144 | + |
| 145 | +- This codebase is not yet production-grade, e.g., cryptographically secure PRNGs are required for sampling noise -- our |
| 146 | + codebase currently does not use these strong PRNGs. |
| 147 | +- This codebase is born out of the need to experiment with various things for differentially private NLP in rapidly |
| 148 | + succession. I've tried my best to write clean code, though parts of this codebase may be less tidy than I had hoped |
| 149 | + given the extremely tight timeline. |
| 150 | + |
| 151 | +## Citation |
| 152 | + |
| 153 | +If you found this codebase useful in your research, please consider citing: |
| 154 | + |
| 155 | +``` |
| 156 | +@misc{li2021large, |
| 157 | + title={Large Language Models Can Be Strong Differentially Private Learners}, |
| 158 | + author={Xuechen Li and Florian Tramèr and Percy Liang and Tatsunori Hashimoto}, |
| 159 | + year={2021}, |
| 160 | + eprint={2110.05679}, |
| 161 | + archivePrefix={arXiv}, |
| 162 | + primaryClass={cs.LG} |
| 163 | +} |
| 164 | +``` |
0 commit comments