|
| 1 | +# spacybert: Bert inference for spaCy |
| 2 | +[spaCy v2.0](https://spacy.io/usage/v2) extension and pipeline component for loading BERT sentence / document embedding meta data to `Doc`, `Span` and `Token` objects. The Bert backend itself is supported by the [Hugging Face transformers](https://github.com/huggingface/transformers) library. |
| 3 | + |
| 4 | +## Installation |
| 5 | +`spacybert` requires `spacy` v2.0.0 or higher. |
| 6 | + |
| 7 | +## Usage |
| 8 | +### Getting BERT embeddings for single language dataset |
| 9 | +``` |
| 10 | +import spacy |
| 11 | +from spacybert import BertInference |
| 12 | +nlp = spacy.load('en') |
| 13 | +``` |
| 14 | + |
| 15 | +Then either use BertInference as part of a pipeline, |
| 16 | +``` |
| 17 | +bert = BertInference( |
| 18 | + from_pretrained='path/to/pretrained_bert_weights_dir', |
| 19 | + set_extension=False) |
| 20 | +nlp.add_pipe(bert, last=True) |
| 21 | +``` |
| 22 | +Or not... |
| 23 | +``` |
| 24 | +bert = BertInference( |
| 25 | + from_pretrained='path/to/pretrained_bert_weights_dir', |
| 26 | + set_extension=True) |
| 27 | +``` |
| 28 | +The difference is that when `set_extension=True`, `bert_repr` is set as a property extension for the Doc, Span and Token spacy objects. If `set_extension=False`, the `bert_repr` is set as an attribute extension with a default value (`=None`). The attribute computes the correct value when `doc._.bert_repr` is called. |
| 29 | + |
| 30 | +Get the Bert representation / embedding. |
| 31 | +``` |
| 32 | +doc = nlp("This is a test") |
| 33 | +print(doc._.bert_repr) # <-- torch.Tensor |
| 34 | +``` |
| 35 | + |
| 36 | +### Getting BERT embeddings for multiple languages dataset. |
| 37 | +``` |
| 38 | +import spacy |
| 39 | +from spacy_langdetect import LanguageDetector |
| 40 | +from spacybert import MultiLangBertInference |
| 41 | +
|
| 42 | +nlp = spacy.load('en') |
| 43 | +nlp.add_pipe(LanguageDetector(), name='language_detector', last=True) |
| 44 | +bert = MultiLangBertInference( |
| 45 | + from_pretrained={ |
| 46 | + 'en': 'path/to/en_pretrained_bert_weights_dir', |
| 47 | + 'nl': 'path/to/nl_pretrained_bert_weights_dir' |
| 48 | + }, |
| 49 | + set_extension=False) |
| 50 | +nlp.add_pipe(bert, after='language_detector') |
| 51 | +
|
| 52 | +texts = [ |
| 53 | + "This is a test", # English |
| 54 | + "Dit is een test" # Dutch |
| 55 | +] |
| 56 | +for doc in nlp.pipe(texts): |
| 57 | + print(doc._.bert_repr) # <-- torch.Tensor |
| 58 | +``` |
| 59 | +When language_detector detects languages other than the ones for which pre-trained weights is specified, by default `doc._.bert_repr = None`. |
| 60 | + |
| 61 | +## Available attributes |
| 62 | +The extension sets attributes on the `Doc`, `Span` and `Token`. You can change the attribute name on initializing the extension. |
| 63 | +| | | | |
| 64 | +|-|-|-| |
| 65 | +| `Doc._.bert_repr` | `torch.Tensor` | Document BERT embedding | |
| 66 | +| `Span._.bert_repr` | `torch.Tensor` | Span BERT embedding | |
| 67 | +| `Token._.bert_repr` | `torch.Tensor` | Token BERT embedding | |
| 68 | +| | | | |
| 69 | + |
| 70 | +## Settings |
| 71 | +On initialization of `BertInference`, you can define the following: |
| 72 | + |
| 73 | +| name | type | default | description | |
| 74 | +|-|-|-|-| |
| 75 | +| `from_pretrained` | `str` | `None` | Path to Bert model directory or name of HuggingFace transformers pre-trained Bert weights, e.g., `bert-base-uncased` | |
| 76 | +| `attr_name` | `str` | `'bert_repr'` | Name of the BERT embedding attribute to set to the `._` property | |
| 77 | +| `max_seq_len` | `int` | 512 | Max sequence length for input to Bert | |
| 78 | +| `pooling_strategy` | `str` | `'REDUCE_MEAN'` | Strategy to generate single sentence embedding from multiple word embeddings. See below for the various pooling strategies available. | |
| 79 | +| `set_extension` | `bool` | `True` | If `True`, then `'bert_repr'` is set as a property extension for the `Doc`, `Span` and `Token` spacy objects. If `False`, the `'bert_repr'` is set as an attribute extension with a default value (`None`) which gets filled correctly when called in a pipeline. Set it to `False` if you want to use this extension in a spacy pipeline. | |
| 80 | +| `force_extension` | `bool` | `True` | A boolean value to create the same 'Extension Attribute' upon being executed again | |
| 81 | + |
| 82 | +On initialization of `MultiLangBertInference`, you can define the following: |
| 83 | + |
| 84 | +| name | type | default | description | |
| 85 | +|-|-|-|-| |
| 86 | +| `from_pretrained` | `Dict[LANG_ISO_639_1, str]` | `None` | Mapping between two-letter language codes to path to model directory or HuggingFace transformers pre-trained Bert weights | |
| 87 | +| `attr_name` | `str` | `'bert_repr'` | Same as in BertInference | |
| 88 | +| `max_seq_len` | `int` | 512 | Same as in BertInference | |
| 89 | +| `pooling_strategy` | `str` | `'REDUCE_MEAN'` | Same as in BertInference | |
| 90 | +| `set_extension` | `bool` | `True` | Same as in BertInference | |
| 91 | +| `force_extension` | `bool` | `True` | Same as in BertInference | |
| 92 | + |
| 93 | +## Pooling strategies |
| 94 | +| strategy | description | |
| 95 | +|-|-| |
| 96 | +| `REDUCE_MEAN` | Element-wise average the word embeddings | |
| 97 | +| `REDUCE_MAX` | Element-wise maximum of the word embeddings | |
| 98 | +| `REDUCE_MEAN_MAX` | Apply both `'REDUCE_MEAN'` and `'REDUCE_MAX'` and concatenate. So if the original word embedding is of dimensions `(768,)`, then the output will have shape `(1536,)` | |
| 99 | +| `CLS_TOKEN`, `FIRST_TOKEN` | Take the embedding of only the first `[CLS]` token | |
| 100 | +| `SEP_TOKEN`, `LAST_TOKEN` | Take the embedding of only the last `[SEP]` token | |
| 101 | +| `None` | No reduction is applied and a matrix of embeddings per word in the sentence is returned | |
| 102 | + |
| 103 | +## Roadmap |
| 104 | +This extension is still experimental. Possible future updates include: |
| 105 | +* Getting document representation from other state-of-the-art NLP models other than Google's BERT. |
| 106 | +* Method for computing similarity between `Doc`, `Span` and `Token` objects using the `bert_repr` tensor. |
| 107 | +* Getting representation from multiple / other layers in the models. |
0 commit comments