Training NER model with a frozen transformer #12255
-
I am trying to train a NER model on top of a frozen transformer. I understand from discussion posts (#11522, #11608) and open issues (#11547) that there are currently bugs with freezing a transformer in training. Nevertheless, as suggested, I have been getting some results with setting the Do you have any suggestions regarding the config or setup? EDIT: Providing some background and context: We have an existing annotation pipeline on which we run five separate spacy NER models (non-transformer, each with their own tok2vec layer), and save their results to a list of overlapping entities. We now want to add a new (sixth) NER model based on a new dataset we collected. We figured we would get better results by moving from a the current setup to one where the NER models use a transformer. We get great results when training a regular transformer model (using the default config generated by Here is the config that gets the best (yet unsatisfactory) results: [paths]
train = "data/train.spacy"
dev = "data/dev.spacy"
vectors = null
init_tok2vec = null
[system]
gpu_allocator = "pytorch"
seed = 0
[nlp]
lang = "en"
pipeline = ["transformer","ner"]
batch_size = 128
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}
[components]
[components.ner]
factory = "ner"
incorrect_spans_key = null
moves = null
scorer = {"@scorers":"spacy.ner_scorer.v1"}
update_with_oracle_cut_size = 100
[components.ner.model]
@architectures = "spacy.TransitionBasedParser.v2"
state_type = "ner"
extra_state_tokens = false
hidden_width = 64
maxout_pieces = 2
use_upper = true
nO = null
[components.ner.model.tok2vec]
@architectures = "spacy-transformers.TransformerListener.v1"
grad_factor = 0.0
pooling = {"@layers":"reduce_mean.v1"}
upstream = "*"
[components.transformer]
source = "output/en_core_sci_biobert_parser_tagger/model-best"
[corpora]
[corpora.dev]
@readers = "spacy.Corpus.v1"
path = ${paths.dev}
max_length = 0
limit = 0
[corpora.train]
@readers = "spacy.Corpus.v1"
path = ${paths.train}
max_length = 0
limit = 0
[training]
accumulate_gradient = 3
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"
seed = ${system.seed}
gpu_allocator = ${system.gpu_allocator}
dropout = 0.1
patience = 1600
max_epochs = 0
max_steps = 20000
eval_frequency = 200
frozen_components = []
annotating_components = []
before_to_disk = null
[training.batcher]
@batchers = "spacy.batch_by_padded.v1"
discard_oversize = true
size = 2000
buffer = 256
get_length = null
[training.logger]
@loggers = "spacy.ConsoleLogger.v1"
progress_bar = false
[training.optimizer]
@optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true
L2 = 0.01
grad_clip = 1.0
use_averages = false
eps = 0.00000001
learn_rate = 0.001
[training.score_weights]
ents_f = 1.0
ents_p = 0.0
ents_r = 0.0
ents_per_type = null
[pretraining]
[initialize]
vectors = ${paths.vectors}
init_tok2vec = ${paths.init_tok2vec}
vocab_data = null
lookups = null
after_init = null
[initialize.before_init]
@callbacks = "customize_tokenizer"
|
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 1 reply
-
Hi @cjer, as you noted, freezing transformers is not supported yet - and unfortunately I can't give you any useful recommendations on how to effectively circumvent this particular issue at the moment. Looking at what you actually want to achieve, there might be a workaround though. You mention that you have several NER models and finally merge their results. You are doing this because you want overlapping entities and the NER model doesn't allow for that, correct? In that case you may want to consider switching to |
Beta Was this translation helpful? Give feedback.
-
Hi @cjer , Are you open to "creative" solutions? (Also known as, preposterous hacks 😅 ). My instinct would be to lean into the hackery while you get it set up, to see how it performs. You can always find ways to tidy it up later. Here's the quick notes assuming all Thinc knowledge etc. Happy to expand out the gaps --- this is definitely stack-specific. What we need is a registered layer that has the signature @registry.layers("sinful_transformer_tok2vec.v1")
def get_trf_embeddings_from_global() -> Model[List[Doc], List[Floats2d]]:
return Model("sinful_transformer_tok2vec", forward_trf_via_global)
def forward_trf_via_global(model: Model[List[Doc], List[Floats2d], docs: List[doc], is_train: bool) -> Tuple[Floats2d, Callable]:
# Access the global data and compute the embedding representation
# The returned list should have one ndarray per doc, with one vector per token.
output = _somehow_get_vectors_from_global(docs)
def backprop_noop(d_vectors: List[Floats2d]) -> List[Doc]:
return []
return output, backprop_noop You'd then reference this layer in the configs of the models you're training. Note that it's not a listener -- it's just a normal sublayer of the NER model. There should then be no problem with composing those models into a single pipeline. |
Beta Was this translation helpful? Give feedback.
Hi @cjer, as you noted, freezing transformers is not supported yet - and unfortunately I can't give you any useful recommendations on how to effectively circumvent this particular issue at the moment.
Looking at what you actually want to achieve, there might be a workaround though. You mention that you have several NER models and finally merge their results. You are doing this because you want overlapping entities and the NER model doesn't allow for that, correct?
In that case you may want to consider switching to
spancat
(give this a read, if you haven't yet; docs). Spans can overlap there, so you could use a single model and thus avoid freezing the transfomer.