SpanCategorizer not learning (either tok2vec or transformer based) for aspect extraction task #13780
Replies: 1 comment
-
I've been able to get the needle moving somewhat by adjusting training parameters. The best config so far
I'm seeing 3 potential paths and here's what I'm doing for each: Making SpanCat workContinue annotating as I only have between 300 to 500 per label, and in parallel continue tweaking training parameters, adding/swapping which annotated labels are being used, and performing more experimental runs. Can't say enough on how much the Weights&Biases library has come in handy here. Test NERI chose SpanCat from reading the Prodigy docs, however, I don't really have overlapping spans so I think I can transform my SpanCat annotated data to NER and give this a try. Convert to TextCatDrawing inspiration from "Healthsea", I am generating some statistics to see how many of my span annotations are singularly contained within sentences; meaning if within one sentence I only have one annotated span. If the count is high enough, I could change this from a sequence classification problem to a text classification one (I see this recommendation often in replies from the spaCy/prodigy team). However, Edward uses Benepar for constituency parsing, but this library seems abandoned (not touched in 4 years), so I'm very reluctant to use it. Are there other alternatives for "smarter" ways of splitting sentences? |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I'm struggling to get SpanCategorizer to learn anything. All my attempts end up with the same 30 epochs in, and F1, Precision, and Recall are all 0.00, with a fluctuating, increasing loss. I'm trying to determine whether the problem is:
Context
I'm extracting aspects (commentary about entities) from noisy online text. I'll use Formula 1 to craft an example:
Entity extraction (e.g., "Charles", "YUKI" → Driver, "Ferrari" → Team, "monaco" → Race) works well. Now, I want to classify spans like:
"Can't believe what I just saw, Charles is an absolute demon behind the wheel but Ferrari is gonna Ferrari, they need to replace their entire pit wall because their strategies never make sense"
"LMAO classic monaco. i should've stayed in bed, this race is so boring"
"YUKI P4 WHAT A DRIVE!!!!"
Dataset
This is the output of my
spacy debug
:What I've Tried
tok2vec
,roberta-base
,xlm-roberta-base
→ All got scores of 0.00 with default settings.xlm-roberta-base
on just two labels (most numerous & distinctive) withdropout = 0.0
andL2 = 0.0001
. Some learning happened:Questions
Any insights on annotation quality checks, hyperparameter tuning, or alternative strategies would be greatly appreciated.
Thanks!
Config
This is one of the configs I used that gave me 0.00 scores:
Beta Was this translation helpful? Give feedback.
All reactions