-
Notifications
You must be signed in to change notification settings - Fork 510
Description
Feature request
The current RandomObfuscator
implementation (in line with the original paper, if I understand correctly) masks values by setting them to 0.
But 0 is a very significant number in a lot of contexts, to be using as a mask! I would liken it to choosing the token THE
as your [MASK]
for an English text model pre-training task.
I believe this pattern may be materially limiting accuracy/performance on datasets containing a large number of fields/instances where 0 (or proximity to 0) already has important significance - unless these datasets are pre-processed in some way to mitigate the impact (e.g. shifting binary encodings from 0/1 to 1/2, etc).
What is the expected behavior?
I suggest two primary options:
- Offer configurable alternative masking strategies (e.g. different constants) for users to select
- (Preferred) Implement embedding-aware attention per Research : Embedding Aware Attention #122 and offer option to embed fields with an additional mask column so e.g. scalars become 2-vectors of [value, mask]
Embedding-aware attention should be a pre-requisite for (2) because otherwise the introduction of extra mask flag columns would add lots of extra parameters / double input dimensionality... Whereas if it's done in a model-aware way results could be much better.
What is motivation or use case for adding/changing the behavior?
I've lately been playing with pre-training on the Forest Cover Type benchmark dataset (which includes a lot of already-one-hot-encoded fields I haven't yet bothered to "fix" to proper TabNet categorical fields) and even after experimenting with a range of parameters am finding the model loves to converge to unsupervised losses of ~7.130 (should really be <1.0, per the README, as 1.0 is equivalent to just always predicting average value for the feature).
As previously noted on a different issue, I did some experiments with the same dataset on top of my PR #217 last year before pre-training was available, and found that in the supervised case I got better performance adding a flag column than simply selecting a different mask value (old draft code is here).
...So I'm super-suspicious from background playing with this dataset, that the poor pre-training losses I'm currently observing are being skewed by inability of the model to tell when binary fields are =0 vs masked... And have seen some good performance from the flag-column treatment in past testing.
How should this be implemented in your opinion?
- Implement per-field / "embedding-aware" attention, perhaps something like feat: embedding-aware attention #217
- Implement missing/masked value handling as logic in the embedding layer (perhaps something like athewsey/feat/tra) so users can control how missing values are embedded per-field similarly to how they control categorical embeddings, and one of these options is to add an extra flag column to the embedding
- Modify
RandomObfuscator
to use a non-finite value likenan
as the mask value and allow non-finite values in (both pre-training and fine-tuning) dataset inputs so consistent treatment can be applied to masked vs missing values, and models can be successfully pre-trained or fine-tuned with arbitrary gaps inX
.
Are you willing to work on this yourself?
yes