Skip to content

Research(?) : Alternative missing-value masks #278

@athewsey

Description

@athewsey

Feature request

The current RandomObfuscator implementation (in line with the original paper, if I understand correctly) masks values by setting them to 0.

But 0 is a very significant number in a lot of contexts, to be using as a mask! I would liken it to choosing the token THE as your [MASK] for an English text model pre-training task.

I believe this pattern may be materially limiting accuracy/performance on datasets containing a large number of fields/instances where 0 (or proximity to 0) already has important significance - unless these datasets are pre-processed in some way to mitigate the impact (e.g. shifting binary encodings from 0/1 to 1/2, etc).

What is the expected behavior?

I suggest two primary options:

  1. Offer configurable alternative masking strategies (e.g. different constants) for users to select
  2. (Preferred) Implement embedding-aware attention per Research : Embedding Aware Attention #122 and offer option to embed fields with an additional mask column so e.g. scalars become 2-vectors of [value, mask]

Embedding-aware attention should be a pre-requisite for (2) because otherwise the introduction of extra mask flag columns would add lots of extra parameters / double input dimensionality... Whereas if it's done in a model-aware way results could be much better.

What is motivation or use case for adding/changing the behavior?

I've lately been playing with pre-training on the Forest Cover Type benchmark dataset (which includes a lot of already-one-hot-encoded fields I haven't yet bothered to "fix" to proper TabNet categorical fields) and even after experimenting with a range of parameters am finding the model loves to converge to unsupervised losses of ~7.130 (should really be <1.0, per the README, as 1.0 is equivalent to just always predicting average value for the feature).

As previously noted on a different issue, I did some experiments with the same dataset on top of my PR #217 last year before pre-training was available, and found that in the supervised case I got better performance adding a flag column than simply selecting a different mask value (old draft code is here).

...So I'm super-suspicious from background playing with this dataset, that the poor pre-training losses I'm currently observing are being skewed by inability of the model to tell when binary fields are =0 vs masked... And have seen some good performance from the flag-column treatment in past testing.

How should this be implemented in your opinion?

  • Implement per-field / "embedding-aware" attention, perhaps something like feat: embedding-aware attention #217
  • Implement missing/masked value handling as logic in the embedding layer (perhaps something like athewsey/feat/tra) so users can control how missing values are embedded per-field similarly to how they control categorical embeddings, and one of these options is to add an extra flag column to the embedding
  • Modify RandomObfuscator to use a non-finite value like nan as the mask value and allow non-finite values in (both pre-training and fine-tuning) dataset inputs so consistent treatment can be applied to masked vs missing values, and models can be successfully pre-trained or fine-tuned with arbitrary gaps in X.

Are you willing to work on this yourself?

yes

Metadata

Metadata

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions