Research(?) : Alternative missing-value masks

## Feature request

The current `RandomObfuscator` implementation (in line with the original paper, if I understand correctly) masks values by  setting them to 0.

But 0 is a very significant number in a lot of contexts, to be using as a mask! I would liken it to choosing the token `THE` as your `[MASK]` for an English text model pre-training task.

I believe this pattern may be materially limiting accuracy/performance on datasets containing a large number of fields/instances where 0 (or proximity to 0) already has important significance - unless these datasets are pre-processed in some way to mitigate the impact (e.g. shifting binary encodings from 0/1 to 1/2, etc).

**What is the expected behavior?**

I suggest two primary options:

1. Offer configurable alternative masking strategies (e.g. different constants) for users to select
2. (Preferred) Implement embedding-aware attention per #122 and offer option to embed fields with an additional mask column so e.g. scalars become 2-vectors of [value, mask]

Embedding-aware attention should be a pre-requisite for (2) because otherwise the introduction of extra mask flag columns would add lots of extra parameters / double input dimensionality... Whereas if it's done in a model-aware way results could be much better.

**What is motivation or use case for adding/changing the behavior?**

I've lately been playing with pre-training on the Forest Cover Type benchmark dataset (which includes a lot of already-one-hot-encoded fields I haven't yet bothered to "fix" to proper TabNet categorical fields) and even after experimenting with a range of parameters am finding the model *loves* to converge to unsupervised losses of ~7.130 (should really be <1.0, per the README, as 1.0 is equivalent to just always predicting average value for the feature).

As previously noted [on a different issue](https://github.com/dreamquark-ai/tabnet/pull/220#issuecomment-717694343), I did some experiments with the same dataset on top of my PR #217 last year before pre-training was available, and found that in the supervised case I got better performance adding a flag column than simply selecting a different mask value (old draft code is [here](https://github.com/athewsey/tabnet/tree/feat/tra)).

...So I'm super-suspicious from background playing with this dataset, that the poor pre-training losses I'm currently observing are being skewed by inability of the model to tell when binary fields are =0 vs masked... And have seen some good performance from the flag-column treatment in past testing.

**How should this be implemented in your opinion?**

- Implement per-field / "embedding-aware" attention, perhaps something like #217
- Implement missing/masked value handling as logic in the embedding layer (perhaps something like [athewsey/feat/tra](https://github.com/athewsey/tabnet/blob/feat/tra/pytorch_tabnet/tab_network.py#L492)) so users can control how missing values are embedded per-field similarly to how they control categorical embeddings, and one of these options is to add an extra flag column to the embedding
- Modify `RandomObfuscator` to use a non-finite value like `nan` as the mask value and allow non-finite values in (both pre-training and fine-tuning) dataset inputs so consistent treatment can be applied to masked vs missing values, and models can be successfully pre-trained or fine-tuned with arbitrary gaps in `X`.

**Are you willing to work on this yourself?**

yes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Research(?) : Alternative missing-value masks #278

Feature request

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Research(?) : Alternative missing-value masks #278

Description

Feature request

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions