You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi all,
I am interested in using TabPFN for an arbitrary classification setting, i.e. more than 500 features, more than 10000 rows/training samples and more than 10 classes. I do not expect that TabPFN will outperform trained methods on these problems, but hope to be able to trade off how long a model needs to train with performance (e.g. AutoGluon may give me the best performance, but takes quite long to fit).
As mentioned above, there are three major hurdles to applying TabPFN to arbitrary classification settings, namely:
Too many features
Too many classes
Too many samples
I am sure that someone has already thought about or even tried this, so I would really appreciate any thoughts, experience or relevant resources anyone could share!
As a starting point, you can find my first (naive) ideas on how I would try to solve the hurdles below:
Too many features:
Apply relatively cheap dimensionality reduction technique like PCA. Any recommendations on what techniques I should try out besides PCA?
Too many classes:
In some preliminary research, I found the many_classes_classifier, which I would use as a first starting point.
Too many samples:
Naively, I could just randomly subsample the data to 10000 rows and fit TabPFN on this smaller train set. This completely ignores the rest of the samples that didn’t get picked for the smaller train set. To improve performance, I could bootstrap multiple of these smaller train sets independently and then use some for ensembling (e.g. ensemble selection) to combine the predictions. Any better ideas?
Any help or pointers to the right resources are deeply appreciated!
Update: I did find this paper, which has some interesting ideas.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi all,
I am interested in using TabPFN for an arbitrary classification setting, i.e. more than 500 features, more than 10000 rows/training samples and more than 10 classes. I do not expect that TabPFN will outperform trained methods on these problems, but hope to be able to trade off how long a model needs to train with performance (e.g. AutoGluon may give me the best performance, but takes quite long to fit).
As mentioned above, there are three major hurdles to applying TabPFN to arbitrary classification settings, namely:
I am sure that someone has already thought about or even tried this, so I would really appreciate any thoughts, experience or relevant resources anyone could share!
As a starting point, you can find my first (naive) ideas on how I would try to solve the hurdles below:
Too many features:
Apply relatively cheap dimensionality reduction technique like PCA. Any recommendations on what techniques I should try out besides PCA?
Too many classes:
In some preliminary research, I found the
many_classes_classifier
, which I would use as a first starting point.Too many samples:
Naively, I could just randomly subsample the data to 10000 rows and fit TabPFN on this smaller train set. This completely ignores the rest of the samples that didn’t get picked for the smaller train set. To improve performance, I could bootstrap multiple of these smaller train sets independently and then use some for ensembling (e.g. ensemble selection) to combine the predictions. Any better ideas?
Any help or pointers to the right resources are deeply appreciated!
Update: I did find this paper, which has some interesting ideas.
Beta Was this translation helpful? Give feedback.
All reactions