-
Notifications
You must be signed in to change notification settings - Fork 153
Open
Labels
bugSomething isn't workingSomething isn't working
Milestone
Description
Describe the bug
The skb.apply
and thus expressions (data plan, data ops) do not allow to plug-in standard scikit learn vectorizers such as CountVectorizer
, HashingVectorizer
, TfidfVectorizer
etc. because they output a sparse matrix while expressions expect a dataframe.
There is currently no way to use set_output(transform="pandas")
on these transformers and I don't think it will be possible anytime soon in scikit-learn (scikit-learn/scikit-learn#22377).
Steps/Code to Reproduce
import skrub
df = skrub.toy_orders().X
x = skrub.var("x", df)
# Works:
x.skb.apply(skrub.MinHashEncoder(), cols=["product"])
# Works (numeric transformer)
from sklearn.preprocessing import StandardScaler
x.skb.apply(StandardScaler(), cols=["quantity"])
# Works (on dataframe)
from sklearn.feature_extraction.text import HashingVectorizer
enc = HashingVectorizer()
enc.fit_transform(df["product"])
# Doesn't work (inside expressions)
x.skb.apply(HashingVectorizer(), cols=["product"])
Expected Results
Support for CountVectorizer
, HashingVectorizer
, TfidfTransformer
so we can use them in expressions.
Actual Results
TypeError: HashingVectorizer.fit_transform returned a result of type csr_matrix, but a pandas DataFrame was expected. If HashingVectorizer is a custom transformer class, please make sure that the output is a pandas container when the input is a pandas container. One way of enabling a transformer to output pandas DataFrames is inheriting from the sklearn.base.TransformerMixin class and defining the 'get_feature_names_out' method. See https://scikit-learn.org/stable/auto_examples/miscellaneous/plot_set_output.html for details.
Versions
Latest version (0.6.dev0)
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working