Skip to content

skb.apply does not support some scikit-learn estimators #1513

@jovan-stojanovic

Description

@jovan-stojanovic

Describe the bug

The skb.apply and thus expressions (data plan, data ops) do not allow to plug-in standard scikit learn vectorizers such as CountVectorizer, HashingVectorizer, TfidfVectorizer etc. because they output a sparse matrix while expressions expect a dataframe.

There is currently no way to use set_output(transform="pandas") on these transformers and I don't think it will be possible anytime soon in scikit-learn (scikit-learn/scikit-learn#22377).

Steps/Code to Reproduce

import skrub
df = skrub.toy_orders().X
x = skrub.var("x", df)

# Works:
x.skb.apply(skrub.MinHashEncoder(), cols=["product"])

# Works (numeric transformer)
from sklearn.preprocessing import StandardScaler
x.skb.apply(StandardScaler(), cols=["quantity"])

# Works (on dataframe)
from sklearn.feature_extraction.text import HashingVectorizer
enc = HashingVectorizer()
enc.fit_transform(df["product"])

# Doesn't work (inside expressions)
x.skb.apply(HashingVectorizer(), cols=["product"])

Expected Results

Support for CountVectorizer, HashingVectorizer, TfidfTransformer so we can use them in expressions.

Actual Results

TypeError: HashingVectorizer.fit_transform returned a result of type csr_matrix, but a pandas DataFrame was expected. If HashingVectorizer is a custom transformer class, please make sure that the output is a pandas container when the input is a pandas container. One way of enabling a transformer to output pandas DataFrames is inheriting from the sklearn.base.TransformerMixin class and defining the 'get_feature_names_out' method. See https://scikit-learn.org/stable/auto_examples/miscellaneous/plot_set_output.html for details.

Versions

Latest version (0.6.dev0)

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions