-
Notifications
You must be signed in to change notification settings - Fork 153
Description
Describe the bug
When an input polars dataframe has '' (empty string) as one of the column names, scikit-learn's set_output
changes the output column name to column_0
. This causes an error in the Cleaner which expects the output column names to be the same as the inputs, ie the same as those it sets in the dataframe it returns from transform
and fit_transform
The problem is that the skrub transformer inherits from sklearn TransformerMixin, which wraps transform in set_output, which does return pl.DataFrame(X_output, schema=columns, orient="row")
here. But pl.DataFrame
when given a schema replaces '' with column_0
(whereas it does not have that behavior when given a dict):
>>> df = pl.DataFrame({'': [1], 'b': [2]}) # ok
>>> df
shape: (1, 2)
┌─────┬─────┐
│ ┆ b │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 1 ┆ 2 │
└─────┴─────┘
>>> pl.DataFrame(df) # ok
shape: (1, 2)
┌─────┬─────┐
│ ┆ b │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 1 ┆ 2 │
└─────┴─────┘
>>> pl.DataFrame(df, schema=df.columns) # column_0
shape: (1, 2)
┌──────────┬─────┐
│ column_0 ┆ b │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞══════════╪═════╡
│ 1 ┆ 2 │
└──────────┴─────┘
I'll also maybe try to open an issue in scikit-learn.
We don't need the TransformerMixin because all it does is do the set_output
which changes the column name in this case but is unnecessary, as the skrub transformer already produces a dataframe of the correct type with the correct column names, and provide a default fit_transform
but we define it anyway. So the simple fix is to not inherit from the TransformerMixin.
If we absolutely want to inherit from it, we can pass auto_wrap_output_keys=()
in the class definition eg
class OnEachColumn(TransformerMixin, BaseEstimator, auto_wrap_output_keys=()):
to disable this wrapping.
otherwise, if having one empty column name is a problem maybe CheckInputDataFrame
could find and replace such names (it already checks for duplicate names), but so far having '' as a column name does not seem to cause any other issues than the one reported here
Steps/Code to Reproduce
import polars as pl
import skrub
df = pl.DataFrame({'': [1], 'b': [2]})
skrub.Cleaner().fit_transform(df)
Expected Results
shape: (1, 2)
┌─────┬─────┐
│ ┆ b │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 1 ┆ 2 │
└─────┴─────┘
Actual Results
Traceback (most recent call last):
File "empty-col-name-reproducer.py", line 6, in <module>
skrub.Cleaner().fit_transform(df)
File "/site-packages/sklearn/utils/_set_output.py", line 319, in wrapped
data_to_wrap = f(self, X, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/site-packages/skrub/_table_vectorizer.py", line 290, in fit_transform
self.all_processing_steps_[col].append(transformer)
~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^
KeyError: 'column_0'
Versions
polars 1.29.0, sklearn 1.6.1, skrub 0.5.3