Skip to content

Cleaner (and TableVectorizer) fail when there is a polars empty column name #1490

@jeromedockes

Description

@jeromedockes

Describe the bug

When an input polars dataframe has '' (empty string) as one of the column names, scikit-learn's set_output changes the output column name to column_0. This causes an error in the Cleaner which expects the output column names to be the same as the inputs, ie the same as those it sets in the dataframe it returns from transform and fit_transform

The problem is that the skrub transformer inherits from sklearn TransformerMixin, which wraps transform in set_output, which does return pl.DataFrame(X_output, schema=columns, orient="row") here. But pl.DataFrame when given a schema replaces '' with column_0 (whereas it does not have that behavior when given a dict):

>>> df = pl.DataFrame({'': [1], 'b': [2]}) # ok
>>> df
shape: (1, 2)
┌─────┬─────┐
│     ┆ b   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 1   ┆ 2   │
└─────┴─────┘
>>> pl.DataFrame(df) # ok
shape: (1, 2)
┌─────┬─────┐
│     ┆ b   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 1   ┆ 2   │
└─────┴─────┘
>>> pl.DataFrame(df, schema=df.columns) # column_0
shape: (1, 2)
┌──────────┬─────┐
│ column_0 ┆ b   │
│ ---      ┆ --- │
│ i64      ┆ i64 │
╞══════════╪═════╡
│ 1        ┆ 2   │
└──────────┴─────┘

I'll also maybe try to open an issue in scikit-learn.

We don't need the TransformerMixin because all it does is do the set_output which changes the column name in this case but is unnecessary, as the skrub transformer already produces a dataframe of the correct type with the correct column names, and provide a default fit_transform but we define it anyway. So the simple fix is to not inherit from the TransformerMixin.

If we absolutely want to inherit from it, we can pass auto_wrap_output_keys=() in the class definition eg

class OnEachColumn(TransformerMixin, BaseEstimator, auto_wrap_output_keys=()):

to disable this wrapping.

otherwise, if having one empty column name is a problem maybe CheckInputDataFrame could find and replace such names (it already checks for duplicate names), but so far having '' as a column name does not seem to cause any other issues than the one reported here

Steps/Code to Reproduce

import polars as pl
import skrub

df = pl.DataFrame({'': [1], 'b': [2]})
skrub.Cleaner().fit_transform(df)

Expected Results

shape: (1, 2)
┌─────┬─────┐
│     ┆ b   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 1   ┆ 2   │
└─────┴─────┘

Actual Results

Traceback (most recent call last):
  File "empty-col-name-reproducer.py", line 6, in <module>
    skrub.Cleaner().fit_transform(df)
  File "/site-packages/sklearn/utils/_set_output.py", line 319, in wrapped
    data_to_wrap = f(self, X, *args, **kwargs)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/site-packages/skrub/_table_vectorizer.py", line 290, in fit_transform
    self.all_processing_steps_[col].append(transformer)
    ~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^
KeyError: 'column_0'

Versions

polars 1.29.0, sklearn 1.6.1, skrub 0.5.3

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions