Cleaner (and TableVectorizer) fail when there is a polars empty column name

### Describe the bug

When an input polars dataframe has '' (empty string) as one of the column names, scikit-learn's `set_output` changes the output column name to `column_0`. This causes an error in the Cleaner which expects the output column names to be the same as the inputs, ie the same as those it sets in the dataframe it returns from `transform` and `fit_transform`

The problem is that the skrub transformer inherits from sklearn TransformerMixin, which wraps transform in set_output, which does `return pl.DataFrame(X_output, schema=columns, orient="row")` [here](https://github.com/scikit-learn/scikit-learn/blob/cfd5f7833dfb3794e711e79e4a3373e599d5a1f0/sklearn/utils/_set_output.py#L165). But `pl.DataFrame` when given a schema replaces '' with `column_0` (whereas it does not have that behavior when given a dict):

```
>>> df = pl.DataFrame({'': [1], 'b': [2]}) # ok
>>> df
shape: (1, 2)
┌─────┬─────┐
│     ┆ b   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 1   ┆ 2   │
└─────┴─────┘
>>> pl.DataFrame(df) # ok
shape: (1, 2)
┌─────┬─────┐
│     ┆ b   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 1   ┆ 2   │
└─────┴─────┘
>>> pl.DataFrame(df, schema=df.columns) # column_0
shape: (1, 2)
┌──────────┬─────┐
│ column_0 ┆ b   │
│ ---      ┆ --- │
│ i64      ┆ i64 │
╞══════════╪═════╡
│ 1        ┆ 2   │
└──────────┴─────┘
```

I'll also maybe try to open an issue in scikit-learn.


We don't need the TransformerMixin because all it does is do the `set_output` which changes the column name in this case but is unnecessary, as the skrub transformer already produces a dataframe of the correct type with the correct column names, and provide a default `fit_transform` but we define it anyway. So the simple fix is to not inherit from the TransformerMixin.

If we absolutely want to inherit from it, we can pass `auto_wrap_output_keys=()` in the class definition eg

```
class OnEachColumn(TransformerMixin, BaseEstimator, auto_wrap_output_keys=()):
```

to disable this wrapping.

otherwise, if having one empty column name is a problem maybe `CheckInputDataFrame` could find and replace such names (it already checks for duplicate names), but so far having '' as a column name does not seem to cause any other issues than the one reported here

### Steps/Code to Reproduce

```python
import polars as pl
import skrub

df = pl.DataFrame({'': [1], 'b': [2]})
skrub.Cleaner().fit_transform(df)
```

### Expected Results

```
shape: (1, 2)
┌─────┬─────┐
│     ┆ b   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 1   ┆ 2   │
└─────┴─────┘
```

### Actual Results

```
Traceback (most recent call last):
  File "empty-col-name-reproducer.py", line 6, in <module>
    skrub.Cleaner().fit_transform(df)
  File "/site-packages/sklearn/utils/_set_output.py", line 319, in wrapped
    data_to_wrap = f(self, X, *args, **kwargs)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/site-packages/skrub/_table_vectorizer.py", line 290, in fit_transform
    self.all_processing_steps_[col].append(transformer)
    ~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^
KeyError: 'column_0'
```

### Versions

```shell
polars 1.29.0, sklearn 1.6.1, skrub 0.5.3
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Cleaner (and TableVectorizer) fail when there is a polars empty column name #1490

Describe the bug

Steps/Code to Reproduce

Expected Results

Actual Results

Versions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Cleaner (and TableVectorizer) fail when there is a polars empty column name #1490

Description

Describe the bug

Steps/Code to Reproduce

Expected Results

Actual Results

Versions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions