Skip to content

Commit b044ef8

Browse files
authored
FIX only consider "?" as missing marker as per ARFF specs (scikit-learn#26551)
1 parent 9eea5b7 commit b044ef8

File tree

2 files changed

+6
-0
lines changed

2 files changed

+6
-0
lines changed

doc/whats_new/v1.3.rst

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -274,6 +274,11 @@ Changelog
274274
- |Fix| :func:`datasets.fetch_openml` returns improved data types when
275275
`as_frame=True` and `parser="liac-arff"`. :pr:`26386` by `Thomas Fan`_.
276276

277+
- |Fix| Following the ARFF specs, only the marker `"?"` is now considered as a missing
278+
values when opening ARFF files fetched using :func:`datasets.fetch_openml` when using
279+
the pandas parser. The parameter `read_csv_kwargs` allows to overwrite this behaviour.
280+
:pr:`26551` by :user:`Guillaume Lemaitre <glemaitre>`.
281+
277282
- |Enhancement| Allows to overwrite the parameters used to open the ARFF file using
278283
the parameter `read_csv_kwargs` in :func:`datasets.fetch_openml` when using the
279284
pandas parser.

sklearn/datasets/_arff_parser.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -387,6 +387,7 @@ def _pandas_arff_parser(
387387
"header": None,
388388
"index_col": False, # always force pandas to not use the first column as index
389389
"na_values": ["?"], # missing values are represented by `?`
390+
"keep_default_na": False, # only `?` is a missing value given the ARFF specs
390391
"comment": "%", # skip line starting by `%` since they are comments
391392
"quotechar": '"', # delimiter to use for quoted strings
392393
"skipinitialspace": True, # skip spaces after delimiter to follow ARFF specs

0 commit comments

Comments
 (0)