Skip to content

Commit f721c6d

Browse files
authored
FIX use np.nan instead of None for missing marker in fetch_openml (scikit-learn#26579)
1 parent 9c266cf commit f721c6d

File tree

3 files changed

+9
-4
lines changed

3 files changed

+9
-4
lines changed

doc/whats_new/v1.3.rst

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -283,6 +283,10 @@ Changelog
283283
the pandas parser. The parameter `read_csv_kwargs` allows to overwrite this behaviour.
284284
:pr:`26551` by :user:`Guillaume Lemaitre <glemaitre>`.
285285

286+
- |Fix| :func:`dataasets.fetch_openml` will consistenly use `np.nan` as missing marker
287+
with both parsers `"pandas"` and `"liac-arff"`.
288+
:pr:`26579` by :user:`Guillaume Lemaitre <glemaitre>`.
289+
286290
- |Enhancement| Allows to overwrite the parameters used to open the ARFF file using
287291
the parameter `read_csv_kwargs` in :func:`datasets.fetch_openml` when using the
288292
pandas parser.

sklearn/datasets/_arff_parser.py

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -204,7 +204,10 @@ def _io_to_generator(gzip_file):
204204
if len(dfs) >= 2:
205205
dfs[0] = dfs[0].astype(dfs[1].dtypes)
206206

207-
frame = pd.concat(dfs, ignore_index=True)
207+
# liac-arff parser does not depend on NumPy and uses None to represent
208+
# missing values. To be consistent with the pandas parser, we replace
209+
# None with np.nan.
210+
frame = pd.concat(dfs, ignore_index=True).fillna(value=np.nan)
208211
del dfs, first_df
209212

210213
# cast the columns frame

sklearn/datasets/tests/test_openml.py

Lines changed: 1 addition & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -920,9 +920,7 @@ def datasets_missing_values():
920920
(1119, "liac-arff", 9, 6, 0),
921921
(1119, "pandas", 9, 0, 6),
922922
# miceprotein
923-
# 1 column has only missing values with object dtype
924-
(40966, "liac-arff", 1, 76, 0),
925-
# with casting it will be transformed to either float or Int64
923+
(40966, "liac-arff", 1, 77, 0),
926924
(40966, "pandas", 1, 77, 0),
927925
# titanic
928926
(40945, "liac-arff", 3, 6, 0),

0 commit comments

Comments
 (0)