Skip to content

Commit 2b471c8

Browse files
jorisvandenbosschesimonjayhawkinsjbrockmendel
authored
DOC: add pandas 3.0 migration guide for the string dtype (#61705)
Co-authored-by: Simon Hawkins <simonjayhawkins@gmail.com> Co-authored-by: jbrockmendel <jbrockmendel@gmail.com>
1 parent e5a1c10 commit 2b471c8

File tree

2 files changed

+387
-0
lines changed

2 files changed

+387
-0
lines changed

doc/source/user_guide/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -87,5 +87,6 @@ Guides
8787
enhancingperf
8888
scale
8989
sparse
90+
migration-3-strings
9091
gotchas
9192
cookbook
Lines changed: 386 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,386 @@
1+
{{ header }}
2+
3+
.. _string_migration_guide:
4+
5+
=========================================================
6+
Migration guide for the new string data type (pandas 3.0)
7+
=========================================================
8+
9+
The upcoming pandas 3.0 release introduces a new, default string data type. This
10+
will most likely cause some work when upgrading to pandas 3.0, and this page
11+
provides an overview of the issues you might run into and gives guidance on how
12+
to address them.
13+
14+
This new dtype is already available in the pandas 2.3 release, and you can
15+
enable it with:
16+
17+
.. code-block:: python
18+
19+
pd.options.future.infer_string = True
20+
21+
This allows you to test your code before the final 3.0 release.
22+
23+
Background
24+
----------
25+
26+
Historically, pandas has always used the NumPy ``object`` dtype as the default
27+
to store text data. This has two primary drawbacks. First, ``object`` dtype is
28+
not specific to strings: any Python object can be stored in an ``object``-dtype
29+
array, not just strings, and seeing ``object`` as the dtype for a column with
30+
strings is confusing for users. Second, this is not always very efficient (both
31+
performance wise and for memory usage).
32+
33+
Since pandas 1.0, an opt-in string data type has been available, but this has
34+
not yet been made the default, and uses the ``pd.NA`` scalar to represent
35+
missing values.
36+
37+
Pandas 3.0 changes the default dtype for strings to a new string data type,
38+
a variant of the existing optional string data type but using ``NaN`` as the
39+
missing value indicator, to be consistent with the other default data types.
40+
41+
To improve performance, the new string data type will use the ``pyarrow``
42+
package by default, if installed (and otherwise it uses object dtype under the
43+
hood as a fallback).
44+
45+
See `PDEP-14: Dedicated string data type for pandas 3.0 <https://pandas.pydata.org/pdeps/0014-string-dtype.html>`__
46+
for more background and details.
47+
48+
.. - brief primer on the new dtype
49+
50+
.. - Main characteristics:
51+
.. - inferred by default (Default inference of a string dtype)
52+
.. - only strings (setitem with non string fails)
53+
.. - missing values sentinel is always NaN and uses NaN semantics
54+
55+
.. - Breaking changes:
56+
.. - dtype is no longer object dtype
57+
.. - None gets coerced to NaN
58+
.. - setitem raises an error for non-string data
59+
60+
Brief introduction to the new default string dtype
61+
--------------------------------------------------
62+
63+
By default, pandas will infer this new string dtype instead of object dtype for
64+
string data (when creating pandas objects, such as in constructors or IO
65+
functions).
66+
67+
Being a default dtype means that the string dtype will be used in IO methods or
68+
constructors when the dtype is being inferred and the input is inferred to be
69+
string data:
70+
71+
.. code-block:: python
72+
73+
>>> pd.Series(["a", "b", None])
74+
0 a
75+
1 b
76+
2 NaN
77+
dtype: str
78+
79+
It can also be specified explicitly using the ``"str"`` alias:
80+
81+
.. code-block:: python
82+
83+
>>> pd.Series(["a", "b", None], dtype="str")
84+
0 a
85+
1 b
86+
2 NaN
87+
dtype: str
88+
89+
Similarly, functions like :func:`read_csv`, :func:`read_parquet`, and others
90+
will now use the new string dtype when reading string data.
91+
92+
In contrast to the current object dtype, the new string dtype will only store
93+
strings. This also means that it will raise an error if you try to store a
94+
non-string value in it (see below for more details).
95+
96+
Missing values with the new string dtype are always represented as ``NaN`` (``np.nan``),
97+
and the missing value behavior is similar to other default dtypes.
98+
99+
This new string dtype should otherwise behave the same as the existing
100+
``object`` dtype users are used to. For example, all string-specific methods
101+
through the ``str`` accessor will work the same:
102+
103+
.. code-block:: python
104+
105+
>>> ser = pd.Series(["a", "b", None], dtype="str")
106+
>>> ser.str.upper()
107+
0 A
108+
1 B
109+
2 NaN
110+
dtype: str
111+
112+
.. note::
113+
114+
The new default string dtype is an instance of the :class:`pandas.StringDtype`
115+
class. The dtype can be constructed as ``pd.StringDtype(na_value=np.nan)``,
116+
but for general usage we recommend to use the shorter ``"str"`` alias.
117+
118+
Overview of behavior differences and how to address them
119+
---------------------------------------------------------
120+
121+
The dtype is no longer object dtype
122+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
123+
124+
When inferring or reading string data, the data type of the resulting DataFrame
125+
column or Series will silently start being the new ``"str"`` dtype instead of
126+
``"object"`` dtype, and this can have some impact on your code.
127+
128+
Checking the dtype
129+
^^^^^^^^^^^^^^^^^^
130+
131+
When checking the dtype, code might currently do something like:
132+
133+
.. code-block:: python
134+
135+
>>> ser = pd.Series(["a", "b", "c"])
136+
>>> ser.dtype == "object"
137+
138+
to check for columns with string data (by checking for the dtype being
139+
``"object"``). This will no longer work in pandas 3+, since ``ser.dtype`` will
140+
now be ``"str"`` with the new default string dtype, and the above check will
141+
return ``False``.
142+
143+
To check for columns with string data, you should instead use:
144+
145+
.. code-block:: python
146+
147+
>>> ser.dtype == "str"
148+
149+
**How to write compatible code**
150+
151+
For code that should work on both pandas 2.x and 3.x, you can use the
152+
:func:`pandas.api.types.is_string_dtype` function:
153+
154+
.. code-block:: python
155+
156+
>>> pd.api.types.is_string_dtype(ser.dtype)
157+
True
158+
159+
This will return ``True`` for both the object dtype and the string dtypes.
160+
161+
Hardcoded use of object dtype
162+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
163+
164+
If you have code where the dtype is hardcoded in constructors, like
165+
166+
.. code-block:: python
167+
168+
>>> pd.Series(["a", "b", "c"], dtype="object")
169+
170+
this will keep using the object dtype. You will want to update this code to
171+
ensure you get the benefits of the new string dtype.
172+
173+
**How to write compatible code?**
174+
175+
First, in many cases it can be sufficient to remove the specific data type, and
176+
let pandas do the inference. But if you want to be specific, you can specify the
177+
``"str"`` dtype:
178+
179+
.. code-block:: python
180+
181+
>>> pd.Series(["a", "b", "c"], dtype="str")
182+
183+
This is actually compatible with pandas 2.x as well, since in pandas < 3,
184+
``dtype="str"`` was essentially treated as an alias for object dtype.
185+
186+
The missing value sentinel is now always NaN
187+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
188+
189+
When using object dtype, multiple possible missing value sentinels are
190+
supported, including ``None`` and ``np.nan``. With the new default string dtype,
191+
the missing value sentinel is always NaN (``np.nan``):
192+
193+
.. code-block:: python
194+
195+
# with object dtype, None is preserved as None and seen as missing
196+
>>> ser = pd.Series(["a", "b", None], dtype="object")
197+
>>> ser
198+
0 a
199+
1 b
200+
2 None
201+
dtype: object
202+
>>> print(ser[2])
203+
None
204+
205+
# with the new string dtype, any missing value like None is coerced to NaN
206+
>>> ser = pd.Series(["a", "b", None], dtype="str")
207+
>>> ser
208+
0 a
209+
1 b
210+
2 NaN
211+
dtype: str
212+
>>> print(ser[2])
213+
nan
214+
215+
Generally this should be no problem when relying on missing value behavior in
216+
pandas methods (for example, ``ser.isna()`` will give the same result as before).
217+
But when you relied on the exact value of ``None`` being present, that can
218+
impact your code.
219+
220+
**How to write compatible code?**
221+
222+
When checking for a missing value, instead of checking for the exact value of
223+
``None`` or ``np.nan``, you should use the :func:`pandas.isna` function. This is
224+
the most robust way to check for missing values, as it will work regardless of
225+
the dtype and the exact missing value sentinel:
226+
227+
.. code-block:: python
228+
229+
>>> pd.isna(ser[2])
230+
True
231+
232+
One caveat: this function works both on scalars and on array-likes, and in the
233+
latter case it will return an array of bools. When using it in a Boolean context
234+
(for example, ``if pd.isna(..): ..``) be sure to only pass a scalar to it.
235+
236+
"setitem" operations will now raise an error for non-string data
237+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
238+
239+
With the new string dtype, any attempt to set a non-string value in a Series or
240+
DataFrame will raise an error:
241+
242+
.. code-block:: python
243+
244+
>>> ser = pd.Series(["a", "b", None], dtype="str")
245+
>>> ser[1] = 2.5
246+
---------------------------------------------------------------------------
247+
TypeError Traceback (most recent call last)
248+
...
249+
TypeError: Invalid value '2.5' for dtype 'str'. Value should be a string or missing value, got 'float' instead.
250+
251+
If you relied on the flexible nature of object dtype being able to hold any
252+
Python object, but your initial data was inferred as strings, your code might be
253+
impacted by this change.
254+
255+
**How to write compatible code?**
256+
257+
You can update your code to ensure you only set string values in such columns,
258+
or otherwise you can explicitly ensure the column has object dtype first. This
259+
can be done by specifying the dtype explicitly in the constructor, or by using
260+
the :meth:`~pandas.Series.astype` method:
261+
262+
.. code-block:: python
263+
264+
>>> ser = pd.Series(["a", "b", None], dtype="str")
265+
>>> ser = ser.astype("object")
266+
>>> ser[1] = 2.5
267+
268+
This ``astype("object")`` call will be redundant when using pandas 2.x, but
269+
this code will work for all versions.
270+
271+
Invalid unicode input
272+
~~~~~~~~~~~~~~~~~~~~~
273+
274+
Python allows to have a built-in ``str`` object that represents invalid unicode
275+
data. And since the ``object`` dtype can hold any Python object, you can have a
276+
pandas Series with such invalid unicode data:
277+
278+
.. code-block:: python
279+
280+
>>> ser = pd.Series(["\u2600", "\ud83d"], dtype=object)
281+
>>> ser
282+
0
283+
1 \ud83d
284+
dtype: object
285+
286+
However, when using the string dtype using ``pyarrow`` under the hood, this can
287+
only store valid unicode data, and otherwise it will raise an error:
288+
289+
.. code-block:: python
290+
291+
>>> ser = pd.Series(["\u2600", "\ud83d"])
292+
---------------------------------------------------------------------------
293+
UnicodeEncodeError Traceback (most recent call last)
294+
...
295+
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud83d' in position 0: surrogates not allowed
296+
297+
If you want to keep the previous behaviour, you can explicitly specify
298+
``dtype=object`` to keep working with object dtype.
299+
300+
When you have byte data that you want to convert to strings using ``decode()``,
301+
the :meth:`~pandas.Series.str.decode` method now has a ``dtype`` parameter to be
302+
able to specify object dtype instead of the default of string dtype for this use
303+
case.
304+
305+
Notable bug fixes
306+
~~~~~~~~~~~~~~~~~
307+
308+
``astype(str)`` preserving missing values
309+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
310+
311+
This is a long standing "bug" or misfeature, as discussed in https://github.com/pandas-dev/pandas/issues/25353.
312+
313+
With pandas < 3, when using ``astype(str)`` (using the built-in :func:`str`, not
314+
``astype("str")``!), the operation would convert every element to a string,
315+
including the missing values:
316+
317+
.. code-block:: python
318+
319+
# OLD behavior in pandas < 3
320+
>>> ser = pd.Series(["a", np.nan], dtype=object)
321+
>>> ser
322+
0 a
323+
1 NaN
324+
dtype: object
325+
>>> ser.astype(str)
326+
0 a
327+
1 nan
328+
dtype: object
329+
>>> ser.astype(str).to_numpy()
330+
array(['a', 'nan'], dtype=object)
331+
332+
Note how ``NaN`` (``np.nan``) was converted to the string ``"nan"``. This was
333+
not the intended behavior, and it was inconsistent with how other dtypes handled
334+
missing values.
335+
336+
With pandas 3, this behavior has been fixed, and now ``astype(str)`` is an alias
337+
for ``astype("str")``, i.e. casting to the new string dtype, which will preserve
338+
the missing values:
339+
340+
.. code-block:: python
341+
342+
# NEW behavior in pandas 3
343+
>>> pd.options.future.infer_string = True
344+
>>> ser = pd.Series(["a", np.nan], dtype=object)
345+
>>> ser.astype(str)
346+
0 a
347+
1 NaN
348+
dtype: str
349+
>>> ser.astype(str).values
350+
array(['a', nan], dtype=object)
351+
352+
If you want to preserve the old behaviour of converting every object to a
353+
string, you can use ``ser.map(str)`` instead.
354+
355+
356+
``prod()`` raising for string data
357+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
358+
359+
In pandas < 3, calling the :meth:`~pandas.Series.prod` method on a Series with
360+
string data would generally raise an error, except when the Series was empty or
361+
contained only a single string (potentially with missing values):
362+
363+
.. code-block:: python
364+
365+
>>> ser = pd.Series(["a", None], dtype=object)
366+
>>> ser.prod()
367+
'a'
368+
369+
When the Series contains multiple strings, it will raise a ``TypeError``. This
370+
behaviour stays the same in pandas 3 when using the flexible ``object`` dtype.
371+
But by virtue of using the new string dtype, this will generally consistently
372+
raise an error regardless of the number of strings:
373+
374+
.. code-block:: python
375+
376+
>>> ser = pd.Series(["a", None], dtype="str")
377+
>>> ser.prod()
378+
---------------------------------------------------------------------------
379+
TypeError Traceback (most recent call last)
380+
...
381+
TypeError: Cannot perform reduction 'prod' with string dtype
382+
383+
.. For existing users of the nullable ``StringDtype``
384+
.. --------------------------------------------------
385+
386+
.. TODO

0 commit comments

Comments
 (0)