|
| 1 | +{{ header }} |
| 2 | + |
| 3 | +.. _string_migration_guide: |
| 4 | + |
| 5 | +========================================================= |
| 6 | +Migration guide for the new string data type (pandas 3.0) |
| 7 | +========================================================= |
| 8 | + |
| 9 | +The upcoming pandas 3.0 release introduces a new, default string data type. This |
| 10 | +will most likely cause some work when upgrading to pandas 3.0, and this page |
| 11 | +provides an overview of the issues you might run into and gives guidance on how |
| 12 | +to address them. |
| 13 | + |
| 14 | +This new dtype is already available in the pandas 2.3 release, and you can |
| 15 | +enable it with: |
| 16 | + |
| 17 | +.. code-block:: python |
| 18 | +
|
| 19 | + pd.options.future.infer_string = True |
| 20 | +
|
| 21 | +This allows you to test your code before the final 3.0 release. |
| 22 | + |
| 23 | +Background |
| 24 | +---------- |
| 25 | + |
| 26 | +Historically, pandas has always used the NumPy ``object`` dtype as the default |
| 27 | +to store text data. This has two primary drawbacks. First, ``object`` dtype is |
| 28 | +not specific to strings: any Python object can be stored in an ``object``-dtype |
| 29 | +array, not just strings, and seeing ``object`` as the dtype for a column with |
| 30 | +strings is confusing for users. Second, this is not always very efficient (both |
| 31 | +performance wise and for memory usage). |
| 32 | + |
| 33 | +Since pandas 1.0, an opt-in string data type has been available, but this has |
| 34 | +not yet been made the default, and uses the ``pd.NA`` scalar to represent |
| 35 | +missing values. |
| 36 | + |
| 37 | +Pandas 3.0 changes the default dtype for strings to a new string data type, |
| 38 | +a variant of the existing optional string data type but using ``NaN`` as the |
| 39 | +missing value indicator, to be consistent with the other default data types. |
| 40 | + |
| 41 | +To improve performance, the new string data type will use the ``pyarrow`` |
| 42 | +package by default, if installed (and otherwise it uses object dtype under the |
| 43 | +hood as a fallback). |
| 44 | + |
| 45 | +See `PDEP-14: Dedicated string data type for pandas 3.0 <https://pandas.pydata.org/pdeps/0014-string-dtype.html>`__ |
| 46 | +for more background and details. |
| 47 | + |
| 48 | +.. - brief primer on the new dtype |
| 49 | +
|
| 50 | +.. - Main characteristics: |
| 51 | +.. - inferred by default (Default inference of a string dtype) |
| 52 | +.. - only strings (setitem with non string fails) |
| 53 | +.. - missing values sentinel is always NaN and uses NaN semantics |
| 54 | +
|
| 55 | +.. - Breaking changes: |
| 56 | +.. - dtype is no longer object dtype |
| 57 | +.. - None gets coerced to NaN |
| 58 | +.. - setitem raises an error for non-string data |
| 59 | +
|
| 60 | +Brief introduction to the new default string dtype |
| 61 | +-------------------------------------------------- |
| 62 | + |
| 63 | +By default, pandas will infer this new string dtype instead of object dtype for |
| 64 | +string data (when creating pandas objects, such as in constructors or IO |
| 65 | +functions). |
| 66 | + |
| 67 | +Being a default dtype means that the string dtype will be used in IO methods or |
| 68 | +constructors when the dtype is being inferred and the input is inferred to be |
| 69 | +string data: |
| 70 | + |
| 71 | +.. code-block:: python |
| 72 | +
|
| 73 | + >>> pd.Series(["a", "b", None]) |
| 74 | + 0 a |
| 75 | + 1 b |
| 76 | + 2 NaN |
| 77 | + dtype: str |
| 78 | +
|
| 79 | +It can also be specified explicitly using the ``"str"`` alias: |
| 80 | + |
| 81 | +.. code-block:: python |
| 82 | +
|
| 83 | + >>> pd.Series(["a", "b", None], dtype="str") |
| 84 | + 0 a |
| 85 | + 1 b |
| 86 | + 2 NaN |
| 87 | + dtype: str |
| 88 | +
|
| 89 | +Similarly, functions like :func:`read_csv`, :func:`read_parquet`, and others |
| 90 | +will now use the new string dtype when reading string data. |
| 91 | + |
| 92 | +In contrast to the current object dtype, the new string dtype will only store |
| 93 | +strings. This also means that it will raise an error if you try to store a |
| 94 | +non-string value in it (see below for more details). |
| 95 | + |
| 96 | +Missing values with the new string dtype are always represented as ``NaN`` (``np.nan``), |
| 97 | +and the missing value behavior is similar to other default dtypes. |
| 98 | + |
| 99 | +This new string dtype should otherwise behave the same as the existing |
| 100 | +``object`` dtype users are used to. For example, all string-specific methods |
| 101 | +through the ``str`` accessor will work the same: |
| 102 | + |
| 103 | +.. code-block:: python |
| 104 | +
|
| 105 | + >>> ser = pd.Series(["a", "b", None], dtype="str") |
| 106 | + >>> ser.str.upper() |
| 107 | + 0 A |
| 108 | + 1 B |
| 109 | + 2 NaN |
| 110 | + dtype: str |
| 111 | +
|
| 112 | +.. note:: |
| 113 | + |
| 114 | + The new default string dtype is an instance of the :class:`pandas.StringDtype` |
| 115 | + class. The dtype can be constructed as ``pd.StringDtype(na_value=np.nan)``, |
| 116 | + but for general usage we recommend to use the shorter ``"str"`` alias. |
| 117 | + |
| 118 | +Overview of behavior differences and how to address them |
| 119 | +--------------------------------------------------------- |
| 120 | + |
| 121 | +The dtype is no longer object dtype |
| 122 | +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 123 | + |
| 124 | +When inferring or reading string data, the data type of the resulting DataFrame |
| 125 | +column or Series will silently start being the new ``"str"`` dtype instead of |
| 126 | +``"object"`` dtype, and this can have some impact on your code. |
| 127 | + |
| 128 | +Checking the dtype |
| 129 | +^^^^^^^^^^^^^^^^^^ |
| 130 | + |
| 131 | +When checking the dtype, code might currently do something like: |
| 132 | + |
| 133 | +.. code-block:: python |
| 134 | +
|
| 135 | + >>> ser = pd.Series(["a", "b", "c"]) |
| 136 | + >>> ser.dtype == "object" |
| 137 | +
|
| 138 | +to check for columns with string data (by checking for the dtype being |
| 139 | +``"object"``). This will no longer work in pandas 3+, since ``ser.dtype`` will |
| 140 | +now be ``"str"`` with the new default string dtype, and the above check will |
| 141 | +return ``False``. |
| 142 | + |
| 143 | +To check for columns with string data, you should instead use: |
| 144 | + |
| 145 | +.. code-block:: python |
| 146 | +
|
| 147 | + >>> ser.dtype == "str" |
| 148 | +
|
| 149 | +**How to write compatible code** |
| 150 | + |
| 151 | +For code that should work on both pandas 2.x and 3.x, you can use the |
| 152 | +:func:`pandas.api.types.is_string_dtype` function: |
| 153 | + |
| 154 | +.. code-block:: python |
| 155 | +
|
| 156 | + >>> pd.api.types.is_string_dtype(ser.dtype) |
| 157 | + True |
| 158 | +
|
| 159 | +This will return ``True`` for both the object dtype and the string dtypes. |
| 160 | + |
| 161 | +Hardcoded use of object dtype |
| 162 | +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| 163 | + |
| 164 | +If you have code where the dtype is hardcoded in constructors, like |
| 165 | + |
| 166 | +.. code-block:: python |
| 167 | +
|
| 168 | + >>> pd.Series(["a", "b", "c"], dtype="object") |
| 169 | +
|
| 170 | +this will keep using the object dtype. You will want to update this code to |
| 171 | +ensure you get the benefits of the new string dtype. |
| 172 | + |
| 173 | +**How to write compatible code?** |
| 174 | + |
| 175 | +First, in many cases it can be sufficient to remove the specific data type, and |
| 176 | +let pandas do the inference. But if you want to be specific, you can specify the |
| 177 | +``"str"`` dtype: |
| 178 | + |
| 179 | +.. code-block:: python |
| 180 | +
|
| 181 | + >>> pd.Series(["a", "b", "c"], dtype="str") |
| 182 | +
|
| 183 | +This is actually compatible with pandas 2.x as well, since in pandas < 3, |
| 184 | +``dtype="str"`` was essentially treated as an alias for object dtype. |
| 185 | + |
| 186 | +The missing value sentinel is now always NaN |
| 187 | +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 188 | + |
| 189 | +When using object dtype, multiple possible missing value sentinels are |
| 190 | +supported, including ``None`` and ``np.nan``. With the new default string dtype, |
| 191 | +the missing value sentinel is always NaN (``np.nan``): |
| 192 | + |
| 193 | +.. code-block:: python |
| 194 | +
|
| 195 | + # with object dtype, None is preserved as None and seen as missing |
| 196 | + >>> ser = pd.Series(["a", "b", None], dtype="object") |
| 197 | + >>> ser |
| 198 | + 0 a |
| 199 | + 1 b |
| 200 | + 2 None |
| 201 | + dtype: object |
| 202 | + >>> print(ser[2]) |
| 203 | + None |
| 204 | +
|
| 205 | + # with the new string dtype, any missing value like None is coerced to NaN |
| 206 | + >>> ser = pd.Series(["a", "b", None], dtype="str") |
| 207 | + >>> ser |
| 208 | + 0 a |
| 209 | + 1 b |
| 210 | + 2 NaN |
| 211 | + dtype: str |
| 212 | + >>> print(ser[2]) |
| 213 | + nan |
| 214 | +
|
| 215 | +Generally this should be no problem when relying on missing value behavior in |
| 216 | +pandas methods (for example, ``ser.isna()`` will give the same result as before). |
| 217 | +But when you relied on the exact value of ``None`` being present, that can |
| 218 | +impact your code. |
| 219 | + |
| 220 | +**How to write compatible code?** |
| 221 | + |
| 222 | +When checking for a missing value, instead of checking for the exact value of |
| 223 | +``None`` or ``np.nan``, you should use the :func:`pandas.isna` function. This is |
| 224 | +the most robust way to check for missing values, as it will work regardless of |
| 225 | +the dtype and the exact missing value sentinel: |
| 226 | + |
| 227 | +.. code-block:: python |
| 228 | +
|
| 229 | + >>> pd.isna(ser[2]) |
| 230 | + True |
| 231 | +
|
| 232 | +One caveat: this function works both on scalars and on array-likes, and in the |
| 233 | +latter case it will return an array of bools. When using it in a Boolean context |
| 234 | +(for example, ``if pd.isna(..): ..``) be sure to only pass a scalar to it. |
| 235 | + |
| 236 | +"setitem" operations will now raise an error for non-string data |
| 237 | +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 238 | + |
| 239 | +With the new string dtype, any attempt to set a non-string value in a Series or |
| 240 | +DataFrame will raise an error: |
| 241 | + |
| 242 | +.. code-block:: python |
| 243 | +
|
| 244 | + >>> ser = pd.Series(["a", "b", None], dtype="str") |
| 245 | + >>> ser[1] = 2.5 |
| 246 | + --------------------------------------------------------------------------- |
| 247 | + TypeError Traceback (most recent call last) |
| 248 | + ... |
| 249 | + TypeError: Invalid value '2.5' for dtype 'str'. Value should be a string or missing value, got 'float' instead. |
| 250 | +
|
| 251 | +If you relied on the flexible nature of object dtype being able to hold any |
| 252 | +Python object, but your initial data was inferred as strings, your code might be |
| 253 | +impacted by this change. |
| 254 | +
|
| 255 | +**How to write compatible code?** |
| 256 | +
|
| 257 | +You can update your code to ensure you only set string values in such columns, |
| 258 | +or otherwise you can explicitly ensure the column has object dtype first. This |
| 259 | +can be done by specifying the dtype explicitly in the constructor, or by using |
| 260 | +the :meth:`~pandas.Series.astype` method: |
| 261 | + |
| 262 | +.. code-block:: python |
| 263 | +
|
| 264 | + >>> ser = pd.Series(["a", "b", None], dtype="str") |
| 265 | + >>> ser = ser.astype("object") |
| 266 | + >>> ser[1] = 2.5 |
| 267 | +
|
| 268 | +This ``astype("object")`` call will be redundant when using pandas 2.x, but |
| 269 | +this code will work for all versions. |
| 270 | + |
| 271 | +Invalid unicode input |
| 272 | +~~~~~~~~~~~~~~~~~~~~~ |
| 273 | + |
| 274 | +Python allows to have a built-in ``str`` object that represents invalid unicode |
| 275 | +data. And since the ``object`` dtype can hold any Python object, you can have a |
| 276 | +pandas Series with such invalid unicode data: |
| 277 | + |
| 278 | +.. code-block:: python |
| 279 | +
|
| 280 | + >>> ser = pd.Series(["\u2600", "\ud83d"], dtype=object) |
| 281 | + >>> ser |
| 282 | + 0 ☀ |
| 283 | + 1 \ud83d |
| 284 | + dtype: object |
| 285 | +
|
| 286 | +However, when using the string dtype using ``pyarrow`` under the hood, this can |
| 287 | +only store valid unicode data, and otherwise it will raise an error: |
| 288 | + |
| 289 | +.. code-block:: python |
| 290 | +
|
| 291 | + >>> ser = pd.Series(["\u2600", "\ud83d"]) |
| 292 | + --------------------------------------------------------------------------- |
| 293 | + UnicodeEncodeError Traceback (most recent call last) |
| 294 | + ... |
| 295 | + UnicodeEncodeError: 'utf-8' codec can't encode character '\ud83d' in position 0: surrogates not allowed |
| 296 | +
|
| 297 | +If you want to keep the previous behaviour, you can explicitly specify |
| 298 | +``dtype=object`` to keep working with object dtype. |
| 299 | + |
| 300 | +When you have byte data that you want to convert to strings using ``decode()``, |
| 301 | +the :meth:`~pandas.Series.str.decode` method now has a ``dtype`` parameter to be |
| 302 | +able to specify object dtype instead of the default of string dtype for this use |
| 303 | +case. |
| 304 | + |
| 305 | +Notable bug fixes |
| 306 | +~~~~~~~~~~~~~~~~~ |
| 307 | + |
| 308 | +``astype(str)`` preserving missing values |
| 309 | +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| 310 | + |
| 311 | +This is a long standing "bug" or misfeature, as discussed in https://github.com/pandas-dev/pandas/issues/25353. |
| 312 | + |
| 313 | +With pandas < 3, when using ``astype(str)`` (using the built-in :func:`str`, not |
| 314 | +``astype("str")``!), the operation would convert every element to a string, |
| 315 | +including the missing values: |
| 316 | + |
| 317 | +.. code-block:: python |
| 318 | +
|
| 319 | + # OLD behavior in pandas < 3 |
| 320 | + >>> ser = pd.Series(["a", np.nan], dtype=object) |
| 321 | + >>> ser |
| 322 | + 0 a |
| 323 | + 1 NaN |
| 324 | + dtype: object |
| 325 | + >>> ser.astype(str) |
| 326 | + 0 a |
| 327 | + 1 nan |
| 328 | + dtype: object |
| 329 | + >>> ser.astype(str).to_numpy() |
| 330 | + array(['a', 'nan'], dtype=object) |
| 331 | +
|
| 332 | +Note how ``NaN`` (``np.nan``) was converted to the string ``"nan"``. This was |
| 333 | +not the intended behavior, and it was inconsistent with how other dtypes handled |
| 334 | +missing values. |
| 335 | + |
| 336 | +With pandas 3, this behavior has been fixed, and now ``astype(str)`` is an alias |
| 337 | +for ``astype("str")``, i.e. casting to the new string dtype, which will preserve |
| 338 | +the missing values: |
| 339 | + |
| 340 | +.. code-block:: python |
| 341 | +
|
| 342 | + # NEW behavior in pandas 3 |
| 343 | + >>> pd.options.future.infer_string = True |
| 344 | + >>> ser = pd.Series(["a", np.nan], dtype=object) |
| 345 | + >>> ser.astype(str) |
| 346 | + 0 a |
| 347 | + 1 NaN |
| 348 | + dtype: str |
| 349 | + >>> ser.astype(str).values |
| 350 | + array(['a', nan], dtype=object) |
| 351 | +
|
| 352 | +If you want to preserve the old behaviour of converting every object to a |
| 353 | +string, you can use ``ser.map(str)`` instead. |
| 354 | + |
| 355 | + |
| 356 | +``prod()`` raising for string data |
| 357 | +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| 358 | + |
| 359 | +In pandas < 3, calling the :meth:`~pandas.Series.prod` method on a Series with |
| 360 | +string data would generally raise an error, except when the Series was empty or |
| 361 | +contained only a single string (potentially with missing values): |
| 362 | + |
| 363 | +.. code-block:: python |
| 364 | +
|
| 365 | + >>> ser = pd.Series(["a", None], dtype=object) |
| 366 | + >>> ser.prod() |
| 367 | + 'a' |
| 368 | +
|
| 369 | +When the Series contains multiple strings, it will raise a ``TypeError``. This |
| 370 | +behaviour stays the same in pandas 3 when using the flexible ``object`` dtype. |
| 371 | +But by virtue of using the new string dtype, this will generally consistently |
| 372 | +raise an error regardless of the number of strings: |
| 373 | + |
| 374 | +.. code-block:: python |
| 375 | +
|
| 376 | + >>> ser = pd.Series(["a", None], dtype="str") |
| 377 | + >>> ser.prod() |
| 378 | + --------------------------------------------------------------------------- |
| 379 | + TypeError Traceback (most recent call last) |
| 380 | + ... |
| 381 | + TypeError: Cannot perform reduction 'prod' with string dtype |
| 382 | +
|
| 383 | +.. For existing users of the nullable ``StringDtype`` |
| 384 | +.. -------------------------------------------------- |
| 385 | +
|
| 386 | +.. TODO |
0 commit comments