diff --git a/doc/source/whatsnew/v2.3.0.rst b/doc/source/whatsnew/v2.3.0.rst index 8ca6c0006a604..bf9b2ae2333c0 100644 --- a/doc/source/whatsnew/v2.3.0.rst +++ b/doc/source/whatsnew/v2.3.0.rst @@ -10,6 +10,104 @@ including other versions of pandas. .. --------------------------------------------------------------------------- +.. _whatsnew_230.upcoming_changes: + +Upcoming changes in pandas 3.0 +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +pandas 3.0 will bring two bigger changes to the default behavior of pandas. + +Dedicated string data type by default +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Historically, pandas represented string columns with NumPy ``object`` data type. +This representation has numerous problems: it is not specific to strings (any +Python object can be stored in an ``object``-dtype array, not just strings) and +it is often not very efficient (both performance wise and for memory usage). + +Starting with the upcoming pandas 3.0 release, a dedicated string data type will +be enabled by default (backed by PyArrow under the hood, if installed, otherwise +falling back to NumPy). This means that pandas will start inferring columns +containing string data as the new ``str`` data type when creating pandas +objects, such as in constructors or IO functions. + +Old behavior: + +.. code-block:: python + + >>> ser = pd.Series(["a", "b"]) + 0 a + 1 b + dtype: object + +New behavior: + +.. code-block:: python + + >>> ser = pd.Series(["a", "b"]) + 0 a + 1 b + dtype: str + +The string data type that is used in these scenarios will mostly behave as NumPy +object would, including missing value semantics and general operations on these +columns. + +However, the introduction of a new default dtype will also have some breaking +consequences to your code (for example when checking for the ``.dtype`` being +object dtype). To allow testing it in advance of the pandas 3.0 release, this +future dtype inference logic can be enabled in pandas 2.3 with: + +.. code-block:: python + + pd.options.future.infer_string = True + +See the :ref:`string_migration_guide` for more details on the behaviour changes +and how to adapt your code to the new default. + +Copy-on-Write +^^^^^^^^^^^^^ + +The currently optional mode Copy-on-Write will be enabled by default in pandas 3.0. There +won't be an option to retain the legacy behavior. + +In summary, the new "copy-on-write" behaviour will bring changes in behavior in +how pandas operates with respect to copies and views. + +1. The result of *any* indexing operation (subsetting a DataFrame or Series in any way, + i.e. including accessing a DataFrame column as a Series) or any method returning a + new DataFrame or Series, always *behaves as if* it were a copy in terms of user + API. +2. As a consequence, if you want to modify an object (DataFrame or Series), the only way + to do this is to directly modify that object itself. + +Because every single indexing step now behaves as a copy, this also means that +"chained assignment" (updating a DataFrame with multiple setitem steps) will +stop working. Because this now consistently never works, the +``SettingWithCopyWarning`` will be removed. + +The new behavioral semantics are explained in more detail in the +:ref:`user guide about Copy-on-Write `. + +The new behavior can be enabled since pandas 2.0 with the following option: + +.. code-block:: python + + pd.options.mode.copy_on_write = True + +Some of the behaviour changes allow a clear deprecation, like the changes in +chained assignment. Other changes are more subtle and thus, the warnings are +hidden behind an option that can be enabled since pandas 2.2: + +.. code-block:: python + + pd.options.mode.copy_on_write = "warn" + +This mode will warn in many different scenarios that aren't actually relevant to +most queries. We recommend exploring this mode, but it is not necessary to get rid +of all of these warnings. The :ref:`migration guide ` +explains the upgrade process in more detail. + .. _whatsnew_230.enhancements: Enhancements diff --git a/doc/source/whatsnew/v2.3.1.rst b/doc/source/whatsnew/v2.3.1.rst index eb3ad72f6a59f..7ad76e9d82c9c 100644 --- a/doc/source/whatsnew/v2.3.1.rst +++ b/doc/source/whatsnew/v2.3.1.rst @@ -44,7 +44,7 @@ correctly, rather than defaulting to ``object`` dtype. For example: .. code-block:: python - >>> pd.options.mode.infer_string = True + >>> pd.options.future.infer_string = True >>> df = pd.DataFrame() >>> df.columns.dtype dtype('int64') # default RangeIndex for empty columns