docs: update spark profiling docs and add new databricks integration (#1269)

fabclmnt · azory-ydata · web-flow · commit a23d7284e44b · 2023-02-13T23:46:01.000Z
* docs: update spark profiling docs and add a new integration example with Databricks

---------

Co-authored-by: Azory YData Bot &lt;azory@ydata.ai&gt;
diff --git a/docsrc/source/pages/integrations/great_expectations.rst b/docsrc/source/pages/integrations/great_expectations.rst
@@ -2,8 +2,14 @@
 Great Expectations
 ==================
 
-`Great Expectations <https://www.greatexpectations.io>`_ is a Python-based open-source library for validating, documenting, and profiling your data. It helps you to maintain data quality and improve communication about data between teams. With Great Expectations, you can assert what you expect from the data you load and transform, and catch data issues quickly – Expectations are basically *unit tests for your data*. ``pandas-profiling`` features a method to create a suite of Expectations based on the results of your ``ProfileReport``!
+.. NOTE::
+   **Great expectation integration**
+    - Great expectations integration is not longer supported.
+    - You can recreate the integration with the follow packages versions:
+     - pandas-profiling==2.1.0
+     - great-expectations==0.13.4
 
+`Great Expectations <https://www.greatexpectations.io>`_ is a Python-based open-source library for validating, documenting, and profiling your data. It helps you to maintain data quality and improve communication about data between teams. With Great Expectations, you can assert what you expect from the data you load and transform, and catch data issues quickly – Expectations are basically *unit tests for your data*. ``pandas-profiling`` features a method to create a suite of Expectations based on the results of your ``ProfileReport``!
 
 About Great Expectations
 -------------------------
diff --git a/docsrc/source/pages/integrations/pyspark.rst b/docsrc/source/pages/integrations/pyspark.rst
@@ -2,6 +2,10 @@
 ⚡ Pyspark
 ============
 
+.. NOTE::
+   **Spark dataframes support**
+    - Spark Dataframes profiling is available from ydata-profiling version 4.0.0 onwards
+
 Data Profiling is a core step in the process of developing AI solutions.
 For small datasets, the data can be loaded into memory and easily accessed with Python and pandas dataframes.
 However for larger datasets what can be done?
@@ -80,4 +84,13 @@ A quickstart example to profile data from a CSV leveraging Pyspark engine and ``
     df.printSchema()
 
     a = ProfileReport(df)
-    a.to_file("spark_profile.html")
+    a.to_file("spark_profile.html")
+
+ydata-profiling in Databricks
+---------
+
+Yes! We have fantastic new coming with a full tutorial on how you can use ydata-profiling in Databricks Notebooks.
+
+The notebook example can be found `here <https://github.com/ydataai/ydata-profiling/tree/master/examples/integrations/databricks_example.ipynb>`_.
+
+Stay tuned - we are going to update the documentation soon!
diff --git a/docsrc/source/pages/use_cases/big_data.rst b/docsrc/source/pages/use_cases/big_data.rst
@@ -2,22 +2,23 @@
 Profiling large datasets
 ========================
 
-By default, ``pandas-profiling`` comprehensively summarizes the input dataset in a way that gives the most insights for data analysis. For small datasets, these computations can be performed in *quasi* real-time. For larger datasets, deciding upfront which calculations to make might be required.
-Whether a computation scales to a large datasets not only depends on the exact size of the detaset, but also on its complexity and on whether fast computations are available. If the computation time of the profiling becomes a bottleneck, ``pandas-profiling`` offers several alternatives to overcome it.
+By default, ``ydata-profiling`` comprehensively summarizes the input dataset in a way that gives the most insights for data analysis. For small datasets, these computations can be performed in *quasi* real-time. For larger datasets, deciding upfront which calculations to make might be required.
+Whether a computation scales to a large datasets not only depends on the exact size of the detaset, but also on its complexity and on whether fast computations are available. If the computation time of the profiling becomes a bottleneck, ``ydata-profiling`` offers several alternatives to overcome it.
 
 Minimal mode
 ------------
 
-``pandas-profiling`` includes a minimal configuration file where the most expensive computations are turned off by default.
+``ydata-profiling`` includes a minimal configuration file where the most expensive computations are turned off by default.
 This is the recommended starting point for larger datasets.
 
 .. code-block:: python
 
   profile = ProfileReport(large_dataset, minimal=True)
   profile.to_file("output.html")
 
-
-*(minimal mode was introduced in version v2.4.0)*
+.. NOTE::
+   **Minimal mode**
+    - This mode was introduced in version v2.4.0
 
 This configuration file can be found here: `config_minimal.yaml <https://github.com/ydataai/pandas-profiling/blob/master/src/pandas_profiling/config_minimal.yaml>`_. More details on settings and configuration are available in :doc:`../advanced_usage/available_settings`.
 
@@ -54,7 +55,7 @@ To decrease the computational burden in particularly large datasets but still ma
 
 .. code-block:: python
 
-    from pandas_profiling import ProfileReport
+    from ydata_profiling import ProfileReport
     import pandas as pd
 
     # Reading the data
@@ -72,10 +73,33 @@ To decrease the computational burden in particularly large datasets but still ma
 
 The setting controlling this, ``ìnteractions.targets``, can be changed via multiple interfaces (configuration files or environment variables). For details, see :doc:`../advanced_usage/changing_settings`.
 
+Pyspark
+-----------
+
+`Spark <https://spark.apache.org/>`_
+
+.. NOTE::
+   **Minimal mode**
+    - This mode was introduced in version v4.0.0
+
+``ydata-profiling`` now supports Spark Dataframes profiling. You can find an example of the integration `here <https://github.com/ydataai/ydata-profiling/blob/master/examples/features/spark_example.py>`_.
+
+**Features supported:**
+- Univariate variables analysis
+- Head and Tail dataset sample
+- Correlation matrices: Pearson and Spearman
+
+*Coming soon*
+- Missing values analysis
+- Interactions
+- Improved histogram computation
+
+Keep an eye on the `GitHub <https://github.com/ydataai/pandas-profiling/issues>`_ page to follow the updates on the implementation of `Pyspark Dataframes support <https://github.com/orgs/ydataai/projects/16/views/2>`_.
+
 Concurrency
 -----------
 
-``pandas-profiling`` is a project under active development. One of the highly desired features is the addition of a scalable backend such as `Modin <https://github.com/modin-project/modin>`_, `Spark <https://spark.apache.org/>`_ or `Dask <https://dask.org/>`_.
+``ydata-profiling`` is a project under active development. One of the highly desired features is the addition of a scalable backend such as `Modin <https://github.com/modin-project/modin>`_ or `Dask <https://dask.org/>`_.
 
 
-Keep an eye on the `GitHub <https://github.com/ydataai/pandas-profiling/issues>`_ page to follow the updates on the implementation of a concurrent and highly scalable backend. Specifically, development of a Spark backend is `currently underway <https://github.com/ydataai/pandas-profiling/projects/3>`_.
+Keep an eye on the `GitHub <https://github.com/ydataai/pandas-profiling/issues>`_ page to follow the updates on the implementation of a concurrent and highly scalable backend. Specifically, development of a Spark backend is `currently underway <https://github.com/ydataai/pandas-profiling/projects/3>`_.
diff --git a/docsrc/source/pages/use_cases/comparing_datasets.rst b/docsrc/source/pages/use_cases/comparing_datasets.rst
@@ -2,6 +2,12 @@
 Dataset Comparison
 ==================
 
+.. NOTE::
+   **Dataframes compare support**
+    - Profiling compare is supported from ydata-profiling version 3.5.0 onwards
+    - Profiling compare is not *(yet!)* available for Spark Dataframes
+
+
 ``pandas-profiling`` can be used to compare multiple version of the same dataset.
 This is useful when comparing data from multiple time periods, such as two years.
 Another common scenario is to view the dataset profile for training, validation and test sets in machine learning.
diff --git a/examples/integrations/databricks/ydata-profiling in Databricks.dbc b/examples/integrations/databricks/ydata-profiling in Databricks.dbc
diff --git a/examples/integrations/databricks/ydata-profiling in Databricks.ipynb b/examples/integrations/databricks/ydata-profiling in Databricks.ipynb
@@ -0,0 +1,221 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "application/vnd.databricks.v1+cell": {
+     "cellMetadata": {},
+     "inputWidgets": {},
+     "nuid": "47c63171-a4b6-49fb-a26d-531adc1464f8",
+     "showTitle": false,
+     "title": ""
+    }
+   },
+   "source": [
+    "# Yellow Taxy NYC\n",
+    "\n",
+    "### Data Profiling in Databricks with ydata-profiling"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "application/vnd.databricks.v1+cell": {
+     "cellMetadata": {},
+     "inputWidgets": {},
+     "nuid": "18fdccc1-4779-4062-981a-882fa043538c",
+     "showTitle": false,
+     "title": ""
+    }
+   },
+   "source": [
+    "Read a Delta table"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "application/vnd.databricks.v1+cell": {
+     "cellMetadata": {},
+     "inputWidgets": {},
+     "nuid": "9f0e1f63-717a-4653-98fc-bb5119ae9c63",
+     "showTitle": false,
+     "title": ""
+    }
+   },
+   "outputs": [],
+   "source": [
+    "input_table_name = \"default.yellowtaxi_trips\"\n",
+    "df = spark.table(input_table_name)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "application/vnd.databricks.v1+cell": {
+     "cellMetadata": {},
+     "inputWidgets": {},
+     "nuid": "21da1ee5-9055-4995-b777-f2081f91e7d4",
+     "showTitle": false,
+     "title": ""
+    }
+   },
+   "outputs": [],
+   "source": [
+    "display(df)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "application/vnd.databricks.v1+cell": {
+     "cellMetadata": {},
+     "inputWidgets": {},
+     "nuid": "565d9304-1f65-42c1-91a3-01f182d44cd2",
+     "showTitle": false,
+     "title": ""
+    }
+   },
+   "source": [
+    "## Data profiling with YData Profiling\n",
+    "\n",
+    "pandas-profiling is now ydata-profiling and includes support for Spark dataframes."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "application/vnd.databricks.v1+cell": {
+     "cellMetadata": {},
+     "inputWidgets": {},
+     "nuid": "eebb3d44-b088-4f45-b55b-6bd527ea7f7a",
+     "showTitle": false,
+     "title": ""
+    }
+   },
+   "outputs": [],
+   "source": [
+    "from ydata_profiling import ProfileReport\n",
+    "\n",
+    "report = ProfileReport(\n",
+    "    df,\n",
+    "    title=\"NYC yellow taxi trip\",\n",
+    "    infer_dtypes=False,\n",
+    "    interactions=None,\n",
+    "    missing_diagrams=None,\n",
+    "    correlations={\n",
+    "        \"auto\": {\"calculate\": False},\n",
+    "        \"pearson\": {\"calculate\": True},\n",
+    "        \"spearman\": {\"calculate\": True},\n",
+    "    },\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "application/vnd.databricks.v1+cell": {
+     "cellMetadata": {},
+     "inputWidgets": {},
+     "nuid": "b4a28624-1b4f-4879-a4a4-cf34ec9e6bc0",
+     "showTitle": false,
+     "title": ""
+    }
+   },
+   "source": [
+    "####Display as an HTML"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "application/vnd.databricks.v1+cell": {
+     "cellMetadata": {},
+     "inputWidgets": {},
+     "nuid": "eefeac1e-2d58-4d23-a0ec-46b1d69e44bf",
+     "showTitle": false,
+     "title": ""
+    }
+   },
+   "outputs": [],
+   "source": [
+    "# Export the report as html and display\n",
+    "report_html = report.to_html()\n",
+    "displayHTML(report_html)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "application/vnd.databricks.v1+cell": {
+     "cellMetadata": {},
+     "inputWidgets": {},
+     "nuid": "0a7fbe9c-7b0f-460f-99d7-552736c1091c",
+     "showTitle": false,
+     "title": ""
+    }
+   },
+   "source": [
+    "#### Extract the profile as JSON"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "application/vnd.databricks.v1+cell": {
+     "cellMetadata": {},
+     "inputWidgets": {},
+     "nuid": "cd3f32fc-344d-4a93-b037-a56952903979",
+     "showTitle": false,
+     "title": ""
+    }
+   },
+   "outputs": [],
+   "source": [
+    "profile_json = report.to_json()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "application/vnd.databricks.v1+cell": {
+     "cellMetadata": {},
+     "inputWidgets": {},
+     "nuid": "4c569eaa-5cb5-471c-9c7c-cb7d9f7cd1f5",
+     "showTitle": false,
+     "title": ""
+    }
+   },
+   "outputs": [],
+   "source": [
+    "profile_json"
+   ]
+  }
+ ],
+ "metadata": {
+  "application/vnd.databricks.v1+notebook": {
+   "dashboards": [],
+   "language": "python",
+   "notebookMetadata": {
+    "mostRecentlyExecutedCommandWithImplicitDF": {
+     "commandId": 2648559141144570,
+     "dataframes": [
+      "_sqldf"
+     ]
+    },
+    "pythonIndentUnit": 4
+   },
+   "notebookName": "YData-profiling in Databricks",
+   "notebookOrigID": 329200988581789,
+   "widgets": {}
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 0
+}
diff --git a/examples/integrations/great_expectations/great_expectations_example.py b/examples/integrations/great_expectations/great_expectations_example.py