Skip to content

Commit a23d728

Browse files
docs: update spark profiling docs and add new databricks integration (#1269)
* docs: update spark profiling docs and add a new integration example with Databricks --------- Co-authored-by: Azory YData Bot <azory@ydata.ai>
1 parent 9635ce4 commit a23d728

File tree

7 files changed

+280
-10
lines changed

7 files changed

+280
-10
lines changed

docsrc/source/pages/integrations/great_expectations.rst

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,8 +2,14 @@
22
Great Expectations
33
==================
44

5-
`Great Expectations <https://www.greatexpectations.io>`_ is a Python-based open-source library for validating, documenting, and profiling your data. It helps you to maintain data quality and improve communication about data between teams. With Great Expectations, you can assert what you expect from the data you load and transform, and catch data issues quickly – Expectations are basically *unit tests for your data*. ``pandas-profiling`` features a method to create a suite of Expectations based on the results of your ``ProfileReport``!
5+
.. NOTE::
6+
**Great expectation integration**
7+
- Great expectations integration is not longer supported.
8+
- You can recreate the integration with the follow packages versions:
9+
- pandas-profiling==2.1.0
10+
- great-expectations==0.13.4
611

12+
`Great Expectations <https://www.greatexpectations.io>`_ is a Python-based open-source library for validating, documenting, and profiling your data. It helps you to maintain data quality and improve communication about data between teams. With Great Expectations, you can assert what you expect from the data you load and transform, and catch data issues quickly – Expectations are basically *unit tests for your data*. ``pandas-profiling`` features a method to create a suite of Expectations based on the results of your ``ProfileReport``!
713

814
About Great Expectations
915
-------------------------

docsrc/source/pages/integrations/pyspark.rst

Lines changed: 14 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,10 @@
22
⚡ Pyspark
33
============
44

5+
.. NOTE::
6+
**Spark dataframes support**
7+
- Spark Dataframes profiling is available from ydata-profiling version 4.0.0 onwards
8+
59
Data Profiling is a core step in the process of developing AI solutions.
610
For small datasets, the data can be loaded into memory and easily accessed with Python and pandas dataframes.
711
However for larger datasets what can be done?
@@ -80,4 +84,13 @@ A quickstart example to profile data from a CSV leveraging Pyspark engine and ``
8084
df.printSchema()
8185
8286
a = ProfileReport(df)
83-
a.to_file("spark_profile.html")
87+
a.to_file("spark_profile.html")
88+
89+
ydata-profiling in Databricks
90+
---------
91+
92+
Yes! We have fantastic new coming with a full tutorial on how you can use ydata-profiling in Databricks Notebooks.
93+
94+
The notebook example can be found `here <https://github.com/ydataai/ydata-profiling/tree/master/examples/integrations/databricks_example.ipynb>`_.
95+
96+
Stay tuned - we are going to update the documentation soon!

docsrc/source/pages/use_cases/big_data.rst

Lines changed: 32 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -2,22 +2,23 @@
22
Profiling large datasets
33
========================
44

5-
By default, ``pandas-profiling`` comprehensively summarizes the input dataset in a way that gives the most insights for data analysis. For small datasets, these computations can be performed in *quasi* real-time. For larger datasets, deciding upfront which calculations to make might be required.
6-
Whether a computation scales to a large datasets not only depends on the exact size of the detaset, but also on its complexity and on whether fast computations are available. If the computation time of the profiling becomes a bottleneck, ``pandas-profiling`` offers several alternatives to overcome it.
5+
By default, ``ydata-profiling`` comprehensively summarizes the input dataset in a way that gives the most insights for data analysis. For small datasets, these computations can be performed in *quasi* real-time. For larger datasets, deciding upfront which calculations to make might be required.
6+
Whether a computation scales to a large datasets not only depends on the exact size of the detaset, but also on its complexity and on whether fast computations are available. If the computation time of the profiling becomes a bottleneck, ``ydata-profiling`` offers several alternatives to overcome it.
77

88
Minimal mode
99
------------
1010

11-
``pandas-profiling`` includes a minimal configuration file where the most expensive computations are turned off by default.
11+
``ydata-profiling`` includes a minimal configuration file where the most expensive computations are turned off by default.
1212
This is the recommended starting point for larger datasets.
1313

1414
.. code-block:: python
1515
1616
profile = ProfileReport(large_dataset, minimal=True)
1717
profile.to_file("output.html")
1818
19-
20-
*(minimal mode was introduced in version v2.4.0)*
19+
.. NOTE::
20+
**Minimal mode**
21+
- This mode was introduced in version v2.4.0
2122

2223
This configuration file can be found here: `config_minimal.yaml <https://github.com/ydataai/pandas-profiling/blob/master/src/pandas_profiling/config_minimal.yaml>`_. More details on settings and configuration are available in :doc:`../advanced_usage/available_settings`.
2324

@@ -54,7 +55,7 @@ To decrease the computational burden in particularly large datasets but still ma
5455

5556
.. code-block:: python
5657
57-
from pandas_profiling import ProfileReport
58+
from ydata_profiling import ProfileReport
5859
import pandas as pd
5960
6061
# Reading the data
@@ -72,10 +73,33 @@ To decrease the computational burden in particularly large datasets but still ma
7273
7374
The setting controlling this, ``ìnteractions.targets``, can be changed via multiple interfaces (configuration files or environment variables). For details, see :doc:`../advanced_usage/changing_settings`.
7475

76+
Pyspark
77+
-----------
78+
79+
`Spark <https://spark.apache.org/>`_
80+
81+
.. NOTE::
82+
**Minimal mode**
83+
- This mode was introduced in version v4.0.0
84+
85+
``ydata-profiling`` now supports Spark Dataframes profiling. You can find an example of the integration `here <https://github.com/ydataai/ydata-profiling/blob/master/examples/features/spark_example.py>`_.
86+
87+
**Features supported:**
88+
- Univariate variables analysis
89+
- Head and Tail dataset sample
90+
- Correlation matrices: Pearson and Spearman
91+
92+
*Coming soon*
93+
- Missing values analysis
94+
- Interactions
95+
- Improved histogram computation
96+
97+
Keep an eye on the `GitHub <https://github.com/ydataai/pandas-profiling/issues>`_ page to follow the updates on the implementation of `Pyspark Dataframes support <https://github.com/orgs/ydataai/projects/16/views/2>`_.
98+
7599
Concurrency
76100
-----------
77101

78-
``pandas-profiling`` is a project under active development. One of the highly desired features is the addition of a scalable backend such as `Modin <https://github.com/modin-project/modin>`_, `Spark <https://spark.apache.org/>`_ or `Dask <https://dask.org/>`_.
102+
``ydata-profiling`` is a project under active development. One of the highly desired features is the addition of a scalable backend such as `Modin <https://github.com/modin-project/modin>`_ or `Dask <https://dask.org/>`_.
79103

80104

81-
Keep an eye on the `GitHub <https://github.com/ydataai/pandas-profiling/issues>`_ page to follow the updates on the implementation of a concurrent and highly scalable backend. Specifically, development of a Spark backend is `currently underway <https://github.com/ydataai/pandas-profiling/projects/3>`_.
105+
Keep an eye on the `GitHub <https://github.com/ydataai/pandas-profiling/issues>`_ page to follow the updates on the implementation of a concurrent and highly scalable backend. Specifically, development of a Spark backend is `currently underway <https://github.com/ydataai/pandas-profiling/projects/3>`_.

docsrc/source/pages/use_cases/comparing_datasets.rst

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,12 @@
22
Dataset Comparison
33
==================
44

5+
.. NOTE::
6+
**Dataframes compare support**
7+
- Profiling compare is supported from ydata-profiling version 3.5.0 onwards
8+
- Profiling compare is not *(yet!)* available for Spark Dataframes
9+
10+
511
``pandas-profiling`` can be used to compare multiple version of the same dataset.
612
This is useful when comparing data from multiple time periods, such as two years.
713
Another common scenario is to view the dataset profile for training, validation and test sets in machine learning.
Binary file not shown.
Lines changed: 221 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,221 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"metadata": {
6+
"application/vnd.databricks.v1+cell": {
7+
"cellMetadata": {},
8+
"inputWidgets": {},
9+
"nuid": "47c63171-a4b6-49fb-a26d-531adc1464f8",
10+
"showTitle": false,
11+
"title": ""
12+
}
13+
},
14+
"source": [
15+
"# Yellow Taxy NYC\n",
16+
"\n",
17+
"### Data Profiling in Databricks with ydata-profiling"
18+
]
19+
},
20+
{
21+
"cell_type": "markdown",
22+
"metadata": {
23+
"application/vnd.databricks.v1+cell": {
24+
"cellMetadata": {},
25+
"inputWidgets": {},
26+
"nuid": "18fdccc1-4779-4062-981a-882fa043538c",
27+
"showTitle": false,
28+
"title": ""
29+
}
30+
},
31+
"source": [
32+
"Read a Delta table"
33+
]
34+
},
35+
{
36+
"cell_type": "code",
37+
"execution_count": null,
38+
"metadata": {
39+
"application/vnd.databricks.v1+cell": {
40+
"cellMetadata": {},
41+
"inputWidgets": {},
42+
"nuid": "9f0e1f63-717a-4653-98fc-bb5119ae9c63",
43+
"showTitle": false,
44+
"title": ""
45+
}
46+
},
47+
"outputs": [],
48+
"source": [
49+
"input_table_name = \"default.yellowtaxi_trips\"\n",
50+
"df = spark.table(input_table_name)"
51+
]
52+
},
53+
{
54+
"cell_type": "code",
55+
"execution_count": null,
56+
"metadata": {
57+
"application/vnd.databricks.v1+cell": {
58+
"cellMetadata": {},
59+
"inputWidgets": {},
60+
"nuid": "21da1ee5-9055-4995-b777-f2081f91e7d4",
61+
"showTitle": false,
62+
"title": ""
63+
}
64+
},
65+
"outputs": [],
66+
"source": [
67+
"display(df)"
68+
]
69+
},
70+
{
71+
"cell_type": "markdown",
72+
"metadata": {
73+
"application/vnd.databricks.v1+cell": {
74+
"cellMetadata": {},
75+
"inputWidgets": {},
76+
"nuid": "565d9304-1f65-42c1-91a3-01f182d44cd2",
77+
"showTitle": false,
78+
"title": ""
79+
}
80+
},
81+
"source": [
82+
"## Data profiling with YData Profiling\n",
83+
"\n",
84+
"pandas-profiling is now ydata-profiling and includes support for Spark dataframes."
85+
]
86+
},
87+
{
88+
"cell_type": "code",
89+
"execution_count": null,
90+
"metadata": {
91+
"application/vnd.databricks.v1+cell": {
92+
"cellMetadata": {},
93+
"inputWidgets": {},
94+
"nuid": "eebb3d44-b088-4f45-b55b-6bd527ea7f7a",
95+
"showTitle": false,
96+
"title": ""
97+
}
98+
},
99+
"outputs": [],
100+
"source": [
101+
"from ydata_profiling import ProfileReport\n",
102+
"\n",
103+
"report = ProfileReport(\n",
104+
" df,\n",
105+
" title=\"NYC yellow taxi trip\",\n",
106+
" infer_dtypes=False,\n",
107+
" interactions=None,\n",
108+
" missing_diagrams=None,\n",
109+
" correlations={\n",
110+
" \"auto\": {\"calculate\": False},\n",
111+
" \"pearson\": {\"calculate\": True},\n",
112+
" \"spearman\": {\"calculate\": True},\n",
113+
" },\n",
114+
")"
115+
]
116+
},
117+
{
118+
"cell_type": "markdown",
119+
"metadata": {
120+
"application/vnd.databricks.v1+cell": {
121+
"cellMetadata": {},
122+
"inputWidgets": {},
123+
"nuid": "b4a28624-1b4f-4879-a4a4-cf34ec9e6bc0",
124+
"showTitle": false,
125+
"title": ""
126+
}
127+
},
128+
"source": [
129+
"####Display as an HTML"
130+
]
131+
},
132+
{
133+
"cell_type": "code",
134+
"execution_count": null,
135+
"metadata": {
136+
"application/vnd.databricks.v1+cell": {
137+
"cellMetadata": {},
138+
"inputWidgets": {},
139+
"nuid": "eefeac1e-2d58-4d23-a0ec-46b1d69e44bf",
140+
"showTitle": false,
141+
"title": ""
142+
}
143+
},
144+
"outputs": [],
145+
"source": [
146+
"# Export the report as html and display\n",
147+
"report_html = report.to_html()\n",
148+
"displayHTML(report_html)"
149+
]
150+
},
151+
{
152+
"cell_type": "markdown",
153+
"metadata": {
154+
"application/vnd.databricks.v1+cell": {
155+
"cellMetadata": {},
156+
"inputWidgets": {},
157+
"nuid": "0a7fbe9c-7b0f-460f-99d7-552736c1091c",
158+
"showTitle": false,
159+
"title": ""
160+
}
161+
},
162+
"source": [
163+
"#### Extract the profile as JSON"
164+
]
165+
},
166+
{
167+
"cell_type": "code",
168+
"execution_count": null,
169+
"metadata": {
170+
"application/vnd.databricks.v1+cell": {
171+
"cellMetadata": {},
172+
"inputWidgets": {},
173+
"nuid": "cd3f32fc-344d-4a93-b037-a56952903979",
174+
"showTitle": false,
175+
"title": ""
176+
}
177+
},
178+
"outputs": [],
179+
"source": [
180+
"profile_json = report.to_json()"
181+
]
182+
},
183+
{
184+
"cell_type": "code",
185+
"execution_count": null,
186+
"metadata": {
187+
"application/vnd.databricks.v1+cell": {
188+
"cellMetadata": {},
189+
"inputWidgets": {},
190+
"nuid": "4c569eaa-5cb5-471c-9c7c-cb7d9f7cd1f5",
191+
"showTitle": false,
192+
"title": ""
193+
}
194+
},
195+
"outputs": [],
196+
"source": [
197+
"profile_json"
198+
]
199+
}
200+
],
201+
"metadata": {
202+
"application/vnd.databricks.v1+notebook": {
203+
"dashboards": [],
204+
"language": "python",
205+
"notebookMetadata": {
206+
"mostRecentlyExecutedCommandWithImplicitDF": {
207+
"commandId": 2648559141144570,
208+
"dataframes": [
209+
"_sqldf"
210+
]
211+
},
212+
"pythonIndentUnit": 4
213+
},
214+
"notebookName": "YData-profiling in Databricks",
215+
"notebookOrigID": 329200988581789,
216+
"widgets": {}
217+
}
218+
},
219+
"nbformat": 4,
220+
"nbformat_minor": 0
221+
}

0 commit comments

Comments
 (0)