Skip to content

Commit 91b6635

Browse files
kosiewtimsaucer
andauthored
Add DataFrame usage guide with HTML rendering customization options (#1108)
* docs: enhance user guide with detailed DataFrame operations and examples * move /docs/source/api/dataframe.rst into user-guide * docs: remove DataFrame API documentation * docs: fix formatting inconsistencies in DataFrame user guide * Two minor corrections to documentation rendering --------- Co-authored-by: Tim Saucer <timsaucer@gmail.com>
1 parent c9f1554 commit 91b6635

File tree

3 files changed

+184
-1
lines changed

3 files changed

+184
-1
lines changed

docs/source/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -72,6 +72,7 @@ Example
7272
user-guide/introduction
7373
user-guide/basics
7474
user-guide/data-sources
75+
user-guide/dataframe
7576
user-guide/common-operations/index
7677
user-guide/io/index
7778
user-guide/configuration

docs/source/user-guide/basics.rst

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,8 @@ Concepts
2121
========
2222

2323
In this section, we will cover a basic example to introduce a few key concepts. We will use the
24-
2021 Yellow Taxi Trip Records ([download](https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2021-01.parquet)), from the [TLC Trip Record Data](https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page).
24+
2021 Yellow Taxi Trip Records (`download <https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2021-01.parquet>`_),
25+
from the `TLC Trip Record Data <https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page>`_.
2526

2627
.. ipython:: python
2728
@@ -72,6 +73,8 @@ DataFrames are typically created by calling a method on :py:class:`~datafusion.c
7273
calling the transformation methods, such as :py:func:`~datafusion.dataframe.DataFrame.filter`, :py:func:`~datafusion.dataframe.DataFrame.select`, :py:func:`~datafusion.dataframe.DataFrame.aggregate`,
7374
and :py:func:`~datafusion.dataframe.DataFrame.limit` to build up a query definition.
7475

76+
For more details on working with DataFrames, including visualization options and conversion to other formats, see :doc:`dataframe`.
77+
7578
Expressions
7679
-----------
7780

docs/source/user-guide/dataframe.rst

Lines changed: 179 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,179 @@
1+
.. Licensed to the Apache Software Foundation (ASF) under one
2+
.. or more contributor license agreements. See the NOTICE file
3+
.. distributed with this work for additional information
4+
.. regarding copyright ownership. The ASF licenses this file
5+
.. to you under the Apache License, Version 2.0 (the
6+
.. "License"); you may not use this file except in compliance
7+
.. with the License. You may obtain a copy of the License at
8+
9+
.. http://www.apache.org/licenses/LICENSE-2.0
10+
11+
.. Unless required by applicable law or agreed to in writing,
12+
.. software distributed under the License is distributed on an
13+
.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
14+
.. KIND, either express or implied. See the License for the
15+
.. specific language governing permissions and limitations
16+
.. under the License.
17+
18+
DataFrames
19+
==========
20+
21+
Overview
22+
--------
23+
24+
DataFusion's DataFrame API provides a powerful interface for building and executing queries against data sources.
25+
It offers a familiar API similar to pandas and other DataFrame libraries, but with the performance benefits of Rust
26+
and Arrow.
27+
28+
A DataFrame represents a logical plan that can be composed through operations like filtering, projection, and aggregation.
29+
The actual execution happens when terminal operations like ``collect()`` or ``show()`` are called.
30+
31+
Basic Usage
32+
-----------
33+
34+
.. code-block:: python
35+
36+
import datafusion
37+
from datafusion import col, lit
38+
39+
# Create a context and register a data source
40+
ctx = datafusion.SessionContext()
41+
ctx.register_csv("my_table", "path/to/data.csv")
42+
43+
# Create and manipulate a DataFrame
44+
df = ctx.sql("SELECT * FROM my_table")
45+
46+
# Or use the DataFrame API directly
47+
df = (ctx.table("my_table")
48+
.filter(col("age") > lit(25))
49+
.select([col("name"), col("age")]))
50+
51+
# Execute and collect results
52+
result = df.collect()
53+
54+
# Display the first few rows
55+
df.show()
56+
57+
HTML Rendering
58+
--------------
59+
60+
When working in Jupyter notebooks or other environments that support HTML rendering, DataFrames will
61+
automatically display as formatted HTML tables, making it easier to visualize your data.
62+
63+
The ``_repr_html_`` method is called automatically by Jupyter to render a DataFrame. This method
64+
controls how DataFrames appear in notebook environments, providing a richer visualization than
65+
plain text output.
66+
67+
Customizing HTML Rendering
68+
--------------------------
69+
70+
You can customize how DataFrames are rendered in HTML by configuring the formatter:
71+
72+
.. code-block:: python
73+
74+
from datafusion.html_formatter import configure_formatter
75+
76+
# Change the default styling
77+
configure_formatter(
78+
max_rows=50, # Maximum number of rows to display
79+
max_width=None, # Maximum width in pixels (None for auto)
80+
theme="light", # Theme: "light" or "dark"
81+
precision=2, # Floating point precision
82+
thousands_separator=",", # Separator for thousands
83+
date_format="%Y-%m-%d", # Date format
84+
truncate_width=20 # Max width for string columns before truncating
85+
)
86+
87+
The formatter settings affect all DataFrames displayed after configuration.
88+
89+
Custom Style Providers
90+
----------------------
91+
92+
For advanced styling needs, you can create a custom style provider:
93+
94+
.. code-block:: python
95+
96+
from datafusion.html_formatter import StyleProvider, configure_formatter
97+
98+
class MyStyleProvider(StyleProvider):
99+
def get_table_styles(self):
100+
return {
101+
"table": "border-collapse: collapse; width: 100%;",
102+
"th": "background-color: #007bff; color: white; padding: 8px; text-align: left;",
103+
"td": "border: 1px solid #ddd; padding: 8px;",
104+
"tr:nth-child(even)": "background-color: #f2f2f2;",
105+
}
106+
107+
def get_value_styles(self, dtype, value):
108+
"""Return custom styles for specific values"""
109+
if dtype == "float" and value < 0:
110+
return "color: red;"
111+
return None
112+
113+
# Apply the custom style provider
114+
configure_formatter(style_provider=MyStyleProvider())
115+
116+
Creating a Custom Formatter
117+
---------------------------
118+
119+
For complete control over rendering, you can implement a custom formatter:
120+
121+
.. code-block:: python
122+
123+
from datafusion.html_formatter import Formatter, get_formatter
124+
125+
class MyFormatter(Formatter):
126+
def format_html(self, batches, schema, has_more=False, table_uuid=None):
127+
# Create your custom HTML here
128+
html = "<div class='my-custom-table'>"
129+
# ... formatting logic ...
130+
html += "</div>"
131+
return html
132+
133+
# Set as the global formatter
134+
configure_formatter(formatter_class=MyFormatter)
135+
136+
# Or use the formatter just for specific operations
137+
formatter = get_formatter()
138+
custom_html = formatter.format_html(batches, schema)
139+
140+
Managing Formatters
141+
-------------------
142+
143+
Reset to default formatting:
144+
145+
.. code-block:: python
146+
147+
from datafusion.html_formatter import reset_formatter
148+
149+
# Reset to default settings
150+
reset_formatter()
151+
152+
Get the current formatter settings:
153+
154+
.. code-block:: python
155+
156+
from datafusion.html_formatter import get_formatter
157+
158+
formatter = get_formatter()
159+
print(formatter.max_rows)
160+
print(formatter.theme)
161+
162+
Contextual Formatting
163+
---------------------
164+
165+
You can also use a context manager to temporarily change formatting settings:
166+
167+
.. code-block:: python
168+
169+
from datafusion.html_formatter import formatting_context
170+
171+
# Default formatting
172+
df.show()
173+
174+
# Temporarily use different formatting
175+
with formatting_context(max_rows=100, theme="dark"):
176+
df.show() # Will use the temporary settings
177+
178+
# Back to default formatting
179+
df.show()

0 commit comments

Comments
 (0)