Skip to content

Commit 5240670

Browse files
zanmato1984raulcdamoeba
authored
GH-46209: [Documentation][C++][Compute] Add cpp developer documentation for row table (#46210)
### What changes are included in this PR? Add cpp developer documentation for row table, making it under the compute category. ### Are these changes tested? No need. ### Are there any user-facing changes? None. * GitHub Issue: #46209 Lead-authored-by: Rossi Sun <zanmato1984@gmail.com> Co-authored-by: Raúl Cumplido <raulcumplido@gmail.com> Co-authored-by: Bryce Mecum <petridish@gmail.com> Signed-off-by: Rossi Sun <zanmato1984@gmail.com>
1 parent 992bee2 commit 5240670

File tree

3 files changed

+201
-2
lines changed

3 files changed

+201
-2
lines changed

docs/source/cpp/index.rst

Lines changed: 18 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -96,11 +96,26 @@ Welcome to the Apache Arrow C++ implementation documentation!
9696

9797
To the API Reference
9898

99-
.. grid:: 1
99+
.. grid:: 1 2 2 2
100100
:gutter: 4
101101
:padding: 2 2 0 0
102102
:class-container: sd-text-center
103103

104+
.. grid-item-card:: C++ Development
105+
:class-card: contrib-card
106+
:shadow: none
107+
108+
Find guidelines and documentation for Arrow C++ developers
109+
110+
+++
111+
112+
.. button-link:: ../developers/cpp/index.html
113+
:click-parent:
114+
:color: primary
115+
:expand:
116+
117+
To C++ Development
118+
104119
.. grid-item-card:: Cookbook
105120
:class-card: contrib-card
106121
:shadow: none
@@ -126,4 +141,5 @@ Welcome to the Apache Arrow C++ implementation documentation!
126141
user_guide
127142
Examples <examples/index>
128143
api
129-
C++ cookbook <https://arrow.apache.org/cookbook/cpp/>
144+
C++ Development <../developers/cpp/index>
145+
C++ Cookbook <https://arrow.apache.org/cookbook/cpp/>
Lines changed: 182 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,182 @@
1+
.. Licensed to the Apache Software Foundation (ASF) under one
2+
.. or more contributor license agreements. See the NOTICE file
3+
.. distributed with this work for additional information
4+
.. regarding copyright ownership. The ASF licenses this file
5+
.. to you under the Apache License, Version 2.0 (the
6+
.. "License"); you may not use this file except in compliance
7+
.. with the License. You may obtain a copy of the License at
8+
9+
.. http://www.apache.org/licenses/LICENSE-2.0
10+
11+
.. Unless required by applicable law or agreed to in writing,
12+
.. software distributed under the License is distributed on an
13+
.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
14+
.. KIND, either express or implied. See the License for the
15+
.. specific language governing permissions and limitations
16+
.. under the License.
17+
18+
.. highlight:: console
19+
.. _development-cpp-compute:
20+
21+
============================
22+
Developing Arrow C++ Compute
23+
============================
24+
25+
This section provides information for developers of the Arrow C++ Compute module.
26+
27+
Row Table
28+
=========
29+
30+
The row table in Arrow represents data stored in row-major format. This format
31+
is particularly useful for scenarios involving random access to individual rows
32+
and where all columns are frequently accessed together. It is especially
33+
advantageous for hash-table keys and facilitates efficient operations such as
34+
grouping and hash joins by optimizing memory access patterns and data locality.
35+
36+
Metadata
37+
--------
38+
39+
A row table is defined by its metadata, ``RowTableMetadata``, which includes
40+
information about its schema, alignment, and derived properties.
41+
42+
The schema specifies the types and order of columns. Each row in the row table
43+
contains the data for each column in that logical order (the physical order may
44+
vary; see :ref:`row-encoding` for details).
45+
46+
.. note::
47+
Columns of nested types or large binary types are **not** supported in the
48+
row table.
49+
50+
One important property derived from the schema is whether the row table is
51+
fixed-length or varying-length. A fixed-length row table contains only
52+
fixed-length columns, while a varying-length row table includes at least one
53+
varying-length column. This distinction determines how data is stored and
54+
accessed in the row table.
55+
56+
Each row in the row table is aligned to ``RowTableMetadata::row_alignment``
57+
bytes. Fixed-length columns with non-power-of-2 lengths are also aligned to
58+
``RowTableMetadata::row_alignment`` bytes. Varying-length columns are aligned to
59+
``RowTableMetadata::string_alignment`` bytes.
60+
61+
Buffer Layout
62+
-------------
63+
64+
Similar to most Arrow ``Array``\s, the row table consists of three buffers:
65+
66+
- **Null Masks Buffer**: Indicates null values for each column in each row.
67+
- **Fixed-length Buffer**: Stores row data for fixed-length tables or offsets to
68+
varying-length data for varying-length tables.
69+
- **Varying-length Buffer** (Optional): Contains row data for varying-length
70+
tables; unused for fixed-length tables.
71+
72+
Row Format
73+
----------
74+
75+
Null Masks
76+
~~~~~~~~~~
77+
78+
For each row, a contiguous sequence of bits represents whether each column in
79+
that row is null. Each bit corresponds to a specific column, with ``1``
80+
indicating the value is null and ``0`` indicating the value is valid. Note that
81+
this is the opposite of how the validity bitmap works for ``Array``\s. The null
82+
mask for a row occupies ``RowTableMetadata::null_masks_bytes_per_row`` bytes.
83+
84+
Fixed-length Row Data
85+
~~~~~~~~~~~~~~~~~~~~~
86+
87+
In a fixed-length row table, row data is directly stored in the fixed-length
88+
buffer. All columns in each row are stored sequentially. Notably, a ``boolean``
89+
column is special because, in a normal Arrow ``Array``, it is stored using 1
90+
bit, whereas in a row table, it occupies 1 byte. The varying-length buffer is
91+
not used in this case.
92+
93+
For example, a row table with the schema ``(int32, boolean)`` and rows
94+
``[[7, false], [8, true], [9, false], ...]`` is stored in the fixed-length
95+
buffer as follows:
96+
97+
.. list-table::
98+
:header-rows: 1
99+
100+
* - Row 0
101+
- Row 1
102+
- Row 2
103+
- ...
104+
* - ``7 0 0 0, 0 (padding)``
105+
- ``8 0 0 0, 1 (padding)``
106+
- ``9 0 0 0, 0 (padding)``
107+
- ...
108+
109+
Offsets for Varying-length Row Data
110+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
111+
112+
In a varying-length row table, the fixed-length buffer contains offsets to the
113+
varying-length row data, which is stored separately in the optional
114+
varying-length buffer. The offsets are of type ``RowTableMetadata::offset_type``
115+
(fixed as ``int64_t``) and indicate the starting position of the row data for
116+
each row.
117+
118+
Varying-length Row Data
119+
~~~~~~~~~~~~~~~~~~~~~~~
120+
121+
In a varying-length row table, the varying-length buffer contains the actual row
122+
data, stored contiguously. The offsets in the fixed-length buffer point to the
123+
starting position of each row's data.
124+
125+
.. _row-encoding:
126+
127+
Row Encoding
128+
^^^^^^^^^^^^
129+
130+
A varying-length row is encoded as follows:
131+
132+
- Fixed-length columns are stored first.
133+
- A sequence of offsets to each varying-length column follows. Each offset is
134+
32-bit and indicates the **end** position within the row data of the
135+
corresponding varying-length column.
136+
- Varying-length columns are stored last.
137+
138+
For example, a row table with the schema ``(int32, string, string, int32)`` and
139+
rows ``[[7, 'Alice', 'x', 0], [8, 'Bob', 'y', 1], [9, 'Charlotte', 'z', 2], ...]``
140+
is stored as follows (assuming 8-byte alignment for varying-length columns):
141+
142+
Fixed-length buffer (row offsets):
143+
144+
.. list-table::
145+
:header-rows: 1
146+
147+
* - Row 0
148+
- Row 1
149+
- Row 2
150+
- Row 3
151+
- ...
152+
* - ``0 0 0 0 0 0 0 0``
153+
- ``32 0 0 0 0 0 0 0``
154+
- ``64 0 0 0 0 0 0 0``
155+
- ``104 0 0 0 0 0 0 0``
156+
- ...
157+
158+
Varying-length buffer (row data):
159+
160+
.. list-table::
161+
:header-rows: 1
162+
163+
* - Row
164+
- Fixed-length Cols
165+
- Varying-length Offsets
166+
- Varying-length Cols
167+
* - 0
168+
- ``7 0 0 0, 0 0 0 0``
169+
- ``21 0 0 0, 25 0 0 0``
170+
- ``Alice~~~x~~~~~~~``
171+
* - 1
172+
- ``8 0 0 0, 1 0 0 0``
173+
- ``19 0 0 0, 25 0 0 0``
174+
- ``Bob~~~~~y~~~~~~~``
175+
* - 2
176+
- ``9 0 0 0, 2 0 0 0``
177+
- ``25 0 0 0, 33 0 0 0``
178+
- ``Charlotte~~~~~~~z~~~~~~~``
179+
* - 3
180+
- ...
181+
- ...
182+
- ...

docs/source/developers/cpp/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -30,3 +30,4 @@ C++ Development
3030
emscripten
3131
conventions
3232
fuzzing
33+
compute

0 commit comments

Comments
 (0)