|
| 1 | +.. Licensed to the Apache Software Foundation (ASF) under one |
| 2 | +.. or more contributor license agreements. See the NOTICE file |
| 3 | +.. distributed with this work for additional information |
| 4 | +.. regarding copyright ownership. The ASF licenses this file |
| 5 | +.. to you under the Apache License, Version 2.0 (the |
| 6 | +.. "License"); you may not use this file except in compliance |
| 7 | +.. with the License. You may obtain a copy of the License at |
| 8 | +
|
| 9 | +.. http://www.apache.org/licenses/LICENSE-2.0 |
| 10 | +
|
| 11 | +.. Unless required by applicable law or agreed to in writing, |
| 12 | +.. software distributed under the License is distributed on an |
| 13 | +.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| 14 | +.. KIND, either express or implied. See the License for the |
| 15 | +.. specific language governing permissions and limitations |
| 16 | +.. under the License. |
| 17 | +
|
| 18 | +.. highlight:: console |
| 19 | +.. _development-cpp-compute: |
| 20 | + |
| 21 | +============================ |
| 22 | +Developing Arrow C++ Compute |
| 23 | +============================ |
| 24 | + |
| 25 | +This section provides information for developers of the Arrow C++ Compute module. |
| 26 | + |
| 27 | +Row Table |
| 28 | +========= |
| 29 | + |
| 30 | +The row table in Arrow represents data stored in row-major format. This format |
| 31 | +is particularly useful for scenarios involving random access to individual rows |
| 32 | +and where all columns are frequently accessed together. It is especially |
| 33 | +advantageous for hash-table keys and facilitates efficient operations such as |
| 34 | +grouping and hash joins by optimizing memory access patterns and data locality. |
| 35 | + |
| 36 | +Metadata |
| 37 | +-------- |
| 38 | + |
| 39 | +A row table is defined by its metadata, ``RowTableMetadata``, which includes |
| 40 | +information about its schema, alignment, and derived properties. |
| 41 | + |
| 42 | +The schema specifies the types and order of columns. Each row in the row table |
| 43 | +contains the data for each column in that logical order (the physical order may |
| 44 | +vary; see :ref:`row-encoding` for details). |
| 45 | + |
| 46 | +.. note:: |
| 47 | + Columns of nested types or large binary types are **not** supported in the |
| 48 | + row table. |
| 49 | + |
| 50 | +One important property derived from the schema is whether the row table is |
| 51 | +fixed-length or varying-length. A fixed-length row table contains only |
| 52 | +fixed-length columns, while a varying-length row table includes at least one |
| 53 | +varying-length column. This distinction determines how data is stored and |
| 54 | +accessed in the row table. |
| 55 | + |
| 56 | +Each row in the row table is aligned to ``RowTableMetadata::row_alignment`` |
| 57 | +bytes. Fixed-length columns with non-power-of-2 lengths are also aligned to |
| 58 | +``RowTableMetadata::row_alignment`` bytes. Varying-length columns are aligned to |
| 59 | +``RowTableMetadata::string_alignment`` bytes. |
| 60 | + |
| 61 | +Buffer Layout |
| 62 | +------------- |
| 63 | + |
| 64 | +Similar to most Arrow ``Array``\s, the row table consists of three buffers: |
| 65 | + |
| 66 | +- **Null Masks Buffer**: Indicates null values for each column in each row. |
| 67 | +- **Fixed-length Buffer**: Stores row data for fixed-length tables or offsets to |
| 68 | + varying-length data for varying-length tables. |
| 69 | +- **Varying-length Buffer** (Optional): Contains row data for varying-length |
| 70 | + tables; unused for fixed-length tables. |
| 71 | + |
| 72 | +Row Format |
| 73 | +---------- |
| 74 | + |
| 75 | +Null Masks |
| 76 | +~~~~~~~~~~ |
| 77 | + |
| 78 | +For each row, a contiguous sequence of bits represents whether each column in |
| 79 | +that row is null. Each bit corresponds to a specific column, with ``1`` |
| 80 | +indicating the value is null and ``0`` indicating the value is valid. Note that |
| 81 | +this is the opposite of how the validity bitmap works for ``Array``\s. The null |
| 82 | +mask for a row occupies ``RowTableMetadata::null_masks_bytes_per_row`` bytes. |
| 83 | + |
| 84 | +Fixed-length Row Data |
| 85 | +~~~~~~~~~~~~~~~~~~~~~ |
| 86 | + |
| 87 | +In a fixed-length row table, row data is directly stored in the fixed-length |
| 88 | +buffer. All columns in each row are stored sequentially. Notably, a ``boolean`` |
| 89 | +column is special because, in a normal Arrow ``Array``, it is stored using 1 |
| 90 | +bit, whereas in a row table, it occupies 1 byte. The varying-length buffer is |
| 91 | +not used in this case. |
| 92 | + |
| 93 | +For example, a row table with the schema ``(int32, boolean)`` and rows |
| 94 | +``[[7, false], [8, true], [9, false], ...]`` is stored in the fixed-length |
| 95 | +buffer as follows: |
| 96 | + |
| 97 | +.. list-table:: |
| 98 | + :header-rows: 1 |
| 99 | + |
| 100 | + * - Row 0 |
| 101 | + - Row 1 |
| 102 | + - Row 2 |
| 103 | + - ... |
| 104 | + * - ``7 0 0 0, 0 (padding)`` |
| 105 | + - ``8 0 0 0, 1 (padding)`` |
| 106 | + - ``9 0 0 0, 0 (padding)`` |
| 107 | + - ... |
| 108 | + |
| 109 | +Offsets for Varying-length Row Data |
| 110 | +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 111 | + |
| 112 | +In a varying-length row table, the fixed-length buffer contains offsets to the |
| 113 | +varying-length row data, which is stored separately in the optional |
| 114 | +varying-length buffer. The offsets are of type ``RowTableMetadata::offset_type`` |
| 115 | +(fixed as ``int64_t``) and indicate the starting position of the row data for |
| 116 | +each row. |
| 117 | + |
| 118 | +Varying-length Row Data |
| 119 | +~~~~~~~~~~~~~~~~~~~~~~~ |
| 120 | + |
| 121 | +In a varying-length row table, the varying-length buffer contains the actual row |
| 122 | +data, stored contiguously. The offsets in the fixed-length buffer point to the |
| 123 | +starting position of each row's data. |
| 124 | + |
| 125 | +.. _row-encoding: |
| 126 | + |
| 127 | +Row Encoding |
| 128 | +^^^^^^^^^^^^ |
| 129 | + |
| 130 | +A varying-length row is encoded as follows: |
| 131 | + |
| 132 | +- Fixed-length columns are stored first. |
| 133 | +- A sequence of offsets to each varying-length column follows. Each offset is |
| 134 | + 32-bit and indicates the **end** position within the row data of the |
| 135 | + corresponding varying-length column. |
| 136 | +- Varying-length columns are stored last. |
| 137 | + |
| 138 | +For example, a row table with the schema ``(int32, string, string, int32)`` and |
| 139 | +rows ``[[7, 'Alice', 'x', 0], [8, 'Bob', 'y', 1], [9, 'Charlotte', 'z', 2], ...]`` |
| 140 | +is stored as follows (assuming 8-byte alignment for varying-length columns): |
| 141 | + |
| 142 | +Fixed-length buffer (row offsets): |
| 143 | + |
| 144 | +.. list-table:: |
| 145 | + :header-rows: 1 |
| 146 | + |
| 147 | + * - Row 0 |
| 148 | + - Row 1 |
| 149 | + - Row 2 |
| 150 | + - Row 3 |
| 151 | + - ... |
| 152 | + * - ``0 0 0 0 0 0 0 0`` |
| 153 | + - ``32 0 0 0 0 0 0 0`` |
| 154 | + - ``64 0 0 0 0 0 0 0`` |
| 155 | + - ``104 0 0 0 0 0 0 0`` |
| 156 | + - ... |
| 157 | + |
| 158 | +Varying-length buffer (row data): |
| 159 | + |
| 160 | +.. list-table:: |
| 161 | + :header-rows: 1 |
| 162 | + |
| 163 | + * - Row |
| 164 | + - Fixed-length Cols |
| 165 | + - Varying-length Offsets |
| 166 | + - Varying-length Cols |
| 167 | + * - 0 |
| 168 | + - ``7 0 0 0, 0 0 0 0`` |
| 169 | + - ``21 0 0 0, 25 0 0 0`` |
| 170 | + - ``Alice~~~x~~~~~~~`` |
| 171 | + * - 1 |
| 172 | + - ``8 0 0 0, 1 0 0 0`` |
| 173 | + - ``19 0 0 0, 25 0 0 0`` |
| 174 | + - ``Bob~~~~~y~~~~~~~`` |
| 175 | + * - 2 |
| 176 | + - ``9 0 0 0, 2 0 0 0`` |
| 177 | + - ``25 0 0 0, 33 0 0 0`` |
| 178 | + - ``Charlotte~~~~~~~z~~~~~~~`` |
| 179 | + * - 3 |
| 180 | + - ... |
| 181 | + - ... |
| 182 | + - ... |
0 commit comments