|
| 1 | +Multi-Pack-Index (MIDX) Design Notes |
| 2 | +==================================== |
| 3 | + |
| 4 | +The Git object directory contains a 'pack' directory containing |
| 5 | +packfiles (with suffix ".pack") and pack-indexes (with suffix |
| 6 | +".idx"). The pack-indexes provide a way to lookup objects and |
| 7 | +navigate to their offset within the pack, but these must come |
| 8 | +in pairs with the packfiles. This pairing depends on the file |
| 9 | +names, as the pack-index differs only in suffix with its pack- |
| 10 | +file. While the pack-indexes provide fast lookup per packfile, |
| 11 | +this performance degrades as the number of packfiles increases, |
| 12 | +because abbreviations need to inspect every packfile and we are |
| 13 | +more likely to have a miss on our most-recently-used packfile. |
| 14 | +For some large repositories, repacking into a single packfile |
| 15 | +is not feasible due to storage space or excessive repack times. |
| 16 | + |
| 17 | +The multi-pack-index (MIDX for short) stores a list of objects |
| 18 | +and their offsets into multiple packfiles. It contains: |
| 19 | + |
| 20 | +* A list of packfile names. |
| 21 | +* A sorted list of object IDs. |
| 22 | +* A list of metadata for the ith object ID including: |
| 23 | +** A value j referring to the jth packfile. |
| 24 | +** An offset within the jth packfile for the object. |
| 25 | +* If large offsets are required, we use another list of large |
| 26 | + offsets similar to version 2 pack-indexes. |
| 27 | +- An optional list of objects in pseudo-pack order (used with MIDX bitmaps). |
| 28 | + |
| 29 | +Thus, we can provide O(log N) lookup time for any number |
| 30 | +of packfiles. |
| 31 | + |
| 32 | +Design Details |
| 33 | +-------------- |
| 34 | + |
| 35 | +- The MIDX is stored in a file named 'multi-pack-index' in the |
| 36 | + .git/objects/pack directory. This could be stored in the pack |
| 37 | + directory of an alternate. It refers only to packfiles in that |
| 38 | + same directory. |
| 39 | + |
| 40 | +- The core.multiPackIndex config setting must be on (which is the |
| 41 | + default) to consume MIDX files. Setting it to `false` prevents |
| 42 | + Git from reading a MIDX file, even if one exists. |
| 43 | + |
| 44 | +- The file format includes parameters for the object ID hash |
| 45 | + function, so a future change of hash algorithm does not require |
| 46 | + a change in format. |
| 47 | + |
| 48 | +- The MIDX keeps only one record per object ID. If an object appears |
| 49 | + in multiple packfiles, then the MIDX selects the copy in the |
| 50 | + preferred packfile, otherwise selecting from the most-recently |
| 51 | + modified packfile. |
| 52 | + |
| 53 | +- If there exist packfiles in the pack directory not registered in |
| 54 | + the MIDX, then those packfiles are loaded into the `packed_git` |
| 55 | + list and `packed_git_mru` cache. |
| 56 | + |
| 57 | +- The pack-indexes (.idx files) remain in the pack directory so we |
| 58 | + can delete the MIDX file, set core.midx to false, or downgrade |
| 59 | + without any loss of information. |
| 60 | + |
| 61 | +- The MIDX file format uses a chunk-based approach (similar to the |
| 62 | + commit-graph file) that allows optional data to be added. |
| 63 | + |
| 64 | +Incremental multi-pack indexes |
| 65 | +------------------------------ |
| 66 | + |
| 67 | +As repositories grow in size, it becomes more expensive to write a |
| 68 | +multi-pack index (MIDX) that includes all packfiles. To accommodate |
| 69 | +this, the "incremental multi-pack indexes" feature allows for combining |
| 70 | +a "chain" of multi-pack indexes. |
| 71 | + |
| 72 | +Each individual component of the chain need only contain a small number |
| 73 | +of packfiles. Appending to the chain does not invalidate earlier parts |
| 74 | +of the chain, so repositories can control how much time is spent |
| 75 | +updating the MIDX chain by determining the number of packs in each layer |
| 76 | +of the MIDX chain. |
| 77 | + |
| 78 | +=== Design state |
| 79 | + |
| 80 | +At present, the incremental multi-pack indexes feature is missing two |
| 81 | +important components: |
| 82 | + |
| 83 | + - The ability to rewrite earlier portions of the MIDX chain (i.e., to |
| 84 | + "compact" some collection of adjacent MIDX layers into a single |
| 85 | + MIDX). At present the only supported way of shrinking a MIDX chain |
| 86 | + is to rewrite the entire chain from scratch without the `--split` |
| 87 | + flag. |
| 88 | ++ |
| 89 | +There are no fundamental limitations that stand in the way of being able |
| 90 | +to implement this feature. It is omitted from the initial implementation |
| 91 | +in order to reduce the complexity, but will be added later. |
| 92 | + |
| 93 | + - Support for reachability bitmaps. The classic single MIDX |
| 94 | + implementation does support reachability bitmaps (see the section |
| 95 | + titled "multi-pack-index reverse indexes" in |
| 96 | + linkgit:gitformat-pack[5] for more details). |
| 97 | ++ |
| 98 | +As above, there are no fundamental limitations that stand in the way of |
| 99 | +extending the incremental MIDX format to support reachability bitmaps. |
| 100 | +The design below specifically takes this into account, and support for |
| 101 | +reachability bitmaps will be added in a future patch series. It is |
| 102 | +omitted from the current implementation for the same reason as above. |
| 103 | ++ |
| 104 | +In brief, to support reachability bitmaps with the incremental MIDX |
| 105 | +feature, the concept of the pseudo-pack order is extended across each |
| 106 | +layer of the incremental MIDX chain to form a concatenated pseudo-pack |
| 107 | +order. This concatenation takes place in the same order as the chain |
| 108 | +itself (in other words, the concatenated pseudo-pack order for a chain |
| 109 | +`{$H1, $H2, $H3}` would be the pseudo-pack order for `$H1`, followed by |
| 110 | +the pseudo-pack order for `$H2`, followed by the pseudo-pack order for |
| 111 | +`$H3`). |
| 112 | ++ |
| 113 | +The layout will then be extended so that each layer of the incremental |
| 114 | +MIDX chain can write a `*.bitmap`. The objects in each layer's bitmap |
| 115 | +are offset by the number of objects in the previous layers of the chain. |
| 116 | + |
| 117 | +=== File layout |
| 118 | + |
| 119 | +Instead of storing a single `multi-pack-index` file (with an optional |
| 120 | +`.rev` and `.bitmap` extension) in `$GIT_DIR/objects/pack`, incremental |
| 121 | +MIDXs are stored in the following layout: |
| 122 | + |
| 123 | +---- |
| 124 | +$GIT_DIR/objects/pack/multi-pack-index.d/ |
| 125 | +$GIT_DIR/objects/pack/multi-pack-index.d/multi-pack-index-chain |
| 126 | +$GIT_DIR/objects/pack/multi-pack-index.d/multi-pack-index-$H1.midx |
| 127 | +$GIT_DIR/objects/pack/multi-pack-index.d/multi-pack-index-$H2.midx |
| 128 | +$GIT_DIR/objects/pack/multi-pack-index.d/multi-pack-index-$H3.midx |
| 129 | +---- |
| 130 | + |
| 131 | +The `multi-pack-index-chain` file contains a list of the incremental |
| 132 | +MIDX files in the chain, in order. The above example shows a chain whose |
| 133 | +`multi-pack-index-chain` file would contain the following lines: |
| 134 | + |
| 135 | +---- |
| 136 | +$H1 |
| 137 | +$H2 |
| 138 | +$H3 |
| 139 | +---- |
| 140 | + |
| 141 | +The `multi-pack-index-$H1.midx` file contains the first layer of the |
| 142 | +multi-pack-index chain. The `multi-pack-index-$H2.midx` file contains |
| 143 | +the second layer of the chain, and so on. |
| 144 | + |
| 145 | +When both an incremental- and non-incremental MIDX are present, the |
| 146 | +non-incremental MIDX is always read first. |
| 147 | + |
| 148 | +=== Object positions for incremental MIDXs |
| 149 | + |
| 150 | +In the original multi-pack-index design, we refer to objects via their |
| 151 | +lexicographic position (by object IDs) within the repository's singular |
| 152 | +multi-pack-index. In the incremental multi-pack-index design, we refer |
| 153 | +to objects via their index into a concatenated lexicographic ordering |
| 154 | +among each component in the MIDX chain. |
| 155 | + |
| 156 | +If `objects_nr()` is a function that returns the number of objects in a |
| 157 | +given MIDX layer, then the index of an object at lexicographic position |
| 158 | +`i` within, say, $H3 is defined as: |
| 159 | + |
| 160 | +---- |
| 161 | +objects_nr($H2) + objects_nr($H1) + i |
| 162 | +---- |
| 163 | + |
| 164 | +(in the C implementation, this is often computed as `i + |
| 165 | +m->num_objects_in_base`). |
| 166 | + |
| 167 | +=== Pseudo-pack order for incremental MIDXs |
| 168 | + |
| 169 | +The original implementation of multi-pack reachability bitmaps defined |
| 170 | +the pseudo-pack order in linkgit:gitformat-pack[5] (see the section |
| 171 | +titled "multi-pack-index reverse indexes") roughly as follows: |
| 172 | + |
| 173 | +____ |
| 174 | +In short, a MIDX's pseudo-pack is the de-duplicated concatenation of |
| 175 | +objects in packs stored by the MIDX, laid out in pack order, and the |
| 176 | +packs arranged in MIDX order (with the preferred pack coming first). |
| 177 | +____ |
| 178 | + |
| 179 | +In the incremental MIDX design, we extend this definition to include |
| 180 | +objects from multiple layers of the MIDX chain. The pseudo-pack order |
| 181 | +for incremental MIDXs is determined by concatenating the pseudo-pack |
| 182 | +ordering for each layer of the MIDX chain in order. Formally two objects |
| 183 | +`o1` and `o2` are compared as follows: |
| 184 | + |
| 185 | +1. If `o1` appears in an earlier layer of the MIDX chain than `o2`, then |
| 186 | + `o1` sorts ahead of `o2`. |
| 187 | + |
| 188 | +2. Otherwise, if `o1` and `o2` appear in the same MIDX layer, and that |
| 189 | + MIDX layer has no base, then if one of `pack(o1)` and `pack(o2)` is |
| 190 | + preferred and the other is not, then the preferred one sorts ahead of |
| 191 | + the non-preferred one. If there is a base layer (i.e. the MIDX layer |
| 192 | + is not the first layer in the chain), then if `pack(o1)` appears |
| 193 | + earlier in that MIDX layer's pack order, then `o1` sorts ahead of |
| 194 | + `o2`. Likewise if `pack(o2)` appears earlier, then the opposite is |
| 195 | + true. |
| 196 | + |
| 197 | +3. Otherwise, `o1` and `o2` appear in the same pack, and thus in the |
| 198 | + same MIDX layer. Sort `o1` and `o2` by their offset within their |
| 199 | + containing packfile. |
| 200 | + |
| 201 | +Note that the preferred pack is a property of the MIDX chain, not the |
| 202 | +individual layers themselves. Fundamentally we could introduce a |
| 203 | +per-layer preferred pack, but this is less relevant now that we can |
| 204 | +perform multi-pack reuse across the set of packs in a MIDX. |
| 205 | + |
| 206 | +=== Reachability bitmaps and incremental MIDXs |
| 207 | + |
| 208 | +Each layer of an incremental MIDX chain may have its objects (and the |
| 209 | +objects from any previous layer in the same MIDX chain) represented in |
| 210 | +its own `*.bitmap` file. |
| 211 | + |
| 212 | +The structure of a `*.bitmap` file belonging to an incremental MIDX |
| 213 | +chain is identical to that of a non-incremental MIDX bitmap, or a |
| 214 | +classic single-pack bitmap. Since objects are added to the end of the |
| 215 | +incremental MIDX's pseudo-pack order (see above), it is possible to |
| 216 | +extend a bitmap when appending to the end of a MIDX chain. |
| 217 | + |
| 218 | +(Note: it is possible likewise to compress a contiguous sequence of MIDX |
| 219 | +incremental layers, and their `*.bitmap` files into a single layer and |
| 220 | +`*.bitmap`, but this is not yet implemented.) |
| 221 | + |
| 222 | +The object positions used are global within the pseudo-pack order, so |
| 223 | +subsequent layers will have, for example, `m->num_objects_in_base` |
| 224 | +number of `0` bits in each of their four type bitmaps. This follows from |
| 225 | +the fact that we only write type bitmap entries for objects present in |
| 226 | +the layer immediately corresponding to the bitmap). |
| 227 | + |
| 228 | +Note also that only the bitmap pertaining to the most recent layer in an |
| 229 | +incremental MIDX chain is used to store reachability information about |
| 230 | +the interesting and uninteresting objects in a reachability query. |
| 231 | +Earlier bitmap layers are only used to look up commit and pseudo-merge |
| 232 | +bitmaps from that layer, as well as the type-level bitmaps for objects |
| 233 | +in that layer. |
| 234 | + |
| 235 | +To simplify the implementation, type-level bitmaps are iterated |
| 236 | +simultaneously, and their results are OR'd together to avoid recursively |
| 237 | +calling internal bitmap functions. |
| 238 | + |
| 239 | +Future Work |
| 240 | +----------- |
| 241 | + |
| 242 | +- If the multi-pack-index is extended to store a "stable object order" |
| 243 | + (a function Order(hash) = integer that is constant for a given hash, |
| 244 | + even as the multi-pack-index is updated) then MIDX bitmaps could be |
| 245 | + updated independently of the MIDX. |
| 246 | + |
| 247 | +- Packfiles can be marked as "special" using empty files that share |
| 248 | + the initial name but replace ".pack" with ".keep" or ".promisor". |
| 249 | + We can add an optional chunk of data to the multi-pack-index that |
| 250 | + records flags of information about the packfiles. This allows new |
| 251 | + states, such as 'repacked' or 'redeltified', that can help with |
| 252 | + pack maintenance in a multi-pack environment. It may also be |
| 253 | + helpful to organize packfiles by object type (commit, tree, blob, |
| 254 | + etc.) and use this metadata to help that maintenance. |
| 255 | + |
| 256 | +Related Links |
| 257 | +------------- |
| 258 | +[0] https://bugs.chromium.org/p/git/issues/detail?id=6 |
| 259 | + Chromium work item for: Multi-Pack Index (MIDX) |
| 260 | + |
| 261 | +[1] https://lore.kernel.org/git/20180107181459.222909-1-dstolee@microsoft.com/ |
| 262 | + An earlier RFC for the multi-pack-index feature |
| 263 | + |
| 264 | +[2] https://lore.kernel.org/git/alpine.DEB.2.20.1803091557510.23109@alexmv-linux/ |
| 265 | + Git Merge 2018 Contributor's summit notes (includes discussion of MIDX) |
0 commit comments