diff options
Diffstat (limited to 'Documentation/technical')
-rw-r--r-- | Documentation/technical/multi-pack-index.txt | 109 | ||||
-rw-r--r-- | Documentation/technical/pack-format.txt | 77 | ||||
-rw-r--r-- | Documentation/technical/rerere.txt | 186 |
3 files changed, 372 insertions, 0 deletions
diff --git a/Documentation/technical/multi-pack-index.txt b/Documentation/technical/multi-pack-index.txt new file mode 100644 index 0000000000..d7e57639f7 --- /dev/null +++ b/Documentation/technical/multi-pack-index.txt @@ -0,0 +1,109 @@ +Multi-Pack-Index (MIDX) Design Notes +==================================== + +The Git object directory contains a 'pack' directory containing +packfiles (with suffix ".pack") and pack-indexes (with suffix +".idx"). The pack-indexes provide a way to lookup objects and +navigate to their offset within the pack, but these must come +in pairs with the packfiles. This pairing depends on the file +names, as the pack-index differs only in suffix with its pack- +file. While the pack-indexes provide fast lookup per packfile, +this performance degrades as the number of packfiles increases, +because abbreviations need to inspect every packfile and we are +more likely to have a miss on our most-recently-used packfile. +For some large repositories, repacking into a single packfile +is not feasible due to storage space or excessive repack times. + +The multi-pack-index (MIDX for short) stores a list of objects +and their offsets into multiple packfiles. It contains: + +- A list of packfile names. +- A sorted list of object IDs. +- A list of metadata for the ith object ID including: + - A value j referring to the jth packfile. + - An offset within the jth packfile for the object. +- If large offsets are required, we use another list of large + offsets similar to version 2 pack-indexes. + +Thus, we can provide O(log N) lookup time for any number +of packfiles. + +Design Details +-------------- + +- The MIDX is stored in a file named 'multi-pack-index' in the + .git/objects/pack directory. This could be stored in the pack + directory of an alternate. It refers only to packfiles in that + same directory. + +- The pack.multiIndex config setting must be on to consume MIDX files. + +- The file format includes parameters for the object ID hash + function, so a future change of hash algorithm does not require + a change in format. + +- The MIDX keeps only one record per object ID. If an object appears + in multiple packfiles, then the MIDX selects the copy in the most- + recently modified packfile. + +- If there exist packfiles in the pack directory not registered in + the MIDX, then those packfiles are loaded into the `packed_git` + list and `packed_git_mru` cache. + +- The pack-indexes (.idx files) remain in the pack directory so we + can delete the MIDX file, set core.midx to false, or downgrade + without any loss of information. + +- The MIDX file format uses a chunk-based approach (similar to the + commit-graph file) that allows optional data to be added. + +Future Work +----------- + +- Add a 'verify' subcommand to the 'git midx' builtin to verify the + contents of the multi-pack-index file match the offsets listed in + the corresponding pack-indexes. + +- The multi-pack-index allows many packfiles, especially in a context + where repacking is expensive (such as a very large repo), or + unexpected maintenance time is unacceptable (such as a high-demand + build machine). However, the multi-pack-index needs to be rewritten + in full every time. We can extend the format to be incremental, so + writes are fast. By storing a small "tip" multi-pack-index that + points to large "base" MIDX files, we can keep writes fast while + still reducing the number of binary searches required for object + lookups. + +- The reachability bitmap is currently paired directly with a single + packfile, using the pack-order as the object order to hopefully + compress the bitmaps well using run-length encoding. This could be + extended to pair a reachability bitmap with a multi-pack-index. If + the multi-pack-index is extended to store a "stable object order" + (a function Order(hash) = integer that is constant for a given hash, + even as the multi-pack-index is updated) then a reachability bitmap + could point to a multi-pack-index and be updated independently. + +- Packfiles can be marked as "special" using empty files that share + the initial name but replace ".pack" with ".keep" or ".promisor". + We can add an optional chunk of data to the multi-pack-index that + records flags of information about the packfiles. This allows new + states, such as 'repacked' or 'redeltified', that can help with + pack maintenance in a multi-pack environment. It may also be + helpful to organize packfiles by object type (commit, tree, blob, + etc.) and use this metadata to help that maintenance. + +- The partial clone feature records special "promisor" packs that + may point to objects that are not stored locally, but available + on request to a server. The multi-pack-index does not currently + track these promisor packs. + +Related Links +------------- +[0] https://bugs.chromium.org/p/git/issues/detail?id=6 + Chromium work item for: Multi-Pack Index (MIDX) + +[1] https://public-inbox.org/git/20180107181459.222909-1-dstolee@microsoft.com/ + An earlier RFC for the multi-pack-index feature + +[2] https://public-inbox.org/git/alpine.DEB.2.20.1803091557510.23109@alexmv-linux/ + Git Merge 2018 Contributor's summit notes (includes discussion of MIDX) diff --git a/Documentation/technical/pack-format.txt b/Documentation/technical/pack-format.txt index 70a99fd142..cab5bdd2ff 100644 --- a/Documentation/technical/pack-format.txt +++ b/Documentation/technical/pack-format.txt @@ -252,3 +252,80 @@ Pack file entry: <+ corresponding packfile. 20-byte SHA-1-checksum of all of the above. + +== multi-pack-index (MIDX) files have the following format: + +The multi-pack-index files refer to multiple pack-files and loose objects. + +In order to allow extensions that add extra data to the MIDX, we organize +the body into "chunks" and provide a lookup table at the beginning of the +body. The header includes certain length values, such as the number of packs, +the number of base MIDX files, hash lengths and types. + +All 4-byte numbers are in network order. + +HEADER: + + 4-byte signature: + The signature is: {'M', 'I', 'D', 'X'} + + 1-byte version number: + Git only writes or recognizes version 1. + + 1-byte Object Id Version + Git only writes or recognizes version 1 (SHA1). + + 1-byte number of "chunks" + + 1-byte number of base multi-pack-index files: + This value is currently always zero. + + 4-byte number of pack files + +CHUNK LOOKUP: + + (C + 1) * 12 bytes providing the chunk offsets: + First 4 bytes describe chunk id. Value 0 is a terminating label. + Other 8 bytes provide offset in current file for chunk to start. + (Chunks are provided in file-order, so you can infer the length + using the next chunk position if necessary.) + + The remaining data in the body is described one chunk at a time, and + these chunks may be given in any order. Chunks are required unless + otherwise specified. + +CHUNK DATA: + + Packfile Names (ID: {'P', 'N', 'A', 'M'}) + Stores the packfile names as concatenated, null-terminated strings. + Packfiles must be listed in lexicographic order for fast lookups by + name. This is the only chunk not guaranteed to be a multiple of four + bytes in length, so should be the last chunk for alignment reasons. + + OID Fanout (ID: {'O', 'I', 'D', 'F'}) + The ith entry, F[i], stores the number of OIDs with first + byte at most i. Thus F[255] stores the total + number of objects. + + OID Lookup (ID: {'O', 'I', 'D', 'L'}) + The OIDs for all objects in the MIDX are stored in lexicographic + order in this chunk. + + Object Offsets (ID: {'O', 'O', 'F', 'F'}) + Stores two 4-byte values for every object. + 1: The pack-int-id for the pack storing this object. + 2: The offset within the pack. + If all offsets are less than 2^31, then the large offset chunk + will not exist and offsets are stored as in IDX v1. + If there is at least one offset value larger than 2^32-1, then + the large offset chunk must exist. If the large offset chunk + exists and the 31st bit is on, then removing that bit reveals + the row in the large offsets containing the 8-byte offset of + this object. + + [Optional] Object Large Offsets (ID: {'L', 'O', 'F', 'F'}) + 8-byte offsets into large packfiles. + +TRAILER: + + 20-byte SHA1-checksum of the above contents. diff --git a/Documentation/technical/rerere.txt b/Documentation/technical/rerere.txt new file mode 100644 index 0000000000..aa22d7ace8 --- /dev/null +++ b/Documentation/technical/rerere.txt @@ -0,0 +1,186 @@ +Rerere +====== + +This document describes the rerere logic. + +Conflict normalization +---------------------- + +To ensure recorded conflict resolutions can be looked up in the rerere +database, even when branches are merged in a different order, +different branches are merged that result in the same conflict, or +when different conflict style settings are used, rerere normalizes the +conflicts before writing them to the rerere database. + +Different conflict styles and branch names are normalized by stripping +the labels from the conflict markers, and removing the common ancestor +version from the `diff3` conflict style. Branches that are merged +in different order are normalized by sorting the conflict hunks. More +on each of those steps in the following sections. + +Once these two normalization operations are applied, a conflict ID is +calculated based on the normalized conflict, which is later used by +rerere to look up the conflict in the rerere database. + +Removing the common ancestor version +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Say we have three branches AB, AC and AC2. The common ancestor of +these branches has a file with a line containing the string "A" (for +brevity this is called "line A" in the rest of the document). In +branch AB this line is changed to "B", in AC, this line is changed to +"C", and branch AC2 is forked off of AC, after the line was changed to +"C". + +Forking a branch ABAC off of branch AB and then merging AC into it, we +get a conflict like the following: + + <<<<<<< HEAD + B + ======= + C + >>>>>>> AC + +Doing the analogous with AC2 (forking a branch ABAC2 off of branch AB +and then merging branch AC2 into it), using the diff3 conflict style, +we get a conflict like the following: + + <<<<<<< HEAD + B + ||||||| merged common ancestors + A + ======= + C + >>>>>>> AC2 + +By resolving this conflict, to leave line D, the user declares: + + After examining what branches AB and AC did, I believe that making + line A into line D is the best thing to do that is compatible with + what AB and AC wanted to do. + +As branch AC2 refers to the same commit as AC, the above implies that +this is also compatible what AB and AC2 wanted to do. + +By extension, this means that rerere should recognize that the above +conflicts are the same. To do this, the labels on the conflict +markers are stripped, and the common ancestor version is removed. The above +examples would both result in the following normalized conflict: + + <<<<<<< + B + ======= + C + >>>>>>> + +Sorting hunks +~~~~~~~~~~~~~ + +As before, lets imagine that a common ancestor had a file with line A +its early part, and line X in its late part. And then four branches +are forked that do these things: + + - AB: changes A to B + - AC: changes A to C + - XY: changes X to Y + - XZ: changes X to Z + +Now, forking a branch ABAC off of branch AB and then merging AC into +it, and forking a branch ACAB off of branch AC and then merging AB +into it, would yield the conflict in a different order. The former +would say "A became B or C, what now?" while the latter would say "A +became C or B, what now?" + +As a reminder, the act of merging AC into ABAC and resolving the +conflict to leave line D means that the user declares: + + After examining what branches AB and AC did, I believe that + making line A into line D is the best thing to do that is + compatible with what AB and AC wanted to do. + +So the conflict we would see when merging AB into ACAB should be +resolved the same way---it is the resolution that is in line with that +declaration. + +Imagine that similarly previously a branch XYXZ was forked from XY, +and XZ was merged into it, and resolved "X became Y or Z" into "X +became W". + +Now, if a branch ABXY was forked from AB and then merged XY, then ABXY +would have line B in its early part and line Y in its later part. +Such a merge would be quite clean. We can construct 4 combinations +using these four branches ((AB, AC) x (XY, XZ)). + +Merging ABXY and ACXZ would make "an early A became B or C, a late X +became Y or Z" conflict, while merging ACXY and ABXZ would make "an +early A became C or B, a late X became Y or Z". We can see there are +4 combinations of ("B or C", "C or B") x ("X or Y", "Y or X"). + +By sorting, the conflict is given its canonical name, namely, "an +early part became B or C, a late part becames X or Y", and whenever +any of these four patterns appear, and we can get to the same conflict +and resolution that we saw earlier. + +Without the sorting, we'd have to somehow find a previous resolution +from combinatorial explosion. + +Conflict ID calculation +~~~~~~~~~~~~~~~~~~~~~~~ + +Once the conflict normalization is done, the conflict ID is calculated +as the sha1 hash of the conflict hunks appended to each other, +separated by <NUL> characters. The conflict markers are stripped out +before the sha1 is calculated. So in the example above, where we +merge branch AC which changes line A to line C, into branch AB, which +changes line A to line C, the conflict ID would be +SHA1('B<NUL>C<NUL>'). + +If there are multiple conflicts in one file, the sha1 is calculated +the same way with all hunks appended to each other, in the order in +which they appear in the file, separated by a <NUL> character. + +Nested conflicts +~~~~~~~~~~~~~~~~ + +Nested conflicts are handled very similarly to "simple" conflicts. +Similar to simple conflicts, the conflict is first normalized by +stripping the labels from conflict markers, stripping the common ancestor +version, and the sorting the conflict hunks, both for the outer and the +inner conflict. This is done recursively, so any number of nested +conflicts can be handled. + +Note that this only works for conflict markers that "cleanly nest". If +there are any unmatched conflict markers, rerere will fail to handle +the conflict and record a conflict resolution. + +The only difference is in how the conflict ID is calculated. For the +inner conflict, the conflict markers themselves are not stripped out +before calculating the sha1. + +Say we have the following conflict for example: + + <<<<<<< HEAD + 1 + ======= + <<<<<<< HEAD + 3 + ======= + 2 + >>>>>>> branch-2 + >>>>>>> branch-3~ + +After stripping out the labels of the conflict markers, and sorting +the hunks, the conflict would look as follows: + + <<<<<<< + 1 + ======= + <<<<<<< + 2 + ======= + 3 + >>>>>>> + >>>>>>> + +and finally the conflict ID would be calculated as: +`sha1('1<NUL><<<<<<<\n3\n=======\n2\n>>>>>>><NUL>')` |