summaryrefslogtreecommitdiff
path: root/Documentation/technical
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/technical')
-rw-r--r--Documentation/technical/multi-pack-index.txt109
-rw-r--r--Documentation/technical/pack-format.txt77
-rw-r--r--Documentation/technical/rerere.txt186
3 files changed, 372 insertions, 0 deletions
diff --git a/Documentation/technical/multi-pack-index.txt b/Documentation/technical/multi-pack-index.txt
new file mode 100644
index 0000000000..d7e57639f7
--- /dev/null
+++ b/Documentation/technical/multi-pack-index.txt
@@ -0,0 +1,109 @@
+Multi-Pack-Index (MIDX) Design Notes
+====================================
+
+The Git object directory contains a 'pack' directory containing
+packfiles (with suffix ".pack") and pack-indexes (with suffix
+".idx"). The pack-indexes provide a way to lookup objects and
+navigate to their offset within the pack, but these must come
+in pairs with the packfiles. This pairing depends on the file
+names, as the pack-index differs only in suffix with its pack-
+file. While the pack-indexes provide fast lookup per packfile,
+this performance degrades as the number of packfiles increases,
+because abbreviations need to inspect every packfile and we are
+more likely to have a miss on our most-recently-used packfile.
+For some large repositories, repacking into a single packfile
+is not feasible due to storage space or excessive repack times.
+
+The multi-pack-index (MIDX for short) stores a list of objects
+and their offsets into multiple packfiles. It contains:
+
+- A list of packfile names.
+- A sorted list of object IDs.
+- A list of metadata for the ith object ID including:
+ - A value j referring to the jth packfile.
+ - An offset within the jth packfile for the object.
+- If large offsets are required, we use another list of large
+ offsets similar to version 2 pack-indexes.
+
+Thus, we can provide O(log N) lookup time for any number
+of packfiles.
+
+Design Details
+--------------
+
+- The MIDX is stored in a file named 'multi-pack-index' in the
+ .git/objects/pack directory. This could be stored in the pack
+ directory of an alternate. It refers only to packfiles in that
+ same directory.
+
+- The pack.multiIndex config setting must be on to consume MIDX files.
+
+- The file format includes parameters for the object ID hash
+ function, so a future change of hash algorithm does not require
+ a change in format.
+
+- The MIDX keeps only one record per object ID. If an object appears
+ in multiple packfiles, then the MIDX selects the copy in the most-
+ recently modified packfile.
+
+- If there exist packfiles in the pack directory not registered in
+ the MIDX, then those packfiles are loaded into the `packed_git`
+ list and `packed_git_mru` cache.
+
+- The pack-indexes (.idx files) remain in the pack directory so we
+ can delete the MIDX file, set core.midx to false, or downgrade
+ without any loss of information.
+
+- The MIDX file format uses a chunk-based approach (similar to the
+ commit-graph file) that allows optional data to be added.
+
+Future Work
+-----------
+
+- Add a 'verify' subcommand to the 'git midx' builtin to verify the
+ contents of the multi-pack-index file match the offsets listed in
+ the corresponding pack-indexes.
+
+- The multi-pack-index allows many packfiles, especially in a context
+ where repacking is expensive (such as a very large repo), or
+ unexpected maintenance time is unacceptable (such as a high-demand
+ build machine). However, the multi-pack-index needs to be rewritten
+ in full every time. We can extend the format to be incremental, so
+ writes are fast. By storing a small "tip" multi-pack-index that
+ points to large "base" MIDX files, we can keep writes fast while
+ still reducing the number of binary searches required for object
+ lookups.
+
+- The reachability bitmap is currently paired directly with a single
+ packfile, using the pack-order as the object order to hopefully
+ compress the bitmaps well using run-length encoding. This could be
+ extended to pair a reachability bitmap with a multi-pack-index. If
+ the multi-pack-index is extended to store a "stable object order"
+ (a function Order(hash) = integer that is constant for a given hash,
+ even as the multi-pack-index is updated) then a reachability bitmap
+ could point to a multi-pack-index and be updated independently.
+
+- Packfiles can be marked as "special" using empty files that share
+ the initial name but replace ".pack" with ".keep" or ".promisor".
+ We can add an optional chunk of data to the multi-pack-index that
+ records flags of information about the packfiles. This allows new
+ states, such as 'repacked' or 'redeltified', that can help with
+ pack maintenance in a multi-pack environment. It may also be
+ helpful to organize packfiles by object type (commit, tree, blob,
+ etc.) and use this metadata to help that maintenance.
+
+- The partial clone feature records special "promisor" packs that
+ may point to objects that are not stored locally, but available
+ on request to a server. The multi-pack-index does not currently
+ track these promisor packs.
+
+Related Links
+-------------
+[0] https://bugs.chromium.org/p/git/issues/detail?id=6
+ Chromium work item for: Multi-Pack Index (MIDX)
+
+[1] https://public-inbox.org/git/20180107181459.222909-1-dstolee@microsoft.com/
+ An earlier RFC for the multi-pack-index feature
+
+[2] https://public-inbox.org/git/alpine.DEB.2.20.1803091557510.23109@alexmv-linux/
+ Git Merge 2018 Contributor's summit notes (includes discussion of MIDX)
diff --git a/Documentation/technical/pack-format.txt b/Documentation/technical/pack-format.txt
index 70a99fd142..cab5bdd2ff 100644
--- a/Documentation/technical/pack-format.txt
+++ b/Documentation/technical/pack-format.txt
@@ -252,3 +252,80 @@ Pack file entry: <+
corresponding packfile.
20-byte SHA-1-checksum of all of the above.
+
+== multi-pack-index (MIDX) files have the following format:
+
+The multi-pack-index files refer to multiple pack-files and loose objects.
+
+In order to allow extensions that add extra data to the MIDX, we organize
+the body into "chunks" and provide a lookup table at the beginning of the
+body. The header includes certain length values, such as the number of packs,
+the number of base MIDX files, hash lengths and types.
+
+All 4-byte numbers are in network order.
+
+HEADER:
+
+ 4-byte signature:
+ The signature is: {'M', 'I', 'D', 'X'}
+
+ 1-byte version number:
+ Git only writes or recognizes version 1.
+
+ 1-byte Object Id Version
+ Git only writes or recognizes version 1 (SHA1).
+
+ 1-byte number of "chunks"
+
+ 1-byte number of base multi-pack-index files:
+ This value is currently always zero.
+
+ 4-byte number of pack files
+
+CHUNK LOOKUP:
+
+ (C + 1) * 12 bytes providing the chunk offsets:
+ First 4 bytes describe chunk id. Value 0 is a terminating label.
+ Other 8 bytes provide offset in current file for chunk to start.
+ (Chunks are provided in file-order, so you can infer the length
+ using the next chunk position if necessary.)
+
+ The remaining data in the body is described one chunk at a time, and
+ these chunks may be given in any order. Chunks are required unless
+ otherwise specified.
+
+CHUNK DATA:
+
+ Packfile Names (ID: {'P', 'N', 'A', 'M'})
+ Stores the packfile names as concatenated, null-terminated strings.
+ Packfiles must be listed in lexicographic order for fast lookups by
+ name. This is the only chunk not guaranteed to be a multiple of four
+ bytes in length, so should be the last chunk for alignment reasons.
+
+ OID Fanout (ID: {'O', 'I', 'D', 'F'})
+ The ith entry, F[i], stores the number of OIDs with first
+ byte at most i. Thus F[255] stores the total
+ number of objects.
+
+ OID Lookup (ID: {'O', 'I', 'D', 'L'})
+ The OIDs for all objects in the MIDX are stored in lexicographic
+ order in this chunk.
+
+ Object Offsets (ID: {'O', 'O', 'F', 'F'})
+ Stores two 4-byte values for every object.
+ 1: The pack-int-id for the pack storing this object.
+ 2: The offset within the pack.
+ If all offsets are less than 2^31, then the large offset chunk
+ will not exist and offsets are stored as in IDX v1.
+ If there is at least one offset value larger than 2^32-1, then
+ the large offset chunk must exist. If the large offset chunk
+ exists and the 31st bit is on, then removing that bit reveals
+ the row in the large offsets containing the 8-byte offset of
+ this object.
+
+ [Optional] Object Large Offsets (ID: {'L', 'O', 'F', 'F'})
+ 8-byte offsets into large packfiles.
+
+TRAILER:
+
+ 20-byte SHA1-checksum of the above contents.
diff --git a/Documentation/technical/rerere.txt b/Documentation/technical/rerere.txt
new file mode 100644
index 0000000000..aa22d7ace8
--- /dev/null
+++ b/Documentation/technical/rerere.txt
@@ -0,0 +1,186 @@
+Rerere
+======
+
+This document describes the rerere logic.
+
+Conflict normalization
+----------------------
+
+To ensure recorded conflict resolutions can be looked up in the rerere
+database, even when branches are merged in a different order,
+different branches are merged that result in the same conflict, or
+when different conflict style settings are used, rerere normalizes the
+conflicts before writing them to the rerere database.
+
+Different conflict styles and branch names are normalized by stripping
+the labels from the conflict markers, and removing the common ancestor
+version from the `diff3` conflict style. Branches that are merged
+in different order are normalized by sorting the conflict hunks. More
+on each of those steps in the following sections.
+
+Once these two normalization operations are applied, a conflict ID is
+calculated based on the normalized conflict, which is later used by
+rerere to look up the conflict in the rerere database.
+
+Removing the common ancestor version
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Say we have three branches AB, AC and AC2. The common ancestor of
+these branches has a file with a line containing the string "A" (for
+brevity this is called "line A" in the rest of the document). In
+branch AB this line is changed to "B", in AC, this line is changed to
+"C", and branch AC2 is forked off of AC, after the line was changed to
+"C".
+
+Forking a branch ABAC off of branch AB and then merging AC into it, we
+get a conflict like the following:
+
+ <<<<<<< HEAD
+ B
+ =======
+ C
+ >>>>>>> AC
+
+Doing the analogous with AC2 (forking a branch ABAC2 off of branch AB
+and then merging branch AC2 into it), using the diff3 conflict style,
+we get a conflict like the following:
+
+ <<<<<<< HEAD
+ B
+ ||||||| merged common ancestors
+ A
+ =======
+ C
+ >>>>>>> AC2
+
+By resolving this conflict, to leave line D, the user declares:
+
+ After examining what branches AB and AC did, I believe that making
+ line A into line D is the best thing to do that is compatible with
+ what AB and AC wanted to do.
+
+As branch AC2 refers to the same commit as AC, the above implies that
+this is also compatible what AB and AC2 wanted to do.
+
+By extension, this means that rerere should recognize that the above
+conflicts are the same. To do this, the labels on the conflict
+markers are stripped, and the common ancestor version is removed. The above
+examples would both result in the following normalized conflict:
+
+ <<<<<<<
+ B
+ =======
+ C
+ >>>>>>>
+
+Sorting hunks
+~~~~~~~~~~~~~
+
+As before, lets imagine that a common ancestor had a file with line A
+its early part, and line X in its late part. And then four branches
+are forked that do these things:
+
+ - AB: changes A to B
+ - AC: changes A to C
+ - XY: changes X to Y
+ - XZ: changes X to Z
+
+Now, forking a branch ABAC off of branch AB and then merging AC into
+it, and forking a branch ACAB off of branch AC and then merging AB
+into it, would yield the conflict in a different order. The former
+would say "A became B or C, what now?" while the latter would say "A
+became C or B, what now?"
+
+As a reminder, the act of merging AC into ABAC and resolving the
+conflict to leave line D means that the user declares:
+
+ After examining what branches AB and AC did, I believe that
+ making line A into line D is the best thing to do that is
+ compatible with what AB and AC wanted to do.
+
+So the conflict we would see when merging AB into ACAB should be
+resolved the same way---it is the resolution that is in line with that
+declaration.
+
+Imagine that similarly previously a branch XYXZ was forked from XY,
+and XZ was merged into it, and resolved "X became Y or Z" into "X
+became W".
+
+Now, if a branch ABXY was forked from AB and then merged XY, then ABXY
+would have line B in its early part and line Y in its later part.
+Such a merge would be quite clean. We can construct 4 combinations
+using these four branches ((AB, AC) x (XY, XZ)).
+
+Merging ABXY and ACXZ would make "an early A became B or C, a late X
+became Y or Z" conflict, while merging ACXY and ABXZ would make "an
+early A became C or B, a late X became Y or Z". We can see there are
+4 combinations of ("B or C", "C or B") x ("X or Y", "Y or X").
+
+By sorting, the conflict is given its canonical name, namely, "an
+early part became B or C, a late part becames X or Y", and whenever
+any of these four patterns appear, and we can get to the same conflict
+and resolution that we saw earlier.
+
+Without the sorting, we'd have to somehow find a previous resolution
+from combinatorial explosion.
+
+Conflict ID calculation
+~~~~~~~~~~~~~~~~~~~~~~~
+
+Once the conflict normalization is done, the conflict ID is calculated
+as the sha1 hash of the conflict hunks appended to each other,
+separated by <NUL> characters. The conflict markers are stripped out
+before the sha1 is calculated. So in the example above, where we
+merge branch AC which changes line A to line C, into branch AB, which
+changes line A to line C, the conflict ID would be
+SHA1('B<NUL>C<NUL>').
+
+If there are multiple conflicts in one file, the sha1 is calculated
+the same way with all hunks appended to each other, in the order in
+which they appear in the file, separated by a <NUL> character.
+
+Nested conflicts
+~~~~~~~~~~~~~~~~
+
+Nested conflicts are handled very similarly to "simple" conflicts.
+Similar to simple conflicts, the conflict is first normalized by
+stripping the labels from conflict markers, stripping the common ancestor
+version, and the sorting the conflict hunks, both for the outer and the
+inner conflict. This is done recursively, so any number of nested
+conflicts can be handled.
+
+Note that this only works for conflict markers that "cleanly nest". If
+there are any unmatched conflict markers, rerere will fail to handle
+the conflict and record a conflict resolution.
+
+The only difference is in how the conflict ID is calculated. For the
+inner conflict, the conflict markers themselves are not stripped out
+before calculating the sha1.
+
+Say we have the following conflict for example:
+
+ <<<<<<< HEAD
+ 1
+ =======
+ <<<<<<< HEAD
+ 3
+ =======
+ 2
+ >>>>>>> branch-2
+ >>>>>>> branch-3~
+
+After stripping out the labels of the conflict markers, and sorting
+the hunks, the conflict would look as follows:
+
+ <<<<<<<
+ 1
+ =======
+ <<<<<<<
+ 2
+ =======
+ 3
+ >>>>>>>
+ >>>>>>>
+
+and finally the conflict ID would be calculated as:
+`sha1('1<NUL><<<<<<<\n3\n=======\n2\n>>>>>>><NUL>')`