diff options
Diffstat (limited to 'Documentation')
-rw-r--r-- | Documentation/config.txt | 4 | ||||
-rw-r--r-- | Documentation/git-commit-graph.txt | 94 | ||||
-rw-r--r-- | Documentation/technical/commit-graph-format.txt | 97 | ||||
-rw-r--r-- | Documentation/technical/commit-graph.txt | 163 |
4 files changed, 358 insertions, 0 deletions
diff --git a/Documentation/config.txt b/Documentation/config.txt index 2659153cb3..cb4c93f74a 100644 --- a/Documentation/config.txt +++ b/Documentation/config.txt @@ -898,6 +898,10 @@ core.notesRef:: This setting defaults to "refs/notes/commits", and it can be overridden by the `GIT_NOTES_REF` environment variable. See linkgit:git-notes[1]. +core.commitGraph:: + Enable git commit graph feature. Allows reading from the + commit-graph file. + core.sparseCheckout:: Enable "sparse checkout" feature. See section "Sparse checkout" in linkgit:git-read-tree[1] for more information. diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt new file mode 100644 index 0000000000..4c97b555cc --- /dev/null +++ b/Documentation/git-commit-graph.txt @@ -0,0 +1,94 @@ +git-commit-graph(1) +=================== + +NAME +---- +git-commit-graph - Write and verify Git commit graph files + + +SYNOPSIS +-------- +[verse] +'git commit-graph read' [--object-dir <dir>] +'git commit-graph write' <options> [--object-dir <dir>] + + +DESCRIPTION +----------- + +Manage the serialized commit graph file. + + +OPTIONS +------- +--object-dir:: + Use given directory for the location of packfiles and commit graph + file. This parameter exists to specify the location of an alternate + that only has the objects directory, not a full .git directory. The + commit graph file is expected to be at <dir>/info/commit-graph and + the packfiles are expected to be in <dir>/pack. + + +COMMANDS +-------- +'write':: + +Write a commit graph file based on the commits found in packfiles. ++ +With the `--stdin-packs` option, generate the new commit graph by +walking objects only in the specified pack-indexes. (Cannot be combined +with --stdin-commits.) ++ +With the `--stdin-commits` option, generate the new commit graph by +walking commits starting at the commits specified in stdin as a list +of OIDs in hex, one OID per line. (Cannot be combined with +--stdin-packs.) ++ +With the `--append` option, include all commits that are present in the +existing commit-graph file. + +'read':: + +Read a graph file given by the commit-graph file and output basic +details about the graph file. Used for debugging purposes. + + +EXAMPLES +-------- + +* Write a commit graph file for the packed commits in your local .git folder. ++ +------------------------------------------------ +$ git commit-graph write +------------------------------------------------ + +* Write a graph file, extending the current graph file using commits +* in <pack-index>. ++ +------------------------------------------------ +$ echo <pack-index> | git commit-graph write --stdin-packs +------------------------------------------------ + +* Write a graph file containing all reachable commits. ++ +------------------------------------------------ +$ git show-ref -s | git commit-graph write --stdin-commits +------------------------------------------------ + +* Write a graph file containing all commits in the current +* commit-graph file along with those reachable from HEAD. ++ +------------------------------------------------ +$ git rev-parse HEAD | git commit-graph write --stdin-commits --append +------------------------------------------------ + +* Read basic information from the commit-graph file. ++ +------------------------------------------------ +$ git commit-graph read +------------------------------------------------ + + +GIT +--- +Part of the linkgit:git[1] suite diff --git a/Documentation/technical/commit-graph-format.txt b/Documentation/technical/commit-graph-format.txt new file mode 100644 index 0000000000..ad6af8105c --- /dev/null +++ b/Documentation/technical/commit-graph-format.txt @@ -0,0 +1,97 @@ +Git commit graph format +======================= + +The Git commit graph stores a list of commit OIDs and some associated +metadata, including: + +- The generation number of the commit. Commits with no parents have + generation number 1; commits with parents have generation number + one more than the maximum generation number of its parents. We + reserve zero as special, and can be used to mark a generation + number invalid or as "not computed". + +- The root tree OID. + +- The commit date. + +- The parents of the commit, stored using positional references within + the graph file. + +These positional references are stored as unsigned 32-bit integers +corresponding to the array position withing the list of commit OIDs. We +use the most-significant bit for special purposes, so we can store at most +(1 << 31) - 1 (around 2 billion) commits. + +== Commit graph files have the following format: + +In order to allow extensions that add extra data to the graph, we organize +the body into "chunks" and provide a binary lookup table at the beginning +of the body. The header includes certain values, such as number of chunks +and hash type. + +All 4-byte numbers are in network order. + +HEADER: + + 4-byte signature: + The signature is: {'C', 'G', 'P', 'H'} + + 1-byte version number: + Currently, the only valid version is 1. + + 1-byte Hash Version (1 = SHA-1) + We infer the hash length (H) from this value. + + 1-byte number (C) of "chunks" + + 1-byte (reserved for later use) + Current clients should ignore this value. + +CHUNK LOOKUP: + + (C + 1) * 12 bytes listing the table of contents for the chunks: + First 4 bytes describe the chunk id. Value 0 is a terminating label. + Other 8 bytes provide the byte-offset in current file for chunk to + start. (Chunks are ordered contiguously in the file, so you can infer + the length using the next chunk position if necessary.) Each chunk + ID appears at most once. + + The remaining data in the body is described one chunk at a time, and + these chunks may be given in any order. Chunks are required unless + otherwise specified. + +CHUNK DATA: + + OID Fanout (ID: {'O', 'I', 'D', 'F'}) (256 * 4 bytes) + The ith entry, F[i], stores the number of OIDs with first + byte at most i. Thus F[255] stores the total + number of commits (N). + + OID Lookup (ID: {'O', 'I', 'D', 'L'}) (N * H bytes) + The OIDs for all commits in the graph, sorted in ascending order. + + Commit Data (ID: {'C', 'G', 'E', 'T' }) (N * (H + 16) bytes) + * The first H bytes are for the OID of the root tree. + * The next 8 bytes are for the positions of the first two parents + of the ith commit. Stores value 0xffffffff if no parent in that + position. If there are more than two parents, the second value + has its most-significant bit on and the other bits store an array + position into the Large Edge List chunk. + * The next 8 bytes store the generation number of the commit and + the commit time in seconds since EPOCH. The generation number + uses the higher 30 bits of the first 4 bytes, while the commit + time uses the 32 bits of the second 4 bytes, along with the lowest + 2 bits of the lowest byte, storing the 33rd and 34th bit of the + commit time. + + Large Edge List (ID: {'E', 'D', 'G', 'E'}) [Optional] + This list of 4-byte values store the second through nth parents for + all octopus merges. The second parent value in the commit data stores + an array position within this list along with the most-significant bit + on. Starting at that array position, iterate through this list of commit + positions for the parents until reaching a value with the most-significant + bit on. The other bits correspond to the position of the last parent. + +TRAILER: + + H-byte HASH-checksum of all of the above. diff --git a/Documentation/technical/commit-graph.txt b/Documentation/technical/commit-graph.txt new file mode 100644 index 0000000000..0550c6d0dc --- /dev/null +++ b/Documentation/technical/commit-graph.txt @@ -0,0 +1,163 @@ +Git Commit Graph Design Notes +============================= + +Git walks the commit graph for many reasons, including: + +1. Listing and filtering commit history. +2. Computing merge bases. + +These operations can become slow as the commit count grows. The merge +base calculation shows up in many user-facing commands, such as 'merge-base' +or 'status' and can take minutes to compute depending on history shape. + +There are two main costs here: + +1. Decompressing and parsing commits. +2. Walking the entire graph to satisfy topological order constraints. + +The commit graph file is a supplemental data structure that accelerates +commit graph walks. If a user downgrades or disables the 'core.commitGraph' +config setting, then the existing ODB is sufficient. The file is stored +as "commit-graph" either in the .git/objects/info directory or in the info +directory of an alternate. + +The commit graph file stores the commit graph structure along with some +extra metadata to speed up graph walks. By listing commit OIDs in lexi- +cographic order, we can identify an integer position for each commit and +refer to the parents of a commit using those integer positions. We use +binary search to find initial commits and then use the integer positions +for fast lookups during the walk. + +A consumer may load the following info for a commit from the graph: + +1. The commit OID. +2. The list of parents, along with their integer position. +3. The commit date. +4. The root tree OID. +5. The generation number (see definition below). + +Values 1-4 satisfy the requirements of parse_commit_gently(). + +Define the "generation number" of a commit recursively as follows: + + * A commit with no parents (a root commit) has generation number one. + + * A commit with at least one parent has generation number one more than + the largest generation number among its parents. + +Equivalently, the generation number of a commit A is one more than the +length of a longest path from A to a root commit. The recursive definition +is easier to use for computation and observing the following property: + + If A and B are commits with generation numbers N and M, respectively, + and N <= M, then A cannot reach B. That is, we know without searching + that B is not an ancestor of A because it is further from a root commit + than A. + + Conversely, when checking if A is an ancestor of B, then we only need + to walk commits until all commits on the walk boundary have generation + number at most N. If we walk commits using a priority queue seeded by + generation numbers, then we always expand the boundary commit with highest + generation number and can easily detect the stopping condition. + +This property can be used to significantly reduce the time it takes to +walk commits and determine topological relationships. Without generation +numbers, the general heuristic is the following: + + If A and B are commits with commit time X and Y, respectively, and + X < Y, then A _probably_ cannot reach B. + +This heuristic is currently used whenever the computation is allowed to +violate topological relationships due to clock skew (such as "git log" +with default order), but is not used when the topological order is +required (such as merge base calculations, "git log --graph"). + +In practice, we expect some commits to be created recently and not stored +in the commit graph. We can treat these commits as having "infinite" +generation number and walk until reaching commits with known generation +number. + +Design Details +-------------- + +- The commit graph file is stored in a file named 'commit-graph' in the + .git/objects/info directory. This could be stored in the info directory + of an alternate. + +- The core.commitGraph config setting must be on to consume graph files. + +- The file format includes parameters for the object ID hash function, + so a future change of hash algorithm does not require a change in format. + +Future Work +----------- + +- The commit graph feature currently does not honor commit grafts. This can + be remedied by duplicating or refactoring the current graft logic. + +- The 'commit-graph' subcommand does not have a "verify" mode that is + necessary for integration with fsck. + +- The file format includes room for precomputed generation numbers. These + are not currently computed, so all generation numbers will be marked as + 0 (or "uncomputed"). A later patch will include this calculation. + +- After computing and storing generation numbers, we must make graph + walks aware of generation numbers to gain the performance benefits they + enable. This will mostly be accomplished by swapping a commit-date-ordered + priority queue with one ordered by generation number. The following + operations are important candidates: + + - paint_down_to_common() + - 'log --topo-order' + +- Currently, parse_commit_gently() requires filling in the root tree + object for a commit. This passes through lookup_tree() and consequently + lookup_object(). Also, it calls lookup_commit() when loading the parents. + These method calls check the ODB for object existence, even if the + consumer does not need the content. For example, we do not need the + tree contents when computing merge bases. Now that commit parsing is + removed from the computation time, these lookup operations are the + slowest operations keeping graph walks from being fast. Consider + loading these objects without verifying their existence in the ODB and + only loading them fully when consumers need them. Consider a method + such as "ensure_tree_loaded(commit)" that fully loads a tree before + using commit->tree. + +- The current design uses the 'commit-graph' subcommand to generate the graph. + When this feature stabilizes enough to recommend to most users, we should + add automatic graph writes to common operations that create many commits. + For example, one could compute a graph on 'clone', 'fetch', or 'repack' + commands. + +- A server could provide a commit graph file as part of the network protocol + to avoid extra calculations by clients. This feature is only of benefit if + the user is willing to trust the file, because verifying the file is correct + is as hard as computing it from scratch. + +Related Links +------------- +[0] https://bugs.chromium.org/p/git/issues/detail?id=8 + Chromium work item for: Serialized Commit Graph + +[1] https://public-inbox.org/git/20110713070517.GC18566@sigill.intra.peff.net/ + An abandoned patch that introduced generation numbers. + +[2] https://public-inbox.org/git/20170908033403.q7e6dj7benasrjes@sigill.intra.peff.net/ + Discussion about generation numbers on commits and how they interact + with fsck. + +[3] https://public-inbox.org/git/20170908034739.4op3w4f2ma5s65ku@sigill.intra.peff.net/ + More discussion about generation numbers and not storing them inside + commit objects. A valuable quote: + + "I think we should be moving more in the direction of keeping + repo-local caches for optimizations. Reachability bitmaps have been + a big performance win. I think we should be doing the same with our + properties of commits. Not just generation numbers, but making it + cheap to access the graph structure without zlib-inflating whole + commit objects (i.e., packv4 or something like the "metapacks" I + proposed a few years ago)." + +[4] https://public-inbox.org/git/20180108154822.54829-1-git@jeffhostetler.com/T/#u + A patch to remove the ahead-behind calculation from 'status'. |