diff options
Diffstat (limited to 'Documentation/technical')
-rw-r--r-- | Documentation/technical/api-config.txt | 18 | ||||
-rw-r--r-- | Documentation/technical/api-decorate.txt | 6 | ||||
-rw-r--r-- | Documentation/technical/api-directory-listing.txt | 27 | ||||
-rw-r--r-- | Documentation/technical/api-object-access.txt | 4 | ||||
-rw-r--r-- | Documentation/technical/api-oid-array.txt | 17 | ||||
-rw-r--r-- | Documentation/technical/api-submodule-config.txt | 4 | ||||
-rw-r--r-- | Documentation/technical/commit-graph-format.txt | 97 | ||||
-rw-r--r-- | Documentation/technical/commit-graph.txt | 182 | ||||
-rw-r--r-- | Documentation/technical/hash-function-transition.txt | 40 | ||||
-rw-r--r-- | Documentation/technical/http-protocol.txt | 11 | ||||
-rw-r--r-- | Documentation/technical/index-format.txt | 19 | ||||
-rw-r--r-- | Documentation/technical/long-running-process-protocol.txt | 50 | ||||
-rw-r--r-- | Documentation/technical/pack-format.txt | 92 | ||||
-rw-r--r-- | Documentation/technical/pack-protocol.txt | 43 | ||||
-rw-r--r-- | Documentation/technical/partial-clone.txt | 324 | ||||
-rw-r--r-- | Documentation/technical/protocol-v2.txt | 414 | ||||
-rw-r--r-- | Documentation/technical/shallow.txt | 20 |
17 files changed, 1316 insertions, 52 deletions
diff --git a/Documentation/technical/api-config.txt b/Documentation/technical/api-config.txt index 9a778b0cad..fa39ac9d71 100644 --- a/Documentation/technical/api-config.txt +++ b/Documentation/technical/api-config.txt @@ -47,21 +47,23 @@ will first feed the user-wide one to the callback, and then the repo-specific one; by overwriting, the higher-priority repo-specific value is left at the end). -The `git_config_with_options` function lets the caller examine config +The `config_with_options` function lets the caller examine config while adjusting some of the default behavior of `git_config`. It should almost never be used by "regular" Git code that is looking up configuration variables. It is intended for advanced callers like `git-config`, which are intentionally tweaking the normal config-lookup process. It takes two extra parameters: -`filename`:: -If this parameter is non-NULL, it specifies the name of a file to -parse for configuration, rather than looking in the usual files. Regular -`git_config` defaults to `NULL`. +`config_source`:: +If this parameter is non-NULL, it specifies the source to parse for +configuration, rather than looking in the usual files. See `struct +git_config_source` in `config.h` for details. Regular `git_config` defaults +to `NULL`. -`respect_includes`:: -Specify whether include directives should be followed in parsed files. -Regular `git_config` defaults to `1`. +`opts`:: +Specify options to adjust the behavior of parsing config files. See `struct +config_options` in `config.h` for details. As an example: regular `git_config` +sets `opts.respect_includes` to `1` by default. Reading Specific Files ---------------------- diff --git a/Documentation/technical/api-decorate.txt b/Documentation/technical/api-decorate.txt deleted file mode 100644 index 1d52a6ce14..0000000000 --- a/Documentation/technical/api-decorate.txt +++ /dev/null @@ -1,6 +0,0 @@ -decorate API -============ - -Talk about <decorate.h> - -(Linus) diff --git a/Documentation/technical/api-directory-listing.txt b/Documentation/technical/api-directory-listing.txt index 6c77b4920c..4f44ca24f6 100644 --- a/Documentation/technical/api-directory-listing.txt +++ b/Documentation/technical/api-directory-listing.txt @@ -22,16 +22,20 @@ The notable options are: `flags`:: - A bit-field of options (the `*IGNORED*` flags are mutually exclusive): + A bit-field of options: `DIR_SHOW_IGNORED`::: - Return just ignored files in `entries[]`, not untracked files. + Return just ignored files in `entries[]`, not untracked + files. This flag is mutually exclusive with + `DIR_SHOW_IGNORED_TOO`. `DIR_SHOW_IGNORED_TOO`::: - Similar to `DIR_SHOW_IGNORED`, but return ignored files in `ignored[]` - in addition to untracked files in `entries[]`. + Similar to `DIR_SHOW_IGNORED`, but return ignored files in + `ignored[]` in addition to untracked files in + `entries[]`. This flag is mutually exclusive with + `DIR_SHOW_IGNORED`. `DIR_KEEP_UNTRACKED_CONTENTS`::: @@ -39,6 +43,21 @@ The notable options are: untracked contents of untracked directories are also returned in `entries[]`. +`DIR_SHOW_IGNORED_TOO_MODE_MATCHING`::: + + Only has meaning if `DIR_SHOW_IGNORED_TOO` is also set; if + this is set, returns ignored files and directories that match + an exclude pattern. If a directory matches an exclude pattern, + then the directory is returned and the contained paths are + not. A directory that does not match an exclude pattern will + not be returned even if all of its contents are ignored. In + this case, the contents are returned as individual entries. ++ +If this is set, files and directories that explicitly match an ignore +pattern are reported. Implicity ignored directories (directories that +do not match an ignore pattern, but whose contents are all ignored) +are not reported, instead all of the contents are reported. + `DIR_COLLECT_IGNORED`::: Special mode for git-add. Return ignored files in `ignored[]` and diff --git a/Documentation/technical/api-object-access.txt b/Documentation/technical/api-object-access.txt index 03bb0e950d..5b29622d00 100644 --- a/Documentation/technical/api-object-access.txt +++ b/Documentation/technical/api-object-access.txt @@ -1,13 +1,13 @@ object access API ================= -Talk about <sha1_file.c> and <object.h> family, things like +Talk about <sha1-file.c> and <object.h> family, things like * read_sha1_file() * read_object_with_reference() * has_sha1_file() * write_sha1_file() -* pretend_sha1_file() +* pretend_object_file() * lookup_{object,commit,tag,blob,tree} * parse_{object,commit,tag,blob,tree} * Use of object flags diff --git a/Documentation/technical/api-oid-array.txt b/Documentation/technical/api-oid-array.txt index b0c11f868d..9febfb1d52 100644 --- a/Documentation/technical/api-oid-array.txt +++ b/Documentation/technical/api-oid-array.txt @@ -35,13 +35,18 @@ Functions Free all memory associated with the array and return it to the initial, empty state. +`oid_array_for_each`:: + Iterate over each element of the list, executing the callback + function for each one. Does not sort the list, so any custom + hash order is retained. If the callback returns a non-zero + value, the iteration ends immediately and the callback's + return is propagated; otherwise, 0 is returned. + `oid_array_for_each_unique`:: - Efficiently iterate over each unique element of the list, - executing the callback function for each one. If the array is - not sorted, this function has the side effect of sorting it. If - the callback returns a non-zero value, the iteration ends - immediately and the callback's return is propagated; otherwise, - 0 is returned. + Iterate over each unique element of the list in sorted order, + but otherwise behave like `oid_array_for_each`. If the array + is not sorted, this function has the side effect of sorting + it. Examples -------- diff --git a/Documentation/technical/api-submodule-config.txt b/Documentation/technical/api-submodule-config.txt index 3dce003fda..fb06089393 100644 --- a/Documentation/technical/api-submodule-config.txt +++ b/Documentation/technical/api-submodule-config.txt @@ -4,7 +4,7 @@ submodule config cache API The submodule config cache API allows to read submodule configurations/information from specified revisions. Internally information is lazily read into a cache that is used to avoid -unnecessary parsing of the same .gitmodule files. Lookups can be done by +unnecessary parsing of the same .gitmodules files. Lookups can be done by submodule path or name. Usage @@ -38,7 +38,7 @@ Data Structures Functions --------- -`void submodule_free()`:: +`void submodule_free(struct repository *r)`:: Use these to free the internally cached values. diff --git a/Documentation/technical/commit-graph-format.txt b/Documentation/technical/commit-graph-format.txt new file mode 100644 index 0000000000..ad6af8105c --- /dev/null +++ b/Documentation/technical/commit-graph-format.txt @@ -0,0 +1,97 @@ +Git commit graph format +======================= + +The Git commit graph stores a list of commit OIDs and some associated +metadata, including: + +- The generation number of the commit. Commits with no parents have + generation number 1; commits with parents have generation number + one more than the maximum generation number of its parents. We + reserve zero as special, and can be used to mark a generation + number invalid or as "not computed". + +- The root tree OID. + +- The commit date. + +- The parents of the commit, stored using positional references within + the graph file. + +These positional references are stored as unsigned 32-bit integers +corresponding to the array position withing the list of commit OIDs. We +use the most-significant bit for special purposes, so we can store at most +(1 << 31) - 1 (around 2 billion) commits. + +== Commit graph files have the following format: + +In order to allow extensions that add extra data to the graph, we organize +the body into "chunks" and provide a binary lookup table at the beginning +of the body. The header includes certain values, such as number of chunks +and hash type. + +All 4-byte numbers are in network order. + +HEADER: + + 4-byte signature: + The signature is: {'C', 'G', 'P', 'H'} + + 1-byte version number: + Currently, the only valid version is 1. + + 1-byte Hash Version (1 = SHA-1) + We infer the hash length (H) from this value. + + 1-byte number (C) of "chunks" + + 1-byte (reserved for later use) + Current clients should ignore this value. + +CHUNK LOOKUP: + + (C + 1) * 12 bytes listing the table of contents for the chunks: + First 4 bytes describe the chunk id. Value 0 is a terminating label. + Other 8 bytes provide the byte-offset in current file for chunk to + start. (Chunks are ordered contiguously in the file, so you can infer + the length using the next chunk position if necessary.) Each chunk + ID appears at most once. + + The remaining data in the body is described one chunk at a time, and + these chunks may be given in any order. Chunks are required unless + otherwise specified. + +CHUNK DATA: + + OID Fanout (ID: {'O', 'I', 'D', 'F'}) (256 * 4 bytes) + The ith entry, F[i], stores the number of OIDs with first + byte at most i. Thus F[255] stores the total + number of commits (N). + + OID Lookup (ID: {'O', 'I', 'D', 'L'}) (N * H bytes) + The OIDs for all commits in the graph, sorted in ascending order. + + Commit Data (ID: {'C', 'G', 'E', 'T' }) (N * (H + 16) bytes) + * The first H bytes are for the OID of the root tree. + * The next 8 bytes are for the positions of the first two parents + of the ith commit. Stores value 0xffffffff if no parent in that + position. If there are more than two parents, the second value + has its most-significant bit on and the other bits store an array + position into the Large Edge List chunk. + * The next 8 bytes store the generation number of the commit and + the commit time in seconds since EPOCH. The generation number + uses the higher 30 bits of the first 4 bytes, while the commit + time uses the 32 bits of the second 4 bytes, along with the lowest + 2 bits of the lowest byte, storing the 33rd and 34th bit of the + commit time. + + Large Edge List (ID: {'E', 'D', 'G', 'E'}) [Optional] + This list of 4-byte values store the second through nth parents for + all octopus merges. The second parent value in the commit data stores + an array position within this list along with the most-significant bit + on. Starting at that array position, iterate through this list of commit + positions for the parents until reaching a value with the most-significant + bit on. The other bits correspond to the position of the last parent. + +TRAILER: + + H-byte HASH-checksum of all of the above. diff --git a/Documentation/technical/commit-graph.txt b/Documentation/technical/commit-graph.txt new file mode 100644 index 0000000000..e1a883eb46 --- /dev/null +++ b/Documentation/technical/commit-graph.txt @@ -0,0 +1,182 @@ +Git Commit Graph Design Notes +============================= + +Git walks the commit graph for many reasons, including: + +1. Listing and filtering commit history. +2. Computing merge bases. + +These operations can become slow as the commit count grows. The merge +base calculation shows up in many user-facing commands, such as 'merge-base' +or 'status' and can take minutes to compute depending on history shape. + +There are two main costs here: + +1. Decompressing and parsing commits. +2. Walking the entire graph to satisfy topological order constraints. + +The commit graph file is a supplemental data structure that accelerates +commit graph walks. If a user downgrades or disables the 'core.commitGraph' +config setting, then the existing ODB is sufficient. The file is stored +as "commit-graph" either in the .git/objects/info directory or in the info +directory of an alternate. + +The commit graph file stores the commit graph structure along with some +extra metadata to speed up graph walks. By listing commit OIDs in lexi- +cographic order, we can identify an integer position for each commit and +refer to the parents of a commit using those integer positions. We use +binary search to find initial commits and then use the integer positions +for fast lookups during the walk. + +A consumer may load the following info for a commit from the graph: + +1. The commit OID. +2. The list of parents, along with their integer position. +3. The commit date. +4. The root tree OID. +5. The generation number (see definition below). + +Values 1-4 satisfy the requirements of parse_commit_gently(). + +Define the "generation number" of a commit recursively as follows: + + * A commit with no parents (a root commit) has generation number one. + + * A commit with at least one parent has generation number one more than + the largest generation number among its parents. + +Equivalently, the generation number of a commit A is one more than the +length of a longest path from A to a root commit. The recursive definition +is easier to use for computation and observing the following property: + + If A and B are commits with generation numbers N and M, respectively, + and N <= M, then A cannot reach B. That is, we know without searching + that B is not an ancestor of A because it is further from a root commit + than A. + + Conversely, when checking if A is an ancestor of B, then we only need + to walk commits until all commits on the walk boundary have generation + number at most N. If we walk commits using a priority queue seeded by + generation numbers, then we always expand the boundary commit with highest + generation number and can easily detect the stopping condition. + +This property can be used to significantly reduce the time it takes to +walk commits and determine topological relationships. Without generation +numbers, the general heuristic is the following: + + If A and B are commits with commit time X and Y, respectively, and + X < Y, then A _probably_ cannot reach B. + +This heuristic is currently used whenever the computation is allowed to +violate topological relationships due to clock skew (such as "git log" +with default order), but is not used when the topological order is +required (such as merge base calculations, "git log --graph"). + +In practice, we expect some commits to be created recently and not stored +in the commit graph. We can treat these commits as having "infinite" +generation number and walk until reaching commits with known generation +number. + +We use the macro GENERATION_NUMBER_INFINITY = 0xFFFFFFFF to mark commits not +in the commit-graph file. If a commit-graph file was written by a version +of Git that did not compute generation numbers, then those commits will +have generation number represented by the macro GENERATION_NUMBER_ZERO = 0. + +Since the commit-graph file is closed under reachability, we can guarantee +the following weaker condition on all commits: + + If A and B are commits with generation numbers N amd M, respectively, + and N < M, then A cannot reach B. + +Note how the strict inequality differs from the inequality when we have +fully-computed generation numbers. Using strict inequality may result in +walking a few extra commits, but the simplicity in dealing with commits +with generation number *_INFINITY or *_ZERO is valuable. + +We use the macro GENERATION_NUMBER_MAX = 0x3FFFFFFF to for commits whose +generation numbers are computed to be at least this value. We limit at +this value since it is the largest value that can be stored in the +commit-graph file using the 30 bits available to generation numbers. This +presents another case where a commit can have generation number equal to +that of a parent. + +Design Details +-------------- + +- The commit graph file is stored in a file named 'commit-graph' in the + .git/objects/info directory. This could be stored in the info directory + of an alternate. + +- The core.commitGraph config setting must be on to consume graph files. + +- The file format includes parameters for the object ID hash function, + so a future change of hash algorithm does not require a change in format. + +Future Work +----------- + +- The commit graph feature currently does not honor commit grafts. This can + be remedied by duplicating or refactoring the current graft logic. + +- The 'commit-graph' subcommand does not have a "verify" mode that is + necessary for integration with fsck. + +- After computing and storing generation numbers, we must make graph + walks aware of generation numbers to gain the performance benefits they + enable. This will mostly be accomplished by swapping a commit-date-ordered + priority queue with one ordered by generation number. The following + operations are important candidates: + + - 'log --topo-order' + - 'tag --merged' + +- Currently, parse_commit_gently() requires filling in the root tree + object for a commit. This passes through lookup_tree() and consequently + lookup_object(). Also, it calls lookup_commit() when loading the parents. + These method calls check the ODB for object existence, even if the + consumer does not need the content. For example, we do not need the + tree contents when computing merge bases. Now that commit parsing is + removed from the computation time, these lookup operations are the + slowest operations keeping graph walks from being fast. Consider + loading these objects without verifying their existence in the ODB and + only loading them fully when consumers need them. Consider a method + such as "ensure_tree_loaded(commit)" that fully loads a tree before + using commit->tree. + +- The current design uses the 'commit-graph' subcommand to generate the graph. + When this feature stabilizes enough to recommend to most users, we should + add automatic graph writes to common operations that create many commits. + For example, one could compute a graph on 'clone', 'fetch', or 'repack' + commands. + +- A server could provide a commit graph file as part of the network protocol + to avoid extra calculations by clients. This feature is only of benefit if + the user is willing to trust the file, because verifying the file is correct + is as hard as computing it from scratch. + +Related Links +------------- +[0] https://bugs.chromium.org/p/git/issues/detail?id=8 + Chromium work item for: Serialized Commit Graph + +[1] https://public-inbox.org/git/20110713070517.GC18566@sigill.intra.peff.net/ + An abandoned patch that introduced generation numbers. + +[2] https://public-inbox.org/git/20170908033403.q7e6dj7benasrjes@sigill.intra.peff.net/ + Discussion about generation numbers on commits and how they interact + with fsck. + +[3] https://public-inbox.org/git/20170908034739.4op3w4f2ma5s65ku@sigill.intra.peff.net/ + More discussion about generation numbers and not storing them inside + commit objects. A valuable quote: + + "I think we should be moving more in the direction of keeping + repo-local caches for optimizations. Reachability bitmaps have been + a big performance win. I think we should be doing the same with our + properties of commits. Not just generation numbers, but making it + cheap to access the graph structure without zlib-inflating whole + commit objects (i.e., packv4 or something like the "metapacks" I + proposed a few years ago)." + +[4] https://public-inbox.org/git/20180108154822.54829-1-git@jeffhostetler.com/T/#u + A patch to remove the ahead-behind calculation from 'status'. diff --git a/Documentation/technical/hash-function-transition.txt b/Documentation/technical/hash-function-transition.txt index 417ba491d0..4ab6cd1012 100644 --- a/Documentation/technical/hash-function-transition.txt +++ b/Documentation/technical/hash-function-transition.txt @@ -28,11 +28,30 @@ advantages: address stored content. Over time some flaws in SHA-1 have been discovered by security -researchers. https://shattered.io demonstrated a practical SHA-1 hash -collision. As a result, SHA-1 cannot be considered cryptographically -secure any more. This impacts the communication of hash values because -we cannot trust that a given hash value represents the known good -version of content that the speaker intended. +researchers. On 23 February 2017 the SHAttered attack +(https://shattered.io) demonstrated a practical SHA-1 hash collision. + +Git v2.13.0 and later subsequently moved to a hardened SHA-1 +implementation by default, which isn't vulnerable to the SHAttered +attack. + +Thus Git has in effect already migrated to a new hash that isn't SHA-1 +and doesn't share its vulnerabilities, its new hash function just +happens to produce exactly the same output for all known inputs, +except two PDFs published by the SHAttered researchers, and the new +implementation (written by those researchers) claims to detect future +cryptanalytic collision attacks. + +Regardless, it's considered prudent to move past any variant of SHA-1 +to a new hash. There's no guarantee that future attacks on SHA-1 won't +be published in the future, and those attacks may not have viable +mitigations. + +If SHA-1 and its variants were to be truly broken, Git's hash function +could not be considered cryptographically secure any more. This would +impact the communication of hash values because we could not trust +that a given hash value represented the known good version of content +that the speaker intended. SHA-1 still possesses the other properties such as fast object lookup and safe error checking, but other hash functions are equally suitable @@ -116,10 +135,15 @@ Documentation/technical/repository-version.txt) with extensions objectFormat = newhash compatObjectFormat = sha1 -Specifying a repository format extension ensures that versions of Git -not aware of NewHash do not try to operate on these repositories, -instead producing an error message: +The combination of setting `core.repositoryFormatVersion=1` and +populating `extensions.*` ensures that all versions of Git later than +`v0.99.9l` will die instead of trying to operate on the NewHash +repository, instead producing an error message. + # Between v0.99.9l and v2.7.0 + $ git status + fatal: Expected git repo version <= 0, found 1 + # After v2.7.0 $ git status fatal: unknown repository extensions found: objectformat diff --git a/Documentation/technical/http-protocol.txt b/Documentation/technical/http-protocol.txt index 1c561bdd92..64f49d0bbb 100644 --- a/Documentation/technical/http-protocol.txt +++ b/Documentation/technical/http-protocol.txt @@ -214,10 +214,16 @@ smart server reply: S: Cache-Control: no-cache S: S: 001e# service=git-upload-pack\n + S: 0000 S: 004895dcfa3633004da0049d3d0fa03f80589cbcaf31 refs/heads/maint\0multi_ack\n S: 0042d049f6c27a2244e12041955e262a404c7faba355 refs/heads/master\n S: 003c2cb58b79488a98d2721cea644875a8dd0026b115 refs/tags/v1.0\n S: 003fa3c2e2402b99163d1d59756e5f207ae21cccba4c refs/tags/v1.0^{}\n + S: 0000 + +The client may send Extra Parameters (see +Documentation/technical/pack-protocol.txt) as a colon-separated string +in the Git-Protocol HTTP header. Dumb Server Response ^^^^^^^^^^^^^^^^^^^^ @@ -269,7 +275,12 @@ the C locale ordering. The stream SHOULD include the default ref named `HEAD` as the first ref. The stream MUST include capability declarations behind a NUL on the first ref. +The returned response contains "version 1" if "version=1" was sent as an +Extra Parameter. + smart_reply = PKT-LINE("# service=$servicename" LF) + "0000" + *1("version 1") ref_list "0000" ref_list = empty_list / non_empty_list diff --git a/Documentation/technical/index-format.txt b/Documentation/technical/index-format.txt index ade0b0c445..db3572626b 100644 --- a/Documentation/technical/index-format.txt +++ b/Documentation/technical/index-format.txt @@ -295,3 +295,22 @@ The remaining data of each directory block is grouped by type: in the previous ewah bitmap. - One NUL. + +== File System Monitor cache + + The file system monitor cache tracks files for which the core.fsmonitor + hook has told us about changes. The signature for this extension is + { 'F', 'S', 'M', 'N' }. + + The extension starts with + + - 32-bit version number: the current supported version is 1. + + - 64-bit time: the extension data reflects all changes through the given + time which is stored as the nanoseconds elapsed since midnight, + January 1, 1970. + + - 32-bit bitmap size: the size of the CE_FSMONITOR_VALID bitmap. + + - An ewah bitmap, the n-th bit indicates whether the n-th index entry + is not CE_FSMONITOR_VALID. diff --git a/Documentation/technical/long-running-process-protocol.txt b/Documentation/technical/long-running-process-protocol.txt new file mode 100644 index 0000000000..aa0aa9af1c --- /dev/null +++ b/Documentation/technical/long-running-process-protocol.txt @@ -0,0 +1,50 @@ +Long-running process protocol +============================= + +This protocol is used when Git needs to communicate with an external +process throughout the entire life of a single Git command. All +communication is in pkt-line format (see technical/protocol-common.txt) +over standard input and standard output. + +Handshake +--------- + +Git starts by sending a welcome message (for example, +"git-filter-client"), a list of supported protocol version numbers, and +a flush packet. Git expects to read the welcome message with "server" +instead of "client" (for example, "git-filter-server"), exactly one +protocol version number from the previously sent list, and a flush +packet. All further communication will be based on the selected version. +The remaining protocol description below documents "version=2". Please +note that "version=42" in the example below does not exist and is only +there to illustrate how the protocol would look like with more than one +version. + +After the version negotiation Git sends a list of all capabilities that +it supports and a flush packet. Git expects to read a list of desired +capabilities, which must be a subset of the supported capabilities list, +and a flush packet as response: +------------------------ +packet: git> git-filter-client +packet: git> version=2 +packet: git> version=42 +packet: git> 0000 +packet: git< git-filter-server +packet: git< version=2 +packet: git< 0000 +packet: git> capability=clean +packet: git> capability=smudge +packet: git> capability=not-yet-invented +packet: git> 0000 +packet: git< capability=clean +packet: git< capability=smudge +packet: git< 0000 +------------------------ + +Shutdown +-------- + +Git will close +the command pipe on exit. The filter is expected to detect EOF +and exit gracefully on its own. Git will wait until the filter +process has stopped. diff --git a/Documentation/technical/pack-format.txt b/Documentation/technical/pack-format.txt index 8e5bf60be3..70a99fd142 100644 --- a/Documentation/technical/pack-format.txt +++ b/Documentation/technical/pack-format.txt @@ -36,6 +36,98 @@ Git pack format - The trailer records 20-byte SHA-1 checksum of all of the above. +=== Object types + +Valid object types are: + +- OBJ_COMMIT (1) +- OBJ_TREE (2) +- OBJ_BLOB (3) +- OBJ_TAG (4) +- OBJ_OFS_DELTA (6) +- OBJ_REF_DELTA (7) + +Type 5 is reserved for future expansion. Type 0 is invalid. + +=== Deltified representation + +Conceptually there are only four object types: commit, tree, tag and +blob. However to save space, an object could be stored as a "delta" of +another "base" object. These representations are assigned new types +ofs-delta and ref-delta, which is only valid in a pack file. + +Both ofs-delta and ref-delta store the "delta" to be applied to +another object (called 'base object') to reconstruct the object. The +difference between them is, ref-delta directly encodes 20-byte base +object name. If the base object is in the same pack, ofs-delta encodes +the offset of the base object in the pack instead. + +The base object could also be deltified if it's in the same pack. +Ref-delta can also refer to an object outside the pack (i.e. the +so-called "thin pack"). When stored on disk however, the pack should +be self contained to avoid cyclic dependency. + +The delta data is a sequence of instructions to reconstruct an object +from the base object. If the base object is deltified, it must be +converted to canonical form first. Each instruction appends more and +more data to the target object until it's complete. There are two +supported instructions so far: one for copy a byte range from the +source object and one for inserting new data embedded in the +instruction itself. + +Each instruction has variable length. Instruction type is determined +by the seventh bit of the first octet. The following diagrams follow +the convention in RFC 1951 (Deflate compressed data format). + +==== Instruction to copy from base object + + +----------+---------+---------+---------+---------+-------+-------+-------+ + | 1xxxxxxx | offset1 | offset2 | offset3 | offset4 | size1 | size2 | size3 | + +----------+---------+---------+---------+---------+-------+-------+-------+ + +This is the instruction format to copy a byte range from the source +object. It encodes the offset to copy from and the number of bytes to +copy. Offset and size are in little-endian order. + +All offset and size bytes are optional. This is to reduce the +instruction size when encoding small offsets or sizes. The first seven +bits in the first octet determines which of the next seven octets is +present. If bit zero is set, offset1 is present. If bit one is set +offset2 is present and so on. + +Note that a more compact instruction does not change offset and size +encoding. For example, if only offset2 is omitted like below, offset3 +still contains bits 16-23. It does not become offset2 and contains +bits 8-15 even if it's right next to offset1. + + +----------+---------+---------+ + | 10000101 | offset1 | offset3 | + +----------+---------+---------+ + +In its most compact form, this instruction only takes up one byte +(0x80) with both offset and size omitted, which will have default +values zero. There is another exception: size zero is automatically +converted to 0x10000. + +==== Instruction to add new data + + +----------+============+ + | 0xxxxxxx | data | + +----------+============+ + +This is the instruction to construct target object without the base +object. The following data is appended to the target object. The first +seven bits of the first octet determines the size of data in +bytes. The size must be non-zero. + +==== Reserved instruction + + +----------+============ + | 00000000 | + +----------+============ + +This is the instruction reserved for future expansion. + == Original (version 1) pack-*.idx files have the following format: - The header consists of 256 4-byte network byte order diff --git a/Documentation/technical/pack-protocol.txt b/Documentation/technical/pack-protocol.txt index a43a113e44..7fee6b780a 100644 --- a/Documentation/technical/pack-protocol.txt +++ b/Documentation/technical/pack-protocol.txt @@ -39,6 +39,19 @@ communicates with that invoked process over the SSH connection. The file:// transport runs the 'upload-pack' or 'receive-pack' process locally and communicates with it over a pipe. +Extra Parameters +---------------- + +The protocol provides a mechanism in which clients can send additional +information in its first message to the server. These are called "Extra +Parameters", and are supported by the Git, SSH, and HTTP protocols. + +Each Extra Parameter takes the form of `<key>=<value>` or `<key>`. + +Servers that receive any such Extra Parameters MUST ignore all +unrecognized keys. Currently, the only Extra Parameter recognized is +"version=1". + Git Transport ------------- @@ -46,18 +59,25 @@ The Git transport starts off by sending the command and repository on the wire using the pkt-line format, followed by a NUL byte and a hostname parameter, terminated by a NUL byte. - 0032git-upload-pack /project.git\0host=myserver.com\0 + 0033git-upload-pack /project.git\0host=myserver.com\0 + +The transport may send Extra Parameters by adding an additional NUL +byte, and then adding one or more NUL-terminated strings: + + 003egit-upload-pack /project.git\0host=myserver.com\0\0version=1\0 -- - git-proto-request = request-command SP pathname NUL [ host-parameter NUL ] + git-proto-request = request-command SP pathname NUL + [ host-parameter NUL ] [ NUL extra-parameters ] request-command = "git-upload-pack" / "git-receive-pack" / "git-upload-archive" ; case sensitive pathname = *( %x01-ff ) ; exclude NUL host-parameter = "host=" hostname [ ":" port ] + extra-parameters = 1*extra-parameter + extra-parameter = 1*( %x01-ff ) NUL -- -Only host-parameter is allowed in the git-proto-request. Clients -MUST NOT attempt to send additional parameters. It is used for the +host-parameter is used for the git-daemon name based virtual hosting. See --interpolated-path option to git daemon, with the %H/%CH format characters. @@ -117,6 +137,12 @@ we execute it without the leading '/'. v ssh user@example.com "git-upload-pack '~alice/project.git'" +Depending on the value of the `protocol.version` configuration variable, +Git may attempt to send Extra Parameters as a colon-separated string in +the GIT_PROTOCOL environment variable. This is done only if +the `ssh.variant` configuration variable indicates that the ssh command +supports passing environment variables as an argument. + A few things to remember here: - The "command name" is spelled with dash (e.g. git-upload-pack), but @@ -137,11 +163,13 @@ Reference Discovery ------------------- When the client initially connects the server will immediately respond -with a listing of each reference it has (all branches and tags) along +with a version number (if "version=1" is sent as an Extra Parameter), +and a listing of each reference it has (all branches and tags) along with the object name that each reference currently points to. - $ echo -e -n "0039git-upload-pack /schacon/gitbook.git\0host=example.com\0" | + $ echo -e -n "0044git-upload-pack /schacon/gitbook.git\0host=example.com\0\0version=1\0" | nc -v example.com 9418 + 000aversion 1 00887217a7c7e582c46cec22a130adf4b9d7d950fba0 HEAD\0multi_ack thin-pack side-band side-band-64k ofs-delta shallow no-progress include-tag 00441d3fcd5ced445d1abc402225c0b8a1299641f497 refs/heads/integration @@ -165,7 +193,8 @@ immediately after the ref itself, if presented. A conforming server MUST peel the ref if it's an annotated tag. ---- - advertised-refs = (no-refs / list-of-refs) + advertised-refs = *1("version 1") + (no-refs / list-of-refs) *shallow flush-pkt diff --git a/Documentation/technical/partial-clone.txt b/Documentation/technical/partial-clone.txt new file mode 100644 index 0000000000..0bed2472c8 --- /dev/null +++ b/Documentation/technical/partial-clone.txt @@ -0,0 +1,324 @@ +Partial Clone Design Notes +========================== + +The "Partial Clone" feature is a performance optimization for Git that +allows Git to function without having a complete copy of the repository. +The goal of this work is to allow Git better handle extremely large +repositories. + +During clone and fetch operations, Git downloads the complete contents +and history of the repository. This includes all commits, trees, and +blobs for the complete life of the repository. For extremely large +repositories, clones can take hours (or days) and consume 100+GiB of disk +space. + +Often in these repositories there are many blobs and trees that the user +does not need such as: + + 1. files outside of the user's work area in the tree. For example, in + a repository with 500K directories and 3.5M files in every commit, + we can avoid downloading many objects if the user only needs a + narrow "cone" of the source tree. + + 2. large binary assets. For example, in a repository where large build + artifacts are checked into the tree, we can avoid downloading all + previous versions of these non-mergeable binary assets and only + download versions that are actually referenced. + +Partial clone allows us to avoid downloading such unneeded objects *in +advance* during clone and fetch operations and thereby reduce download +times and disk usage. Missing objects can later be "demand fetched" +if/when needed. + +Use of partial clone requires that the user be online and the origin +remote be available for on-demand fetching of missing objects. This may +or may not be problematic for the user. For example, if the user can +stay within the pre-selected subset of the source tree, they may not +encounter any missing objects. Alternatively, the user could try to +pre-fetch various objects if they know that they are going offline. + + +Non-Goals +--------- + +Partial clone is a mechanism to limit the number of blobs and trees downloaded +*within* a given range of commits -- and is therefore independent of and not +intended to conflict with existing DAG-level mechanisms to limit the set of +requested commits (i.e. shallow clone, single branch, or fetch '<refspec>'). + + +Design Overview +--------------- + +Partial clone logically consists of the following parts: + +- A mechanism for the client to describe unneeded or unwanted objects to + the server. + +- A mechanism for the server to omit such unwanted objects from packfiles + sent to the client. + +- A mechanism for the client to gracefully handle missing objects (that + were previously omitted by the server). + +- A mechanism for the client to backfill missing objects as needed. + + +Design Details +-------------- + +- A new pack-protocol capability "filter" is added to the fetch-pack and + upload-pack negotiation. + + This uses the existing capability discovery mechanism. + See "filter" in Documentation/technical/pack-protocol.txt. + +- Clients pass a "filter-spec" to clone and fetch which is passed to the + server to request filtering during packfile construction. + + There are various filters available to accommodate different situations. + See "--filter=<filter-spec>" in Documentation/rev-list-options.txt. + +- On the server pack-objects applies the requested filter-spec as it + creates "filtered" packfiles for the client. + + These filtered packfiles are *incomplete* in the traditional sense because + they may contain objects that reference objects not contained in the + packfile and that the client doesn't already have. For example, the + filtered packfile may contain trees or tags that reference missing blobs + or commits that reference missing trees. + +- On the client these incomplete packfiles are marked as "promisor packfiles" + and treated differently by various commands. + +- On the client a repository extension is added to the local config to + prevent older versions of git from failing mid-operation because of + missing objects that they cannot handle. + See "extensions.partialClone" in Documentation/technical/repository-version.txt" + + +Handling Missing Objects +------------------------ + +- An object may be missing due to a partial clone or fetch, or missing due + to repository corruption. To differentiate these cases, the local + repository specially indicates such filtered packfiles obtained from the + promisor remote as "promisor packfiles". + + These promisor packfiles consist of a "<name>.promisor" file with + arbitrary contents (like the "<name>.keep" files), in addition to + their "<name>.pack" and "<name>.idx" files. + +- The local repository considers a "promisor object" to be an object that + it knows (to the best of its ability) that the promisor remote has promised + that it has, either because the local repository has that object in one of + its promisor packfiles, or because another promisor object refers to it. + + When Git encounters a missing object, Git can see if it a promisor object + and handle it appropriately. If not, Git can report a corruption. + + This means that there is no need for the client to explicitly maintain an + expensive-to-modify list of missing objects.[a] + +- Since almost all Git code currently expects any referenced object to be + present locally and because we do not want to force every command to do + a dry-run first, a fallback mechanism is added to allow Git to attempt + to dynamically fetch missing objects from the promisor remote. + + When the normal object lookup fails to find an object, Git invokes + fetch-object to try to get the object from the server and then retry + the object lookup. This allows objects to be "faulted in" without + complicated prediction algorithms. + + For efficiency reasons, no check as to whether the missing object is + actually a promisor object is performed. + + Dynamic object fetching tends to be slow as objects are fetched one at + a time. + +- `checkout` (and any other command using `unpack-trees`) has been taught + to bulk pre-fetch all required missing blobs in a single batch. + +- `rev-list` has been taught to print missing objects. + + This can be used by other commands to bulk prefetch objects. + For example, a "git log -p A..B" may internally want to first do + something like "git rev-list --objects --quiet --missing=print A..B" + and prefetch those objects in bulk. + +- `fsck` has been updated to be fully aware of promisor objects. + +- `repack` in GC has been updated to not touch promisor packfiles at all, + and to only repack other objects. + +- The global variable "fetch_if_missing" is used to control whether an + object lookup will attempt to dynamically fetch a missing object or + report an error. + + We are not happy with this global variable and would like to remove it, + but that requires significant refactoring of the object code to pass an + additional flag. We hope that concurrent efforts to add an ODB API can + encompass this. + + +Fetching Missing Objects +------------------------ + +- Fetching of objects is done using the existing transport mechanism using + transport_fetch_refs(), setting a new transport option + TRANS_OPT_NO_DEPENDENTS to indicate that only the objects themselves are + desired, not any object that they refer to. + + Because some transports invoke fetch_pack() in the same process, fetch_pack() + has been updated to not use any object flags when the corresponding argument + (no_dependents) is set. + +- The local repository sends a request with the hashes of all requested + objects as "want" lines, and does not perform any packfile negotiation. + It then receives a packfile. + +- Because we are reusing the existing fetch-pack mechanism, fetching + currently fetches all objects referred to by the requested objects, even + though they are not necessary. + + +Current Limitations +------------------- + +- The remote used for a partial clone (or the first partial fetch + following a regular clone) is marked as the "promisor remote". + + We are currently limited to a single promisor remote and only that + remote may be used for subsequent partial fetches. + + We accept this limitation because we believe initial users of this + feature will be using it on repositories with a strong single central + server. + +- Dynamic object fetching will only ask the promisor remote for missing + objects. We assume that the promisor remote has a complete view of the + repository and can satisfy all such requests. + +- Repack essentially treats promisor and non-promisor packfiles as 2 + distinct partitions and does not mix them. Repack currently only works + on non-promisor packfiles and loose objects. + +- Dynamic object fetching invokes fetch-pack once *for each item* + because most algorithms stumble upon a missing object and need to have + it resolved before continuing their work. This may incur significant + overhead -- and multiple authentication requests -- if many objects are + needed. + +- Dynamic object fetching currently uses the existing pack protocol V0 + which means that each object is requested via fetch-pack. The server + will send a full set of info/refs when the connection is established. + If there are large number of refs, this may incur significant overhead. + + +Future Work +----------- + +- Allow more than one promisor remote and define a strategy for fetching + missing objects from specific promisor remotes or of iterating over the + set of promisor remotes until a missing object is found. + + A user might want to have multiple geographically-close cache servers + for fetching missing blobs while continuing to do filtered `git-fetch` + commands from the central server, for example. + + Or the user might want to work in a triangular work flow with multiple + promisor remotes that each have an incomplete view of the repository. + +- Allow repack to work on promisor packfiles (while keeping them distinct + from non-promisor packfiles). + +- Allow non-pathname-based filters to make use of packfile bitmaps (when + present). This was just an omission during the initial implementation. + +- Investigate use of a long-running process to dynamically fetch a series + of objects, such as proposed in [5,6] to reduce process startup and + overhead costs. + + It would be nice if pack protocol V2 could allow that long-running + process to make a series of requests over a single long-running + connection. + +- Investigate pack protocol V2 to avoid the info/refs broadcast on + each connection with the server to dynamically fetch missing objects. + +- Investigate the need to handle loose promisor objects. + + Objects in promisor packfiles are allowed to reference missing objects + that can be dynamically fetched from the server. An assumption was + made that loose objects are only created locally and therefore should + not reference a missing object. We may need to revisit that assumption + if, for example, we dynamically fetch a missing tree and store it as a + loose object rather than a single object packfile. + + This does not necessarily mean we need to mark loose objects as promisor; + it may be sufficient to relax the object lookup or is-promisor functions. + + +Non-Tasks +--------- + +- Every time the subject of "demand loading blobs" comes up it seems + that someone suggests that the server be allowed to "guess" and send + additional objects that may be related to the requested objects. + + No work has gone into actually doing that; we're just documenting that + it is a common suggestion. We're not sure how it would work and have + no plans to work on it. + + It is valid for the server to send more objects than requested (even + for a dynamic object fetch), but we are not building on that. + + +Footnotes +--------- + +[a] expensive-to-modify list of missing objects: Earlier in the design of + partial clone we discussed the need for a single list of missing objects. + This would essentially be a sorted linear list of OIDs that the were + omitted by the server during a clone or subsequent fetches. + + This file would need to be loaded into memory on every object lookup. + It would need to be read, updated, and re-written (like the .git/index) + on every explicit "git fetch" command *and* on any dynamic object fetch. + + The cost to read, update, and write this file could add significant + overhead to every command if there are many missing objects. For example, + if there are 100M missing blobs, this file would be at least 2GiB on disk. + + With the "promisor" concept, we *infer* a missing object based upon the + type of packfile that references it. + + +Related Links +------------- +[0] https://bugs.chromium.org/p/git/issues/detail?id=2 + Chromium work item for: Partial Clone + +[1] https://public-inbox.org/git/20170113155253.1644-1-benpeart@microsoft.com/ + Subject: [RFC] Add support for downloading blobs on demand + Date: Fri, 13 Jan 2017 10:52:53 -0500 + +[2] https://public-inbox.org/git/cover.1506714999.git.jonathantanmy@google.com/ + Subject: [PATCH 00/18] Partial clone (from clone to lazy fetch in 18 patches) + Date: Fri, 29 Sep 2017 13:11:36 -0700 + +[3] https://public-inbox.org/git/20170426221346.25337-1-jonathantanmy@google.com/ + Subject: Proposal for missing blob support in Git repos + Date: Wed, 26 Apr 2017 15:13:46 -0700 + +[4] https://public-inbox.org/git/1488999039-37631-1-git-send-email-git@jeffhostetler.com/ + Subject: [PATCH 00/10] RFC Partial Clone and Fetch + Date: Wed, 8 Mar 2017 18:50:29 +0000 + +[5] https://public-inbox.org/git/20170505152802.6724-1-benpeart@microsoft.com/ + Subject: [PATCH v7 00/10] refactor the filter process code into a reusable module + Date: Fri, 5 May 2017 11:27:52 -0400 + +[6] https://public-inbox.org/git/20170714132651.170708-1-benpeart@microsoft.com/ + Subject: [RFC/PATCH v2 0/1] Add support for downloading blobs on demand + Date: Fri, 14 Jul 2017 09:26:50 -0400 diff --git a/Documentation/technical/protocol-v2.txt b/Documentation/technical/protocol-v2.txt new file mode 100644 index 0000000000..49bda76d23 --- /dev/null +++ b/Documentation/technical/protocol-v2.txt @@ -0,0 +1,414 @@ + Git Wire Protocol, Version 2 +============================== + +This document presents a specification for a version 2 of Git's wire +protocol. Protocol v2 will improve upon v1 in the following ways: + + * Instead of multiple service names, multiple commands will be + supported by a single service + * Easily extendable as capabilities are moved into their own section + of the protocol, no longer being hidden behind a NUL byte and + limited by the size of a pkt-line + * Separate out other information hidden behind NUL bytes (e.g. agent + string as a capability and symrefs can be requested using 'ls-refs') + * Reference advertisement will be omitted unless explicitly requested + * ls-refs command to explicitly request some refs + * Designed with http and stateless-rpc in mind. With clear flush + semantics the http remote helper can simply act as a proxy + +In protocol v2 communication is command oriented. When first contacting a +server a list of capabilities will advertised. Some of these capabilities +will be commands which a client can request be executed. Once a command +has completed, a client can reuse the connection and request that other +commands be executed. + + Packet-Line Framing +--------------------- + +All communication is done using packet-line framing, just as in v1. See +`Documentation/technical/pack-protocol.txt` and +`Documentation/technical/protocol-common.txt` for more information. + +In protocol v2 these special packets will have the following semantics: + + * '0000' Flush Packet (flush-pkt) - indicates the end of a message + * '0001' Delimiter Packet (delim-pkt) - separates sections of a message + + Initial Client Request +------------------------ + +In general a client can request to speak protocol v2 by sending +`version=2` through the respective side-channel for the transport being +used which inevitably sets `GIT_PROTOCOL`. More information can be +found in `pack-protocol.txt` and `http-protocol.txt`. In all cases the +response from the server is the capability advertisement. + + Git Transport +~~~~~~~~~~~~~~~ + +When using the git:// transport, you can request to use protocol v2 by +sending "version=2" as an extra parameter: + + 003egit-upload-pack /project.git\0host=myserver.com\0\0version=2\0 + + SSH and File Transport +~~~~~~~~~~~~~~~~~~~~~~~~ + +When using either the ssh:// or file:// transport, the GIT_PROTOCOL +environment variable must be set explicitly to include "version=2". + + HTTP Transport +~~~~~~~~~~~~~~~~ + +When using the http:// or https:// transport a client makes a "smart" +info/refs request as described in `http-protocol.txt` and requests that +v2 be used by supplying "version=2" in the `Git-Protocol` header. + + C: Git-Protocol: version=2 + C: + C: GET $GIT_URL/info/refs?service=git-upload-pack HTTP/1.0 + +A v2 server would reply: + + S: 200 OK + S: <Some headers> + S: ... + S: + S: 000eversion 2\n + S: <capability-advertisement> + +Subsequent requests are then made directly to the service +`$GIT_URL/git-upload-pack`. (This works the same for git-receive-pack). + + Capability Advertisement +-------------------------- + +A server which decides to communicate (based on a request from a client) +using protocol version 2, notifies the client by sending a version string +in its initial response followed by an advertisement of its capabilities. +Each capability is a key with an optional value. Clients must ignore all +unknown keys. Semantics of unknown values are left to the definition of +each key. Some capabilities will describe commands which can be requested +to be executed by the client. + + capability-advertisement = protocol-version + capability-list + flush-pkt + + protocol-version = PKT-LINE("version 2" LF) + capability-list = *capability + capability = PKT-LINE(key[=value] LF) + + key = 1*(ALPHA | DIGIT | "-_") + value = 1*(ALPHA | DIGIT | " -_.,?\/{}[]()<>!@#$%^&*+=:;") + + Command Request +----------------- + +After receiving the capability advertisement, a client can then issue a +request to select the command it wants with any particular capabilities +or arguments. There is then an optional section where the client can +provide any command specific parameters or queries. Only a single +command can be requested at a time. + + request = empty-request | command-request + empty-request = flush-pkt + command-request = command + capability-list + [command-args] + flush-pkt + command = PKT-LINE("command=" key LF) + command-args = delim-pkt + *command-specific-arg + + command-specific-args are packet line framed arguments defined by + each individual command. + +The server will then check to ensure that the client's request is +comprised of a valid command as well as valid capabilities which were +advertised. If the request is valid the server will then execute the +command. A server MUST wait till it has received the client's entire +request before issuing a response. The format of the response is +determined by the command being executed, but in all cases a flush-pkt +indicates the end of the response. + +When a command has finished, and the client has received the entire +response from the server, a client can either request that another +command be executed or can terminate the connection. A client may +optionally send an empty request consisting of just a flush-pkt to +indicate that no more requests will be made. + + Capabilities +-------------- + +There are two different types of capabilities: normal capabilities, +which can be used to to convey information or alter the behavior of a +request, and commands, which are the core actions that a client wants to +perform (fetch, push, etc). + +Protocol version 2 is stateless by default. This means that all commands +must only last a single round and be stateless from the perspective of the +server side, unless the client has requested a capability indicating that +state should be maintained by the server. Clients MUST NOT require state +management on the server side in order to function correctly. This +permits simple round-robin load-balancing on the server side, without +needing to worry about state management. + + agent +~~~~~~~ + +The server can advertise the `agent` capability with a value `X` (in the +form `agent=X`) to notify the client that the server is running version +`X`. The client may optionally send its own agent string by including +the `agent` capability with a value `Y` (in the form `agent=Y`) in its +request to the server (but it MUST NOT do so if the server did not +advertise the agent capability). The `X` and `Y` strings may contain any +printable ASCII characters except space (i.e., the byte range 32 < x < +127), and are typically of the form "package/version" (e.g., +"git/1.8.3.1"). The agent strings are purely informative for statistics +and debugging purposes, and MUST NOT be used to programmatically assume +the presence or absence of particular features. + + ls-refs +~~~~~~~~~ + +`ls-refs` is the command used to request a reference advertisement in v2. +Unlike the current reference advertisement, ls-refs takes in arguments +which can be used to limit the refs sent from the server. + +Additional features not supported in the base command will be advertised +as the value of the command in the capability advertisement in the form +of a space separated list of features: "<command>=<feature 1> <feature 2>" + +ls-refs takes in the following arguments: + + symrefs + In addition to the object pointed by it, show the underlying ref + pointed by it when showing a symbolic ref. + peel + Show peeled tags. + ref-prefix <prefix> + When specified, only references having a prefix matching one of + the provided prefixes are displayed. + +The output of ls-refs is as follows: + + output = *ref + flush-pkt + ref = PKT-LINE(obj-id SP refname *(SP ref-attribute) LF) + ref-attribute = (symref | peeled) + symref = "symref-target:" symref-target + peeled = "peeled:" obj-id + + fetch +~~~~~~~ + +`fetch` is the command used to fetch a packfile in v2. It can be looked +at as a modified version of the v1 fetch where the ref-advertisement is +stripped out (since the `ls-refs` command fills that role) and the +message format is tweaked to eliminate redundancies and permit easy +addition of future extensions. + +Additional features not supported in the base command will be advertised +as the value of the command in the capability advertisement in the form +of a space separated list of features: "<command>=<feature 1> <feature 2>" + +A `fetch` request can take the following arguments: + + want <oid> + Indicates to the server an object which the client wants to + retrieve. Wants can be anything and are not limited to + advertised objects. + + have <oid> + Indicates to the server an object which the client has locally. + This allows the server to make a packfile which only contains + the objects that the client needs. Multiple 'have' lines can be + supplied. + + done + Indicates to the server that negotiation should terminate (or + not even begin if performing a clone) and that the server should + use the information supplied in the request to construct the + packfile. + + thin-pack + Request that a thin pack be sent, which is a pack with deltas + which reference base objects not contained within the pack (but + are known to exist at the receiving end). This can reduce the + network traffic significantly, but it requires the receiving end + to know how to "thicken" these packs by adding the missing bases + to the pack. + + no-progress + Request that progress information that would normally be sent on + side-band channel 2, during the packfile transfer, should not be + sent. However, the side-band channel 3 is still used for error + responses. + + include-tag + Request that annotated tags should be sent if the objects they + point to are being sent. + + ofs-delta + Indicate that the client understands PACKv2 with delta referring + to its base by position in pack rather than by an oid. That is, + they can read OBJ_OFS_DELTA (ake type 6) in a packfile. + +If the 'shallow' feature is advertised the following arguments can be +included in the clients request as well as the potential addition of the +'shallow-info' section in the server's response as explained below. + + shallow <oid> + A client must notify the server of all commits for which it only + has shallow copies (meaning that it doesn't have the parents of + a commit) by supplying a 'shallow <oid>' line for each such + object so that the server is aware of the limitations of the + client's history. This is so that the server is aware that the + client may not have all objects reachable from such commits. + + deepen <depth> + Requests that the fetch/clone should be shallow having a commit + depth of <depth> relative to the remote side. + + deepen-relative + Requests that the semantics of the "deepen" command be changed + to indicate that the depth requested is relative to the client's + current shallow boundary, instead of relative to the requested + commits. + + deepen-since <timestamp> + Requests that the shallow clone/fetch should be cut at a + specific time, instead of depth. Internally it's equivalent to + doing "git rev-list --max-age=<timestamp>". Cannot be used with + "deepen". + + deepen-not <rev> + Requests that the shallow clone/fetch should be cut at a + specific revision specified by '<rev>', instead of a depth. + Internally it's equivalent of doing "git rev-list --not <rev>". + Cannot be used with "deepen", but can be used with + "deepen-since". + +If the 'filter' feature is advertised, the following argument can be +included in the client's request: + + filter <filter-spec> + Request that various objects from the packfile be omitted + using one of several filtering techniques. These are intended + for use with partial clone and partial fetch operations. See + `rev-list` for possible "filter-spec" values. + +The response of `fetch` is broken into a number of sections separated by +delimiter packets (0001), with each section beginning with its section +header. + + output = *section + section = (acknowledgments | shallow-info | packfile) + (flush-pkt | delim-pkt) + + acknowledgments = PKT-LINE("acknowledgments" LF) + (nak | *ack) + (ready) + ready = PKT-LINE("ready" LF) + nak = PKT-LINE("NAK" LF) + ack = PKT-LINE("ACK" SP obj-id LF) + + shallow-info = PKT-LINE("shallow-info" LF) + *PKT-LINE((shallow | unshallow) LF) + shallow = "shallow" SP obj-id + unshallow = "unshallow" SP obj-id + + packfile = PKT-LINE("packfile" LF) + *PKT-LINE(%x01-03 *%x00-ff) + + acknowledgments section + * If the client determines that it is finished with negotiations + by sending a "done" line, the acknowledgments sections MUST be + omitted from the server's response. + + * Always begins with the section header "acknowledgments" + + * The server will respond with "NAK" if none of the object ids sent + as have lines were common. + + * The server will respond with "ACK obj-id" for all of the + object ids sent as have lines which are common. + + * A response cannot have both "ACK" lines as well as a "NAK" + line. + + * The server will respond with a "ready" line indicating that + the server has found an acceptable common base and is ready to + make and send a packfile (which will be found in the packfile + section of the same response) + + * If the server has found a suitable cut point and has decided + to send a "ready" line, then the server can decide to (as an + optimization) omit any "ACK" lines it would have sent during + its response. This is because the server will have already + determined the objects it plans to send to the client and no + further negotiation is needed. + + shallow-info section + * If the client has requested a shallow fetch/clone, a shallow + client requests a fetch or the server is shallow then the + server's response may include a shallow-info section. The + shallow-info section will be included if (due to one of the + above conditions) the server needs to inform the client of any + shallow boundaries or adjustments to the clients already + existing shallow boundaries. + + * Always begins with the section header "shallow-info" + + * If a positive depth is requested, the server will compute the + set of commits which are no deeper than the desired depth. + + * The server sends a "shallow obj-id" line for each commit whose + parents will not be sent in the following packfile. + + * The server sends an "unshallow obj-id" line for each commit + which the client has indicated is shallow, but is no longer + shallow as a result of the fetch (due to its parents being + sent in the following packfile). + + * The server MUST NOT send any "unshallow" lines for anything + which the client has not indicated was shallow as a part of + its request. + + * This section is only included if a packfile section is also + included in the response. + + packfile section + * This section is only included if the client has sent 'want' + lines in its request and either requested that no more + negotiation be done by sending 'done' or if the server has + decided it has found a sufficient cut point to produce a + packfile. + + * Always begins with the section header "packfile" + + * The transmission of the packfile begins immediately after the + section header + + * The data transfer of the packfile is always multiplexed, using + the same semantics of the 'side-band-64k' capability from + protocol version 1. This means that each packet, during the + packfile data stream, is made up of a leading 4-byte pkt-line + length (typical of the pkt-line format), followed by a 1-byte + stream code, followed by the actual data. + + The stream code can be one of: + 1 - pack data + 2 - progress messages + 3 - fatal error message just before stream aborts + + server-option +~~~~~~~~~~~~~~~ + +If advertised, indicates that any number of server specific options can be +included in a request. This is done by sending each option as a +"server-option=<option>" capability line in the capability-list section of +a request. + +The provided options must not contain a NUL or LF character. diff --git a/Documentation/technical/shallow.txt b/Documentation/technical/shallow.txt index 5183b15422..01dedfe9ff 100644 --- a/Documentation/technical/shallow.txt +++ b/Documentation/technical/shallow.txt @@ -8,20 +8,22 @@ repo, and therefore grafts are introduced pretending that these commits have no parents. ********************************************************* -The basic idea is to write the SHA-1s of shallow commits into -$GIT_DIR/shallow, and handle its contents like the contents -of $GIT_DIR/info/grafts (with the difference that shallow -cannot contain parent information). - -This information is stored in a new file instead of grafts, or -even the config, since the user should not touch that file -at all (even throughout development of the shallow clone, it -was never manually edited!). +$GIT_DIR/shallow lists commit object names and tells Git to +pretend as if they are root commits (e.g. "git log" traversal +stops after showing them; "git fsck" does not complain saying +the commits listed on their "parent" lines do not exist). Each line contains exactly one SHA-1. When read, a commit_graft will be constructed, which has nr_parent < 0 to make it easier to discern from user provided grafts. +Note that the shallow feature could not be changed easily to +use replace refs: a commit containing a `mergetag` is not allowed +to be replaced, not even by a root commit. Such a commit can be +made shallow, though. Also, having a `shallow` file explicitly +listing all the commits made shallow makes it a *lot* easier to +do shallow-specific things such as to deepen the history. + Since fsck-objects relies on the library to read the objects, it honours shallow commits automatically. |