diff options
Diffstat (limited to 'Documentation/technical')
-rw-r--r-- | Documentation/technical/api-argv-array.txt | 2 | ||||
-rw-r--r-- | Documentation/technical/api-decorate.txt | 6 | ||||
-rw-r--r-- | Documentation/technical/api-directory-listing.txt | 27 | ||||
-rw-r--r-- | Documentation/technical/api-string-list.txt | 209 | ||||
-rw-r--r-- | Documentation/technical/hash-function-transition.txt | 797 | ||||
-rw-r--r-- | Documentation/technical/http-protocol.txt | 8 | ||||
-rw-r--r-- | Documentation/technical/index-format.txt | 19 | ||||
-rw-r--r-- | Documentation/technical/pack-protocol.txt | 51 | ||||
-rw-r--r-- | Documentation/technical/partial-clone.txt | 324 | ||||
-rw-r--r-- | Documentation/technical/protocol-capabilities.txt | 8 | ||||
-rw-r--r-- | Documentation/technical/repository-version.txt | 12 |
11 files changed, 1236 insertions, 227 deletions
diff --git a/Documentation/technical/api-argv-array.txt b/Documentation/technical/api-argv-array.txt index cfc063018c..870c8edbfb 100644 --- a/Documentation/technical/api-argv-array.txt +++ b/Documentation/technical/api-argv-array.txt @@ -8,7 +8,7 @@ always NULL-terminated at the element pointed to by `argv[argc]`. This makes the result suitable for passing to functions expecting to receive argv from main(), or the link:api-run-command.html[run-command API]. -The link:api-string-list.html[string-list API] is similar, but cannot be +The string-list API (documented in string-list.h) is similar, but cannot be used for these purposes; instead of storing a straight string pointer, it contains an item structure with a `util` field that is not compatible with the traditional argv interface. diff --git a/Documentation/technical/api-decorate.txt b/Documentation/technical/api-decorate.txt deleted file mode 100644 index 1d52a6ce14..0000000000 --- a/Documentation/technical/api-decorate.txt +++ /dev/null @@ -1,6 +0,0 @@ -decorate API -============ - -Talk about <decorate.h> - -(Linus) diff --git a/Documentation/technical/api-directory-listing.txt b/Documentation/technical/api-directory-listing.txt index 6c77b4920c..7fae00f44f 100644 --- a/Documentation/technical/api-directory-listing.txt +++ b/Documentation/technical/api-directory-listing.txt @@ -22,16 +22,20 @@ The notable options are: `flags`:: - A bit-field of options (the `*IGNORED*` flags are mutually exclusive): + A bit-field of options: `DIR_SHOW_IGNORED`::: - Return just ignored files in `entries[]`, not untracked files. + Return just ignored files in `entries[]`, not untracked + files. This flag is mutually exclusive with + `DIR_SHOW_IGNORED_TOO`. `DIR_SHOW_IGNORED_TOO`::: - Similar to `DIR_SHOW_IGNORED`, but return ignored files in `ignored[]` - in addition to untracked files in `entries[]`. + Similar to `DIR_SHOW_IGNORED`, but return ignored files in + `ignored[]` in addition to untracked files in + `entries[]`. This flag is mutually exclusive with + `DIR_SHOW_IGNORED`. `DIR_KEEP_UNTRACKED_CONTENTS`::: @@ -39,6 +43,21 @@ The notable options are: untracked contents of untracked directories are also returned in `entries[]`. +`DIR_SHOW_IGNORED_TOO_MODE_MATCHING`::: + + Only has meaning if `DIR_SHOW_IGNORED_TOO` is also set; if + this is set, returns ignored files and directories that match + an exclude pattern. If a directory matches an exclude pattern, + then the directory is returned and the contained paths are + not. A directory that does not match an exclude pattern will + not be returned even if all of its contents are ignored. In + this case, the contents are returned as individual entries. ++ +If this is set, files and directories that explicity match an ignore +pattern are reported. Implicity ignored directories (directories that +do not match an ignore pattern, but whose contents are all ignored) +are not reported, instead all of the contents are reported. + `DIR_COLLECT_IGNORED`::: Special mode for git-add. Return ignored files in `ignored[]` and diff --git a/Documentation/technical/api-string-list.txt b/Documentation/technical/api-string-list.txt deleted file mode 100644 index c08402b12e..0000000000 --- a/Documentation/technical/api-string-list.txt +++ /dev/null @@ -1,209 +0,0 @@ -string-list API -=============== - -The string_list API offers a data structure and functions to handle -sorted and unsorted string lists. A "sorted" list is one whose -entries are sorted by string value in `strcmp()` order. - -The 'string_list' struct used to be called 'path_list', but was renamed -because it is not specific to paths. - -The caller: - -. Allocates and clears a `struct string_list` variable. - -. Initializes the members. You might want to set the flag `strdup_strings` - if the strings should be strdup()ed. For example, this is necessary - when you add something like git_path("..."), since that function returns - a static buffer that will change with the next call to git_path(). -+ -If you need something advanced, you can manually malloc() the `items` -member (you need this if you add things later) and you should set the -`nr` and `alloc` members in that case, too. - -. Adds new items to the list, using `string_list_append`, - `string_list_append_nodup`, `string_list_insert`, - `string_list_split`, and/or `string_list_split_in_place`. - -. Can check if a string is in the list using `string_list_has_string` or - `unsorted_string_list_has_string` and get it from the list using - `string_list_lookup` for sorted lists. - -. Can sort an unsorted list using `string_list_sort`. - -. Can remove duplicate items from a sorted list using - `string_list_remove_duplicates`. - -. Can remove individual items of an unsorted list using - `unsorted_string_list_delete_item`. - -. Can remove items not matching a criterion from a sorted or unsorted - list using `filter_string_list`, or remove empty strings using - `string_list_remove_empty_items`. - -. Finally it should free the list using `string_list_clear`. - -Example: - ----- -struct string_list list = STRING_LIST_INIT_NODUP; -int i; - -string_list_append(&list, "foo"); -string_list_append(&list, "bar"); -for (i = 0; i < list.nr; i++) - printf("%s\n", list.items[i].string) ----- - -NOTE: It is more efficient to build an unsorted list and sort it -afterwards, instead of building a sorted list (`O(n log n)` instead of -`O(n^2)`). -+ -However, if you use the list to check if a certain string was added -already, you should not do that (using unsorted_string_list_has_string()), -because the complexity would be quadratic again (but with a worse factor). - -Functions ---------- - -* General ones (works with sorted and unsorted lists as well) - -`string_list_init`:: - - Initialize the members of the string_list, set `strdup_strings` - member according to the value of the second parameter. - -`filter_string_list`:: - - Apply a function to each item in a list, retaining only the - items for which the function returns true. If free_util is - true, call free() on the util members of any items that have - to be deleted. Preserve the order of the items that are - retained. - -`string_list_remove_empty_items`:: - - Remove any empty strings from the list. If free_util is true, - call free() on the util members of any items that have to be - deleted. Preserve the order of the items that are retained. - -`print_string_list`:: - - Dump a string_list to stdout, useful mainly for debugging purposes. It - can take an optional header argument and it writes out the - string-pointer pairs of the string_list, each one in its own line. - -`string_list_clear`:: - - Free a string_list. The `string` pointer of the items will be freed in - case the `strdup_strings` member of the string_list is set. The second - parameter controls if the `util` pointer of the items should be freed - or not. - -* Functions for sorted lists only - -`string_list_has_string`:: - - Determine if the string_list has a given string or not. - -`string_list_insert`:: - - Insert a new element to the string_list. The returned pointer can be - handy if you want to write something to the `util` pointer of the - string_list_item containing the just added string. If the given - string already exists the insertion will be skipped and the - pointer to the existing item returned. -+ -Since this function uses xrealloc() (which die()s if it fails) if the -list needs to grow, it is safe not to check the pointer. I.e. you may -write `string_list_insert(...)->util = ...;`. - -`string_list_lookup`:: - - Look up a given string in the string_list, returning the containing - string_list_item. If the string is not found, NULL is returned. - -`string_list_remove_duplicates`:: - - Remove all but the first of consecutive entries that have the - same string value. If free_util is true, call free() on the - util members of any items that have to be deleted. - -* Functions for unsorted lists only - -`string_list_append`:: - - Append a new string to the end of the string_list. If - `strdup_string` is set, then the string argument is copied; - otherwise the new `string_list_entry` refers to the input - string. - -`string_list_append_nodup`:: - - Append a new string to the end of the string_list. The new - `string_list_entry` always refers to the input string, even if - `strdup_string` is set. This function can be used to hand - ownership of a malloc()ed string to a `string_list` that has - `strdup_string` set. - -`string_list_sort`:: - - Sort the list's entries by string value in `strcmp()` order. - -`unsorted_string_list_has_string`:: - - It's like `string_list_has_string()` but for unsorted lists. - -`unsorted_string_list_lookup`:: - - It's like `string_list_lookup()` but for unsorted lists. -+ -The above two functions need to look through all items, as opposed to their -counterpart for sorted lists, which performs a binary search. - -`unsorted_string_list_delete_item`:: - - Remove an item from a string_list. The `string` pointer of the items - will be freed in case the `strdup_strings` member of the string_list - is set. The third parameter controls if the `util` pointer of the - items should be freed or not. - -`string_list_split`:: -`string_list_split_in_place`:: - - Split a string into substrings on a delimiter character and - append the substrings to a `string_list`. If `maxsplit` is - non-negative, then split at most `maxsplit` times. Return the - number of substrings appended to the list. -+ -`string_list_split` requires a `string_list` that has `strdup_strings` -set to true; it leaves the input string untouched and makes copies of -the substrings in newly-allocated memory. -`string_list_split_in_place` requires a `string_list` that has -`strdup_strings` set to false; it splits the input string in place, -overwriting the delimiter characters with NULs and creating new -string_list_items that point into the original string (the original -string must therefore not be modified or freed while the `string_list` -is in use). - - -Data structures ---------------- - -* `struct string_list_item` - -Represents an item of the list. The `string` member is a pointer to the -string, and you may use the `util` member for any purpose, if you want. - -* `struct string_list` - -Represents the list itself. - -. The array of items are available via the `items` member. -. The `nr` member contains the number of items stored in the list. -. The `alloc` member is used to avoid reallocating at every insertion. - You should not tamper with it. -. Setting the `strdup_strings` member to 1 will strdup() the strings - before adding them, see above. -. The `compare_strings_fn` member is used to specify a custom compare - function, otherwise `strcmp()` is used as the default function. diff --git a/Documentation/technical/hash-function-transition.txt b/Documentation/technical/hash-function-transition.txt new file mode 100644 index 0000000000..417ba491d0 --- /dev/null +++ b/Documentation/technical/hash-function-transition.txt @@ -0,0 +1,797 @@ +Git hash function transition +============================ + +Objective +--------- +Migrate Git from SHA-1 to a stronger hash function. + +Background +---------- +At its core, the Git version control system is a content addressable +filesystem. It uses the SHA-1 hash function to name content. For +example, files, directories, and revisions are referred to by hash +values unlike in other traditional version control systems where files +or versions are referred to via sequential numbers. The use of a hash +function to address its content delivers a few advantages: + +* Integrity checking is easy. Bit flips, for example, are easily + detected, as the hash of corrupted content does not match its name. +* Lookup of objects is fast. + +Using a cryptographically secure hash function brings additional +advantages: + +* Object names can be signed and third parties can trust the hash to + address the signed object and all objects it references. +* Communication using Git protocol and out of band communication + methods have a short reliable string that can be used to reliably + address stored content. + +Over time some flaws in SHA-1 have been discovered by security +researchers. https://shattered.io demonstrated a practical SHA-1 hash +collision. As a result, SHA-1 cannot be considered cryptographically +secure any more. This impacts the communication of hash values because +we cannot trust that a given hash value represents the known good +version of content that the speaker intended. + +SHA-1 still possesses the other properties such as fast object lookup +and safe error checking, but other hash functions are equally suitable +that are believed to be cryptographically secure. + +Goals +----- +Where NewHash is a strong 256-bit hash function to replace SHA-1 (see +"Selection of a New Hash", below): + +1. The transition to NewHash can be done one local repository at a time. + a. Requiring no action by any other party. + b. A NewHash repository can communicate with SHA-1 Git servers + (push/fetch). + c. Users can use SHA-1 and NewHash identifiers for objects + interchangeably (see "Object names on the command line", below). + d. New signed objects make use of a stronger hash function than + SHA-1 for their security guarantees. +2. Allow a complete transition away from SHA-1. + a. Local metadata for SHA-1 compatibility can be removed from a + repository if compatibility with SHA-1 is no longer needed. +3. Maintainability throughout the process. + a. The object format is kept simple and consistent. + b. Creation of a generalized repository conversion tool. + +Non-Goals +--------- +1. Add NewHash support to Git protocol. This is valuable and the + logical next step but it is out of scope for this initial design. +2. Transparently improving the security of existing SHA-1 signed + objects. +3. Intermixing objects using multiple hash functions in a single + repository. +4. Taking the opportunity to fix other bugs in Git's formats and + protocols. +5. Shallow clones and fetches into a NewHash repository. (This will + change when we add NewHash support to Git protocol.) +6. Skip fetching some submodules of a project into a NewHash + repository. (This also depends on NewHash support in Git + protocol.) + +Overview +-------- +We introduce a new repository format extension. Repositories with this +extension enabled use NewHash instead of SHA-1 to name their objects. +This affects both object names and object content --- both the names +of objects and all references to other objects within an object are +switched to the new hash function. + +NewHash repositories cannot be read by older versions of Git. + +Alongside the packfile, a NewHash repository stores a bidirectional +mapping between NewHash and SHA-1 object names. The mapping is generated +locally and can be verified using "git fsck". Object lookups use this +mapping to allow naming objects using either their SHA-1 and NewHash names +interchangeably. + +"git cat-file" and "git hash-object" gain options to display an object +in its sha1 form and write an object given its sha1 form. This +requires all objects referenced by that object to be present in the +object database so that they can be named using the appropriate name +(using the bidirectional hash mapping). + +Fetches from a SHA-1 based server convert the fetched objects into +NewHash form and record the mapping in the bidirectional mapping table +(see below for details). Pushes to a SHA-1 based server convert the +objects being pushed into sha1 form so the server does not have to be +aware of the hash function the client is using. + +Detailed Design +--------------- +Repository format extension +~~~~~~~~~~~~~~~~~~~~~~~~~~~ +A NewHash repository uses repository format version `1` (see +Documentation/technical/repository-version.txt) with extensions +`objectFormat` and `compatObjectFormat`: + + [core] + repositoryFormatVersion = 1 + [extensions] + objectFormat = newhash + compatObjectFormat = sha1 + +Specifying a repository format extension ensures that versions of Git +not aware of NewHash do not try to operate on these repositories, +instead producing an error message: + + $ git status + fatal: unknown repository extensions found: + objectformat + compatobjectformat + +See the "Transition plan" section below for more details on these +repository extensions. + +Object names +~~~~~~~~~~~~ +Objects can be named by their 40 hexadecimal digit sha1-name or 64 +hexadecimal digit newhash-name, plus names derived from those (see +gitrevisions(7)). + +The sha1-name of an object is the SHA-1 of the concatenation of its +type, length, a nul byte, and the object's sha1-content. This is the +traditional <sha1> used in Git to name objects. + +The newhash-name of an object is the NewHash of the concatenation of its +type, length, a nul byte, and the object's newhash-content. + +Object format +~~~~~~~~~~~~~ +The content as a byte sequence of a tag, commit, or tree object named +by sha1 and newhash differ because an object named by newhash-name refers to +other objects by their newhash-names and an object named by sha1-name +refers to other objects by their sha1-names. + +The newhash-content of an object is the same as its sha1-content, except +that objects referenced by the object are named using their newhash-names +instead of sha1-names. Because a blob object does not refer to any +other object, its sha1-content and newhash-content are the same. + +The format allows round-trip conversion between newhash-content and +sha1-content. + +Object storage +~~~~~~~~~~~~~~ +Loose objects use zlib compression and packed objects use the packed +format described in Documentation/technical/pack-format.txt, just like +today. The content that is compressed and stored uses newhash-content +instead of sha1-content. + +Pack index +~~~~~~~~~~ +Pack index (.idx) files use a new v3 format that supports multiple +hash functions. They have the following format (all integers are in +network byte order): + +- A header appears at the beginning and consists of the following: + - The 4-byte pack index signature: '\377t0c' + - 4-byte version number: 3 + - 4-byte length of the header section, including the signature and + version number + - 4-byte number of objects contained in the pack + - 4-byte number of object formats in this pack index: 2 + - For each object format: + - 4-byte format identifier (e.g., 'sha1' for SHA-1) + - 4-byte length in bytes of shortened object names. This is the + shortest possible length needed to make names in the shortened + object name table unambiguous. + - 4-byte integer, recording where tables relating to this format + are stored in this index file, as an offset from the beginning. + - 4-byte offset to the trailer from the beginning of this file. + - Zero or more additional key/value pairs (4-byte key, 4-byte + value). Only one key is supported: 'PSRC'. See the "Loose objects + and unreachable objects" section for supported values and how this + is used. All other keys are reserved. Readers must ignore + unrecognized keys. +- Zero or more NUL bytes. This can optionally be used to improve the + alignment of the full object name table below. +- Tables for the first object format: + - A sorted table of shortened object names. These are prefixes of + the names of all objects in this pack file, packed together + without offset values to reduce the cache footprint of the binary + search for a specific object name. + + - A table of full object names in pack order. This allows resolving + a reference to "the nth object in the pack file" (from a + reachability bitmap or from the next table of another object + format) to its object name. + + - A table of 4-byte values mapping object name order to pack order. + For an object in the table of sorted shortened object names, the + value at the corresponding index in this table is the index in the + previous table for that same object. + + This can be used to look up the object in reachability bitmaps or + to look up its name in another object format. + + - A table of 4-byte CRC32 values of the packed object data, in the + order that the objects appear in the pack file. This is to allow + compressed data to be copied directly from pack to pack during + repacking without undetected data corruption. + + - A table of 4-byte offset values. For an object in the table of + sorted shortened object names, the value at the corresponding + index in this table indicates where that object can be found in + the pack file. These are usually 31-bit pack file offsets, but + large offsets are encoded as an index into the next table with the + most significant bit set. + + - A table of 8-byte offset entries (empty for pack files less than + 2 GiB). Pack files are organized with heavily used objects toward + the front, so most object references should not need to refer to + this table. +- Zero or more NUL bytes. +- Tables for the second object format, with the same layout as above, + up to and not including the table of CRC32 values. +- Zero or more NUL bytes. +- The trailer consists of the following: + - A copy of the 20-byte NewHash checksum at the end of the + corresponding packfile. + + - 20-byte NewHash checksum of all of the above. + +Loose object index +~~~~~~~~~~~~~~~~~~ +A new file $GIT_OBJECT_DIR/loose-object-idx contains information about +all loose objects. Its format is + + # loose-object-idx + (newhash-name SP sha1-name LF)* + +where the object names are in hexadecimal format. The file is not +sorted. + +The loose object index is protected against concurrent writes by a +lock file $GIT_OBJECT_DIR/loose-object-idx.lock. To add a new loose +object: + +1. Write the loose object to a temporary file, like today. +2. Open loose-object-idx.lock with O_CREAT | O_EXCL to acquire the lock. +3. Rename the loose object into place. +4. Open loose-object-idx with O_APPEND and write the new object +5. Unlink loose-object-idx.lock to release the lock. + +To remove entries (e.g. in "git pack-refs" or "git-prune"): + +1. Open loose-object-idx.lock with O_CREAT | O_EXCL to acquire the + lock. +2. Write the new content to loose-object-idx.lock. +3. Unlink any loose objects being removed. +4. Rename to replace loose-object-idx, releasing the lock. + +Translation table +~~~~~~~~~~~~~~~~~ +The index files support a bidirectional mapping between sha1-names +and newhash-names. The lookup proceeds similarly to ordinary object +lookups. For example, to convert a sha1-name to a newhash-name: + + 1. Look for the object in idx files. If a match is present in the + idx's sorted list of truncated sha1-names, then: + a. Read the corresponding entry in the sha1-name order to pack + name order mapping. + b. Read the corresponding entry in the full sha1-name table to + verify we found the right object. If it is, then + c. Read the corresponding entry in the full newhash-name table. + That is the object's newhash-name. + 2. Check for a loose object. Read lines from loose-object-idx until + we find a match. + +Step (1) takes the same amount of time as an ordinary object lookup: +O(number of packs * log(objects per pack)). Step (2) takes O(number of +loose objects) time. To maintain good performance it will be necessary +to keep the number of loose objects low. See the "Loose objects and +unreachable objects" section below for more details. + +Since all operations that make new objects (e.g., "git commit") add +the new objects to the corresponding index, this mapping is possible +for all objects in the object store. + +Reading an object's sha1-content +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +The sha1-content of an object can be read by converting all newhash-names +its newhash-content references to sha1-names using the translation table. + +Fetch +~~~~~ +Fetching from a SHA-1 based server requires translating between SHA-1 +and NewHash based representations on the fly. + +SHA-1s named in the ref advertisement that are present on the client +can be translated to NewHash and looked up as local objects using the +translation table. + +Negotiation proceeds as today. Any "have"s generated locally are +converted to SHA-1 before being sent to the server, and SHA-1s +mentioned by the server are converted to NewHash when looking them up +locally. + +After negotiation, the server sends a packfile containing the +requested objects. We convert the packfile to NewHash format using +the following steps: + +1. index-pack: inflate each object in the packfile and compute its + SHA-1. Objects can contain deltas in OBJ_REF_DELTA format against + objects the client has locally. These objects can be looked up + using the translation table and their sha1-content read as + described above to resolve the deltas. +2. topological sort: starting at the "want"s from the negotiation + phase, walk through objects in the pack and emit a list of them, + excluding blobs, in reverse topologically sorted order, with each + object coming later in the list than all objects it references. + (This list only contains objects reachable from the "wants". If the + pack from the server contained additional extraneous objects, then + they will be discarded.) +3. convert to newhash: open a new (newhash) packfile. Read the topologically + sorted list just generated. For each object, inflate its + sha1-content, convert to newhash-content, and write it to the newhash + pack. Record the new sha1<->newhash mapping entry for use in the idx. +4. sort: reorder entries in the new pack to match the order of objects + in the pack the server generated and include blobs. Write a newhash idx + file +5. clean up: remove the SHA-1 based pack file, index, and + topologically sorted list obtained from the server in steps 1 + and 2. + +Step 3 requires every object referenced by the new object to be in the +translation table. This is why the topological sort step is necessary. + +As an optimization, step 1 could write a file describing what non-blob +objects each object it has inflated from the packfile references. This +makes the topological sort in step 2 possible without inflating the +objects in the packfile for a second time. The objects need to be +inflated again in step 3, for a total of two inflations. + +Step 4 is probably necessary for good read-time performance. "git +pack-objects" on the server optimizes the pack file for good data +locality (see Documentation/technical/pack-heuristics.txt). + +Details of this process are likely to change. It will take some +experimenting to get this to perform well. + +Push +~~~~ +Push is simpler than fetch because the objects referenced by the +pushed objects are already in the translation table. The sha1-content +of each object being pushed can be read as described in the "Reading +an object's sha1-content" section to generate the pack written by git +send-pack. + +Signed Commits +~~~~~~~~~~~~~~ +We add a new field "gpgsig-newhash" to the commit object format to allow +signing commits without relying on SHA-1. It is similar to the +existing "gpgsig" field. Its signed payload is the newhash-content of the +commit object with any "gpgsig" and "gpgsig-newhash" fields removed. + +This means commits can be signed +1. using SHA-1 only, as in existing signed commit objects +2. using both SHA-1 and NewHash, by using both gpgsig-newhash and gpgsig + fields. +3. using only NewHash, by only using the gpgsig-newhash field. + +Old versions of "git verify-commit" can verify the gpgsig signature in +cases (1) and (2) without modifications and view case (3) as an +ordinary unsigned commit. + +Signed Tags +~~~~~~~~~~~ +We add a new field "gpgsig-newhash" to the tag object format to allow +signing tags without relying on SHA-1. Its signed payload is the +newhash-content of the tag with its gpgsig-newhash field and "-----BEGIN PGP +SIGNATURE-----" delimited in-body signature removed. + +This means tags can be signed +1. using SHA-1 only, as in existing signed tag objects +2. using both SHA-1 and NewHash, by using gpgsig-newhash and an in-body + signature. +3. using only NewHash, by only using the gpgsig-newhash field. + +Mergetag embedding +~~~~~~~~~~~~~~~~~~ +The mergetag field in the sha1-content of a commit contains the +sha1-content of a tag that was merged by that commit. + +The mergetag field in the newhash-content of the same commit contains the +newhash-content of the same tag. + +Submodules +~~~~~~~~~~ +To convert recorded submodule pointers, you need to have the converted +submodule repository in place. The translation table of the submodule +can be used to look up the new hash. + +Loose objects and unreachable objects +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Fast lookups in the loose-object-idx require that the number of loose +objects not grow too high. + +"git gc --auto" currently waits for there to be 6700 loose objects +present before consolidating them into a packfile. We will need to +measure to find a more appropriate threshold for it to use. + +"git gc --auto" currently waits for there to be 50 packs present +before combining packfiles. Packing loose objects more aggressively +may cause the number of pack files to grow too quickly. This can be +mitigated by using a strategy similar to Martin Fick's exponential +rolling garbage collection script: +https://gerrit-review.googlesource.com/c/gerrit/+/35215 + +"git gc" currently expels any unreachable objects it encounters in +pack files to loose objects in an attempt to prevent a race when +pruning them (in case another process is simultaneously writing a new +object that refers to the about-to-be-deleted object). This leads to +an explosion in the number of loose objects present and disk space +usage due to the objects in delta form being replaced with independent +loose objects. Worse, the race is still present for loose objects. + +Instead, "git gc" will need to move unreachable objects to a new +packfile marked as UNREACHABLE_GARBAGE (using the PSRC field; see +below). To avoid the race when writing new objects referring to an +about-to-be-deleted object, code paths that write new objects will +need to copy any objects from UNREACHABLE_GARBAGE packs that they +refer to to new, non-UNREACHABLE_GARBAGE packs (or loose objects). +UNREACHABLE_GARBAGE are then safe to delete if their creation time (as +indicated by the file's mtime) is long enough ago. + +To avoid a proliferation of UNREACHABLE_GARBAGE packs, they can be +combined under certain circumstances. If "gc.garbageTtl" is set to +greater than one day, then packs created within a single calendar day, +UTC, can be coalesced together. The resulting packfile would have an +mtime before midnight on that day, so this makes the effective maximum +ttl the garbageTtl + 1 day. If "gc.garbageTtl" is less than one day, +then we divide the calendar day into intervals one-third of that ttl +in duration. Packs created within the same interval can be coalesced +together. The resulting packfile would have an mtime before the end of +the interval, so this makes the effective maximum ttl equal to the +garbageTtl * 4/3. + +This rule comes from Thirumala Reddy Mutchukota's JGit change +https://git.eclipse.org/r/90465. + +The UNREACHABLE_GARBAGE setting goes in the PSRC field of the pack +index. More generally, that field indicates where a pack came from: + + - 1 (PACK_SOURCE_RECEIVE) for a pack received over the network + - 2 (PACK_SOURCE_AUTO) for a pack created by a lightweight + "gc --auto" operation + - 3 (PACK_SOURCE_GC) for a pack created by a full gc + - 4 (PACK_SOURCE_UNREACHABLE_GARBAGE) for potential garbage + discovered by gc + - 5 (PACK_SOURCE_INSERT) for locally created objects that were + written directly to a pack file, e.g. from "git add ." + +This information can be useful for debugging and for "gc --auto" to +make appropriate choices about which packs to coalesce. + +Caveats +------- +Invalid objects +~~~~~~~~~~~~~~~ +The conversion from sha1-content to newhash-content retains any +brokenness in the original object (e.g., tree entry modes encoded with +leading 0, tree objects whose paths are not sorted correctly, and +commit objects without an author or committer). This is a deliberate +feature of the design to allow the conversion to round-trip. + +More profoundly broken objects (e.g., a commit with a truncated "tree" +header line) cannot be converted but were not usable by current Git +anyway. + +Shallow clone and submodules +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Because it requires all referenced objects to be available in the +locally generated translation table, this design does not support +shallow clone or unfetched submodules. Protocol improvements might +allow lifting this restriction. + +Alternates +~~~~~~~~~~ +For the same reason, a newhash repository cannot borrow objects from a +sha1 repository using objects/info/alternates or +$GIT_ALTERNATE_OBJECT_REPOSITORIES. + +git notes +~~~~~~~~~ +The "git notes" tool annotates objects using their sha1-name as key. +This design does not describe a way to migrate notes trees to use +newhash-names. That migration is expected to happen separately (for +example using a file at the root of the notes tree to describe which +hash it uses). + +Server-side cost +~~~~~~~~~~~~~~~~ +Until Git protocol gains NewHash support, using NewHash based storage +on public-facing Git servers is strongly discouraged. Once Git +protocol gains NewHash support, NewHash based servers are likely not +to support SHA-1 compatibility, to avoid what may be a very expensive +hash reencode during clone and to encourage peers to modernize. + +The design described here allows fetches by SHA-1 clients of a +personal NewHash repository because it's not much more difficult than +allowing pushes from that repository. This support needs to be guarded +by a configuration option --- servers like git.kernel.org that serve a +large number of clients would not be expected to bear that cost. + +Meaning of signatures +~~~~~~~~~~~~~~~~~~~~~ +The signed payload for signed commits and tags does not explicitly +name the hash used to identify objects. If some day Git adopts a new +hash function with the same length as the current SHA-1 (40 +hexadecimal digit) or NewHash (64 hexadecimal digit) objects then the +intent behind the PGP signed payload in an object signature is +unclear: + + object e7e07d5a4fcc2a203d9873968ad3e6bd4d7419d7 + type commit + tag v2.12.0 + tagger Junio C Hamano <gitster@pobox.com> 1487962205 -0800 + + Git 2.12 + +Does this mean Git v2.12.0 is the commit with sha1-name +e7e07d5a4fcc2a203d9873968ad3e6bd4d7419d7 or the commit with +new-40-digit-hash-name e7e07d5a4fcc2a203d9873968ad3e6bd4d7419d7? + +Fortunately NewHash and SHA-1 have different lengths. If Git starts +using another hash with the same length to name objects, then it will +need to change the format of signed payloads using that hash to +address this issue. + +Object names on the command line +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +To support the transition (see Transition plan below), this design +supports four different modes of operation: + + 1. ("dark launch") Treat object names input by the user as SHA-1 and + convert any object names written to output to SHA-1, but store + objects using NewHash. This allows users to test the code with no + visible behavior change except for performance. This allows + allows running even tests that assume the SHA-1 hash function, to + sanity-check the behavior of the new mode. + + 2. ("early transition") Allow both SHA-1 and NewHash object names in + input. Any object names written to output use SHA-1. This allows + users to continue to make use of SHA-1 to communicate with peers + (e.g. by email) that have not migrated yet and prepares for mode 3. + + 3. ("late transition") Allow both SHA-1 and NewHash object names in + input. Any object names written to output use NewHash. In this + mode, users are using a more secure object naming method by + default. The disruption is minimal as long as most of their peers + are in mode 2 or mode 3. + + 4. ("post-transition") Treat object names input by the user as + NewHash and write output using NewHash. This is safer than mode 3 + because there is less risk that input is incorrectly interpreted + using the wrong hash function. + +The mode is specified in configuration. + +The user can also explicitly specify which format to use for a +particular revision specifier and for output, overriding the mode. For +example: + +git --output-format=sha1 log abac87a^{sha1}..f787cac^{newhash} + +Selection of a New Hash +----------------------- +In early 2005, around the time that Git was written, Xiaoyun Wang, +Yiqun Lisa Yin, and Hongbo Yu announced an attack finding SHA-1 +collisions in 2^69 operations. In August they published details. +Luckily, no practical demonstrations of a collision in full SHA-1 were +published until 10 years later, in 2017. + +The hash function NewHash to replace SHA-1 should be stronger than +SHA-1 was: we would like it to be trustworthy and useful in practice +for at least 10 years. + +Some other relevant properties: + +1. A 256-bit hash (long enough to match common security practice; not + excessively long to hurt performance and disk usage). + +2. High quality implementations should be widely available (e.g. in + OpenSSL). + +3. The hash function's properties should match Git's needs (e.g. Git + requires collision and 2nd preimage resistance and does not require + length extension resistance). + +4. As a tiebreaker, the hash should be fast to compute (fortunately + many contenders are faster than SHA-1). + +Some hashes under consideration are SHA-256, SHA-512/256, SHA-256x16, +K12, and BLAKE2bp-256. + +Transition plan +--------------- +Some initial steps can be implemented independently of one another: +- adding a hash function API (vtable) +- teaching fsck to tolerate the gpgsig-newhash field +- excluding gpgsig-* from the fields copied by "git commit --amend" +- annotating tests that depend on SHA-1 values with a SHA1 test + prerequisite +- using "struct object_id", GIT_MAX_RAWSZ, and GIT_MAX_HEXSZ + consistently instead of "unsigned char *" and the hardcoded + constants 20 and 40. +- introducing index v3 +- adding support for the PSRC field and safer object pruning + + +The first user-visible change is the introduction of the objectFormat +extension (without compatObjectFormat). This requires: +- implementing the loose-object-idx +- teaching fsck about this mode of operation +- using the hash function API (vtable) when computing object names +- signing objects and verifying signatures +- rejecting attempts to fetch from or push to an incompatible + repository + +Next comes introduction of compatObjectFormat: +- translating object names between object formats +- translating object content between object formats +- generating and verifying signatures in the compat format +- adding appropriate index entries when adding a new object to the + object store +- --output-format option +- ^{sha1} and ^{newhash} revision notation +- configuration to specify default input and output format (see + "Object names on the command line" above) + +The next step is supporting fetches and pushes to SHA-1 repositories: +- allow pushes to a repository using the compat format +- generate a topologically sorted list of the SHA-1 names of fetched + objects +- convert the fetched packfile to newhash format and generate an idx + file +- re-sort to match the order of objects in the fetched packfile + +The infrastructure supporting fetch also allows converting an existing +repository. In converted repositories and new clones, end users can +gain support for the new hash function without any visible change in +behavior (see "dark launch" in the "Object names on the command line" +section). In particular this allows users to verify NewHash signatures +on objects in the repository, and it should ensure the transition code +is stable in production in preparation for using it more widely. + +Over time projects would encourage their users to adopt the "early +transition" and then "late transition" modes to take advantage of the +new, more futureproof NewHash object names. + +When objectFormat and compatObjectFormat are both set, commands +generating signatures would generate both SHA-1 and NewHash signatures +by default to support both new and old users. + +In projects using NewHash heavily, users could be encouraged to adopt +the "post-transition" mode to avoid accidentally making implicit use +of SHA-1 object names. + +Once a critical mass of users have upgraded to a version of Git that +can verify NewHash signatures and have converted their existing +repositories to support verifying them, we can add support for a +setting to generate only NewHash signatures. This is expected to be at +least a year later. + +That is also a good moment to advertise the ability to convert +repositories to use NewHash only, stripping out all SHA-1 related +metadata. This improves performance by eliminating translation +overhead and security by avoiding the possibility of accidentally +relying on the safety of SHA-1. + +Updating Git's protocols to allow a server to specify which hash +functions it supports is also an important part of this transition. It +is not discussed in detail in this document but this transition plan +assumes it happens. :) + +Alternatives considered +----------------------- +Upgrading everyone working on a particular project on a flag day +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Projects like the Linux kernel are large and complex enough that +flipping the switch for all projects based on the repository at once +is infeasible. + +Not only would all developers and server operators supporting +developers have to switch on the same flag day, but supporting tooling +(continuous integration, code review, bug trackers, etc) would have to +be adapted as well. This also makes it difficult to get early feedback +from some project participants testing before it is time for mass +adoption. + +Using hash functions in parallel +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +(e.g. https://public-inbox.org/git/22708.8913.864049.452252@chiark.greenend.org.uk/ ) +Objects newly created would be addressed by the new hash, but inside +such an object (e.g. commit) it is still possible to address objects +using the old hash function. +* You cannot trust its history (needed for bisectability) in the + future without further work +* Maintenance burden as the number of supported hash functions grows + (they will never go away, so they accumulate). In this proposal, by + comparison, converted objects lose all references to SHA-1. + +Signed objects with multiple hashes +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Instead of introducing the gpgsig-newhash field in commit and tag objects +for newhash-content based signatures, an earlier version of this design +added "hash newhash <newhash-name>" fields to strengthen the existing +sha1-content based signatures. + +In other words, a single signature was used to attest to the object +content using both hash functions. This had some advantages: +* Using one signature instead of two speeds up the signing process. +* Having one signed payload with both hashes allows the signer to + attest to the sha1-name and newhash-name referring to the same object. +* All users consume the same signature. Broken signatures are likely + to be detected quickly using current versions of git. + +However, it also came with disadvantages: +* Verifying a signed object requires access to the sha1-names of all + objects it references, even after the transition is complete and + translation table is no longer needed for anything else. To support + this, the design added fields such as "hash sha1 tree <sha1-name>" + and "hash sha1 parent <sha1-name>" to the newhash-content of a signed + commit, complicating the conversion process. +* Allowing signed objects without a sha1 (for after the transition is + complete) complicated the design further, requiring a "nohash sha1" + field to suppress including "hash sha1" fields in the newhash-content + and signed payload. + +Lazily populated translation table +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Some of the work of building the translation table could be deferred to +push time, but that would significantly complicate and slow down pushes. +Calculating the sha1-name at object creation time at the same time it is +being streamed to disk and having its newhash-name calculated should be +an acceptable cost. + +Document History +---------------- + +2017-03-03 +bmwill@google.com, jonathantanmy@google.com, jrnieder@gmail.com, +sbeller@google.com + +Initial version sent to +http://public-inbox.org/git/20170304011251.GA26789@aiede.mtv.corp.google.com + +2017-03-03 jrnieder@gmail.com +Incorporated suggestions from jonathantanmy and sbeller: +* describe purpose of signed objects with each hash type +* redefine signed object verification using object content under the + first hash function + +2017-03-06 jrnieder@gmail.com +* Use SHA3-256 instead of SHA2 (thanks, Linus and brian m. carlson).[1][2] +* Make sha3-based signatures a separate field, avoiding the need for + "hash" and "nohash" fields (thanks to peff[3]). +* Add a sorting phase to fetch (thanks to Junio for noticing the need + for this). +* Omit blobs from the topological sort during fetch (thanks to peff). +* Discuss alternates, git notes, and git servers in the caveats + section (thanks to Junio Hamano, brian m. carlson[4], and Shawn + Pearce). +* Clarify language throughout (thanks to various commenters, + especially Junio). + +2017-09-27 jrnieder@gmail.com, sbeller@google.com +* use placeholder NewHash instead of SHA3-256 +* describe criteria for picking a hash function. +* include a transition plan (thanks especially to Brandon Williams + for fleshing these ideas out) +* define the translation table (thanks, Shawn Pearce[5], Jonathan + Tan, and Masaya Suzuki) +* avoid loose object overhead by packing more aggressively in + "git gc --auto" + +[1] http://public-inbox.org/git/CA+55aFzJtejiCjV0e43+9oR3QuJK2PiFiLQemytoLpyJWe6P9w@mail.gmail.com/ +[2] http://public-inbox.org/git/CA+55aFz+gkAsDZ24zmePQuEs1XPS9BP_s8O7Q4wQ7LV7X5-oDA@mail.gmail.com/ +[3] http://public-inbox.org/git/20170306084353.nrns455dvkdsfgo5@sigill.intra.peff.net/ +[4] http://public-inbox.org/git/20170304224936.rqqtkdvfjgyezsht@genre.crustytoothpaste.net +[5] https://public-inbox.org/git/CAJo=hJtoX9=AyLHHpUJS7fueV9ciZ_MNpnEPHUz8Whui6g9F0A@mail.gmail.com/ diff --git a/Documentation/technical/http-protocol.txt b/Documentation/technical/http-protocol.txt index 1c561bdd92..a0e45f2889 100644 --- a/Documentation/technical/http-protocol.txt +++ b/Documentation/technical/http-protocol.txt @@ -219,6 +219,10 @@ smart server reply: S: 003c2cb58b79488a98d2721cea644875a8dd0026b115 refs/tags/v1.0\n S: 003fa3c2e2402b99163d1d59756e5f207ae21cccba4c refs/tags/v1.0^{}\n +The client may send Extra Parameters (see +Documentation/technical/pack-protocol.txt) as a colon-separated string +in the Git-Protocol HTTP header. + Dumb Server Response ^^^^^^^^^^^^^^^^^^^^ Dumb servers MUST respond with the dumb server reply format. @@ -269,7 +273,11 @@ the C locale ordering. The stream SHOULD include the default ref named `HEAD` as the first ref. The stream MUST include capability declarations behind a NUL on the first ref. +The returned response contains "version 1" if "version=1" was sent as an +Extra Parameter. + smart_reply = PKT-LINE("# service=$servicename" LF) + *1("version 1") ref_list "0000" ref_list = empty_list / non_empty_list diff --git a/Documentation/technical/index-format.txt b/Documentation/technical/index-format.txt index ade0b0c445..db3572626b 100644 --- a/Documentation/technical/index-format.txt +++ b/Documentation/technical/index-format.txt @@ -295,3 +295,22 @@ The remaining data of each directory block is grouped by type: in the previous ewah bitmap. - One NUL. + +== File System Monitor cache + + The file system monitor cache tracks files for which the core.fsmonitor + hook has told us about changes. The signature for this extension is + { 'F', 'S', 'M', 'N' }. + + The extension starts with + + - 32-bit version number: the current supported version is 1. + + - 64-bit time: the extension data reflects all changes through the given + time which is stored as the nanoseconds elapsed since midnight, + January 1, 1970. + + - 32-bit bitmap size: the size of the CE_FSMONITOR_VALID bitmap. + + - An ewah bitmap, the n-th bit indicates whether the n-th index entry + is not CE_FSMONITOR_VALID. diff --git a/Documentation/technical/pack-protocol.txt b/Documentation/technical/pack-protocol.txt index ed1eae8b83..7fee6b780a 100644 --- a/Documentation/technical/pack-protocol.txt +++ b/Documentation/technical/pack-protocol.txt @@ -39,6 +39,19 @@ communicates with that invoked process over the SSH connection. The file:// transport runs the 'upload-pack' or 'receive-pack' process locally and communicates with it over a pipe. +Extra Parameters +---------------- + +The protocol provides a mechanism in which clients can send additional +information in its first message to the server. These are called "Extra +Parameters", and are supported by the Git, SSH, and HTTP protocols. + +Each Extra Parameter takes the form of `<key>=<value>` or `<key>`. + +Servers that receive any such Extra Parameters MUST ignore all +unrecognized keys. Currently, the only Extra Parameter recognized is +"version=1". + Git Transport ------------- @@ -46,18 +59,25 @@ The Git transport starts off by sending the command and repository on the wire using the pkt-line format, followed by a NUL byte and a hostname parameter, terminated by a NUL byte. - 0032git-upload-pack /project.git\0host=myserver.com\0 + 0033git-upload-pack /project.git\0host=myserver.com\0 + +The transport may send Extra Parameters by adding an additional NUL +byte, and then adding one or more NUL-terminated strings: + + 003egit-upload-pack /project.git\0host=myserver.com\0\0version=1\0 -- - git-proto-request = request-command SP pathname NUL [ host-parameter NUL ] + git-proto-request = request-command SP pathname NUL + [ host-parameter NUL ] [ NUL extra-parameters ] request-command = "git-upload-pack" / "git-receive-pack" / "git-upload-archive" ; case sensitive pathname = *( %x01-ff ) ; exclude NUL host-parameter = "host=" hostname [ ":" port ] + extra-parameters = 1*extra-parameter + extra-parameter = 1*( %x01-ff ) NUL -- -Only host-parameter is allowed in the git-proto-request. Clients -MUST NOT attempt to send additional parameters. It is used for the +host-parameter is used for the git-daemon name based virtual hosting. See --interpolated-path option to git daemon, with the %H/%CH format characters. @@ -117,6 +137,12 @@ we execute it without the leading '/'. v ssh user@example.com "git-upload-pack '~alice/project.git'" +Depending on the value of the `protocol.version` configuration variable, +Git may attempt to send Extra Parameters as a colon-separated string in +the GIT_PROTOCOL environment variable. This is done only if +the `ssh.variant` configuration variable indicates that the ssh command +supports passing environment variables as an argument. + A few things to remember here: - The "command name" is spelled with dash (e.g. git-upload-pack), but @@ -137,11 +163,13 @@ Reference Discovery ------------------- When the client initially connects the server will immediately respond -with a listing of each reference it has (all branches and tags) along +with a version number (if "version=1" is sent as an Extra Parameter), +and a listing of each reference it has (all branches and tags) along with the object name that each reference currently points to. - $ echo -e -n "0039git-upload-pack /schacon/gitbook.git\0host=example.com\0" | + $ echo -e -n "0044git-upload-pack /schacon/gitbook.git\0host=example.com\0\0version=1\0" | nc -v example.com 9418 + 000aversion 1 00887217a7c7e582c46cec22a130adf4b9d7d950fba0 HEAD\0multi_ack thin-pack side-band side-band-64k ofs-delta shallow no-progress include-tag 00441d3fcd5ced445d1abc402225c0b8a1299641f497 refs/heads/integration @@ -165,7 +193,8 @@ immediately after the ref itself, if presented. A conforming server MUST peel the ref if it's an annotated tag. ---- - advertised-refs = (no-refs / list-of-refs) + advertised-refs = *1("version 1") + (no-refs / list-of-refs) *shallow flush-pkt @@ -212,6 +241,7 @@ out of what the server said it could do with the first 'want' line. upload-request = want-list *shallow-line *1depth-request + [filter-request] flush-pkt want-list = first-want @@ -227,6 +257,8 @@ out of what the server said it could do with the first 'want' line. additional-want = PKT-LINE("want" SP obj-id) depth = 1*DIGIT + + filter-request = PKT-LINE("filter" SP filter-spec) ---- Clients MUST send all the obj-ids it wants from the reference @@ -249,6 +281,11 @@ complete those commits. Commits whose parents are not received as a result are defined as shallow and marked as such in the server. This information is sent back to the client in the next step. +The client can optionally request that pack-objects omit various +objects from the packfile using one of several filtering techniques. +These are intended for use with partial clone and partial fetch +operations. See `rev-list` for possible "filter-spec" values. + Once all the 'want's and 'shallow's (and optional 'deepen') are transferred, clients MUST send a flush-pkt, to tell the server side that it is done sending the list. diff --git a/Documentation/technical/partial-clone.txt b/Documentation/technical/partial-clone.txt new file mode 100644 index 0000000000..0bed2472c8 --- /dev/null +++ b/Documentation/technical/partial-clone.txt @@ -0,0 +1,324 @@ +Partial Clone Design Notes +========================== + +The "Partial Clone" feature is a performance optimization for Git that +allows Git to function without having a complete copy of the repository. +The goal of this work is to allow Git better handle extremely large +repositories. + +During clone and fetch operations, Git downloads the complete contents +and history of the repository. This includes all commits, trees, and +blobs for the complete life of the repository. For extremely large +repositories, clones can take hours (or days) and consume 100+GiB of disk +space. + +Often in these repositories there are many blobs and trees that the user +does not need such as: + + 1. files outside of the user's work area in the tree. For example, in + a repository with 500K directories and 3.5M files in every commit, + we can avoid downloading many objects if the user only needs a + narrow "cone" of the source tree. + + 2. large binary assets. For example, in a repository where large build + artifacts are checked into the tree, we can avoid downloading all + previous versions of these non-mergeable binary assets and only + download versions that are actually referenced. + +Partial clone allows us to avoid downloading such unneeded objects *in +advance* during clone and fetch operations and thereby reduce download +times and disk usage. Missing objects can later be "demand fetched" +if/when needed. + +Use of partial clone requires that the user be online and the origin +remote be available for on-demand fetching of missing objects. This may +or may not be problematic for the user. For example, if the user can +stay within the pre-selected subset of the source tree, they may not +encounter any missing objects. Alternatively, the user could try to +pre-fetch various objects if they know that they are going offline. + + +Non-Goals +--------- + +Partial clone is a mechanism to limit the number of blobs and trees downloaded +*within* a given range of commits -- and is therefore independent of and not +intended to conflict with existing DAG-level mechanisms to limit the set of +requested commits (i.e. shallow clone, single branch, or fetch '<refspec>'). + + +Design Overview +--------------- + +Partial clone logically consists of the following parts: + +- A mechanism for the client to describe unneeded or unwanted objects to + the server. + +- A mechanism for the server to omit such unwanted objects from packfiles + sent to the client. + +- A mechanism for the client to gracefully handle missing objects (that + were previously omitted by the server). + +- A mechanism for the client to backfill missing objects as needed. + + +Design Details +-------------- + +- A new pack-protocol capability "filter" is added to the fetch-pack and + upload-pack negotiation. + + This uses the existing capability discovery mechanism. + See "filter" in Documentation/technical/pack-protocol.txt. + +- Clients pass a "filter-spec" to clone and fetch which is passed to the + server to request filtering during packfile construction. + + There are various filters available to accommodate different situations. + See "--filter=<filter-spec>" in Documentation/rev-list-options.txt. + +- On the server pack-objects applies the requested filter-spec as it + creates "filtered" packfiles for the client. + + These filtered packfiles are *incomplete* in the traditional sense because + they may contain objects that reference objects not contained in the + packfile and that the client doesn't already have. For example, the + filtered packfile may contain trees or tags that reference missing blobs + or commits that reference missing trees. + +- On the client these incomplete packfiles are marked as "promisor packfiles" + and treated differently by various commands. + +- On the client a repository extension is added to the local config to + prevent older versions of git from failing mid-operation because of + missing objects that they cannot handle. + See "extensions.partialClone" in Documentation/technical/repository-version.txt" + + +Handling Missing Objects +------------------------ + +- An object may be missing due to a partial clone or fetch, or missing due + to repository corruption. To differentiate these cases, the local + repository specially indicates such filtered packfiles obtained from the + promisor remote as "promisor packfiles". + + These promisor packfiles consist of a "<name>.promisor" file with + arbitrary contents (like the "<name>.keep" files), in addition to + their "<name>.pack" and "<name>.idx" files. + +- The local repository considers a "promisor object" to be an object that + it knows (to the best of its ability) that the promisor remote has promised + that it has, either because the local repository has that object in one of + its promisor packfiles, or because another promisor object refers to it. + + When Git encounters a missing object, Git can see if it a promisor object + and handle it appropriately. If not, Git can report a corruption. + + This means that there is no need for the client to explicitly maintain an + expensive-to-modify list of missing objects.[a] + +- Since almost all Git code currently expects any referenced object to be + present locally and because we do not want to force every command to do + a dry-run first, a fallback mechanism is added to allow Git to attempt + to dynamically fetch missing objects from the promisor remote. + + When the normal object lookup fails to find an object, Git invokes + fetch-object to try to get the object from the server and then retry + the object lookup. This allows objects to be "faulted in" without + complicated prediction algorithms. + + For efficiency reasons, no check as to whether the missing object is + actually a promisor object is performed. + + Dynamic object fetching tends to be slow as objects are fetched one at + a time. + +- `checkout` (and any other command using `unpack-trees`) has been taught + to bulk pre-fetch all required missing blobs in a single batch. + +- `rev-list` has been taught to print missing objects. + + This can be used by other commands to bulk prefetch objects. + For example, a "git log -p A..B" may internally want to first do + something like "git rev-list --objects --quiet --missing=print A..B" + and prefetch those objects in bulk. + +- `fsck` has been updated to be fully aware of promisor objects. + +- `repack` in GC has been updated to not touch promisor packfiles at all, + and to only repack other objects. + +- The global variable "fetch_if_missing" is used to control whether an + object lookup will attempt to dynamically fetch a missing object or + report an error. + + We are not happy with this global variable and would like to remove it, + but that requires significant refactoring of the object code to pass an + additional flag. We hope that concurrent efforts to add an ODB API can + encompass this. + + +Fetching Missing Objects +------------------------ + +- Fetching of objects is done using the existing transport mechanism using + transport_fetch_refs(), setting a new transport option + TRANS_OPT_NO_DEPENDENTS to indicate that only the objects themselves are + desired, not any object that they refer to. + + Because some transports invoke fetch_pack() in the same process, fetch_pack() + has been updated to not use any object flags when the corresponding argument + (no_dependents) is set. + +- The local repository sends a request with the hashes of all requested + objects as "want" lines, and does not perform any packfile negotiation. + It then receives a packfile. + +- Because we are reusing the existing fetch-pack mechanism, fetching + currently fetches all objects referred to by the requested objects, even + though they are not necessary. + + +Current Limitations +------------------- + +- The remote used for a partial clone (or the first partial fetch + following a regular clone) is marked as the "promisor remote". + + We are currently limited to a single promisor remote and only that + remote may be used for subsequent partial fetches. + + We accept this limitation because we believe initial users of this + feature will be using it on repositories with a strong single central + server. + +- Dynamic object fetching will only ask the promisor remote for missing + objects. We assume that the promisor remote has a complete view of the + repository and can satisfy all such requests. + +- Repack essentially treats promisor and non-promisor packfiles as 2 + distinct partitions and does not mix them. Repack currently only works + on non-promisor packfiles and loose objects. + +- Dynamic object fetching invokes fetch-pack once *for each item* + because most algorithms stumble upon a missing object and need to have + it resolved before continuing their work. This may incur significant + overhead -- and multiple authentication requests -- if many objects are + needed. + +- Dynamic object fetching currently uses the existing pack protocol V0 + which means that each object is requested via fetch-pack. The server + will send a full set of info/refs when the connection is established. + If there are large number of refs, this may incur significant overhead. + + +Future Work +----------- + +- Allow more than one promisor remote and define a strategy for fetching + missing objects from specific promisor remotes or of iterating over the + set of promisor remotes until a missing object is found. + + A user might want to have multiple geographically-close cache servers + for fetching missing blobs while continuing to do filtered `git-fetch` + commands from the central server, for example. + + Or the user might want to work in a triangular work flow with multiple + promisor remotes that each have an incomplete view of the repository. + +- Allow repack to work on promisor packfiles (while keeping them distinct + from non-promisor packfiles). + +- Allow non-pathname-based filters to make use of packfile bitmaps (when + present). This was just an omission during the initial implementation. + +- Investigate use of a long-running process to dynamically fetch a series + of objects, such as proposed in [5,6] to reduce process startup and + overhead costs. + + It would be nice if pack protocol V2 could allow that long-running + process to make a series of requests over a single long-running + connection. + +- Investigate pack protocol V2 to avoid the info/refs broadcast on + each connection with the server to dynamically fetch missing objects. + +- Investigate the need to handle loose promisor objects. + + Objects in promisor packfiles are allowed to reference missing objects + that can be dynamically fetched from the server. An assumption was + made that loose objects are only created locally and therefore should + not reference a missing object. We may need to revisit that assumption + if, for example, we dynamically fetch a missing tree and store it as a + loose object rather than a single object packfile. + + This does not necessarily mean we need to mark loose objects as promisor; + it may be sufficient to relax the object lookup or is-promisor functions. + + +Non-Tasks +--------- + +- Every time the subject of "demand loading blobs" comes up it seems + that someone suggests that the server be allowed to "guess" and send + additional objects that may be related to the requested objects. + + No work has gone into actually doing that; we're just documenting that + it is a common suggestion. We're not sure how it would work and have + no plans to work on it. + + It is valid for the server to send more objects than requested (even + for a dynamic object fetch), but we are not building on that. + + +Footnotes +--------- + +[a] expensive-to-modify list of missing objects: Earlier in the design of + partial clone we discussed the need for a single list of missing objects. + This would essentially be a sorted linear list of OIDs that the were + omitted by the server during a clone or subsequent fetches. + + This file would need to be loaded into memory on every object lookup. + It would need to be read, updated, and re-written (like the .git/index) + on every explicit "git fetch" command *and* on any dynamic object fetch. + + The cost to read, update, and write this file could add significant + overhead to every command if there are many missing objects. For example, + if there are 100M missing blobs, this file would be at least 2GiB on disk. + + With the "promisor" concept, we *infer* a missing object based upon the + type of packfile that references it. + + +Related Links +------------- +[0] https://bugs.chromium.org/p/git/issues/detail?id=2 + Chromium work item for: Partial Clone + +[1] https://public-inbox.org/git/20170113155253.1644-1-benpeart@microsoft.com/ + Subject: [RFC] Add support for downloading blobs on demand + Date: Fri, 13 Jan 2017 10:52:53 -0500 + +[2] https://public-inbox.org/git/cover.1506714999.git.jonathantanmy@google.com/ + Subject: [PATCH 00/18] Partial clone (from clone to lazy fetch in 18 patches) + Date: Fri, 29 Sep 2017 13:11:36 -0700 + +[3] https://public-inbox.org/git/20170426221346.25337-1-jonathantanmy@google.com/ + Subject: Proposal for missing blob support in Git repos + Date: Wed, 26 Apr 2017 15:13:46 -0700 + +[4] https://public-inbox.org/git/1488999039-37631-1-git-send-email-git@jeffhostetler.com/ + Subject: [PATCH 00/10] RFC Partial Clone and Fetch + Date: Wed, 8 Mar 2017 18:50:29 +0000 + +[5] https://public-inbox.org/git/20170505152802.6724-1-benpeart@microsoft.com/ + Subject: [PATCH v7 00/10] refactor the filter process code into a reusable module + Date: Fri, 5 May 2017 11:27:52 -0400 + +[6] https://public-inbox.org/git/20170714132651.170708-1-benpeart@microsoft.com/ + Subject: [RFC/PATCH v2 0/1] Add support for downloading blobs on demand + Date: Fri, 14 Jul 2017 09:26:50 -0400 diff --git a/Documentation/technical/protocol-capabilities.txt b/Documentation/technical/protocol-capabilities.txt index 26dcc6f502..332d209b58 100644 --- a/Documentation/technical/protocol-capabilities.txt +++ b/Documentation/technical/protocol-capabilities.txt @@ -309,3 +309,11 @@ to accept a signed push certificate, and asks the <nonce> to be included in the push certificate. A send-pack client MUST NOT send a push-cert packet unless the receive-pack server advertises this capability. + +filter +------ + +If the upload-pack server advertises the 'filter' capability, +fetch-pack may send "filter" commands to request a partial clone +or partial fetch and request that the server omit various objects +from the packfile. diff --git a/Documentation/technical/repository-version.txt b/Documentation/technical/repository-version.txt index 00ad37986e..e03eaccebc 100644 --- a/Documentation/technical/repository-version.txt +++ b/Documentation/technical/repository-version.txt @@ -86,3 +86,15 @@ for testing format-1 compatibility. When the config key `extensions.preciousObjects` is set to `true`, objects in the repository MUST NOT be deleted (e.g., by `git-prune` or `git repack -d`). + +`partialclone` +~~~~~~~~~~~~~~ + +When the config key `extensions.partialclone` is set, it indicates +that the repo was created with a partial clone (or later performed +a partial fetch) and that the remote may have omitted sending +certain unwanted objects. Such a remote is called a "promisor remote" +and it promises that all such omitted objects can be fetched from it +in the future. + +The value of this key is the name of the promisor remote. |