summaryrefslogtreecommitdiff
path: root/Documentation/technical
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/technical')
-rw-r--r--Documentation/technical/api-argv-array.txt2
-rw-r--r--Documentation/technical/api-decorate.txt6
-rw-r--r--Documentation/technical/api-directory-listing.txt27
-rw-r--r--Documentation/technical/api-string-list.txt209
-rw-r--r--Documentation/technical/hash-function-transition.txt797
-rw-r--r--Documentation/technical/http-protocol.txt8
-rw-r--r--Documentation/technical/index-format.txt19
-rw-r--r--Documentation/technical/pack-protocol.txt51
-rw-r--r--Documentation/technical/partial-clone.txt324
-rw-r--r--Documentation/technical/protocol-capabilities.txt8
-rw-r--r--Documentation/technical/repository-version.txt12
11 files changed, 1236 insertions, 227 deletions
diff --git a/Documentation/technical/api-argv-array.txt b/Documentation/technical/api-argv-array.txt
index cfc063018c..870c8edbfb 100644
--- a/Documentation/technical/api-argv-array.txt
+++ b/Documentation/technical/api-argv-array.txt
@@ -8,7 +8,7 @@ always NULL-terminated at the element pointed to by `argv[argc]`. This
makes the result suitable for passing to functions expecting to receive
argv from main(), or the link:api-run-command.html[run-command API].
-The link:api-string-list.html[string-list API] is similar, but cannot be
+The string-list API (documented in string-list.h) is similar, but cannot be
used for these purposes; instead of storing a straight string pointer,
it contains an item structure with a `util` field that is not compatible
with the traditional argv interface.
diff --git a/Documentation/technical/api-decorate.txt b/Documentation/technical/api-decorate.txt
deleted file mode 100644
index 1d52a6ce14..0000000000
--- a/Documentation/technical/api-decorate.txt
+++ /dev/null
@@ -1,6 +0,0 @@
-decorate API
-============
-
-Talk about <decorate.h>
-
-(Linus)
diff --git a/Documentation/technical/api-directory-listing.txt b/Documentation/technical/api-directory-listing.txt
index 6c77b4920c..7fae00f44f 100644
--- a/Documentation/technical/api-directory-listing.txt
+++ b/Documentation/technical/api-directory-listing.txt
@@ -22,16 +22,20 @@ The notable options are:
`flags`::
- A bit-field of options (the `*IGNORED*` flags are mutually exclusive):
+ A bit-field of options:
`DIR_SHOW_IGNORED`:::
- Return just ignored files in `entries[]`, not untracked files.
+ Return just ignored files in `entries[]`, not untracked
+ files. This flag is mutually exclusive with
+ `DIR_SHOW_IGNORED_TOO`.
`DIR_SHOW_IGNORED_TOO`:::
- Similar to `DIR_SHOW_IGNORED`, but return ignored files in `ignored[]`
- in addition to untracked files in `entries[]`.
+ Similar to `DIR_SHOW_IGNORED`, but return ignored files in
+ `ignored[]` in addition to untracked files in
+ `entries[]`. This flag is mutually exclusive with
+ `DIR_SHOW_IGNORED`.
`DIR_KEEP_UNTRACKED_CONTENTS`:::
@@ -39,6 +43,21 @@ The notable options are:
untracked contents of untracked directories are also returned in
`entries[]`.
+`DIR_SHOW_IGNORED_TOO_MODE_MATCHING`:::
+
+ Only has meaning if `DIR_SHOW_IGNORED_TOO` is also set; if
+ this is set, returns ignored files and directories that match
+ an exclude pattern. If a directory matches an exclude pattern,
+ then the directory is returned and the contained paths are
+ not. A directory that does not match an exclude pattern will
+ not be returned even if all of its contents are ignored. In
+ this case, the contents are returned as individual entries.
++
+If this is set, files and directories that explicity match an ignore
+pattern are reported. Implicity ignored directories (directories that
+do not match an ignore pattern, but whose contents are all ignored)
+are not reported, instead all of the contents are reported.
+
`DIR_COLLECT_IGNORED`:::
Special mode for git-add. Return ignored files in `ignored[]` and
diff --git a/Documentation/technical/api-string-list.txt b/Documentation/technical/api-string-list.txt
deleted file mode 100644
index c08402b12e..0000000000
--- a/Documentation/technical/api-string-list.txt
+++ /dev/null
@@ -1,209 +0,0 @@
-string-list API
-===============
-
-The string_list API offers a data structure and functions to handle
-sorted and unsorted string lists. A "sorted" list is one whose
-entries are sorted by string value in `strcmp()` order.
-
-The 'string_list' struct used to be called 'path_list', but was renamed
-because it is not specific to paths.
-
-The caller:
-
-. Allocates and clears a `struct string_list` variable.
-
-. Initializes the members. You might want to set the flag `strdup_strings`
- if the strings should be strdup()ed. For example, this is necessary
- when you add something like git_path("..."), since that function returns
- a static buffer that will change with the next call to git_path().
-+
-If you need something advanced, you can manually malloc() the `items`
-member (you need this if you add things later) and you should set the
-`nr` and `alloc` members in that case, too.
-
-. Adds new items to the list, using `string_list_append`,
- `string_list_append_nodup`, `string_list_insert`,
- `string_list_split`, and/or `string_list_split_in_place`.
-
-. Can check if a string is in the list using `string_list_has_string` or
- `unsorted_string_list_has_string` and get it from the list using
- `string_list_lookup` for sorted lists.
-
-. Can sort an unsorted list using `string_list_sort`.
-
-. Can remove duplicate items from a sorted list using
- `string_list_remove_duplicates`.
-
-. Can remove individual items of an unsorted list using
- `unsorted_string_list_delete_item`.
-
-. Can remove items not matching a criterion from a sorted or unsorted
- list using `filter_string_list`, or remove empty strings using
- `string_list_remove_empty_items`.
-
-. Finally it should free the list using `string_list_clear`.
-
-Example:
-
-----
-struct string_list list = STRING_LIST_INIT_NODUP;
-int i;
-
-string_list_append(&list, "foo");
-string_list_append(&list, "bar");
-for (i = 0; i < list.nr; i++)
- printf("%s\n", list.items[i].string)
-----
-
-NOTE: It is more efficient to build an unsorted list and sort it
-afterwards, instead of building a sorted list (`O(n log n)` instead of
-`O(n^2)`).
-+
-However, if you use the list to check if a certain string was added
-already, you should not do that (using unsorted_string_list_has_string()),
-because the complexity would be quadratic again (but with a worse factor).
-
-Functions
----------
-
-* General ones (works with sorted and unsorted lists as well)
-
-`string_list_init`::
-
- Initialize the members of the string_list, set `strdup_strings`
- member according to the value of the second parameter.
-
-`filter_string_list`::
-
- Apply a function to each item in a list, retaining only the
- items for which the function returns true. If free_util is
- true, call free() on the util members of any items that have
- to be deleted. Preserve the order of the items that are
- retained.
-
-`string_list_remove_empty_items`::
-
- Remove any empty strings from the list. If free_util is true,
- call free() on the util members of any items that have to be
- deleted. Preserve the order of the items that are retained.
-
-`print_string_list`::
-
- Dump a string_list to stdout, useful mainly for debugging purposes. It
- can take an optional header argument and it writes out the
- string-pointer pairs of the string_list, each one in its own line.
-
-`string_list_clear`::
-
- Free a string_list. The `string` pointer of the items will be freed in
- case the `strdup_strings` member of the string_list is set. The second
- parameter controls if the `util` pointer of the items should be freed
- or not.
-
-* Functions for sorted lists only
-
-`string_list_has_string`::
-
- Determine if the string_list has a given string or not.
-
-`string_list_insert`::
-
- Insert a new element to the string_list. The returned pointer can be
- handy if you want to write something to the `util` pointer of the
- string_list_item containing the just added string. If the given
- string already exists the insertion will be skipped and the
- pointer to the existing item returned.
-+
-Since this function uses xrealloc() (which die()s if it fails) if the
-list needs to grow, it is safe not to check the pointer. I.e. you may
-write `string_list_insert(...)->util = ...;`.
-
-`string_list_lookup`::
-
- Look up a given string in the string_list, returning the containing
- string_list_item. If the string is not found, NULL is returned.
-
-`string_list_remove_duplicates`::
-
- Remove all but the first of consecutive entries that have the
- same string value. If free_util is true, call free() on the
- util members of any items that have to be deleted.
-
-* Functions for unsorted lists only
-
-`string_list_append`::
-
- Append a new string to the end of the string_list. If
- `strdup_string` is set, then the string argument is copied;
- otherwise the new `string_list_entry` refers to the input
- string.
-
-`string_list_append_nodup`::
-
- Append a new string to the end of the string_list. The new
- `string_list_entry` always refers to the input string, even if
- `strdup_string` is set. This function can be used to hand
- ownership of a malloc()ed string to a `string_list` that has
- `strdup_string` set.
-
-`string_list_sort`::
-
- Sort the list's entries by string value in `strcmp()` order.
-
-`unsorted_string_list_has_string`::
-
- It's like `string_list_has_string()` but for unsorted lists.
-
-`unsorted_string_list_lookup`::
-
- It's like `string_list_lookup()` but for unsorted lists.
-+
-The above two functions need to look through all items, as opposed to their
-counterpart for sorted lists, which performs a binary search.
-
-`unsorted_string_list_delete_item`::
-
- Remove an item from a string_list. The `string` pointer of the items
- will be freed in case the `strdup_strings` member of the string_list
- is set. The third parameter controls if the `util` pointer of the
- items should be freed or not.
-
-`string_list_split`::
-`string_list_split_in_place`::
-
- Split a string into substrings on a delimiter character and
- append the substrings to a `string_list`. If `maxsplit` is
- non-negative, then split at most `maxsplit` times. Return the
- number of substrings appended to the list.
-+
-`string_list_split` requires a `string_list` that has `strdup_strings`
-set to true; it leaves the input string untouched and makes copies of
-the substrings in newly-allocated memory.
-`string_list_split_in_place` requires a `string_list` that has
-`strdup_strings` set to false; it splits the input string in place,
-overwriting the delimiter characters with NULs and creating new
-string_list_items that point into the original string (the original
-string must therefore not be modified or freed while the `string_list`
-is in use).
-
-
-Data structures
----------------
-
-* `struct string_list_item`
-
-Represents an item of the list. The `string` member is a pointer to the
-string, and you may use the `util` member for any purpose, if you want.
-
-* `struct string_list`
-
-Represents the list itself.
-
-. The array of items are available via the `items` member.
-. The `nr` member contains the number of items stored in the list.
-. The `alloc` member is used to avoid reallocating at every insertion.
- You should not tamper with it.
-. Setting the `strdup_strings` member to 1 will strdup() the strings
- before adding them, see above.
-. The `compare_strings_fn` member is used to specify a custom compare
- function, otherwise `strcmp()` is used as the default function.
diff --git a/Documentation/technical/hash-function-transition.txt b/Documentation/technical/hash-function-transition.txt
new file mode 100644
index 0000000000..417ba491d0
--- /dev/null
+++ b/Documentation/technical/hash-function-transition.txt
@@ -0,0 +1,797 @@
+Git hash function transition
+============================
+
+Objective
+---------
+Migrate Git from SHA-1 to a stronger hash function.
+
+Background
+----------
+At its core, the Git version control system is a content addressable
+filesystem. It uses the SHA-1 hash function to name content. For
+example, files, directories, and revisions are referred to by hash
+values unlike in other traditional version control systems where files
+or versions are referred to via sequential numbers. The use of a hash
+function to address its content delivers a few advantages:
+
+* Integrity checking is easy. Bit flips, for example, are easily
+ detected, as the hash of corrupted content does not match its name.
+* Lookup of objects is fast.
+
+Using a cryptographically secure hash function brings additional
+advantages:
+
+* Object names can be signed and third parties can trust the hash to
+ address the signed object and all objects it references.
+* Communication using Git protocol and out of band communication
+ methods have a short reliable string that can be used to reliably
+ address stored content.
+
+Over time some flaws in SHA-1 have been discovered by security
+researchers. https://shattered.io demonstrated a practical SHA-1 hash
+collision. As a result, SHA-1 cannot be considered cryptographically
+secure any more. This impacts the communication of hash values because
+we cannot trust that a given hash value represents the known good
+version of content that the speaker intended.
+
+SHA-1 still possesses the other properties such as fast object lookup
+and safe error checking, but other hash functions are equally suitable
+that are believed to be cryptographically secure.
+
+Goals
+-----
+Where NewHash is a strong 256-bit hash function to replace SHA-1 (see
+"Selection of a New Hash", below):
+
+1. The transition to NewHash can be done one local repository at a time.
+ a. Requiring no action by any other party.
+ b. A NewHash repository can communicate with SHA-1 Git servers
+ (push/fetch).
+ c. Users can use SHA-1 and NewHash identifiers for objects
+ interchangeably (see "Object names on the command line", below).
+ d. New signed objects make use of a stronger hash function than
+ SHA-1 for their security guarantees.
+2. Allow a complete transition away from SHA-1.
+ a. Local metadata for SHA-1 compatibility can be removed from a
+ repository if compatibility with SHA-1 is no longer needed.
+3. Maintainability throughout the process.
+ a. The object format is kept simple and consistent.
+ b. Creation of a generalized repository conversion tool.
+
+Non-Goals
+---------
+1. Add NewHash support to Git protocol. This is valuable and the
+ logical next step but it is out of scope for this initial design.
+2. Transparently improving the security of existing SHA-1 signed
+ objects.
+3. Intermixing objects using multiple hash functions in a single
+ repository.
+4. Taking the opportunity to fix other bugs in Git's formats and
+ protocols.
+5. Shallow clones and fetches into a NewHash repository. (This will
+ change when we add NewHash support to Git protocol.)
+6. Skip fetching some submodules of a project into a NewHash
+ repository. (This also depends on NewHash support in Git
+ protocol.)
+
+Overview
+--------
+We introduce a new repository format extension. Repositories with this
+extension enabled use NewHash instead of SHA-1 to name their objects.
+This affects both object names and object content --- both the names
+of objects and all references to other objects within an object are
+switched to the new hash function.
+
+NewHash repositories cannot be read by older versions of Git.
+
+Alongside the packfile, a NewHash repository stores a bidirectional
+mapping between NewHash and SHA-1 object names. The mapping is generated
+locally and can be verified using "git fsck". Object lookups use this
+mapping to allow naming objects using either their SHA-1 and NewHash names
+interchangeably.
+
+"git cat-file" and "git hash-object" gain options to display an object
+in its sha1 form and write an object given its sha1 form. This
+requires all objects referenced by that object to be present in the
+object database so that they can be named using the appropriate name
+(using the bidirectional hash mapping).
+
+Fetches from a SHA-1 based server convert the fetched objects into
+NewHash form and record the mapping in the bidirectional mapping table
+(see below for details). Pushes to a SHA-1 based server convert the
+objects being pushed into sha1 form so the server does not have to be
+aware of the hash function the client is using.
+
+Detailed Design
+---------------
+Repository format extension
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
+A NewHash repository uses repository format version `1` (see
+Documentation/technical/repository-version.txt) with extensions
+`objectFormat` and `compatObjectFormat`:
+
+ [core]
+ repositoryFormatVersion = 1
+ [extensions]
+ objectFormat = newhash
+ compatObjectFormat = sha1
+
+Specifying a repository format extension ensures that versions of Git
+not aware of NewHash do not try to operate on these repositories,
+instead producing an error message:
+
+ $ git status
+ fatal: unknown repository extensions found:
+ objectformat
+ compatobjectformat
+
+See the "Transition plan" section below for more details on these
+repository extensions.
+
+Object names
+~~~~~~~~~~~~
+Objects can be named by their 40 hexadecimal digit sha1-name or 64
+hexadecimal digit newhash-name, plus names derived from those (see
+gitrevisions(7)).
+
+The sha1-name of an object is the SHA-1 of the concatenation of its
+type, length, a nul byte, and the object's sha1-content. This is the
+traditional <sha1> used in Git to name objects.
+
+The newhash-name of an object is the NewHash of the concatenation of its
+type, length, a nul byte, and the object's newhash-content.
+
+Object format
+~~~~~~~~~~~~~
+The content as a byte sequence of a tag, commit, or tree object named
+by sha1 and newhash differ because an object named by newhash-name refers to
+other objects by their newhash-names and an object named by sha1-name
+refers to other objects by their sha1-names.
+
+The newhash-content of an object is the same as its sha1-content, except
+that objects referenced by the object are named using their newhash-names
+instead of sha1-names. Because a blob object does not refer to any
+other object, its sha1-content and newhash-content are the same.
+
+The format allows round-trip conversion between newhash-content and
+sha1-content.
+
+Object storage
+~~~~~~~~~~~~~~
+Loose objects use zlib compression and packed objects use the packed
+format described in Documentation/technical/pack-format.txt, just like
+today. The content that is compressed and stored uses newhash-content
+instead of sha1-content.
+
+Pack index
+~~~~~~~~~~
+Pack index (.idx) files use a new v3 format that supports multiple
+hash functions. They have the following format (all integers are in
+network byte order):
+
+- A header appears at the beginning and consists of the following:
+ - The 4-byte pack index signature: '\377t0c'
+ - 4-byte version number: 3
+ - 4-byte length of the header section, including the signature and
+ version number
+ - 4-byte number of objects contained in the pack
+ - 4-byte number of object formats in this pack index: 2
+ - For each object format:
+ - 4-byte format identifier (e.g., 'sha1' for SHA-1)
+ - 4-byte length in bytes of shortened object names. This is the
+ shortest possible length needed to make names in the shortened
+ object name table unambiguous.
+ - 4-byte integer, recording where tables relating to this format
+ are stored in this index file, as an offset from the beginning.
+ - 4-byte offset to the trailer from the beginning of this file.
+ - Zero or more additional key/value pairs (4-byte key, 4-byte
+ value). Only one key is supported: 'PSRC'. See the "Loose objects
+ and unreachable objects" section for supported values and how this
+ is used. All other keys are reserved. Readers must ignore
+ unrecognized keys.
+- Zero or more NUL bytes. This can optionally be used to improve the
+ alignment of the full object name table below.
+- Tables for the first object format:
+ - A sorted table of shortened object names. These are prefixes of
+ the names of all objects in this pack file, packed together
+ without offset values to reduce the cache footprint of the binary
+ search for a specific object name.
+
+ - A table of full object names in pack order. This allows resolving
+ a reference to "the nth object in the pack file" (from a
+ reachability bitmap or from the next table of another object
+ format) to its object name.
+
+ - A table of 4-byte values mapping object name order to pack order.
+ For an object in the table of sorted shortened object names, the
+ value at the corresponding index in this table is the index in the
+ previous table for that same object.
+
+ This can be used to look up the object in reachability bitmaps or
+ to look up its name in another object format.
+
+ - A table of 4-byte CRC32 values of the packed object data, in the
+ order that the objects appear in the pack file. This is to allow
+ compressed data to be copied directly from pack to pack during
+ repacking without undetected data corruption.
+
+ - A table of 4-byte offset values. For an object in the table of
+ sorted shortened object names, the value at the corresponding
+ index in this table indicates where that object can be found in
+ the pack file. These are usually 31-bit pack file offsets, but
+ large offsets are encoded as an index into the next table with the
+ most significant bit set.
+
+ - A table of 8-byte offset entries (empty for pack files less than
+ 2 GiB). Pack files are organized with heavily used objects toward
+ the front, so most object references should not need to refer to
+ this table.
+- Zero or more NUL bytes.
+- Tables for the second object format, with the same layout as above,
+ up to and not including the table of CRC32 values.
+- Zero or more NUL bytes.
+- The trailer consists of the following:
+ - A copy of the 20-byte NewHash checksum at the end of the
+ corresponding packfile.
+
+ - 20-byte NewHash checksum of all of the above.
+
+Loose object index
+~~~~~~~~~~~~~~~~~~
+A new file $GIT_OBJECT_DIR/loose-object-idx contains information about
+all loose objects. Its format is
+
+ # loose-object-idx
+ (newhash-name SP sha1-name LF)*
+
+where the object names are in hexadecimal format. The file is not
+sorted.
+
+The loose object index is protected against concurrent writes by a
+lock file $GIT_OBJECT_DIR/loose-object-idx.lock. To add a new loose
+object:
+
+1. Write the loose object to a temporary file, like today.
+2. Open loose-object-idx.lock with O_CREAT | O_EXCL to acquire the lock.
+3. Rename the loose object into place.
+4. Open loose-object-idx with O_APPEND and write the new object
+5. Unlink loose-object-idx.lock to release the lock.
+
+To remove entries (e.g. in "git pack-refs" or "git-prune"):
+
+1. Open loose-object-idx.lock with O_CREAT | O_EXCL to acquire the
+ lock.
+2. Write the new content to loose-object-idx.lock.
+3. Unlink any loose objects being removed.
+4. Rename to replace loose-object-idx, releasing the lock.
+
+Translation table
+~~~~~~~~~~~~~~~~~
+The index files support a bidirectional mapping between sha1-names
+and newhash-names. The lookup proceeds similarly to ordinary object
+lookups. For example, to convert a sha1-name to a newhash-name:
+
+ 1. Look for the object in idx files. If a match is present in the
+ idx's sorted list of truncated sha1-names, then:
+ a. Read the corresponding entry in the sha1-name order to pack
+ name order mapping.
+ b. Read the corresponding entry in the full sha1-name table to
+ verify we found the right object. If it is, then
+ c. Read the corresponding entry in the full newhash-name table.
+ That is the object's newhash-name.
+ 2. Check for a loose object. Read lines from loose-object-idx until
+ we find a match.
+
+Step (1) takes the same amount of time as an ordinary object lookup:
+O(number of packs * log(objects per pack)). Step (2) takes O(number of
+loose objects) time. To maintain good performance it will be necessary
+to keep the number of loose objects low. See the "Loose objects and
+unreachable objects" section below for more details.
+
+Since all operations that make new objects (e.g., "git commit") add
+the new objects to the corresponding index, this mapping is possible
+for all objects in the object store.
+
+Reading an object's sha1-content
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+The sha1-content of an object can be read by converting all newhash-names
+its newhash-content references to sha1-names using the translation table.
+
+Fetch
+~~~~~
+Fetching from a SHA-1 based server requires translating between SHA-1
+and NewHash based representations on the fly.
+
+SHA-1s named in the ref advertisement that are present on the client
+can be translated to NewHash and looked up as local objects using the
+translation table.
+
+Negotiation proceeds as today. Any "have"s generated locally are
+converted to SHA-1 before being sent to the server, and SHA-1s
+mentioned by the server are converted to NewHash when looking them up
+locally.
+
+After negotiation, the server sends a packfile containing the
+requested objects. We convert the packfile to NewHash format using
+the following steps:
+
+1. index-pack: inflate each object in the packfile and compute its
+ SHA-1. Objects can contain deltas in OBJ_REF_DELTA format against
+ objects the client has locally. These objects can be looked up
+ using the translation table and their sha1-content read as
+ described above to resolve the deltas.
+2. topological sort: starting at the "want"s from the negotiation
+ phase, walk through objects in the pack and emit a list of them,
+ excluding blobs, in reverse topologically sorted order, with each
+ object coming later in the list than all objects it references.
+ (This list only contains objects reachable from the "wants". If the
+ pack from the server contained additional extraneous objects, then
+ they will be discarded.)
+3. convert to newhash: open a new (newhash) packfile. Read the topologically
+ sorted list just generated. For each object, inflate its
+ sha1-content, convert to newhash-content, and write it to the newhash
+ pack. Record the new sha1<->newhash mapping entry for use in the idx.
+4. sort: reorder entries in the new pack to match the order of objects
+ in the pack the server generated and include blobs. Write a newhash idx
+ file
+5. clean up: remove the SHA-1 based pack file, index, and
+ topologically sorted list obtained from the server in steps 1
+ and 2.
+
+Step 3 requires every object referenced by the new object to be in the
+translation table. This is why the topological sort step is necessary.
+
+As an optimization, step 1 could write a file describing what non-blob
+objects each object it has inflated from the packfile references. This
+makes the topological sort in step 2 possible without inflating the
+objects in the packfile for a second time. The objects need to be
+inflated again in step 3, for a total of two inflations.
+
+Step 4 is probably necessary for good read-time performance. "git
+pack-objects" on the server optimizes the pack file for good data
+locality (see Documentation/technical/pack-heuristics.txt).
+
+Details of this process are likely to change. It will take some
+experimenting to get this to perform well.
+
+Push
+~~~~
+Push is simpler than fetch because the objects referenced by the
+pushed objects are already in the translation table. The sha1-content
+of each object being pushed can be read as described in the "Reading
+an object's sha1-content" section to generate the pack written by git
+send-pack.
+
+Signed Commits
+~~~~~~~~~~~~~~
+We add a new field "gpgsig-newhash" to the commit object format to allow
+signing commits without relying on SHA-1. It is similar to the
+existing "gpgsig" field. Its signed payload is the newhash-content of the
+commit object with any "gpgsig" and "gpgsig-newhash" fields removed.
+
+This means commits can be signed
+1. using SHA-1 only, as in existing signed commit objects
+2. using both SHA-1 and NewHash, by using both gpgsig-newhash and gpgsig
+ fields.
+3. using only NewHash, by only using the gpgsig-newhash field.
+
+Old versions of "git verify-commit" can verify the gpgsig signature in
+cases (1) and (2) without modifications and view case (3) as an
+ordinary unsigned commit.
+
+Signed Tags
+~~~~~~~~~~~
+We add a new field "gpgsig-newhash" to the tag object format to allow
+signing tags without relying on SHA-1. Its signed payload is the
+newhash-content of the tag with its gpgsig-newhash field and "-----BEGIN PGP
+SIGNATURE-----" delimited in-body signature removed.
+
+This means tags can be signed
+1. using SHA-1 only, as in existing signed tag objects
+2. using both SHA-1 and NewHash, by using gpgsig-newhash and an in-body
+ signature.
+3. using only NewHash, by only using the gpgsig-newhash field.
+
+Mergetag embedding
+~~~~~~~~~~~~~~~~~~
+The mergetag field in the sha1-content of a commit contains the
+sha1-content of a tag that was merged by that commit.
+
+The mergetag field in the newhash-content of the same commit contains the
+newhash-content of the same tag.
+
+Submodules
+~~~~~~~~~~
+To convert recorded submodule pointers, you need to have the converted
+submodule repository in place. The translation table of the submodule
+can be used to look up the new hash.
+
+Loose objects and unreachable objects
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+Fast lookups in the loose-object-idx require that the number of loose
+objects not grow too high.
+
+"git gc --auto" currently waits for there to be 6700 loose objects
+present before consolidating them into a packfile. We will need to
+measure to find a more appropriate threshold for it to use.
+
+"git gc --auto" currently waits for there to be 50 packs present
+before combining packfiles. Packing loose objects more aggressively
+may cause the number of pack files to grow too quickly. This can be
+mitigated by using a strategy similar to Martin Fick's exponential
+rolling garbage collection script:
+https://gerrit-review.googlesource.com/c/gerrit/+/35215
+
+"git gc" currently expels any unreachable objects it encounters in
+pack files to loose objects in an attempt to prevent a race when
+pruning them (in case another process is simultaneously writing a new
+object that refers to the about-to-be-deleted object). This leads to
+an explosion in the number of loose objects present and disk space
+usage due to the objects in delta form being replaced with independent
+loose objects. Worse, the race is still present for loose objects.
+
+Instead, "git gc" will need to move unreachable objects to a new
+packfile marked as UNREACHABLE_GARBAGE (using the PSRC field; see
+below). To avoid the race when writing new objects referring to an
+about-to-be-deleted object, code paths that write new objects will
+need to copy any objects from UNREACHABLE_GARBAGE packs that they
+refer to to new, non-UNREACHABLE_GARBAGE packs (or loose objects).
+UNREACHABLE_GARBAGE are then safe to delete if their creation time (as
+indicated by the file's mtime) is long enough ago.
+
+To avoid a proliferation of UNREACHABLE_GARBAGE packs, they can be
+combined under certain circumstances. If "gc.garbageTtl" is set to
+greater than one day, then packs created within a single calendar day,
+UTC, can be coalesced together. The resulting packfile would have an
+mtime before midnight on that day, so this makes the effective maximum
+ttl the garbageTtl + 1 day. If "gc.garbageTtl" is less than one day,
+then we divide the calendar day into intervals one-third of that ttl
+in duration. Packs created within the same interval can be coalesced
+together. The resulting packfile would have an mtime before the end of
+the interval, so this makes the effective maximum ttl equal to the
+garbageTtl * 4/3.
+
+This rule comes from Thirumala Reddy Mutchukota's JGit change
+https://git.eclipse.org/r/90465.
+
+The UNREACHABLE_GARBAGE setting goes in the PSRC field of the pack
+index. More generally, that field indicates where a pack came from:
+
+ - 1 (PACK_SOURCE_RECEIVE) for a pack received over the network
+ - 2 (PACK_SOURCE_AUTO) for a pack created by a lightweight
+ "gc --auto" operation
+ - 3 (PACK_SOURCE_GC) for a pack created by a full gc
+ - 4 (PACK_SOURCE_UNREACHABLE_GARBAGE) for potential garbage
+ discovered by gc
+ - 5 (PACK_SOURCE_INSERT) for locally created objects that were
+ written directly to a pack file, e.g. from "git add ."
+
+This information can be useful for debugging and for "gc --auto" to
+make appropriate choices about which packs to coalesce.
+
+Caveats
+-------
+Invalid objects
+~~~~~~~~~~~~~~~
+The conversion from sha1-content to newhash-content retains any
+brokenness in the original object (e.g., tree entry modes encoded with
+leading 0, tree objects whose paths are not sorted correctly, and
+commit objects without an author or committer). This is a deliberate
+feature of the design to allow the conversion to round-trip.
+
+More profoundly broken objects (e.g., a commit with a truncated "tree"
+header line) cannot be converted but were not usable by current Git
+anyway.
+
+Shallow clone and submodules
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+Because it requires all referenced objects to be available in the
+locally generated translation table, this design does not support
+shallow clone or unfetched submodules. Protocol improvements might
+allow lifting this restriction.
+
+Alternates
+~~~~~~~~~~
+For the same reason, a newhash repository cannot borrow objects from a
+sha1 repository using objects/info/alternates or
+$GIT_ALTERNATE_OBJECT_REPOSITORIES.
+
+git notes
+~~~~~~~~~
+The "git notes" tool annotates objects using their sha1-name as key.
+This design does not describe a way to migrate notes trees to use
+newhash-names. That migration is expected to happen separately (for
+example using a file at the root of the notes tree to describe which
+hash it uses).
+
+Server-side cost
+~~~~~~~~~~~~~~~~
+Until Git protocol gains NewHash support, using NewHash based storage
+on public-facing Git servers is strongly discouraged. Once Git
+protocol gains NewHash support, NewHash based servers are likely not
+to support SHA-1 compatibility, to avoid what may be a very expensive
+hash reencode during clone and to encourage peers to modernize.
+
+The design described here allows fetches by SHA-1 clients of a
+personal NewHash repository because it's not much more difficult than
+allowing pushes from that repository. This support needs to be guarded
+by a configuration option --- servers like git.kernel.org that serve a
+large number of clients would not be expected to bear that cost.
+
+Meaning of signatures
+~~~~~~~~~~~~~~~~~~~~~
+The signed payload for signed commits and tags does not explicitly
+name the hash used to identify objects. If some day Git adopts a new
+hash function with the same length as the current SHA-1 (40
+hexadecimal digit) or NewHash (64 hexadecimal digit) objects then the
+intent behind the PGP signed payload in an object signature is
+unclear:
+
+ object e7e07d5a4fcc2a203d9873968ad3e6bd4d7419d7
+ type commit
+ tag v2.12.0
+ tagger Junio C Hamano <gitster@pobox.com> 1487962205 -0800
+
+ Git 2.12
+
+Does this mean Git v2.12.0 is the commit with sha1-name
+e7e07d5a4fcc2a203d9873968ad3e6bd4d7419d7 or the commit with
+new-40-digit-hash-name e7e07d5a4fcc2a203d9873968ad3e6bd4d7419d7?
+
+Fortunately NewHash and SHA-1 have different lengths. If Git starts
+using another hash with the same length to name objects, then it will
+need to change the format of signed payloads using that hash to
+address this issue.
+
+Object names on the command line
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+To support the transition (see Transition plan below), this design
+supports four different modes of operation:
+
+ 1. ("dark launch") Treat object names input by the user as SHA-1 and
+ convert any object names written to output to SHA-1, but store
+ objects using NewHash. This allows users to test the code with no
+ visible behavior change except for performance. This allows
+ allows running even tests that assume the SHA-1 hash function, to
+ sanity-check the behavior of the new mode.
+
+ 2. ("early transition") Allow both SHA-1 and NewHash object names in
+ input. Any object names written to output use SHA-1. This allows
+ users to continue to make use of SHA-1 to communicate with peers
+ (e.g. by email) that have not migrated yet and prepares for mode 3.
+
+ 3. ("late transition") Allow both SHA-1 and NewHash object names in
+ input. Any object names written to output use NewHash. In this
+ mode, users are using a more secure object naming method by
+ default. The disruption is minimal as long as most of their peers
+ are in mode 2 or mode 3.
+
+ 4. ("post-transition") Treat object names input by the user as
+ NewHash and write output using NewHash. This is safer than mode 3
+ because there is less risk that input is incorrectly interpreted
+ using the wrong hash function.
+
+The mode is specified in configuration.
+
+The user can also explicitly specify which format to use for a
+particular revision specifier and for output, overriding the mode. For
+example:
+
+git --output-format=sha1 log abac87a^{sha1}..f787cac^{newhash}
+
+Selection of a New Hash
+-----------------------
+In early 2005, around the time that Git was written, Xiaoyun Wang,
+Yiqun Lisa Yin, and Hongbo Yu announced an attack finding SHA-1
+collisions in 2^69 operations. In August they published details.
+Luckily, no practical demonstrations of a collision in full SHA-1 were
+published until 10 years later, in 2017.
+
+The hash function NewHash to replace SHA-1 should be stronger than
+SHA-1 was: we would like it to be trustworthy and useful in practice
+for at least 10 years.
+
+Some other relevant properties:
+
+1. A 256-bit hash (long enough to match common security practice; not
+ excessively long to hurt performance and disk usage).
+
+2. High quality implementations should be widely available (e.g. in
+ OpenSSL).
+
+3. The hash function's properties should match Git's needs (e.g. Git
+ requires collision and 2nd preimage resistance and does not require
+ length extension resistance).
+
+4. As a tiebreaker, the hash should be fast to compute (fortunately
+ many contenders are faster than SHA-1).
+
+Some hashes under consideration are SHA-256, SHA-512/256, SHA-256x16,
+K12, and BLAKE2bp-256.
+
+Transition plan
+---------------
+Some initial steps can be implemented independently of one another:
+- adding a hash function API (vtable)
+- teaching fsck to tolerate the gpgsig-newhash field
+- excluding gpgsig-* from the fields copied by "git commit --amend"
+- annotating tests that depend on SHA-1 values with a SHA1 test
+ prerequisite
+- using "struct object_id", GIT_MAX_RAWSZ, and GIT_MAX_HEXSZ
+ consistently instead of "unsigned char *" and the hardcoded
+ constants 20 and 40.
+- introducing index v3
+- adding support for the PSRC field and safer object pruning
+
+
+The first user-visible change is the introduction of the objectFormat
+extension (without compatObjectFormat). This requires:
+- implementing the loose-object-idx
+- teaching fsck about this mode of operation
+- using the hash function API (vtable) when computing object names
+- signing objects and verifying signatures
+- rejecting attempts to fetch from or push to an incompatible
+ repository
+
+Next comes introduction of compatObjectFormat:
+- translating object names between object formats
+- translating object content between object formats
+- generating and verifying signatures in the compat format
+- adding appropriate index entries when adding a new object to the
+ object store
+- --output-format option
+- ^{sha1} and ^{newhash} revision notation
+- configuration to specify default input and output format (see
+ "Object names on the command line" above)
+
+The next step is supporting fetches and pushes to SHA-1 repositories:
+- allow pushes to a repository using the compat format
+- generate a topologically sorted list of the SHA-1 names of fetched
+ objects
+- convert the fetched packfile to newhash format and generate an idx
+ file
+- re-sort to match the order of objects in the fetched packfile
+
+The infrastructure supporting fetch also allows converting an existing
+repository. In converted repositories and new clones, end users can
+gain support for the new hash function without any visible change in
+behavior (see "dark launch" in the "Object names on the command line"
+section). In particular this allows users to verify NewHash signatures
+on objects in the repository, and it should ensure the transition code
+is stable in production in preparation for using it more widely.
+
+Over time projects would encourage their users to adopt the "early
+transition" and then "late transition" modes to take advantage of the
+new, more futureproof NewHash object names.
+
+When objectFormat and compatObjectFormat are both set, commands
+generating signatures would generate both SHA-1 and NewHash signatures
+by default to support both new and old users.
+
+In projects using NewHash heavily, users could be encouraged to adopt
+the "post-transition" mode to avoid accidentally making implicit use
+of SHA-1 object names.
+
+Once a critical mass of users have upgraded to a version of Git that
+can verify NewHash signatures and have converted their existing
+repositories to support verifying them, we can add support for a
+setting to generate only NewHash signatures. This is expected to be at
+least a year later.
+
+That is also a good moment to advertise the ability to convert
+repositories to use NewHash only, stripping out all SHA-1 related
+metadata. This improves performance by eliminating translation
+overhead and security by avoiding the possibility of accidentally
+relying on the safety of SHA-1.
+
+Updating Git's protocols to allow a server to specify which hash
+functions it supports is also an important part of this transition. It
+is not discussed in detail in this document but this transition plan
+assumes it happens. :)
+
+Alternatives considered
+-----------------------
+Upgrading everyone working on a particular project on a flag day
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+Projects like the Linux kernel are large and complex enough that
+flipping the switch for all projects based on the repository at once
+is infeasible.
+
+Not only would all developers and server operators supporting
+developers have to switch on the same flag day, but supporting tooling
+(continuous integration, code review, bug trackers, etc) would have to
+be adapted as well. This also makes it difficult to get early feedback
+from some project participants testing before it is time for mass
+adoption.
+
+Using hash functions in parallel
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+(e.g. https://public-inbox.org/git/22708.8913.864049.452252@chiark.greenend.org.uk/ )
+Objects newly created would be addressed by the new hash, but inside
+such an object (e.g. commit) it is still possible to address objects
+using the old hash function.
+* You cannot trust its history (needed for bisectability) in the
+ future without further work
+* Maintenance burden as the number of supported hash functions grows
+ (they will never go away, so they accumulate). In this proposal, by
+ comparison, converted objects lose all references to SHA-1.
+
+Signed objects with multiple hashes
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+Instead of introducing the gpgsig-newhash field in commit and tag objects
+for newhash-content based signatures, an earlier version of this design
+added "hash newhash <newhash-name>" fields to strengthen the existing
+sha1-content based signatures.
+
+In other words, a single signature was used to attest to the object
+content using both hash functions. This had some advantages:
+* Using one signature instead of two speeds up the signing process.
+* Having one signed payload with both hashes allows the signer to
+ attest to the sha1-name and newhash-name referring to the same object.
+* All users consume the same signature. Broken signatures are likely
+ to be detected quickly using current versions of git.
+
+However, it also came with disadvantages:
+* Verifying a signed object requires access to the sha1-names of all
+ objects it references, even after the transition is complete and
+ translation table is no longer needed for anything else. To support
+ this, the design added fields such as "hash sha1 tree <sha1-name>"
+ and "hash sha1 parent <sha1-name>" to the newhash-content of a signed
+ commit, complicating the conversion process.
+* Allowing signed objects without a sha1 (for after the transition is
+ complete) complicated the design further, requiring a "nohash sha1"
+ field to suppress including "hash sha1" fields in the newhash-content
+ and signed payload.
+
+Lazily populated translation table
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+Some of the work of building the translation table could be deferred to
+push time, but that would significantly complicate and slow down pushes.
+Calculating the sha1-name at object creation time at the same time it is
+being streamed to disk and having its newhash-name calculated should be
+an acceptable cost.
+
+Document History
+----------------
+
+2017-03-03
+bmwill@google.com, jonathantanmy@google.com, jrnieder@gmail.com,
+sbeller@google.com
+
+Initial version sent to
+http://public-inbox.org/git/20170304011251.GA26789@aiede.mtv.corp.google.com
+
+2017-03-03 jrnieder@gmail.com
+Incorporated suggestions from jonathantanmy and sbeller:
+* describe purpose of signed objects with each hash type
+* redefine signed object verification using object content under the
+ first hash function
+
+2017-03-06 jrnieder@gmail.com
+* Use SHA3-256 instead of SHA2 (thanks, Linus and brian m. carlson).[1][2]
+* Make sha3-based signatures a separate field, avoiding the need for
+ "hash" and "nohash" fields (thanks to peff[3]).
+* Add a sorting phase to fetch (thanks to Junio for noticing the need
+ for this).
+* Omit blobs from the topological sort during fetch (thanks to peff).
+* Discuss alternates, git notes, and git servers in the caveats
+ section (thanks to Junio Hamano, brian m. carlson[4], and Shawn
+ Pearce).
+* Clarify language throughout (thanks to various commenters,
+ especially Junio).
+
+2017-09-27 jrnieder@gmail.com, sbeller@google.com
+* use placeholder NewHash instead of SHA3-256
+* describe criteria for picking a hash function.
+* include a transition plan (thanks especially to Brandon Williams
+ for fleshing these ideas out)
+* define the translation table (thanks, Shawn Pearce[5], Jonathan
+ Tan, and Masaya Suzuki)
+* avoid loose object overhead by packing more aggressively in
+ "git gc --auto"
+
+[1] http://public-inbox.org/git/CA+55aFzJtejiCjV0e43+9oR3QuJK2PiFiLQemytoLpyJWe6P9w@mail.gmail.com/
+[2] http://public-inbox.org/git/CA+55aFz+gkAsDZ24zmePQuEs1XPS9BP_s8O7Q4wQ7LV7X5-oDA@mail.gmail.com/
+[3] http://public-inbox.org/git/20170306084353.nrns455dvkdsfgo5@sigill.intra.peff.net/
+[4] http://public-inbox.org/git/20170304224936.rqqtkdvfjgyezsht@genre.crustytoothpaste.net
+[5] https://public-inbox.org/git/CAJo=hJtoX9=AyLHHpUJS7fueV9ciZ_MNpnEPHUz8Whui6g9F0A@mail.gmail.com/
diff --git a/Documentation/technical/http-protocol.txt b/Documentation/technical/http-protocol.txt
index 1c561bdd92..a0e45f2889 100644
--- a/Documentation/technical/http-protocol.txt
+++ b/Documentation/technical/http-protocol.txt
@@ -219,6 +219,10 @@ smart server reply:
S: 003c2cb58b79488a98d2721cea644875a8dd0026b115 refs/tags/v1.0\n
S: 003fa3c2e2402b99163d1d59756e5f207ae21cccba4c refs/tags/v1.0^{}\n
+The client may send Extra Parameters (see
+Documentation/technical/pack-protocol.txt) as a colon-separated string
+in the Git-Protocol HTTP header.
+
Dumb Server Response
^^^^^^^^^^^^^^^^^^^^
Dumb servers MUST respond with the dumb server reply format.
@@ -269,7 +273,11 @@ the C locale ordering. The stream SHOULD include the default ref
named `HEAD` as the first ref. The stream MUST include capability
declarations behind a NUL on the first ref.
+The returned response contains "version 1" if "version=1" was sent as an
+Extra Parameter.
+
smart_reply = PKT-LINE("# service=$servicename" LF)
+ *1("version 1")
ref_list
"0000"
ref_list = empty_list / non_empty_list
diff --git a/Documentation/technical/index-format.txt b/Documentation/technical/index-format.txt
index ade0b0c445..db3572626b 100644
--- a/Documentation/technical/index-format.txt
+++ b/Documentation/technical/index-format.txt
@@ -295,3 +295,22 @@ The remaining data of each directory block is grouped by type:
in the previous ewah bitmap.
- One NUL.
+
+== File System Monitor cache
+
+ The file system monitor cache tracks files for which the core.fsmonitor
+ hook has told us about changes. The signature for this extension is
+ { 'F', 'S', 'M', 'N' }.
+
+ The extension starts with
+
+ - 32-bit version number: the current supported version is 1.
+
+ - 64-bit time: the extension data reflects all changes through the given
+ time which is stored as the nanoseconds elapsed since midnight,
+ January 1, 1970.
+
+ - 32-bit bitmap size: the size of the CE_FSMONITOR_VALID bitmap.
+
+ - An ewah bitmap, the n-th bit indicates whether the n-th index entry
+ is not CE_FSMONITOR_VALID.
diff --git a/Documentation/technical/pack-protocol.txt b/Documentation/technical/pack-protocol.txt
index ed1eae8b83..7fee6b780a 100644
--- a/Documentation/technical/pack-protocol.txt
+++ b/Documentation/technical/pack-protocol.txt
@@ -39,6 +39,19 @@ communicates with that invoked process over the SSH connection.
The file:// transport runs the 'upload-pack' or 'receive-pack'
process locally and communicates with it over a pipe.
+Extra Parameters
+----------------
+
+The protocol provides a mechanism in which clients can send additional
+information in its first message to the server. These are called "Extra
+Parameters", and are supported by the Git, SSH, and HTTP protocols.
+
+Each Extra Parameter takes the form of `<key>=<value>` or `<key>`.
+
+Servers that receive any such Extra Parameters MUST ignore all
+unrecognized keys. Currently, the only Extra Parameter recognized is
+"version=1".
+
Git Transport
-------------
@@ -46,18 +59,25 @@ The Git transport starts off by sending the command and repository
on the wire using the pkt-line format, followed by a NUL byte and a
hostname parameter, terminated by a NUL byte.
- 0032git-upload-pack /project.git\0host=myserver.com\0
+ 0033git-upload-pack /project.git\0host=myserver.com\0
+
+The transport may send Extra Parameters by adding an additional NUL
+byte, and then adding one or more NUL-terminated strings:
+
+ 003egit-upload-pack /project.git\0host=myserver.com\0\0version=1\0
--
- git-proto-request = request-command SP pathname NUL [ host-parameter NUL ]
+ git-proto-request = request-command SP pathname NUL
+ [ host-parameter NUL ] [ NUL extra-parameters ]
request-command = "git-upload-pack" / "git-receive-pack" /
"git-upload-archive" ; case sensitive
pathname = *( %x01-ff ) ; exclude NUL
host-parameter = "host=" hostname [ ":" port ]
+ extra-parameters = 1*extra-parameter
+ extra-parameter = 1*( %x01-ff ) NUL
--
-Only host-parameter is allowed in the git-proto-request. Clients
-MUST NOT attempt to send additional parameters. It is used for the
+host-parameter is used for the
git-daemon name based virtual hosting. See --interpolated-path
option to git daemon, with the %H/%CH format characters.
@@ -117,6 +137,12 @@ we execute it without the leading '/'.
v
ssh user@example.com "git-upload-pack '~alice/project.git'"
+Depending on the value of the `protocol.version` configuration variable,
+Git may attempt to send Extra Parameters as a colon-separated string in
+the GIT_PROTOCOL environment variable. This is done only if
+the `ssh.variant` configuration variable indicates that the ssh command
+supports passing environment variables as an argument.
+
A few things to remember here:
- The "command name" is spelled with dash (e.g. git-upload-pack), but
@@ -137,11 +163,13 @@ Reference Discovery
-------------------
When the client initially connects the server will immediately respond
-with a listing of each reference it has (all branches and tags) along
+with a version number (if "version=1" is sent as an Extra Parameter),
+and a listing of each reference it has (all branches and tags) along
with the object name that each reference currently points to.
- $ echo -e -n "0039git-upload-pack /schacon/gitbook.git\0host=example.com\0" |
+ $ echo -e -n "0044git-upload-pack /schacon/gitbook.git\0host=example.com\0\0version=1\0" |
nc -v example.com 9418
+ 000aversion 1
00887217a7c7e582c46cec22a130adf4b9d7d950fba0 HEAD\0multi_ack thin-pack
side-band side-band-64k ofs-delta shallow no-progress include-tag
00441d3fcd5ced445d1abc402225c0b8a1299641f497 refs/heads/integration
@@ -165,7 +193,8 @@ immediately after the ref itself, if presented. A conforming server
MUST peel the ref if it's an annotated tag.
----
- advertised-refs = (no-refs / list-of-refs)
+ advertised-refs = *1("version 1")
+ (no-refs / list-of-refs)
*shallow
flush-pkt
@@ -212,6 +241,7 @@ out of what the server said it could do with the first 'want' line.
upload-request = want-list
*shallow-line
*1depth-request
+ [filter-request]
flush-pkt
want-list = first-want
@@ -227,6 +257,8 @@ out of what the server said it could do with the first 'want' line.
additional-want = PKT-LINE("want" SP obj-id)
depth = 1*DIGIT
+
+ filter-request = PKT-LINE("filter" SP filter-spec)
----
Clients MUST send all the obj-ids it wants from the reference
@@ -249,6 +281,11 @@ complete those commits. Commits whose parents are not received as a
result are defined as shallow and marked as such in the server. This
information is sent back to the client in the next step.
+The client can optionally request that pack-objects omit various
+objects from the packfile using one of several filtering techniques.
+These are intended for use with partial clone and partial fetch
+operations. See `rev-list` for possible "filter-spec" values.
+
Once all the 'want's and 'shallow's (and optional 'deepen') are
transferred, clients MUST send a flush-pkt, to tell the server side
that it is done sending the list.
diff --git a/Documentation/technical/partial-clone.txt b/Documentation/technical/partial-clone.txt
new file mode 100644
index 0000000000..0bed2472c8
--- /dev/null
+++ b/Documentation/technical/partial-clone.txt
@@ -0,0 +1,324 @@
+Partial Clone Design Notes
+==========================
+
+The "Partial Clone" feature is a performance optimization for Git that
+allows Git to function without having a complete copy of the repository.
+The goal of this work is to allow Git better handle extremely large
+repositories.
+
+During clone and fetch operations, Git downloads the complete contents
+and history of the repository. This includes all commits, trees, and
+blobs for the complete life of the repository. For extremely large
+repositories, clones can take hours (or days) and consume 100+GiB of disk
+space.
+
+Often in these repositories there are many blobs and trees that the user
+does not need such as:
+
+ 1. files outside of the user's work area in the tree. For example, in
+ a repository with 500K directories and 3.5M files in every commit,
+ we can avoid downloading many objects if the user only needs a
+ narrow "cone" of the source tree.
+
+ 2. large binary assets. For example, in a repository where large build
+ artifacts are checked into the tree, we can avoid downloading all
+ previous versions of these non-mergeable binary assets and only
+ download versions that are actually referenced.
+
+Partial clone allows us to avoid downloading such unneeded objects *in
+advance* during clone and fetch operations and thereby reduce download
+times and disk usage. Missing objects can later be "demand fetched"
+if/when needed.
+
+Use of partial clone requires that the user be online and the origin
+remote be available for on-demand fetching of missing objects. This may
+or may not be problematic for the user. For example, if the user can
+stay within the pre-selected subset of the source tree, they may not
+encounter any missing objects. Alternatively, the user could try to
+pre-fetch various objects if they know that they are going offline.
+
+
+Non-Goals
+---------
+
+Partial clone is a mechanism to limit the number of blobs and trees downloaded
+*within* a given range of commits -- and is therefore independent of and not
+intended to conflict with existing DAG-level mechanisms to limit the set of
+requested commits (i.e. shallow clone, single branch, or fetch '<refspec>').
+
+
+Design Overview
+---------------
+
+Partial clone logically consists of the following parts:
+
+- A mechanism for the client to describe unneeded or unwanted objects to
+ the server.
+
+- A mechanism for the server to omit such unwanted objects from packfiles
+ sent to the client.
+
+- A mechanism for the client to gracefully handle missing objects (that
+ were previously omitted by the server).
+
+- A mechanism for the client to backfill missing objects as needed.
+
+
+Design Details
+--------------
+
+- A new pack-protocol capability "filter" is added to the fetch-pack and
+ upload-pack negotiation.
+
+ This uses the existing capability discovery mechanism.
+ See "filter" in Documentation/technical/pack-protocol.txt.
+
+- Clients pass a "filter-spec" to clone and fetch which is passed to the
+ server to request filtering during packfile construction.
+
+ There are various filters available to accommodate different situations.
+ See "--filter=<filter-spec>" in Documentation/rev-list-options.txt.
+
+- On the server pack-objects applies the requested filter-spec as it
+ creates "filtered" packfiles for the client.
+
+ These filtered packfiles are *incomplete* in the traditional sense because
+ they may contain objects that reference objects not contained in the
+ packfile and that the client doesn't already have. For example, the
+ filtered packfile may contain trees or tags that reference missing blobs
+ or commits that reference missing trees.
+
+- On the client these incomplete packfiles are marked as "promisor packfiles"
+ and treated differently by various commands.
+
+- On the client a repository extension is added to the local config to
+ prevent older versions of git from failing mid-operation because of
+ missing objects that they cannot handle.
+ See "extensions.partialClone" in Documentation/technical/repository-version.txt"
+
+
+Handling Missing Objects
+------------------------
+
+- An object may be missing due to a partial clone or fetch, or missing due
+ to repository corruption. To differentiate these cases, the local
+ repository specially indicates such filtered packfiles obtained from the
+ promisor remote as "promisor packfiles".
+
+ These promisor packfiles consist of a "<name>.promisor" file with
+ arbitrary contents (like the "<name>.keep" files), in addition to
+ their "<name>.pack" and "<name>.idx" files.
+
+- The local repository considers a "promisor object" to be an object that
+ it knows (to the best of its ability) that the promisor remote has promised
+ that it has, either because the local repository has that object in one of
+ its promisor packfiles, or because another promisor object refers to it.
+
+ When Git encounters a missing object, Git can see if it a promisor object
+ and handle it appropriately. If not, Git can report a corruption.
+
+ This means that there is no need for the client to explicitly maintain an
+ expensive-to-modify list of missing objects.[a]
+
+- Since almost all Git code currently expects any referenced object to be
+ present locally and because we do not want to force every command to do
+ a dry-run first, a fallback mechanism is added to allow Git to attempt
+ to dynamically fetch missing objects from the promisor remote.
+
+ When the normal object lookup fails to find an object, Git invokes
+ fetch-object to try to get the object from the server and then retry
+ the object lookup. This allows objects to be "faulted in" without
+ complicated prediction algorithms.
+
+ For efficiency reasons, no check as to whether the missing object is
+ actually a promisor object is performed.
+
+ Dynamic object fetching tends to be slow as objects are fetched one at
+ a time.
+
+- `checkout` (and any other command using `unpack-trees`) has been taught
+ to bulk pre-fetch all required missing blobs in a single batch.
+
+- `rev-list` has been taught to print missing objects.
+
+ This can be used by other commands to bulk prefetch objects.
+ For example, a "git log -p A..B" may internally want to first do
+ something like "git rev-list --objects --quiet --missing=print A..B"
+ and prefetch those objects in bulk.
+
+- `fsck` has been updated to be fully aware of promisor objects.
+
+- `repack` in GC has been updated to not touch promisor packfiles at all,
+ and to only repack other objects.
+
+- The global variable "fetch_if_missing" is used to control whether an
+ object lookup will attempt to dynamically fetch a missing object or
+ report an error.
+
+ We are not happy with this global variable and would like to remove it,
+ but that requires significant refactoring of the object code to pass an
+ additional flag. We hope that concurrent efforts to add an ODB API can
+ encompass this.
+
+
+Fetching Missing Objects
+------------------------
+
+- Fetching of objects is done using the existing transport mechanism using
+ transport_fetch_refs(), setting a new transport option
+ TRANS_OPT_NO_DEPENDENTS to indicate that only the objects themselves are
+ desired, not any object that they refer to.
+
+ Because some transports invoke fetch_pack() in the same process, fetch_pack()
+ has been updated to not use any object flags when the corresponding argument
+ (no_dependents) is set.
+
+- The local repository sends a request with the hashes of all requested
+ objects as "want" lines, and does not perform any packfile negotiation.
+ It then receives a packfile.
+
+- Because we are reusing the existing fetch-pack mechanism, fetching
+ currently fetches all objects referred to by the requested objects, even
+ though they are not necessary.
+
+
+Current Limitations
+-------------------
+
+- The remote used for a partial clone (or the first partial fetch
+ following a regular clone) is marked as the "promisor remote".
+
+ We are currently limited to a single promisor remote and only that
+ remote may be used for subsequent partial fetches.
+
+ We accept this limitation because we believe initial users of this
+ feature will be using it on repositories with a strong single central
+ server.
+
+- Dynamic object fetching will only ask the promisor remote for missing
+ objects. We assume that the promisor remote has a complete view of the
+ repository and can satisfy all such requests.
+
+- Repack essentially treats promisor and non-promisor packfiles as 2
+ distinct partitions and does not mix them. Repack currently only works
+ on non-promisor packfiles and loose objects.
+
+- Dynamic object fetching invokes fetch-pack once *for each item*
+ because most algorithms stumble upon a missing object and need to have
+ it resolved before continuing their work. This may incur significant
+ overhead -- and multiple authentication requests -- if many objects are
+ needed.
+
+- Dynamic object fetching currently uses the existing pack protocol V0
+ which means that each object is requested via fetch-pack. The server
+ will send a full set of info/refs when the connection is established.
+ If there are large number of refs, this may incur significant overhead.
+
+
+Future Work
+-----------
+
+- Allow more than one promisor remote and define a strategy for fetching
+ missing objects from specific promisor remotes or of iterating over the
+ set of promisor remotes until a missing object is found.
+
+ A user might want to have multiple geographically-close cache servers
+ for fetching missing blobs while continuing to do filtered `git-fetch`
+ commands from the central server, for example.
+
+ Or the user might want to work in a triangular work flow with multiple
+ promisor remotes that each have an incomplete view of the repository.
+
+- Allow repack to work on promisor packfiles (while keeping them distinct
+ from non-promisor packfiles).
+
+- Allow non-pathname-based filters to make use of packfile bitmaps (when
+ present). This was just an omission during the initial implementation.
+
+- Investigate use of a long-running process to dynamically fetch a series
+ of objects, such as proposed in [5,6] to reduce process startup and
+ overhead costs.
+
+ It would be nice if pack protocol V2 could allow that long-running
+ process to make a series of requests over a single long-running
+ connection.
+
+- Investigate pack protocol V2 to avoid the info/refs broadcast on
+ each connection with the server to dynamically fetch missing objects.
+
+- Investigate the need to handle loose promisor objects.
+
+ Objects in promisor packfiles are allowed to reference missing objects
+ that can be dynamically fetched from the server. An assumption was
+ made that loose objects are only created locally and therefore should
+ not reference a missing object. We may need to revisit that assumption
+ if, for example, we dynamically fetch a missing tree and store it as a
+ loose object rather than a single object packfile.
+
+ This does not necessarily mean we need to mark loose objects as promisor;
+ it may be sufficient to relax the object lookup or is-promisor functions.
+
+
+Non-Tasks
+---------
+
+- Every time the subject of "demand loading blobs" comes up it seems
+ that someone suggests that the server be allowed to "guess" and send
+ additional objects that may be related to the requested objects.
+
+ No work has gone into actually doing that; we're just documenting that
+ it is a common suggestion. We're not sure how it would work and have
+ no plans to work on it.
+
+ It is valid for the server to send more objects than requested (even
+ for a dynamic object fetch), but we are not building on that.
+
+
+Footnotes
+---------
+
+[a] expensive-to-modify list of missing objects: Earlier in the design of
+ partial clone we discussed the need for a single list of missing objects.
+ This would essentially be a sorted linear list of OIDs that the were
+ omitted by the server during a clone or subsequent fetches.
+
+ This file would need to be loaded into memory on every object lookup.
+ It would need to be read, updated, and re-written (like the .git/index)
+ on every explicit "git fetch" command *and* on any dynamic object fetch.
+
+ The cost to read, update, and write this file could add significant
+ overhead to every command if there are many missing objects. For example,
+ if there are 100M missing blobs, this file would be at least 2GiB on disk.
+
+ With the "promisor" concept, we *infer* a missing object based upon the
+ type of packfile that references it.
+
+
+Related Links
+-------------
+[0] https://bugs.chromium.org/p/git/issues/detail?id=2
+ Chromium work item for: Partial Clone
+
+[1] https://public-inbox.org/git/20170113155253.1644-1-benpeart@microsoft.com/
+ Subject: [RFC] Add support for downloading blobs on demand
+ Date: Fri, 13 Jan 2017 10:52:53 -0500
+
+[2] https://public-inbox.org/git/cover.1506714999.git.jonathantanmy@google.com/
+ Subject: [PATCH 00/18] Partial clone (from clone to lazy fetch in 18 patches)
+ Date: Fri, 29 Sep 2017 13:11:36 -0700
+
+[3] https://public-inbox.org/git/20170426221346.25337-1-jonathantanmy@google.com/
+ Subject: Proposal for missing blob support in Git repos
+ Date: Wed, 26 Apr 2017 15:13:46 -0700
+
+[4] https://public-inbox.org/git/1488999039-37631-1-git-send-email-git@jeffhostetler.com/
+ Subject: [PATCH 00/10] RFC Partial Clone and Fetch
+ Date: Wed, 8 Mar 2017 18:50:29 +0000
+
+[5] https://public-inbox.org/git/20170505152802.6724-1-benpeart@microsoft.com/
+ Subject: [PATCH v7 00/10] refactor the filter process code into a reusable module
+ Date: Fri, 5 May 2017 11:27:52 -0400
+
+[6] https://public-inbox.org/git/20170714132651.170708-1-benpeart@microsoft.com/
+ Subject: [RFC/PATCH v2 0/1] Add support for downloading blobs on demand
+ Date: Fri, 14 Jul 2017 09:26:50 -0400
diff --git a/Documentation/technical/protocol-capabilities.txt b/Documentation/technical/protocol-capabilities.txt
index 26dcc6f502..332d209b58 100644
--- a/Documentation/technical/protocol-capabilities.txt
+++ b/Documentation/technical/protocol-capabilities.txt
@@ -309,3 +309,11 @@ to accept a signed push certificate, and asks the <nonce> to be
included in the push certificate. A send-pack client MUST NOT
send a push-cert packet unless the receive-pack server advertises
this capability.
+
+filter
+------
+
+If the upload-pack server advertises the 'filter' capability,
+fetch-pack may send "filter" commands to request a partial clone
+or partial fetch and request that the server omit various objects
+from the packfile.
diff --git a/Documentation/technical/repository-version.txt b/Documentation/technical/repository-version.txt
index 00ad37986e..e03eaccebc 100644
--- a/Documentation/technical/repository-version.txt
+++ b/Documentation/technical/repository-version.txt
@@ -86,3 +86,15 @@ for testing format-1 compatibility.
When the config key `extensions.preciousObjects` is set to `true`,
objects in the repository MUST NOT be deleted (e.g., by `git-prune` or
`git repack -d`).
+
+`partialclone`
+~~~~~~~~~~~~~~
+
+When the config key `extensions.partialclone` is set, it indicates
+that the repo was created with a partial clone (or later performed
+a partial fetch) and that the remote may have omitted sending
+certain unwanted objects. Such a remote is called a "promisor remote"
+and it promises that all such omitted objects can be fetched from it
+in the future.
+
+The value of this key is the name of the promisor remote.