diff options
Diffstat (limited to 'Documentation/technical')
41 files changed, 3046 insertions, 1028 deletions
diff --git a/Documentation/technical/api-argv-array.txt b/Documentation/technical/api-argv-array.txt index 1a797812fb..870c8edbfb 100644 --- a/Documentation/technical/api-argv-array.txt +++ b/Documentation/technical/api-argv-array.txt @@ -8,7 +8,7 @@ always NULL-terminated at the element pointed to by `argv[argc]`. This makes the result suitable for passing to functions expecting to receive argv from main(), or the link:api-run-command.html[run-command API]. -The link:api-string-list.html[string-list API] is similar, but cannot be +The string-list API (documented in string-list.h) is similar, but cannot be used for these purposes; instead of storing a straight string pointer, it contains an item structure with a `util` field that is not compatible with the traditional argv interface. @@ -46,6 +46,9 @@ Functions Format a string and push it onto the end of the array. This is a convenience wrapper combining `strbuf_addf` and `argv_array_push`. +`argv_array_pushv`:: + Push a null-terminated array of strings onto the end of the array. + `argv_array_pop`:: Remove the final element from the array. If there are no elements in the array, do nothing. @@ -53,3 +56,10 @@ Functions `argv_array_clear`:: Free all memory associated with the array and return it to the initial, empty state. + +`argv_array_detach`:: + Disconnect the `argv` member from the `argv_array` struct and + return it. The caller is responsible for freeing the memory used + by the array, and by the strings it references. After detaching, + the `argv_array` is in a reinitialized state and can be pushed + into again. diff --git a/Documentation/technical/api-builtin.txt b/Documentation/technical/api-builtin.txt deleted file mode 100644 index 22a39b9299..0000000000 --- a/Documentation/technical/api-builtin.txt +++ /dev/null @@ -1,73 +0,0 @@ -builtin API -=========== - -Adding a new built-in ---------------------- - -There are 4 things to do to add a built-in command implementation to -Git: - -. Define the implementation of the built-in command `foo` with - signature: - - int cmd_foo(int argc, const char **argv, const char *prefix); - -. Add the external declaration for the function to `builtin.h`. - -. Add the command to the `commands[]` table defined in `git.c`. - The entry should look like: - - { "foo", cmd_foo, <options> }, -+ -where options is the bitwise-or of: - -`RUN_SETUP`:: - If there is not a Git directory to work on, abort. If there - is a work tree, chdir to the top of it if the command was - invoked in a subdirectory. If there is no work tree, no - chdir() is done. - -`RUN_SETUP_GENTLY`:: - If there is a Git directory, chdir as per RUN_SETUP, otherwise, - don't chdir anywhere. - -`USE_PAGER`:: - - If the standard output is connected to a tty, spawn a pager and - feed our output to it. - -`NEED_WORK_TREE`:: - - Make sure there is a work tree, i.e. the command cannot act - on bare repositories. - This only makes sense when `RUN_SETUP` is also set. - -. Add `builtin/foo.o` to `BUILTIN_OBJS` in `Makefile`. - -Additionally, if `foo` is a new command, there are 3 more things to do: - -. Add tests to `t/` directory. - -. Write documentation in `Documentation/git-foo.txt`. - -. Add an entry for `git-foo` to `command-list.txt`. - -. Add an entry for `/git-foo` to `.gitignore`. - - -How a built-in is called ------------------------- - -The implementation `cmd_foo()` takes three parameters, `argc`, `argv, -and `prefix`. The first two are similar to what `main()` of a -standalone command would be called with. - -When `RUN_SETUP` is specified in the `commands[]` table, and when you -were started from a subdirectory of the work tree, `cmd_foo()` is called -after chdir(2) to the top of the work tree, and `prefix` gets the path -to the subdirectory the command started from. This allows you to -convert a user-supplied pathname (typically relative to that directory) -to a pathname relative to the top of the work tree. - -The return value from `cmd_foo()` becomes the exit status of the -command. diff --git a/Documentation/technical/api-config.txt b/Documentation/technical/api-config.txt index 0d8b99b368..fa39ac9d71 100644 --- a/Documentation/technical/api-config.txt +++ b/Documentation/technical/api-config.txt @@ -47,28 +47,23 @@ will first feed the user-wide one to the callback, and then the repo-specific one; by overwriting, the higher-priority repo-specific value is left at the end). -The `git_config_with_options` function lets the caller examine config +The `config_with_options` function lets the caller examine config while adjusting some of the default behavior of `git_config`. It should almost never be used by "regular" Git code that is looking up configuration variables. It is intended for advanced callers like `git-config`, which are intentionally tweaking the normal config-lookup process. It takes two extra parameters: -`filename`:: -If this parameter is non-NULL, it specifies the name of a file to -parse for configuration, rather than looking in the usual files. Regular -`git_config` defaults to `NULL`. +`config_source`:: +If this parameter is non-NULL, it specifies the source to parse for +configuration, rather than looking in the usual files. See `struct +git_config_source` in `config.h` for details. Regular `git_config` defaults +to `NULL`. -`respect_includes`:: -Specify whether include directives should be followed in parsed files. -Regular `git_config` defaults to `1`. - -There is a special version of `git_config` called `git_config_early`. -This version takes an additional parameter to specify the repository -config, instead of having it looked up via `git_path`. This is useful -early in a Git program before the repository has been found. Unless -you're working with early setup code, you probably don't want to use -this. +`opts`:: +Specify options to adjust the behavior of parsing config files. See `struct +config_options` in `config.h` for details. As an example: regular `git_config` +sets `opts.respect_includes` to `1` by default. Reading Specific Files ---------------------- @@ -193,7 +188,7 @@ parsing is successful, the return value is the result. Same as `git_config_bool`, except that integers are returned as-is, and an `is_bool` flag is unset. -`git_config_maybe_bool`:: +`git_parse_maybe_bool`:: Same as `git_config_bool`, except that it returns -1 on error rather than dying. diff --git a/Documentation/technical/api-credentials.txt b/Documentation/technical/api-credentials.txt index e44426dd04..75368f26ca 100644 --- a/Documentation/technical/api-credentials.txt +++ b/Documentation/technical/api-credentials.txt @@ -243,7 +243,7 @@ appended to its command line, which is one of: The details of the credential will be provided on the helper's stdin stream. The exact format is the same as the input/output format of the `git credential` plumbing command (see the section `INPUT/OUTPUT -FORMAT` in linkgit:git-credential[7] for a detailed specification). +FORMAT` in linkgit:git-credential[1] for a detailed specification). For a `get` operation, the helper should produce a list of attributes on stdout in the same format. A helper is free to produce a subset, or @@ -268,4 +268,4 @@ See also linkgit:gitcredentials[7] -linkgit:git-config[5] (See configuration variables `credential.*`) +linkgit:git-config[1] (See configuration variables `credential.*`) diff --git a/Documentation/technical/api-decorate.txt b/Documentation/technical/api-decorate.txt deleted file mode 100644 index 1d52a6ce14..0000000000 --- a/Documentation/technical/api-decorate.txt +++ /dev/null @@ -1,6 +0,0 @@ -decorate API -============ - -Talk about <decorate.h> - -(Linus) diff --git a/Documentation/technical/api-directory-listing.txt b/Documentation/technical/api-directory-listing.txt index 7f8e78d916..5abb8e8b1f 100644 --- a/Documentation/technical/api-directory-listing.txt +++ b/Documentation/technical/api-directory-listing.txt @@ -22,16 +22,41 @@ The notable options are: `flags`:: - A bit-field of options (the `*IGNORED*` flags are mutually exclusive): + A bit-field of options: `DIR_SHOW_IGNORED`::: - Return just ignored files in `entries[]`, not untracked files. + Return just ignored files in `entries[]`, not untracked + files. This flag is mutually exclusive with + `DIR_SHOW_IGNORED_TOO`. `DIR_SHOW_IGNORED_TOO`::: - Similar to `DIR_SHOW_IGNORED`, but return ignored files in `ignored[]` - in addition to untracked files in `entries[]`. + Similar to `DIR_SHOW_IGNORED`, but return ignored files in + `ignored[]` in addition to untracked files in + `entries[]`. This flag is mutually exclusive with + `DIR_SHOW_IGNORED`. + +`DIR_KEEP_UNTRACKED_CONTENTS`::: + + Only has meaning if `DIR_SHOW_IGNORED_TOO` is also set; if this is set, the + untracked contents of untracked directories are also returned in + `entries[]`. + +`DIR_SHOW_IGNORED_TOO_MODE_MATCHING`::: + + Only has meaning if `DIR_SHOW_IGNORED_TOO` is also set; if + this is set, returns ignored files and directories that match + an exclude pattern. If a directory matches an exclude pattern, + then the directory is returned and the contained paths are + not. A directory that does not match an exclude pattern will + not be returned even if all of its contents are ignored. In + this case, the contents are returned as individual entries. ++ +If this is set, files and directories that explicitly match an ignore +pattern are reported. Implicitly ignored directories (directories that +do not match an ignore pattern, but whose contents are all ignored) +are not reported, instead all of the contents are reported. `DIR_COLLECT_IGNORED`::: diff --git a/Documentation/technical/api-error-handling.txt b/Documentation/technical/api-error-handling.txt index fc68db126e..ceeedd485c 100644 --- a/Documentation/technical/api-error-handling.txt +++ b/Documentation/technical/api-error-handling.txt @@ -58,7 +58,7 @@ to `die` or `error` as-is. For example: if (ref_transaction_commit(transaction, &err)) die("%s", err.buf); -The 'err' parameter will be untouched if no error occured, so multiple +The 'err' parameter will be untouched if no error occurred, so multiple function calls can be chained: t = ref_transaction_begin(&err); diff --git a/Documentation/technical/api-gitattributes.txt b/Documentation/technical/api-gitattributes.txt index 2602668677..45f0df600f 100644 --- a/Documentation/technical/api-gitattributes.txt +++ b/Documentation/technical/api-gitattributes.txt @@ -16,10 +16,15 @@ Data Structure of no interest to the calling programs. The name of the attribute can be retrieved by calling `git_attr_name()`. -`struct git_attr_check`:: +`struct attr_check_item`:: - This structure represents a set of attributes to check in a call - to `git_check_attr()` function, and receives the results. + This structure represents one attribute and its value. + +`struct attr_check`:: + + This structure represents a collection of `attr_check_item`. + It is passed to `git_check_attr()` function, specifying the + attributes to check, and receives their values. Attribute Values @@ -27,7 +32,7 @@ Attribute Values An attribute for a path can be in one of four states: Set, Unset, Unspecified or set to a string, and `.value` member of `struct -git_attr_check` records it. There are three macros to check these: +attr_check_item` records it. There are three macros to check these: `ATTR_TRUE()`:: @@ -48,49 +53,51 @@ value of the attribute for the path. Querying Specific Attributes ---------------------------- -* Prepare an array of `struct git_attr_check` to define the list of - attributes you would want to check. To populate this array, you would - need to define necessary attributes by calling `git_attr()` function. +* Prepare `struct attr_check` using attr_check_initl() + function, enumerating the names of attributes whose values you are + interested in, terminated with a NULL pointer. Alternatively, an + empty `struct attr_check` can be prepared by calling + `attr_check_alloc()` function and then attributes you want to + ask about can be added to it with `attr_check_append()` + function. * Call `git_check_attr()` to check the attributes for the path. -* Inspect `git_attr_check` structure to see how each of the attribute in - the array is defined for the path. +* Inspect `attr_check` structure to see how each of the + attribute in the array is defined for the path. Example ------- -To see how attributes "crlf" and "indent" are set for different paths. +To see how attributes "crlf" and "ident" are set for different paths. -. Prepare an array of `struct git_attr_check` with two elements (because - we are checking two attributes). Initialize their `attr` member with - pointers to `struct git_attr` obtained by calling `git_attr()`: +. Prepare a `struct attr_check` with two elements (because + we are checking two attributes): ------------ -static struct git_attr_check check[2]; +static struct attr_check *check; static void setup_check(void) { - if (check[0].attr) + if (check) return; /* already done */ - check[0].attr = git_attr("crlf"); - check[1].attr = git_attr("ident"); + check = attr_check_initl("crlf", "ident", NULL); } ------------ -. Call `git_check_attr()` with the prepared array of `struct git_attr_check`: +. Call `git_check_attr()` with the prepared `struct attr_check`: ------------ const char *path; setup_check(); - git_check_attr(path, ARRAY_SIZE(check), check); + git_check_attr(path, check); ------------ -. Act on `.value` member of the result, left in `check[]`: +. Act on `.value` member of the result, left in `check->items[]`: ------------ - const char *value = check[0].value; + const char *value = check->items[0].value; if (ATTR_TRUE(value)) { The attribute is Set, by listing only the name of the @@ -109,20 +116,39 @@ static void setup_check(void) } ------------ +To see how attributes in argv[] are set for different paths, only +the first step in the above would be different. + +------------ +static struct attr_check *check; +static void setup_check(const char **argv) +{ + check = attr_check_alloc(); + while (*argv) { + struct git_attr *attr = git_attr(*argv); + attr_check_append(check, attr); + argv++; + } +} +------------ + Querying All Attributes ----------------------- To get the values of all attributes associated with a file: -* Call `git_all_attrs()`, which returns an array of `git_attr_check` - structures. +* Prepare an empty `attr_check` structure by calling + `attr_check_alloc()`. + +* Call `git_all_attrs()`, which populates the `attr_check` + with the attributes attached to the path. -* Iterate over the `git_attr_check` array to examine the attribute - names and values. The name of the attribute described by a - `git_attr_check` object can be retrieved via - `git_attr_name(check[i].attr)`. (Please note that no items will be - returned for unset attributes, so `ATTR_UNSET()` will return false - for all returned `git_array_check` objects.) +* Iterate over the `attr_check.items[]` array to examine + the attribute names and values. The name of the attribute + described by an `attr_check.items[]` object can be retrieved via + `git_attr_name(check->items[i].attr)`. (Please note that no items + will be returned for unset attributes, so `ATTR_UNSET()` will return + false for all returned `attr_check.items[]` objects.) -* Free the `git_array_check` array. +* Free the `attr_check` struct by calling `attr_check_free()`. diff --git a/Documentation/technical/api-hashmap.txt b/Documentation/technical/api-hashmap.txt deleted file mode 100644 index ad7a5bddd2..0000000000 --- a/Documentation/technical/api-hashmap.txt +++ /dev/null @@ -1,280 +0,0 @@ -hashmap API -=========== - -The hashmap API is a generic implementation of hash-based key-value mappings. - -Data Structures ---------------- - -`struct hashmap`:: - - The hash table structure. Members can be used as follows, but should - not be modified directly: -+ -The `size` member keeps track of the total number of entries (0 means the -hashmap is empty). -+ -`tablesize` is the allocated size of the hash table. A non-0 value indicates -that the hashmap is initialized. It may also be useful for statistical purposes -(i.e. `size / tablesize` is the current load factor). -+ -`cmpfn` stores the comparison function specified in `hashmap_init()`. In -advanced scenarios, it may be useful to change this, e.g. to switch between -case-sensitive and case-insensitive lookup. - -`struct hashmap_entry`:: - - An opaque structure representing an entry in the hash table, which must - be used as first member of user data structures. Ideally it should be - followed by an int-sized member to prevent unused memory on 64-bit - systems due to alignment. -+ -The `hash` member is the entry's hash code and the `next` member points to the -next entry in case of collisions (i.e. if multiple entries map to the same -bucket). - -`struct hashmap_iter`:: - - An iterator structure, to be used with hashmap_iter_* functions. - -Types ------ - -`int (*hashmap_cmp_fn)(const void *entry, const void *entry_or_key, const void *keydata)`:: - - User-supplied function to test two hashmap entries for equality. Shall - return 0 if the entries are equal. -+ -This function is always called with non-NULL `entry` / `entry_or_key` -parameters that have the same hash code. When looking up an entry, the `key` -and `keydata` parameters to hashmap_get and hashmap_remove are always passed -as second and third argument, respectively. Otherwise, `keydata` is NULL. - -Functions ---------- - -`unsigned int strhash(const char *buf)`:: -`unsigned int strihash(const char *buf)`:: -`unsigned int memhash(const void *buf, size_t len)`:: -`unsigned int memihash(const void *buf, size_t len)`:: - - Ready-to-use hash functions for strings, using the FNV-1 algorithm (see - http://www.isthe.com/chongo/tech/comp/fnv). -+ -`strhash` and `strihash` take 0-terminated strings, while `memhash` and -`memihash` operate on arbitrary-length memory. -+ -`strihash` and `memihash` are case insensitive versions. - -`unsigned int sha1hash(const unsigned char *sha1)`:: - - Converts a cryptographic hash (e.g. SHA-1) into an int-sized hash code - for use in hash tables. Cryptographic hashes are supposed to have - uniform distribution, so in contrast to `memhash()`, this just copies - the first `sizeof(int)` bytes without shuffling any bits. Note that - the results will be different on big-endian and little-endian - platforms, so they should not be stored or transferred over the net. - -`void hashmap_init(struct hashmap *map, hashmap_cmp_fn equals_function, size_t initial_size)`:: - - Initializes a hashmap structure. -+ -`map` is the hashmap to initialize. -+ -The `equals_function` can be specified to compare two entries for equality. -If NULL, entries are considered equal if their hash codes are equal. -+ -If the total number of entries is known in advance, the `initial_size` -parameter may be used to preallocate a sufficiently large table and thus -prevent expensive resizing. If 0, the table is dynamically resized. - -`void hashmap_free(struct hashmap *map, int free_entries)`:: - - Frees a hashmap structure and allocated memory. -+ -`map` is the hashmap to free. -+ -If `free_entries` is true, each hashmap_entry in the map is freed as well -(using stdlib's free()). - -`void hashmap_entry_init(void *entry, unsigned int hash)`:: - - Initializes a hashmap_entry structure. -+ -`entry` points to the entry to initialize. -+ -`hash` is the hash code of the entry. - -`void *hashmap_get(const struct hashmap *map, const void *key, const void *keydata)`:: - - Returns the hashmap entry for the specified key, or NULL if not found. -+ -`map` is the hashmap structure. -+ -`key` is a hashmap_entry structure (or user data structure that starts with -hashmap_entry) that has at least been initialized with the proper hash code -(via `hashmap_entry_init`). -+ -If an entry with matching hash code is found, `key` and `keydata` are passed -to `hashmap_cmp_fn` to decide whether the entry matches the key. - -`void *hashmap_get_from_hash(const struct hashmap *map, unsigned int hash, const void *keydata)`:: - - Returns the hashmap entry for the specified hash code and key data, - or NULL if not found. -+ -`map` is the hashmap structure. -+ -`hash` is the hash code of the entry to look up. -+ -If an entry with matching hash code is found, `keydata` is passed to -`hashmap_cmp_fn` to decide whether the entry matches the key. The -`entry_or_key` parameter points to a bogus hashmap_entry structure that -should not be used in the comparison. - -`void *hashmap_get_next(const struct hashmap *map, const void *entry)`:: - - Returns the next equal hashmap entry, or NULL if not found. This can be - used to iterate over duplicate entries (see `hashmap_add`). -+ -`map` is the hashmap structure. -+ -`entry` is the hashmap_entry to start the search from, obtained via a previous -call to `hashmap_get` or `hashmap_get_next`. - -`void hashmap_add(struct hashmap *map, void *entry)`:: - - Adds a hashmap entry. This allows to add duplicate entries (i.e. - separate values with the same key according to hashmap_cmp_fn). -+ -`map` is the hashmap structure. -+ -`entry` is the entry to add. - -`void *hashmap_put(struct hashmap *map, void *entry)`:: - - Adds or replaces a hashmap entry. If the hashmap contains duplicate - entries equal to the specified entry, only one of them will be replaced. -+ -`map` is the hashmap structure. -+ -`entry` is the entry to add or replace. -+ -Returns the replaced entry, or NULL if not found (i.e. the entry was added). - -`void *hashmap_remove(struct hashmap *map, const void *key, const void *keydata)`:: - - Removes a hashmap entry matching the specified key. If the hashmap - contains duplicate entries equal to the specified key, only one of - them will be removed. -+ -`map` is the hashmap structure. -+ -`key` is a hashmap_entry structure (or user data structure that starts with -hashmap_entry) that has at least been initialized with the proper hash code -(via `hashmap_entry_init`). -+ -If an entry with matching hash code is found, `key` and `keydata` are -passed to `hashmap_cmp_fn` to decide whether the entry matches the key. -+ -Returns the removed entry, or NULL if not found. - -`void hashmap_iter_init(struct hashmap *map, struct hashmap_iter *iter)`:: -`void *hashmap_iter_next(struct hashmap_iter *iter)`:: -`void *hashmap_iter_first(struct hashmap *map, struct hashmap_iter *iter)`:: - - Used to iterate over all entries of a hashmap. -+ -`hashmap_iter_init` initializes a `hashmap_iter` structure. -+ -`hashmap_iter_next` returns the next hashmap_entry, or NULL if there are no -more entries. -+ -`hashmap_iter_first` is a combination of both (i.e. initializes the iterator -and returns the first entry, if any). - -`const char *strintern(const char *string)`:: -`const void *memintern(const void *data, size_t len)`:: - - Returns the unique, interned version of the specified string or data, - similar to the `String.intern` API in Java and .NET, respectively. - Interned strings remain valid for the entire lifetime of the process. -+ -Can be used as `[x]strdup()` or `xmemdupz` replacement, except that interned -strings / data must not be modified or freed. -+ -Interned strings are best used for short strings with high probability of -duplicates. -+ -Uses a hashmap to store the pool of interned strings. - -Usage example -------------- - -Here's a simple usage example that maps long keys to double values. ------------- -struct hashmap map; - -struct long2double { - struct hashmap_entry ent; /* must be the first member! */ - long key; - double value; -}; - -static int long2double_cmp(const struct long2double *e1, const struct long2double *e2, const void *unused) -{ - return !(e1->key == e2->key); -} - -void long2double_init(void) -{ - hashmap_init(&map, (hashmap_cmp_fn) long2double_cmp, 0); -} - -void long2double_free(void) -{ - hashmap_free(&map, 1); -} - -static struct long2double *find_entry(long key) -{ - struct long2double k; - hashmap_entry_init(&k, memhash(&key, sizeof(long))); - k.key = key; - return hashmap_get(&map, &k, NULL); -} - -double get_value(long key) -{ - struct long2double *e = find_entry(key); - return e ? e->value : 0; -} - -void set_value(long key, double value) -{ - struct long2double *e = find_entry(key); - if (!e) { - e = malloc(sizeof(struct long2double)); - hashmap_entry_init(e, memhash(&key, sizeof(long))); - e->key = key; - hashmap_add(&map, e); - } - e->value = value; -} ------------- - -Using variable-sized keys -------------------------- - -The `hashmap_entry_get` and `hashmap_entry_remove` functions expect an ordinary -`hashmap_entry` structure as key to find the correct entry. If the key data is -variable-sized (e.g. a FLEX_ARRAY string) or quite large, it is undesirable -to create a full-fledged entry structure on the heap and copy all the key data -into the structure. - -In this case, the `keydata` parameter can be used to pass -variable-sized key data directly to the comparison function, and the `key` -parameter can be a stripped-down, fixed size entry structure allocated on the -stack. - -See test-hashmap.c for an example using arbitrary-length strings as keys. diff --git a/Documentation/technical/api-in-core-index.txt b/Documentation/technical/api-in-core-index.txt deleted file mode 100644 index adbdbf5d75..0000000000 --- a/Documentation/technical/api-in-core-index.txt +++ /dev/null @@ -1,21 +0,0 @@ -in-core index API -================= - -Talk about <read-cache.c> and <cache-tree.c>, things like: - -* cache -> the_index macros -* read_index() -* write_index() -* ie_match_stat() and ie_modified(); how they are different and when to - use which. -* index_name_pos() -* remove_index_entry_at() -* remove_file_from_index() -* add_file_to_index() -* add_index_entry() -* refresh_index() -* discard_index() -* cache_tree_invalidate_path() -* cache_tree_update() - -(JC, Linus) diff --git a/Documentation/technical/api-lockfile.txt b/Documentation/technical/api-lockfile.txt deleted file mode 100644 index 93b5f23e4c..0000000000 --- a/Documentation/technical/api-lockfile.txt +++ /dev/null @@ -1,220 +0,0 @@ -lockfile API -============ - -The lockfile API serves two purposes: - -* Mutual exclusion and atomic file updates. When we want to change a - file, we create a lockfile `<filename>.lock`, write the new file - contents into it, and then rename the lockfile to its final - destination `<filename>`. We create the `<filename>.lock` file with - `O_CREAT|O_EXCL` so that we can notice and fail if somebody else has - already locked the file, then atomically rename the lockfile to its - final destination to commit the changes and unlock the file. - -* Automatic cruft removal. If the program exits after we lock a file - but before the changes have been committed, we want to make sure - that we remove the lockfile. This is done by remembering the - lockfiles we have created in a linked list and setting up an - `atexit(3)` handler and a signal handler that clean up the - lockfiles. This mechanism ensures that outstanding lockfiles are - cleaned up if the program exits (including when `die()` is called) - or if the program dies on a signal. - -Please note that lockfiles only block other writers. Readers do not -block, but they are guaranteed to see either the old contents of the -file or the new contents of the file (assuming that the filesystem -implements `rename(2)` atomically). - - -Calling sequence ----------------- - -The caller: - -* Allocates a `struct lock_file` either as a static variable or on the - heap, initialized to zeros. Once you use the structure to call the - `hold_lock_file_*` family of functions, it belongs to the lockfile - subsystem and its storage must remain valid throughout the life of - the program (i.e. you cannot use an on-stack variable to hold this - structure). - -* Attempts to create a lockfile by passing that variable and the path - of the final destination (e.g. `$GIT_DIR/index`) to - `hold_lock_file_for_update` or `hold_lock_file_for_append`. - -* Writes new content for the destination file by either: - - * writing to the file descriptor returned by the `hold_lock_file_*` - functions (also available via `lock->fd`). - - * calling `fdopen_lock_file` to get a `FILE` pointer for the open - file and writing to the file using stdio. - -When finished writing, the caller can: - -* Close the file descriptor and rename the lockfile to its final - destination by calling `commit_lock_file` or `commit_lock_file_to`. - -* Close the file descriptor and remove the lockfile by calling - `rollback_lock_file`. - -* Close the file descriptor without removing or renaming the lockfile - by calling `close_lock_file`, and later call `commit_lock_file`, - `commit_lock_file_to`, `rollback_lock_file`, or `reopen_lock_file`. - -Even after the lockfile is committed or rolled back, the `lock_file` -object must not be freed or altered by the caller. However, it may be -reused; just pass it to another call of `hold_lock_file_for_update` or -`hold_lock_file_for_append`. - -If the program exits before you have called one of `commit_lock_file`, -`commit_lock_file_to`, `rollback_lock_file`, or `close_lock_file`, an -`atexit(3)` handler will close and remove the lockfile, rolling back -any uncommitted changes. - -If you need to close the file descriptor you obtained from a -`hold_lock_file_*` function yourself, do so by calling -`close_lock_file`. You should never call `close(2)` or `fclose(3)` -yourself! Otherwise the `struct lock_file` structure would still think -that the file descriptor needs to be closed, and a commit or rollback -would result in duplicate calls to `close(2)`. Worse yet, if you close -and then later open another file descriptor for a completely different -purpose, then a commit or rollback might close that unrelated file -descriptor. - - -Error handling --------------- - -The `hold_lock_file_*` functions return a file descriptor on success -or -1 on failure (unless `LOCK_DIE_ON_ERROR` is used; see below). On -errors, `errno` describes the reason for failure. Errors can be -reported by passing `errno` to one of the following helper functions: - -unable_to_lock_message:: - - Append an appropriate error message to a `strbuf`. - -unable_to_lock_error:: - - Emit an appropriate error message using `error()`. - -unable_to_lock_die:: - - Emit an appropriate error message and `die()`. - -Similarly, `commit_lock_file`, `commit_lock_file_to`, and -`close_lock_file` return 0 on success. On failure they set `errno` -appropriately, do their best to roll back the lockfile, and return -1. - - -Flags ------ - -The following flags can be passed to `hold_lock_file_for_update` or -`hold_lock_file_for_append`: - -LOCK_NO_DEREF:: - - Usually symbolic links in the destination path are resolved - and the lockfile is created by adding ".lock" to the resolved - path. If `LOCK_NO_DEREF` is set, then the lockfile is created - by adding ".lock" to the path argument itself. This option is - used, for example, when locking a symbolic reference, which - for backwards-compatibility reasons can be a symbolic link - containing the name of the referred-to-reference. - -LOCK_DIE_ON_ERROR:: - - If a lock is already taken for the file, `die()` with an error - message. If this option is not specified, trying to lock a - file that is already locked returns -1 to the caller. - - -The functions -------------- - -hold_lock_file_for_update:: - - Take a pointer to `struct lock_file`, the path of the file to - be locked (e.g. `$GIT_DIR/index`) and a flags argument (see - above). Attempt to create a lockfile for the destination and - return the file descriptor for writing to the file. - -hold_lock_file_for_append:: - - Like `hold_lock_file_for_update`, but before returning copy - the existing contents of the file (if any) to the lockfile and - position its write pointer at the end of the file. - -fdopen_lock_file:: - - Associate a stdio stream with the lockfile. Return NULL - (*without* rolling back the lockfile) on error. The stream is - closed automatically when `close_lock_file` is called or when - the file is committed or rolled back. - -get_locked_file_path:: - - Return the path of the file that is locked by the specified - lock_file object. The caller must free the memory. - -commit_lock_file:: - - Take a pointer to the `struct lock_file` initialized with an - earlier call to `hold_lock_file_for_update` or - `hold_lock_file_for_append`, close the file descriptor, and - rename the lockfile to its final destination. Return 0 upon - success. On failure, roll back the lock file and return -1, - with `errno` set to the value from the failing call to - `close(2)` or `rename(2)`. It is a bug to call - `commit_lock_file` for a `lock_file` object that is not - currently locked. - -commit_lock_file_to:: - - Like `commit_lock_file()`, except that it takes an explicit - `path` argument to which the lockfile should be renamed. The - `path` must be on the same filesystem as the lock file. - -rollback_lock_file:: - - Take a pointer to the `struct lock_file` initialized with an - earlier call to `hold_lock_file_for_update` or - `hold_lock_file_for_append`, close the file descriptor and - remove the lockfile. It is a NOOP to call - `rollback_lock_file()` for a `lock_file` object that has - already been committed or rolled back. - -close_lock_file:: - - Take a pointer to the `struct lock_file` initialized with an - earlier call to `hold_lock_file_for_update` or - `hold_lock_file_for_append`. Close the file descriptor (and - the file pointer if it has been opened using - `fdopen_lock_file`). Return 0 upon success. On failure to - `close(2)`, return a negative value and roll back the lock - file. Usually `commit_lock_file`, `commit_lock_file_to`, or - `rollback_lock_file` should eventually be called if - `close_lock_file` succeeds. - -reopen_lock_file:: - - Re-open a lockfile that has been closed (using - `close_lock_file`) but not yet committed or rolled back. This - can be used to implement a sequence of operations like the - following: - - * Lock file. - - * Write new contents to lockfile, then `close_lock_file` to - cause the contents to be written to disk. - - * Pass the name of the lockfile to another program to allow it - (and nobody else) to inspect the contents you wrote, while - still holding the lock yourself. - - * `reopen_lock_file` to reopen the lockfile. Make further - updates to the contents. - - * `commit_lock_file` to make the final version permanent. diff --git a/Documentation/technical/api-object-access.txt b/Documentation/technical/api-object-access.txt index 03bb0e950d..5b29622d00 100644 --- a/Documentation/technical/api-object-access.txt +++ b/Documentation/technical/api-object-access.txt @@ -1,13 +1,13 @@ object access API ================= -Talk about <sha1_file.c> and <object.h> family, things like +Talk about <sha1-file.c> and <object.h> family, things like * read_sha1_file() * read_object_with_reference() * has_sha1_file() * write_sha1_file() -* pretend_sha1_file() +* pretend_object_file() * lookup_{object,commit,tag,blob,tree} * parse_{object,commit,tag,blob,tree} * Use of object flags diff --git a/Documentation/technical/api-oid-array.txt b/Documentation/technical/api-oid-array.txt new file mode 100644 index 0000000000..9febfb1d52 --- /dev/null +++ b/Documentation/technical/api-oid-array.txt @@ -0,0 +1,85 @@ +oid-array API +============== + +The oid-array API provides storage and manipulation of sets of object +identifiers. The emphasis is on storage and processing efficiency, +making them suitable for large lists. Note that the ordering of items is +not preserved over some operations. + +Data Structures +--------------- + +`struct oid_array`:: + + A single array of object IDs. This should be initialized by + assignment from `OID_ARRAY_INIT`. The `oid` member contains + the actual data. The `nr` member contains the number of items in + the set. The `alloc` and `sorted` members are used internally, + and should not be needed by API callers. + +Functions +--------- + +`oid_array_append`:: + Add an item to the set. The object ID will be placed at the end of + the array (but note that some operations below may lose this + ordering). + +`oid_array_lookup`:: + Perform a binary search of the array for a specific object ID. + If found, returns the offset (in number of elements) of the + object ID. If not found, returns a negative integer. If the array + is not sorted, this function has the side effect of sorting it. + +`oid_array_clear`:: + Free all memory associated with the array and return it to the + initial, empty state. + +`oid_array_for_each`:: + Iterate over each element of the list, executing the callback + function for each one. Does not sort the list, so any custom + hash order is retained. If the callback returns a non-zero + value, the iteration ends immediately and the callback's + return is propagated; otherwise, 0 is returned. + +`oid_array_for_each_unique`:: + Iterate over each unique element of the list in sorted order, + but otherwise behave like `oid_array_for_each`. If the array + is not sorted, this function has the side effect of sorting + it. + +Examples +-------- + +----------------------------------------- +int print_callback(const struct object_id *oid, + void *data) +{ + printf("%s\n", oid_to_hex(oid)); + return 0; /* always continue */ +} + +void some_func(void) +{ + struct sha1_array hashes = OID_ARRAY_INIT; + struct object_id oid; + + /* Read objects into our set */ + while (read_object_from_stdin(oid.hash)) + oid_array_append(&hashes, &oid); + + /* Check if some objects are in our set */ + while (read_object_from_stdin(oid.hash)) { + if (oid_array_lookup(&hashes, &oid) >= 0) + printf("it's in there!\n"); + + /* + * Print the unique set of objects. We could also have + * avoided adding duplicate objects in the first place, + * but we would end up re-sorting the array repeatedly. + * Instead, this will sort once and then skip duplicates + * in linear time. + */ + oid_array_for_each_unique(&hashes, print_callback, NULL); +} +----------------------------------------- diff --git a/Documentation/technical/api-parse-options.txt b/Documentation/technical/api-parse-options.txt index 1f2db31312..829b558110 100644 --- a/Documentation/technical/api-parse-options.txt +++ b/Documentation/technical/api-parse-options.txt @@ -144,8 +144,12 @@ There are some macros to easily define options: `OPT_COUNTUP(short, long, &int_var, description)`:: Introduce a count-up option. - `int_var` is incremented on each use of `--option`, and - reset to zero with `--no-option`. + Each use of `--option` increments `int_var`, starting from zero + (even if initially negative), and `--no-option` resets it to + zero. To determine if `--option` or `--no-option` was encountered at + all, initialize `int_var` to a negative value, and if it is still + negative after parse_options(), then neither `--option` nor + `--no-option` was seen. `OPT_BIT(short, long, &int_var, description, mask)`:: Introduce a boolean option. @@ -164,17 +168,28 @@ There are some macros to easily define options: Introduce an option with string argument. The string argument is put into `str_var`. +`OPT_STRING_LIST(short, long, &struct string_list, arg_str, description)`:: + Introduce an option with string argument. + The string argument is stored as an element in `string_list`. + Use of `--no-option` will clear the list of preceding values. + `OPT_INTEGER(short, long, &int_var, description)`:: Introduce an option with integer argument. The integer is put into `int_var`. -`OPT_DATE(short, long, &int_var, description)`:: +`OPT_MAGNITUDE(short, long, &unsigned_long_var, description)`:: + Introduce an option with a size argument. The argument must be a + non-negative integer and may include a suffix of 'k', 'm' or 'g' to + scale the provided value by 1024, 1024^2 or 1024^3 respectively. + The scaled value is put into `unsigned_long_var`. + +`OPT_DATE(short, long, ×tamp_t_var, description)`:: Introduce an option with date argument, see `approxidate()`. - The timestamp is put into `int_var`. + The timestamp is put into `timestamp_t_var`. -`OPT_EXPIRY_DATE(short, long, &int_var, description)`:: +`OPT_EXPIRY_DATE(short, long, ×tamp_t_var, description)`:: Introduce an option with expiry date argument, see `parse_expiry_date()`. - The timestamp is put into `int_var`. + The timestamp is put into `timestamp_t_var`. `OPT_CALLBACK(short, long, &var, arg_str, description, func_ptr)`:: Introduce an option with argument. @@ -212,6 +227,26 @@ There are some macros to easily define options: Use it to hide deprecated options that are still to be recognized and ignored silently. +`OPT_PASSTHRU(short, long, &char_var, arg_str, description, flags)`:: + Introduce an option that will be reconstructed into a char* string, + which must be initialized to NULL. This is useful when you need to + pass the command-line option to another command. Any previous value + will be overwritten, so this should only be used for options where + the last one specified on the command line wins. + +`OPT_PASSTHRU_ARGV(short, long, &argv_array_var, arg_str, description, flags)`:: + Introduce an option where all instances of it on the command-line will + be reconstructed into an argv_array. This is useful when you need to + pass the command-line option, which can be specified multiple times, + to another command. + +`OPT_CMDMODE(short, long, &int_var, description, enum_val)`:: + Define an "operation mode" option, only one of which in the same + group of "operating mode" options that share the same `int_var` + can be given by the user. `enum_val` is set to `int_var` when the + option is used, but an error is reported if other "operating mode" + option has already set its value to the same `int_var`. + The last element of the array must be `OPT_END()`. diff --git a/Documentation/technical/api-ref-iteration.txt b/Documentation/technical/api-ref-iteration.txt index 02adfd45d3..46c3d5c355 100644 --- a/Documentation/technical/api-ref-iteration.txt +++ b/Documentation/technical/api-ref-iteration.txt @@ -6,7 +6,7 @@ Iteration of refs is done by using an iterate function which will call a callback function for every ref. The callback function has this signature: - int handle_one_ref(const char *refname, const unsigned char *sha1, + int handle_one_ref(const char *refname, const struct object_id *oid, int flags, void *cb_data); There are different kinds of iterate functions which all take a @@ -32,11 +32,8 @@ Iteration functions * `for_each_glob_ref_in()` the previous and `for_each_ref_in()` combined. -* `head_ref_submodule()`, `for_each_ref_submodule()`, - `for_each_ref_in_submodule()`, `for_each_tag_ref_submodule()`, - `for_each_branch_ref_submodule()`, `for_each_remote_ref_submodule()` - do the same as the functions described above but for a specified - submodule. +* Use `refs_` API for accessing submodules. The submodule ref store could + be obtained with `get_submodule_ref_store()`. * `for_each_rawref()` can be used to learn about broken ref and symref. diff --git a/Documentation/technical/api-remote.txt b/Documentation/technical/api-remote.txt index 5d245aa9d1..f10941b2e8 100644 --- a/Documentation/technical/api-remote.txt +++ b/Documentation/technical/api-remote.txt @@ -51,6 +51,10 @@ struct remote The proxy to use for curl (http, https, ftp, etc.) URLs. +`http_proxy_authmethod`:: + + The method used for authenticating against `http_proxy`. + struct remotes can be found by name with remote_get(), and iterated through with for_each_remote(). remote_get(NULL) will return the default remote, given the current branch and configuration. @@ -97,10 +101,6 @@ It contains: The name of the remote listed in the configuration. -`remote`:: - - The struct remote for that remote. - `merge_name`:: An array of the "merge" lines in the configuration. diff --git a/Documentation/technical/api-run-command.txt b/Documentation/technical/api-run-command.txt index a9fdb45b93..8bf3e37f53 100644 --- a/Documentation/technical/api-run-command.txt +++ b/Documentation/technical/api-run-command.txt @@ -46,6 +46,13 @@ Functions The argument dir corresponds the member .dir. The argument env corresponds to the member .env. +`child_process_clear`:: + + Release the memory associated with the struct child_process. + Most users of the run-command API don't need to call this + function explicitly because `start_command` invokes it on + failure and `finish_command` calls it automatically already. + The functions above do the following: . If a system call failed, errno is set and -1 is returned. A diagnostic diff --git a/Documentation/technical/api-setup.txt b/Documentation/technical/api-setup.txt index 540e455689..eb1fa9853e 100644 --- a/Documentation/technical/api-setup.txt +++ b/Documentation/technical/api-setup.txt @@ -27,8 +27,6 @@ parse_pathspec(). This function takes several arguments: - prefix and args come from cmd_* functions -get_pathspec() is obsolete and should never be used in new code. - parse_pathspec() helps catch unsupported features and reject them politely. At a lower level, different pathspec-related functions may not support the same set of features. Such pathspec-sensitive diff --git a/Documentation/technical/api-sha1-array.txt b/Documentation/technical/api-sha1-array.txt deleted file mode 100644 index 3e75497a37..0000000000 --- a/Documentation/technical/api-sha1-array.txt +++ /dev/null @@ -1,76 +0,0 @@ -sha1-array API -============== - -The sha1-array API provides storage and manipulation of sets of SHA-1 -identifiers. The emphasis is on storage and processing efficiency, -making them suitable for large lists. Note that the ordering of items is -not preserved over some operations. - -Data Structures ---------------- - -`struct sha1_array`:: - - A single array of SHA-1 hashes. This should be initialized by - assignment from `SHA1_ARRAY_INIT`. The `sha1` member contains - the actual data. The `nr` member contains the number of items in - the set. The `alloc` and `sorted` members are used internally, - and should not be needed by API callers. - -Functions ---------- - -`sha1_array_append`:: - Add an item to the set. The sha1 will be placed at the end of - the array (but note that some operations below may lose this - ordering). - -`sha1_array_lookup`:: - Perform a binary search of the array for a specific sha1. - If found, returns the offset (in number of elements) of the - sha1. If not found, returns a negative integer. If the array is - not sorted, this function has the side effect of sorting it. - -`sha1_array_clear`:: - Free all memory associated with the array and return it to the - initial, empty state. - -`sha1_array_for_each_unique`:: - Efficiently iterate over each unique element of the list, - executing the callback function for each one. If the array is - not sorted, this function has the side effect of sorting it. - -Examples --------- - ------------------------------------------ -void print_callback(const unsigned char sha1[20], - void *data) -{ - printf("%s\n", sha1_to_hex(sha1)); -} - -void some_func(void) -{ - struct sha1_array hashes = SHA1_ARRAY_INIT; - unsigned char sha1[20]; - - /* Read objects into our set */ - while (read_object_from_stdin(sha1)) - sha1_array_append(&hashes, sha1); - - /* Check if some objects are in our set */ - while (read_object_from_stdin(sha1)) { - if (sha1_array_lookup(&hashes, sha1) >= 0) - printf("it's in there!\n"); - - /* - * Print the unique set of objects. We could also have - * avoided adding duplicate objects in the first place, - * but we would end up re-sorting the array repeatedly. - * Instead, this will sort once and then skip duplicates - * in linear time. - */ - sha1_array_for_each_unique(&hashes, print_callback, NULL); -} ------------------------------------------ diff --git a/Documentation/technical/api-string-list.txt b/Documentation/technical/api-string-list.txt deleted file mode 100644 index c08402b12e..0000000000 --- a/Documentation/technical/api-string-list.txt +++ /dev/null @@ -1,209 +0,0 @@ -string-list API -=============== - -The string_list API offers a data structure and functions to handle -sorted and unsorted string lists. A "sorted" list is one whose -entries are sorted by string value in `strcmp()` order. - -The 'string_list' struct used to be called 'path_list', but was renamed -because it is not specific to paths. - -The caller: - -. Allocates and clears a `struct string_list` variable. - -. Initializes the members. You might want to set the flag `strdup_strings` - if the strings should be strdup()ed. For example, this is necessary - when you add something like git_path("..."), since that function returns - a static buffer that will change with the next call to git_path(). -+ -If you need something advanced, you can manually malloc() the `items` -member (you need this if you add things later) and you should set the -`nr` and `alloc` members in that case, too. - -. Adds new items to the list, using `string_list_append`, - `string_list_append_nodup`, `string_list_insert`, - `string_list_split`, and/or `string_list_split_in_place`. - -. Can check if a string is in the list using `string_list_has_string` or - `unsorted_string_list_has_string` and get it from the list using - `string_list_lookup` for sorted lists. - -. Can sort an unsorted list using `string_list_sort`. - -. Can remove duplicate items from a sorted list using - `string_list_remove_duplicates`. - -. Can remove individual items of an unsorted list using - `unsorted_string_list_delete_item`. - -. Can remove items not matching a criterion from a sorted or unsorted - list using `filter_string_list`, or remove empty strings using - `string_list_remove_empty_items`. - -. Finally it should free the list using `string_list_clear`. - -Example: - ----- -struct string_list list = STRING_LIST_INIT_NODUP; -int i; - -string_list_append(&list, "foo"); -string_list_append(&list, "bar"); -for (i = 0; i < list.nr; i++) - printf("%s\n", list.items[i].string) ----- - -NOTE: It is more efficient to build an unsorted list and sort it -afterwards, instead of building a sorted list (`O(n log n)` instead of -`O(n^2)`). -+ -However, if you use the list to check if a certain string was added -already, you should not do that (using unsorted_string_list_has_string()), -because the complexity would be quadratic again (but with a worse factor). - -Functions ---------- - -* General ones (works with sorted and unsorted lists as well) - -`string_list_init`:: - - Initialize the members of the string_list, set `strdup_strings` - member according to the value of the second parameter. - -`filter_string_list`:: - - Apply a function to each item in a list, retaining only the - items for which the function returns true. If free_util is - true, call free() on the util members of any items that have - to be deleted. Preserve the order of the items that are - retained. - -`string_list_remove_empty_items`:: - - Remove any empty strings from the list. If free_util is true, - call free() on the util members of any items that have to be - deleted. Preserve the order of the items that are retained. - -`print_string_list`:: - - Dump a string_list to stdout, useful mainly for debugging purposes. It - can take an optional header argument and it writes out the - string-pointer pairs of the string_list, each one in its own line. - -`string_list_clear`:: - - Free a string_list. The `string` pointer of the items will be freed in - case the `strdup_strings` member of the string_list is set. The second - parameter controls if the `util` pointer of the items should be freed - or not. - -* Functions for sorted lists only - -`string_list_has_string`:: - - Determine if the string_list has a given string or not. - -`string_list_insert`:: - - Insert a new element to the string_list. The returned pointer can be - handy if you want to write something to the `util` pointer of the - string_list_item containing the just added string. If the given - string already exists the insertion will be skipped and the - pointer to the existing item returned. -+ -Since this function uses xrealloc() (which die()s if it fails) if the -list needs to grow, it is safe not to check the pointer. I.e. you may -write `string_list_insert(...)->util = ...;`. - -`string_list_lookup`:: - - Look up a given string in the string_list, returning the containing - string_list_item. If the string is not found, NULL is returned. - -`string_list_remove_duplicates`:: - - Remove all but the first of consecutive entries that have the - same string value. If free_util is true, call free() on the - util members of any items that have to be deleted. - -* Functions for unsorted lists only - -`string_list_append`:: - - Append a new string to the end of the string_list. If - `strdup_string` is set, then the string argument is copied; - otherwise the new `string_list_entry` refers to the input - string. - -`string_list_append_nodup`:: - - Append a new string to the end of the string_list. The new - `string_list_entry` always refers to the input string, even if - `strdup_string` is set. This function can be used to hand - ownership of a malloc()ed string to a `string_list` that has - `strdup_string` set. - -`string_list_sort`:: - - Sort the list's entries by string value in `strcmp()` order. - -`unsorted_string_list_has_string`:: - - It's like `string_list_has_string()` but for unsorted lists. - -`unsorted_string_list_lookup`:: - - It's like `string_list_lookup()` but for unsorted lists. -+ -The above two functions need to look through all items, as opposed to their -counterpart for sorted lists, which performs a binary search. - -`unsorted_string_list_delete_item`:: - - Remove an item from a string_list. The `string` pointer of the items - will be freed in case the `strdup_strings` member of the string_list - is set. The third parameter controls if the `util` pointer of the - items should be freed or not. - -`string_list_split`:: -`string_list_split_in_place`:: - - Split a string into substrings on a delimiter character and - append the substrings to a `string_list`. If `maxsplit` is - non-negative, then split at most `maxsplit` times. Return the - number of substrings appended to the list. -+ -`string_list_split` requires a `string_list` that has `strdup_strings` -set to true; it leaves the input string untouched and makes copies of -the substrings in newly-allocated memory. -`string_list_split_in_place` requires a `string_list` that has -`strdup_strings` set to false; it splits the input string in place, -overwriting the delimiter characters with NULs and creating new -string_list_items that point into the original string (the original -string must therefore not be modified or freed while the `string_list` -is in use). - - -Data structures ---------------- - -* `struct string_list_item` - -Represents an item of the list. The `string` member is a pointer to the -string, and you may use the `util` member for any purpose, if you want. - -* `struct string_list` - -Represents the list itself. - -. The array of items are available via the `items` member. -. The `nr` member contains the number of items stored in the list. -. The `alloc` member is used to avoid reallocating at every insertion. - You should not tamper with it. -. Setting the `strdup_strings` member to 1 will strdup() the strings - before adding them, see above. -. The `compare_strings_fn` member is used to specify a custom compare - function, otherwise `strcmp()` is used as the default function. diff --git a/Documentation/technical/api-submodule-config.txt b/Documentation/technical/api-submodule-config.txt new file mode 100644 index 0000000000..fb06089393 --- /dev/null +++ b/Documentation/technical/api-submodule-config.txt @@ -0,0 +1,66 @@ +submodule config cache API +========================== + +The submodule config cache API allows to read submodule +configurations/information from specified revisions. Internally +information is lazily read into a cache that is used to avoid +unnecessary parsing of the same .gitmodules files. Lookups can be done by +submodule path or name. + +Usage +----- + +To initialize the cache with configurations from the worktree the caller +typically first calls `gitmodules_config()` to read values from the +worktree .gitmodules and then to overlay the local git config values +`parse_submodule_config_option()` from the config parsing +infrastructure. + +The caller can look up information about submodules by using the +`submodule_from_path()` or `submodule_from_name()` functions. They return +a `struct submodule` which contains the values. The API automatically +initializes and allocates the needed infrastructure on-demand. If the +caller does only want to lookup values from revisions the initialization +can be skipped. + +If the internal cache might grow too big or when the caller is done with +the API, all internally cached values can be freed with submodule_free(). + +Data Structures +--------------- + +`struct submodule`:: + + This structure is used to return the information about one + submodule for a certain revision. It is returned by the lookup + functions. + +Functions +--------- + +`void submodule_free(struct repository *r)`:: + + Use these to free the internally cached values. + +`int parse_submodule_config_option(const char *var, const char *value)`:: + + Can be passed to the config parsing infrastructure to parse + local (worktree) submodule configurations. + +`const struct submodule *submodule_from_path(const unsigned char *treeish_name, const char *path)`:: + + Given a tree-ish in the superproject and a path, return the + submodule that is bound at the path in the named tree. + +`const struct submodule *submodule_from_name(const unsigned char *treeish_name, const char *name)`:: + + The same as above but lookup by name. + +Whenever a submodule configuration is parsed in `parse_submodule_config_option` +via e.g. `gitmodules_config()`, it will overwrite the null_sha1 entry. +So in the normal case, when HEAD:.gitmodules is parsed first and then overlayed +with the repository configuration, the null_sha1 entry contains the local +configuration of a submodule (e.g. consolidated values from local git +configuration and the .gitmodules file in the worktree). + +For an example usage see test-submodule-config.c. diff --git a/Documentation/technical/api-trace.txt b/Documentation/technical/api-trace.txt index 097a651d96..fadb5979c4 100644 --- a/Documentation/technical/api-trace.txt +++ b/Documentation/technical/api-trace.txt @@ -28,7 +28,7 @@ static struct trace_key trace_foo = TRACE_KEY_INIT(FOO); static void trace_print_foo(const char *message) { - trace_print_key(&trace_foo, message); + trace_printf_key(&trace_foo, "%s", message); } ------------ + @@ -95,3 +95,46 @@ for (;;) { } trace_performance(t, "frotz"); ------------ + +Bugs & Caveats +-------------- + +GIT_TRACE_* environment variables can be used to tell Git to show +trace output to its standard error stream. Git can often spawn a pager +internally to run its subcommand and send its standard output and +standard error to it. + +Because GIT_TRACE_PERFORMANCE trace is generated only at the very end +of the program with atexit(), which happens after the pager exits, it +would not work well if you send its log to the standard error output +and let Git spawn the pager at the same time. + +As a work around, you can for example use '--no-pager', or set +GIT_TRACE_PERFORMANCE to another file descriptor which is redirected +to stderr, or set GIT_TRACE_PERFORMANCE to a file specified by its +absolute path. + +For example instead of the following command which by default may not +print any performance information: + +------------ +GIT_TRACE_PERFORMANCE=2 git log -1 +------------ + +you may want to use: + +------------ +GIT_TRACE_PERFORMANCE=2 git --no-pager log -1 +------------ + +or: + +------------ +GIT_TRACE_PERFORMANCE=3 3>&2 git log -1 +------------ + +or: + +------------ +GIT_TRACE_PERFORMANCE=/path/to/log/file git log -1 +------------ diff --git a/Documentation/technical/api-tree-walking.txt b/Documentation/technical/api-tree-walking.txt index 14af37c3f1..bde18622a8 100644 --- a/Documentation/technical/api-tree-walking.txt +++ b/Documentation/technical/api-tree-walking.txt @@ -55,9 +55,9 @@ Initializing `fill_tree_descriptor`:: - Initialize a `tree_desc` and decode its first entry given the sha1 of - a tree. Returns the `buffer` member if the sha1 is a valid tree - identifier and NULL otherwise. + Initialize a `tree_desc` and decode its first entry given the + object ID of a tree. Returns the `buffer` member if the latter + is a valid tree identifier and NULL otherwise. `setup_traverse_info`:: diff --git a/Documentation/technical/commit-graph-format.txt b/Documentation/technical/commit-graph-format.txt new file mode 100644 index 0000000000..cc0474ba3e --- /dev/null +++ b/Documentation/technical/commit-graph-format.txt @@ -0,0 +1,97 @@ +Git commit graph format +======================= + +The Git commit graph stores a list of commit OIDs and some associated +metadata, including: + +- The generation number of the commit. Commits with no parents have + generation number 1; commits with parents have generation number + one more than the maximum generation number of its parents. We + reserve zero as special, and can be used to mark a generation + number invalid or as "not computed". + +- The root tree OID. + +- The commit date. + +- The parents of the commit, stored using positional references within + the graph file. + +These positional references are stored as unsigned 32-bit integers +corresponding to the array position within the list of commit OIDs. Due +to some special constants we use to track parents, we can store at most +(1 << 30) + (1 << 29) + (1 << 28) - 1 (around 1.8 billion) commits. + +== Commit graph files have the following format: + +In order to allow extensions that add extra data to the graph, we organize +the body into "chunks" and provide a binary lookup table at the beginning +of the body. The header includes certain values, such as number of chunks +and hash type. + +All 4-byte numbers are in network order. + +HEADER: + + 4-byte signature: + The signature is: {'C', 'G', 'P', 'H'} + + 1-byte version number: + Currently, the only valid version is 1. + + 1-byte Hash Version (1 = SHA-1) + We infer the hash length (H) from this value. + + 1-byte number (C) of "chunks" + + 1-byte (reserved for later use) + Current clients should ignore this value. + +CHUNK LOOKUP: + + (C + 1) * 12 bytes listing the table of contents for the chunks: + First 4 bytes describe the chunk id. Value 0 is a terminating label. + Other 8 bytes provide the byte-offset in current file for chunk to + start. (Chunks are ordered contiguously in the file, so you can infer + the length using the next chunk position if necessary.) Each chunk + ID appears at most once. + + The remaining data in the body is described one chunk at a time, and + these chunks may be given in any order. Chunks are required unless + otherwise specified. + +CHUNK DATA: + + OID Fanout (ID: {'O', 'I', 'D', 'F'}) (256 * 4 bytes) + The ith entry, F[i], stores the number of OIDs with first + byte at most i. Thus F[255] stores the total + number of commits (N). + + OID Lookup (ID: {'O', 'I', 'D', 'L'}) (N * H bytes) + The OIDs for all commits in the graph, sorted in ascending order. + + Commit Data (ID: {'C', 'D', 'A', 'T' }) (N * (H + 16) bytes) + * The first H bytes are for the OID of the root tree. + * The next 8 bytes are for the positions of the first two parents + of the ith commit. Stores value 0x7000000 if no parent in that + position. If there are more than two parents, the second value + has its most-significant bit on and the other bits store an array + position into the Large Edge List chunk. + * The next 8 bytes store the generation number of the commit and + the commit time in seconds since EPOCH. The generation number + uses the higher 30 bits of the first 4 bytes, while the commit + time uses the 32 bits of the second 4 bytes, along with the lowest + 2 bits of the lowest byte, storing the 33rd and 34th bit of the + commit time. + + Large Edge List (ID: {'E', 'D', 'G', 'E'}) [Optional] + This list of 4-byte values store the second through nth parents for + all octopus merges. The second parent value in the commit data stores + an array position within this list along with the most-significant bit + on. Starting at that array position, iterate through this list of commit + positions for the parents until reaching a value with the most-significant + bit on. The other bits correspond to the position of the last parent. + +TRAILER: + + H-byte HASH-checksum of all of the above. diff --git a/Documentation/technical/commit-graph.txt b/Documentation/technical/commit-graph.txt new file mode 100644 index 0000000000..c664acbd76 --- /dev/null +++ b/Documentation/technical/commit-graph.txt @@ -0,0 +1,160 @@ +Git Commit Graph Design Notes +============================= + +Git walks the commit graph for many reasons, including: + +1. Listing and filtering commit history. +2. Computing merge bases. + +These operations can become slow as the commit count grows. The merge +base calculation shows up in many user-facing commands, such as 'merge-base' +or 'status' and can take minutes to compute depending on history shape. + +There are two main costs here: + +1. Decompressing and parsing commits. +2. Walking the entire graph to satisfy topological order constraints. + +The commit graph file is a supplemental data structure that accelerates +commit graph walks. If a user downgrades or disables the 'core.commitGraph' +config setting, then the existing ODB is sufficient. The file is stored +as "commit-graph" either in the .git/objects/info directory or in the info +directory of an alternate. + +The commit graph file stores the commit graph structure along with some +extra metadata to speed up graph walks. By listing commit OIDs in lexi- +cographic order, we can identify an integer position for each commit and +refer to the parents of a commit using those integer positions. We use +binary search to find initial commits and then use the integer positions +for fast lookups during the walk. + +A consumer may load the following info for a commit from the graph: + +1. The commit OID. +2. The list of parents, along with their integer position. +3. The commit date. +4. The root tree OID. +5. The generation number (see definition below). + +Values 1-4 satisfy the requirements of parse_commit_gently(). + +Define the "generation number" of a commit recursively as follows: + + * A commit with no parents (a root commit) has generation number one. + + * A commit with at least one parent has generation number one more than + the largest generation number among its parents. + +Equivalently, the generation number of a commit A is one more than the +length of a longest path from A to a root commit. The recursive definition +is easier to use for computation and observing the following property: + + If A and B are commits with generation numbers N and M, respectively, + and N <= M, then A cannot reach B. That is, we know without searching + that B is not an ancestor of A because it is further from a root commit + than A. + + Conversely, when checking if A is an ancestor of B, then we only need + to walk commits until all commits on the walk boundary have generation + number at most N. If we walk commits using a priority queue seeded by + generation numbers, then we always expand the boundary commit with highest + generation number and can easily detect the stopping condition. + +This property can be used to significantly reduce the time it takes to +walk commits and determine topological relationships. Without generation +numbers, the general heuristic is the following: + + If A and B are commits with commit time X and Y, respectively, and + X < Y, then A _probably_ cannot reach B. + +This heuristic is currently used whenever the computation is allowed to +violate topological relationships due to clock skew (such as "git log" +with default order), but is not used when the topological order is +required (such as merge base calculations, "git log --graph"). + +In practice, we expect some commits to be created recently and not stored +in the commit graph. We can treat these commits as having "infinite" +generation number and walk until reaching commits with known generation +number. + +We use the macro GENERATION_NUMBER_INFINITY = 0xFFFFFFFF to mark commits not +in the commit-graph file. If a commit-graph file was written by a version +of Git that did not compute generation numbers, then those commits will +have generation number represented by the macro GENERATION_NUMBER_ZERO = 0. + +Since the commit-graph file is closed under reachability, we can guarantee +the following weaker condition on all commits: + + If A and B are commits with generation numbers N amd M, respectively, + and N < M, then A cannot reach B. + +Note how the strict inequality differs from the inequality when we have +fully-computed generation numbers. Using strict inequality may result in +walking a few extra commits, but the simplicity in dealing with commits +with generation number *_INFINITY or *_ZERO is valuable. + +We use the macro GENERATION_NUMBER_MAX = 0x3FFFFFFF to for commits whose +generation numbers are computed to be at least this value. We limit at +this value since it is the largest value that can be stored in the +commit-graph file using the 30 bits available to generation numbers. This +presents another case where a commit can have generation number equal to +that of a parent. + +Design Details +-------------- + +- The commit graph file is stored in a file named 'commit-graph' in the + .git/objects/info directory. This could be stored in the info directory + of an alternate. + +- The core.commitGraph config setting must be on to consume graph files. + +- The file format includes parameters for the object ID hash function, + so a future change of hash algorithm does not require a change in format. + +Future Work +----------- + +- The commit graph feature currently does not honor commit grafts. This can + be remedied by duplicating or refactoring the current graft logic. + +- After computing and storing generation numbers, we must make graph + walks aware of generation numbers to gain the performance benefits they + enable. This will mostly be accomplished by swapping a commit-date-ordered + priority queue with one ordered by generation number. The following + operations are important candidates: + + - 'log --topo-order' + - 'tag --merged' + +- A server could provide a commit graph file as part of the network protocol + to avoid extra calculations by clients. This feature is only of benefit if + the user is willing to trust the file, because verifying the file is correct + is as hard as computing it from scratch. + +Related Links +------------- +[0] https://bugs.chromium.org/p/git/issues/detail?id=8 + Chromium work item for: Serialized Commit Graph + +[1] https://public-inbox.org/git/20110713070517.GC18566@sigill.intra.peff.net/ + An abandoned patch that introduced generation numbers. + +[2] https://public-inbox.org/git/20170908033403.q7e6dj7benasrjes@sigill.intra.peff.net/ + Discussion about generation numbers on commits and how they interact + with fsck. + +[3] https://public-inbox.org/git/20170908034739.4op3w4f2ma5s65ku@sigill.intra.peff.net/ + More discussion about generation numbers and not storing them inside + commit objects. A valuable quote: + + "I think we should be moving more in the direction of keeping + repo-local caches for optimizations. Reachability bitmaps have been + a big performance win. I think we should be doing the same with our + properties of commits. Not just generation numbers, but making it + cheap to access the graph structure without zlib-inflating whole + commit objects (i.e., packv4 or something like the "metapacks" I + proposed a few years ago)." + +[4] https://public-inbox.org/git/20180108154822.54829-1-git@jeffhostetler.com/T/#u + A patch to remove the ahead-behind calculation from 'status'. diff --git a/Documentation/technical/directory-rename-detection.txt b/Documentation/technical/directory-rename-detection.txt new file mode 100644 index 0000000000..1c0086e287 --- /dev/null +++ b/Documentation/technical/directory-rename-detection.txt @@ -0,0 +1,115 @@ +Directory rename detection +========================== + +Rename detection logic in diffcore-rename that checks for renames of +individual files is aggregated and analyzed in merge-recursive for cases +where combinations of renames indicate that a full directory has been +renamed. + +Scope of abilities +------------------ + +It is perhaps easiest to start with an example: + + * When all of x/a, x/b and x/c have moved to z/a, z/b and z/c, it is + likely that x/d added in the meantime would also want to move to z/d by + taking the hint that the entire directory 'x' moved to 'z'. + +More interesting possibilities exist, though, such as: + + * one side of history renames x -> z, and the other renames some file to + x/e, causing the need for the merge to do a transitive rename. + + * one side of history renames x -> z, but also renames all files within + x. For example, x/a -> z/alpha, x/b -> z/bravo, etc. + + * both 'x' and 'y' being merged into a single directory 'z', with a + directory rename being detected for both x->z and y->z. + + * not all files in a directory being renamed to the same location; + i.e. perhaps most the files in 'x' are now found under 'z', but a few + are found under 'w'. + + * a directory being renamed, which also contained a subdirectory that was + renamed to some entirely different location. (And perhaps the inner + directory itself contained inner directories that were renamed to yet + other locations). + + * combinations of the above; see t/t6043-merge-rename-directories.sh for + various interesting cases. + +Limitations -- applicability of directory renames +------------------------------------------------- + +In order to prevent edge and corner cases resulting in either conflicts +that cannot be represented in the index or which might be too complex for +users to try to understand and resolve, a couple basic rules limit when +directory rename detection applies: + + 1) If a given directory still exists on both sides of a merge, we do + not consider it to have been renamed. + + 2) If a subset of to-be-renamed files have a file or directory in the + way (or would be in the way of each other), "turn off" the directory + rename for those specific sub-paths and report the conflict to the + user. + + 3) If the other side of history did a directory rename to a path that + your side of history renamed away, then ignore that particular + rename from the other side of history for any implicit directory + renames (but warn the user). + +Limitations -- detailed rules and testcases +------------------------------------------- + +t/t6043-merge-rename-directories.sh contains extensive tests and commentary +which generate and explore the rules listed above. It also lists a few +additional rules: + + a) If renames split a directory into two or more others, the directory + with the most renames, "wins". + + b) Avoid directory-rename-detection for a path, if that path is the + source of a rename on either side of a merge. + + c) Only apply implicit directory renames to directories if the other side + of history is the one doing the renaming. + +Limitations -- support in different commands +-------------------------------------------- + +Directory rename detection is supported by 'merge' and 'cherry-pick'. +Other git commands which users might be surprised to see limited or no +directory rename detection support in: + + * diff + + Folks have requested in the past that `git diff` detect directory + renames and somehow simplify its output. It is not clear whether this + would be desirable or how the output should be simplified, so this was + simply not implemented. Further, to implement this, directory rename + detection logic would need to move from merge-recursive to + diffcore-rename. + + * am + + git-am tries to avoid a full three way merge, instead calling + git-apply. That prevents us from detecting renames at all, which may + defeat the directory rename detection. There is a fallback, though; if + the initial git-apply fails and the user has specified the -3 option, + git-am will fall back to a three way merge. However, git-am lacks the + necessary information to do a "real" three way merge. Instead, it has + to use build_fake_ancestor() to get a merge base that is missing files + whose rename may have been important to detect for directory rename + detection to function. + + * rebase + + Since am-based rebases work by first generating a bunch of patches + (which no longer record what the original commits were and thus don't + have the necessary info from which we can find a real merge-base), and + then calling git-am, this implies that am-based rebases will not always + successfully detect directory renames either (see the 'am' section + above). merged-based rebases (rebase -m) and cherry-pick-based rebases + (rebase -i) are not affected by this shortcoming, and fully support + directory rename detection. diff --git a/Documentation/technical/hash-function-transition.txt b/Documentation/technical/hash-function-transition.txt new file mode 100644 index 0000000000..bc2ace2a6e --- /dev/null +++ b/Documentation/technical/hash-function-transition.txt @@ -0,0 +1,827 @@ +Git hash function transition +============================ + +Objective +--------- +Migrate Git from SHA-1 to a stronger hash function. + +Background +---------- +At its core, the Git version control system is a content addressable +filesystem. It uses the SHA-1 hash function to name content. For +example, files, directories, and revisions are referred to by hash +values unlike in other traditional version control systems where files +or versions are referred to via sequential numbers. The use of a hash +function to address its content delivers a few advantages: + +* Integrity checking is easy. Bit flips, for example, are easily + detected, as the hash of corrupted content does not match its name. +* Lookup of objects is fast. + +Using a cryptographically secure hash function brings additional +advantages: + +* Object names can be signed and third parties can trust the hash to + address the signed object and all objects it references. +* Communication using Git protocol and out of band communication + methods have a short reliable string that can be used to reliably + address stored content. + +Over time some flaws in SHA-1 have been discovered by security +researchers. On 23 February 2017 the SHAttered attack +(https://shattered.io) demonstrated a practical SHA-1 hash collision. + +Git v2.13.0 and later subsequently moved to a hardened SHA-1 +implementation by default, which isn't vulnerable to the SHAttered +attack. + +Thus Git has in effect already migrated to a new hash that isn't SHA-1 +and doesn't share its vulnerabilities, its new hash function just +happens to produce exactly the same output for all known inputs, +except two PDFs published by the SHAttered researchers, and the new +implementation (written by those researchers) claims to detect future +cryptanalytic collision attacks. + +Regardless, it's considered prudent to move past any variant of SHA-1 +to a new hash. There's no guarantee that future attacks on SHA-1 won't +be published in the future, and those attacks may not have viable +mitigations. + +If SHA-1 and its variants were to be truly broken, Git's hash function +could not be considered cryptographically secure any more. This would +impact the communication of hash values because we could not trust +that a given hash value represented the known good version of content +that the speaker intended. + +SHA-1 still possesses the other properties such as fast object lookup +and safe error checking, but other hash functions are equally suitable +that are believed to be cryptographically secure. + +Goals +----- +1. The transition to SHA-256 can be done one local repository at a time. + a. Requiring no action by any other party. + b. A SHA-256 repository can communicate with SHA-1 Git servers + (push/fetch). + c. Users can use SHA-1 and SHA-256 identifiers for objects + interchangeably (see "Object names on the command line", below). + d. New signed objects make use of a stronger hash function than + SHA-1 for their security guarantees. +2. Allow a complete transition away from SHA-1. + a. Local metadata for SHA-1 compatibility can be removed from a + repository if compatibility with SHA-1 is no longer needed. +3. Maintainability throughout the process. + a. The object format is kept simple and consistent. + b. Creation of a generalized repository conversion tool. + +Non-Goals +--------- +1. Add SHA-256 support to Git protocol. This is valuable and the + logical next step but it is out of scope for this initial design. +2. Transparently improving the security of existing SHA-1 signed + objects. +3. Intermixing objects using multiple hash functions in a single + repository. +4. Taking the opportunity to fix other bugs in Git's formats and + protocols. +5. Shallow clones and fetches into a SHA-256 repository. (This will + change when we add SHA-256 support to Git protocol.) +6. Skip fetching some submodules of a project into a SHA-256 + repository. (This also depends on SHA-256 support in Git + protocol.) + +Overview +-------- +We introduce a new repository format extension. Repositories with this +extension enabled use SHA-256 instead of SHA-1 to name their objects. +This affects both object names and object content --- both the names +of objects and all references to other objects within an object are +switched to the new hash function. + +SHA-256 repositories cannot be read by older versions of Git. + +Alongside the packfile, a SHA-256 repository stores a bidirectional +mapping between SHA-256 and SHA-1 object names. The mapping is generated +locally and can be verified using "git fsck". Object lookups use this +mapping to allow naming objects using either their SHA-1 and SHA-256 names +interchangeably. + +"git cat-file" and "git hash-object" gain options to display an object +in its sha1 form and write an object given its sha1 form. This +requires all objects referenced by that object to be present in the +object database so that they can be named using the appropriate name +(using the bidirectional hash mapping). + +Fetches from a SHA-1 based server convert the fetched objects into +SHA-256 form and record the mapping in the bidirectional mapping table +(see below for details). Pushes to a SHA-1 based server convert the +objects being pushed into sha1 form so the server does not have to be +aware of the hash function the client is using. + +Detailed Design +--------------- +Repository format extension +~~~~~~~~~~~~~~~~~~~~~~~~~~~ +A SHA-256 repository uses repository format version `1` (see +Documentation/technical/repository-version.txt) with extensions +`objectFormat` and `compatObjectFormat`: + + [core] + repositoryFormatVersion = 1 + [extensions] + objectFormat = sha256 + compatObjectFormat = sha1 + +The combination of setting `core.repositoryFormatVersion=1` and +populating `extensions.*` ensures that all versions of Git later than +`v0.99.9l` will die instead of trying to operate on the SHA-256 +repository, instead producing an error message. + + # Between v0.99.9l and v2.7.0 + $ git status + fatal: Expected git repo version <= 0, found 1 + # After v2.7.0 + $ git status + fatal: unknown repository extensions found: + objectformat + compatobjectformat + +See the "Transition plan" section below for more details on these +repository extensions. + +Object names +~~~~~~~~~~~~ +Objects can be named by their 40 hexadecimal digit sha1-name or 64 +hexadecimal digit sha256-name, plus names derived from those (see +gitrevisions(7)). + +The sha1-name of an object is the SHA-1 of the concatenation of its +type, length, a nul byte, and the object's sha1-content. This is the +traditional <sha1> used in Git to name objects. + +The sha256-name of an object is the SHA-256 of the concatenation of its +type, length, a nul byte, and the object's sha256-content. + +Object format +~~~~~~~~~~~~~ +The content as a byte sequence of a tag, commit, or tree object named +by sha1 and sha256 differ because an object named by sha256-name refers to +other objects by their sha256-names and an object named by sha1-name +refers to other objects by their sha1-names. + +The sha256-content of an object is the same as its sha1-content, except +that objects referenced by the object are named using their sha256-names +instead of sha1-names. Because a blob object does not refer to any +other object, its sha1-content and sha256-content are the same. + +The format allows round-trip conversion between sha256-content and +sha1-content. + +Object storage +~~~~~~~~~~~~~~ +Loose objects use zlib compression and packed objects use the packed +format described in Documentation/technical/pack-format.txt, just like +today. The content that is compressed and stored uses sha256-content +instead of sha1-content. + +Pack index +~~~~~~~~~~ +Pack index (.idx) files use a new v3 format that supports multiple +hash functions. They have the following format (all integers are in +network byte order): + +- A header appears at the beginning and consists of the following: + - The 4-byte pack index signature: '\377t0c' + - 4-byte version number: 3 + - 4-byte length of the header section, including the signature and + version number + - 4-byte number of objects contained in the pack + - 4-byte number of object formats in this pack index: 2 + - For each object format: + - 4-byte format identifier (e.g., 'sha1' for SHA-1) + - 4-byte length in bytes of shortened object names. This is the + shortest possible length needed to make names in the shortened + object name table unambiguous. + - 4-byte integer, recording where tables relating to this format + are stored in this index file, as an offset from the beginning. + - 4-byte offset to the trailer from the beginning of this file. + - Zero or more additional key/value pairs (4-byte key, 4-byte + value). Only one key is supported: 'PSRC'. See the "Loose objects + and unreachable objects" section for supported values and how this + is used. All other keys are reserved. Readers must ignore + unrecognized keys. +- Zero or more NUL bytes. This can optionally be used to improve the + alignment of the full object name table below. +- Tables for the first object format: + - A sorted table of shortened object names. These are prefixes of + the names of all objects in this pack file, packed together + without offset values to reduce the cache footprint of the binary + search for a specific object name. + + - A table of full object names in pack order. This allows resolving + a reference to "the nth object in the pack file" (from a + reachability bitmap or from the next table of another object + format) to its object name. + + - A table of 4-byte values mapping object name order to pack order. + For an object in the table of sorted shortened object names, the + value at the corresponding index in this table is the index in the + previous table for that same object. + + This can be used to look up the object in reachability bitmaps or + to look up its name in another object format. + + - A table of 4-byte CRC32 values of the packed object data, in the + order that the objects appear in the pack file. This is to allow + compressed data to be copied directly from pack to pack during + repacking without undetected data corruption. + + - A table of 4-byte offset values. For an object in the table of + sorted shortened object names, the value at the corresponding + index in this table indicates where that object can be found in + the pack file. These are usually 31-bit pack file offsets, but + large offsets are encoded as an index into the next table with the + most significant bit set. + + - A table of 8-byte offset entries (empty for pack files less than + 2 GiB). Pack files are organized with heavily used objects toward + the front, so most object references should not need to refer to + this table. +- Zero or more NUL bytes. +- Tables for the second object format, with the same layout as above, + up to and not including the table of CRC32 values. +- Zero or more NUL bytes. +- The trailer consists of the following: + - A copy of the 20-byte SHA-256 checksum at the end of the + corresponding packfile. + + - 20-byte SHA-256 checksum of all of the above. + +Loose object index +~~~~~~~~~~~~~~~~~~ +A new file $GIT_OBJECT_DIR/loose-object-idx contains information about +all loose objects. Its format is + + # loose-object-idx + (sha256-name SP sha1-name LF)* + +where the object names are in hexadecimal format. The file is not +sorted. + +The loose object index is protected against concurrent writes by a +lock file $GIT_OBJECT_DIR/loose-object-idx.lock. To add a new loose +object: + +1. Write the loose object to a temporary file, like today. +2. Open loose-object-idx.lock with O_CREAT | O_EXCL to acquire the lock. +3. Rename the loose object into place. +4. Open loose-object-idx with O_APPEND and write the new object +5. Unlink loose-object-idx.lock to release the lock. + +To remove entries (e.g. in "git pack-refs" or "git-prune"): + +1. Open loose-object-idx.lock with O_CREAT | O_EXCL to acquire the + lock. +2. Write the new content to loose-object-idx.lock. +3. Unlink any loose objects being removed. +4. Rename to replace loose-object-idx, releasing the lock. + +Translation table +~~~~~~~~~~~~~~~~~ +The index files support a bidirectional mapping between sha1-names +and sha256-names. The lookup proceeds similarly to ordinary object +lookups. For example, to convert a sha1-name to a sha256-name: + + 1. Look for the object in idx files. If a match is present in the + idx's sorted list of truncated sha1-names, then: + a. Read the corresponding entry in the sha1-name order to pack + name order mapping. + b. Read the corresponding entry in the full sha1-name table to + verify we found the right object. If it is, then + c. Read the corresponding entry in the full sha256-name table. + That is the object's sha256-name. + 2. Check for a loose object. Read lines from loose-object-idx until + we find a match. + +Step (1) takes the same amount of time as an ordinary object lookup: +O(number of packs * log(objects per pack)). Step (2) takes O(number of +loose objects) time. To maintain good performance it will be necessary +to keep the number of loose objects low. See the "Loose objects and +unreachable objects" section below for more details. + +Since all operations that make new objects (e.g., "git commit") add +the new objects to the corresponding index, this mapping is possible +for all objects in the object store. + +Reading an object's sha1-content +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +The sha1-content of an object can be read by converting all sha256-names +its sha256-content references to sha1-names using the translation table. + +Fetch +~~~~~ +Fetching from a SHA-1 based server requires translating between SHA-1 +and SHA-256 based representations on the fly. + +SHA-1s named in the ref advertisement that are present on the client +can be translated to SHA-256 and looked up as local objects using the +translation table. + +Negotiation proceeds as today. Any "have"s generated locally are +converted to SHA-1 before being sent to the server, and SHA-1s +mentioned by the server are converted to SHA-256 when looking them up +locally. + +After negotiation, the server sends a packfile containing the +requested objects. We convert the packfile to SHA-256 format using +the following steps: + +1. index-pack: inflate each object in the packfile and compute its + SHA-1. Objects can contain deltas in OBJ_REF_DELTA format against + objects the client has locally. These objects can be looked up + using the translation table and their sha1-content read as + described above to resolve the deltas. +2. topological sort: starting at the "want"s from the negotiation + phase, walk through objects in the pack and emit a list of them, + excluding blobs, in reverse topologically sorted order, with each + object coming later in the list than all objects it references. + (This list only contains objects reachable from the "wants". If the + pack from the server contained additional extraneous objects, then + they will be discarded.) +3. convert to sha256: open a new (sha256) packfile. Read the topologically + sorted list just generated. For each object, inflate its + sha1-content, convert to sha256-content, and write it to the sha256 + pack. Record the new sha1<->sha256 mapping entry for use in the idx. +4. sort: reorder entries in the new pack to match the order of objects + in the pack the server generated and include blobs. Write a sha256 idx + file +5. clean up: remove the SHA-1 based pack file, index, and + topologically sorted list obtained from the server in steps 1 + and 2. + +Step 3 requires every object referenced by the new object to be in the +translation table. This is why the topological sort step is necessary. + +As an optimization, step 1 could write a file describing what non-blob +objects each object it has inflated from the packfile references. This +makes the topological sort in step 2 possible without inflating the +objects in the packfile for a second time. The objects need to be +inflated again in step 3, for a total of two inflations. + +Step 4 is probably necessary for good read-time performance. "git +pack-objects" on the server optimizes the pack file for good data +locality (see Documentation/technical/pack-heuristics.txt). + +Details of this process are likely to change. It will take some +experimenting to get this to perform well. + +Push +~~~~ +Push is simpler than fetch because the objects referenced by the +pushed objects are already in the translation table. The sha1-content +of each object being pushed can be read as described in the "Reading +an object's sha1-content" section to generate the pack written by git +send-pack. + +Signed Commits +~~~~~~~~~~~~~~ +We add a new field "gpgsig-sha256" to the commit object format to allow +signing commits without relying on SHA-1. It is similar to the +existing "gpgsig" field. Its signed payload is the sha256-content of the +commit object with any "gpgsig" and "gpgsig-sha256" fields removed. + +This means commits can be signed +1. using SHA-1 only, as in existing signed commit objects +2. using both SHA-1 and SHA-256, by using both gpgsig-sha256 and gpgsig + fields. +3. using only SHA-256, by only using the gpgsig-sha256 field. + +Old versions of "git verify-commit" can verify the gpgsig signature in +cases (1) and (2) without modifications and view case (3) as an +ordinary unsigned commit. + +Signed Tags +~~~~~~~~~~~ +We add a new field "gpgsig-sha256" to the tag object format to allow +signing tags without relying on SHA-1. Its signed payload is the +sha256-content of the tag with its gpgsig-sha256 field and "-----BEGIN PGP +SIGNATURE-----" delimited in-body signature removed. + +This means tags can be signed +1. using SHA-1 only, as in existing signed tag objects +2. using both SHA-1 and SHA-256, by using gpgsig-sha256 and an in-body + signature. +3. using only SHA-256, by only using the gpgsig-sha256 field. + +Mergetag embedding +~~~~~~~~~~~~~~~~~~ +The mergetag field in the sha1-content of a commit contains the +sha1-content of a tag that was merged by that commit. + +The mergetag field in the sha256-content of the same commit contains the +sha256-content of the same tag. + +Submodules +~~~~~~~~~~ +To convert recorded submodule pointers, you need to have the converted +submodule repository in place. The translation table of the submodule +can be used to look up the new hash. + +Loose objects and unreachable objects +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Fast lookups in the loose-object-idx require that the number of loose +objects not grow too high. + +"git gc --auto" currently waits for there to be 6700 loose objects +present before consolidating them into a packfile. We will need to +measure to find a more appropriate threshold for it to use. + +"git gc --auto" currently waits for there to be 50 packs present +before combining packfiles. Packing loose objects more aggressively +may cause the number of pack files to grow too quickly. This can be +mitigated by using a strategy similar to Martin Fick's exponential +rolling garbage collection script: +https://gerrit-review.googlesource.com/c/gerrit/+/35215 + +"git gc" currently expels any unreachable objects it encounters in +pack files to loose objects in an attempt to prevent a race when +pruning them (in case another process is simultaneously writing a new +object that refers to the about-to-be-deleted object). This leads to +an explosion in the number of loose objects present and disk space +usage due to the objects in delta form being replaced with independent +loose objects. Worse, the race is still present for loose objects. + +Instead, "git gc" will need to move unreachable objects to a new +packfile marked as UNREACHABLE_GARBAGE (using the PSRC field; see +below). To avoid the race when writing new objects referring to an +about-to-be-deleted object, code paths that write new objects will +need to copy any objects from UNREACHABLE_GARBAGE packs that they +refer to to new, non-UNREACHABLE_GARBAGE packs (or loose objects). +UNREACHABLE_GARBAGE are then safe to delete if their creation time (as +indicated by the file's mtime) is long enough ago. + +To avoid a proliferation of UNREACHABLE_GARBAGE packs, they can be +combined under certain circumstances. If "gc.garbageTtl" is set to +greater than one day, then packs created within a single calendar day, +UTC, can be coalesced together. The resulting packfile would have an +mtime before midnight on that day, so this makes the effective maximum +ttl the garbageTtl + 1 day. If "gc.garbageTtl" is less than one day, +then we divide the calendar day into intervals one-third of that ttl +in duration. Packs created within the same interval can be coalesced +together. The resulting packfile would have an mtime before the end of +the interval, so this makes the effective maximum ttl equal to the +garbageTtl * 4/3. + +This rule comes from Thirumala Reddy Mutchukota's JGit change +https://git.eclipse.org/r/90465. + +The UNREACHABLE_GARBAGE setting goes in the PSRC field of the pack +index. More generally, that field indicates where a pack came from: + + - 1 (PACK_SOURCE_RECEIVE) for a pack received over the network + - 2 (PACK_SOURCE_AUTO) for a pack created by a lightweight + "gc --auto" operation + - 3 (PACK_SOURCE_GC) for a pack created by a full gc + - 4 (PACK_SOURCE_UNREACHABLE_GARBAGE) for potential garbage + discovered by gc + - 5 (PACK_SOURCE_INSERT) for locally created objects that were + written directly to a pack file, e.g. from "git add ." + +This information can be useful for debugging and for "gc --auto" to +make appropriate choices about which packs to coalesce. + +Caveats +------- +Invalid objects +~~~~~~~~~~~~~~~ +The conversion from sha1-content to sha256-content retains any +brokenness in the original object (e.g., tree entry modes encoded with +leading 0, tree objects whose paths are not sorted correctly, and +commit objects without an author or committer). This is a deliberate +feature of the design to allow the conversion to round-trip. + +More profoundly broken objects (e.g., a commit with a truncated "tree" +header line) cannot be converted but were not usable by current Git +anyway. + +Shallow clone and submodules +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Because it requires all referenced objects to be available in the +locally generated translation table, this design does not support +shallow clone or unfetched submodules. Protocol improvements might +allow lifting this restriction. + +Alternates +~~~~~~~~~~ +For the same reason, a sha256 repository cannot borrow objects from a +sha1 repository using objects/info/alternates or +$GIT_ALTERNATE_OBJECT_REPOSITORIES. + +git notes +~~~~~~~~~ +The "git notes" tool annotates objects using their sha1-name as key. +This design does not describe a way to migrate notes trees to use +sha256-names. That migration is expected to happen separately (for +example using a file at the root of the notes tree to describe which +hash it uses). + +Server-side cost +~~~~~~~~~~~~~~~~ +Until Git protocol gains SHA-256 support, using SHA-256 based storage +on public-facing Git servers is strongly discouraged. Once Git +protocol gains SHA-256 support, SHA-256 based servers are likely not +to support SHA-1 compatibility, to avoid what may be a very expensive +hash reencode during clone and to encourage peers to modernize. + +The design described here allows fetches by SHA-1 clients of a +personal SHA-256 repository because it's not much more difficult than +allowing pushes from that repository. This support needs to be guarded +by a configuration option --- servers like git.kernel.org that serve a +large number of clients would not be expected to bear that cost. + +Meaning of signatures +~~~~~~~~~~~~~~~~~~~~~ +The signed payload for signed commits and tags does not explicitly +name the hash used to identify objects. If some day Git adopts a new +hash function with the same length as the current SHA-1 (40 +hexadecimal digit) or SHA-256 (64 hexadecimal digit) objects then the +intent behind the PGP signed payload in an object signature is +unclear: + + object e7e07d5a4fcc2a203d9873968ad3e6bd4d7419d7 + type commit + tag v2.12.0 + tagger Junio C Hamano <gitster@pobox.com> 1487962205 -0800 + + Git 2.12 + +Does this mean Git v2.12.0 is the commit with sha1-name +e7e07d5a4fcc2a203d9873968ad3e6bd4d7419d7 or the commit with +new-40-digit-hash-name e7e07d5a4fcc2a203d9873968ad3e6bd4d7419d7? + +Fortunately SHA-256 and SHA-1 have different lengths. If Git starts +using another hash with the same length to name objects, then it will +need to change the format of signed payloads using that hash to +address this issue. + +Object names on the command line +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +To support the transition (see Transition plan below), this design +supports four different modes of operation: + + 1. ("dark launch") Treat object names input by the user as SHA-1 and + convert any object names written to output to SHA-1, but store + objects using SHA-256. This allows users to test the code with no + visible behavior change except for performance. This allows + allows running even tests that assume the SHA-1 hash function, to + sanity-check the behavior of the new mode. + + 2. ("early transition") Allow both SHA-1 and SHA-256 object names in + input. Any object names written to output use SHA-1. This allows + users to continue to make use of SHA-1 to communicate with peers + (e.g. by email) that have not migrated yet and prepares for mode 3. + + 3. ("late transition") Allow both SHA-1 and SHA-256 object names in + input. Any object names written to output use SHA-256. In this + mode, users are using a more secure object naming method by + default. The disruption is minimal as long as most of their peers + are in mode 2 or mode 3. + + 4. ("post-transition") Treat object names input by the user as + SHA-256 and write output using SHA-256. This is safer than mode 3 + because there is less risk that input is incorrectly interpreted + using the wrong hash function. + +The mode is specified in configuration. + +The user can also explicitly specify which format to use for a +particular revision specifier and for output, overriding the mode. For +example: + +git --output-format=sha1 log abac87a^{sha1}..f787cac^{sha256} + +Choice of Hash +-------------- +In early 2005, around the time that Git was written, Xiaoyun Wang, +Yiqun Lisa Yin, and Hongbo Yu announced an attack finding SHA-1 +collisions in 2^69 operations. In August they published details. +Luckily, no practical demonstrations of a collision in full SHA-1 were +published until 10 years later, in 2017. + +Git v2.13.0 and later subsequently moved to a hardened SHA-1 +implementation by default that mitigates the SHAttered attack, but +SHA-1 is still believed to be weak. + +The hash to replace this hardened SHA-1 should be stronger than SHA-1 +was: we would like it to be trustworthy and useful in practice for at +least 10 years. + +Some other relevant properties: + +1. A 256-bit hash (long enough to match common security practice; not + excessively long to hurt performance and disk usage). + +2. High quality implementations should be widely available (e.g., in + OpenSSL and Apple CommonCrypto). + +3. The hash function's properties should match Git's needs (e.g. Git + requires collision and 2nd preimage resistance and does not require + length extension resistance). + +4. As a tiebreaker, the hash should be fast to compute (fortunately + many contenders are faster than SHA-1). + +We choose SHA-256. + +Transition plan +--------------- +Some initial steps can be implemented independently of one another: +- adding a hash function API (vtable) +- teaching fsck to tolerate the gpgsig-sha256 field +- excluding gpgsig-* from the fields copied by "git commit --amend" +- annotating tests that depend on SHA-1 values with a SHA1 test + prerequisite +- using "struct object_id", GIT_MAX_RAWSZ, and GIT_MAX_HEXSZ + consistently instead of "unsigned char *" and the hardcoded + constants 20 and 40. +- introducing index v3 +- adding support for the PSRC field and safer object pruning + + +The first user-visible change is the introduction of the objectFormat +extension (without compatObjectFormat). This requires: +- implementing the loose-object-idx +- teaching fsck about this mode of operation +- using the hash function API (vtable) when computing object names +- signing objects and verifying signatures +- rejecting attempts to fetch from or push to an incompatible + repository + +Next comes introduction of compatObjectFormat: +- translating object names between object formats +- translating object content between object formats +- generating and verifying signatures in the compat format +- adding appropriate index entries when adding a new object to the + object store +- --output-format option +- ^{sha1} and ^{sha256} revision notation +- configuration to specify default input and output format (see + "Object names on the command line" above) + +The next step is supporting fetches and pushes to SHA-1 repositories: +- allow pushes to a repository using the compat format +- generate a topologically sorted list of the SHA-1 names of fetched + objects +- convert the fetched packfile to sha256 format and generate an idx + file +- re-sort to match the order of objects in the fetched packfile + +The infrastructure supporting fetch also allows converting an existing +repository. In converted repositories and new clones, end users can +gain support for the new hash function without any visible change in +behavior (see "dark launch" in the "Object names on the command line" +section). In particular this allows users to verify SHA-256 signatures +on objects in the repository, and it should ensure the transition code +is stable in production in preparation for using it more widely. + +Over time projects would encourage their users to adopt the "early +transition" and then "late transition" modes to take advantage of the +new, more futureproof SHA-256 object names. + +When objectFormat and compatObjectFormat are both set, commands +generating signatures would generate both SHA-1 and SHA-256 signatures +by default to support both new and old users. + +In projects using SHA-256 heavily, users could be encouraged to adopt +the "post-transition" mode to avoid accidentally making implicit use +of SHA-1 object names. + +Once a critical mass of users have upgraded to a version of Git that +can verify SHA-256 signatures and have converted their existing +repositories to support verifying them, we can add support for a +setting to generate only SHA-256 signatures. This is expected to be at +least a year later. + +That is also a good moment to advertise the ability to convert +repositories to use SHA-256 only, stripping out all SHA-1 related +metadata. This improves performance by eliminating translation +overhead and security by avoiding the possibility of accidentally +relying on the safety of SHA-1. + +Updating Git's protocols to allow a server to specify which hash +functions it supports is also an important part of this transition. It +is not discussed in detail in this document but this transition plan +assumes it happens. :) + +Alternatives considered +----------------------- +Upgrading everyone working on a particular project on a flag day +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Projects like the Linux kernel are large and complex enough that +flipping the switch for all projects based on the repository at once +is infeasible. + +Not only would all developers and server operators supporting +developers have to switch on the same flag day, but supporting tooling +(continuous integration, code review, bug trackers, etc) would have to +be adapted as well. This also makes it difficult to get early feedback +from some project participants testing before it is time for mass +adoption. + +Using hash functions in parallel +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +(e.g. https://public-inbox.org/git/22708.8913.864049.452252@chiark.greenend.org.uk/ ) +Objects newly created would be addressed by the new hash, but inside +such an object (e.g. commit) it is still possible to address objects +using the old hash function. +* You cannot trust its history (needed for bisectability) in the + future without further work +* Maintenance burden as the number of supported hash functions grows + (they will never go away, so they accumulate). In this proposal, by + comparison, converted objects lose all references to SHA-1. + +Signed objects with multiple hashes +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Instead of introducing the gpgsig-sha256 field in commit and tag objects +for sha256-content based signatures, an earlier version of this design +added "hash sha256 <sha256-name>" fields to strengthen the existing +sha1-content based signatures. + +In other words, a single signature was used to attest to the object +content using both hash functions. This had some advantages: +* Using one signature instead of two speeds up the signing process. +* Having one signed payload with both hashes allows the signer to + attest to the sha1-name and sha256-name referring to the same object. +* All users consume the same signature. Broken signatures are likely + to be detected quickly using current versions of git. + +However, it also came with disadvantages: +* Verifying a signed object requires access to the sha1-names of all + objects it references, even after the transition is complete and + translation table is no longer needed for anything else. To support + this, the design added fields such as "hash sha1 tree <sha1-name>" + and "hash sha1 parent <sha1-name>" to the sha256-content of a signed + commit, complicating the conversion process. +* Allowing signed objects without a sha1 (for after the transition is + complete) complicated the design further, requiring a "nohash sha1" + field to suppress including "hash sha1" fields in the sha256-content + and signed payload. + +Lazily populated translation table +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Some of the work of building the translation table could be deferred to +push time, but that would significantly complicate and slow down pushes. +Calculating the sha1-name at object creation time at the same time it is +being streamed to disk and having its sha256-name calculated should be +an acceptable cost. + +Document History +---------------- + +2017-03-03 +bmwill@google.com, jonathantanmy@google.com, jrnieder@gmail.com, +sbeller@google.com + +Initial version sent to +http://public-inbox.org/git/20170304011251.GA26789@aiede.mtv.corp.google.com + +2017-03-03 jrnieder@gmail.com +Incorporated suggestions from jonathantanmy and sbeller: +* describe purpose of signed objects with each hash type +* redefine signed object verification using object content under the + first hash function + +2017-03-06 jrnieder@gmail.com +* Use SHA3-256 instead of SHA2 (thanks, Linus and brian m. carlson).[1][2] +* Make sha3-based signatures a separate field, avoiding the need for + "hash" and "nohash" fields (thanks to peff[3]). +* Add a sorting phase to fetch (thanks to Junio for noticing the need + for this). +* Omit blobs from the topological sort during fetch (thanks to peff). +* Discuss alternates, git notes, and git servers in the caveats + section (thanks to Junio Hamano, brian m. carlson[4], and Shawn + Pearce). +* Clarify language throughout (thanks to various commenters, + especially Junio). + +2017-09-27 jrnieder@gmail.com, sbeller@google.com +* use placeholder NewHash instead of SHA3-256 +* describe criteria for picking a hash function. +* include a transition plan (thanks especially to Brandon Williams + for fleshing these ideas out) +* define the translation table (thanks, Shawn Pearce[5], Jonathan + Tan, and Masaya Suzuki) +* avoid loose object overhead by packing more aggressively in + "git gc --auto" + +Later history: + + See the history of this file in git.git for the history of subsequent + edits. This document history is no longer being maintained as it + would now be superfluous to the commit log + +[1] http://public-inbox.org/git/CA+55aFzJtejiCjV0e43+9oR3QuJK2PiFiLQemytoLpyJWe6P9w@mail.gmail.com/ +[2] http://public-inbox.org/git/CA+55aFz+gkAsDZ24zmePQuEs1XPS9BP_s8O7Q4wQ7LV7X5-oDA@mail.gmail.com/ +[3] http://public-inbox.org/git/20170306084353.nrns455dvkdsfgo5@sigill.intra.peff.net/ +[4] http://public-inbox.org/git/20170304224936.rqqtkdvfjgyezsht@genre.crustytoothpaste.net +[5] https://public-inbox.org/git/CAJo=hJtoX9=AyLHHpUJS7fueV9ciZ_MNpnEPHUz8Whui6g9F0A@mail.gmail.com/ diff --git a/Documentation/technical/http-protocol.txt b/Documentation/technical/http-protocol.txt index 229f845dfa..9c5b6f0fac 100644 --- a/Documentation/technical/http-protocol.txt +++ b/Documentation/technical/http-protocol.txt @@ -214,10 +214,16 @@ smart server reply: S: Cache-Control: no-cache S: S: 001e# service=git-upload-pack\n + S: 0000 S: 004895dcfa3633004da0049d3d0fa03f80589cbcaf31 refs/heads/maint\0multi_ack\n S: 0042d049f6c27a2244e12041955e262a404c7faba355 refs/heads/master\n S: 003c2cb58b79488a98d2721cea644875a8dd0026b115 refs/tags/v1.0\n S: 003fa3c2e2402b99163d1d59756e5f207ae21cccba4c refs/tags/v1.0^{}\n + S: 0000 + +The client may send Extra Parameters (see +Documentation/technical/pack-protocol.txt) as a colon-separated string +in the Git-Protocol HTTP header. Dumb Server Response ^^^^^^^^^^^^^^^^^^^^ @@ -269,7 +275,12 @@ the C locale ordering. The stream SHOULD include the default ref named `HEAD` as the first ref. The stream MUST include capability declarations behind a NUL on the first ref. +The returned response contains "version 1" if "version=1" was sent as an +Extra Parameter. + smart_reply = PKT-LINE("# service=$servicename" LF) + "0000" + *1("version 1") ref_list "0000" ref_list = empty_list / non_empty_list @@ -319,18 +330,19 @@ Servers SHOULD support all capabilities defined here. Clients MUST send at least one "want" command in the request body. Clients MUST NOT reference an id in a "want" command which did not appear in the response obtained through ref discovery unless the -server advertises capability `allow-tip-sha1-in-want`. +server advertises capability `allow-tip-sha1-in-want` or +`allow-reachable-sha1-in-want`. compute_request = want_list have_list request_end request_end = "0000" / "done" - want_list = PKT-LINE(want NUL cap_list LF) + want_list = PKT-LINE(want SP cap_list LF) *(want_pkt) want_pkt = PKT-LINE(want LF) want = "want" SP id - cap_list = *(SP capability) SP + cap_list = capability *(SP capability) have_list = *PKT-LINE("have" SP id LF) diff --git a/Documentation/technical/index-format.txt b/Documentation/technical/index-format.txt index 35112e4966..db3572626b 100644 --- a/Documentation/technical/index-format.txt +++ b/Documentation/technical/index-format.txt @@ -170,7 +170,7 @@ Git index format The entries are written out in the top-down, depth-first order. The first entry represents the root level of the repository, followed by the - first subtree---let's call this A---of the root level (with its name + first subtree--let's call this A--of the root level (with its name relative to the root level), followed by the first subtree of A (with its name relative to A), ... @@ -233,3 +233,84 @@ Git index format The remaining index entries after replaced ones will be added to the final index. These added entries are also sorted by entry name then stage. + +== Untracked cache + + Untracked cache saves the untracked file list and necessary data to + verify the cache. The signature for this extension is { 'U', 'N', + 'T', 'R' }. + + The extension starts with + + - A sequence of NUL-terminated strings, preceded by the size of the + sequence in variable width encoding. Each string describes the + environment where the cache can be used. + + - Stat data of $GIT_DIR/info/exclude. See "Index entry" section from + ctime field until "file size". + + - Stat data of core.excludesfile + + - 32-bit dir_flags (see struct dir_struct) + + - 160-bit SHA-1 of $GIT_DIR/info/exclude. Null SHA-1 means the file + does not exist. + + - 160-bit SHA-1 of core.excludesfile. Null SHA-1 means the file does + not exist. + + - NUL-terminated string of per-dir exclude file name. This usually + is ".gitignore". + + - The number of following directory blocks, variable width + encoding. If this number is zero, the extension ends here with a + following NUL. + + - A number of directory blocks in depth-first-search order, each + consists of + + - The number of untracked entries, variable width encoding. + + - The number of sub-directory blocks, variable width encoding. + + - The directory name terminated by NUL. + + - A number of untracked file/dir names terminated by NUL. + +The remaining data of each directory block is grouped by type: + + - An ewah bitmap, the n-th bit marks whether the n-th directory has + valid untracked cache entries. + + - An ewah bitmap, the n-th bit records "check-only" bit of + read_directory_recursive() for the n-th directory. + + - An ewah bitmap, the n-th bit indicates whether SHA-1 and stat data + is valid for the n-th directory and exists in the next data. + + - An array of stat data. The n-th data corresponds with the n-th + "one" bit in the previous ewah bitmap. + + - An array of SHA-1. The n-th SHA-1 corresponds with the n-th "one" bit + in the previous ewah bitmap. + + - One NUL. + +== File System Monitor cache + + The file system monitor cache tracks files for which the core.fsmonitor + hook has told us about changes. The signature for this extension is + { 'F', 'S', 'M', 'N' }. + + The extension starts with + + - 32-bit version number: the current supported version is 1. + + - 64-bit time: the extension data reflects all changes through the given + time which is stored as the nanoseconds elapsed since midnight, + January 1, 1970. + + - 32-bit bitmap size: the size of the CE_FSMONITOR_VALID bitmap. + + - An ewah bitmap, the n-th bit indicates whether the n-th index entry + is not CE_FSMONITOR_VALID. diff --git a/Documentation/technical/long-running-process-protocol.txt b/Documentation/technical/long-running-process-protocol.txt new file mode 100644 index 0000000000..aa0aa9af1c --- /dev/null +++ b/Documentation/technical/long-running-process-protocol.txt @@ -0,0 +1,50 @@ +Long-running process protocol +============================= + +This protocol is used when Git needs to communicate with an external +process throughout the entire life of a single Git command. All +communication is in pkt-line format (see technical/protocol-common.txt) +over standard input and standard output. + +Handshake +--------- + +Git starts by sending a welcome message (for example, +"git-filter-client"), a list of supported protocol version numbers, and +a flush packet. Git expects to read the welcome message with "server" +instead of "client" (for example, "git-filter-server"), exactly one +protocol version number from the previously sent list, and a flush +packet. All further communication will be based on the selected version. +The remaining protocol description below documents "version=2". Please +note that "version=42" in the example below does not exist and is only +there to illustrate how the protocol would look like with more than one +version. + +After the version negotiation Git sends a list of all capabilities that +it supports and a flush packet. Git expects to read a list of desired +capabilities, which must be a subset of the supported capabilities list, +and a flush packet as response: +------------------------ +packet: git> git-filter-client +packet: git> version=2 +packet: git> version=42 +packet: git> 0000 +packet: git< git-filter-server +packet: git< version=2 +packet: git< 0000 +packet: git> capability=clean +packet: git> capability=smudge +packet: git> capability=not-yet-invented +packet: git> 0000 +packet: git< capability=clean +packet: git< capability=smudge +packet: git< 0000 +------------------------ + +Shutdown +-------- + +Git will close +the command pipe on exit. The filter is expected to detect EOF +and exit gracefully on its own. Git will wait until the filter +process has stopped. diff --git a/Documentation/technical/pack-format.txt b/Documentation/technical/pack-format.txt index 8e5bf60be3..70a99fd142 100644 --- a/Documentation/technical/pack-format.txt +++ b/Documentation/technical/pack-format.txt @@ -36,6 +36,98 @@ Git pack format - The trailer records 20-byte SHA-1 checksum of all of the above. +=== Object types + +Valid object types are: + +- OBJ_COMMIT (1) +- OBJ_TREE (2) +- OBJ_BLOB (3) +- OBJ_TAG (4) +- OBJ_OFS_DELTA (6) +- OBJ_REF_DELTA (7) + +Type 5 is reserved for future expansion. Type 0 is invalid. + +=== Deltified representation + +Conceptually there are only four object types: commit, tree, tag and +blob. However to save space, an object could be stored as a "delta" of +another "base" object. These representations are assigned new types +ofs-delta and ref-delta, which is only valid in a pack file. + +Both ofs-delta and ref-delta store the "delta" to be applied to +another object (called 'base object') to reconstruct the object. The +difference between them is, ref-delta directly encodes 20-byte base +object name. If the base object is in the same pack, ofs-delta encodes +the offset of the base object in the pack instead. + +The base object could also be deltified if it's in the same pack. +Ref-delta can also refer to an object outside the pack (i.e. the +so-called "thin pack"). When stored on disk however, the pack should +be self contained to avoid cyclic dependency. + +The delta data is a sequence of instructions to reconstruct an object +from the base object. If the base object is deltified, it must be +converted to canonical form first. Each instruction appends more and +more data to the target object until it's complete. There are two +supported instructions so far: one for copy a byte range from the +source object and one for inserting new data embedded in the +instruction itself. + +Each instruction has variable length. Instruction type is determined +by the seventh bit of the first octet. The following diagrams follow +the convention in RFC 1951 (Deflate compressed data format). + +==== Instruction to copy from base object + + +----------+---------+---------+---------+---------+-------+-------+-------+ + | 1xxxxxxx | offset1 | offset2 | offset3 | offset4 | size1 | size2 | size3 | + +----------+---------+---------+---------+---------+-------+-------+-------+ + +This is the instruction format to copy a byte range from the source +object. It encodes the offset to copy from and the number of bytes to +copy. Offset and size are in little-endian order. + +All offset and size bytes are optional. This is to reduce the +instruction size when encoding small offsets or sizes. The first seven +bits in the first octet determines which of the next seven octets is +present. If bit zero is set, offset1 is present. If bit one is set +offset2 is present and so on. + +Note that a more compact instruction does not change offset and size +encoding. For example, if only offset2 is omitted like below, offset3 +still contains bits 16-23. It does not become offset2 and contains +bits 8-15 even if it's right next to offset1. + + +----------+---------+---------+ + | 10000101 | offset1 | offset3 | + +----------+---------+---------+ + +In its most compact form, this instruction only takes up one byte +(0x80) with both offset and size omitted, which will have default +values zero. There is another exception: size zero is automatically +converted to 0x10000. + +==== Instruction to add new data + + +----------+============+ + | 0xxxxxxx | data | + +----------+============+ + +This is the instruction to construct target object without the base +object. The following data is appended to the target object. The first +seven bits of the first octet determines the size of data in +bytes. The size must be non-zero. + +==== Reserved instruction + + +----------+============ + | 00000000 | + +----------+============ + +This is the instruction reserved for future expansion. + == Original (version 1) pack-*.idx files have the following format: - The header consists of 256 4-byte network byte order diff --git a/Documentation/technical/pack-protocol.txt b/Documentation/technical/pack-protocol.txt index 462e20645f..6ac774d5f6 100644 --- a/Documentation/technical/pack-protocol.txt +++ b/Documentation/technical/pack-protocol.txt @@ -1,11 +1,11 @@ Packfile transfer protocols =========================== -Git supports transferring data in packfiles over the ssh://, git:// and +Git supports transferring data in packfiles over the ssh://, git://, http:// and file:// transports. There exist two sets of protocols, one for pushing data from a client to a server and another for fetching data from a -server to a client. All three transports (ssh, git, file) use the same -protocol to transfer data. +server to a client. The three transports (ssh, git, file) use the same +protocol to transfer data. http is documented in http-protocol.txt. The processes invoked in the canonical Git implementation are 'upload-pack' on the server side and 'fetch-pack' on the client side for fetching data; @@ -14,6 +14,14 @@ data. The protocol functions to have a server tell a client what is currently on the server, then for the two to negotiate the smallest amount of data to send in order to fully update one or the other. +pkt-line Format +--------------- + +The descriptions below build on the pkt-line format described in +protocol-common.txt. When the grammar indicate `PKT-LINE(...)`, unless +otherwise noted the usual pkt-line LF rules apply: the sender SHOULD +include a LF, but the receiver MUST NOT complain if it is not present. + Transports ---------- There are three transports over which the packfile protocol is @@ -31,6 +39,20 @@ communicates with that invoked process over the SSH connection. The file:// transport runs the 'upload-pack' or 'receive-pack' process locally and communicates with it over a pipe. +Extra Parameters +---------------- + +The protocol provides a mechanism in which clients can send additional +information in its first message to the server. These are called "Extra +Parameters", and are supported by the Git, SSH, and HTTP protocols. + +Each Extra Parameter takes the form of `<key>=<value>` or `<key>`. + +Servers that receive any such Extra Parameters MUST ignore all +unrecognized keys. Currently, the only Extra Parameter recognized is +"version" with a value of '1' or '2'. See protocol-v2.txt for more +information on protocol version 2. + Git Transport ------------- @@ -38,18 +60,25 @@ The Git transport starts off by sending the command and repository on the wire using the pkt-line format, followed by a NUL byte and a hostname parameter, terminated by a NUL byte. - 0032git-upload-pack /project.git\0host=myserver.com\0 + 0033git-upload-pack /project.git\0host=myserver.com\0 + +The transport may send Extra Parameters by adding an additional NUL +byte, and then adding one or more NUL-terminated strings: + + 003egit-upload-pack /project.git\0host=myserver.com\0\0version=1\0 -- - git-proto-request = request-command SP pathname NUL [ host-parameter NUL ] + git-proto-request = request-command SP pathname NUL + [ host-parameter NUL ] [ NUL extra-parameters ] request-command = "git-upload-pack" / "git-receive-pack" / "git-upload-archive" ; case sensitive pathname = *( %x01-ff ) ; exclude NUL host-parameter = "host=" hostname [ ":" port ] + extra-parameters = 1*extra-parameter + extra-parameter = 1*( %x01-ff ) NUL -- -Only host-parameter is allowed in the git-proto-request. Clients -MUST NOT attempt to send additional parameters. It is used for the +host-parameter is used for the git-daemon name based virtual hosting. See --interpolated-path option to git daemon, with the %H/%CH format characters. @@ -109,6 +138,12 @@ we execute it without the leading '/'. v ssh user@example.com "git-upload-pack '~alice/project.git'" +Depending on the value of the `protocol.version` configuration variable, +Git may attempt to send Extra Parameters as a colon-separated string in +the GIT_PROTOCOL environment variable. This is done only if +the `ssh.variant` configuration variable indicates that the ssh command +supports passing environment variables as an argument. + A few things to remember here: - The "command name" is spelled with dash (e.g. git-upload-pack), but @@ -129,11 +164,13 @@ Reference Discovery ------------------- When the client initially connects the server will immediately respond -with a listing of each reference it has (all branches and tags) along +with a version number (if "version=1" is sent as an Extra Parameter), +and a listing of each reference it has (all branches and tags) along with the object name that each reference currently points to. - $ echo -e -n "0039git-upload-pack /schacon/gitbook.git\0host=example.com\0" | + $ echo -e -n "0044git-upload-pack /schacon/gitbook.git\0host=example.com\0\0version=1\0" | nc -v example.com 9418 + 000aversion 1 00887217a7c7e582c46cec22a130adf4b9d7d950fba0 HEAD\0multi_ack thin-pack side-band side-band-64k ofs-delta shallow no-progress include-tag 00441d3fcd5ced445d1abc402225c0b8a1299641f497 refs/heads/integration @@ -143,9 +180,6 @@ with the object name that each reference currently points to. 003fe92df48743b7bc7d26bcaabfddde0a1e20cae47c refs/tags/v1.0^{} 0000 -Server SHOULD terminate each non-flush line using LF ("\n") terminator; -client MUST NOT complain if there is no terminator. - The returned response is a pkt-line stream describing each ref and its current value. The stream MUST be sorted by name according to the C locale ordering. @@ -160,20 +194,21 @@ immediately after the ref itself, if presented. A conforming server MUST peel the ref if it's an annotated tag. ---- - advertised-refs = (no-refs / list-of-refs) + advertised-refs = *1("version 1") + (no-refs / list-of-refs) *shallow flush-pkt no-refs = PKT-LINE(zero-id SP "capabilities^{}" - NUL capability-list LF) + NUL capability-list) list-of-refs = first-ref *other-ref first-ref = PKT-LINE(obj-id SP refname - NUL capability-list LF) + NUL capability-list) other-ref = PKT-LINE(other-tip / other-peeled) - other-tip = obj-id SP refname LF - other-peeled = obj-id SP refname "^{}" LF + other-tip = obj-id SP refname + other-peeled = obj-id SP refname "^{}" shallow = PKT-LINE("shallow" SP obj-id) @@ -194,7 +229,7 @@ After reference and capabilities discovery, the client can decide to terminate the connection by sending a flush-pkt, telling the server it can now gracefully terminate, and disconnect, when it does not need any pack data. This can happen with the ls-remote command, and also can happen when -the client already is up-to-date. +the client already is up to date. Otherwise, it enters the negotiation phase, where the client and server determine what the minimal packfile necessary for transport is, @@ -207,6 +242,7 @@ out of what the server said it could do with the first 'want' line. upload-request = want-list *shallow-line *1depth-request + [filter-request] flush-pkt want-list = first-want @@ -214,12 +250,16 @@ out of what the server said it could do with the first 'want' line. shallow-line = PKT-LINE("shallow" SP obj-id) - depth-request = PKT-LINE("deepen" SP depth) + depth-request = PKT-LINE("deepen" SP depth) / + PKT-LINE("deepen-since" SP timestamp) / + PKT-LINE("deepen-not" SP ref) - first-want = PKT-LINE("want" SP obj-id SP capability-list LF) - additional-want = PKT-LINE("want" SP obj-id LF) + first-want = PKT-LINE("want" SP obj-id SP capability-list) + additional-want = PKT-LINE("want" SP obj-id) depth = 1*DIGIT + + filter-request = PKT-LINE("filter" SP filter-spec) ---- Clients MUST send all the obj-ids it wants from the reference @@ -242,6 +282,13 @@ complete those commits. Commits whose parents are not received as a result are defined as shallow and marked as such in the server. This information is sent back to the client in the next step. +The client can optionally request that pack-objects omit various +objects from the packfile using one of several filtering techniques. +These are intended for use with partial clone and partial fetch +operations. An object that does not meet a filter-spec value is +omitted unless explicitly requested in a 'want' line. See `rev-list` +for possible filter-spec values. + Once all the 'want's and 'shallow's (and optional 'deepen') are transferred, clients MUST send a flush-pkt, to tell the server side that it is done sending the list. @@ -284,7 +331,7 @@ so that there is always a block of 32 "in-flight on the wire" at a time. compute-end have-list = *have-line - have-line = PKT-LINE("have" SP obj-id LF) + have-line = PKT-LINE("have" SP obj-id) compute-end = flush-pkt / PKT-LINE("done") ---- @@ -302,7 +349,7 @@ In multi_ack mode: ready to make a packfile, it will blindly ACK all 'have' obj-ids back to the client. - * the server will then send a 'NACK' and then wait for another response + * the server will then send a 'NAK' and then wait for another response from the client - either a 'done' or another list of 'have' lines. In multi_ack_detailed mode: @@ -344,14 +391,19 @@ ACK after 'done' if there is at least one common base and multi_ack or multi_ack_detailed is enabled. The server always sends NAK after 'done' if there is no common base found. +Instead of 'ACK' or 'NAK', the server may send an error message (for +example, if it does not recognize an object in a 'want' line received +from the client). + Then the server will start sending its packfile data. ---- - server-response = *ack_multi ack / nak - ack_multi = PKT-LINE("ACK" SP obj-id ack_status LF) + server-response = *ack_multi ack / nak / error-line + ack_multi = PKT-LINE("ACK" SP obj-id ack_status) ack_status = "continue" / "common" / "ready" - ack = PKT-LINE("ACK SP obj-id LF) - nak = PKT-LINE("NAK" LF) + ack = PKT-LINE("ACK" SP obj-id) + nak = PKT-LINE("NAK") + error-line = PKT-LINE("ERR" SP explanation-text) ---- A simple clone may look like this (with no 'have' lines): @@ -449,7 +501,8 @@ The reference discovery phase is done nearly the same way as it is in the fetching protocol. Each reference obj-id and name on the server is sent in packet-line format to the client, followed by a flush-pkt. The only real difference is that the capability listing is different - the only -possible values are 'report-status', 'delete-refs' and 'ofs-delta'. +possible values are 'report-status', 'delete-refs', 'ofs-delta' and +'push-options'. Reference Update Request and Packfile Transfer ---------------------------------------------- @@ -460,17 +513,15 @@ that it wants to update, it sends a line listing the obj-id currently on the server, the obj-id the client would like to update it to and the name of the reference. -This list is followed by a flush-pkt and then the packfile that should -contain all the objects that the server will need to complete the new -references. +This list is followed by a flush-pkt. ---- - update-request = *shallow ( command-list | push-cert ) [pack-file] + update-requests = *shallow ( command-list | push-cert ) - shallow = PKT-LINE("shallow" SP obj-id LF) + shallow = PKT-LINE("shallow" SP obj-id) - command-list = PKT-LINE(command NUL capability-list LF) - *PKT-LINE(command LF) + command-list = PKT-LINE(command NUL capability-list) + *PKT-LINE(command) flush-pkt command = create / delete / update @@ -486,12 +537,35 @@ references. PKT-LINE("pusher" SP ident LF) PKT-LINE("pushee" SP url LF) PKT-LINE("nonce" SP nonce LF) + *PKT-LINE("push-option" SP push-option LF) PKT-LINE(LF) *PKT-LINE(command LF) *PKT-LINE(gpg-signature-lines LF) PKT-LINE("push-cert-end" LF) - pack-file = "PACK" 28*(OCTET) + push-option = 1*( VCHAR | SP ) +---- + +If the server has advertised the 'push-options' capability and the client has +specified 'push-options' as part of the capability list above, the client then +sends its push options followed by a flush-pkt. + +---- + push-options = *PKT-LINE(push-option) flush-pkt +---- + +For backwards compatibility with older Git servers, if the client sends a push +cert and push options, it MUST send its push options both embedded within the +push cert and after the push cert. (Note that the push options within the cert +are prefixed, but the push options after the cert are not.) Both these lists +MUST be the same, modulo the prefix. + +After that the packfile that +should contain all the objects that the server will need to complete the new +references will be sent. + +---- + packfile = "PACK" 28*(OCTET) ---- If the receiving end does not support delete-refs, the sending end MUST @@ -502,11 +576,11 @@ MUST NOT send a push-cert command. When a push-cert command is sent, command-list MUST NOT be sent; the commands recorded in the push certificate is used instead. -The pack-file MUST NOT be sent if the only command used is 'delete'. +The packfile MUST NOT be sent if the only command used is 'delete'. -A pack-file MUST be sent if either create or update command is used, +A packfile MUST be sent if either create or update command is used, even if the server already has all the necessary objects. In this -case the client MUST send an empty pack-file. The only time this +case the client MUST send an empty packfile. The only time this is likely to happen is if the client is creating a new branch or a tag that points to an existing obj-id. @@ -521,7 +595,8 @@ Push Certificate A push certificate begins with a set of header lines. After the header and an empty line, the protocol commands follow, one per -line. +line. Note that the trailing LF in push-cert PKT-LINEs is _not_ +optional; it must be present. Currently, the following header fields are defined: @@ -560,12 +635,12 @@ update was successful, or 'ng [refname] [error]' if the update was not. 1*(command-status) flush-pkt - unpack-status = PKT-LINE("unpack" SP unpack-result LF) + unpack-status = PKT-LINE("unpack" SP unpack-result) unpack-result = "ok" / error-msg command-status = command-ok / command-fail - command-ok = PKT-LINE("ok" SP refname LF) - command-fail = PKT-LINE("ng" SP refname SP error-msg LF) + command-ok = PKT-LINE("ok" SP refname) + command-fail = PKT-LINE("ng" SP refname SP error-msg) error-msg = 1*(OCTECT) ; where not "ok" ---- diff --git a/Documentation/technical/partial-clone.txt b/Documentation/technical/partial-clone.txt new file mode 100644 index 0000000000..1ef66bd788 --- /dev/null +++ b/Documentation/technical/partial-clone.txt @@ -0,0 +1,324 @@ +Partial Clone Design Notes +========================== + +The "Partial Clone" feature is a performance optimization for Git that +allows Git to function without having a complete copy of the repository. +The goal of this work is to allow Git better handle extremely large +repositories. + +During clone and fetch operations, Git downloads the complete contents +and history of the repository. This includes all commits, trees, and +blobs for the complete life of the repository. For extremely large +repositories, clones can take hours (or days) and consume 100+GiB of disk +space. + +Often in these repositories there are many blobs and trees that the user +does not need such as: + + 1. files outside of the user's work area in the tree. For example, in + a repository with 500K directories and 3.5M files in every commit, + we can avoid downloading many objects if the user only needs a + narrow "cone" of the source tree. + + 2. large binary assets. For example, in a repository where large build + artifacts are checked into the tree, we can avoid downloading all + previous versions of these non-mergeable binary assets and only + download versions that are actually referenced. + +Partial clone allows us to avoid downloading such unneeded objects *in +advance* during clone and fetch operations and thereby reduce download +times and disk usage. Missing objects can later be "demand fetched" +if/when needed. + +Use of partial clone requires that the user be online and the origin +remote be available for on-demand fetching of missing objects. This may +or may not be problematic for the user. For example, if the user can +stay within the pre-selected subset of the source tree, they may not +encounter any missing objects. Alternatively, the user could try to +pre-fetch various objects if they know that they are going offline. + + +Non-Goals +--------- + +Partial clone is a mechanism to limit the number of blobs and trees downloaded +*within* a given range of commits -- and is therefore independent of and not +intended to conflict with existing DAG-level mechanisms to limit the set of +requested commits (i.e. shallow clone, single branch, or fetch '<refspec>'). + + +Design Overview +--------------- + +Partial clone logically consists of the following parts: + +- A mechanism for the client to describe unneeded or unwanted objects to + the server. + +- A mechanism for the server to omit such unwanted objects from packfiles + sent to the client. + +- A mechanism for the client to gracefully handle missing objects (that + were previously omitted by the server). + +- A mechanism for the client to backfill missing objects as needed. + + +Design Details +-------------- + +- A new pack-protocol capability "filter" is added to the fetch-pack and + upload-pack negotiation. ++ +This uses the existing capability discovery mechanism. +See "filter" in Documentation/technical/pack-protocol.txt. + +- Clients pass a "filter-spec" to clone and fetch which is passed to the + server to request filtering during packfile construction. ++ +There are various filters available to accommodate different situations. +See "--filter=<filter-spec>" in Documentation/rev-list-options.txt. + +- On the server pack-objects applies the requested filter-spec as it + creates "filtered" packfiles for the client. ++ +These filtered packfiles are *incomplete* in the traditional sense because +they may contain objects that reference objects not contained in the +packfile and that the client doesn't already have. For example, the +filtered packfile may contain trees or tags that reference missing blobs +or commits that reference missing trees. + +- On the client these incomplete packfiles are marked as "promisor packfiles" + and treated differently by various commands. + +- On the client a repository extension is added to the local config to + prevent older versions of git from failing mid-operation because of + missing objects that they cannot handle. + See "extensions.partialClone" in Documentation/technical/repository-version.txt" + + +Handling Missing Objects +------------------------ + +- An object may be missing due to a partial clone or fetch, or missing due + to repository corruption. To differentiate these cases, the local + repository specially indicates such filtered packfiles obtained from the + promisor remote as "promisor packfiles". ++ +These promisor packfiles consist of a "<name>.promisor" file with +arbitrary contents (like the "<name>.keep" files), in addition to +their "<name>.pack" and "<name>.idx" files. + +- The local repository considers a "promisor object" to be an object that + it knows (to the best of its ability) that the promisor remote has promised + that it has, either because the local repository has that object in one of + its promisor packfiles, or because another promisor object refers to it. ++ +When Git encounters a missing object, Git can see if it a promisor object +and handle it appropriately. If not, Git can report a corruption. ++ +This means that there is no need for the client to explicitly maintain an +expensive-to-modify list of missing objects.[a] + +- Since almost all Git code currently expects any referenced object to be + present locally and because we do not want to force every command to do + a dry-run first, a fallback mechanism is added to allow Git to attempt + to dynamically fetch missing objects from the promisor remote. ++ +When the normal object lookup fails to find an object, Git invokes +fetch-object to try to get the object from the server and then retry +the object lookup. This allows objects to be "faulted in" without +complicated prediction algorithms. ++ +For efficiency reasons, no check as to whether the missing object is +actually a promisor object is performed. ++ +Dynamic object fetching tends to be slow as objects are fetched one at +a time. + +- `checkout` (and any other command using `unpack-trees`) has been taught + to bulk pre-fetch all required missing blobs in a single batch. + +- `rev-list` has been taught to print missing objects. ++ +This can be used by other commands to bulk prefetch objects. +For example, a "git log -p A..B" may internally want to first do +something like "git rev-list --objects --quiet --missing=print A..B" +and prefetch those objects in bulk. + +- `fsck` has been updated to be fully aware of promisor objects. + +- `repack` in GC has been updated to not touch promisor packfiles at all, + and to only repack other objects. + +- The global variable "fetch_if_missing" is used to control whether an + object lookup will attempt to dynamically fetch a missing object or + report an error. ++ +We are not happy with this global variable and would like to remove it, +but that requires significant refactoring of the object code to pass an +additional flag. We hope that concurrent efforts to add an ODB API can +encompass this. + + +Fetching Missing Objects +------------------------ + +- Fetching of objects is done using the existing transport mechanism using + transport_fetch_refs(), setting a new transport option + TRANS_OPT_NO_DEPENDENTS to indicate that only the objects themselves are + desired, not any object that they refer to. ++ +Because some transports invoke fetch_pack() in the same process, fetch_pack() +has been updated to not use any object flags when the corresponding argument +(no_dependents) is set. + +- The local repository sends a request with the hashes of all requested + objects as "want" lines, and does not perform any packfile negotiation. + It then receives a packfile. + +- Because we are reusing the existing fetch-pack mechanism, fetching + currently fetches all objects referred to by the requested objects, even + though they are not necessary. + + +Current Limitations +------------------- + +- The remote used for a partial clone (or the first partial fetch + following a regular clone) is marked as the "promisor remote". ++ +We are currently limited to a single promisor remote and only that +remote may be used for subsequent partial fetches. ++ +We accept this limitation because we believe initial users of this +feature will be using it on repositories with a strong single central +server. + +- Dynamic object fetching will only ask the promisor remote for missing + objects. We assume that the promisor remote has a complete view of the + repository and can satisfy all such requests. + +- Repack essentially treats promisor and non-promisor packfiles as 2 + distinct partitions and does not mix them. Repack currently only works + on non-promisor packfiles and loose objects. + +- Dynamic object fetching invokes fetch-pack once *for each item* + because most algorithms stumble upon a missing object and need to have + it resolved before continuing their work. This may incur significant + overhead -- and multiple authentication requests -- if many objects are + needed. + +- Dynamic object fetching currently uses the existing pack protocol V0 + which means that each object is requested via fetch-pack. The server + will send a full set of info/refs when the connection is established. + If there are large number of refs, this may incur significant overhead. + + +Future Work +----------- + +- Allow more than one promisor remote and define a strategy for fetching + missing objects from specific promisor remotes or of iterating over the + set of promisor remotes until a missing object is found. ++ +A user might want to have multiple geographically-close cache servers +for fetching missing blobs while continuing to do filtered `git-fetch` +commands from the central server, for example. ++ +Or the user might want to work in a triangular work flow with multiple +promisor remotes that each have an incomplete view of the repository. + +- Allow repack to work on promisor packfiles (while keeping them distinct + from non-promisor packfiles). + +- Allow non-pathname-based filters to make use of packfile bitmaps (when + present). This was just an omission during the initial implementation. + +- Investigate use of a long-running process to dynamically fetch a series + of objects, such as proposed in [5,6] to reduce process startup and + overhead costs. ++ +It would be nice if pack protocol V2 could allow that long-running +process to make a series of requests over a single long-running +connection. + +- Investigate pack protocol V2 to avoid the info/refs broadcast on + each connection with the server to dynamically fetch missing objects. + +- Investigate the need to handle loose promisor objects. ++ +Objects in promisor packfiles are allowed to reference missing objects +that can be dynamically fetched from the server. An assumption was +made that loose objects are only created locally and therefore should +not reference a missing object. We may need to revisit that assumption +if, for example, we dynamically fetch a missing tree and store it as a +loose object rather than a single object packfile. ++ +This does not necessarily mean we need to mark loose objects as promisor; +it may be sufficient to relax the object lookup or is-promisor functions. + + +Non-Tasks +--------- + +- Every time the subject of "demand loading blobs" comes up it seems + that someone suggests that the server be allowed to "guess" and send + additional objects that may be related to the requested objects. ++ +No work has gone into actually doing that; we're just documenting that +it is a common suggestion. We're not sure how it would work and have +no plans to work on it. ++ +It is valid for the server to send more objects than requested (even +for a dynamic object fetch), but we are not building on that. + + +Footnotes +--------- + +[a] expensive-to-modify list of missing objects: Earlier in the design of + partial clone we discussed the need for a single list of missing objects. + This would essentially be a sorted linear list of OIDs that the were + omitted by the server during a clone or subsequent fetches. + +This file would need to be loaded into memory on every object lookup. +It would need to be read, updated, and re-written (like the .git/index) +on every explicit "git fetch" command *and* on any dynamic object fetch. + +The cost to read, update, and write this file could add significant +overhead to every command if there are many missing objects. For example, +if there are 100M missing blobs, this file would be at least 2GiB on disk. + +With the "promisor" concept, we *infer* a missing object based upon the +type of packfile that references it. + + +Related Links +------------- +[0] https://crbug.com/git/2 + Bug#2: Partial Clone + +[1] https://public-inbox.org/git/20170113155253.1644-1-benpeart@microsoft.com/ + + Subject: [RFC] Add support for downloading blobs on demand + + Date: Fri, 13 Jan 2017 10:52:53 -0500 + +[2] https://public-inbox.org/git/cover.1506714999.git.jonathantanmy@google.com/ + + Subject: [PATCH 00/18] Partial clone (from clone to lazy fetch in 18 patches) + + Date: Fri, 29 Sep 2017 13:11:36 -0700 + +[3] https://public-inbox.org/git/20170426221346.25337-1-jonathantanmy@google.com/ + + Subject: Proposal for missing blob support in Git repos + + Date: Wed, 26 Apr 2017 15:13:46 -0700 + +[4] https://public-inbox.org/git/1488999039-37631-1-git-send-email-git@jeffhostetler.com/ + + Subject: [PATCH 00/10] RFC Partial Clone and Fetch + + Date: Wed, 8 Mar 2017 18:50:29 +0000 + +[5] https://public-inbox.org/git/20170505152802.6724-1-benpeart@microsoft.com/ + + Subject: [PATCH v7 00/10] refactor the filter process code into a reusable module + + Date: Fri, 5 May 2017 11:27:52 -0400 + +[6] https://public-inbox.org/git/20170714132651.170708-1-benpeart@microsoft.com/ + + Subject: [RFC/PATCH v2 0/1] Add support for downloading blobs on demand + + Date: Fri, 14 Jul 2017 09:26:50 -0400 diff --git a/Documentation/technical/protocol-capabilities.txt b/Documentation/technical/protocol-capabilities.txt index 4f8a7bfb4c..332d209b58 100644 --- a/Documentation/technical/protocol-capabilities.txt +++ b/Documentation/technical/protocol-capabilities.txt @@ -179,6 +179,31 @@ This capability adds "deepen", "shallow" and "unshallow" commands to the fetch-pack/upload-pack protocol so clients can request shallow clones. +deepen-since +------------ + +This capability adds "deepen-since" command to fetch-pack/upload-pack +protocol so the client can request shallow clones that are cut at a +specific time, instead of depth. Internally it's equivalent of doing +"rev-list --max-age=<timestamp>" on the server side. "deepen-since" +cannot be used with "deepen". + +deepen-not +---------- + +This capability adds "deepen-not" command to fetch-pack/upload-pack +protocol so the client can request shallow clones that are cut at a +specific revision, instead of depth. Internally it's equivalent of +doing "rev-list --not <rev>" on the server side. "deepen-not" +cannot be used with "deepen", but can be used with "deepen-since". + +deepen-relative +--------------- + +If this capability is requested by the client, the semantics of +"deepen" command is changed. The "depth" argument is the depth from +the current shallow boundary, instead of the depth from remote refs. + no-progress ----------- @@ -253,6 +278,15 @@ atomic pushes. If the pushing client requests this capability, the server will update the refs in one atomic transaction. Either all refs are updated or none. +push-options +------------ + +If the server sends the 'push-options' capability it is able to accept +push options after the update commands have been sent, but before the +packfile is streamed. If the pushing client requests this capability, +the server will pass the options to the pre- and post- receive hooks +that process this push request. + allow-tip-sha1-in-want ---------------------- @@ -260,6 +294,13 @@ If the upload-pack server advertises this capability, fetch-pack may send "want" lines with SHA-1s that exist at the server but are not advertised by upload-pack. +allow-reachable-sha1-in-want +---------------------------- + +If the upload-pack server advertises this capability, fetch-pack may +send "want" lines with SHA-1s that exist at the server but are not +advertised by upload-pack. + push-cert=<nonce> ----------------- @@ -268,3 +309,11 @@ to accept a signed push certificate, and asks the <nonce> to be included in the push certificate. A send-pack client MUST NOT send a push-cert packet unless the receive-pack server advertises this capability. + +filter +------ + +If the upload-pack server advertises the 'filter' capability, +fetch-pack may send "filter" commands to request a partial clone +or partial fetch and request that the server omit various objects +from the packfile. diff --git a/Documentation/technical/protocol-common.txt b/Documentation/technical/protocol-common.txt index 889985f707..ecedb34bba 100644 --- a/Documentation/technical/protocol-common.txt +++ b/Documentation/technical/protocol-common.txt @@ -62,11 +62,14 @@ A pkt-line MAY contain binary data, so implementors MUST ensure pkt-line parsing/formatting routines are 8-bit clean. A non-binary line SHOULD BE terminated by an LF, which if present -MUST be included in the total length. - -The maximum length of a pkt-line's data component is 65520 bytes. -Implementations MUST NOT send pkt-line whose length exceeds 65524 -(65520 bytes of payload + 4 bytes of length data). +MUST be included in the total length. Receivers MUST treat pkt-lines +with non-binary data the same whether or not they contain the trailing +LF (stripping the LF if present, and not complaining when it is +missing). + +The maximum length of a pkt-line's data component is 65516 bytes. +Implementations MUST NOT send pkt-line whose length exceeds 65520 +(65516 bytes of payload + 4 bytes of length data). Implementations SHOULD NOT send an empty pkt-line ("0004"). diff --git a/Documentation/technical/protocol-v2.txt b/Documentation/technical/protocol-v2.txt new file mode 100644 index 0000000000..09e4e0273f --- /dev/null +++ b/Documentation/technical/protocol-v2.txt @@ -0,0 +1,439 @@ + Git Wire Protocol, Version 2 +============================== + +This document presents a specification for a version 2 of Git's wire +protocol. Protocol v2 will improve upon v1 in the following ways: + + * Instead of multiple service names, multiple commands will be + supported by a single service + * Easily extendable as capabilities are moved into their own section + of the protocol, no longer being hidden behind a NUL byte and + limited by the size of a pkt-line + * Separate out other information hidden behind NUL bytes (e.g. agent + string as a capability and symrefs can be requested using 'ls-refs') + * Reference advertisement will be omitted unless explicitly requested + * ls-refs command to explicitly request some refs + * Designed with http and stateless-rpc in mind. With clear flush + semantics the http remote helper can simply act as a proxy + +In protocol v2 communication is command oriented. When first contacting a +server a list of capabilities will advertised. Some of these capabilities +will be commands which a client can request be executed. Once a command +has completed, a client can reuse the connection and request that other +commands be executed. + + Packet-Line Framing +--------------------- + +All communication is done using packet-line framing, just as in v1. See +`Documentation/technical/pack-protocol.txt` and +`Documentation/technical/protocol-common.txt` for more information. + +In protocol v2 these special packets will have the following semantics: + + * '0000' Flush Packet (flush-pkt) - indicates the end of a message + * '0001' Delimiter Packet (delim-pkt) - separates sections of a message + + Initial Client Request +------------------------ + +In general a client can request to speak protocol v2 by sending +`version=2` through the respective side-channel for the transport being +used which inevitably sets `GIT_PROTOCOL`. More information can be +found in `pack-protocol.txt` and `http-protocol.txt`. In all cases the +response from the server is the capability advertisement. + + Git Transport +~~~~~~~~~~~~~~~ + +When using the git:// transport, you can request to use protocol v2 by +sending "version=2" as an extra parameter: + + 003egit-upload-pack /project.git\0host=myserver.com\0\0version=2\0 + + SSH and File Transport +~~~~~~~~~~~~~~~~~~~~~~~~ + +When using either the ssh:// or file:// transport, the GIT_PROTOCOL +environment variable must be set explicitly to include "version=2". + + HTTP Transport +~~~~~~~~~~~~~~~~ + +When using the http:// or https:// transport a client makes a "smart" +info/refs request as described in `http-protocol.txt` and requests that +v2 be used by supplying "version=2" in the `Git-Protocol` header. + + C: GET $GIT_URL/info/refs?service=git-upload-pack HTTP/1.0 + C: Git-Protocol: version=2 + +A v2 server would reply: + + S: 200 OK + S: <Some headers> + S: ... + S: + S: 000eversion 2\n + S: <capability-advertisement> + +Subsequent requests are then made directly to the service +`$GIT_URL/git-upload-pack`. (This works the same for git-receive-pack). + + Capability Advertisement +-------------------------- + +A server which decides to communicate (based on a request from a client) +using protocol version 2, notifies the client by sending a version string +in its initial response followed by an advertisement of its capabilities. +Each capability is a key with an optional value. Clients must ignore all +unknown keys. Semantics of unknown values are left to the definition of +each key. Some capabilities will describe commands which can be requested +to be executed by the client. + + capability-advertisement = protocol-version + capability-list + flush-pkt + + protocol-version = PKT-LINE("version 2" LF) + capability-list = *capability + capability = PKT-LINE(key[=value] LF) + + key = 1*(ALPHA | DIGIT | "-_") + value = 1*(ALPHA | DIGIT | " -_.,?\/{}[]()<>!@#$%^&*+=:;") + + Command Request +----------------- + +After receiving the capability advertisement, a client can then issue a +request to select the command it wants with any particular capabilities +or arguments. There is then an optional section where the client can +provide any command specific parameters or queries. Only a single +command can be requested at a time. + + request = empty-request | command-request + empty-request = flush-pkt + command-request = command + capability-list + [command-args] + flush-pkt + command = PKT-LINE("command=" key LF) + command-args = delim-pkt + *command-specific-arg + + command-specific-args are packet line framed arguments defined by + each individual command. + +The server will then check to ensure that the client's request is +comprised of a valid command as well as valid capabilities which were +advertised. If the request is valid the server will then execute the +command. A server MUST wait till it has received the client's entire +request before issuing a response. The format of the response is +determined by the command being executed, but in all cases a flush-pkt +indicates the end of the response. + +When a command has finished, and the client has received the entire +response from the server, a client can either request that another +command be executed or can terminate the connection. A client may +optionally send an empty request consisting of just a flush-pkt to +indicate that no more requests will be made. + + Capabilities +-------------- + +There are two different types of capabilities: normal capabilities, +which can be used to to convey information or alter the behavior of a +request, and commands, which are the core actions that a client wants to +perform (fetch, push, etc). + +Protocol version 2 is stateless by default. This means that all commands +must only last a single round and be stateless from the perspective of the +server side, unless the client has requested a capability indicating that +state should be maintained by the server. Clients MUST NOT require state +management on the server side in order to function correctly. This +permits simple round-robin load-balancing on the server side, without +needing to worry about state management. + + agent +~~~~~~~ + +The server can advertise the `agent` capability with a value `X` (in the +form `agent=X`) to notify the client that the server is running version +`X`. The client may optionally send its own agent string by including +the `agent` capability with a value `Y` (in the form `agent=Y`) in its +request to the server (but it MUST NOT do so if the server did not +advertise the agent capability). The `X` and `Y` strings may contain any +printable ASCII characters except space (i.e., the byte range 32 < x < +127), and are typically of the form "package/version" (e.g., +"git/1.8.3.1"). The agent strings are purely informative for statistics +and debugging purposes, and MUST NOT be used to programmatically assume +the presence or absence of particular features. + + ls-refs +~~~~~~~~~ + +`ls-refs` is the command used to request a reference advertisement in v2. +Unlike the current reference advertisement, ls-refs takes in arguments +which can be used to limit the refs sent from the server. + +Additional features not supported in the base command will be advertised +as the value of the command in the capability advertisement in the form +of a space separated list of features: "<command>=<feature 1> <feature 2>" + +ls-refs takes in the following arguments: + + symrefs + In addition to the object pointed by it, show the underlying ref + pointed by it when showing a symbolic ref. + peel + Show peeled tags. + ref-prefix <prefix> + When specified, only references having a prefix matching one of + the provided prefixes are displayed. + +The output of ls-refs is as follows: + + output = *ref + flush-pkt + ref = PKT-LINE(obj-id SP refname *(SP ref-attribute) LF) + ref-attribute = (symref | peeled) + symref = "symref-target:" symref-target + peeled = "peeled:" obj-id + + fetch +~~~~~~~ + +`fetch` is the command used to fetch a packfile in v2. It can be looked +at as a modified version of the v1 fetch where the ref-advertisement is +stripped out (since the `ls-refs` command fills that role) and the +message format is tweaked to eliminate redundancies and permit easy +addition of future extensions. + +Additional features not supported in the base command will be advertised +as the value of the command in the capability advertisement in the form +of a space separated list of features: "<command>=<feature 1> <feature 2>" + +A `fetch` request can take the following arguments: + + want <oid> + Indicates to the server an object which the client wants to + retrieve. Wants can be anything and are not limited to + advertised objects. + + have <oid> + Indicates to the server an object which the client has locally. + This allows the server to make a packfile which only contains + the objects that the client needs. Multiple 'have' lines can be + supplied. + + done + Indicates to the server that negotiation should terminate (or + not even begin if performing a clone) and that the server should + use the information supplied in the request to construct the + packfile. + + thin-pack + Request that a thin pack be sent, which is a pack with deltas + which reference base objects not contained within the pack (but + are known to exist at the receiving end). This can reduce the + network traffic significantly, but it requires the receiving end + to know how to "thicken" these packs by adding the missing bases + to the pack. + + no-progress + Request that progress information that would normally be sent on + side-band channel 2, during the packfile transfer, should not be + sent. However, the side-band channel 3 is still used for error + responses. + + include-tag + Request that annotated tags should be sent if the objects they + point to are being sent. + + ofs-delta + Indicate that the client understands PACKv2 with delta referring + to its base by position in pack rather than by an oid. That is, + they can read OBJ_OFS_DELTA (ake type 6) in a packfile. + +If the 'shallow' feature is advertised the following arguments can be +included in the clients request as well as the potential addition of the +'shallow-info' section in the server's response as explained below. + + shallow <oid> + A client must notify the server of all commits for which it only + has shallow copies (meaning that it doesn't have the parents of + a commit) by supplying a 'shallow <oid>' line for each such + object so that the server is aware of the limitations of the + client's history. This is so that the server is aware that the + client may not have all objects reachable from such commits. + + deepen <depth> + Requests that the fetch/clone should be shallow having a commit + depth of <depth> relative to the remote side. + + deepen-relative + Requests that the semantics of the "deepen" command be changed + to indicate that the depth requested is relative to the client's + current shallow boundary, instead of relative to the requested + commits. + + deepen-since <timestamp> + Requests that the shallow clone/fetch should be cut at a + specific time, instead of depth. Internally it's equivalent to + doing "git rev-list --max-age=<timestamp>". Cannot be used with + "deepen". + + deepen-not <rev> + Requests that the shallow clone/fetch should be cut at a + specific revision specified by '<rev>', instead of a depth. + Internally it's equivalent of doing "git rev-list --not <rev>". + Cannot be used with "deepen", but can be used with + "deepen-since". + +If the 'filter' feature is advertised, the following argument can be +included in the client's request: + + filter <filter-spec> + Request that various objects from the packfile be omitted + using one of several filtering techniques. These are intended + for use with partial clone and partial fetch operations. See + `rev-list` for possible "filter-spec" values. + +If the 'ref-in-want' feature is advertised, the following argument can +be included in the client's request as well as the potential addition of +the 'wanted-refs' section in the server's response as explained below. + + want-ref <ref> + Indicates to the server that the client wants to retrieve a + particular ref, where <ref> is the full name of a ref on the + server. + +The response of `fetch` is broken into a number of sections separated by +delimiter packets (0001), with each section beginning with its section +header. + + output = *section + section = (acknowledgments | shallow-info | wanted-refs | packfile) + (flush-pkt | delim-pkt) + + acknowledgments = PKT-LINE("acknowledgments" LF) + (nak | *ack) + (ready) + ready = PKT-LINE("ready" LF) + nak = PKT-LINE("NAK" LF) + ack = PKT-LINE("ACK" SP obj-id LF) + + shallow-info = PKT-LINE("shallow-info" LF) + *PKT-LINE((shallow | unshallow) LF) + shallow = "shallow" SP obj-id + unshallow = "unshallow" SP obj-id + + wanted-refs = PKT-LINE("wanted-refs" LF) + *PKT-LINE(wanted-ref LF) + wanted-ref = obj-id SP refname + + packfile = PKT-LINE("packfile" LF) + *PKT-LINE(%x01-03 *%x00-ff) + + acknowledgments section + * If the client determines that it is finished with negotiations + by sending a "done" line, the acknowledgments sections MUST be + omitted from the server's response. + + * Always begins with the section header "acknowledgments" + + * The server will respond with "NAK" if none of the object ids sent + as have lines were common. + + * The server will respond with "ACK obj-id" for all of the + object ids sent as have lines which are common. + + * A response cannot have both "ACK" lines as well as a "NAK" + line. + + * The server will respond with a "ready" line indicating that + the server has found an acceptable common base and is ready to + make and send a packfile (which will be found in the packfile + section of the same response) + + * If the server has found a suitable cut point and has decided + to send a "ready" line, then the server can decide to (as an + optimization) omit any "ACK" lines it would have sent during + its response. This is because the server will have already + determined the objects it plans to send to the client and no + further negotiation is needed. + + shallow-info section + * If the client has requested a shallow fetch/clone, a shallow + client requests a fetch or the server is shallow then the + server's response may include a shallow-info section. The + shallow-info section will be included if (due to one of the + above conditions) the server needs to inform the client of any + shallow boundaries or adjustments to the clients already + existing shallow boundaries. + + * Always begins with the section header "shallow-info" + + * If a positive depth is requested, the server will compute the + set of commits which are no deeper than the desired depth. + + * The server sends a "shallow obj-id" line for each commit whose + parents will not be sent in the following packfile. + + * The server sends an "unshallow obj-id" line for each commit + which the client has indicated is shallow, but is no longer + shallow as a result of the fetch (due to its parents being + sent in the following packfile). + + * The server MUST NOT send any "unshallow" lines for anything + which the client has not indicated was shallow as a part of + its request. + + * This section is only included if a packfile section is also + included in the response. + + wanted-refs section + * This section is only included if the client has requested a + ref using a 'want-ref' line and if a packfile section is also + included in the response. + + * Always begins with the section header "wanted-refs". + + * The server will send a ref listing ("<oid> <refname>") for + each reference requested using 'want-ref' lines. + + * The server MUST NOT send any refs which were not requested + using 'want-ref' lines. + + packfile section + * This section is only included if the client has sent 'want' + lines in its request and either requested that no more + negotiation be done by sending 'done' or if the server has + decided it has found a sufficient cut point to produce a + packfile. + + * Always begins with the section header "packfile" + + * The transmission of the packfile begins immediately after the + section header + + * The data transfer of the packfile is always multiplexed, using + the same semantics of the 'side-band-64k' capability from + protocol version 1. This means that each packet, during the + packfile data stream, is made up of a leading 4-byte pkt-line + length (typical of the pkt-line format), followed by a 1-byte + stream code, followed by the actual data. + + The stream code can be one of: + 1 - pack data + 2 - progress messages + 3 - fatal error message just before stream aborts + + server-option +~~~~~~~~~~~~~~~ + +If advertised, indicates that any number of server specific options can be +included in a request. This is done by sending each option as a +"server-option=<option>" capability line in the capability-list section of +a request. + +The provided options must not contain a NUL or LF character. diff --git a/Documentation/technical/racy-git.txt b/Documentation/technical/racy-git.txt index 242a044db9..4a8be4d144 100644 --- a/Documentation/technical/racy-git.txt +++ b/Documentation/technical/racy-git.txt @@ -41,13 +41,17 @@ With a `USE_STDEV` compile-time option, `st_dev` is also compared, but this is not enabled by default because this member is not stable on network filesystems. With `USE_NSEC` compile-time option, `st_mtim.tv_nsec` and `st_ctim.tv_nsec` -members are also compared, but this is not enabled by default +members are also compared. On Linux, this is not enabled by default because in-core timestamps can have finer granularity than on-disk timestamps, resulting in meaningless changes when an inode is evicted from the inode cache. See commit 8ce13b0 of git://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git ([PATCH] Sync in core time granularity with filesystems, -2005-01-04). +2005-01-04). This patch is included in kernel 2.6.11 and newer, but +only fixes the issue for file systems with exactly 1 ns or 1 s +resolution. Other file systems are still broken in current Linux +kernels (e.g. CEPH, CIFS, NTFS, UDF), see +https://lkml.org/lkml/2015/6/9/714 Racy Git -------- diff --git a/Documentation/technical/repository-version.txt b/Documentation/technical/repository-version.txt new file mode 100644 index 0000000000..e03eaccebc --- /dev/null +++ b/Documentation/technical/repository-version.txt @@ -0,0 +1,100 @@ +Git Repository Format Versions +============================== + +Every git repository is marked with a numeric version in the +`core.repositoryformatversion` key of its `config` file. This version +specifies the rules for operating on the on-disk repository data. An +implementation of git which does not understand a particular version +advertised by an on-disk repository MUST NOT operate on that repository; +doing so risks not only producing wrong results, but actually losing +data. + +Because of this rule, version bumps should be kept to an absolute +minimum. Instead, we generally prefer these strategies: + + - bumping format version numbers of individual data files (e.g., + index, packfiles, etc). This restricts the incompatibilities only to + those files. + + - introducing new data that gracefully degrades when used by older + clients (e.g., pack bitmap files are ignored by older clients, which + simply do not take advantage of the optimization they provide). + +A whole-repository format version bump should only be part of a change +that cannot be independently versioned. For instance, if one were to +change the reachability rules for objects, or the rules for locking +refs, that would require a bump of the repository format version. + +Note that this applies only to accessing the repository's disk contents +directly. An older client which understands only format `0` may still +connect via `git://` to a repository using format `1`, as long as the +server process understands format `1`. + +The preferred strategy for rolling out a version bump (whether whole +repository or for a single file) is to teach git to read the new format, +and allow writing the new format with a config switch or command line +option (for experimentation or for those who do not care about backwards +compatibility with older gits). Then after a long period to allow the +reading capability to become common, we may switch to writing the new +format by default. + +The currently defined format versions are: + +Version `0` +----------- + +This is the format defined by the initial version of git, including but +not limited to the format of the repository directory, the repository +configuration file, and the object and ref storage. Specifying the +complete behavior of git is beyond the scope of this document. + +Version `1` +----------- + +This format is identical to version `0`, with the following exceptions: + + 1. When reading the `core.repositoryformatversion` variable, a git + implementation which supports version 1 MUST also read any + configuration keys found in the `extensions` section of the + configuration file. + + 2. If a version-1 repository specifies any `extensions.*` keys that + the running git has not implemented, the operation MUST NOT + proceed. Similarly, if the value of any known key is not understood + by the implementation, the operation MUST NOT proceed. + +Note that if no extensions are specified in the config file, then +`core.repositoryformatversion` SHOULD be set to `0` (setting it to `1` +provides no benefit, and makes the repository incompatible with older +implementations of git). + +This document will serve as the master list for extensions. Any +implementation wishing to define a new extension should make a note of +it here, in order to claim the name. + +The defined extensions are: + +`noop` +~~~~~~ + +This extension does not change git's behavior at all. It is useful only +for testing format-1 compatibility. + +`preciousObjects` +~~~~~~~~~~~~~~~~~ + +When the config key `extensions.preciousObjects` is set to `true`, +objects in the repository MUST NOT be deleted (e.g., by `git-prune` or +`git repack -d`). + +`partialclone` +~~~~~~~~~~~~~~ + +When the config key `extensions.partialclone` is set, it indicates +that the repo was created with a partial clone (or later performed +a partial fetch) and that the remote may have omitted sending +certain unwanted objects. Such a remote is called a "promisor remote" +and it promises that all such omitted objects can be fetched from it +in the future. + +The value of this key is the name of the promisor remote. diff --git a/Documentation/technical/shallow.txt b/Documentation/technical/shallow.txt index 5183b15422..01dedfe9ff 100644 --- a/Documentation/technical/shallow.txt +++ b/Documentation/technical/shallow.txt @@ -8,20 +8,22 @@ repo, and therefore grafts are introduced pretending that these commits have no parents. ********************************************************* -The basic idea is to write the SHA-1s of shallow commits into -$GIT_DIR/shallow, and handle its contents like the contents -of $GIT_DIR/info/grafts (with the difference that shallow -cannot contain parent information). - -This information is stored in a new file instead of grafts, or -even the config, since the user should not touch that file -at all (even throughout development of the shallow clone, it -was never manually edited!). +$GIT_DIR/shallow lists commit object names and tells Git to +pretend as if they are root commits (e.g. "git log" traversal +stops after showing them; "git fsck" does not complain saying +the commits listed on their "parent" lines do not exist). Each line contains exactly one SHA-1. When read, a commit_graft will be constructed, which has nr_parent < 0 to make it easier to discern from user provided grafts. +Note that the shallow feature could not be changed easily to +use replace refs: a commit containing a `mergetag` is not allowed +to be replaced, not even by a root commit. Such a commit can be +made shallow, though. Also, having a `shallow` file explicitly +listing all the commits made shallow makes it a *lot* easier to +do shallow-specific things such as to deepen the history. + Since fsck-objects relies on the library to read the objects, it honours shallow commits automatically. diff --git a/Documentation/technical/signature-format.txt b/Documentation/technical/signature-format.txt new file mode 100644 index 0000000000..2c9406a56a --- /dev/null +++ b/Documentation/technical/signature-format.txt @@ -0,0 +1,186 @@ +Git signature format +==================== + +== Overview + +Git uses cryptographic signatures in various places, currently objects (tags, +commits, mergetags) and transactions (pushes). In every case, the command which +is about to create an object or transaction determines a payload from that, +calls gpg to obtain a detached signature for the payload (`gpg -bsa`) and +embeds the signature into the object or transaction. + +Signatures always begin with `-----BEGIN PGP SIGNATURE-----` +and end with `-----END PGP SIGNATURE-----`, unless gpg is told to +produce RFC1991 signatures which use `MESSAGE` instead of `SIGNATURE`. + +The signed payload and the way the signature is embedded depends +on the type of the object resp. transaction. + +== Tag signatures + +- created by: `git tag -s` +- payload: annotated tag object +- embedding: append the signature to the unsigned tag object +- example: tag `signedtag` with subject `signed tag` + +---- +object 04b871796dc0420f8e7561a895b52484b701d51a +type commit +tag signedtag +tagger C O Mitter <committer@example.com> 1465981006 +0000 + +signed tag + +signed tag message body +-----BEGIN PGP SIGNATURE----- +Version: GnuPG v1 + +iQEcBAABAgAGBQJXYRhOAAoJEGEJLoW3InGJklkIAIcnhL7RwEb/+QeX9enkXhxn +rxfdqrvWd1K80sl2TOt8Bg/NYwrUBw/RWJ+sg/hhHp4WtvE1HDGHlkEz3y11Lkuh +8tSxS3qKTxXUGozyPGuE90sJfExhZlW4knIQ1wt/yWqM+33E9pN4hzPqLwyrdods +q8FWEqPPUbSJXoMbRPw04S5jrLtZSsUWbRYjmJCHzlhSfFWW4eFd37uquIaLUBS0 +rkC3Jrx7420jkIpgFcTI2s60uhSQLzgcCwdA2ukSYIRnjg/zDkj8+3h/GaROJ72x +lZyI6HWixKJkWw8lE9aAOD9TmTW9sFJwcVAzmAuFX2kUreDUKMZduGcoRYGpD7E= +=jpXa +-----END PGP SIGNATURE----- +---- + +- verify with: `git verify-tag [-v]` or `git tag -v` + +---- +gpg: Signature made Wed Jun 15 10:56:46 2016 CEST using RSA key ID B7227189 +gpg: Good signature from "Eris Discordia <discord@example.net>" +gpg: WARNING: This key is not certified with a trusted signature! +gpg: There is no indication that the signature belongs to the owner. +Primary key fingerprint: D4BE 2231 1AD3 131E 5EDA 29A4 6109 2E85 B722 7189 +object 04b871796dc0420f8e7561a895b52484b701d51a +type commit +tag signedtag +tagger C O Mitter <committer@example.com> 1465981006 +0000 + +signed tag + +signed tag message body +---- + +== Commit signatures + +- created by: `git commit -S` +- payload: commit object +- embedding: header entry `gpgsig` + (content is preceded by a space) +- example: commit with subject `signed commit` + +---- +tree eebfed94e75e7760540d1485c740902590a00332 +parent 04b871796dc0420f8e7561a895b52484b701d51a +author A U Thor <author@example.com> 1465981137 +0000 +committer C O Mitter <committer@example.com> 1465981137 +0000 +gpgsig -----BEGIN PGP SIGNATURE----- + Version: GnuPG v1 + + iQEcBAABAgAGBQJXYRjRAAoJEGEJLoW3InGJ3IwIAIY4SA6GxY3BjL60YyvsJPh/ + HRCJwH+w7wt3Yc/9/bW2F+gF72kdHOOs2jfv+OZhq0q4OAN6fvVSczISY/82LpS7 + DVdMQj2/YcHDT4xrDNBnXnviDO9G7am/9OE77kEbXrp7QPxvhjkicHNwy2rEflAA + zn075rtEERDHr8nRYiDh8eVrefSO7D+bdQ7gv+7GsYMsd2auJWi1dHOSfTr9HIF4 + HJhWXT9d2f8W+diRYXGh4X0wYiGg6na/soXc+vdtDYBzIxanRqjg8jCAeo1eOTk1 + EdTwhcTZlI0x5pvJ3H0+4hA2jtldVtmPM4OTB0cTrEWBad7XV6YgiyuII73Ve3I= + =jKHM + -----END PGP SIGNATURE----- + +signed commit + +signed commit message body +---- + +- verify with: `git verify-commit [-v]` (or `git show --show-signature`) + +---- +gpg: Signature made Wed Jun 15 10:58:57 2016 CEST using RSA key ID B7227189 +gpg: Good signature from "Eris Discordia <discord@example.net>" +gpg: WARNING: This key is not certified with a trusted signature! +gpg: There is no indication that the signature belongs to the owner. +Primary key fingerprint: D4BE 2231 1AD3 131E 5EDA 29A4 6109 2E85 B722 7189 +tree eebfed94e75e7760540d1485c740902590a00332 +parent 04b871796dc0420f8e7561a895b52484b701d51a +author A U Thor <author@example.com> 1465981137 +0000 +committer C O Mitter <committer@example.com> 1465981137 +0000 + +signed commit + +signed commit message body +---- + +== Mergetag signatures + +- created by: `git merge` on signed tag +- payload/embedding: the whole signed tag object is embedded into + the (merge) commit object as header entry `mergetag` +- example: merge of the signed tag `signedtag` as above + +---- +tree c7b1cff039a93f3600a1d18b82d26688668c7dea +parent c33429be94b5f2d3ee9b0adad223f877f174b05d +parent 04b871796dc0420f8e7561a895b52484b701d51a +author A U Thor <author@example.com> 1465982009 +0000 +committer C O Mitter <committer@example.com> 1465982009 +0000 +mergetag object 04b871796dc0420f8e7561a895b52484b701d51a + type commit + tag signedtag + tagger C O Mitter <committer@example.com> 1465981006 +0000 + + signed tag + + signed tag message body + -----BEGIN PGP SIGNATURE----- + Version: GnuPG v1 + + iQEcBAABAgAGBQJXYRhOAAoJEGEJLoW3InGJklkIAIcnhL7RwEb/+QeX9enkXhxn + rxfdqrvWd1K80sl2TOt8Bg/NYwrUBw/RWJ+sg/hhHp4WtvE1HDGHlkEz3y11Lkuh + 8tSxS3qKTxXUGozyPGuE90sJfExhZlW4knIQ1wt/yWqM+33E9pN4hzPqLwyrdods + q8FWEqPPUbSJXoMbRPw04S5jrLtZSsUWbRYjmJCHzlhSfFWW4eFd37uquIaLUBS0 + rkC3Jrx7420jkIpgFcTI2s60uhSQLzgcCwdA2ukSYIRnjg/zDkj8+3h/GaROJ72x + lZyI6HWixKJkWw8lE9aAOD9TmTW9sFJwcVAzmAuFX2kUreDUKMZduGcoRYGpD7E= + =jpXa + -----END PGP SIGNATURE----- + +Merge tag 'signedtag' into downstream + +signed tag + +signed tag message body + +# gpg: Signature made Wed Jun 15 08:56:46 2016 UTC using RSA key ID B7227189 +# gpg: Good signature from "Eris Discordia <discord@example.net>" +# gpg: WARNING: This key is not certified with a trusted signature! +# gpg: There is no indication that the signature belongs to the owner. +# Primary key fingerprint: D4BE 2231 1AD3 131E 5EDA 29A4 6109 2E85 B722 7189 +---- + +- verify with: verification is embedded in merge commit message by default, + alternatively with `git show --show-signature`: + +---- +commit 9863f0c76ff78712b6800e199a46aa56afbcbd49 +merged tag 'signedtag' +gpg: Signature made Wed Jun 15 10:56:46 2016 CEST using RSA key ID B7227189 +gpg: Good signature from "Eris Discordia <discord@example.net>" +gpg: WARNING: This key is not certified with a trusted signature! +gpg: There is no indication that the signature belongs to the owner. +Primary key fingerprint: D4BE 2231 1AD3 131E 5EDA 29A4 6109 2E85 B722 7189 +Merge: c33429b 04b8717 +Author: A U Thor <author@example.com> +Date: Wed Jun 15 09:13:29 2016 +0000 + + Merge tag 'signedtag' into downstream + + signed tag + + signed tag message body + + # gpg: Signature made Wed Jun 15 08:56:46 2016 UTC using RSA key ID B7227189 + # gpg: Good signature from "Eris Discordia <discord@example.net>" + # gpg: WARNING: This key is not certified with a trusted signature! + # gpg: There is no indication that the signature belongs to the owner. + # Primary key fingerprint: D4BE 2231 1AD3 131E 5EDA 29A4 6109 2E85 B722 7189 +---- diff --git a/Documentation/technical/trivial-merge.txt b/Documentation/technical/trivial-merge.txt index c79d4a7c47..1f1c33d0da 100644 --- a/Documentation/technical/trivial-merge.txt +++ b/Documentation/technical/trivial-merge.txt @@ -32,7 +32,7 @@ or the result. If multiple cases apply, the one used is listed first. A result which changes the index is an error if the index is not empty -and not up-to-date. +and not up to date. Entries marked '+' have stat information. Spaces marked '*' don't affect the result. @@ -65,7 +65,7 @@ empty, no entry is left for that stage). Otherwise, the given entry is left in stage 0, and there are no other entries. A result of "no merge" is an error if the index is not empty and not -up-to-date. +up to date. *empty* means that the tree must not have a directory-file conflict with the entry. |