7 files changed, 1237 insertions, 15 deletions
diff --git a/Documentation/technical/api-trace2.txt b/Documentation/technical/api-trace2.txt
index 4f07ceadcb..6b6085585d 100644
--- a/Documentation/technical/api-trace2.txt
+++ b/Documentation/technical/api-trace2.txt
@@ -656,7 +656,8 @@ The "exec_id" field is a command-unique id and is only useful if the
 ------------
 
 `"def_param"`::
-	This event is generated to log a global parameter.
+	This event is generated to log a global parameter, such as a config
+	setting, command-line flag, or environment variable.
 +
 ------------
 {
diff --git a/Documentation/technical/commit-graph-format.txt b/Documentation/technical/commit-graph-format.txt
index a4f17441ae..1beef17182 100644
--- a/Documentation/technical/commit-graph-format.txt
+++ b/Documentation/technical/commit-graph-format.txt
@@ -17,6 +17,9 @@ metadata, including:
 - The parents of the commit, stored using positional references within
   the graph file.
 
+- The Bloom filter of the commit carrying the paths that were changed between
+  the commit and its first parent, if requested.
+
 These positional references are stored as unsigned 32-bit integers
 corresponding to the array position within the list of commit OIDs. Due
 to some special constants we use to track parents, we can store at most
@@ -93,6 +96,33 @@ CHUNK DATA:
       positions for the parents until reaching a value with the most-significant
       bit on. The other bits correspond to the position of the last parent.
 
+  Bloom Filter Index (ID: {'B', 'I', 'D', 'X'}) (N * 4 bytes) [Optional]
+    * The ith entry, BIDX[i], stores the number of bytes in all Bloom filters
+      from commit 0 to commit i (inclusive) in lexicographic order. The Bloom
+      filter for the i-th commit spans from BIDX[i-1] to BIDX[i] (plus header
+      length), where BIDX[-1] is 0.
+    * The BIDX chunk is ignored if the BDAT chunk is not present.
+
+  Bloom Filter Data (ID: {'B', 'D', 'A', 'T'}) [Optional]
+    * It starts with header consisting of three unsigned 32-bit integers:
+      - Version of the hash algorithm being used. We currently only support
+	value 1 which corresponds to the 32-bit version of the murmur3 hash
+	implemented exactly as described in
+	https://en.wikipedia.org/wiki/MurmurHash#Algorithm and the double
+	hashing technique using seed values 0x293ae76f and 0x7e646e2 as
+	described in https://doi.org/10.1007/978-3-540-30494-4_26 "Bloom Filters
+	in Probabilistic Verification"
+      - The number of times a path is hashed and hence the number of bit positions
+	      that cumulatively determine whether a file is present in the commit.
+      - The minimum number of bits 'b' per entry in the Bloom filter. If the filter
+	      contains 'n' entries, then the filter size is the minimum number of 64-bit
+	      words that contain n*b bits.
+    * The rest of the chunk is the concatenation of all the computed Bloom
+      filters for the commits in lexicographic order.
+    * Note: Commits with no changes or more than 512 changes have Bloom filters
+      of length zero.
+    * The BDAT chunk is present if and only if BIDX is present.
+
   Base Graphs List (ID: {'B', 'A', 'S', 'E'}) [Optional]
       This list of H-byte hashes describe a set of B commit-graph files that
       form a commit-graph chain. The graph position for the ith commit in this
diff --git a/Documentation/technical/http-protocol.txt b/Documentation/technical/http-protocol.txt
index 9c5b6f0fac..51a79e63de 100644
--- a/Documentation/technical/http-protocol.txt
+++ b/Documentation/technical/http-protocol.txt
@@ -216,7 +216,7 @@ smart server reply:
    S: 001e# service=git-upload-pack\n
    S: 0000
    S: 004895dcfa3633004da0049d3d0fa03f80589cbcaf31 refs/heads/maint\0multi_ack\n
-   S: 0042d049f6c27a2244e12041955e262a404c7faba355 refs/heads/master\n
+   S: 003fd049f6c27a2244e12041955e262a404c7faba355 refs/heads/master\n
    S: 003c2cb58b79488a98d2721cea644875a8dd0026b115 refs/tags/v1.0\n
    S: 003fa3c2e2402b99163d1d59756e5f207ae21cccba4c refs/tags/v1.0^{}\n
    S: 0000
diff --git a/Documentation/technical/pack-protocol.txt b/Documentation/technical/pack-protocol.txt
index d5ce4eea8a..a4573d12ce 100644
--- a/Documentation/technical/pack-protocol.txt
+++ b/Documentation/technical/pack-protocol.txt
@@ -96,7 +96,7 @@ Basically what the Git client is doing to connect to an 'upload-pack'
 process on the server side over the Git protocol is this:
 
    $ echo -e -n \
-     "0039git-upload-pack /schacon/gitbook.git\0host=example.com\0" |
+     "003agit-upload-pack /schacon/gitbook.git\0host=example.com\0" |
      nc -v example.com 9418
 
 
@@ -171,9 +171,9 @@ with a version number (if "version=1" is sent as an Extra Parameter),
 and a listing of each reference it has (all branches and tags) along
 with the object name that each reference currently points to.
 
-   $ echo -e -n "0044git-upload-pack /schacon/gitbook.git\0host=example.com\0\0version=1\0" |
+   $ echo -e -n "0045git-upload-pack /schacon/gitbook.git\0host=example.com\0\0version=1\0" |
       nc -v example.com 9418
-   000aversion 1
+   000eversion 1
    00887217a7c7e582c46cec22a130adf4b9d7d950fba0 HEAD\0multi_ack thin-pack
 		side-band side-band-64k ofs-delta shallow no-progress include-tag
    00441d3fcd5ced445d1abc402225c0b8a1299641f497 refs/heads/integration
diff --git a/Documentation/technical/packfile-uri.txt b/Documentation/technical/packfile-uri.txt
new file mode 100644
index 0000000000..318713abc3
--- /dev/null
+++ b/Documentation/technical/packfile-uri.txt
@@ -0,0 +1,78 @@
+Packfile URIs
+=============
+
+This feature allows servers to serve part of their packfile response as URIs.
+This allows server designs that improve scalability in bandwidth and CPU usage
+(for example, by serving some data through a CDN), and (in the future) provides
+some measure of resumability to clients.
+
+This feature is available only in protocol version 2.
+
+Protocol
+--------
+
+The server advertises the `packfile-uris` capability.
+
+If the client then communicates which protocols (HTTPS, etc.) it supports with
+a `packfile-uris` argument, the server MAY send a `packfile-uris` section
+directly before the `packfile` section (right after `wanted-refs` if it is
+sent) containing URIs of any of the given protocols. The URIs point to
+packfiles that use only features that the client has declared that it supports
+(e.g. ofs-delta and thin-pack). See protocol-v2.txt for the documentation of
+this section.
+
+Clients should then download and index all the given URIs (in addition to
+downloading and indexing the packfile given in the `packfile` section of the
+response) before performing the connectivity check.
+
+Server design
+-------------
+
+The server can be trivially made compatible with the proposed protocol by
+having it advertise `packfile-uris`, tolerating the client sending
+`packfile-uris`, and never sending any `packfile-uris` section. But we should
+include some sort of non-trivial implementation in the Minimum Viable Product,
+at least so that we can test the client.
+
+This is the implementation: a feature, marked experimental, that allows the
+server to be configured by one or more `uploadpack.blobPackfileUri=<sha1>
+<uri>` entries. Whenever the list of objects to be sent is assembled, all such
+blobs are excluded, replaced with URIs. The client will download those URIs,
+expecting them to each point to packfiles containing single blobs.
+
+Client design
+-------------
+
+The client has a config variable `fetch.uriprotocols` that determines which
+protocols the end user is willing to use. By default, this is empty.
+
+When the client downloads the given URIs, it should store them with "keep"
+files, just like it does with the packfile in the `packfile` section. These
+additional "keep" files can only be removed after the refs have been updated -
+just like the "keep" file for the packfile in the `packfile` section.
+
+The division of work (initial fetch + additional URIs) introduces convenient
+points for resumption of an interrupted clone - such resumption can be done
+after the Minimum Viable Product (see "Future work").
+
+Future work
+-----------
+
+The protocol design allows some evolution of the server and client without any
+need for protocol changes, so only a small-scoped design is included here to
+form the MVP. For example, the following can be done:
+
+ * On the server, more sophisticated means of excluding objects (e.g. by
+   specifying a commit to represent that commit and all objects that it
+   references).
+ * On the client, resumption of clone. If a clone is interrupted, information
+   could be recorded in the repository's config and a "clone-resume" command
+   can resume the clone in progress. (Resumption of subsequent fetches is more
+   difficult because that must deal with the user wanting to use the repository
+   even after the fetch was interrupted.)
+
+There are some possible features that will require a change in protocol:
+
+ * Additional HTTP headers (e.g. authentication)
+ * Byte range support
+ * Different file formats referenced by URIs (e.g. raw object)
diff --git a/Documentation/technical/protocol-v2.txt b/Documentation/technical/protocol-v2.txt
index 7e3766cafb..5852f499a6 100644
--- a/Documentation/technical/protocol-v2.txt
+++ b/Documentation/technical/protocol-v2.txt
@@ -33,6 +33,8 @@ In protocol v2 these special packets will have the following semantics:
 
   * '0000' Flush Packet (flush-pkt) - indicates the end of a message
   * '0001' Delimiter Packet (delim-pkt) - separates sections of a message
+  * '0002' Message Packet (response-end-pkt) - indicates the end of a response
+    for stateless connections
 
 Initial Client Request
 ----------------------
@@ -323,13 +325,26 @@ included in the client's request:
 	indicating its sideband (1, 2, or 3), and the server may send "0005\2"
 	(a PKT-LINE of sideband 2 with no payload) as a keepalive packet.
 
+If the 'packfile-uris' feature is advertised, the following argument
+can be included in the client's request as well as the potential
+addition of the 'packfile-uris' section in the server's response as
+explained below.
+
+    packfile-uris <comma-separated list of protocols>
+	Indicates to the server that the client is willing to receive
+	URIs of any of the given protocols in place of objects in the
+	sent packfile. Before performing the connectivity check, the
+	client should download from all given URIs. Currently, the
+	protocols supported are "http" and "https".
+
 The response of `fetch` is broken into a number of sections separated by
 delimiter packets (0001), with each section beginning with its section
-header.
+header. Most sections are sent only when the packfile is sent.
 
-    output = *section
-    section = (acknowledgments | shallow-info | wanted-refs | packfile)
-	      (flush-pkt | delim-pkt)
+    output = acknowledgements flush-pkt |
+	     [acknowledgments delim-pkt] [shallow-info delim-pkt]
+	     [wanted-refs delim-pkt] [packfile-uris delim-pkt]
+	     packfile flush-pkt
 
     acknowledgments = PKT-LINE("acknowledgments" LF)
 		      (nak | *ack)
@@ -347,13 +362,17 @@ header.
 		  *PKT-LINE(wanted-ref LF)
     wanted-ref = obj-id SP refname
 
+    packfile-uris = PKT-LINE("packfile-uris" LF) *packfile-uri
+    packfile-uri = PKT-LINE(40*(HEXDIGIT) SP *%x20-ff LF)
+
     packfile = PKT-LINE("packfile" LF)
 	       *PKT-LINE(%x01-03 *%x00-ff)
 
     acknowledgments section
-	* If the client determines that it is finished with negotiations
-	  by sending a "done" line, the acknowledgments sections MUST be
-	  omitted from the server's response.
+	* If the client determines that it is finished with negotiations by
+	  sending a "done" line (thus requiring the server to send a packfile),
+	  the acknowledgments sections MUST be omitted from the server's
+	  response.
 
 	* Always begins with the section header "acknowledgments"
 
@@ -404,9 +423,6 @@ header.
 	  which the client has not indicated was shallow as a part of
 	  its request.
 
-	* This section is only included if a packfile section is also
-	  included in the response.
-
     wanted-refs section
 	* This section is only included if the client has requested a
 	  ref using a 'want-ref' line and if a packfile section is also
@@ -420,6 +436,20 @@ header.
 	* The server MUST NOT send any refs which were not requested
 	  using 'want-ref' lines.
 
+    packfile-uris section
+	* This section is only included if the client sent
+	  'packfile-uris' and the server has at least one such URI to
+	  send.
+
+	* Always begins with the section header "packfile-uris".
+
+	* For each URI the server sends, it sends a hash of the pack's
+	  contents (as output by git index-pack) followed by the URI.
+
+	* The hashes are 40 hex characters long. When Git upgrades to a new
+	  hash algorithm, this might need to be updated. (It should match
+	  whatever index-pack outputs after "pack\t" or "keep\t".
+
     packfile section
 	* This section is only included if the client has sent 'want'
 	  lines in its request and either requested that no more
diff --git a/Documentation/technical/reftable.txt b/Documentation/technical/reftable.txt
new file mode 100644
index 0000000000..2951840e9c
--- /dev/null
+++ b/Documentation/technical/reftable.txt
@@ -0,0 +1,1083 @@
+reftable
+--------
+
+Overview
+~~~~~~~~
+
+Problem statement
+^^^^^^^^^^^^^^^^^
+
+Some repositories contain a lot of references (e.g. android at 866k,
+rails at 31k). The existing packed-refs format takes up a lot of space
+(e.g. 62M), and does not scale with additional references. Lookup of a
+single reference requires linearly scanning the file.
+
+Atomic pushes modifying multiple references require copying the entire
+packed-refs file, which can be a considerable amount of data moved
+(e.g. 62M in, 62M out) for even small transactions (2 refs modified).
+
+Repositories with many loose references occupy a large number of disk
+blocks from the local file system, as each reference is its own file
+storing 41 bytes (and another file for the corresponding reflog). This
+negatively affects the number of inodes available when a large number of
+repositories are stored on the same filesystem. Readers can be penalized
+due to the larger number of syscalls required to traverse and read the
+`$GIT_DIR/refs` directory.
+
+
+Objectives
+^^^^^^^^^^
+
+* Near constant time lookup for any single reference, even when the
+repository is cold and not in process or kernel cache.
+* Near constant time verification if an object name is referred to by at least
+one reference (for allow-tip-sha1-in-want).
+* Efficient enumeration of an entire namespace, such as `refs/tags/`.
+* Support atomic push with `O(size_of_update)` operations.
+* Combine reflog storage with ref storage for small transactions.
+* Separate reflog storage for base refs and historical logs.
+
+Description
+^^^^^^^^^^^
+
+A reftable file is a portable binary file format customized for
+reference storage. References are sorted, enabling linear scans, binary
+search lookup, and range scans.
+
+Storage in the file is organized into variable sized blocks. Prefix
+compression is used within a single block to reduce disk space. Block
+size and alignment is tunable by the writer.
+
+Performance
+^^^^^^^^^^^
+
+Space used, packed-refs vs. reftable:
+
+[cols=",>,>,>,>,>",options="header",]
+|===============================================================
+|repository |packed-refs |reftable |% original |avg ref |avg obj
+|android |62.2 M |36.1 M |58.0% |33 bytes |5 bytes
+|rails |1.8 M |1.1 M |57.7% |29 bytes |4 bytes
+|git |78.7 K |48.1 K |61.0% |50 bytes |4 bytes
+|git (heads) |332 b |269 b |81.0% |33 bytes |0 bytes
+|===============================================================
+
+Scan (read 866k refs), by reference name lookup (single ref from 866k
+refs), and by SHA-1 lookup (refs with that SHA-1, from 866k refs):
+
+[cols=",>,>,>,>",options="header",]
+|=========================================================
+|format |cache |scan |by name |by SHA-1
+|packed-refs |cold |402 ms |409,660.1 usec |412,535.8 usec
+|packed-refs |hot | |6,844.6 usec |20,110.1 usec
+|reftable |cold |112 ms |33.9 usec |323.2 usec
+|reftable |hot | |20.2 usec |320.8 usec
+|=========================================================
+
+Space used for 149,932 log entries for 43,061 refs, reflog vs. reftable:
+
+[cols=",>,>",options="header",]
+|================================
+|format |size |avg entry
+|$GIT_DIR/logs |173 M |1209 bytes
+|reftable |5 M |37 bytes
+|================================
+
+Details
+~~~~~~~
+
+Peeling
+^^^^^^^
+
+References stored in a reftable are peeled, a record for an annotated
+(or signed) tag records both the tag object, and the object it refers
+to. This is analogous to storage in the packed-refs format.
+
+Reference name encoding
+^^^^^^^^^^^^^^^^^^^^^^^
+
+Reference names are an uninterpreted sequence of bytes that must pass
+linkgit:git-check-ref-format[1] as a valid reference name.
+
+Key unicity
+^^^^^^^^^^^
+
+Each entry must have a unique key; repeated keys are disallowed.
+
+Network byte order
+^^^^^^^^^^^^^^^^^^
+
+All multi-byte, fixed width fields are in network byte order.
+
+Varint encoding
+^^^^^^^^^^^^^^^
+
+Varint encoding is identical to the ofs-delta encoding method used
+within pack files.
+
+Decoder works such as:
+
+....
+val = buf[ptr] & 0x7f
+while (buf[ptr] & 0x80) {
+  ptr++
+  val = ((val + 1) << 7) | (buf[ptr] & 0x7f)
+}
+....
+
+Ordering
+^^^^^^^^
+
+Blocks are lexicographically ordered by their first reference.
+
+Directory/file conflicts
+^^^^^^^^^^^^^^^^^^^^^^^^
+
+The reftable format accepts both `refs/heads/foo` and
+`refs/heads/foo/bar` as distinct references.
+
+This property is useful for retaining log records in reftable, but may
+confuse versions of Git using `$GIT_DIR/refs` directory tree to maintain
+references. Users of reftable may choose to continue to reject `foo` and
+`foo/bar` type conflicts to prevent problems for peers.
+
+File format
+~~~~~~~~~~~
+
+Structure
+^^^^^^^^^
+
+A reftable file has the following high-level structure:
+
+....
+first_block {
+  header
+  first_ref_block
+}
+ref_block*
+ref_index*
+obj_block*
+obj_index*
+log_block*
+log_index*
+footer
+....
+
+A log-only file omits the `ref_block`, `ref_index`, `obj_block` and
+`obj_index` sections, containing only the file header and log block:
+
+....
+first_block {
+  header
+}
+log_block*
+log_index*
+footer
+....
+
+in a log-only file the first log block immediately follows the file
+header, without padding to block alignment.
+
+Block size
+^^^^^^^^^^
+
+The file's block size is arbitrarily determined by the writer, and does
+not have to be a power of 2. The block size must be larger than the
+longest reference name or log entry used in the repository, as
+references cannot span blocks.
+
+Powers of two that are friendly to the virtual memory system or
+filesystem (such as 4k or 8k) are recommended. Larger sizes (64k) can
+yield better compression, with a possible increased cost incurred by
+readers during access.
+
+The largest block size is `16777215` bytes (15.99 MiB).
+
+Block alignment
+^^^^^^^^^^^^^^^
+
+Writers may choose to align blocks at multiples of the block size by
+including `padding` filled with NUL bytes at the end of a block to round
+out to the chosen alignment. When alignment is used, writers must
+specify the alignment with the file header's `block_size` field.
+
+Block alignment is not required by the file format. Unaligned files must
+set `block_size = 0` in the file header, and omit `padding`. Unaligned
+files with more than one ref block must include the link:#Ref-index[ref
+index] to support fast lookup. Readers must be able to read both aligned
+and non-aligned files.
+
+Very small files (e.g. a single ref block) may omit `padding` and the ref
+index to reduce total file size.
+
+Header (version 1)
+^^^^^^^^^^^^^^^^^^
+
+A 24-byte header appears at the beginning of the file:
+
+....
+'REFT'
+uint8( version_number = 1 )
+uint24( block_size )
+uint64( min_update_index )
+uint64( max_update_index )
+....
+
+Aligned files must specify `block_size` to configure readers with the
+expected block alignment. Unaligned files must set `block_size = 0`.
+
+The `min_update_index` and `max_update_index` describe bounds for the
+`update_index` field of all log records in this file. When reftables are
+used in a stack for link:#Update-transactions[transactions], these
+fields can order the files such that the prior file's
+`max_update_index + 1` is the next file's `min_update_index`.
+
+Header (version 2)
+^^^^^^^^^^^^^^^^^^
+
+A 28-byte header appears at the beginning of the file:
+
+....
+'REFT'
+uint8( version_number = 2 )
+uint24( block_size )
+uint64( min_update_index )
+uint64( max_update_index )
+uint32( hash_id )
+....
+
+The header is identical to `version_number=1`, with the 4-byte hash ID
+("sha1" for SHA1 and "s256" for SHA-256) append to the header.
+
+For maximum backward compatibility, it is recommended to use version 1 when
+writing SHA1 reftables.
+
+First ref block
+^^^^^^^^^^^^^^^
+
+The first ref block shares the same block as the file header, and is 24
+bytes smaller than all other blocks in the file. The first block
+immediately begins after the file header, at position 24.
+
+If the first block is a log block (a log-only file), its block header
+begins immediately at position 24.
+
+Ref block format
+^^^^^^^^^^^^^^^^
+
+A ref block is written as:
+
+....
+'r'
+uint24( block_len )
+ref_record+
+uint24( restart_offset )+
+uint16( restart_count )
+
+padding?
+....
+
+Blocks begin with `block_type = 'r'` and a 3-byte `block_len` which
+encodes the number of bytes in the block up to, but not including the
+optional `padding`. This is always less than or equal to the file's
+block size. In the first ref block, `block_len` includes 24 bytes for
+the file header.
+
+The 2-byte `restart_count` stores the number of entries in the
+`restart_offset` list, which must not be empty. Readers can use
+`restart_count` to binary search between restarts before starting a
+linear scan.
+
+Exactly `restart_count` 3-byte `restart_offset` values precedes the
+`restart_count`. Offsets are relative to the start of the block and
+refer to the first byte of any `ref_record` whose name has not been
+prefix compressed. Entries in the `restart_offset` list must be sorted,
+ascending. Readers can start linear scans from any of these records.
+
+A variable number of `ref_record` fill the middle of the block,
+describing reference names and values. The format is described below.
+
+As the first ref block shares the first file block with the file header,
+all `restart_offset` in the first block are relative to the start of the
+file (position 0), and include the file header. This forces the first
+`restart_offset` to be `28`.
+
+ref record
+++++++++++
+
+A `ref_record` describes a single reference, storing both the name and
+its value(s). Records are formatted as:
+
+....
+varint( prefix_length )
+varint( (suffix_length << 3) | value_type )
+suffix
+varint( update_index_delta )
+value?
+....
+
+The `prefix_length` field specifies how many leading bytes of the prior
+reference record's name should be copied to obtain this reference's
+name. This must be 0 for the first reference in any block, and also must
+be 0 for any `ref_record` whose offset is listed in the `restart_offset`
+table at the end of the block.
+
+Recovering a reference name from any `ref_record` is a simple concat:
+
+....
+this_name = prior_name[0..prefix_length] + suffix
+....
+
+The `suffix_length` value provides the number of bytes available in
+`suffix` to copy from `suffix` to complete the reference name.
+
+The `update_index` that last modified the reference can be obtained by
+adding `update_index_delta` to the `min_update_index` from the file
+header: `min_update_index + update_index_delta`.
+
+The `value` follows. Its format is determined by `value_type`, one of
+the following:
+
+* `0x0`: deletion; no value data (see transactions, below)
+* `0x1`: one object name; value of the ref
+* `0x2`: two object names; value of the ref, peeled target
+* `0x3`: symbolic reference: `varint( target_len ) target`
+
+Symbolic references use `0x3`, followed by the complete name of the
+reference target. No compression is applied to the target name.
+
+Types `0x4..0x7` are reserved for future use.
+
+Ref index
+^^^^^^^^^
+
+The ref index stores the name of the last reference from every ref block
+in the file, enabling reduced disk seeks for lookups. Any reference can
+be found by searching the index, identifying the containing block, and
+searching within that block.
+
+The index may be organized into a multi-level index, where the 1st level
+index block points to additional ref index blocks (2nd level), which may
+in turn point to either additional index blocks (e.g. 3rd level) or ref
+blocks (leaf level). Disk reads required to access a ref go up with
+higher index levels. Multi-level indexes may be required to ensure no
+single index block exceeds the file format's max block size of
+`16777215` bytes (15.99 MiB). To achieve constant O(1) disk seeks for
+lookups the index must be a single level, which is permitted to exceed
+the file's configured block size, but not the format's max block size of
+15.99 MiB.
+
+If present, the ref index block(s) appears after the last ref block.
+
+If there are at least 4 ref blocks, a ref index block should be written
+to improve lookup times. Cold reads using the index require 2 disk reads
+(read index, read block), and binary searching < 4 blocks also requires
+<= 2 reads. Omitting the index block from smaller files saves space.
+
+If the file is unaligned and contains more than one ref block, the ref
+index must be written.
+
+Index block format:
+
+....
+'i'
+uint24( block_len )
+index_record+
+uint24( restart_offset )+
+uint16( restart_count )
+
+padding?
+....
+
+The index blocks begin with `block_type = 'i'` and a 3-byte `block_len`
+which encodes the number of bytes in the block, up to but not including
+the optional `padding`.
+
+The `restart_offset` and `restart_count` fields are identical in format,
+meaning and usage as in ref blocks.
+
+To reduce the number of reads required for random access in very large
+files the index block may be larger than other blocks. However, readers
+must hold the entire index in memory to benefit from this, so it's a
+time-space tradeoff in both file size and reader memory.
+
+Increasing the file's block size decreases the index size. Alternatively
+a multi-level index may be used, keeping index blocks within the file's
+block size, but increasing the number of blocks that need to be
+accessed.
+
+index record
+++++++++++++
+
+An index record describes the last entry in another block. Index records
+are written as:
+
+....
+varint( prefix_length )
+varint( (suffix_length << 3) | 0 )
+suffix
+varint( block_position )
+....
+
+Index records use prefix compression exactly like `ref_record`.
+
+Index records store `block_position` after the suffix, specifying the
+absolute position in bytes (from the start of the file) of the block
+that ends with this reference. Readers can seek to `block_position` to
+begin reading the block header.
+
+Readers must examine the block header at `block_position` to determine
+if the next block is another level index block, or the leaf-level ref
+block.
+
+Reading the index
++++++++++++++++++
+
+Readers loading the ref index must first read the footer (below) to
+obtain `ref_index_position`. If not present, the position will be 0. The
+`ref_index_position` is for the 1st level root of the ref index.
+
+Obj block format
+^^^^^^^^^^^^^^^^
+
+Object blocks are optional. Writers may choose to omit object blocks,
+especially if readers will not use the object name to ref mapping.
+
+Object blocks use unique, abbreviated 2-32 object name keys, mapping to
+ref blocks containing references pointing to that object directly, or as
+the peeled value of an annotated tag. Like ref blocks, object blocks use
+the file's standard block size. The abbrevation length is available in
+the footer as `obj_id_len`.
+
+To save space in small files, object blocks may be omitted if the ref
+index is not present, as brute force search will only need to read a few
+ref blocks. When missing, readers should brute force a linear search of
+all references to lookup by object name.
+
+An object block is written as:
+
+....
+'o'
+uint24( block_len )
+obj_record+
+uint24( restart_offset )+
+uint16( restart_count )
+
+padding?
+....
+
+Fields are identical to ref block. Binary search using the restart table
+works the same as in reference blocks.
+
+Because object names are abbreviated by writers to the shortest unique
+abbreviation within the reftable, obj key lengths have a variable length. Their
+length must be at least 2 bytes. Readers must compare only for common prefix
+match within an obj block or obj index.
+
+obj record
+++++++++++
+
+An `obj_record` describes a single object abbreviation, and the blocks
+containing references using that unique abbreviation:
+
+....
+varint( prefix_length )
+varint( (suffix_length << 3) | cnt_3 )
+suffix
+varint( cnt_large )?
+varint( position_delta )*
+....
+
+Like in reference blocks, abbreviations are prefix compressed within an
+obj block. On large reftables with many unique objects, higher block
+sizes (64k), and higher restart interval (128), a `prefix_length` of 2
+or 3 and `suffix_length` of 3 may be common in obj records (unique
+abbreviation of 5-6 raw bytes, 10-12 hex digits).
+
+Each record contains `position_count` number of positions for matching
+ref blocks. For 1-7 positions the count is stored in `cnt_3`. When
+`cnt_3 = 0` the actual count follows in a varint, `cnt_large`.
+
+The use of `cnt_3` bets most objects are pointed to by only a single
+reference, some may be pointed to by a couple of references, and very
+few (if any) are pointed to by more than 7 references.
+
+A special case exists when `cnt_3 = 0` and `cnt_large = 0`: there are no
+`position_delta`, but at least one reference starts with this
+abbreviation. A reader that needs exact reference names must scan all
+references to find which specific references have the desired object.
+Writers should use this format when the `position_delta` list would have
+overflowed the file's block size due to a high number of references
+pointing to the same object.
+
+The first `position_delta` is the position from the start of the file.
+Additional `position_delta` entries are sorted ascending and relative to
+the prior entry, e.g. a reader would perform:
+
+....
+pos = position_delta[0]
+prior = pos
+for (j = 1; j < position_count; j++) {
+  pos = prior + position_delta[j]
+  prior = pos
+}
+....
+
+With a position in hand, a reader must linearly scan the ref block,
+starting from the first `ref_record`, testing each reference's object names
+(for `value_type = 0x1` or `0x2`) for full equality. Faster searching by
+object name within a single ref block is not supported by the reftable format.
+Smaller block sizes reduce the number of candidates this step must
+consider.
+
+Obj index
+^^^^^^^^^
+
+The obj index stores the abbreviation from the last entry for every obj
+block in the file, enabling reduced disk seeks for all lookups. It is
+formatted exactly the same as the ref index, but refers to obj blocks.
+
+The obj index should be present if obj blocks are present, as obj blocks
+should only be written in larger files.
+
+Readers loading the obj index must first read the footer (below) to
+obtain `obj_index_position`. If not present, the position will be 0.
+
+Log block format
+^^^^^^^^^^^^^^^^
+
+Unlike ref and obj blocks, log blocks are always unaligned.
+
+Log blocks are variable in size, and do not match the `block_size`
+specified in the file header or footer. Writers should choose an
+appropriate buffer size to prepare a log block for deflation, such as
+`2 * block_size`.
+
+A log block is written as:
+
+....
+'g'
+uint24( block_len )
+zlib_deflate {
+  log_record+
+  uint24( restart_offset )+
+  uint16( restart_count )
+}
+....
+
+Log blocks look similar to ref blocks, except `block_type = 'g'`.
+
+The 4-byte block header is followed by the deflated block contents using
+zlib deflate. The `block_len` in the header is the inflated size
+(including 4-byte block header), and should be used by readers to
+preallocate the inflation output buffer. A log block's `block_len` may
+exceed the file's block size.
+
+Offsets within the log block (e.g. `restart_offset`) still include the
+4-byte header. Readers may prefer prefixing the inflation output buffer
+with the 4-byte header.
+
+Within the deflate container, a variable number of `log_record` describe
+reference changes. The log record format is described below. See ref
+block format (above) for a description of `restart_offset` and
+`restart_count`.
+
+Because log blocks have no alignment or padding between blocks, readers
+must keep track of the bytes consumed by the inflater to know where the
+next log block begins.
+
+log record
+++++++++++
+
+Log record keys are structured as:
+
+....
+ref_name '\0' reverse_int64( update_index )
+....
+
+where `update_index` is the unique transaction identifier. The
+`update_index` field must be unique within the scope of a `ref_name`.
+See the update transactions section below for further details.
+
+The `reverse_int64` function inverses the value so lexicographical
+ordering the network byte order encoding sorts the more recent records
+with higher `update_index` values first:
+
+....
+reverse_int64(int64 t) {
+  return 0xffffffffffffffff - t;
+}
+....
+
+Log records have a similar starting structure to ref and index records,
+utilizing the same prefix compression scheme applied to the log record
+key described above.
+
+....
+    varint( prefix_length )
+    varint( (suffix_length << 3) | log_type )
+    suffix
+    log_data {
+      old_id
+      new_id
+      varint( name_length    )  name
+      varint( email_length   )  email
+      varint( time_seconds )
+      sint16( tz_offset )
+      varint( message_length )  message
+    }?
+....
+
+Log record entries use `log_type` to indicate what follows:
+
+* `0x0`: deletion; no log data.
+* `0x1`: standard git reflog data using `log_data` above.
+
+The `log_type = 0x0` is mostly useful for `git stash drop`, removing an
+entry from the reflog of `refs/stash` in a transaction file (below),
+without needing to rewrite larger files. Readers reading a stack of
+reflogs must treat this as a deletion.
+
+For `log_type = 0x1`, the `log_data` section follows
+linkgit:git-update-ref[1] logging and includes:
+
+* two object names (old id, new id)
+* varint string of committer's name
+* varint string of committer's email
+* varint time in seconds since epoch (Jan 1, 1970)
+* 2-byte timezone offset in minutes (signed)
+* varint string of message
+
+`tz_offset` is the absolute number of minutes from GMT the committer was
+at the time of the update. For example `GMT-0800` is encoded in reftable
+as `sint16(-480)` and `GMT+0230` is `sint16(150)`.
+
+The committer email does not contain `<` or `>`, it's the value normally
+found between the `<>` in a git commit object header.
+
+The `message_length` may be 0, in which case there was no message
+supplied for the update.
+
+Contrary to traditional reflog (which is a file), renames are encoded as
+a combination of ref deletion and ref creation.  A deletion is a log
+record with a zero new_id, and a creation is a log record with a zero old_id.
+
+Reading the log
++++++++++++++++
+
+Readers accessing the log must first read the footer (below) to
+determine the `log_position`. The first block of the log begins at
+`log_position` bytes since the start of the file. The `log_position` is
+not block aligned.
+
+Importing logs
+++++++++++++++
+
+When importing from `$GIT_DIR/logs` writers should globally order all
+log records roughly by timestamp while preserving file order, and assign
+unique, increasing `update_index` values for each log line. Newer log
+records get higher `update_index` values.
+
+Although an import may write only a single reftable file, the reftable
+file must span many unique `update_index`, as each log line requires its
+own `update_index` to preserve semantics.
+
+Log index
+^^^^^^^^^
+
+The log index stores the log key
+(`refname \0 reverse_int64(update_index)`) for the last log record of
+every log block in the file, supporting bounded-time lookup.
+
+A log index block must be written if 2 or more log blocks are written to
+the file. If present, the log index appears after the last log block.
+There is no padding used to align the log index to block alignment.
+
+Log index format is identical to ref index, except the keys are 9 bytes
+longer to include `'\0'` and the 8-byte `reverse_int64(update_index)`.
+Records use `block_position` to refer to the start of a log block.
+
+Reading the index
++++++++++++++++++
+
+Readers loading the log index must first read the footer (below) to
+obtain `log_index_position`. If not present, the position will be 0.
+
+Footer
+^^^^^^
+
+After the last block of the file, a file footer is written. It begins
+like the file header, but is extended with additional data.
+
+....
+    HEADER
+
+    uint64( ref_index_position )
+    uint64( (obj_position << 5) | obj_id_len )
+    uint64( obj_index_position )
+
+    uint64( log_position )
+    uint64( log_index_position )
+
+    uint32( CRC-32 of above )
+....
+
+If a section is missing (e.g. ref index) the corresponding position
+field (e.g. `ref_index_position`) will be 0.
+
+* `obj_position`: byte position for the first obj block.
+* `obj_id_len`: number of bytes used to abbreviate object names in
+obj blocks.
+* `log_position`: byte position for the first log block.
+* `ref_index_position`: byte position for the start of the ref index.
+* `obj_index_position`: byte position for the start of the obj index.
+* `log_index_position`: byte position for the start of the log index.
+
+The size of the footer is 68 bytes for version 1, and 72 bytes for
+version 2.
+
+Reading the footer
+++++++++++++++++++
+
+Readers must first read the file start to determine the version
+number. Then they seek to `file_length - FOOTER_LENGTH` to access the
+footer. A trusted external source (such as `stat(2)`) is necessary to
+obtain `file_length`. When reading the footer, readers must verify:
+
+* 4-byte magic is correct
+* 1-byte version number is recognized
+* 4-byte CRC-32 matches the other 64 bytes (including magic, and
+version)
+
+Once verified, the other fields of the footer can be accessed.
+
+Empty tables
+++++++++++++
+
+A reftable may be empty. In this case, the file starts with a header
+and is immediately followed by a footer.
+
+Binary search
+^^^^^^^^^^^^^
+
+Binary search within a block is supported by the `restart_offset` fields
+at the end of the block. Readers can binary search through the restart
+table to locate between which two restart points the sought reference or
+key should appear.
+
+Each record identified by a `restart_offset` stores the complete key in
+the `suffix` field of the record, making the compare operation during
+binary search straightforward.
+
+Once a restart point lexicographically before the sought reference has
+been identified, readers can linearly scan through the following record
+entries to locate the sought record, terminating if the current record
+sorts after (and therefore the sought key is not present).
+
+Restart point selection
++++++++++++++++++++++++
+
+Writers determine the restart points at file creation. The process is
+arbitrary, but every 16 or 64 records is recommended. Every 16 may be
+more suitable for smaller block sizes (4k or 8k), every 64 for larger
+block sizes (64k).
+
+More frequent restart points reduces prefix compression and increases
+space consumed by the restart table, both of which increase file size.
+
+Less frequent restart points makes prefix compression more effective,
+decreasing overall file size, with increased penalties for readers
+walking through more records after the binary search step.
+
+A maximum of `65535` restart points per block is supported.
+
+Considerations
+~~~~~~~~~~~~~~
+
+Lightweight refs dominate
+^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The reftable format assumes the vast majority of references are single
+object names valued with common prefixes, such as Gerrit Code Review's
+`refs/changes/` namespace, GitHub's `refs/pulls/` namespace, or many
+lightweight tags in the `refs/tags/` namespace.
+
+Annotated tags storing the peeled object cost an additional object name per
+reference.
+
+Low overhead
+^^^^^^^^^^^^
+
+A reftable with very few references (e.g. git.git with 5 heads) is 269
+bytes for reftable, vs. 332 bytes for packed-refs. This supports
+reftable scaling down for transaction logs (below).
+
+Block size
+^^^^^^^^^^
+
+For a Gerrit Code Review type repository with many change refs, larger
+block sizes (64 KiB) and less frequent restart points (every 64) yield
+better compression due to more references within the block compressing
+against the prior reference.
+
+Larger block sizes reduce the index size, as the reftable will require
+fewer blocks to store the same number of references.
+
+Minimal disk seeks
+^^^^^^^^^^^^^^^^^^
+
+Assuming the index block has been loaded into memory, binary searching
+for any single reference requires exactly 1 disk seek to load the
+containing block.
+
+Scans and lookups dominate
+^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Scanning all references and lookup by name (or namespace such as
+`refs/heads/`) are the most common activities performed on repositories.
+Object names are stored directly with references to optimize this use case.
+
+Logs are infrequently read
+^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Logs are infrequently accessed, but can be large. Deflating log blocks
+saves disk space, with some increased penalty at read time.
+
+Logs are stored in an isolated section from refs, reducing the burden on
+reference readers that want to ignore logs. Further, historical logs can
+be isolated into log-only files.
+
+Logs are read backwards
+^^^^^^^^^^^^^^^^^^^^^^^
+
+Logs are frequently accessed backwards (most recent N records for master
+to answer `master@{4}`), so log records are grouped by reference, and
+sorted descending by update index.
+
+Repository format
+~~~~~~~~~~~~~~~~~
+
+Version 1
+^^^^^^^^^
+
+A repository must set its `$GIT_DIR/config` to configure reftable:
+
+....
+[core]
+    repositoryformatversion = 1
+[extensions]
+    refStorage = reftable
+....
+
+Layout
+^^^^^^
+
+A collection of reftable files are stored in the `$GIT_DIR/reftable/`
+directory:
+
+....
+00000001-00000001.log
+00000002-00000002.ref
+00000003-00000003.ref
+....
+
+where reftable files are named by a unique name such as produced by the
+function `${min_update_index}-${max_update_index}.ref`.
+
+Log-only files use the `.log` extension, while ref-only and mixed ref
+and log files use `.ref`. extension.
+
+The stack ordering file is `$GIT_DIR/reftable/tables.list` and lists the
+current files, one per line, in order, from oldest (base) to newest
+(most recent):
+
+....
+$ cat .git/reftable/tables.list
+00000001-00000001.log
+00000002-00000002.ref
+00000003-00000003.ref
+....
+
+Readers must read `$GIT_DIR/reftable/tables.list` to determine which
+files are relevant right now, and search through the stack in reverse
+order (last reftable is examined first).
+
+Reftable files not listed in `tables.list` may be new (and about to be
+added to the stack by the active writer), or ancient and ready to be
+pruned.
+
+Backward compatibility
+^^^^^^^^^^^^^^^^^^^^^^
+
+Older clients should continue to recognize the directory as a git
+repository so they don't look for an enclosing repository in parent
+directories. To this end, a reftable-enabled repository must contain the
+following dummy files
+
+* `.git/HEAD`, a regular file containing `ref: refs/heads/.invalid`.
+* `.git/refs/`, a directory
+* `.git/refs/heads`, a regular file
+
+Readers
+^^^^^^^
+
+Readers can obtain a consistent snapshot of the reference space by
+following:
+
+1.  Open and read the `tables.list` file.
+2.  Open each of the reftable files that it mentions.
+3.  If any of the files is missing, goto 1.
+4.  Read from the now-open files as long as necessary.
+
+Update transactions
+^^^^^^^^^^^^^^^^^^^
+
+Although reftables are immutable, mutations are supported by writing a
+new reftable and atomically appending it to the stack:
+
+1.  Acquire `tables.list.lock`.
+2.  Read `tables.list` to determine current reftables.
+3.  Select `update_index` to be most recent file's
+`max_update_index + 1`.
+4.  Prepare temp reftable `tmp_XXXXXX`, including log entries.
+5.  Rename `tmp_XXXXXX` to `${update_index}-${update_index}.ref`.
+6.  Copy `tables.list` to `tables.list.lock`, appending file from (5).
+7.  Rename `tables.list.lock` to `tables.list`.
+
+During step 4 the new file's `min_update_index` and `max_update_index`
+are both set to the `update_index` selected by step 3. All log records
+for the transaction use the same `update_index` in their keys. This
+enables later correlation of which references were updated by the same
+transaction.
+
+Because a single `tables.list.lock` file is used to manage locking, the
+repository is single-threaded for writers. Writers may have to busy-spin
+(with backoff) around creating `tables.list.lock`, for up to an
+acceptable wait period, aborting if the repository is too busy to
+mutate. Application servers wrapped around repositories (e.g. Gerrit
+Code Review) can layer their own lock/wait queue to improve fairness to
+writers.
+
+Reference deletions
+^^^^^^^^^^^^^^^^^^^
+
+Deletion of any reference can be explicitly stored by setting the `type`
+to `0x0` and omitting the `value` field of the `ref_record`. This serves
+as a tombstone, overriding any assertions about the existence of the
+reference from earlier files in the stack.
+
+Compaction
+^^^^^^^^^^
+
+A partial stack of reftables can be compacted by merging references
+using a straightforward merge join across reftables, selecting the most
+recent value for output, and omitting deleted references that do not
+appear in remaining, lower reftables.
+
+A compacted reftable should set its `min_update_index` to the smallest
+of the input files' `min_update_index`, and its `max_update_index`
+likewise to the largest input `max_update_index`.
+
+For sake of illustration, assume the stack currently consists of
+reftable files (from oldest to newest): A, B, C, and D. The compactor is
+going to compact B and C, leaving A and D alone.
+
+1.  Obtain lock `tables.list.lock` and read the `tables.list` file.
+2.  Obtain locks `B.lock` and `C.lock`. Ownership of these locks
+prevents other processes from trying to compact these files.
+3.  Release `tables.list.lock`.
+4.  Compact `B` and `C` into a temp file
+`${min_update_index}-${max_update_index}_XXXXXX`.
+5.  Reacquire lock `tables.list.lock`.
+6.  Verify that `B` and `C` are still in the stack, in that order. This
+should always be the case, assuming that other processes are adhering to
+the locking protocol.
+7.  Rename `${min_update_index}-${max_update_index}_XXXXXX` to
+`${min_update_index}-${max_update_index}.ref`.
+8.  Write the new stack to `tables.list.lock`, replacing `B` and `C`
+with the file from (4).
+9.  Rename `tables.list.lock` to `tables.list`.
+10. Delete `B` and `C`, perhaps after a short sleep to avoid forcing
+readers to backtrack.
+
+This strategy permits compactions to proceed independently of updates.
+
+Each reftable (compacted or not) is uniquely identified by its name, so
+open reftables can be cached by their name.
+
+Alternatives considered
+~~~~~~~~~~~~~~~~~~~~~~~
+
+bzip packed-refs
+^^^^^^^^^^^^^^^^
+
+`bzip2` can significantly shrink a large packed-refs file (e.g. 62 MiB
+compresses to 23 MiB, 37%). However the bzip format does not support
+random access to a single reference. Readers must inflate and discard
+while performing a linear scan.
+
+Breaking packed-refs into chunks (individually compressing each chunk)
+would reduce the amount of data a reader must inflate, but still leaves
+the problem of indexing chunks to support readers efficiently locating
+the correct chunk.
+
+Given the compression achieved by reftable's encoding, it does not seem
+necessary to add the complexity of bzip/gzip/zlib.
+
+Michael Haggerty's alternate format
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Michael Haggerty proposed
+link:https://lore.kernel.org/git/CAMy9T_HCnyc1g8XWOOWhe7nN0aEFyyBskV2aOMb_fe%2BwGvEJ7A%40mail.gmail.com/[an
+alternate] format to reftable on the Git mailing list. This format uses
+smaller chunks, without the restart table, and avoids block alignment
+with padding. Reflog entries immediately follow each ref, and are thus
+interleaved between refs.
+
+Performance testing indicates reftable is faster for lookups (51%
+faster, 11.2 usec vs. 5.4 usec), although reftable produces a slightly
+larger file (+ ~3.2%, 28.3M vs 29.2M):
+
+[cols=">,>,>,>",options="header",]
+|=====================================
+|format |size |seek cold |seek hot
+|mh-alt |28.3 M |23.4 usec |11.2 usec
+|reftable |29.2 M |19.9 usec |5.4 usec
+|=====================================
+
+JGit Ketch RefTree
+^^^^^^^^^^^^^^^^^^
+
+https://dev.eclipse.org/mhonarc/lists/jgit-dev/msg03073.html[JGit Ketch]
+proposed
+link:https://lore.kernel.org/git/CAJo%3DhJvnAPNAdDcAAwAvU9C4RVeQdoS3Ev9WTguHx4fD0V_nOg%40mail.gmail.com/[RefTree],
+an encoding of references inside Git tree objects stored as part of the
+repository's object database.
+
+The RefTree format adds additional load on the object database storage
+layer (more loose objects, more objects in packs), and relies heavily on
+the packer's delta compression to save space. Namespaces which are flat
+(e.g. thousands of tags in refs/tags) initially create very large loose
+objects, and so RefTree does not address the problem of copying many
+references to modify a handful.
+
+Flat namespaces are not efficiently searchable in RefTree, as tree
+objects in canonical formatting cannot be binary searched. This fails
+the need to handle a large number of references in a single namespace,
+such as GitHub's `refs/pulls`, or a project with many tags.
+
+LMDB
+^^^^
+
+David Turner proposed
+https://lore.kernel.org/git/1455772670-21142-26-git-send-email-dturner@twopensource.com/[using
+LMDB], as LMDB is lightweight (64k of runtime code) and GPL-compatible
+license.
+
+A downside of LMDB is its reliance on a single C implementation. This
+makes embedding inside JGit (a popular reimplementation of Git)
+difficult, and hoisting onto virtual storage (for JGit DFS) virtually
+impossible.
+
+A common format that can be supported by all major Git implementations
+(git-core, JGit, libgit2) is strongly preferred.