6 files changed, 153 insertions, 32 deletions
diff --git a/Documentation/technical/commit-graph-format.txt b/Documentation/technical/commit-graph-format.txt
index b3b58880b9..b6658eff18 100644
--- a/Documentation/technical/commit-graph-format.txt
+++ b/Documentation/technical/commit-graph-format.txt
@@ -4,11 +4,7 @@ Git commit graph format
 The Git commit graph stores a list of commit OIDs and some associated
 metadata, including:
 
-- The generation number of the commit. Commits with no parents have
-  generation number 1; commits with parents have generation number
-  one more than the maximum generation number of its parents. We
-  reserve zero as special, and can be used to mark a generation
-  number invalid or as "not computed".
+- The generation number of the commit.
 
 - The root tree OID.
 
@@ -86,13 +82,33 @@ CHUNK DATA:
       position. If there are more than two parents, the second value
       has its most-significant bit on and the other bits store an array
       position into the Extra Edge List chunk.
-    * The next 8 bytes store the generation number of the commit and
+    * The next 8 bytes store the topological level (generation number v1)
+      of the commit and
       the commit time in seconds since EPOCH. The generation number
       uses the higher 30 bits of the first 4 bytes, while the commit
       time uses the 32 bits of the second 4 bytes, along with the lowest
       2 bits of the lowest byte, storing the 33rd and 34th bit of the
       commit time.
 
+  Generation Data (ID: {'G', 'D', 'A', 'T' }) (N * 4 bytes) [Optional]
+    * This list of 4-byte values store corrected commit date offsets for the
+      commits, arranged in the same order as commit data chunk.
+    * If the corrected commit date offset cannot be stored within 31 bits,
+      the value has its most-significant bit on and the other bits store
+      the position of corrected commit date into the Generation Data Overflow
+      chunk.
+    * Generation Data chunk is present only when commit-graph file is written
+      by compatible versions of Git and in case of split commit-graph chains,
+      the topmost layer also has Generation Data chunk.
+
+  Generation Data Overflow (ID: {'G', 'D', 'O', 'V' }) [Optional]
+    * This list of 8-byte values stores the corrected commit date offsets
+      for commits with corrected commit date offsets that cannot be
+      stored within 31 bits.
+    * Generation Data Overflow chunk is present only when Generation Data
+      chunk is present and atleast one corrected commit date offset cannot
+      be stored within 31 bits.
+
   Extra Edge List (ID: {'E', 'D', 'G', 'E'}) [Optional]
       This list of 4-byte values store the second through nth parents for
       all octopus merges. The second parent value in the commit data stores
diff --git a/Documentation/technical/commit-graph.txt b/Documentation/technical/commit-graph.txt
index f14a7659aa..f05e7bda1a 100644
--- a/Documentation/technical/commit-graph.txt
+++ b/Documentation/technical/commit-graph.txt
@@ -38,14 +38,31 @@ A consumer may load the following info for a commit from the graph:
 
 Values 1-4 satisfy the requirements of parse_commit_gently().
 
-Define the "generation number" of a commit recursively as follows:
+There are two definitions of generation number:
+1. Corrected committer dates (generation number v2)
+2. Topological levels (generation nummber v1)
 
- * A commit with no parents (a root commit) has generation number one.
+Define "corrected committer date" of a commit recursively as follows:
 
- * A commit with at least one parent has generation number one more than
-   the largest generation number among its parents.
+ * A commit with no parents (a root commit) has corrected committer date
+    equal to its committer date.
 
-Equivalently, the generation number of a commit A is one more than the
+ * A commit with at least one parent has corrected committer date equal to
+    the maximum of its commiter date and one more than the largest corrected
+    committer date among its parents.
+
+ * As a special case, a root commit with timestamp zero has corrected commit
+    date of 1, to be able to distinguish it from GENERATION_NUMBER_ZERO
+    (that is, an uncomputed corrected commit date).
+
+Define the "topological level" of a commit recursively as follows:
+
+ * A commit with no parents (a root commit) has topological level of one.
+
+ * A commit with at least one parent has topological level one more than
+   the largest topological level among its parents.
+
+Equivalently, the topological level of a commit A is one more than the
 length of a longest path from A to a root commit. The recursive definition
 is easier to use for computation and observing the following property:
 
@@ -60,6 +77,9 @@ is easier to use for computation and observing the following property:
     generation numbers, then we always expand the boundary commit with highest
     generation number and can easily detect the stopping condition.
 
+The property applies to both versions of generation number, that is both
+corrected committer dates and topological levels.
+
 This property can be used to significantly reduce the time it takes to
 walk commits and determine topological relationships. Without generation
 numbers, the general heuristic is the following:
@@ -67,7 +87,9 @@ numbers, the general heuristic is the following:
     If A and B are commits with commit time X and Y, respectively, and
     X < Y, then A _probably_ cannot reach B.
 
-This heuristic is currently used whenever the computation is allowed to
+In absence of corrected commit dates (for example, old versions of Git or
+mixed generation graph chains),
+this heuristic is currently used whenever the computation is allowed to
 violate topological relationships due to clock skew (such as "git log"
 with default order), but is not used when the topological order is
 required (such as merge base calculations, "git log --graph").
@@ -77,7 +99,7 @@ in the commit graph. We can treat these commits as having "infinite"
 generation number and walk until reaching commits with known generation
 number.
 
-We use the macro GENERATION_NUMBER_INFINITY = 0xFFFFFFFF to mark commits not
+We use the macro GENERATION_NUMBER_INFINITY to mark commits not
 in the commit-graph file. If a commit-graph file was written by a version
 of Git that did not compute generation numbers, then those commits will
 have generation number represented by the macro GENERATION_NUMBER_ZERO = 0.
@@ -93,12 +115,12 @@ fully-computed generation numbers. Using strict inequality may result in
 walking a few extra commits, but the simplicity in dealing with commits
 with generation number *_INFINITY or *_ZERO is valuable.
 
-We use the macro GENERATION_NUMBER_MAX = 0x3FFFFFFF to for commits whose
-generation numbers are computed to be at least this value. We limit at
-this value since it is the largest value that can be stored in the
-commit-graph file using the 30 bits available to generation numbers. This
-presents another case where a commit can have generation number equal to
-that of a parent.
+We use the macro GENERATION_NUMBER_V1_MAX = 0x3FFFFFFF for commits whose
+topological levels (generation number v1) are computed to be at least
+this value. We limit at this value since it is the largest value that
+can be stored in the commit-graph file using the 30 bits available
+to topological levels. This presents another case where a commit can
+have generation number equal to that of a parent.
 
 Design Details
 --------------
@@ -267,6 +289,35 @@ The merge strategy values (2 for the size multiple, 64,000 for the maximum
 number of commits) could be extracted into config settings for full
 flexibility.
 
+## Handling Mixed Generation Number Chains
+
+With the introduction of generation number v2 and generation data chunk, the
+following scenario is possible:
+
+1. "New" Git writes a commit-graph with the corrected commit dates.
+2. "Old" Git writes a split commit-graph on top without corrected commit dates.
+
+A naive approach of using the newest available generation number from
+each layer would lead to violated expectations: the lower layer would
+use corrected commit dates which are much larger than the topological
+levels of the higher layer. For this reason, Git inspects the topmost
+layer to see if the layer is missing corrected commit dates. In such a case
+Git only uses topological level for generation numbers.
+
+When writing a new layer in split commit-graph, we write corrected commit
+dates if the topmost layer has corrected commit dates written. This
+guarantees that if a layer has corrected commit dates, all lower layers
+must have corrected commit dates as well.
+
+When merging layers, we do not consider whether the merged layers had corrected
+commit dates. Instead, the new layer will have corrected commit dates if the
+layer below the new layer has corrected commit dates.
+
+While writing or merging layers, if the new layer is the only layer, it will
+have corrected commit dates when written by compatible versions of Git. Thus,
+rewriting split commit-graph as a single file (`--split=replace`) creates a
+single layer with corrected commit dates.
+
 ## Deleting graph-{hash} files
 
 After a new tip file is written, some `graph-{hash}` files may no longer
diff --git a/Documentation/technical/index-format.txt b/Documentation/technical/index-format.txt
index 69edf46c03..b633482b1b 100644
--- a/Documentation/technical/index-format.txt
+++ b/Documentation/technical/index-format.txt
@@ -26,7 +26,7 @@ Git index format
      Extensions are identified by signature. Optional extensions can
      be ignored if Git does not understand them.
 
-     Git currently supports cached tree and resolve undo extensions.
+     Git currently supports cache tree and resolve undo extensions.
 
      4-byte extension signature. If the first byte is 'A'..'Z' the
      extension is optional and can be ignored.
@@ -136,14 +136,35 @@ Git index format
 
 == Extensions
 
-=== Cached tree
-
-  Cached tree extension contains pre-computed hashes for trees that can
-  be derived from the index. It helps speed up tree object generation
-  from index for a new commit.
-
-  When a path is updated in index, the path must be invalidated and
-  removed from tree cache.
+=== Cache tree
+
+  Since the index does not record entries for directories, the cache
+  entries cannot describe tree objects that already exist in the object
+  database for regions of the index that are unchanged from an existing
+  commit. The cache tree extension stores a recursive tree structure that
+  describes the trees that already exist and completely match sections of
+  the cache entries. This speeds up tree object generation from the index
+  for a new commit by only computing the trees that are "new" to that
+  commit. It also assists when comparing the index to another tree, such
+  as `HEAD^{tree}`, since sections of the index can be skipped when a tree
+  comparison demonstrates equality.
+
+  The recursive tree structure uses nodes that store a number of cache
+  entries, a list of subnodes, and an object ID (OID). The OID references
+  the existing tree for that node, if it is known to exist. The subnodes
+  correspond to subdirectories that themselves have cache tree nodes. The
+  number of cache entries corresponds to the number of cache entries in
+  the index that describe paths within that tree's directory.
+
+  The extension tracks the full directory structure in the cache tree
+  extension, but this is generally smaller than the full cache entry list.
+
+  When a path is updated in index, Git invalidates all nodes of the
+  recursive cache tree corresponding to the parent directories of that
+  path. We store these tree nodes as being "invalid" by using "-1" as the
+  number of cache entries. Invalid nodes still store a span of index
+  entries, allowing Git to focus its efforts when reconstructing a full
+  cache tree.
 
   The signature for this extension is { 'T', 'R', 'E', 'E' }.
 
@@ -174,7 +195,8 @@ Git index format
   first entry represents the root level of the repository, followed by the
   first subtree--let's call this A--of the root level (with its name
   relative to the root level), followed by the first subtree of A (with
-  its name relative to A), ...
+  its name relative to A), and so on. The specified number of subtrees
+  indicates when the current level of the recursive stack is complete.
 
 === Resolve undo
 
diff --git a/Documentation/technical/pack-format.txt b/Documentation/technical/pack-format.txt
index 96d2fc589f..8833b71c8b 100644
--- a/Documentation/technical/pack-format.txt
+++ b/Documentation/technical/pack-format.txt
@@ -274,6 +274,26 @@ Pack file entry: <+
 
     Index checksum of all of the above.
 
+== pack-*.rev files have the format:
+
+  - A 4-byte magic number '0x52494458' ('RIDX').
+
+  - A 4-byte version identifier (= 1).
+
+  - A 4-byte hash function identifier (= 1 for SHA-1, 2 for SHA-256).
+
+  - A table of index positions (one per packed object, num_objects in
+    total, each a 4-byte unsigned integer in network order), sorted by
+    their corresponding offsets in the packfile.
+
+  - A trailer, containing a:
+
+    checksum of the corresponding packfile, and
+
+    a checksum of all of the above.
+
+All 4-byte numbers are in network order.
+
 == multi-pack-index (MIDX) files have the following format:
 
 The multi-pack-index files refer to multiple pack-files and loose objects.
diff --git a/Documentation/technical/packfile-uri.txt b/Documentation/technical/packfile-uri.txt
index 318713abc3..f7eabc6c76 100644
--- a/Documentation/technical/packfile-uri.txt
+++ b/Documentation/technical/packfile-uri.txt
@@ -37,8 +37,11 @@ at least so that we can test the client.
 This is the implementation: a feature, marked experimental, that allows the
 server to be configured by one or more `uploadpack.blobPackfileUri=<sha1>
 <uri>` entries. Whenever the list of objects to be sent is assembled, all such
-blobs are excluded, replaced with URIs. The client will download those URIs,
-expecting them to each point to packfiles containing single blobs.
+blobs are excluded, replaced with URIs. As noted in "Future work" below, the
+server can evolve in the future to support excluding other objects (or other
+implementations of servers could be made that support excluding other objects)
+without needing a protocol change, so clients should not expect that packfiles
+downloaded in this way only contain single blobs.
 
 Client design
 -------------
diff --git a/Documentation/technical/protocol-v2.txt b/Documentation/technical/protocol-v2.txt
index 85daeb5d9e..f772d90eaf 100644
--- a/Documentation/technical/protocol-v2.txt
+++ b/Documentation/technical/protocol-v2.txt
@@ -192,11 +192,20 @@ ls-refs takes in the following arguments:
 	When specified, only references having a prefix matching one of
 	the provided prefixes are displayed.
 
+If the 'unborn' feature is advertised the following argument can be
+included in the client's request.
+
+    unborn
+	The server will send information about HEAD even if it is a symref
+	pointing to an unborn branch in the form "unborn HEAD
+	symref-target:<target>".
+
 The output of ls-refs is as follows:
 
     output = *ref
 	     flush-pkt
-    ref = PKT-LINE(obj-id SP refname *(SP ref-attribute) LF)
+    obj-id-or-unborn = (obj-id | "unborn")
+    ref = PKT-LINE(obj-id-or-unborn SP refname *(SP ref-attribute) LF)
     ref-attribute = (symref | peeled)
     symref = "symref-target:" symref-target
     peeled = "peeled:" obj-id