From 917a54c01755c3a0de7abea4b3afbc125aaf1d76 Mon Sep 17 00:00:00 2001 From: Taylor Blau Date: Tue, 24 Aug 2021 12:15:59 -0400 Subject: Documentation: describe MIDX-based bitmaps Update the technical documentation to describe the multi-pack bitmap format. This patch merely introduces the new format, and describes its high-level ideas. Git does not yet know how to read nor write these multi-pack variants, and so the subsequent patches will: - Introduce code to interpret multi-pack bitmaps, according to this document. - Then, introduce code to write multi-pack bitmaps from the 'git multi-pack-index write' sub-command. Finally, the implementation will gain tests in subsequent patches (as opposed to inline with the patch teaching Git how to write multi-pack bitmaps) to avoid a cyclic dependency. Signed-off-by: Taylor Blau Signed-off-by: Junio C Hamano --- Documentation/technical/bitmap-format.txt | 71 ++++++++++++++++++++++------ Documentation/technical/multi-pack-index.txt | 10 ++-- 2 files changed, 60 insertions(+), 21 deletions(-) (limited to 'Documentation') diff --git a/Documentation/technical/bitmap-format.txt b/Documentation/technical/bitmap-format.txt index f8c18a0f7a..04b3ec2178 100644 --- a/Documentation/technical/bitmap-format.txt +++ b/Documentation/technical/bitmap-format.txt @@ -1,6 +1,44 @@ GIT bitmap v1 format ==================== +== Pack and multi-pack bitmaps + +Bitmaps store reachability information about the set of objects in a packfile, +or a multi-pack index (MIDX). The former is defined obviously, and the latter is +defined as the union of objects in packs contained in the MIDX. + +A bitmap may belong to either one pack, or the repository's multi-pack index (if +it exists). A repository may have at most one bitmap. + +An object is uniquely described by its bit position within a bitmap: + + - If the bitmap belongs to a packfile, the __n__th bit corresponds to + the __n__th object in pack order. For a function `offset` which maps + objects to their byte offset within a pack, pack order is defined as + follows: + + o1 <= o2 <==> offset(o1) <= offset(o2) + + - If the bitmap belongs to a MIDX, the __n__th bit corresponds to the + __n__th object in MIDX order. With an additional function `pack` which + maps objects to the pack they were selected from by the MIDX, MIDX order + is defined as follows: + + o1 <= o2 <==> pack(o1) <= pack(o2) /\ offset(o1) <= offset(o2) + + The ordering between packs is done according to the MIDX's .rev file. + Notably, the preferred pack sorts ahead of all other packs. + +The on-disk representation (described below) of a bitmap is the same regardless +of whether or not that bitmap belongs to a packfile or a MIDX. The only +difference is the interpretation of the bits, which is described above. + +Certain bitmap extensions are supported (see: Appendix B). No extensions are +required for bitmaps corresponding to packfiles. For bitmaps that correspond to +MIDXs, both the bit-cache and rev-cache extensions are required. + +== On-disk format + - A header appears at the beginning: 4-byte signature: {'B', 'I', 'T', 'M'} @@ -14,17 +52,19 @@ GIT bitmap v1 format The following flags are supported: - BITMAP_OPT_FULL_DAG (0x1) REQUIRED - This flag must always be present. It implies that the bitmap - index has been generated for a packfile with full closure - (i.e. where every single object in the packfile can find - its parent links inside the same packfile). This is a - requirement for the bitmap index format, also present in JGit, - that greatly reduces the complexity of the implementation. + This flag must always be present. It implies that the + bitmap index has been generated for a packfile or + multi-pack index (MIDX) with full closure (i.e. where + every single object in the packfile/MIDX can find its + parent links inside the same packfile/MIDX). This is a + requirement for the bitmap index format, also present in + JGit, that greatly reduces the complexity of the + implementation. - BITMAP_OPT_HASH_CACHE (0x4) If present, the end of the bitmap file contains `N` 32-bit name-hash values, one per object in the - pack. The format and meaning of the name-hash is + pack/MIDX. The format and meaning of the name-hash is described below. 4-byte entry count (network byte order) @@ -33,7 +73,8 @@ GIT bitmap v1 format 20-byte checksum - The SHA1 checksum of the pack this bitmap index belongs to. + The SHA1 checksum of the pack/MIDX this bitmap index + belongs to. - 4 EWAH bitmaps that act as type indexes @@ -50,7 +91,7 @@ GIT bitmap v1 format - Tags In each bitmap, the `n`th bit is set to true if the `n`th object - in the packfile is of that type. + in the packfile or multi-pack index is of that type. The obvious consequence is that the OR of all 4 bitmaps will result in a full set (all bits set), and the AND of all 4 bitmaps will @@ -62,8 +103,9 @@ GIT bitmap v1 format Each entry contains the following: - 4-byte object position (network byte order) - The position **in the index for the packfile** where the - bitmap for this commit is found. + The position **in the index for the packfile or + multi-pack index** where the bitmap for this commit is + found. - 1-byte XOR-offset The xor offset used to compress this bitmap. For an entry @@ -146,10 +188,11 @@ Name-hash cache --------------- If the BITMAP_OPT_HASH_CACHE flag is set, the end of the bitmap contains -a cache of 32-bit values, one per object in the pack. The value at +a cache of 32-bit values, one per object in the pack/MIDX. The value at position `i` is the hash of the pathname at which the `i`th object -(counting in index order) in the pack can be found. This can be fed -into the delta heuristics to compare objects with similar pathnames. +(counting in index or multi-pack index order) in the pack/MIDX can be found. +This can be fed into the delta heuristics to compare objects with similar +pathnames. The hash algorithm used is: diff --git a/Documentation/technical/multi-pack-index.txt b/Documentation/technical/multi-pack-index.txt index fb688976c4..1a73c3ee20 100644 --- a/Documentation/technical/multi-pack-index.txt +++ b/Documentation/technical/multi-pack-index.txt @@ -71,14 +71,10 @@ Future Work still reducing the number of binary searches required for object lookups. -- The reachability bitmap is currently paired directly with a single - packfile, using the pack-order as the object order to hopefully - compress the bitmaps well using run-length encoding. This could be - extended to pair a reachability bitmap with a multi-pack-index. If - the multi-pack-index is extended to store a "stable object order" +- If the multi-pack-index is extended to store a "stable object order" (a function Order(hash) = integer that is constant for a given hash, - even as the multi-pack-index is updated) then a reachability bitmap - could point to a multi-pack-index and be updated independently. + even as the multi-pack-index is updated) then MIDX bitmaps could be + updated independently of the MIDX. - Packfiles can be marked as "special" using empty files that share the initial name but replace ".pack" with ".keep" or ".promisor". -- cgit v1.2.3 From 5d3cd09a80819009636ae85c5171d9a4057eb372 Mon Sep 17 00:00:00 2001 From: Taylor Blau Date: Tue, 31 Aug 2021 16:52:02 -0400 Subject: midx: reject empty `--preferred-pack`'s The soon-to-be-implemented multi-pack bitmap treats object in the first bit position specially by assuming that all objects in the pack it was selected from are also represented from that pack in the MIDX. In other words, the pack from which the first object was selected must also have all of its other objects selected from that same pack in the MIDX in case of any duplicates. But this assumption relies on the fact that there is at least one object in that pack to begin with; otherwise the object in the first bit position isn't from a preferred pack, in which case we can no longer assume that all objects in that pack were also selected from the same pack. Guard this assumption by checking the number of objects in the given preferred pack, and failing if the given pack is empty. To make sure we can safely perform this check, open any packs which are contained in an existing MIDX via prepare_midx_pack(). The same is done for new packs via the add_pack_to_midx() callback, but packs picked up from a previous MIDX will not yet have these opened. Signed-off-by: Taylor Blau Signed-off-by: Junio C Hamano --- Documentation/git-multi-pack-index.txt | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) (limited to 'Documentation') diff --git a/Documentation/git-multi-pack-index.txt b/Documentation/git-multi-pack-index.txt index ffd601bc17..c9b063d31e 100644 --- a/Documentation/git-multi-pack-index.txt +++ b/Documentation/git-multi-pack-index.txt @@ -37,9 +37,9 @@ write:: -- --preferred-pack=:: Optionally specify the tie-breaking pack used when - multiple packs contain the same object. If not given, - ties are broken in favor of the pack with the lowest - mtime. + multiple packs contain the same object. `` must + contain at least one object. If not given, ties are + broken in favor of the pack with the lowest mtime. -- verify:: -- cgit v1.2.3 From f57a739691e2efa1d71851cd725154a5a760d8f4 Mon Sep 17 00:00:00 2001 From: Taylor Blau Date: Wed, 1 Sep 2021 16:34:01 -0400 Subject: midx: avoid opening multiple MIDXs when writing Opening multiple instance of the same MIDX can lead to problems like two separate packed_git structures which represent the same pack being added to the repository's object store. The above scenario can happen because prepare_midx_pack() checks if `m->packs[pack_int_id]` is NULL in order to determine if a pack has been opened and installed in the repository before. But a caller can construct two copies of the same MIDX by calling get_multi_pack_index() and load_multi_pack_index() since the former manipulates the object store directly but the latter is a lower-level routine which allocates a new MIDX for each call. So if prepare_midx_pack() is called on multiple MIDXs with the same pack_int_id, then that pack will be installed twice in the object store's packed_git pointer. This can lead to problems in, for e.g., the pack-bitmap code, which does something like the following (in pack-bitmap.c:open_pack_bitmap()): struct bitmap_index *bitmap_git = ...; for (p = get_all_packs(r); p; p = p->next) { if (open_pack_bitmap_1(bitmap_git, p) == 0) ret = 0; } which is a problem if two copies of the same pack exist in the packed_git list because pack-bitmap.c:open_pack_bitmap_1() contains a conditional like the following: if (bitmap_git->pack || bitmap_git->midx) { /* ignore extra bitmap file; we can only handle one */ warning("ignoring extra bitmap file: %s", packfile->pack_name); close(fd); return -1; } Avoid this scenario by not letting write_midx_internal() open a MIDX that isn't also pointed at by the object store. So long as this is the case, other routines should prefer to open MIDXs with get_multi_pack_index() or reprepare_packed_git() instead of creating instances on their own. Because get_multi_pack_index() returns `r->object_store->multi_pack_index` if it is non-NULL, we'll only have one instance of a MIDX open at one time, avoiding these problems. To encourage this, drop the `struct multi_pack_index *` parameter from `write_midx_internal()`, and rely instead on the `object_dir` to find (or initialize) the correct MIDX instance. Likewise, replace the call to `close_midx()` with `close_object_store()`, since we're about to replace the MIDX with a new one and should invalidate the object store's memory of any MIDX that might have existed beforehand. Note that this now forbids passing object directories that don't belong to alternate repositories over `--object-dir`, since before we would have happily opened a MIDX in any directory, but now restrict ourselves to only those reachable by `r->objects->multi_pack_index` (and alternate MIDXs that we can see by walking the `next` pointer). As far as I can tell, supporting arbitrary directories with `--object-dir` was a historical accident, since even the documentation says `` when referring to the value passed to this option. Signed-off-by: Taylor Blau Signed-off-by: Junio C Hamano --- Documentation/git-multi-pack-index.txt | 2 ++ 1 file changed, 2 insertions(+) (limited to 'Documentation') diff --git a/Documentation/git-multi-pack-index.txt b/Documentation/git-multi-pack-index.txt index c9b063d31e..0af6beb2dd 100644 --- a/Documentation/git-multi-pack-index.txt +++ b/Documentation/git-multi-pack-index.txt @@ -23,6 +23,8 @@ OPTIONS Use given directory for the location of Git objects. We check `/packs/multi-pack-index` for the current MIDX file, and `/packs` for the pack-files to index. ++ +`` must be an alternate of the current repository. --[no-]progress:: Turn progress on/off explicitly. If neither is specified, progress is -- cgit v1.2.3 From c528e179662ce1353c51f65e1b5895d02988c78c Mon Sep 17 00:00:00 2001 From: Taylor Blau Date: Tue, 31 Aug 2021 16:52:24 -0400 Subject: pack-bitmap: write multi-pack bitmaps Write multi-pack bitmaps in the format described by Documentation/technical/bitmap-format.txt, inferring their presence with the absence of '--bitmap'. To write a multi-pack bitmap, this patch attempts to reuse as much of the existing machinery from pack-objects as possible. Specifically, the MIDX code prepares a packing_data struct that pretends as if a single packfile has been generated containing all of the objects contained within the MIDX. Signed-off-by: Taylor Blau Signed-off-by: Junio C Hamano --- Documentation/git-multi-pack-index.txt | 12 +++++++++++- 1 file changed, 11 insertions(+), 1 deletion(-) (limited to 'Documentation') diff --git a/Documentation/git-multi-pack-index.txt b/Documentation/git-multi-pack-index.txt index 0af6beb2dd..a9df3dbd32 100644 --- a/Documentation/git-multi-pack-index.txt +++ b/Documentation/git-multi-pack-index.txt @@ -10,7 +10,7 @@ SYNOPSIS -------- [verse] 'git multi-pack-index' [--object-dir=] [--[no-]progress] - [--preferred-pack=] + [--preferred-pack=] [--[no-]bitmap] DESCRIPTION ----------- @@ -42,6 +42,9 @@ write:: multiple packs contain the same object. `` must contain at least one object. If not given, ties are broken in favor of the pack with the lowest mtime. + + --[no-]bitmap:: + Control whether or not a multi-pack bitmap is written. -- verify:: @@ -83,6 +86,13 @@ EXAMPLES $ git multi-pack-index write ----------------------------------------------- +* Write a MIDX file for the packfiles in the current .git folder with a +corresponding bitmap. ++ +------------------------------------------------------------- +$ git multi-pack-index write --preferred-pack= --bitmap +------------------------------------------------------------- + * Write a MIDX file for the packfiles in an alternate object store. + ----------------------------------------------- -- cgit v1.2.3