summaryrefslogtreecommitdiff
path: root/pack-bitmap.c
AgeCommit message (Collapse)AuthorFilesLines
2022-01-27pack-bitmap.c: gracefully fallback after opening pack/MIDXLibravatar Taylor Blau1-0/+4
When opening a MIDX/pack-bitmap, we call open_midx_bitmap_1() or open_pack_bitmap_1() respectively in a loop over the set of MIDXs/packs. By design, these functions are supposed to be called over every pack and MIDX, since only one of them should have a valid bitmap. Ordinarily we return '0' from these two functions in order to indicate that we successfully loaded a bitmap To signal that we couldn't load a bitmap corresponding to the MIDX/pack (either because one doesn't exist, or because there was an error with loading it), we can return '-1'. In either case, the callers each enumerate all MIDXs/packs to ensure that at most one bitmap per-kind is present. But when we fail to load a bitmap that does exist (for example, loading a MIDX bitmap without finding a corresponding reverse index), we'll return -1 but leave the 'midx' field non-NULL. So when we fallback to loading a pack bitmap, we'll complain that the bitmap we're trying to populate already is "opened", even though it isn't. Rectify this by setting the '->pack' and '->midx' field back to NULL as appropriate. Two tests are added: one to ensure that the MIDX-to-pack bitmap fallback works, and another to ensure we still complain when there are multiple pack bitmaps in a repository. Signed-off-by: Taylor Blau <me@ttaylorr.com> Reviewed-by: Derrick Stolee <dstolee@microsoft.com> Reviewed-by: Jonathan Tan <jonathantanmy@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-12-10Merge branch 'jk/test-bitmap-fix'Libravatar Junio C Hamano1-1/+1
Tighten code for testing pack-bitmap. * jk/test-bitmap-fix: test_bitmap_hashes(): handle repository without bitmaps
2021-11-05test_bitmap_hashes(): handle repository without bitmapsLibravatar Jeff King1-1/+1
If prepare_bitmap_git() returns NULL (one easy-to-trigger cause being that the repository does not have bitmaps at all), then we'll segfault accessing bitmap_git->hashes: $ t/helper/test-tool bitmap dump-hashes Segmentation fault We should treat this the same as a repository with bitmaps but no name-hashes, and quietly produce an empty output. The later call to free_bitmap_index() in the cleanup label is OK, as it treats a NULL pointer as a noop. This isn't a big deal in practice, as this function is intended for and used only by test-tool. It's probably worth fixing to avoid confusion, but not worth adding coverage for this to the test suite. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-10-28pack-bitmap.c: more aggressively free in free_bitmap_index()Libravatar Taylor Blau1-0/+8
The function free_bitmap_index() is somewhat lax in what it frees. There are two notable examples: - While it does call kh_destroy_oid_map on the "bitmaps" map, which maps commit OIDs to their corresponding bitmaps, the bitmaps themselves are not freed. Note here that we recycle already-freed ewah_bitmaps into a pool, but these are handled correctly by ewah_pool_free(). - We never bother to free the extended index's "positions" map, which we always allocate in load_bitmap(). Fix both of these. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-10-28pack-bitmap.c: don't leak type-level bitmapsLibravatar Taylor Blau1-0/+6
test_bitmap_walk() is used to implement `git rev-list --test-bitmap`, which compares the result of the on-disk bitmaps with ones generated on-the-fly during a revision walk. In fa95666a40 (pack-bitmap.c: harden 'test_bitmap_walk()' to check type bitmaps, 2021-08-24), we hardened those tests to also check the four special type-level bitmaps, but never freed those bitmaps. We should have, since each required an allocation when we EWAH-decompressed them. Free those, plugging that leak, and also free the base (the scratch-pad bitmap), too. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-10-28midx.c: write MIDX filenames to strbufLibravatar Taylor Blau1-5/+10
To ask for the name of a MIDX and its corresponding .rev file, callers invoke get_midx_filename() and get_midx_rev_filename(), respectively. These both invoke xstrfmt(), allocating a chunk of memory which must be freed later on. This makes callers in pack-bitmap.c somewhat awkward. Specifically, midx_bitmap_filename(), which is implemented like: return xstrfmt("%s-%s.bitmap", get_midx_filename(midx->object_dir), hash_to_hex(get_midx_checksum(midx))); this leaks the second argument to xstrfmt(), which itself was allocated with xstrfmt(). This caller could assign both the result of get_midx_filename() and the outer xstrfmt() to a temporary variable, remembering to free() the former before returning. But that involves a wasteful copy. Instead, get_midx_filename() and get_midx_rev_filename() take a strbuf as an output parameter. This way midx_bitmap_filename() can manipulate and pass around a temporary buffer which it detaches back to its caller. That allows us to implement the function without copying or open-coding get_midx_filename() in a way that doesn't leak. Update the other callers of get_midx_filename() and get_midx_rev_filename() accordingly. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-10-18Merge branch 'tb/repack-write-midx'Libravatar Junio C Hamano1-1/+1
"git repack" has been taught to generate multi-pack reachability bitmaps. * tb/repack-write-midx: test-read-midx: fix leak of bitmap_index struct builtin/repack.c: pass `--refs-snapshot` when writing bitmaps builtin/repack.c: make largest pack preferred builtin/repack.c: support writing a MIDX while repacking builtin/repack.c: extract showing progress to a variable builtin/repack.c: rename variables that deal with non-kept packs builtin/repack.c: keep track of existing packs unconditionally midx: preliminary support for `--refs-snapshot` builtin/multi-pack-index.c: support `--stdin-packs` mode midx: expose `write_midx_file_only()` publicly
2021-09-28builtin/repack.c: make largest pack preferredLibravatar Taylor Blau1-1/+1
When repacking into a geometric series and writing a multi-pack bitmap, it is beneficial to have the largest resulting pack be the preferred object source in the bitmap's MIDX, since selecting the large packs can lead to fewer broken delta chains and better compression. Teach 'git repack' to identify this pack and pass it to the MIDX write machinery in order to mark it as preferred. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-14pack-bitmap.c: propagate namehash values from existing bitmapsLibravatar Taylor Blau1-6/+8
When an old bitmap exists while writing a new one, we load it and build a "reposition" table which maps bit positions of objects from the old bitmap to their respective positions in the new bitmap. This can help when we encounter a commit which was selected in both the old and new bitmap, since we only need to permute its bit (not recompute it from scratch). We do not, however, repurpose existing namehash values in the case of the hash-cache extension. There has been thus far no good reason to do so, since all of the namehash values for objects in the new bitmap would be populated during the traversal that was just performed by pack-objects when generating single-pack reachability bitmaps. But this isn't the case for multi-pack bitmaps, which are written via `git multi-pack-index write --bitmap` and do not perform any traversal. In this case all namehash values are set to zero, but we don't even bother to check the `pack.writeBitmapHashcache` option anyway, so it fails to matter. There are two approaches we could take to fill in non-zero hash-cache values: - have either the multi-pack-index builtin run its own traversal to attempt to fill in some values, or let a hypothetical caller (like `pack-objects` when `repack` eventually drives the `multi-pack-index` builtin) fill in the values they found during their traversal - or copy any existing namehash values that were stored in an existing bitmap to their corresponding positions in the new bitmap In a system where a repository is generally repacked with `git repack --geometric=<d>` and occasionally repacked with `git repack -a`, the hash-cache coverage will tend towards all objects. Since populating the hash-cache is additive (i.e., doing so only helps our delta search), any intermediate lack of full coverage is just fine. So let's start by just propagating any values from the existing hash-cache if we see one. The next patch will respect the `pack.writeBitmapHashcache` option while writing MIDX bitmaps, and then test this new behavior. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-14t/helper/test-bitmap.c: add 'dump-hashes' modeLibravatar Taylor Blau1-0/+27
The pack-bitmap writer code is about to learn how to propagate values from an existing hash-cache. To prepare, teach the test-bitmap helper to dump the values from a bitmap's hash-cache extension in order to test those changes. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-09pack-bitmap: drop bitmap_index argument from try_partial_reuse()Libravatar Jeff King1-3/+2
Starting in commit 0f533c7284 (pack-bitmap: read multi-pack bitmaps, 2021-08-31), we no longer look at the "struct bitmap_index" passed to try_partial_reuse(). This is because we only handle verbatim reuse from a single pack: either the pack whose bitmap we're looking at, or the "preferred" pack of a midx bitmap. And thus the primary item we look at is the "pack" parameter added by that same commit, and not the bitmap_git->pack parameter (which would be NULL for a midx bitmap). It's our caller, reuse_partial_packfile_from_bitmap(), which decides which pack to use and passes it in to us. Drop the unused parameter to prevent confusion. Signed-off-by: Jeff King <peff@peff.net> Reviewed-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-09pack-bitmap: drop repository argument from prepare_midx_bitmap_git()Libravatar Jeff King1-2/+1
We never look at the repository argument which is passed. This makes sense, since the multi_pack_index struct already tells us everything we need to access the files in its associated object directory. Signed-off-by: Jeff King <peff@peff.net> Reviewed-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-01pack-bitmap: read multi-pack bitmapsLibravatar Taylor Blau1-38/+319
This prepares the code in pack-bitmap to interpret the new multi-pack bitmaps described in Documentation/technical/bitmap-format.txt, which mostly involves converting bit positions to accommodate looking them up in a MIDX. Note that there are currently no writers who write multi-pack bitmaps, and that this will be implemented in the subsequent commit. Note also that get_midx_checksum() and get_midx_filename() are made non-static so they can be called from pack-bitmap.c. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-01pack-bitmap.c: avoid redundant calls to try_partial_reuseLibravatar Taylor Blau1-11/+29
try_partial_reuse() is used to mark any bits in the beginning of a bitmap whose objects can be reused verbatim from the pack they came from. Currently this function returns void, and signals nothing to the caller when bits could not be reused. But multi-pack bitmaps would benefit from having such a signal, because they may try to pass objects which are in bounds, but from a pack other than the preferred one. Any extra calls are noops because of a conditional in reuse_partial_packfile_from_bitmap(), but those loop iterations can be avoided by letting try_partial_reuse() indicate when it can't accept any more bits for reuse, and then listening to that signal. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-01pack-bitmap.c: introduce 'bitmap_is_preferred_refname()'Libravatar Taylor Blau1-0/+16
In a recent commit, pack-objects learned support for the 'pack.preferBitmapTips' configuration. This patch prepares the multi-pack bitmap code to respect this configuration, too. The yet-to-be implemented code will find that it is more efficient to check whether each reference contains a prefix found in the configured set of values rather than doing an additional traversal. Implement a function 'bitmap_is_preferred_refname()' which will perform that check. Its caller will be added in a subsequent patch. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-01pack-bitmap.c: introduce 'nth_bitmap_object_oid()'Libravatar Taylor Blau1-3/+10
A subsequent patch to support reading MIDX bitmaps will be less noisy after extracting a generic function to fetch the nth OID contained in the bitmap. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-01pack-bitmap.c: introduce 'bitmap_num_objects()'Libravatar Taylor Blau1-16/+21
A subsequent patch to support reading MIDX bitmaps will be less noisy after extracting a generic function to return how many objects are contained in a bitmap. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-08-24pack-bitmap.c: harden 'test_bitmap_walk()' to check type bitmapsLibravatar Taylor Blau1-0/+48
The special `--test-bitmap` mode of `git rev-list` is used to compare the result of an object traversal with a bitmap to check its integrity. This mode does not, however, assert that the types of reachable objects are stored correctly. Harden this mode by teaching it to also check that each time an object's bit is marked, the corresponding bit should be set in exactly one of the type bitmaps (whose type matches the object's true type). Co-authored-by: Jeff King <peff@peff.net> Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-08-02Merge branch 'jk/check-pack-valid-before-opening-bitmap'Libravatar Junio C Hamano1-0/+5
A race between repacking and using pack bitmaps has been corrected. * jk/check-pack-valid-before-opening-bitmap: pack-bitmap: check pack validity when opening bitmap
2021-08-02Merge branch 'tb/bitmap-type-filter-comment-fix'Libravatar Junio C Hamano1-5/+6
In-code comment update. * tb/bitmap-type-filter-comment-fix: pack-bitmap: clarify comment in filter_bitmap_exclude_type()
2021-07-23pack-bitmap: check pack validity when opening bitmapLibravatar Jeff King1-0/+5
When pack-objects adds an entry to its list of objects to pack, it may mark the packfile and offset that contains the file, which we can later use to output the object verbatim. If the packfile is deleted while we are running (e.g., by another process running "git repack"), we may die in use_pack() if the pack file cannot be opened. We worked around this in 4c08018204 (pack-objects: protect against disappearing packs, 2011-10-14) by making sure we can open the pack before recording it as a source. This detects a pack which has already disappeared while generating the packing list, and because we keep the pack's file descriptor (or an mmap window) open, it means we can access it later (unless you exceed core.packedgitlimit). The bitmap code that was added later does not do this; it adds entries to the packlist without checking that the packfile is still valid, and is vulnerable to this race. It needs the same treatment as 4c08018204. However, rather than add it in just that one spot, it makes more sense to simply open and check the packfile when we open the bitmap. Technically you can use the .bitmap without even looking in the .pack file (e.g., if you are just printing a list of objects without accessing them), but it's much simpler to do it early. That covers all later direct uses of the pack (due to the cached descriptor) without having to check each one directly. For example, in pack-objects we need to protect the packlist entries, but we also access the pack directly as part of the reuse_partial_pack_from_bitmap() feature. This patch covers both cases. There's no test here, because the problem is inherently racy. I reproduced and verified the fix with this script: rm -rf parent.git push.git fetch.git push() { ( cd push.git && echo content >>file && git add file && git commit -qm "change $1" && git push -q origin HEAD && echo "push $1..." ) && ( cd parent.git && git repack -ad -q && echo "repack $1..." ) } fetch() { rm -rf fetch.git && git clone -q file://$PWD/parent.git fetch.git && echo "fetch $1..." } git init --bare parent.git && git --git-dir=parent.git config transfer.unpacklimit 1 && git clone parent.git push.git && (for i in `seq 1 1000`; do push $i || break; done) & pusher=$! (for i in `seq 1 1000`; do fetch $i || break; done) & fetcher=$! wait $fetcher kill $pusher That simulates a race between a client cloning and a push triggering a repack on the server. Without this patch, it generally fails within a couple hundred iterations with: remote: fatal: packfile ./objects/pack/.tmp-1377349-pack-498afdec371232bdb99d1757872f5569331da61e.pack cannot be accessed error: git upload-pack: git-pack-objects died with error. fatal: git upload-pack: aborting due to possible repository corruption on the remote side. remote: aborting due to possible repository corruption on the remote side. fatal: early EOF fatal: fetch-pack: invalid index-pack output With this patch, it reliably runs through all thousand attempts. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-07-20pack-bitmap: clarify comment in filter_bitmap_exclude_type()Libravatar Taylor Blau1-5/+6
The code that eventually became filter_bitmap_exclude_type() was originally introduced in 4f3bd5606a (pack-bitmap: implement BLOB_NONE filtering, 2020-02-14) to accelerate BLOB_NONE filters with bitmaps. In 856e12c18a (pack-bitmap.c: make object filtering functions generic, 2020-05-04), it became filter_bitmap_exclude_type(). But not all of the comments were updated to be agnostic to the provided type. Remove the remaining comments which should have been updated in 856e12c18a to reflect the type-agnostic nature of the function. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-06-15bitmaps: don't recurse into trees already in the bitmapLibravatar Jeff King1-0/+18
If an object is already mentioned in a reachability bitmap we are building, then by definition so are all of the objects it can reach. We have an optimization to stop traversing commits when we see they are already in the bitmap, but we don't do the same for trees. It's generally unavoidable to recurse into trees for commits not yet covered by bitmaps (since most commits generally do have unique top-level trees). But they usually have subtrees that are shared with other commits (i.e., all of the subtrees the commit _didn't_ touch). And some of those commits (and their trees) may be covered by the bitmap. Usually this isn't _too_ big a deal, because we'll visit those subtrees only once in total for the whole walk. But if you have a large number of unbitmapped commits, and if your tree is big, then you may end up opening a lot of sub-trees for no good reason. We can use the same optimization we do for commits here: when we are about to open a tree, see if it's in the bitmap (either the one we are building, or the "seen" bitmap which covers the UNINTERESTING side of the bitmap when doing a set-difference). This works especially well because we'll visit all commits before hitting any trees. So even in a history like: A -- B if "A" has a bitmap on disk but "B" doesn't, we'll already have OR-ed in the results from A before looking at B's tree (so we really will only look at trees touched by B). For most repositories, the timings produced by p5310 are unspectacular. Here's linux.git: Test HEAD^ HEAD -------------------------------------------------------------------- 5310.4: simulated clone 6.00(5.90+0.10) 5.98(5.90+0.08) -0.3% 5310.5: simulated fetch 2.98(5.45+0.18) 2.85(5.31+0.18) -4.4% 5310.7: rev-list (commits) 0.32(0.29+0.03) 0.33(0.30+0.03) +3.1% 5310.8: rev-list (objects) 1.48(1.44+0.03) 1.49(1.44+0.05) +0.7% Any improvement there is within the noise (the +3.1% on test 7 has to be noise, since we are not recursing into trees, and thus the new code isn't even run). The results for git.git are likewise uninteresting. But here are numbers from some other real-world repositories (that are not public). This one's tree is comparable in size to linux.git, but has ~16k refs (and so less complete bitmap coverage): Test HEAD^ HEAD ------------------------------------------------------------------------- 5310.4: simulated clone 38.34(39.86+0.74) 33.95(35.53+0.76) -11.5% 5310.5: simulated fetch 2.29(6.31+0.35) 2.20(5.97+0.41) -3.9% 5310.7: rev-list (commits) 0.99(0.86+0.13) 0.96(0.85+0.11) -3.0% 5310.8: rev-list (objects) 11.32(11.04+0.27) 6.59(6.37+0.21) -41.8% And here's another with a very large tree (~340k entries), and a fairly large number of refs (~10k): Test HEAD^ HEAD ------------------------------------------------------------------------- 5310.3: simulated clone 53.83(54.71+1.54) 39.77(40.76+1.50) -26.1% 5310.4: simulated fetch 19.91(20.11+0.56) 19.79(19.98+0.67) -0.6% 5310.6: rev-list (commits) 0.54(0.44+0.11) 0.51(0.43+0.07) -5.6% 5310.7: rev-list (objects) 24.32(23.59+0.73) 9.85(9.49+0.36) -59.5% This patch provides substantial improvements in these larger cases, and have any drawbacks for smaller ones (the cost of the bitmap check is quite small compared to an actual tree traversal). Note that we have to add a version of revision.c's include_check callback which handles non-commits. We could possibly consolidate this into a single callback for all objects types, as there's only one user of the feature which would need converted (pack-bitmap.c:should_include). That would in theory let us avoid duplicating any logic. But when I tried it, the code ended up much worse to read, with lots of repeated "if it's a commit do this, otherwise do that". Having two separate callbacks splits that naturally, and matches the existing split of show_commit/show_object callbacks. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-05-07Merge branch 'ps/rev-list-object-type-filter'Libravatar Junio C Hamano1-5/+40
"git rev-list" learns the "--filter=object:type=<type>" option, which can be used to exclude objects of the given kind from the packfile generated by pack-objects. * ps/rev-list-object-type-filter: rev-list: allow filtering of provided items pack-bitmap: implement combined filter pack-bitmap: implement object type filter list-objects: implement object type filter list-objects: support filtering by tag and commit list-objects: move tag processing into its own function revision: mark commit parents as NOT_USER_GIVEN uploadpack.txt: document implication of `uploadpackfilter.allow`
2021-05-07Merge branch 'jk/prune-with-bitmap-fix'Libravatar Junio C Hamano1-0/+3
When the reachability bitmap is in effect, the "do not lose recently created objects and those that are reachable from them" safety to protect us from races were disabled by mistake, which has been corrected. * jk/prune-with-bitmap-fix: prune: save reachable-from-recent objects with bitmaps pack-bitmap: clean up include_check after use
2021-04-29pack-bitmap: clean up include_check after useLibravatar Jeff King1-0/+3
When a bitmap walk has to traverse (to fill in non-bitmapped objects), we use rev_info's include_check mechanism to let us stop the traversal early. But after setting the function and its data parameter, we never clean it up. This means that if the rev_info is used for a subsequent traversal without bitmaps, it will unexpectedly call into our include_check function (worse, it will do so pointing to a now-defunct stack variable in include_check_data, likely resulting in a segfault). There's no code which does this now, but it's an accident waiting to happen. Let's clean up after ourselves in the bitmap code. Reported-by: David Emett <dave@sp4m.net> Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-04-19rev-list: allow filtering of provided itemsLibravatar Patrick Steinhardt1-2/+4
When providing an object filter, it is currently impossible to also filter provided items. E.g. when executing `git rev-list HEAD` , the commit this reference points to will be treated as user-provided and is thus excluded from the filtering mechanism. This makes it harder than necessary to properly use the new `--filter=object:type` filter given that even if the user wants to only see blobs, he'll still see commits of provided references. Improve this by introducing a new `--filter-provided-objects` option to the git-rev-parse(1) command. If given, then all user-provided references will be subject to filtering. Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-04-19pack-bitmap: implement combined filterLibravatar Patrick Steinhardt1-0/+10
When the user has multiple objects filters specified, then this is internally represented by having a "combined" filter. These combined filters aren't yet supported by bitmap indices and can thus not be accelerated. Fix this by implementing support for these combined filters. The implementation is quite trivial: when there's a combined filter, we simply recurse into `filter_bitmap()` for all of the sub-filters. Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-04-19pack-bitmap: implement object type filterLibravatar Patrick Steinhardt1-3/+26
The preceding commit has added a new object filter for git-rev-list(1) which allows to filter objects by type. Implement the equivalent filter for packfile bitmaps so that we can answer these queries fast. Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-04-13Merge branch 'tb/pack-preferred-tips-to-give-bitmap'Libravatar Junio C Hamano1-0/+24
A configuration variable has been added to force tips of certain refs to be given a reachability bitmap. * tb/pack-preferred-tips-to-give-bitmap: builtin/pack-objects.c: respect 'pack.preferBitmapTips' t/helper/test-bitmap.c: initial commit pack-bitmap: add 'test_bitmap_commits()' helper
2021-04-07Merge branch 'ps/pack-bitmap-optim'Libravatar Junio C Hamano1-0/+1
Optimize "rev-list --use-bitmap-index --objects" corner case that uses negative tags as the stopping points. * ps/pack-bitmap-optim: pack-bitmap: avoid traversal of objects referenced by uninteresting tag
2021-03-31builtin/pack-objects.c: respect 'pack.preferBitmapTips'Libravatar Taylor Blau1-0/+6
When writing a new pack with a bitmap, it is sometimes convenient to indicate some reference prefixes which should receive priority when selecting which commits to receive bitmaps. A truly motivated caller could accomplish this by setting 'pack.islandCore', (since all commits in the core island are similarly marked as preferred) but this requires callers to opt into using delta islands, which they may or may not want to do. Introduce a new multi-valued configuration, 'pack.preferBitmapTips' to allow callers to specify a list of reference prefixes. All references which have a prefix contained in 'pack.preferBitmapTips' will mark their tips as "preferred" in the same way as commits are marked as preferred for selection by 'pack.islandCore'. The choice of the verb "prefer" is intentional: marking the NEEDS_BITMAP flag on an object does *not* guarantee that that object will receive a bitmap. It merely guarantees that that commit will receive a bitmap over any *other* commit in the same window by bitmap_writer_select_commits(). The test this patch adds reflects this quirk, too. It only tests that a commit (which didn't receive bitmaps by default) is selected for bitmaps after changing the value of 'pack.preferBitmapTips' to include it. Other commits may lose their bitmaps as a byproduct of how the selection process works (bitmap_writer_select_commits() ignores the remainder of a window after seeing a commit with the NEEDS_BITMAP flag). This configuration will aide in selecting important references for multi-pack bitmaps, since they do not respect the same pack.islandCore configuration. (They could, but doing so may be confusing, since it is packs--not bitmaps--which are influenced by the delta-islands configuration). In a fork network repository (one which lists all forks of a given repository as remotes), for example, it is useful to set pack.preferBitmapTips to 'refs/remotes/<root>/heads' and 'refs/remotes/<root>/tags', where '<root>' is an opaque identifier referring to the repository which is at the base of the fork chain. Suggested-by: Jeff King <peff@peff.net> Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-03-31pack-bitmap: add 'test_bitmap_commits()' helperLibravatar Taylor Blau1-0/+18
The next patch will add a 'bitmap' test-tool which prints the list of commits that have bitmaps computed. The test helper could implement this itself, but it would need access to the 'bitmaps' field of the 'pack_bitmap' struct. To avoid exposing this private detail, implement the entirety of the helper behind a test_bitmap_commits() function in pack-bitmap.c. There is some precedence for this with test_bitmap_walk() which is used to implement the '--test-bitmap' flag in 'git rev-list' (and is also implemented in pack-bitmap.c). A caller will be added in the next patch. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-03-22pack-bitmap: avoid traversal of objects referenced by uninteresting tagLibravatar Patrick Steinhardt1-0/+1
When preparing the bitmap walk, we first establish the set of of have and want objects by iterating over the set of pending objects: if an object is marked as uninteresting, it's declared as an object we already have, otherwise as an object we want. These two sets are then used to compute which transitively referenced objects we need to obtain. One special case here are tag objects: when a tag is requested, we resolve it to its first not-tag object and add both resolved objects as well as the tag itself into either the have or want set. Given that the uninteresting-property always propagates to referenced objects, it is clear that if the tag is uninteresting, so are its children and vice versa. But we fail to propagate the flag, which effectively means that referenced objects will always be interesting except for the case where they have already been marked as uninteresting explicitly. This mislabeling does not impact correctness: we now have it in our "wants" set, and given that we later do an `AND NOT` of the bitmaps of "wants" and "haves" sets it is clear that the result must be the same. But we now start to needlessly traverse the tag's referenced objects in case it is uninteresting, even though we know that each referenced object will be uninteresting anyway. In the worst case, this can lead to a complete graph walk just to establish that we do not care for any object. Fix the issue by propagating the `UNINTERESTING` flag to pointees of tag objects and add a benchmark with negative revisions to p5310. This shows some nice performance benefits, tested with linux.git: Test HEAD~ HEAD --------------------------------------------------------------------------------------------------------------- 5310.3: repack to disk 193.18(181.46+16.42) 194.61(183.41+15.83) +0.7% 5310.4: simulated clone 25.93(24.88+1.05) 25.81(24.73+1.08) -0.5% 5310.5: simulated fetch 2.64(5.30+0.69) 2.59(5.16+0.65) -1.9% 5310.6: pack to file (bitmap) 58.75(57.56+6.30) 58.29(57.61+5.73) -0.8% 5310.7: rev-list (commits) 1.45(1.18+0.26) 1.46(1.22+0.24) +0.7% 5310.8: rev-list (objects) 15.35(14.22+1.13) 15.30(14.23+1.07) -0.3% 5310.9: rev-list with tag negated via --not --all (objects) 22.49(20.93+1.56) 0.11(0.09+0.01) -99.5% 5310.10: rev-list with negative tag (objects) 0.61(0.44+0.16) 0.51(0.35+0.16) -16.4% 5310.11: rev-list count with blob:none 12.15(11.19+0.96) 12.18(11.19+0.99) +0.2% 5310.12: rev-list count with blob:limit=1k 17.77(15.71+2.06) 17.75(15.63+2.12) -0.1% 5310.13: rev-list count with tree:0 1.69(1.31+0.38) 1.68(1.28+0.39) -0.6% 5310.14: simulated partial clone 20.14(19.15+0.98) 19.98(18.93+1.05) -0.8% 5310.16: clone (partial bitmap) 12.78(13.89+1.07) 12.72(13.99+1.01) -0.5% 5310.17: pack to file (partial bitmap) 42.07(45.44+2.72) 41.44(44.66+2.80) -1.5% 5310.18: rev-list with tree filter (partial bitmap) 0.44(0.29+0.15) 0.46(0.32+0.14) +4.5% While most benchmarks are probably in the range of noise, the newly added 5310.9 and 5310.10 benchmarks consistenly perform better. Signed-off-by: Patrick Steinhardt <ps@pks.im>. Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-03-13use CALLOC_ARRAYLibravatar René Scharfe1-2/+2
Add and apply a semantic patch for converting code that open-codes CALLOC_ARRAY to use it instead. It shortens the code and infers the element size automatically. Signed-off-by: René Scharfe <l.s.r@web.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-11rev-list: add --disk-usage option for calculating disk usageLibravatar Jeff King1-0/+81
It can sometimes be useful to see which refs are contributing to the overall repository size (e.g., does some branch have a bunch of objects not found elsewhere in history, which indicates that deleting it would shrink the size of a clone). You can find that out by generating a list of objects, getting their sizes from cat-file, and then summing them, like: git rev-list --objects --no-object-names main..branch git cat-file --batch-check='%(objectsize:disk)' | perl -lne '$total += $_; END { print $total }' Though note that the caveats from git-cat-file(1) apply here. We "blame" base objects more than their deltas, even though the relationship could easily be flipped. Still, it can be a useful rough measure. But one problem is that it's slow to run. Teaching rev-list to sum up the sizes can be much faster for two reasons: 1. It skips all of the piping of object names and sizes. 2. If bitmaps are in use, for objects that are in the bitmapped packfile we can skip the oid_object_info() lookup entirely, and just ask the revindex for the on-disk size. This patch implements a --disk-usage option which produces the same answer in a fraction of the time. Here are some timings using a clone of torvalds/linux: [rev-list piped to cat-file, no bitmaps] $ time git rev-list --objects --no-object-names --all | git cat-file --buffer --batch-check='%(objectsize:disk)' | perl -lne '$total += $_; END { print $total }' 1459938510 real 0m29.635s user 0m38.003s sys 0m1.093s [internal, no bitmaps] $ time git rev-list --disk-usage --objects --all 1459938510 real 0m31.262s user 0m30.885s sys 0m0.376s Even though the wall-clock time is slightly worse due to parallelism, notice the CPU savings between the two. We saved 21% of the CPU just by avoiding the pipes. But the real win is with bitmaps. If we use them without the new option: [rev-list piped to cat-file, bitmaps] $ time git rev-list --objects --no-object-names --all --use-bitmap-index | git cat-file --batch-check='%(objectsize:disk)' | perl -lne '$total += $_; END { print $total }' 1459938510 real 0m6.244s user 0m8.452s sys 0m0.311s then we're faster to generate the list of objects, but we still spend a lot of time piping and looking things up. But if we do both together: [internal, bitmaps] $ time git rev-list --disk-usage --objects --all --use-bitmap-index 1459938510 real 0m0.219s user 0m0.169s sys 0m0.049s then we get the same answer much faster. For "--all", that answer will correspond closely to "du objects/pack", of course. But we're actually checking reachability here, so we're still fast when we ask for more interesting things: $ time git rev-list --disk-usage --use-bitmap-index v5.0..v5.10 374798628 real 0m0.429s user 0m0.356s sys 0m0.072s Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-13rebuild_existing_bitmaps(): convert to new revindex APILibravatar Taylor Blau1-3/+2
Remove another instance of looking at the revindex directly by instead calling 'pack_pos_to_index()'. Unlike other patches, this caller only cares about the index position of each object in the loop. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-13try_partial_reuse(): convert to new revindex APILibravatar Taylor Blau1-8/+5
Remove another instance of direct revindex manipulation by calling 'pack_pos_to_offset()' instead (the caller here does not care about the index position of the object at position 'pos'). Note that we cannot just use the existing "offset" variable to store the value we get from pack_pos_to_offset(). It is incremented by unpack_object_header(), but we later need the original value. Since we'll no longer have revindex->offset to read it from, we'll store that in a separate variable ("header" since it points to the entry's header bytes). Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-13get_size_by_pos(): convert to new revindex APILibravatar Taylor Blau1-4/+4
Remove another caller that holds onto a 'struct revindex_entry' by replacing the direct indexing with calls to 'pack_pos_to_offset()' and 'pack_pos_to_index()'. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-13show_objects_for_type(): convert to new revindex APILibravatar Taylor Blau1-6/+7
Avoid storing the revindex entry directly, since this structure will soon be removed from the public interface. Instead, store the offset and index position by calling 'pack_pos_to_offset()' and 'pack_pos_to_index()', respectively. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-13bitmap_position_packfile(): convert to new revindex APILibravatar Taylor Blau1-1/+4
Replace find_revindex_position() with its counterpart in the new API, offset_to_pack_pos(). Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-12-08pack-bitmap: factor out 'add_commit_to_bitmap()'Libravatar Taylor Blau1-15/+21
'find_objects()' currently needs to interact with the bitmaps khash pretty closely. To make 'find_objects()' read a little more straightforwardly, remove some of the khash-level details into a new function that describes what it does: 'add_commit_to_bitmap()'. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-12-08pack-bitmap: factor out 'bitmap_for_commit()'Libravatar Taylor Blau1-14/+19
A couple of callers within pack-bitmap.c duplicate logic to lookup a given object id in the bitamps khash. Factor this out into a new function, 'bitmap_for_commit()' to reduce some code duplication. Make this new function non-static, since it will be used in later commits from outside of pack-bitmap.c. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-12-08pack-bitmap-write: ignore BITMAP_FLAG_REUSELibravatar Jeff King1-40/+6
The on-disk bitmap format has a flag to mark a bitmap to be "reused". This is a rather curious feature, and works like this: - a run of pack-objects would decide to mark the last 80% of the bitmaps it generates with the reuse flag - the next time we generate bitmaps, we'd see those reuse flags from the last run, and mark those commits as special: - we'd be more likely to select those commits to get bitmaps in the new output - when generating the bitmap for a selected commit, we'd reuse the old bitmap as-is (rearranging the bits to match the new pack, of course) However, neither of these behaviors particularly makes sense. Just because a commit happened to be bitmapped last time does not make it a good candidate for having a bitmap this time. In particular, we may choose bitmaps based on how recent they are in history, or whether a ref tip points to them, and those things will change. We're better off re-considering fresh which commits are good candidates. Reusing the existing bitmap _is_ a reasonable thing to do to save computation. But only reusing exact bitmaps is a weak form of this. If we have an old bitmap for A and now want a new bitmap for its child, we should be able to compute that only by looking at trees and that are new to the child. But this code would consider only exact reuse (which is perhaps why it was eager to select those commits in the first place). Furthermore, the recent switch to the reverse-edge algorithm for generating bitmaps dropped this optimization entirely (and yet still performs better). So let's do a few cleanups: - drop the whole "reusing bitmaps" phase of generating bitmaps. It's not helping anything, and is mostly unused code (or worse, code that is using CPU but not doing anything useful) - drop the use of the on-disk reuse flag to select commits to bitmap - stop setting the on-disk reuse flag in bitmaps we generate (since nothing respects it anymore) We will keep a few innards of the reuse code, which will help us implement a more capable version of the "reuse" optimization: - simplify rebuild_existing_bitmaps() into a function that only builds the mapping of bits between the old and new orders, but doesn't actually convert any bitmaps - make rebuild_bitmap() public; we'll call it lazily to convert bitmaps as we traverse (using the mapping created above) Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-12-08pack-bitmap.c: check reads more aggressively when loadingLibravatar Taylor Blau1-1/+6
Before 'load_bitmap_entries_v1()' reads an actual EWAH bitmap, it should check that it can safely do so by ensuring that there are at least 6 bytes available to be read (four for the commit's index position, and then two more for the xor offset and flags, respectively). Likewise, it should check that the commit index it read refers to a legitimate object in the pack. The first fix catches a truncation bug that was exposed when testing, and the second is purely precautionary. There are some possible future improvements, not pursued here. They are: - Computing the correct boundary of the bitmap itself in the caller and ensuring that we don't read past it. This may or may not be worth it, since in a truncation situation, all bets are off: (is the trailer still there and the bitmap entries malformed, or is the trailer truncated?). The best we can do is try to read what's there as if it's correct data (and protect ourselves when it's obviously bogus). - Avoid the magic "6" by teaching read_be32() and read_u8() (both of which are custom helpers for this function) to check sizes before advancing the pointers. - Adding more tests in this area. Testing these truncation situations are remarkably fragile to even subtle changes in the bitmap generation. So, the resulting tests are likely to be quite brittle. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-12-08rev-list: die when --test-bitmap detects a mismatchLibravatar Jeff King1-1/+1
You can use "git rev-list --test-bitmap HEAD" to check that bitmaps produce the same answer we'd get from a regular traversal. But if we detect an error, we only print "mismatch", and still exit with a successful error code. That makes the uses of --test-bitmap in the test suite (e.g., in t5310) mostly pointless: even if we saw an error, the tests wouldn't notice. Let's instead call die(), which will let these tests work as designed, and alert us if the bitmaps are bogus. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-12-08pack-bitmap: bounds-check size of cache extensionLibravatar Jeff King1-2/+6
A .bitmap file may have a "name hash cache" extension, which puts a sequence of uint32_t values (one per object) at the end of the file. When we see a flag indicating this extension, we blindly subtract the appropriate number of bytes from our available length. However, if the .bitmap file is too short, we'll underflow our length variable and wrap around, thinking we have a very large length. This can lead to reading out-of-bounds bytes while loading individual ewah bitmaps. We can fix this by checking the number of available bytes when we parse the header. The existing "truncated bitmap" test is now split into two tests: one where we don't have this extension at all (and hence actually do try to read a truncated ewah bitmap) and one where we realize up-front that we can't even fit in the cache structure. We'll check stderr in each case to make sure we hit the error we're expecting. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-12-08pack-bitmap: fix header size checkLibravatar Jeff King1-3/+4
When we parse a .bitmap header, we first check that we have enough bytes to make a valid header. We do that based on sizeof(struct bitmap_disk_header). However, as of 0f4d6cada8 (pack-bitmap: make bitmap header handling hash agnostic, 2019-02-19), that struct oversizes its checksum member to GIT_MAX_RAWSZ. That means we need to adjust for the difference between that constant and the size of the actual hash we're using. That commit adjusted the code which moves our pointer forward, but forgot to update the size check. This meant we were overly strict about the header size (requiring room for a 32-byte worst-case hash, when sha1 is only 20 bytes). But in practice it didn't matter because bitmap files tend to have at least 12 bytes of actual data anyway, so it was unlikely for a valid file to be caught by this. Let's fix it by pulling the header size into a separate variable and using it in both spots. That fixes the bug and simplifies the code to make it harder to have a mismatch like this in the future. It will also come in handy in the next patch for more bounds checking. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-05-04pack-bitmap: pass object filter to fill-in traversalLibravatar Jeff King1-5/+9
Sometimes a bitmap traversal still has to walk some commits manually, because those commits aren't included in the bitmap packfile (e.g., due to a push or commit since the last full repack). If we're given an object filter, we don't pass it down to this traversal. It's not necessary for correctness because the bitmap code has its own filters to post-process the bitmap result (which it must, to filter out the objects that _are_ mentioned in the bitmapped packfile). And with blob filters, there was no performance reason to pass along those filters, either. The fill-in traversal could omit them from the result, but it wouldn't save us any time to do so, since we'd still have to walk each tree entry to see if it's a blob or not. But now that we support tree filters, there's opportunity for savings. A tree:depth=0 filter means we can avoid accessing trees entirely, since we know we won't them (or any of the subtrees or blobs they point to). The new test in p5310 shows this off (the "partial bitmap" state is one where HEAD~100 and its ancestors are all in a bitmapped pack, but HEAD~100..HEAD are not). Here are the results (run against linux.git): Test HEAD^ HEAD ------------------------------------------------------------------------------------------------- [...] 5310.16: rev-list with tree filter (partial bitmap) 0.19(0.17+0.02) 0.03(0.02+0.01) -84.2% The absolute number of savings isn't _huge_, but keep in mind that we only omitted 100 first-parent links (in the version of linux.git here, that's 894 actual commits). In a more pathological case, we might have a much larger proportion of non-bitmapped commits. I didn't bother creating such a case in the perf script because the setup is expensive, and this is plenty to show the savings as a percentage. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-05-04pack-bitmap.c: support 'tree:0' filteringLibravatar Taylor Blau1-1/+24
In the previous patch, we made it easy to define other filters that exclude all objects of a certain type. Use that in order to implement bitmap-level filtering for the '--filter=tree:<n>' filter when 'n' is equal to 0. The general case is not helped by bitmaps, since for values of 'n > 0', the object filtering machinery requires a full-blown tree traversal in order to determine the depth of a given tree. Caching this is non-obvious, too, since the same tree object can have a different depth depending on the context (e.g., a tree was moved up in the directory hierarchy between two commits). But, the 'n = 0' case can be helped, and this patch does so. Running p5310.11 in this tree and on master with the kernel, we can see that this case is helped substantially: Test master this tree -------------------------------------------------------------------------------- 5310.11: rev-list count with tree:0 10.68(10.39+0.27) 0.06(0.04+0.01) -99.4% Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>