Age | Commit message (Collapse) | Author | Files | Lines |
|
When writing a new pack with a bitmap, it is sometimes convenient to
indicate some reference prefixes which should receive priority when
selecting which commits to receive bitmaps.
A truly motivated caller could accomplish this by setting
'pack.islandCore', (since all commits in the core island are similarly
marked as preferred) but this requires callers to opt into using delta
islands, which they may or may not want to do.
Introduce a new multi-valued configuration, 'pack.preferBitmapTips' to
allow callers to specify a list of reference prefixes. All references
which have a prefix contained in 'pack.preferBitmapTips' will mark their
tips as "preferred" in the same way as commits are marked as preferred
for selection by 'pack.islandCore'.
The choice of the verb "prefer" is intentional: marking the NEEDS_BITMAP
flag on an object does *not* guarantee that that object will receive a
bitmap. It merely guarantees that that commit will receive a bitmap over
any *other* commit in the same window by bitmap_writer_select_commits().
The test this patch adds reflects this quirk, too. It only tests that
a commit (which didn't receive bitmaps by default) is selected for
bitmaps after changing the value of 'pack.preferBitmapTips' to include
it. Other commits may lose their bitmaps as a byproduct of how the
selection process works (bitmap_writer_select_commits() ignores the
remainder of a window after seeing a commit with the NEEDS_BITMAP flag).
This configuration will aide in selecting important references for
multi-pack bitmaps, since they do not respect the same pack.islandCore
configuration. (They could, but doing so may be confusing, since it is
packs--not bitmaps--which are influenced by the delta-islands
configuration).
In a fork network repository (one which lists all forks of a given
repository as remotes), for example, it is useful to set
pack.preferBitmapTips to 'refs/remotes/<root>/heads' and
'refs/remotes/<root>/tags', where '<root>' is an opaque identifier
referring to the repository which is at the base of the fork chain.
Suggested-by: Jeff King <peff@peff.net>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
The next patch will add a 'bitmap' test-tool which prints the list of
commits that have bitmaps computed.
The test helper could implement this itself, but it would need access to
the 'bitmaps' field of the 'pack_bitmap' struct. To avoid exposing this
private detail, implement the entirety of the helper behind a
test_bitmap_commits() function in pack-bitmap.c.
There is some precedence for this with test_bitmap_walk() which is used
to implement the '--test-bitmap' flag in 'git rev-list' (and is also
implemented in pack-bitmap.c).
A caller will be added in the next patch.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
It can sometimes be useful to see which refs are contributing to the
overall repository size (e.g., does some branch have a bunch of objects
not found elsewhere in history, which indicates that deleting it would
shrink the size of a clone).
You can find that out by generating a list of objects, getting their
sizes from cat-file, and then summing them, like:
git rev-list --objects --no-object-names main..branch
git cat-file --batch-check='%(objectsize:disk)' |
perl -lne '$total += $_; END { print $total }'
Though note that the caveats from git-cat-file(1) apply here. We "blame"
base objects more than their deltas, even though the relationship could
easily be flipped. Still, it can be a useful rough measure.
But one problem is that it's slow to run. Teaching rev-list to sum up
the sizes can be much faster for two reasons:
1. It skips all of the piping of object names and sizes.
2. If bitmaps are in use, for objects that are in the
bitmapped packfile we can skip the oid_object_info()
lookup entirely, and just ask the revindex for the
on-disk size.
This patch implements a --disk-usage option which produces the same
answer in a fraction of the time. Here are some timings using a clone of
torvalds/linux:
[rev-list piped to cat-file, no bitmaps]
$ time git rev-list --objects --no-object-names --all |
git cat-file --buffer --batch-check='%(objectsize:disk)' |
perl -lne '$total += $_; END { print $total }'
1459938510
real 0m29.635s
user 0m38.003s
sys 0m1.093s
[internal, no bitmaps]
$ time git rev-list --disk-usage --objects --all
1459938510
real 0m31.262s
user 0m30.885s
sys 0m0.376s
Even though the wall-clock time is slightly worse due to parallelism,
notice the CPU savings between the two. We saved 21% of the CPU just by
avoiding the pipes.
But the real win is with bitmaps. If we use them without the new option:
[rev-list piped to cat-file, bitmaps]
$ time git rev-list --objects --no-object-names --all --use-bitmap-index |
git cat-file --batch-check='%(objectsize:disk)' |
perl -lne '$total += $_; END { print $total }'
1459938510
real 0m6.244s
user 0m8.452s
sys 0m0.311s
then we're faster to generate the list of objects, but we still spend a
lot of time piping and looking things up. But if we do both together:
[internal, bitmaps]
$ time git rev-list --disk-usage --objects --all --use-bitmap-index
1459938510
real 0m0.219s
user 0m0.169s
sys 0m0.049s
then we get the same answer much faster.
For "--all", that answer will correspond closely to "du objects/pack",
of course. But we're actually checking reachability here, so we're still
fast when we ask for more interesting things:
$ time git rev-list --disk-usage --use-bitmap-index v5.0..v5.10
374798628
real 0m0.429s
user 0m0.356s
sys 0m0.072s
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
A couple of callers within pack-bitmap.c duplicate logic to lookup a
given object id in the bitamps khash. Factor this out into a new
function, 'bitmap_for_commit()' to reduce some code duplication.
Make this new function non-static, since it will be used in later
commits from outside of pack-bitmap.c.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
The on-disk bitmap format has a flag to mark a bitmap to be "reused".
This is a rather curious feature, and works like this:
- a run of pack-objects would decide to mark the last 80% of the
bitmaps it generates with the reuse flag
- the next time we generate bitmaps, we'd see those reuse flags from
the last run, and mark those commits as special:
- we'd be more likely to select those commits to get bitmaps in
the new output
- when generating the bitmap for a selected commit, we'd reuse the
old bitmap as-is (rearranging the bits to match the new pack, of
course)
However, neither of these behaviors particularly makes sense.
Just because a commit happened to be bitmapped last time does not make
it a good candidate for having a bitmap this time. In particular, we may
choose bitmaps based on how recent they are in history, or whether a ref
tip points to them, and those things will change. We're better off
re-considering fresh which commits are good candidates.
Reusing the existing bitmap _is_ a reasonable thing to do to save
computation. But only reusing exact bitmaps is a weak form of this. If
we have an old bitmap for A and now want a new bitmap for its child, we
should be able to compute that only by looking at trees and that are new
to the child. But this code would consider only exact reuse (which is
perhaps why it was eager to select those commits in the first place).
Furthermore, the recent switch to the reverse-edge algorithm for
generating bitmaps dropped this optimization entirely (and yet still
performs better).
So let's do a few cleanups:
- drop the whole "reusing bitmaps" phase of generating bitmaps. It's
not helping anything, and is mostly unused code (or worse, code that
is using CPU but not doing anything useful)
- drop the use of the on-disk reuse flag to select commits to bitmap
- stop setting the on-disk reuse flag in bitmaps we generate (since
nothing respects it anymore)
We will keep a few innards of the reuse code, which will help us
implement a more capable version of the "reuse" optimization:
- simplify rebuild_existing_bitmaps() into a function that only builds
the mapping of bits between the old and new orders, but doesn't
actually convert any bitmaps
- make rebuild_bitmap() public; we'll call it lazily to convert bitmaps
as we traverse (using the mapping created above)
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
The object reachability bitmap machinery and the partial cloning
machinery were not prepared to work well together, because some
object-filtering criteria that partial clones use inherently rely
on object traversal, but the bitmap machinery is an optimization
to bypass that object traversal. There however are some cases
where they can work together, and they were taught about them.
* jk/object-filter-with-bitmap:
rev-list --count: comment on the use of count_right++
pack-objects: support filters with bitmaps
pack-bitmap: implement BLOB_LIMIT filtering
pack-bitmap: implement BLOB_NONE filtering
bitmap: add bitmap_unset() function
rev-list: use bitmap filters for traversal
pack-bitmap: basic noop bitmap filter infrastructure
rev-list: allow commit-only bitmap traversals
t5310: factor out bitmap traversal comparison
rev-list: allow bitmaps when counting objects
rev-list: make --count work with --objects
rev-list: factor out bitmap-optimized routines
pack-bitmap: refuse to do a bitmap traversal with pathspecs
rev-list: fallback to non-bitmap traversal when filtering
pack-bitmap: fix leak of haves/wants object lists
pack-bitmap: factor out type iterator initialization
|
|
The way "git pack-objects" reuses objects stored in existing pack
to generate its result has been improved.
* jk/packfile-reuse-cleanup:
pack-bitmap: don't rely on bitmap_git->reuse_objects
pack-objects: add checks for duplicate objects
pack-objects: improve partial packfile reuse
builtin/pack-objects: introduce obj_is_packed()
pack-objects: introduce pack.allowPackReuse
csum-file: introduce hashfile_total()
pack-bitmap: simplify bitmap_has_oid_in_uninteresting()
pack-bitmap: uninteresting oid can be outside bitmapped packfile
pack-bitmap: introduce bitmap_walk_contains()
ewah/bitmap: introduce bitmap_word_alloc()
packfile: expose get_delta_base()
builtin/pack-objects: report reused packfile objects
|
|
Currently you can't use object filters with bitmaps, but we plan to
support at least some filters with bitmaps. Let's introduce some
infrastructure that will help us do that:
- prepare_bitmap_walk() now accepts a list_objects_filter_options
parameter (which can be NULL for no filtering; all the current
callers pass this)
- we'll bail early if the filter is incompatible with bitmaps (just as
we would if there were no bitmaps at all). Currently all filters are
incompatible.
- we'll filter the resulting bitmap; since there are no supported
filters yet, this is always a noop.
There should be no behavior change yet, but we'll support some actual
filters in a future patch.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
Ever since we added reachability bitmap support, we've been able to use
it with rev-list to get the full list of objects, like:
git rev-list --objects --use-bitmap-index --all
But you can't do so without --objects, since we weren't ready to just
show the commits. However, the internals of the bitmap code are mostly
ready for this: they avoid opening up trees when walking to fill in the
bitmaps. We just need to actually pass in the rev_info to
traverse_bitmap_commit_list() so it knows which types to bother
triggering our callback for.
For completeness, the perf test now covers both the existing --objects
case, as well as the new commits-only behavior (the objects one got way
faster when we introduced bitmaps, but obviously isn't improved now).
Here are numbers for linux.git:
Test HEAD^ HEAD
------------------------------------------------------------------------
5310.7: rev-list (commits) 8.29(8.10+0.19) 1.76(1.72+0.04) -78.8%
5310.8: rev-list (objects) 8.06(7.94+0.12) 8.14(7.94+0.13) +1.0%
That run was cheating a little, as I didn't have any commit-graph in the
repository, and we'd built it by default these days when running git-gc.
Here are numbers with a commit-graph:
Test HEAD^ HEAD
------------------------------------------------------------------------
5310.7: rev-list (commits) 0.70(0.58+0.12) 0.51(0.46+0.04) -27.1%
5310.8: rev-list (objects) 6.20(6.09+0.10) 6.27(6.16+0.11) +1.1%
Still an improvement, but a lot less impressive.
We could have the perf script remove any commit-graph to show the
out-sized effect, but it probably makes sense to leave it in what would
be a more typical setup.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
The old code to reuse deltas from an existing packfile
just tried to dump a whole segment of the pack verbatim.
That's faster than the traditional way of actually adding
objects to the packing list, but it didn't kick in very
often. This new code is really going for a middle ground:
do _some_ per-object work, but way less than we'd
traditionally do.
The general strategy of the new code is to make a bitmap
of objects from the packfile we'll include, and then
iterate over it, writing out each object exactly as it is
in our on-disk pack, but _not_ adding it to our packlist
(which costs memory, and increases the search space for
deltas).
One complication is that if we're omitting some objects,
we can't set a delta against a base that we're not
sending. So we have to check each object in
try_partial_reuse() to make sure we have its delta.
About performance, in the worst case we might have
interleaved objects that we are sending or not sending,
and we'd have as many chunks as objects. But in practice
we send big chunks.
For instance, packing torvalds/linux on GitHub servers
now reused 6.5M objects, but only needed ~50k chunks.
Helped-by: Jonathan Tan <jonathantanmy@google.com>
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
We will use this helper function in a following commit to
tell us if an object is packed.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
When we ran `make hdr-check` with the following patch
diff --git a/Makefile b/Makefile
index f879697ea3..d8df4e316b 100644
--- a/Makefile
+++ b/Makefile
@@ -2773,7 +2773,7 @@ CHK_HDRS = $(filter-out $(EXCEPT_HDRS),$(patsubst ./%,%,$(LIB_H)))
HCO = $(patsubst %.h,%.hco,$(CHK_HDRS))
$(HCO): %.hco: %.h FORCE
- $(QUIET_HDR)$(CC) -include git-compat-util.h -I. -o /dev/null -c -xc $<
+ $(QUIET_HDR)$(CC) -include git-compat-util.h -I. -o /dev/null -c -xc $(ALL_CFLAGS) $<
.PHONY: hdr-check $(HCO)
hdr-check: $(HCO)
and with `DEVELOPER=1`, we got the following warning on Arch Linux:
pack-bitmap.h:20:19: error: ‘BITMAP_IDX_SIGNATURE’ defined but not used [-Werror=unused-const-variable=]
20 | static const char BITMAP_IDX_SIGNATURE[] = {'B', 'I', 'T', 'M'};
| ^~~~~~~~~~~~~~~~~~~~
cc1: all warnings being treated as errors
"Use" the BITMAP_IDX_SIGNATURE variable by making the size of
bitmap_disk_header.magic equal to the size of BITMAP_IDX_SIGNATURE,
thereby eliminating the magic number (4).
An alternative was to simply add MAYBE_UNUSED, however that does not
eliminate the magic number.
Another alternative was to change the definition to
extern const char BITMAP_IDX_SIGNATURE[4];
However, this design was also not chosen as the static definition allows
us to keep the declaration together for readability along with removing
the magic number.
Signed-off-by: Denton Liu <liu.denton@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
All of the users of our khash_sha1 maps actually have a "struct
object_id". Let's use the more descriptive type.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
Instead of storing unsigned char pointers in the hash tables, switch to
storing instances of struct object_id. Update several internal functions
and one external function to take pointers to struct object_id.
Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
Increase the checksum field in struct bitmap_disk_header to be
GIT_MAX_RAWSZ bytes in length and ensure that we hash the proper number
of bytes out when computing the bitmap checksum.
Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
Commit 30cdc33fba (pack-bitmap: save "have" bitmap from
walk, 2018-08-21) introduced a new function for looking at
the "have" side of a bitmap walk. Because it only makes
sense to do so after we've finished the walk, we added an
extra safety assertion, making sure that bitmap_git->result
is non-NULL.
However, this safety is misguided. It was trying to catch
the case where we had called prepare_bitmap_walk() to give
us a "struct bitmap_index", but had not yet called
traverse_bitmap_commit_list() to walk it. But all of the
interesting computation (including setting up the result and
"have" bitmaps) happens in the first function! The latter
function only delivers the result to a callback function.
So the case we were worried about is impossible; if you get
a non-NULL result from prepare_bitmap_walk(), then its
"have" field will be fully formed.
But much worse, traverse_bitmap_commit_list() actually frees
the result field as it finishes. Which means that this
assertion is worse than useless: it's almost guaranteed to
trigger!
Our test suite didn't catch this because the function isn't
actually exercised at all. The only caller comes from
6a1e32d532 (pack-objects: reuse on-disk deltas for thin
"have" objects, 2018-08-21), and that's triggered only when
you fetch or push history that contains an object with a
base that is found deep in history. Our test suite fetches
and pushes either don't use bitmaps, or use too-small
example repositories. But any reasonably-sized real-world
push or fetch (with bitmaps) would trigger this.
This patch drops the harmful assertion and tweaks the
docstring for the function to make the precondition clear.
The tests need to be improved to exercise this new
pack-objects feature, but we'll do that in a separate
commit.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
When we do a bitmap walk, we save the result, which
represents (WANTs & ~HAVEs); i.e., every object we care
about visiting in our walk. However, we throw away the
haves bitmap, which can sometimes be useful, too. Save it
and provide an access function so code which has performed a
walk can query it.
A few notes on the accessor interface:
- the bitmap code calls these "haves" because it grew out
of the want/have negotiation for fetches. But really,
these are simply the objects that would be flagged
UNINTERESTING in a regular traversal. Let's use that
more universal nomenclature for the external module
interface. We may want to change the internal naming
inside the bitmap code, but that's outside the scope of
this patch.
- it still uses a bare "sha1" rather than "oid". That's
true of all of the bitmap code. And in this particular
instance, our caller in pack-objects is dealing with the
bare sha1 that comes from a packed REF_DELTA (we're
pointing directly to the mmap'd pack on disk). That's
something we'll have to deal with as we transition to a
new hash, but we can wait and see how the caller ends up
being fixed and adjust this interface accordingly.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
I looped over the toplevel header files, creating a temporary two-line C
program for each consisting of
#include "git-compat-util.h"
#include $HEADER
This patch is the result of manually fixing errors in compiling those
tiny programs.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
Add a function to free struct bitmap_index instances, and use it where
needed (except when rebuild_existing_bitmaps() is used, since it creates
references to the bitmaps within the struct bitmap_index passed to it).
Note that the hashes field in struct bitmap_index is not freed because
it points to another field within the same struct. The documentation for
that field has been updated to clarify that.
Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
Remove the bitmap_git global variable. Instead, generate on demand an
instance of struct bitmap_index for code that needs to access it.
This allows us significant control over the lifetime of instances of
struct bitmap_index. In particular, packs can now be closed without
worrying if an unnecessarily long-lived "pack" field in struct
bitmap_index still points to it.
The bitmap API is also clearer in that we need to first obtain a struct
bitmap_index, then we use it.
This patch raises two potential issues: (1) memory for the struct
bitmap_index is allocated without being freed, and (2)
prepare_bitmap_git() and prepare_bitmap_walk() can reuse a previously
loaded bitmap. For (1), this will be dealt with in a subsequent patch in
this patch set that also deals with freeing the contents of the struct
bitmap_index (which were not freed previously, because they have global
scope). For (2), current bitmap users only load the bitmap once at most
(note that pack-objects can use bitmaps or write bitmaps, but not both
at the same time), so support for reuse has no effect - and future users
can pass around the struct bitmap_index * obtained if they need to do 2
or more things with the same bitmap.
Helped-by: Stefan Beller <sbeller@google.com>
Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
This field is only need for pack-bitmap, which is an optional
feature. Move it to a separate array that is only allocated when
pack-bitmap is used (like objects[], it is not freed, since we need it
until the end of the process)
Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
Convert traverse_bitmap_commit_list and the callbacks it takes to use a
pointer to struct object_id.
Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
Reviewed-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
The "__attribute__" flag may be a noop on some compilers.
That's OK as long as the code is correct without the
attribute, but in this case it is not. We would typically
end up with a struct that is 2 bytes too long due to struct
padding, breaking both reading and writing of bitmaps.
Instead of marshalling the data in a struct, let's just
provide helpers for reading and writing the appropriate
types. Besides being correct on all platforms, the result is
more efficient and simpler to read.
Signed-off-by: Karsten Blees <blees@dcon.de>
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
When we use pack bitmaps rather than walking the object
graph, we end up with the list of objects to include in the
packfile, but we do not know the path at which any tree or
blob objects would be found.
In a recently packed repository, this is fine. A fetch would
use the paths only as a heuristic in the delta compression
phase, and a fully packed repository should not need to do
much delta compression.
As time passes, though, we may acquire more objects on top
of our large bitmapped pack. If clients fetch frequently,
then they never even look at the bitmapped history, and all
works as usual. However, a client who has not fetched since
the last bitmap repack will have "have" tips in the
bitmapped history, but "want" newer objects.
The bitmaps themselves degrade gracefully in this
circumstance. We manually walk the more recent bits of
history, and then use bitmaps when we hit them.
But we would also like to perform delta compression between
the newer objects and the bitmapped objects (both to delta
against what we know the user already has, but also between
"new" and "old" objects that the user is fetching). The lack
of pathnames makes our delta heuristics much less effective.
This patch adds an optional cache of the 32-bit name_hash
values to the end of the bitmap file. If present, a reader
can use it to match bitmapped and non-bitmapped names during
delta compression.
Here are perf results for p5310:
Test origin/master HEAD^ HEAD
-------------------------------------------------------------------------------------------------
5310.2: repack to disk 36.81(37.82+1.43) 47.70(48.74+1.41) +29.6% 47.75(48.70+1.51) +29.7%
5310.3: simulated clone 30.78(29.70+2.14) 1.08(0.97+0.10) -96.5% 1.07(0.94+0.12) -96.5%
5310.4: simulated fetch 3.16(6.10+0.08) 3.54(10.65+0.06) +12.0% 1.70(3.07+0.06) -46.2%
5310.6: partial bitmap 36.76(43.19+1.81) 6.71(11.25+0.76) -81.7% 4.08(6.26+0.46) -88.9%
You can see that the time spent on an incremental fetch goes
down, as our delta heuristics are able to do their work.
And we save time on the partial bitmap clone for the same
reason.
Signed-off-by: Vicent Marti <tanoku@gmail.com>
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
This commit extends more the functionality of `pack-objects` by allowing
it to write out a `.bitmap` index next to any written packs, together
with the `.idx` index that currently gets written.
If bitmap writing is enabled for a given repository (either by calling
`pack-objects` with the `--write-bitmap-index` flag or by having
`pack.writebitmaps` set to `true` in the config) and pack-objects is
writing a packfile that would normally be indexed (i.e. not piping to
stdout), we will attempt to write the corresponding bitmap index for the
packfile.
Bitmap index writing happens after the packfile and its index has been
successfully written to disk (`finish_tmp_packfile`). The process is
performed in several steps:
1. `bitmap_writer_set_checksum`: this call stores the partial
checksum for the packfile being written; the checksum will be
written in the resulting bitmap index to verify its integrity
2. `bitmap_writer_build_type_index`: this call uses the array of
`struct object_entry` that has just been sorted when writing out
the actual packfile index to disk to generate 4 type-index bitmaps
(one for each object type).
These bitmaps have their nth bit set if the given object is of
the bitmap's type. E.g. the nth bit of the Commits bitmap will be
1 if the nth object in the packfile index is a commit.
This is a very cheap operation because the bitmap writing code has
access to the metadata stored in the `struct object_entry` array,
and hence the real type for each object in the packfile.
3. `bitmap_writer_reuse_bitmaps`: if there exists an existing bitmap
index for one of the packfiles we're trying to repack, this call
will efficiently rebuild the existing bitmaps so they can be
reused on the new index. All the existing bitmaps will be stored
in a `reuse` hash table, and the commit selection phase will
prioritize these when selecting, as they can be written directly
to the new index without having to perform a revision walk to
fill the bitmap. This can greatly speed up the repack of a
repository that already has bitmaps.
4. `bitmap_writer_select_commits`: if bitmap writing is enabled for
a given `pack-objects` run, the sequence of commits generated
during the Counting Objects phase will be stored in an array.
We then use that array to build up the list of selected commits.
Writing a bitmap in the index for each object in the repository
would be cost-prohibitive, so we use a simple heuristic to pick
the commits that will be indexed with bitmaps.
The current heuristics are a simplified version of JGit's
original implementation. We select a higher density of commits
depending on their age: the 100 most recent commits are always
selected, after that we pick 1 commit of each 100, and the gap
increases as the commits grow older. On top of that, we make sure
that every single branch that has not been merged (all the tips
that would be required from a clone) gets their own bitmap, and
when selecting commits between a gap, we tend to prioritize the
commit with the most parents.
Do note that there is no right/wrong way to perform commit
selection; different selection algorithms will result in
different commits being selected, but there's no such thing as
"missing a commit". The bitmap walker algorithm implemented in
`prepare_bitmap_walk` is able to adapt to missing bitmaps by
performing manual walks that complete the bitmap: the ideal
selection algorithm, however, would select the commits that are
more likely to be used as roots for a walk in the future (e.g.
the tips of each branch, and so on) to ensure a bitmap for them
is always available.
5. `bitmap_writer_build`: this is the computationally expensive part
of bitmap generation. Based on the list of commits that were
selected in the previous step, we perform several incremental
walks to generate the bitmap for each commit.
The walks begin from the oldest commit, and are built up
incrementally for each branch. E.g. consider this dag where A, B,
C, D, E, F are the selected commits, and a, b, c, e are a chunk
of simplified history that will not receive bitmaps.
A---a---B--b--C--c--D
\
E--e--F
We start by building the bitmap for A, using A as the root for a
revision walk and marking all the objects that are reachable
until the walk is over. Once this bitmap is stored, we reuse the
bitmap walker to perform the walk for B, assuming that once we
reach A again, the walk will be terminated because A has already
been SEEN on the previous walk.
This process is repeated for C, and D, but when we try to
generate the bitmaps for E, we can reuse neither the current walk
nor the bitmap we have generated so far.
What we do now is resetting both the walk and clearing the
bitmap, and performing the walk from scratch using E as the
origin. This new walk, however, does not need to be completed.
Once we hit B, we can lookup the bitmap we have already stored
for that commit and OR it with the existing bitmap we've composed
so far, allowing us to limit the walk early.
After all the bitmaps have been generated, another iteration
through the list of commits is performed to find the best XOR
offsets for compression before writing them to disk. Because of
the incremental nature of these bitmaps, XORing one of them with
its predecesor results in a minimal "bitmap delta" most of the
time. We can write this delta to the on-disk bitmap index, and
then re-compose the original bitmaps by XORing them again when
loaded.
This is a phase very similar to pack-object's `find_delta` (using
bitmaps instead of objects, of course), except the heuristics
have been greatly simplified: we only check the 10 bitmaps before
any given one to find best compressing one. This gives good
results in practice, because there is locality in the ordering of
the objects (and therefore bitmaps) in the packfile.
6. `bitmap_writer_finish`: the last step in the process is
serializing to disk all the bitmap data that has been generated
in the two previous steps.
The bitmap is written to a tmp file and then moved atomically to
its final destination, using the same process as
`pack-write.c:write_idx_file`.
Signed-off-by: Vicent Marti <tanoku@gmail.com>
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
A bitmap index is a `.bitmap` file that can be found inside
`$GIT_DIR/objects/pack/`, next to its corresponding packfile, and
contains precalculated reachability information for selected commits.
The full specification of the format for these bitmap indexes can be found
in `Documentation/technical/bitmap-format.txt`.
For a given commit SHA1, if it happens to be available in the bitmap
index, its bitmap will represent every single object that is reachable
from the commit itself. The nth bit in the bitmap is the nth object in
the packfile; if it's set to 1, the object is reachable.
By using the bitmaps available in the index, this commit implements
several new functions:
- `prepare_bitmap_git`
- `prepare_bitmap_walk`
- `traverse_bitmap_commit_list`
- `reuse_partial_packfile_from_bitmap`
The `prepare_bitmap_walk` function tries to build a bitmap of all the
objects that can be reached from the commit roots of a given `rev_info`
struct by using the following algorithm:
- If all the interesting commits for a revision walk are available in
the index, the resulting reachability bitmap is the bitwise OR of all
the individual bitmaps.
- When the full set of WANTs is not available in the index, we perform a
partial revision walk using the commits that don't have bitmaps as
roots, and limiting the revision walk as soon as we reach a commit that
has a corresponding bitmap. The earlier OR'ed bitmap with all the
indexed commits can now be completed as this walk progresses, so the end
result is the full reachability list.
- For revision walks with a HAVEs set (a set of commits that are deemed
uninteresting), first we perform the same method as for the WANTs, but
using our HAVEs as roots, in order to obtain a full reachability bitmap
of all the uninteresting commits. This bitmap then can be used to:
a) limit the subsequent walk when building the WANTs bitmap
b) finding the final set of interesting commits by performing an
AND-NOT of the WANTs and the HAVEs.
If `prepare_bitmap_walk` runs successfully, the resulting bitmap is
stored and the equivalent of a `traverse_commit_list` call can be
performed by using `traverse_bitmap_commit_list`; the bitmap version
of this call yields the objects straight from the packfile index
(without having to look them up or parse them) and hence is several
orders of magnitude faster.
As an extra optimization, when `prepare_bitmap_walk` succeeds, the
`reuse_partial_packfile_from_bitmap` call can be attempted: it will find
the amount of objects at the beginning of the on-disk packfile that can
be reused as-is, and return an offset into the packfile. The source
packfile can then be loaded and the bytes up to `offset` can be written
directly to the result without having to consider the entires inside the
packfile individually.
If the `prepare_bitmap_walk` call fails (e.g. because no bitmap files
are available), the `rev_info` struct is left untouched, and can be used
to perform a manual rev-walk using `traverse_commit_list`.
Hence, this new set of functions are a generic API that allows to
perform the equivalent of
git rev-list --objects [roots...] [^uninteresting...]
for any set of commits, even if they don't have specific bitmaps
generated for them.
In further patches, we'll use this bitmap traversal optimization to
speed up the `pack-objects` and `rev-list` commands.
Signed-off-by: Vicent Marti <tanoku@gmail.com>
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|