t/perf: add tests for many-pack scenarios

Git's pack storage does efficient (log n) lookups in a single packfile's index, but if we have multiple packfiles, we have to linearly search each for a given object. This patch introduces some timing tests for cases where we have a large number of packs, so that we can measure any improvements we make in the following patches. The main thing we want to time is object lookup. To do this, we measure "git rev-list --objects --all", which does a fairly large number of object lookups (essentially one per object in the repository). However, we also measure the time to do a full repack, which is interesting for two reasons. One is that in addition to the usual pack lookup, it has its own linear iteration over the list of packs. And two is that because it it is the tool one uses to go from an inefficient many-pack situation back to a single pack, we care about its performance not only at marginal numbers of packs, but at the extreme cases (e.g., if you somehow end up with 5,000 packs, it is the only way to get back to 1 pack, so we need to make sure it performs well). We measure the performance of each command in three scenarios: 1 pack, 50 packs, and 1,000 packs. The 1-pack case is a baseline; any optimizations we do to handle multiple packs cannot possibly perform better than this. The 50-pack case is as far as Git should generally allow your repository to go, if you have auto-gc enabled with the default settings. So this represents the maximum performance improvement we would expect under normal circumstances. The 1,000-pack case is hopefully rare, though I have seen it in the wild where automatic maintenance was broken for some time (and the repository continued to receive pushes). This represents cases where we care less about general performance, but want to make sure that a full repack command does not take excessively long. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
author: Jeff King <peff@peff.net> 2016-07-29 00:06:09 -0400
committer: Junio C Hamano <gitster@pobox.com> 2016-07-29 11:05:06 -0700
commit: 77023ea3c3951be97286bc241ae88bc6c860e2b7 (patch)
tree: 34f3e3ffbfb648afb6e2a59aed6985147f2235c3 /t/perf
parent: Sixth batch of topics for 2.10 (diff)
download: tgif-77023ea3c3951be97286bc241ae88bc6c860e2b7.tar.xz
1 files changed, 87 insertions, 0 deletions
diff --git a/t/perf/p5303-many-packs.sh b/t/perf/p5303-many-packs.sh
new file mode 100755
index 0000000000..3779851941
--- /dev/null
+++ b/t/perf/p5303-many-packs.sh
@@ -0,0 +1,87 @@
+#!/bin/sh
+
+test_description='performance with large numbers of packs'
+. ./perf-lib.sh
+
+test_perf_large_repo
+
+# A real many-pack situation would probably come from having a lot of pushes
+# over time. We don't know how big each push would be, but we can fake it by
+# just walking the first-parent chain and having every 5 commits be their own
+# "push". This isn't _entirely_ accurate, as real pushes would have some
+# duplicate objects due to thin-pack fixing, but it's a reasonable
+# approximation.
+#
+# And then all of the rest of the objects can go in a single packfile that
+# represents the state before any of those pushes (actually, we'll generate
+# that first because in such a setup it would be the oldest pack, and we sort
+# the packs by reverse mtime inside git).
+repack_into_n () {
+	rm -rf staging &&
+	mkdir staging &&
+
+	git rev-list --first-parent HEAD |
+	sed -n '1~5p' |
+	head -n "$1" |
+	perl -e 'print reverse <>' \
+	>pushes
+
+	# create base packfile
+	head -n 1 pushes |
+	git pack-objects --delta-base-offset --revs staging/pack
+
+	# and then incrementals between each pair of commits
+	last= &&
+	while read rev
+	do
+		if test -n "$last"; then
+			{
+				echo "$rev" &&
+				echo "^$last"
+			} |
+			git pack-objects --delta-base-offset --revs \
+				staging/pack || return 1
+		fi
+		last=$rev
+	done <pushes &&
+
+	# and install the whole thing
+	rm -f .git/objects/pack/* &&
+	mv staging/* .git/objects/pack/
+}
+
+# Pretend we just have a single branch and no reflogs, and that everything is
+# in objects/pack; that makes our fake pack-building via repack_into_n()
+# much simpler.
+test_expect_success 'simplify reachability' '
+	tip=$(git rev-parse --verify HEAD) &&
+	git for-each-ref --format="option no-deref%0adelete %(refname)" |
+	git update-ref --stdin &&
+	rm -rf .git/logs &&
+	git update-ref refs/heads/master $tip &&
+	git symbolic-ref HEAD refs/heads/master &&
+	git repack -ad
+'
+
+for nr_packs in 1 50 1000
+do
+	test_expect_success "create $nr_packs-pack scenario" '
+		repack_into_n $nr_packs
+	'
+
+	test_perf "rev-list ($nr_packs)" '
+		git rev-list --objects --all >/dev/null
+	'
+
+	# This simulates the interesting part of the repack, which is the
+	# actual pack generation, without smudging the on-disk setup
+	# between trials.
+	test_perf "repack ($nr_packs)" '
+		git pack-objects --keep-true-parents \
+		  --honor-pack-keep --non-empty --all \
+		  --reflog --indexed-objects --delta-base-offset \
+		  --stdout </dev/null >/dev/null
+	'
+done
+
+test_done
author	Jeff King <peff@peff.net>	2016-07-29 00:06:09 -0400
committer	Junio C Hamano <gitster@pobox.com>	2016-07-29 11:05:06 -0700
commit	77023ea3c3951be97286bc241ae88bc6c860e2b7 (patch)
tree	34f3e3ffbfb648afb6e2a59aed6985147f2235c3 /t/perf
parent	Sixth batch of topics for 2.10 (diff)
download	tgif-77023ea3c3951be97286bc241ae88bc6c860e2b7.tar.xz