tgif.git - Terin's Improved Git Fork

Age	Commit message (Collapse)	Author	Files	Lines
2020-10-05	Merge branch 'js/default-branch-name-part-2'	Junio C Hamano	10	-96/+96
	Update the tests to drop word 'master' from them. * js/default-branch-name-part-2: t9902: avoid using the branch name `master` tests: avoid variations of the `master` branch name t3200: avoid variations of the `master` branch name fast-export: avoid using unnecessary language in a code comment t/test-terminal: avoid non-inclusive language
2020-10-05	Merge branch 'ds/in-merge-bases-many-optim-bug'	Junio C Hamano	2	-0/+32
	in_merge_bases_many(), a way to see if a commit is reachable from any commit in a set of commits, was totally broken when the commit-graph feature was in use, which has been corrected. * ds/in-merge-bases-many-optim-bug: commit-reach: fix in_merge_bases_many bug
2020-10-04	Merge branch 'jk/shortlog-group-by-trailer'	Junio C Hamano	1	-0/+141
	"git shortlog" has been taught to group commits by the contents of the trailer lines, like "Reviewed-by:", "Coauthored-by:", etc. * jk/shortlog-group-by-trailer: shortlog: allow multiple groups to be specified shortlog: parse trailer idents shortlog: rename parse_stdin_ident() shortlog: de-duplicate trailer values shortlog: match commit trailers with --group trailer: add interface for iterating over commit trailers shortlog: add grouping option shortlog: change "author" variables to "ident"
2020-10-04	Merge branch 'cc/bisect-start-fix'	Junio C Hamano	1	-0/+7
	"git bisect start X Y", when X and Y are not valid committish object names, should take X and Y as pathspec, but didn't. * cc/bisect-start-fix: bisect: don't use invalid oid as rev when starting
2020-10-04	Merge branch 'jc/blame-ignore-fix'	Junio C Hamano	1	-22/+39
	"git blame --ignore-rev/--ignore-revs-file" failed to validate their input are valid revision, and failed to take into account that the user may want to give an annotated tag instead of a commit, which has been corrected. * jc/blame-ignore-fix: blame: validate and peel the object names on the ignore list t8013: minimum preparatory clean-up
2020-10-02	commit-reach: fix in_merge_bases_many bug	Derrick Stolee	2	-0/+32
	Way back in f9b8908b (commit.c: use generation numbers for in_merge_bases(), 2018-05-01), a heuristic was used to short-circuit the in_merge_bases() walk. This works just fine as long as the caller is checking only two commits, but when there are multiple, there is a possibility that this heuristic is _very wrong_. Some code moves since then has changed this method to repo_in_merge_bases_many() inside commit-reach.c. The heuristic computes the minimum generation number of the "reference" list, then compares this number to the generation number of the "commit". In a recent topic, a test was added that used in_merge_bases_many() to test if a commit was reachable from a number of commits pulled from a reflog. However, this highlighted the problem: if any of the reference commits have a smaller generation number than the given commit, then the walk is skipped _even if there exist some with higher generation number_. This heuristic is wrong! It must check the MAXIMUM generation number of the reference commits, not the MINIMUM. This highlights a testing gap. t6600-test-reach.sh covers many methods in commit-reach.c, including in_merge_bases() and get_merge_bases_many(), but since these methods either restrict to two input commits or actually look for the full list of merge bases, they don't check this heuristic! Add a possible input to "test-tool reach" that tests in_merge_bases_many() and add tests to t6600-test-reach.sh that cover this heuristic. This includes cases for the reference commits having generation above and below the generation of the input commit, but also having maximum generation below the generation of the input commit. The fix itself is to swap min_generation with a max_generation in repo_in_merge_bases_many(). Reported-by: Srinidhi Kaushik <shrinidhi.kaushik@gmail.com> Helped-by: Johannes Schindelin <johannes.schindelin@gmx.de> Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-09-29	Merge branch 'ah/pull'	Junio C Hamano	1	-1/+20
	Earlier we taught "git pull" to warn when the user does not say the histories need to be merged, rebased or accepts only fast- forwarding, but the warning triggered for those who have set the pull.ff configuration variable. * ah/pull: pull: don't warn if pull.ff has been set
2020-09-29	Merge branch 'tg/range-diff-same-file-fix'	Junio C Hamano	1	-8/+4
	"git range-diff" showed incorrect diffstat, which has been corrected. * tg/range-diff-same-file-fix: diff: fix modified lines stats with --stat and --numstat
2020-09-29	Merge branch 'jc/t1506-rev-parse-leaves-range-endpoint-unpeeled'	Junio C Hamano	1	-0/+18
	Test update. * jc/t1506-rev-parse-leaves-range-endpoint-unpeeled: t1506: rev-parse A..B and A...B
2020-09-29	Merge branch 'bc/clone-with-git-default-hash-fix'	Junio C Hamano	1	-0/+14
	"git clone" that clones from SHA-1 repository, while GIT_DEFAULT_HASH set to use SHA-256 already, resulted in an unusable repository that half-claims to be SHA-256 repository with SHA-1 objects and refs. This has been corrected. * bc/clone-with-git-default-hash-fix: builtin/clone: avoid failure with GIT_DEFAULT_HASH
2020-09-29	Merge branch 'tb/bloom-improvements'	Junio C Hamano	5	-24/+246
	"git commit-graph write" learned to limit the number of bloom filters that are computed from scratch with the --max-new-filters option. * tb/bloom-improvements: commit-graph: introduce 'commitGraph.maxNewFilters' builtin/commit-graph.c: introduce '--max-new-filters=<n>' commit-graph: rename 'split_commit_graph_opts' bloom: encode out-of-bounds filters as non-empty bloom/diff: properly short-circuit on max_changes bloom: use provided 'struct bloom_filter_settings' bloom: split 'get_bloom_filter()' in two commit-graph.c: store maximum changed paths commit-graph: respect 'commitGraph.readChangedPaths' t/helper/test-read-graph.c: prepare repo settings commit-graph: pass a 'struct repository *' in more places t4216: use an '&&'-chain commit-graph: introduce 'get_bloom_filter_settings()'
2020-09-27	shortlog: allow multiple groups to be specified	Jeff King	1	-0/+74
	Now that shortlog supports reading from trailers, it can be useful to combine counts from multiple trailers, or between trailers and authors. This can be done manually by post-processing the output from multiple runs, but it's non-trivial to make sure that each name/commit pair is counted only once. This patch teaches shortlog to accept multiple --group options on the command line, and pull data from all of them. That makes it possible to run: git shortlog -ns --group=author --group=trailer:co-authored-by to get a shortlog that counts authors and co-authors equally. The implementation is mostly straightforward. The "group" enum becomes a bitfield, and the trailer key becomes a list. I didn't bother implementing the multi-group semantics for reading from stdin. It would be possible to do, but the existing matching code makes it awkward, and I doubt anybody cares. The duplicate suppression we used for trailers now covers authors and committers as well (though in non-trailer single-group mode we can skip the hash insertion and lookup, since we only see one value per commit). There is one subtlety: we now care about the case when no group bit is set (in which case we default to showing the author). The caller in builtin/log.c needs to be adapted to ask explicitly for authors, rather than relying on shortlog_init(). It would be possible with some gymnastics to make this keep working as-is, but it's not worth it for a single caller. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-09-27	shortlog: parse trailer idents	Jeff King	1	-0/+20
	Trailers don't necessarily contain name/email identity values, so shortlog has so far treated them as opaque strings. However, since many trailers do contain identities, it's useful to treat them as such when they can be parsed. That lets "-e" work as usual, as well as mailmap. When they can't be parsed, we'll continue with the old behavior of treating them as a single string (there's no new test for that here, since the existing tests cover a trailer like this). Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-09-27	shortlog: de-duplicate trailer values	Jeff King	1	-0/+28
	The current documentation is vague about what happens with --group=trailer:signed-off-by when we see a commit with: Signed-off-by: One Signed-off-by: Two Signed-off-by: One We clearly should credit both "One" and "Two", but should "One" get credited twice? The current code does so, but mostly because that was the easiest thing to do. It's probably more useful to count each commit at most once. This will become especially important when we allow values from multiple sources in a future patch. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-09-27	shortlog: match commit trailers with --group	Jeff King	1	-0/+14
	If a project uses commit trailers, this patch lets you use shortlog to see who is performing each action. For example, running: git shortlog -ns --group=trailer:reviewed-by in git.git shows who has reviewed. You can even use a custom format to see things like who has helped whom: git shortlog --format="...helped %an (%ad)" \ --group=trailer:helped-by Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-09-27	shortlog: add grouping option	Jeff King	1	-0/+5
	In preparation for adding more grouping types, let's refactor the committer/author grouping code and add a user-facing option that binds them together. In particular: - the main option is now "--group", to make it clear that the various group types are mutually exclusive. The "--committer" option is an alias for "--group=committer". - we keep an enum rather than a binary flag, to prepare for more values - we prefer switch statements to ternary assignment, since other group types will need more custom code Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-09-26	t9902: avoid using the branch name `master`	Johannes Schindelin	1	-5/+5
	The completion tests used that name unnecessarily, and it is a non-inclusive term, so let's avoid using it here. Since three of the touched test cases make use of the fact that two of the branch names (`master` and `maint`) start with the same letter (or even with the same two letters), we choose to replace the use of `master` by a name that also has that property: `main`. Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-09-26	tests: avoid variations of the `master` branch name	Johannes Schindelin	7	-58/+58
	The term `master` has a loaded history that serves as a constant reminder of racial injustice. The Git project has no desire to perpetuate this and already started avoiding it. The test suite uses variations of this name for branches other than the default one. Apart from t3200, where we just addressed this in the previous commit, those instances can be renamed in an automated manner because they do not require any changes outside of the test script, so let's do that. Seeing as the touched branches have very little (if anything) to do with the default branch, we choose to use a completely separate naming scheme: `topic_<number>` (it cannot be `topic-<number>` because t5515 uses the `test_oid` machinery with the term, and that machinery uses shell variables internally, whose names cannot contain dashes). This trick was performed by this (GNU) sed invocation: $ sed -i 's/master$[a-z0-9]$/topic_\1/g' t/t*.sh Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-09-25	Merge branch 'hx/push-atomic-with-cert'	Junio C Hamano	1	-0/+23
	"git push" that wants to be atomic and wants to send push certificate learned not to prepare and sign the push certificate when it fails the local check (hence due to atomicity it is known that no certificate is needed). * hx/push-atomic-with-cert: send-pack: run GPG after atomic push checking
2020-09-25	Merge branch 'ld/p4-unshelve-fix'	Junio C Hamano	1	-1/+4
	The "unshelve" subcommand of "git p4" used incorrectly used commit^N where it meant to say commit~N to name the Nth generation ancestor, which has been corrected. * ld/p4-unshelve-fix: git-p4: use HEAD~$n to find parent commit for unshelve git-p4 unshelve: adding a commit breaks git-p4 unshelve
2020-09-25	Merge branch 'jx/proc-receive-hook'	Junio C Hamano	37	-1/+3396
	"git receive-pack" that accepts requests by "git push" learned to outsource most of the ref updates to the new "proc-receive" hook. * jx/proc-receive-hook: doc: add documentation for the proc-receive hook transport: parse report options for tracking refs t5411: test updates of remote-tracking branches receive-pack: new config receive.procReceiveRefs doc: add document for capability report-status-v2 New capability "report-status-v2" for git-push receive-pack: feed report options to post-receive receive-pack: add new proc-receive hook t5411: add basic test cases for proc-receive hook transport: not report a non-head push as a branch
2020-09-25	Merge branch 'ds/maintenance-part-1'	Junio C Hamano	4	-2/+100
	A "git gc"'s big brother has been introduced to take care of more repository maintenance tasks, not limited to the object database cleaning. * ds/maintenance-part-1: maintenance: add trace2 regions for task execution maintenance: add auto condition for commit-graph task maintenance: use pointers to check --auto maintenance: create maintenance.<task>.enabled config maintenance: take a lock on the objects directory maintenance: add --task option maintenance: add commit-graph task maintenance: initialize task array maintenance: replace run_auto_gc() maintenance: add --quiet option maintenance: create basic maintenance runner
2020-09-25	t1506: rev-parse A..B and A...B	Junio C Hamano	1	-0/+18
	Because these constructs can be used to parse user input to be passed to rev-list --objects, e.g. range=$(git rev-parse v1.0..v2.0) && git rev-list --objects $range \| git pack-objects --stdin the endpoints (v1.0 and v2.0 in the example) are shown without peeling them to underlying commits, even when they are annotated tags. Make sure it stays that way. While at it, ensure "rev-parse A...B" also keeps the endpoints A and B unpeeled, even though the negative side (i.e. the merge-base between A and B) has to become a commit. Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-09-25	bisect: don't use invalid oid as rev when starting	Christian Couder	1	-0/+7
	In 06f5608c14 (bisect--helper: `bisect_start` shell function partially in C, 2019-01-02), we changed the following shell code: - rev=$(git rev-parse -q --verify "$arg^{commit}") \|\| { - test $has_double_dash -eq 1 && - die "$(eval_gettext "'\$arg' does not appear to be a valid revision")" - break - } - revs="$revs $rev" into: + char *commit_id = xstrfmt("%s^{commit}", arg); + if (get_oid(commit_id, &oid) && has_double_dash) + die(_("'%s' does not appear to be a valid " + "revision"), arg); + + string_list_append(&revs, oid_to_hex(&oid)); + free(commit_id); In case of an invalid "arg" when "has_double_dash" is false, the old code would "break" out of the argument loop. In the new C code though, `oid_to_hex(&oid)` is unconditonally appended to "revs". This is wrong first because "oid" is junk as `get_oid(commit_id, &oid)` failed and second because it doesn't break out of the argument loop. Not breaking out of the argument loop means that "arg" is then not treated as a path restriction (which is wrong). Signed-off-by: Christian Couder <chriscool@tuxfamily.org> Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-09-24	pull: don't warn if pull.ff has been set	Alex Henrie	1	-1/+20
	A user who understands enough to set pull.ff does not need additional instructions. Signed-off-by: Alex Henrie <alexhenrie24@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-09-24	blame: validate and peel the object names on the ignore list	Junio C Hamano	1	-10/+27
	The command reads list of object names to place on the ignore list either from the command line or from a file, but they are not checked with their object type (those read from the file are not even checked for object existence). Extend the oidset_parse_file() API and allow it to take a callback that can be used to die (e.g. when an inappropriate input is read) or modify the object name read (e.g. when a tag pointing at a commit is read, and the caller wants a commit object name), and use it in the code that handles ignore list. Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-09-24	t8013: minimum preparatory clean-up	Junio C Hamano	1	-11/+11
	The closing sq for each test piece should be placed at the beginning of line. Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-09-24	diff: fix modified lines stats with --stat and --numstat	Thomas Guyot-Sionnest	1	-8/+4
	Only skip diffstats when both oids are valid and identical. This check was causing both false-positives (files included in diffstats with no actual changes (0 lines modified) and false-negatives (showing 0 lines modified in stats when files had actually changed). Also replaced same_contents with may_differ to avoid confusion. Signed-off-by: Thomas Guyot-Sionnest <tguyot@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-09-22	Merge branch 'jk/dont-count-existing-objects-twice'	Junio C Hamano	1	-0/+4
	There is a logic to estimate how many objects are in the repository, which is mean to run once per process invocation, but it ran every time the estimated value was requested. * jk/dont-count-existing-objects-twice: packfile: actually set approximate_object_count_valid
2020-09-22	Merge branch 'al/ref-filter-merged-and-no-merged'	Junio C Hamano	4	-19/+67
	"git for-each-ref" and friends that list refs used to allow only one --merged or --no-merged to filter them; they learned to take combination of both kind of filtering. * al/ref-filter-merged-and-no-merged: Doc: prefer more specific file name ref-filter: make internal reachable-filter API more precise ref-filter: allow merged and no-merged filters Doc: cover multiple contains/no-contains filters t3201: test multiple branch filter combinations
2020-09-22	Merge branch 'kk/build-portability-fix'	Junio C Hamano	1	-33/+33
	Portability tweak for some shell scripts used while building. * kk/build-portability-fix: Fit to Plan 9's ANSI/POSIX compatibility layer
2020-09-22	Merge branch 'pw/add-p-edit-ita-path'	Junio C Hamano	1	-0/+38
	"add -p" now allows editing paths that were only added in intent. * pw/add-p-edit-ita-path: add -p: fix editing of intent-to-add paths
2020-09-22	builtin/clone: avoid failure with GIT_DEFAULT_HASH	brian m. carlson	1	-0/+14
	If a user is cloning a SHA-1 repository with GIT_DEFAULT_HASH set to "sha256", then we can end up with a repository where the repository format version is 0 but the extensions.objectformat key is set to "sha256". This is both wrong (the user has a SHA-1 repository) and nonfunctional (because the extension cannot be used in a v0 repository). This happens because in a clone, we initially set up the repository, and then change its algorithm based on what the remote side tells us it's using. We've initially set up the repository as SHA-256 in this case, and then later on reset the repository version without clearing the extension. We could just always set the extension in this case, but that would mean that our SHA-1 repositories weren't compatible with older Git versions, even though there's no reason why they shouldn't be. And we also don't want to initialize the repository as SHA-1 initially, since that means if we're cloning an empty repository, we'll have failed to honor the GIT_DEFAULT_HASH variable and will end up with a SHA-1 repository, not a SHA-256 repository. Neither of those are appealing, so let's tell the repository initialization code if we're doing a reinit like this, and if so, to clear the extension if we're using SHA-1. This makes sure we produce a valid and functional repository and doesn't break any of our other use cases. Reported-by: Matheus Tavares <matheus.bernardino@usp.br> Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-09-21	t3200: avoid variations of the `master` branch name	Johannes Schindelin	1	-17/+17
	To avoid branch names with a loaded history, we already started to avoid using the name "master" in a couple instances. The `t3200-branch.sh` script uses variations of this name for branches other than the default one. So let's change those names, as "lowest-hanging fruits" in the effort to use more inclusive naming throughout Git's source code. While at it, make those branch names independent from the default branch name. In this particular instance, this rename requires a couple of non-trivial adjustments, as the aligned output depends on the maximum length of the displayed branches (which we now changed), and also on the alphabetical order (which we now changed, too). Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-09-21	t/test-terminal: avoid non-inclusive language	Johannes Schindelin	1	-16/+16
	In the ongoing effort to make the Git project a more inclusive place, let's try to avoid names like "master" where possible. In this instance, the use of the term `slave` is unfortunately enshrined in IO::Pty's API. We simply cannot avoid using that word here. But at least we can get rid of the usage of the word `master` and hope that IO::Pty will be eventually adjusted, too. Guessing that IO::Pty might follow Python's lead, we replace the name `master` by `parent` (hoping that IO::Pty will adopt the parent/child nomenclature, too). Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-09-19	send-pack: run GPG after atomic push checking	Han Xin	1	-0/+23
	The refs update commands can be sent to the server side in two different ways: GPG-signed or unsigned. We should run these two operations in the same "Finally, tell the other end!" code block, but they are seperated by the "Clear the status for each ref" code block. This will result in a slight performance loss, because the failed atomic push will still perform unnecessary preparations for shallow advertise and GPG-signed commands buffers, and user may have to be bothered by the (possible) GPG passphrase input when there is nothing to sign. Add a new test case to t5534 to ensure GPG will not be called when the GPG-signed atomic push fails. Signed-off-by: Han Xin <hanxin.hx@alibaba-inc.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-09-19	git-p4: use HEAD~$n to find parent commit for unshelve	Luke Diamand	1	-1/+1
	Found-by: Liu Xuhui (Jackson) <Xuhui.Liu@amd.com> Signed-off-by: Luke Diamand <luke@diamand.org> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-09-19	git-p4 unshelve: adding a commit breaks git-p4 unshelve	Luke Diamand	1	-2/+5
	git-p4 unshelve uses HEAD^$n to find the parent commit, which fails if there is an additional commit. Signed-off-by: Luke Diamand <luke@diamand.org> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-09-18	Merge branch 'mt/config-fail-nongit-early'	Junio C Hamano	1	-5/+8
	Unlike "git config --local", "git config --worktree" did not fail early and cleanly when started outside a git repository. * mt/config-fail-nongit-early: config: complain about --worktree outside of a git repo
2020-09-18	Merge branch 'jc/quote-path-cleanup'	Junio C Hamano	1	-0/+27
	"git status --short" quoted a path with SP in it when tracked, but not those that are untracked, ignored or unmerged. They are all shown quoted consistently. * jc/quote-path-cleanup: quote: turn 'nodq' parameter into a set of flags quote: rename misnamed sq_lookup[] to cq_lookup[] wt-status: consistently quote paths in "status --short" output quote_path: code clarification quote_path: optionally allow quoting a path with SP in it quote_path: give flags parameter to quote_path() quote_path: rename quote_path_relative() to quote_path()
2020-09-18	Merge branch 'jk/add-i-fixes'	Junio C Hamano	1	-0/+8
	"add -i/-p" fixes. * jk/add-i-fixes: add--interactive.perl: specify --no-color explicitly add-patch: fix inverted return code of repo_read_index()
2020-09-18	Merge branch 'al/t3200-back-on-a-branch'	Junio C Hamano	1	-0/+1
	Test fix. * al/t3200-back-on-a-branch: t3200: clean side effect of git checkout --orphan
2020-09-18	commit-graph: introduce 'commitGraph.maxNewFilters'	Taylor Blau	1	-1/+23
	Introduce a configuration variable to specify a default value for the recently-introduce '--max-new-filters' option of 'git commit-graph write'. Helped-by: Junio C Hamano <gitster@pobox.com> Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-09-18	builtin/commit-graph.c: introduce '--max-new-filters=<n>'	Taylor Blau	1	-0/+70
	Introduce a command-line flag to specify the maximum number of new Bloom filters that a 'git commit-graph write' is willing to compute from scratch. Prior to this patch, a commit-graph write with '--changed-paths' would compute Bloom filters for all selected commits which haven't already been computed (i.e., by a previous commit-graph write with '--split' such that a roll-up or replacement is performed). This behavior can cause prohibitively-long commit-graph writes for a variety of reasons: * There may be lots of filters whose diffs take a long time to generate (for example, they have close to the maximum number of changes, diffing itself takes a long time, etc). * Old-style commit-graphs (which encode filters with too many entries as not having been computed at all) cause us to waste time recomputing filters that appear to have not been computed only to discover that they are too-large. This can make the upper-bound of the time it takes for 'git commit-graph write --changed-paths' to be rather unpredictable. To make this command behave more predictably, introduce '--max-new-filters=<n>' to allow computing at most '<n>' Bloom filters from scratch. This lets "computing" already-known filters proceed quickly, while bounding the number of slow tasks that Git is willing to do. Helped-by: Junio C Hamano <gitster@pobox.com> Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-09-17	bloom: encode out-of-bounds filters as non-empty	Taylor Blau	2	-6/+33
	When a changed-path Bloom filter has either zero, or more than a certain number (commonly 512) of entries, the commit-graph machinery encodes it as "missing". More specifically, it sets the indices adjacent in the BIDX chunk as equal to each other to indicate a "length 0" filter; that is, that the filter occupies zero bytes on disk. This has heretofore been fine, since the commit-graph machinery has no need to care about these filters with too few or too many changed paths. Both cases act like no filter has been generated at all, and so there is no need to store them. In a subsequent commit, however, the commit-graph machinery will learn to only compute Bloom filters for some commits in the current commit-graph layer. This is a change from the current implementation which computes Bloom filters for all commits that are in the layer being written. Critically for this patch, only computing some of the Bloom filters means adding a third state for length 0 Bloom filters: zero entries, too many entries, or "hasn't been computed". It will be important for that future patch to distinguish between "not representable" (i.e., zero or too-many changed paths), and "hasn't been computed". In particular, we don't want to waste time recomputing filters that have already been computed. To that end, change how we store Bloom filters in the "computed but not representable" category: - Bloom filters with no entries are stored as a single byte with all bits low (i.e., all queries to that Bloom filter will return "definitely not") - Bloom filters with too many entries are stored as a single byte with all bits set high (i.e., all queries to that Bloom filter will return "maybe"). These rules are sufficient to not incur a behavior change by changing the on-disk representation of these two classes. Likewise, no specification changes are necessary for the commit-graph format, either: - Filters that were previously empty will be recomputed and stored according to the new rules, and - old clients reading filters generated by new clients will interpret the filters correctly and be none the wiser to how they were generated. Clients will invoke the Bloom machinery in more cases than before, but this can be addressed by returning a NULL filter when all bits are set high. This can be addressed in a future patch. Note that this does increase the size of on-disk commit-graphs, but far less than other proposals. In particular, this is generally more efficient than storing a bitmap for which commits haven't computed their Bloom filters. Storing a bitmap incurs a penalty of one bit per commit, whereas storing explicit filters as above incurs a penalty of one byte per too-large or empty commit. In practice, these boundary commits likely occupy a small proportion of the overall number of commits, and so the size penalty is likely smaller than storing a bitmap for all commits. See, for example, these relative proportions of such boundary commits (collected by SZEDER Gábor): \| Percentage of \| commit-graph \| \| \| commits modifying \| file size \| \| ├────────┬──────────────┼───────────────────┤ pct. \| \| 0 path \| >= 512 paths \| before \| after \| change \| ┌────────────────┼────────┼──────────────┼─────────┼─────────┼───────────┤ \| android-base \| 13.20% \| 0.13% \| 37.468M \| 37.534M \| +0.1741 % \| \| cmssw \| 0.15% \| 0.23% \| 17.118M \| 17.119M \| +0.0091 % \| \| cpython \| 3.07% \| 0.01% \| 7.967M \| 7.971M \| +0.0423 % \| \| elasticsearch \| 0.70% \| 1.00% \| 8.833M \| 8.835M \| +0.0128 % \| \| gcc \| 0.00% \| 0.08% \| 16.073M \| 16.074M \| +0.0030 % \| \| gecko-dev \| 0.14% \| 0.64% \| 59.868M \| 59.874M \| +0.0105 % \| \| git \| 0.11% \| 0.02% \| 3.895M \| 3.895M \| +0.0020 % \| \| glibc \| 0.02% \| 0.10% \| 3.555M \| 3.555M \| +0.0021 % \| \| go \| 0.00% \| 0.07% \| 3.186M \| 3.186M \| +0.0018 % \| \| homebrew-cask \| 0.40% \| 0.02% \| 7.035M \| 7.035M \| +0.0065 % \| \| homebrew-core \| 0.01% \| 0.01% \| 11.611M \| 11.611M \| +0.0002 % \| \| jdk \| 0.26% \| 5.64% \| 5.537M \| 5.540M \| +0.0590 % \| \| linux \| 0.01% \| 0.51% \| 63.735M \| 63.740M \| +0.0073 % \| \| llvm-project \| 0.12% \| 0.03% \| 25.515M \| 25.516M \| +0.0050 % \| \| rails \| 0.10% \| 0.10% \| 6.252M \| 6.252M \| +0.0027 % \| \| rust \| 0.07% \| 0.17% \| 9.364M \| 9.364M \| +0.0033 % \| \| tensorflow \| 0.09% \| 1.02% \| 7.009M \| 7.010M \| +0.0158 % \| \| webkit \| 0.05% \| 0.31% \| 17.405M \| 17.406M \| +0.0047 % \| (where the above increase is determined by computing a non-split commit-graph before and after this patch). Given that these projects are all "large" by commit count, the storage cost by writing these filters explicitly is negligible. In the most extreme example, android-base (which has 494,848 commits at the time of writing) would have its commit-graph increase by a modest 68.4 KB. Finally, a test to exercise filters which contain too many changed path entries will be introduced in a subsequent patch. Suggested-by: SZEDER Gábor <szeder.dev@gmail.com> Suggested-by: Jakub Narębski <jnareb@gmail.com> Helped-by: Derrick Stolee <dstolee@microsoft.com> Helped-by: SZEDER Gábor <szeder.dev@gmail.com> Helped-by: Junio C Hamano <gitster@pobox.com> Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-09-17	packfile: actually set approximate_object_count_valid	Jeff King	1	-0/+4
	The approximate_object_count() function tries to compute the count only once per process. But ever since it was introduced in 8e3f52d778 (find_unique_abbrev: move logic out of get_short_sha1(), 2016-10-03), we failed to actually set the "valid" flag, meaning we'd compute it fresh on every call. This turns out not to be _too_ bad, because we're only iterating through the packed_git list, and not making any system calls. But since it may get called for every abbreviated hash we output, even this can add up if you have many packs. Here are before-and-after timings for a new perf test which just asks rev-list to abbreviate each commit hash (the test repo is linux.git, with commit-graphs): Test origin HEAD ---------------------------------------------------------------------------- 5303.3: rev-list (1) 28.91(28.46+0.44) 29.03(28.65+0.38) +0.4% 5303.4: abbrev-commit (1) 1.18(1.06+0.11) 1.17(1.02+0.14) -0.8% 5303.7: rev-list (50) 28.95(28.56+0.38) 29.50(29.17+0.32) +1.9% 5303.8: abbrev-commit (50) 3.67(3.56+0.10) 3.57(3.42+0.15) -2.7% 5303.11: rev-list (1000) 30.34(29.89+0.43) 30.82(30.35+0.46) +1.6% 5303.12: abbrev-commit (1000) 86.82(86.52+0.29) 77.82(77.59+0.22) -10.4% 5303.15: load 10,000 packs 0.08(0.02+0.05) 0.08(0.02+0.06) +0.0% It doesn't help at all when we have 1 pack (5303.4), but we get a 10% speedup when there are 1000 packs (5303.12). That's a modest speedup for a case that's already slow and we'd hope to avoid in general (note how slow it is even after, because we have to look in each of those packs for abbreviations). But it's a one-line change that clearly matches the original intent, so it seems worth doing. The included perf test may also be useful for keeping an eye on any regressions in the overall abbreviation code. Reported-by: Rasmus Villemoes <rv@rasmusvillemoes.dk> Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-09-17	maintenance: use pointers to check --auto	Derrick Stolee	2	-2/+2
	The 'git maintenance run' command has an '--auto' option. This is used by other Git commands such as 'git commit' or 'git fetch' to check if maintenance should be run after adding data to the repository. Previously, this --auto option was only used to add the argument to the 'git gc' command as part of the 'gc' task. We will be expanding the other tasks to perform a check to see if they should do work as part of the --auto flag, when they are enabled by config. First, update the 'gc' task to perform the auto check inside the maintenance process. This prevents running an extra 'git gc --auto' command when not needed. It also shows a model for other tasks. Second, use the 'auto_condition' function pointer as a signal for whether we enable the maintenance task under '--auto'. For instance, we do not want to enable the 'fetch' task in '--auto' mode, so that function pointer will remain NULL. Now that we are not automatically calling 'git gc', a test in t5514-fetch-multiple.sh must be changed to watch for 'git maintenance' instead. We continue to pass the '--auto' option to the 'git gc' command when necessary, because of the gc.autoDetach config option changes behavior. Likely, we will want to absorb the daemonizing behavior implied by gc.autoDetach as a maintenance.autoDetach config option. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-09-17	maintenance: create maintenance.<task>.enabled config	Derrick Stolee	1	-0/+8
	Currently, a normal run of "git maintenance run" will only run the 'gc' task, as it is the only one enabled. This is mostly for backwards- compatible reasons since "git maintenance run --auto" commands replaced previous "git gc --auto" commands after some Git processes. Users could manually run specific maintenance tasks by calling "git maintenance run --task=<task>" directly. Allow users to customize which steps are run automatically using config. The 'maintenance.<task>.enabled' option then can turn on these other tasks (or turn off the 'gc' task). Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-09-17	maintenance: add --task option	Derrick Stolee	1	-0/+27
	A user may want to only run certain maintenance tasks in a certain order. Add the --task=<task> option, which allows a user to specify an ordered list of tasks to run. These cannot be run multiple times, however. Here is where our array of maintenance_task pointers becomes critical. We can sort the array of pointers based on the task order, but we do not want to move the struct data itself in order to preserve the hashmap references. We use the hashmap to match the --task=<task> arguments into the task struct data. Keep in mind that the 'enabled' member of the maintenance_task struct is a placeholder for a future 'maintenance.<task>.enabled' config option. Thus, we use the 'enabled' member to specify which tasks are run when the user does not specify any --task=<task> arguments. The 'enabled' member should be ignored if --task=<task> appears. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-09-17	maintenance: add commit-graph task	Derrick Stolee	1	-0/+2
	The first new task in the 'git maintenance' builtin is the 'commit-graph' task. This updates the commit-graph file incrementally with the command git commit-graph write --reachable --split By writing an incremental commit-graph file using the "--split" option we minimize the disruption from this operation. The default behavior is to merge layers until the new "top" layer is less than half the size of the layer below. This provides quick writes most of the time, with the longer writes following a power law distribution. Most importantly, concurrent Git processes only look at the commit-graph-chain file for a very short amount of time, so they will verly likely not be holding a handle to the file when we try to replace it. (This only matters on Windows.) If a concurrent process reads the old commit-graph-chain file, but our job expires some of the .graph files before they can be read, then those processes will see a warning message (but not fail). This could be avoided by a future update to use the --expire-time argument when writing the commit-graph. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>