summaryrefslogtreecommitdiff
path: root/commit-graph.h
AgeCommit message (Collapse)AuthorFilesLines
2021-08-09revision: avoid hitting packfiles when commits are in commit-graphLibravatar Patrick Steinhardt1-0/+8
When queueing references in git-rev-list(1), we try to optimize parsing of commits via the commit-graph. To do so, we first look up the object's type, and if it is a commit we call `repo_parse_commit()` instead of `parse_object()`. This is quite inefficient though given that we're always uncompressing the object header in order to determine the type. Instead, we can opportunistically search the commit-graph for the object ID: in case it's found, we know it's a commit and can directly fill in the commit object without having to uncompress the object header. Expose a new function `lookup_commit_in_graph()`, which tries to find a commit in the commit-graph by ID, and convert `get_reference()` to use this function. This provides a big performance win in cases where we load references in a repository with lots of references pointing to commits. The following has been executed in a real-world repository with about 2.2 million refs: Benchmark #1: HEAD~: rev-list --unsorted-input --objects --quiet --not --all --not $newrev Time (mean ± σ): 4.458 s ± 0.044 s [User: 4.115 s, System: 0.342 s] Range (min … max): 4.409 s … 4.534 s 10 runs Benchmark #2: HEAD: rev-list --unsorted-input --objects --quiet --not --all --not $newrev Time (mean ± σ): 3.089 s ± 0.015 s [User: 2.768 s, System: 0.321 s] Range (min … max): 3.061 s … 3.105 s 10 runs Summary 'HEAD: rev-list --unsorted-input --objects --quiet --not --all --not $newrev' ran 1.44 ± 0.02 times faster than 'HEAD~: rev-list --unsorted-input --objects --quiet --not --all --not $newrev' Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-25commit-graph: use config to specify generation typeLibravatar Derrick Stolee1-1/+0
We have two established generation number versions: 1: topological levels 2: corrected commit dates The corrected commit dates are enabled by default, but they also write extra data in the GDAT and GDOV chunks. Services that host Git data might want to have more control over when this feature rolls out than just updating the Git binaries. Add a new "commitGraph.generationVersion" config option that specifies the intended generation number version. If this value is less than 2, then the GDAT chunk is never written _or read_ from an existing file. This can replace our use of the GIT_TEST_COMMIT_GRAPH_NO_GDAT environment variable in the test suite. Remove it. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-18commit-reach: use corrected commit dates in paint_down_to_common()Libravatar Abhishek Kumar1-0/+6
091f4cf (commit: don't use generation numbers if not needed, 2018-08-30) changed paint_down_to_common() to use commit dates instead of generation numbers v1 (topological levels) as the performance regressed on certain topologies. With generation number v2 (corrected commit dates) implemented, we no longer have to rely on commit dates and can use generation numbers. For example, the command `git merge-base v4.8 v4.9` on the Linux repository walks 167468 commits, taking 0.135s for committer date and 167496 commits, taking 0.157s for corrected committer date respectively. While using corrected commit dates, Git walks nearly the same number of commits as commit date, the process is slower as for each comparision we have to access a commit-slab (for corrected committer date) instead of accessing struct member (for committer date). This change incidentally broke the fragile t6404-recursive-merge test. t6404-recursive-merge sets up a unique repository where all commits have the same committer date without a well-defined merge-base. While running tests with GIT_TEST_COMMIT_GRAPH unset, we use committer date as a heuristic in paint_down_to_common(). 6404.1 'combined merge conflicts' merges commits in the order: - Merge C with B to form an intermediate commit. - Merge the intermediate commit with A. With GIT_TEST_COMMIT_GRAPH=1, we write a commit-graph and subsequently use the corrected committer date, which changes the order in which commits are merged: - Merge A with B to form an intermediate commit. - Merge the intermediate commit with C. While resulting repositories are equivalent, 6404.4 'virtual trees were processed' fails with GIT_TEST_COMMIT_GRAPH=1 as we are selecting different merge-bases and thus have different object ids for the intermediate commits. As this has already causes problems (as noted in 859fdc0 (commit-graph: define GIT_TEST_COMMIT_GRAPH, 2018-08-29)), we disable commit graph within t6404-recursive-merge. Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com> Reviewed-by: Taylor Blau <me@ttaylorr.com> Reviewed-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-18commit-graph: use generation v2 only if entire chain doesLibravatar Abhishek Kumar1-0/+1
Since there are released versions of Git that understand generation numbers in the commit-graph's CDAT chunk but do not understand the GDAT chunk, the following scenario is possible: 1. "New" Git writes a commit-graph with the GDAT chunk. 2. "Old" Git writes a split commit-graph on top without a GDAT chunk. If each layer of split commit-graph is treated independently, as it was the case before this commit, with Git inspecting only the current layer for chunk_generation_data pointer, commits in the lower layer (one with GDAT) whould have corrected commit date as their generation number, while commits in the upper layer would have topological levels as their generation. Corrected commit dates usually have much larger values than topological levels. This means that if we take two commits, one from the upper layer, and one reachable from it in the lower layer, then the expectation that the generation of a parent is smaller than the generation of a child would be violated. It is difficult to expose this issue in a test. Since we _start_ with artificially low generation numbers, any commit walk that prioritizes generation numbers will walk all of the commits with high generation number before walking the commits with low generation number. In all the cases I tried, the commit-graph layers themselves "protect" any incorrect behavior since none of the commits in the lower layer can reach the commits in the upper layer. This issue would manifest itself as a performance problem in this case, especially with something like "git log --graph" since the low generation numbers would cause the in-degree queue to walk all of the commits in the lower layer before allowing the topo-order queue to write anything to output (depending on the size of the upper layer). Therefore, When writing the new layer in split commit-graph, we write a GDAT chunk only if the topmost layer has a GDAT chunk. This guarantees that if a layer has GDAT chunk, all lower layers must have a GDAT chunk as well. Rewriting layers follows similar approach: if the topmost layer below the set of layers being rewritten (in the split commit-graph chain) exists, and it does not contain GDAT chunk, then the result of rewrite does not have GDAT chunks either. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com> Reviewed-by: Taylor Blau <me@ttaylorr.com> Reviewed-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-18commit-graph: implement generation data chunkLibravatar Abhishek Kumar1-0/+3
As discovered by Ævar, we cannot increment graph version to distinguish between generation numbers v1 and v2 [1]. Thus, one of pre-requistes before implementing generation number v2 was to distinguish between graph versions in a backwards compatible manner. We are going to introduce a new chunk called Generation DATa chunk (or GDAT). GDAT will store corrected committer date offsets whereas CDAT will still store topological level. Old Git does not understand GDAT chunk and would ignore it, reading topological levels from CDAT. New Git can parse GDAT and take advantage of newer generation numbers, falling back to topological levels when GDAT chunk is missing (as it would happen with a commit-graph written by old Git). We introduce a test environment variable 'GIT_TEST_COMMIT_GRAPH_NO_GDAT' which forces commit-graph file to be written without generation data chunk to emulate a commit-graph file written by old Git. To minimize the space required to store corrrected commit date, Git stores corrected commit date offsets into the commit-graph file, instea of corrected commit dates. This saves us 4 bytes per commit, decreasing the GDAT chunk size by half, but it's possible for the offset to overflow the 4-bytes allocated for storage. As such overflows are and should be exceedingly rare, we use the following overflow management scheme: We introduce a new commit-graph chunk, Generation Data OVerflow ('GDOV') to store corrected commit dates for commits with offsets greater than GENERATION_NUMBER_V2_OFFSET_MAX. If the offset is greater than GENERATION_NUMBER_V2_OFFSET_MAX, we set the MSB of the offset and the other bits store the position of corrected commit date in GDOV chunk, similar to how Extra Edge List is maintained. We test the overflow-related code with the following repo history: F - N - U / \ U - N - U N \ / N - F - N Where the commits denoted by U have committer date of zero seconds since Unix epoch, the commits denoted by N have committer date of 1112354055 (default committer date for the test suite) seconds since Unix epoch and the commits denoted by F have committer date of (2 ^ 31 - 2) seconds since Unix epoch. The largest offset observed is 2 ^ 31, just large enough to overflow. [1]: https://lore.kernel.org/git/87a7gdspo4.fsf@evledraar.gmail.com/ Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com> Reviewed-by: Taylor Blau <me@ttaylorr.com> Reviewed-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-18commit-graph: return 64-bit generation numberLibravatar Abhishek Kumar1-2/+2
In a preparatory step for introducing corrected commit dates, let's return timestamp_t values from commit_graph_generation(), use timestamp_t for local variables and define GENERATION_NUMBER_INFINITY as (2 ^ 63 - 1) instead. We rename GENERATION_NUMBER_MAX to GENERATION_NUMBER_V1_MAX to represent the largest topological level we can store in the commit data chunk. With corrected commit dates implemented, we will have two such *_MAX variables to denote the largest offset and largest topological level that can be stored. Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com> Reviewed-by: Taylor Blau <me@ttaylorr.com> Reviewed-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-18commit-graph: add a slab to store topological levelsLibravatar Abhishek Kumar1-0/+1
In a later commit we will introduce corrected commit date as the generation number v2. Corrected commit dates will be stored in the new seperate Generation Data chunk. However, to ensure backwards compatibility with "Old" Git we need to continue to write generation number v1 (topological levels) to the commit data chunk. Thus, we need to compute and store both versions of generation numbers to write the commit-graph file. Therefore, let's introduce a commit-slab `topo_level_slab` to store topological levels; corrected commit date will be stored in the member `generation` of struct commit_graph_data. The macros `GENERATION_NUMBER_INFINITY` and `GENERATION_NUMBER_ZERO` mark commits not in the commit-graph file and commits written by a version of Git that did not compute generation numbers respectively. Generation numbers are computed identically for both kinds of commits. A "slab-miss" should return `GENERATION_NUMBER_INFINITY` as the commit is not in the commit-graph file. However, since the slab is zero-initialized, it returns 0 (or rather `GENERATION_NUMBER_ZERO`). Thus, we no longer need to check if the topological level of a commit is `GENERATION_NUMBER_INFINITY`. We will add a pointer to the slab in `struct write_commit_graph_context` and `struct commit_graph` to populate the slab in `fill_commit_graph_info` if the commit has a pre-computed topological level as in case of split commit-graphs. Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com> Reviewed-by: Taylor Blau <me@ttaylorr.com> Reviewed-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-09-29Merge branch 'tb/bloom-improvements'Libravatar Junio C Hamano1-6/+11
"git commit-graph write" learned to limit the number of bloom filters that are computed from scratch with the --max-new-filters option. * tb/bloom-improvements: commit-graph: introduce 'commitGraph.maxNewFilters' builtin/commit-graph.c: introduce '--max-new-filters=<n>' commit-graph: rename 'split_commit_graph_opts' bloom: encode out-of-bounds filters as non-empty bloom/diff: properly short-circuit on max_changes bloom: use provided 'struct bloom_filter_settings' bloom: split 'get_bloom_filter()' in two commit-graph.c: store maximum changed paths commit-graph: respect 'commitGraph.readChangedPaths' t/helper/test-read-graph.c: prepare repo settings commit-graph: pass a 'struct repository *' in more places t4216: use an '&&'-chain commit-graph: introduce 'get_bloom_filter_settings()'
2020-09-18builtin/commit-graph.c: introduce '--max-new-filters=<n>'Libravatar Taylor Blau1-0/+1
Introduce a command-line flag to specify the maximum number of new Bloom filters that a 'git commit-graph write' is willing to compute from scratch. Prior to this patch, a commit-graph write with '--changed-paths' would compute Bloom filters for all selected commits which haven't already been computed (i.e., by a previous commit-graph write with '--split' such that a roll-up or replacement is performed). This behavior can cause prohibitively-long commit-graph writes for a variety of reasons: * There may be lots of filters whose diffs take a long time to generate (for example, they have close to the maximum number of changes, diffing itself takes a long time, etc). * Old-style commit-graphs (which encode filters with too many entries as not having been computed at all) cause us to waste time recomputing filters that appear to have not been computed only to discover that they are too-large. This can make the upper-bound of the time it takes for 'git commit-graph write --changed-paths' to be rather unpredictable. To make this command behave more predictably, introduce '--max-new-filters=<n>' to allow computing at most '<n>' Bloom filters from scratch. This lets "computing" already-known filters proceed quickly, while bounding the number of slow tasks that Git is willing to do. Helped-by: Junio C Hamano <gitster@pobox.com> Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-09-17commit-graph: rename 'split_commit_graph_opts'Libravatar Taylor Blau1-4/+4
In the subsequent commit, additional options will be added to the commit-graph API which have nothing to do with splitting. Rename the 'split_commit_graph_opts' structure to the more-generic 'commit_graph_opts' to encompass both. Likewise, rename the 'flags' member to instead be 'split_flags' to clarify that it only has to do with the behavior implied by '--split'. Suggested-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-09-17maintenance: add commit-graph taskLibravatar Derrick Stolee1-0/+1
The first new task in the 'git maintenance' builtin is the 'commit-graph' task. This updates the commit-graph file incrementally with the command git commit-graph write --reachable --split By writing an incremental commit-graph file using the "--split" option we minimize the disruption from this operation. The default behavior is to merge layers until the new "top" layer is less than half the size of the layer below. This provides quick writes most of the time, with the longer writes following a power law distribution. Most importantly, concurrent Git processes only look at the commit-graph-chain file for a very short amount of time, so they will verly likely not be holding a handle to the file when we try to replace it. (This only matters on Windows.) If a concurrent process reads the old commit-graph-chain file, but our job expires some of the .graph files before they can be read, then those processes will see a warning message (but not fail). This could be avoided by a future update to use the --expire-time argument when writing the commit-graph. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-09-09commit-graph: pass a 'struct repository *' in more placesLibravatar Taylor Blau1-2/+4
In a future commit, some commit-graph internals will want access to 'r->settings', but we only have the 'struct object_directory *' corresponding to that repository. Add an additional parameter to pass the repository around in more places. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-09-09commit-graph: introduce 'get_bloom_filter_settings()'Libravatar Taylor Blau1-0/+2
Many places in the code often need a pointer to the commit-graph's 'struct bloom_filter_settings', in which case they often take the value from the top-most commit-graph. In the non-split case, this works as expected. In the split case, however, things get a little tricky. Not all layers in a chain of incremental commit-graphs are required to themselves have Bloom data, and so whether or not some part of the code uses Bloom filters depends entirely on whether or not the top-most level of the commit-graph chain has Bloom filters. This has been the behavior since Bloom filters were introduced, and has been codified into the tests since a759bfa9ee (t4216: add end to end tests for git log with Bloom filters, 2020-04-06). In fact, t4216.130 requires that Bloom filters are not used in exactly the case described earlier. There is no reason that this needs to be the case, since it is perfectly valid for commits in an earlier layer to have Bloom filters when commits in a newer layer do not. Since Bloom settings are guaranteed in practice to be the same for any layer in a chain that has Bloom data, it is sufficient to traverse the '->base_graph' pointer until either (1) a non-null 'struct bloom_filter_settings *' is found, or (2) until we are at the root of the commit-graph chain. Introduce a 'get_bloom_filter_settings()' function that does just this, and use it instead of purely dereferencing the top-most graph's '->bloom_filter_settings' pointer. While we're at it, add an additional test in t5324 to guard against code in the commit-graph writing machinery that doesn't correctly handle a NULL 'struct bloom_filter *'. Co-authored-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-07-30Merge branch 'ds/commit-graph-bloom-updates' into masterLibravatar Junio C Hamano1-1/+2
Updates to the changed-paths bloom filter. * ds/commit-graph-bloom-updates: commit-graph: check all leading directories in changed path Bloom filters revision: empty pathspecs should not use Bloom filters revision.c: fix whitespace commit-graph: check chunk sizes after writing commit-graph: simplify chunk writes into loop commit-graph: unify the signatures of all write_graph_chunk_*() functions commit-graph: persist existence of changed-paths bloom: fix logic in get_bloom_filter() commit-graph: change test to die on parse, not load commit-graph: place bloom_settings in context
2020-07-30Merge branch 'sg/commit-graph-cleanups' into masterLibravatar Junio C Hamano1-3/+3
The changed-path Bloom filter is improved using ideas from an independent implementation. * sg/commit-graph-cleanups: commit-graph: simplify write_commit_graph_file() #2 commit-graph: simplify write_commit_graph_file() #1 commit-graph: simplify parse_commit_graph() #2 commit-graph: simplify parse_commit_graph() #1 commit-graph: clean up #includes diff.h: drop diff_tree_oid() & friends' return value commit-slab: add a function to deep free entries on the slab commit-graph-format.txt: all multi-byte numbers are in network byte order commit-graph: fix parsing the Chunk Lookup table tree-walk.c: don't match submodule entries for 'submod/anything'
2020-07-01commit-graph: persist existence of changed-pathsLibravatar Derrick Stolee1-0/+1
The changed-path Bloom filters were released in v2.27.0, but have a significant drawback. A user can opt-in to writing the changed-path filters using the "--changed-paths" option to "git commit-graph write" but the next write will drop the filters unless that option is specified. This becomes even more important when considering the interaction with gc.writeCommitGraph (on by default) or fetch.writeCommitGraph (part of features.experimental). These config options trigger commit-graph writes that the user did not signal, and hence there is no --changed-paths option available. Allow a user that opts-in to the changed-path filters to persist the property of "my commit-graph has changed-path filters" automatically. A user can drop filters using the --no-changed-paths option. In the process, we need to be extremely careful to match the Bloom filter settings as specified by the commit-graph. This will allow future versions of Git to customize these settings, and the version with this change will persist those settings as commit-graphs are rewritten on top. Use the trace2 API to signal the settings used during the write, and check that output in a test after manually adjusting the correct bytes in the commit-graph file. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-06-23commit-graph: change test to die on parse, not loadLibravatar Derrick Stolee1-1/+1
43d3561 (commit-graph write: don't die if the existing graph is corrupt, 2019-03-25) introduced the GIT_TEST_COMMIT_GRAPH_DIE_ON_LOAD environment variable. This was created to verify that commit-graph was not loaded when writing a new non-incremental commit-graph. An upcoming change wants to load a commit-graph in some valuable cases, but we want to maintain that we don't trust the commit-graph data when writing our new file. Instead of dying on load, instead die if we ever try to parse a commit from the commit-graph. This functionally verifies the same intended behavior, but allows a more advanced feature in the next change. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-06-17commit-graph: introduce commit_graph_data_slabLibravatar Abhishek Kumar1-0/+10
The struct commit is used in many contexts. However, members `generation` and `graph_pos` are only used for commit-graph related operations and otherwise waste memory. This wastage would have been more pronounced as we transition to generation number v2, which uses 64-bit generation number instead of current 32-bits. As they are often accessed together, let's introduce struct commit_graph_data and move them to a commit_graph_data slab. While the overall test suite runs just as fast as master, (series: 26m48s, master: 27m34s, faster by 2.87%), certain commands like `git merge-base --is-ancestor` were slowed by 40% as discovered by Szeder Gábor [1]. After minimizing commit-slab access, the slow down persists but is closer to 20%. Derrick Stolee believes the slow down is attributable to the underlying algorithm rather than the slowness of commit-slab access [2] and we will follow-up in a later series. [1]: https://lore.kernel.org/git/20200607195347.GA8232@szeder.dev/ [2]: https://lore.kernel.org/git/13db757a-9412-7f1e-805c-8a028c4ab2b1@gmail.com/ Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-06-08commit-graph: clean up #includesLibravatar SZEDER Gábor1-3/+3
Our CodingGuidelines says that it's sufficient to include one of 'git-compat-util.h' and 'cache.h', but both 'commit-graph.c' and 'commit-graph.h' include both. Let's include only 'git-compat-util.h' to loose a bunch of unnecessary dependencies; but include 'hash.h', because 'commit-graph.h' does require the definition of 'struct object_id'. 'commit-graph.h' explicitly includes 'repository.h' and 'string-list.h', but only needs the declaration of a few structs from them. Drop these includes and forward-declare the necessary structs instead. 'commit-graph.c' includes 'dir.h', but doesn't actually use anything from there, so let's drop that #include as well. Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com> Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-05-18commit-graph: drop COMMIT_GRAPH_WRITE_CHECK_OIDS flagLibravatar Taylor Blau1-3/+1
Since 7c5c9b9c57 (commit-graph: error out on invalid commit oids in 'write --stdin-commits', 2019-08-05), the commit-graph builtin dies on receiving non-commit OIDs as input to '--stdin-commits'. This behavior can be cumbersome to work around in, say, the case of piping 'git for-each-ref' to 'git commit-graph write --stdin-commits' if the caller does not want to cull out non-commits themselves. In this situation, it would be ideal if 'git commit-graph write' wrote the graph containing the inputs that did pertain to commits, and silently ignored the remainder of the input. Some options have been proposed to the effect of '--[no-]check-oids' which would allow callers to have the commit-graph builtin do just that. After some discussion, it is difficult to imagine a caller who wouldn't want to pass '--no-check-oids', suggesting that we should get rid of the behavior of complaining about non-commit inputs altogether. If callers do wish to retain this behavior, they can easily work around this change by doing the following: git for-each-ref --format='%(objectname) %(objecttype) %(*objecttype)' | awk ' !/commit/ { print "not-a-commit:"$1 } /commit/ { print $1 } ' | git commit-graph write --stdin-commits To make it so that valid OIDs that refer to non-existent objects are indeed an error after loosening the error handling, perform an extra lookup to make sure that object indeed exists before sending it to the commit-graph internals. Helped-by: Jeff King <peff@peff.net> Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-05-01Merge branch 'ds/blame-on-bloom'Libravatar Junio C Hamano1-0/+9
"git blame" learns to take advantage of the "changed-paths" Bloom filter stored in the commit-graph file. * ds/blame-on-bloom: test-bloom: check that we have expected arguments test-bloom: fix some whitespace issues blame: drop unused parameter from maybe_changed_path blame: use changed-path Bloom filters tests: write commit-graph with Bloom filters revision: complicated pathspecs disable filters
2020-05-01Merge branch 'gs/commit-graph-path-filter'Libravatar Junio C Hamano1-1/+8
Introduce an extension to the commit-graph to make it efficient to check for the paths that were modified at each commit using Bloom filters. * gs/commit-graph-path-filter: bloom: ignore renames when computing changed paths commit-graph: add GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS test flag t4216: add end to end tests for git log with Bloom filters revision.c: add trace2 stats around Bloom filter usage revision.c: use Bloom filters to speed up path based revision walks commit-graph: add --changed-paths option to write subcommand commit-graph: reuse existing Bloom filters during write commit-graph: write Bloom filters to commit graph file commit-graph: examine commits by generation number commit-graph: examine changed-path objects in pack order commit-graph: compute Bloom filters for changed paths diff: halt tree-diff early after max_changes bloom.c: core Bloom filter implementation for changed paths. bloom.c: introduce core Bloom filter constructs bloom.c: add the murmur3 hash implementation commit-graph: define and use MAX_NUM_CHUNKS
2020-04-24commit-graph: close descriptors after mmapLibravatar Jeff King1-4/+1
We don't ever refer to the descriptor after mmap-ing it. And keeping it open means we can run out of descriptors in degenerate cases (e.g., thousands of split chain files). Let's close it as soon as possible. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-04-16tests: write commit-graph with Bloom filtersLibravatar Derrick Stolee1-0/+9
The GIT_TEST_COMMIT_GRAPH environment variable updates the commit- graph file whenever "git commit" is run, ensuring that we always have an updated commit-graph throughout the test suite. The GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS environment variable was introduced to write the changed-path Bloom filters whenever "git commit-graph write" is run. However, the GIT_TEST_COMMIT_GRAPH trick doesn't launch a separate process and instead writes it directly. To expand the number of tests that have commits in the commit-graph file, add a helper method that computes the commit-graph and place that helper inside "git commit" and "git merge". In the helper method, check GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS to ensure we are writing changed-path Bloom filters whenever possible. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-04-15commit-graph.h: replace 'commit_hex' with 'commits'Libravatar Taylor Blau1-1/+2
The 'write_commit_graph()' function takes in either a string list of pack indices, or a string list of hexadecimal commit OIDs. These correspond to the '--stdin-packs' and '--stdin-commits' mode(s) from 'git commit-graph write'. Using a string_list of hexadecimal commit IDs is not the most efficient use of memory, since we can instead use the 'struct oidset', which is more well-suited for this case. This has another benefit which will become apparent in the following commit. This is that we are about to disambiguate the kinds of errors we produce with '--stdin-commits' into "non-hex input" and "hex-input, but referring to a non-commit object". By having 'write_commit_graph' take in a 'struct oidset *' of commits, we place the burden on the caller (in this case, the builtin) to handle the first case, and the commit-graph machinery can handle the second case. Suggested-by: Jeff King <peff@peff.net> Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-04-15builtin/commit-graph.c: introduce split strategy 'replace'Libravatar Taylor Blau1-1/+2
When using split commit-graphs, it is sometimes useful to completely replace the commit-graph chain with a new base. For example, consider a scenario in which a repository builds a new commit-graph incremental for each push. Occasionally (say, after some fixed number of pushes), they may wish to rebuild the commit-graph chain with all reachable commits. They can do so with $ git commit-graph write --reachable but this removes the chain entirely and replaces it with a single commit-graph in 'objects/info/commit-graph'. Unfortunately, this means that the next push will have to move this commit-graph into the first layer of a new chain, and then write its new commits on top. Avoid such copying entirely by allowing the caller to specify that they wish to replace the entirety of their commit-graph chain, while also specifying that the new commit-graph should become the basis of a fresh, length-one chain. This addresses the above situation by making it possible for the caller to instead write: $ git commit-graph write --reachable --split=replace which writes a new length-one chain to 'objects/info/commit-graphs', making the commit-graph incremental generated by the subsequent push relatively cheap by avoiding the aforementioned copy. In order to do this, remove an assumption in 'write_commit_graph_file' that chains are always at least two incrementals long. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-04-15builtin/commit-graph.c: introduce split strategy 'no-merge'Libravatar Taylor Blau1-1/+2
In the previous commit, we laid the groundwork for supporting different splitting strategies. In this commit, we introduce the first splitting strategy: 'no-merge'. Passing '--split=no-merge' is useful for callers which wish to write a new incremental commit-graph, but do not want to spend effort condensing the incremental chain [1]. Previously, this was possible by passing '--size-multiple=0', but this no longer the case following 63020f175f (commit-graph: prefer default size_mult when given zero, 2020-01-02). When '--split=no-merge' is given, the commit-graph machinery will never condense an existing chain, and it will always write a new incremental. [1]: This might occur when, for example, a server administrator running some program after each push may want to ensure that each job runs proportional in time to the size of the push, and does not "jump" when the commit-graph machinery decides to trigger a merge. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-04-15builtin/commit-graph.c: support for '--split[=<strategy>]'Libravatar Taylor Blau1-0/+5
With '--split', the commit-graph machinery writes new commits in another incremental commit-graph which is part of the existing chain, and optionally decides to condense the chain into a single commit-graph. This is done to ensure that the asymptotic behavior of looking up a commit in an incremental chain is not dominated by the number of incrementals in that chain. It can be controlled by the '--max-commits' and '--size-multiple' options. In the next two commits, we will introduce additional splitting strategies that can exert additional control over: - when a split commit-graph is and isn't written, and - when the existing commit-graph chain is discarded completely and replaced with another graph To prepare for this, make '--split' take an optional strategy (as in '--split[=<strategy>]'), and add a new enum to describe which strategy is being used. For now, no strategies are given, and the only enumerated value is 'COMMIT_GRAPH_SPLIT_UNSPECIFIED', indicating the absence of a strategy. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-04-06commit-graph: add GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS test flagLibravatar Garima Singh1-0/+1
Add GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS test flag to the test setup suite in order to toggle writing Bloom filters when running any of the git tests. If set to true, we will compute and write Bloom filters every time a test calls `git commit-graph write`, as if the `--changed-paths` option was passed in. The test suite passes when GIT_TEST_COMMIT_GRAPH and GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS are enabled. Helped-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Garima Singh <garima.singh@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-04-06commit-graph: write Bloom filters to commit graph fileLibravatar Garima Singh1-0/+5
Update the technical documentation for commit-graph-format with the formats for the Bloom filter index (BIDX) and Bloom filter data (BDAT) chunks. Write the computed Bloom filters information to the commit graph file using this format. Helped-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Garima Singh <garima.singh@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-03-30commit-graph: compute Bloom filters for changed pathsLibravatar Garima Singh1-1/+2
Add new COMMIT_GRAPH_WRITE_CHANGED_PATHS flag that makes Git compute Bloom filters for the paths that changed between a commit and it's first parent, for each commit in the commit-graph. This computation is done on a commit-by-commit basis. We will write these Bloom filters to the commit-graph file, to store this data on disk, in the next change in this series. Helped-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Garima Singh <garima.singh@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-02-04commit-graph.h: use odb in 'load_commit_graph_one_fd_st'Libravatar Taylor Blau1-1/+2
Apply a similar treatment as in the previous patch to pass a 'struct object_directory *' through the 'load_commit_graph_one_fd_st' initializer, too. This prevents a potential bug where a pointer comparison is made to a NULL 'g->odb', which would cause the commit-graph machinery to think that a pair of commit-graphs belonged to different alternates when in fact they do not (i.e., in the case of no '--object-dir'). Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-02-04commit-graph.c: remove path normalization, comparisonLibravatar Taylor Blau1-1/+1
As of the previous patch, all calls to 'commit-graph.c' functions which perform path normalization (for e.g., 'get_commit_graph_filename()') are of the form 'ctx->odb->path', which is always in normalized form. Now that there are no callers passing non-normalized paths to these functions, ensure that future callers are bound by the same restrictions by making these functions take a 'struct object_directory *' instead of a 'const char *'. To match, replace all calls with arguments of the form 'ctx->odb->path' with 'ctx->odb' To recover the path, functions that perform path manipulation simply use 'odb->path'. Further, avoid string comparisons with arguments of the form 'odb->path', and instead prefer raw pointer comparisons, which accomplish the same effect, but are far less brittle. This has a pleasant side-effect of making these functions much more robust to paths that cannot be normalized by 'normalize_path_copy()', i.e., because they are outside of the current working directory. For example, prior to this patch, Valgrind reports that the following uninitialized memory read [1]: $ ( cd t && GIT_DIR=../.git valgrind git rev-parse HEAD^ ) because 'normalize_path_copy()' can't normalize '../.git' (since it's relative to but above of the current working directory) [2]. By using a 'struct object_directory *' directly, 'get_commit_graph_filename()' does not need to normalize, because all paths are relative to the current working directory since they are always read from the '->path' of an object directory. [1]: https://lore.kernel.org/git/20191027042116.GA5801@sigill.intra.peff.net. [2]: The bug here is that 'get_commit_graph_filename()' returns the result of 'normalize_path_copy()' without checking the return value. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-02-04commit-graph.h: store object directory in 'struct commit_graph'Libravatar Taylor Blau1-2/+3
In a previous patch, the 'char *object_dir' in 'struct commit_graph' was replaced with a 'struct object_directory'. This patch applies the same treatment to 'struct commit_graph', which is another intermediate step towards getting rid of all path normalization in 'commit-graph.c'. Instead of taking a 'char *object_dir', functions that construct a 'struct commit_graph' now take a 'struct object_directory *'. Any code that needs an object directory path use '->path' instead. This ensures that all calls to functions that perform path normalization are given arguments which do not themselves require normalization. This prepares those functions to drop their normalization entirely, which will occur in the subsequent patch. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-02-04commit-graph.h: store an odb in 'struct write_commit_graph_context'Libravatar Taylor Blau1-2/+3
There are lots of places in 'commit-graph.h' where a function either has (or almost has) a full 'struct object_directory *', accesses '->path', and then throws away the rest of the struct. This can cause headaches when comparing the locations of object directories across alternates (e.g., in the case of deciding if two commit-graph layers can be merged). These paths are normalized with 'normalize_path_copy()' which mitigates some comparison issues, but not all [1]. Replace usage of 'char *object_dir' with 'odb->path' by storing a 'struct object_directory *' in the 'write_commit_graph_context' structure. This is an intermediate step towards getting rid of all path normalization in 'commit-graph.c'. Resolving a user-provided '--object-dir' argument now requires that we compare it to the known alternates for equality. Prior to this patch, an unknown '--object-dir' argument would silently exit with status zero. This can clearly lead to unintended behavior, such as verifying commit-graphs that aren't in a repository's own object store (or one of its alternates), or causing a typo to mask a legitimate commit-graph verification failure. Make this error non-silent by 'die()'-ing when the given '--object-dir' does not match any known alternate object store. [1]: In my testing, for example, I can get one side of the commit-graph code to fill object_dir with "./objects" and the other with just "objects". Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-09-12upload-pack: disable commit graph more gently for shallow traversalLibravatar Jeff King1-0/+6
When the client has asked for certain shallow options like "deepen-since", we do a custom rev-list walk that pretends to be shallow. Before doing so, we have to disable the commit-graph, since it is not compatible with the shallow view of the repository. That's handled by 829a321569 (commit-graph: close_commit_graph before shallow walk, 2018-08-20). That commit literally closes and frees our repo->objects->commit_graph struct. That creates an interesting problem for commits that have _already_ been parsed using the commit graph. Their commit->object.parsed flag is set, their commit->graph_pos is set, but their commit->maybe_tree may still be NULL. When somebody later calls repo_get_commit_tree(), we see that we haven't loaded the tree oid yet and try to get it from the commit graph. But since it has been freed, we segfault! So the root of the issue is a data dependency between the commit's lazy-load of the tree oid and the fact that the commit graph can go away mid-process. How can we resolve it? There are a couple of general approaches: 1. The obvious answer is to avoid loading the tree from the graph when we see that it's NULL. But then what do