summaryrefslogtreecommitdiff
path: root/blame.c
AgeCommit message (Collapse)AuthorFilesLines
2022-02-25Merge branch 'ab/diff-free-more'Libravatar Junio C Hamano1-3/+0
Leakfixes. * ab/diff-free-more: diff.[ch]: have diff_free() free options->parseopts diff.[ch]: have diff_free() call clear_pathspec(opts.pathspec)
2022-02-16diff.[ch]: have diff_free() call clear_pathspec(opts.pathspec)Libravatar Ævar Arnfjörð Bjarmason1-3/+0
Have the diff_free() function call clear_pathspec(). Since the diff_flush() function calls this all its callers can be simplified to rely on it instead. When I added the diff_free() function in e900d494dcf (diff: add an API for deferred freeing, 2021-02-11) I simply missed this, or wasn't interested in it. Let's consolidate this now. This means that any future callers (and I've got revision.c in mind) that embed a "struct diff_options" can simply call diff_free() instead of needing know that it has an embedded pathspec. This does fix a bunch of leaks, but I can't mark any test here as passing under the SANITIZE=leak testing mode because in 886e1084d78 (builtin/: add UNLEAKs, 2017-10-01) an UNLEAK(rev) was added, which plasters over the memory leak. E.g. "t4011-diff-symlink.sh" would report fewer leaks with this fix, but because of the UNLEAK() reports none. I'll eventually loop around to removing that UNLEAK(rev) annotation as I'll fix deeper issues with the revisions API leaking. This is one small step on the way there, a new freeing function in revisions.c will want to call this diff_free(). Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Reviewed-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-01-12git-rev-list: add --exclude-first-parent-only flagLibravatar Jerry Zhang1-1/+1
It is useful to know when a branch first diverged in history from some integration branch in order to be able to enumerate the user's local changes. However, these local changes can include arbitrary merges, so it is necessary to ignore this merge structure when finding the divergence point. In order to do this, teach the "rev-list" family to accept "--exclude-first-parent-only", which restricts the traversal of excluded commits to only follow first parent links. -A-----E-F-G--main \ / / B-C-D--topic In this example, the goal is to return the set {B, C, D} which represents a topic branch that has been merged into main branch. `git rev-list topic ^main` will end up returning no commits since excluding main will end up traversing the commits on topic as well. `git rev-list --exclude-first-parent-only topic ^main` however will return {B, C, D} as desired. Add docs for the new flag, and clarify the doc for --first-parent to indicate that it applies to traversing the set of included commits only. Signed-off-by: Jerry Zhang <jerry@skydio.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-04-27hash: provide per-algorithm null OIDsLibravatar brian m. carlson1-1/+1
Up until recently, object IDs did not have an algorithm member, only a hash. Consequently, it was possible to share one null (all-zeros) object ID among all hash algorithms. Now that we're going to be handling objects from multiple hash algorithms, it's important to make sure that all object IDs have a correct algorithm field. Introduce a per-algorithm null OID, and add it to struct hash_algo. Introduce a wrapper function as well, and use it everywhere we used to use the null_oid constant. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-03-13use CALLOC_ARRAYLibravatar René Scharfe1-9/+8
Add and apply a semantic patch for converting code that open-codes CALLOC_ARRAY to use it instead. It shortens the code and infers the element size automatically. Signed-off-by: René Scharfe <l.s.r@web.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-11-21Merge branch 'en/strmap'Libravatar Junio C Hamano1-1/+1
A specialization of hashmap that uses a string as key has been introduced. Hopefully it will see wider use over time. * en/strmap: shortlog: use strset from strmap.h Use new HASHMAP_INIT macro to simplify hashmap initialization strmap: take advantage of FLEXPTR_ALLOC_STR when relevant strmap: enable allocations to come from a mem_pool strmap: add a strset sub-type strmap: split create_entry() out of strmap_put() strmap: add functions facilitating use as a string->int map strmap: enable faster clearing and reusing of strmaps strmap: add more utility functions strmap: new utility functions hashmap: provide deallocation function names hashmap: introduce a new hashmap_partial_clear() hashmap: allow re-use after hashmap_free() hashmap: adjust spacing to fix argument alignment hashmap: add usage documentation explaining hashmap_free[_entries]()
2020-11-02hashmap: provide deallocation function namesLibravatar Elijah Newren1-1/+1
hashmap_free(), hashmap_free_entries(), and hashmap_free_() have existed for a while, but aren't necessarily the clearest names, especially with hashmap_partial_clear() being added to the mix and lazy-initialization now being supported. Peff suggested we adopt the following names[1]: - hashmap_clear() - remove all entries and de-allocate any hashmap-specific data, but be ready for reuse - hashmap_clear_and_free() - ditto, but free the entries themselves - hashmap_partial_clear() - remove all entries but don't deallocate table - hashmap_partial_clear_and_free() - ditto, but free the entries This patch provides the new names and converts all existing callers over to the new naming scheme. [1] https://lore.kernel.org/git/20201030125059.GA3277724@coredump.intra.peff.net/ Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-11-01blame: simplify 'setup_blame_bloom_data' interfaceLibravatar Philippe Blain1-3/+2
The penultimate commit moved the initialization of 'sb.path' in 'builtin/blame.c::cmd_blame' before the call to 'blame.c::setup_blame_bloom_data'. Since 'cmd_blame' is the only caller of 'setup_blame_bloom_data', it is now unnecessary for 'setup_blame_bloom_data' to receive 'path' as a separate argument, as 'sb.path' is already initialized. Remove this argument from setup_blame_bloom_data's interface and use the 'path' field of the 'sb' 'struct blame_scoreboard' instead. Signed-off-by: Philippe Blain <levraiphilippeblain@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-11-01blame: simplify 'setup_scoreboard' interfaceLibravatar Philippe Blain1-6/+5
The previous commit moved the initialization of 'sb.path' in 'builtin/blame.c::cmd_blame' before the call to 'blame.c::setup_scoreboard'. Since 'cmd_blame' is the only caller of 'setup_scoreboard', it is now unnecessary for 'setup_scoreboard' to receive 'path' as a separate argument, as 'sb.path' is already initialized. Remove this argument from setup_scoreboard's interface and use the 'path' field of the 'sb' 'struct blame_scoreboard' instead. Signed-off-by: Philippe Blain <levraiphilippeblain@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-10-12blame: handle deref_tag() returning NULLLibravatar René Scharfe1-3/+3
Signed-off-by: René Scharfe <l.s.r@web.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-09-29Merge branch 'tb/bloom-improvements'Libravatar Junio C Hamano1-3/+5
"git commit-graph write" learned to limit the number of bloom filters that are computed from scratch with the --max-new-filters option. * tb/bloom-improvements: commit-graph: introduce 'commitGraph.maxNewFilters' builtin/commit-graph.c: introduce '--max-new-filters=<n>' commit-graph: rename 'split_commit_graph_opts' bloom: encode out-of-bounds filters as non-empty bloom/diff: properly short-circuit on max_changes bloom: use provided 'struct bloom_filter_settings' bloom: split 'get_bloom_filter()' in two commit-graph.c: store maximum changed paths commit-graph: respect 'commitGraph.readChangedPaths' t/helper/test-read-graph.c: prepare repo settings commit-graph: pass a 'struct repository *' in more places t4216: use an '&&'-chain commit-graph: introduce 'get_bloom_filter_settings()'
2020-09-18Merge branch 'ea/blame-use-oideq'Libravatar Junio C Hamano1-2/+2
Code cleanup. * ea/blame-use-oideq: blame.c: replace instance of !oidcmp for oideq
2020-09-17bloom: split 'get_bloom_filter()' in twoLibravatar Taylor Blau1-1/+1
'get_bloom_filter' takes a flag to control whether it will compute a Bloom filter if the requested one is missing. In the next patch, we'll add yet another parameter to this method, which would force all but one caller to specify an extra 'NULL' parameter at the end. Instead of doing this, split 'get_bloom_filter' into two functions: 'get_bloom_filter' and 'get_or_compute_bloom_filter'. The former only looks up a Bloom filter (and does not compute one if it's missing, thus dropping the 'compute_if_not_present' flag). The latter does compute missing Bloom filters, with an additional parameter to store whether or not it needed to do so. This simplifies many call-sites, since the majority of existing callers to 'get_bloom_filter' do not want missing Bloom filters to be computed (so they can drop the parameter entirely and use the simpler version of the function). While we're at it, instrument the new 'get_or_compute_bloom_filter()' with counters in the 'write_commit_graph_context' struct which store the number of filters that we did and didn't compute, as well as filters that were truncated. It would be nice to drop the 'compute_if_not_present' flag entirely, since all remaining callers of 'get_or_compute_bloom_filter' pass it as '1', but this will change in a future patch and hence cannot be removed. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-09-09commit-graph: introduce 'get_bloom_filter_settings()'Libravatar Taylor Blau1-2/+4
Many places in the code often need a pointer to the commit-graph's 'struct bloom_filter_settings', in which case they often take the value from the top-most commit-graph. In the non-split case, this works as expected. In the split case, however, things get a little tricky. Not all layers in a chain of incremental commit-graphs are required to themselves have Bloom data, and so whether or not some part of the code uses Bloom filters depends entirely on whether or not the top-most level of the commit-graph chain has Bloom filters. This has been the behavior since Bloom filters were introduced, and has been codified into the tests since a759bfa9ee (t4216: add end to end tests for git log with Bloom filters, 2020-04-06). In fact, t4216.130 requires that Bloom filters are not used in exactly the case described earlier. There is no reason that this needs to be the case, since it is perfectly valid for commits in an earlier layer to have Bloom filters when commits in a newer layer do not. Since Bloom settings are guaranteed in practice to be the same for any layer in a chain that has Bloom data, it is sufficient to traverse the '->base_graph' pointer until either (1) a non-null 'struct bloom_filter_settings *' is found, or (2) until we are at the root of the commit-graph chain. Introduce a 'get_bloom_filter_settings()' function that does just this, and use it instead of purely dereferencing the top-most graph's '->bloom_filter_settings' pointer. While we're at it, add an additional test in t5324 to guard against code in the commit-graph writing machinery that doesn't correctly handle a NULL 'struct bloom_filter *'. Co-authored-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-09-08blame.c: replace instance of !oidcmp for oideqLibravatar Edmundo Carmona Antoranz1-2/+2
0906ac2b (blame: use changed-path Bloom filters, 2020-04-16) introduced a call to oidcmp() that should have been oideq(), which was introduced in 14438c44 (introduce hasheq() and oideq(), 2018-08-28). Signed-off-by: Edmundo Carmona Antoranz <eantoranz@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-08-13blame: only coalesce lines that are adjacent in resultLibravatar Jeff King1-0/+1
After blame has finished but before we produce any output, we coalesce groups of lines that were adjacent in the original suspect (which may have been split apart by lines in intermediate commits which went away). However, this can cause incorrect output if the lines are not also adjacent in the result. For instance, the case in t8003 has: ABC DEF which becomes ABC SPLIT DEF Blaming only lines 1 and 3 in the result yields two blame groups (one for each line) that were adjacent in the original. That's enough for us to coalesce them into a single group, but that loses information: our output routines assume they're adjacent in the result as well, and we output: <oid> 1) ABC <oid> 2) SPLIT This is nonsense for two reasons: - we were asked about line 3, not line 2; we should not output the SPLIT line at all - commit <oid> did not touch the SPLIT line at all! We found the correct blame for line 3, but the bug is actually in the output stage, which is showing the wrong line number and content from the final file. We can fix this by only coalescing when both the suspect and result lines are adjacent. That fixes this bug, but keeps coalescing in cases where want it (e.g., the existing test in t8003 where SPLIT goes away, and the lines really are adjacent in the result). Reported-by: Nuthan Munaiah <nm6061@rit.edu> Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-06-17commit: move members graph_pos, generation to a slabLibravatar Abhishek Kumar1-1/+1
We remove members `graph_pos` and `generation` from the struct commit. The default assignments in init_commit_node() are no longer valid, which is fine as the slab helpers return appropriate default values and the assignments are removed. We will replace existing use of commit->generation and commit->graph_pos by commit_graph_data_slab helpers using `contrib/coccinelle/commit.cocci'. Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-04-23blame: drop unused parameter from maybe_changed_pathLibravatar Jeff King1-3/+1
We don't use the "parent" parameter at all (probably because the bloom filter for a commit is always defined against a single parent anyway). Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-04-16blame: use changed-path Bloom filtersLibravatar Derrick Stolee1-9/+130
The changed-path Bloom filters help reduce the amount of tree parsing required during history queries. Before calculating a diff, we can ask the filter if a path changed between a commit and its first parent. If the filter says "no" then we can move on without parsing trees. If the filter says "maybe" then we parse trees to discover if the answer is actually "yes" or "no". When computing a blame, there is a section in find_origin() that computes a diff between a commit and one of its parents. When this is the first parent, we can check the Bloom filters before calling diff_tree_oid(). In order to make this work with the blame machinery, we need to initialize a struct bloom_key with the initial path. But also, we need to add more keys to a list if a rename is detected. We then check to see if _any_ of these keys answer "maybe" in the diff. During development, I purposefully left out this "add a new key when a rename is detected" to see if the test suite would catch my error. That is how I discovered the issues with GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS from the previous change. With that change, we can feel some confidence in the coverage of this change. If a user requests copy detection using "git blame -C", then there are more places where the set of "important" files can expand. I do not know enough about how this happens in the blame machinery. Thus, the Bloom filter integration is explicitly disabled in this mode. A later change could expand the bloom_key data with an appropriate call (or calls) to add_bloom_key(). If we did not disable this mode, then the following tests would fail: t8003-blame-corner-cases.sh t8011-blame-split-file.sh Generally, this is a performance enhancement and should not change the behavior of 'git blame' in any way. If a repo has a commit-graph file with computed changed-path Bloom filters, then they should notice improved performance for their 'git blame' commands. Here are some example timings that I found by blaming some paths in the Linux kernel repository: git blame arch/x86/kernel/topology.c >/dev/null Before: 0.83s After: 0.24s git blame kernel/time/time.c >/dev/null Before: 0.72s After: 0.24s git blame tools/perf/ui/stdio/hist.c >/dev/null Before: 0.27s After: 0.11s I specifically looked for "deep" paths that were also edited many times. As a counterpoint, the MAINTAINERS file was edited many times but is located in the root tree. This means that the cost of computing a diff relative to the pathspec is very small. Here are the timings for that command: git blame MAINTAINERS >/dev/null Before: 20.1s After: 18.0s These timings are the best of five. The worst-case runs were on the order of 2.5 minutes for both cases. Note that the MAINTAINERS file has 18,740 lines across 17,000+ commits. This happens to be one of the cases where this change provides the least improvement. The lack of improvement for the MAINTAINERS file and the relatively modest improvement for the other examples can be easily explained. The blame machinery needs to compute line-level diffs to determine which lines were changed by each commit. That makes up a large proportion of the computation time, and this change does not attempt to improve on that section of the algorithm. The MAINTAINERS file is large and changed often, so it takes time to determine which lines were updated by which commit. In contrast, the code files are much smaller, and it takes longer to comute the line-by-line diff for a single patch on the Linux mailing lists. Outside of the "-C" integration, I believe there is little more to gain from the changed-path Bloom filters for 'git blame' after this patch. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-10-15Merge branch 'ew/hashmap'Libravatar Junio C Hamano1-11/+14
Code clean-up of the hashmap API, both users and implementation. * ew/hashmap: hashmap_entry: remove first member requirement from docs hashmap: remove type arg from hashmap_{get,put,remove}_entry OFFSETOF_VAR macro to simplify hashmap iterators hashmap: introduce hashmap_free_entries hashmap: hashmap_{put,remove} return hashmap_entry * hashmap: use *_entry APIs for iteration hashmap_cmp_fn takes hashmap_entry params hashmap_get{,_from_hash} return "struct hashmap_entry *" hashmap: use *_entry APIs to wrap container_of hashmap_get_next returns "struct hashmap_entry *" introduce container_of macro hashmap_put takes "struct hashmap_entry *" hashmap_remove takes "const struct hashmap_entry *" hashmap_get takes "const struct hashmap_entry *" hashmap_add takes "struct hashmap_entry *" hashmap_get_next takes "const struct hashmap_entry *" hashmap_entry_init takes "struct hashmap_entry *" packfile: use hashmap_entry in delta_base_cache_entry coccicheck: detect hashmap_entry.hash assignment diff: use hashmap_entry_init on moved_entry.ent
2019-10-07hashmap: remove type arg from hashmap_{get,put,remove}_entryLibravatar Eric Wong1-6/+4
Since these macros already take a `keyvar' pointer of a known type, we can rely on OFFSETOF_VAR to get the correct offset without relying on non-portable `__typeof__' and `offsetof'. Argument order is also rearranged, so `keyvar' and `member' are sequential as they are used as: `keyvar->member' Signed-off-by: Eric Wong <e@80x24.org> Reviewed-by: Derrick Stolee <stolee@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-10-07OFFSETOF_VAR macro to simplify hashmap iteratorsLibravatar Eric Wong1-2/+0
While we cannot rely on a `__typeof__' operator being portable to use with `offsetof'; we can calculate the pointer offset using an existing pointer and the address of a member using pointer arithmetic for compilers without `__typeof__'. This allows us to simplify usage of hashmap iterator macros by not having to specify a type when a pointer of that type is already given. In the future, list iterator macros (e.g. list_for_each_entry) may also be implemented using OFFSETOF_VAR to save hackers the trouble of using container_of/list_entry macros and without relying on non-portable `__typeof__'. v3: use `__typeof__' to avoid clang warnings Signed-off-by: Eric Wong <e@80x24.org> Reviewed-by: Derrick Stolee <stolee@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-10-07hashmap: introduce hashmap_free_entriesLibravatar Eric Wong1-1/+1
`hashmap_free_entries' behaves like `container_of' and passes the offset of the hashmap_entry struct to the internal `hashmap_free_' function, allowing the function to free any struct pointer regardless of where the hashmap_entry field is located. `hashmap_free' no longer takes any arguments aside from the hashmap itself. Signed-off-by: Eric Wong <e@80x24.org> Reviewed-by: Derrick Stolee <stolee@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-10-07hashmap: use *_entry APIs for iterationLibravatar Eric Wong1-4/+6
Inspired by list_for_each_entry in the Linux kernel. Once again, these are somewhat compromised usability-wise by compilers lacking __typeof__ support. Signed-off-by: Eric Wong <e@80x24.org> Reviewed-by: Derrick Stolee <stolee@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-10-07hashmap_get{,_from_hash} return "struct hashmap_entry *"Libravatar Eric Wong1-3/+8
Update callers to use hashmap_get_entry, hashmap_get_entry_from_hash or container_of as appropriate. This is another step towards eliminating the requirement of hashmap_entry being the first field in a struct. Signed-off-by: Eric Wong <e@80x24.org> Reviewed-by: Derrick Stolee <stolee@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-10-07hashmap_remove takes "const struct hashmap_entry *"Libravatar Eric Wong1-1/+1
This is less error-prone than "const void *" as the compiler now detects invalid types being passed. Signed-off-by: Eric Wong <e@80x24.org> Reviewed-by: Derrick Stolee <stolee@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-10-07hashmap_get takes "const struct hashmap_entry *"Libravatar Eric Wong1-3/+3
This is less error-prone than "const void *" as the compiler now detects invalid types being passed. Signed-off-by: Eric Wong <e@80x24.org> Reviewed-by: Derrick Stolee <stolee@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-10-07hashmap_add takes "struct hashmap_entry *"Libravatar Eric Wong1-1/+1
This is less error-prone than "void *" as the compiler now detects invalid types being passed. Signed-off-by: Eric Wong <e@80x24.org> Reviewed-by: Derrick Stolee <stolee@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-10-07hashmap_entry_init takes "struct hashmap_entry *"Libravatar Eric Wong1-1/+1
C compilers do type checking to make life easier for us. So rely on that and update all hashmap_entry_init callers to take "struct hashmap_entry *" to avoid future bugs while improving safety and readability. Signed-off-by: Eric Wong <e@80x24.org> Reviewed-by: Derrick Stolee <stolee@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-08-19blame: remove needless comparison with GIT_SHA1_HEXSZLibravatar brian m. carlson1-1/+1
When faking a working tree commit, we read in lines from MERGE_HEAD into a strbuf. Because the strbuf is NUL-terminated and get_oid_hex will fail if it unexpectedly encounters a NUL, the check for the length of the line is unnecessary. There is no optimization benefit from this case, either, since on failure we call die. Remove this check, since it is no longer needed. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-07-19Merge branch 'nd/tree-walk-with-repo'Libravatar Junio C Hamano1-2/+2
The tree-walk API learned to pass an in-core repository instance throughout more codepaths. * nd/tree-walk-with-repo: t7814: do not generate same commits in different repos Use the right 'struct repository' instead of the_repository match-trees.c: remove the_repo from shift_tree*() tree-walk.c: remove the_repo from get_tree_entry_follow_symlinks() tree-walk.c: remove the_repo from get_tree_entry() tree-walk.c: remove the_repo from fill_tree_descriptor() sha1-file.c: remove the_repo from read_object_with_reference()
2019-07-19Merge branch 'br/blame-ignore'Libravatar Junio C Hamano1-56/+959
"git blame" learned to "ignore" commits in the history, whose effects (as well as their presence) get ignored. * br/blame-ignore: t8014: remove unnecessary braces blame: drop some unused function parameters blame: add a test to cover blame_coalesce() blame: use the fingerprint heuristic to match ignored lines blame: add a fingerprint heuristic to match ignored lines blame: optionally track line fingerprints during fill_blame_origin() blame: add config options for the output of ignored or unblamable lines blame: add the ability to ignore commits and their changes blame: use a helper function in blame_chunk() Move oidset_parse_file() to oidset.c fsck: rename and touch up init_skiplist()
2019-06-28blame: drop some unused function parametersLibravatar Jeff King1-4/+3
These unused parameters were introduced recently as part of the br/blame-ignore topic. I assume they are not indicative of bugs, but are just leftovers from the development process (they were introduced by the series but not used in any of its iterations). Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-06-27tree-walk.c: remove the_repo from get_tree_entry()Libravatar Nguyễn Thái Ngọc Duy1-2/+2
Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-06-20blame: use the fingerprint heuristic to match ignored linesLibravatar Barret Rhoden1-5/+55
This commit integrates the fuzzy fingerprint heuristic into guess_line_blames(). We actually make two passes. The first pass uses the fuzzy algorithm to find a match within the current diff chunk. If that fails, the second pass searches the entire parent file for the best match. For an example of scanning the entire parent for a match, consider: commit-a 30) #include <sys/header_a.h> commit-b 31) #include <header_b.h> commit-c 32) #include <header_c.h> Then commit X alphabetizes them: commit-X 30) #include <header_b.h> commit-X 31) #include <header_c.h> commit-X 32) #include <sys/header_a.h> If we just check the parent's chunk (i.e. the first pass), we'd get: commit-b 30) #include <header_b.h> commit-c 31) #include <header_c.h> commit-X 32) #include <sys/header_a.h> That's because commit X actually consists of two chunks: one chunk is removing sys/header_a.h, then some context, and the second chunk is adding sys/header_a.h. If we scan the entire parent file, we get: commit-b 30) #include <header_b.h> commit-c 31) #include <header_c.h> commit-a 32) #include <sys/header_a.h> Signed-off-by: Barret Rhoden <brho@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-06-20blame: add a fingerprint heuristic to match ignored linesLibravatar Michael Platings1-0/+642
This algorithm will replace the heuristic used to identify lines from ignored commits with one that finds likely candidate lines in the parent's version of the file. The actual replacement occurs in an upcoming commit. The old heuristic simply assigned lines in the target to the same line number (plus offset) in the parent. The new function uses a fingerprinting algorithm to detect similarity between lines. The new heuristic is designed to accurately match changes made mechanically by formatting tools such as clang-format and clang-tidy. These tools make changes such as breaking up lines to fit within a character limit or changing identifiers to fit with a naming convention. The heuristic is not intended to match more extensive refactoring changes and may give misleading results in such cases. In most cases formatting tools preserve line ordering, so the heuristic is optimised for such cases. (Some types of changes do reorder lines e.g. sorting keep the line content identical, the git blame -M option can already be used to address this). The reason that it is advantageous to rely on ordering is due to source code repeating the same character sequences often e.g. declaring an identifier on one line and using that identifier on several subsequent lines. This means that lines can look very similar to each other which presents a problem when doing fuzzy matching. Relying on ordering gives us extra clues to point towards the true match. The heuristic operates on a single diff chunk change at a time. It creates a “fingerprint” for each line on each side of the change. Fingerprints are described in detail in the comment for `struct fingerprint`, but essentially are a multiset of the character pairs in a line. The heuristic first identifies the line in the target entry whose fingerprint is most clearly matched to a line fingerprint in the parent entry. Where fingerprints match identically, the position of the lines is used as a tie-break. The heuristic locks in the best match, and subtracts the fingerprint of the line in the target entry from the fingerprint of the line in the parent entry to prevent other lines being matched on the same parts of that line. It then repeats the process recursively on the section of the chunk before the match, and then the section of the chunk after the match. Here's an example of the difference the fingerprinting makes. Consider a file with two commits: commit-a 1) void func_1(void *x, void *y); commit-b 2) void func_2(void *x, void *y); After a commit 'X', we have: commit-X 1) void func_1(void *x, commit-X 2) void *y); commit-X 3) void func_2(void *x, commit-X 4) void *y); When we blame-ignored with the old algorithm, we get: commit-a 1) void func_1(void *x, commit-b 2) void *y); commit-X 3) void func_2(void *x, commit-X 4) void *y); Where commit-b is blamed for 2 instead of 3. With the fingerprint algorithm, we get: commit-a 1) void func_1(void *x, commit-a 2) void *y); commit-b 3) void func_2(void *x, commit-b 4) void *y); Note line 2 could be matched with either commit-a or commit-b as it is equally similar to both lines, but is matched with commit-a because its position as a fraction of the new line range is more similar to commit-a as a fraction of the old line range. Line 4 is also equally similar to both lines, but as it appears after line 3 which will be matched first it cannot be matched with an earlier line. For many more examples, see t/t8014-blame-ignore-fuzzy.sh which contains example parent and target files and the line numbers in the parent that must be matched. Signed-off-by: Michael Platings <michael@platin.gs> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-05-16blame: optionally track line fingerprints during fill_blame_origin()Libravatar Barret Rhoden1-30/+60
fill_blame_origin() is a convenient place to store data that we will use throughout the lifetime of a blame_origin. Some heuristics for ignoring commits during a blame session can make use of this storage. In particular, we will calculate a fingerprint for each line of a file for blame_origins involved in an ignored commit. In this commit, we only calculate the line_starts, reusing the existing code from the scoreboard's line_starts. In an upcoming commit, we will actually compute the fingerprints. This feature will be used when we attempt to pass blame entries to parents when we "ignore" a commit. Most uses of fill_blame_origin() will not require this feature, hence the flag parameter. Multiple calls to fill_blame_origin() are idempotent, and any of them can request the creation of the fingerprints structure. Suggested-by: Michael Platings <michael@platin.gs> Signed-off-by: Barret Rhoden <brho@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-05-16blame: add config options for the output of ignored or unblamable linesLibravatar Barret Rhoden1-1/+13
When ignoring commits, the commit that is blamed might not be responsible for the change, due to the inaccuracy of our heuristic. Users might want to know when a particular line has a potentially inaccurate blame. Furthermore, guess_line_blames() may fail to find any parent commit for a given line touched by an ignored commit. Those 'unblamable' lines remain blamed on an ignored commit. Users might want to know if a line is unblamable so that they do not spend time investigating a commit they know is uninteresting. This patch adds two config options to mark these two types of lines in the output of blame. The first option can identify ignored lines by specifying blame.markIgnoredLines. When this option is set, each blame line that was blamed on a commit other than the ignored commit is marked with a '?'. For example: 278b6158d6fdb (Barret Rhoden 2016-04-11 13:57:54 -0400 26) appears as: ?278b6158d6fd (Barret Rhoden 2016-04-11 13:57:54 -0400 26) where the '?' is placed before the commit, and the hash has one fewer characters. Sometimes we are unable to even guess at what ancestor commit touched a line. These lines are 'unblamable.' The second option, blame.markUnblamableLines, will mark the line with '*'. For example, say we ignore e5e8d36d04cbe, yet we are unable to blame this line on another commit: e5e8d36d04cbe (Barret Rhoden 2016-04-11 13:57:54 -0400 26) appears as: *e5e8d36d04cb (Barret Rhoden 2016-04-11 13:57:54 -0400 26) When these config options are used together, every line touched by an ignored commit will be marked with either a '?' or a '*'. Signed-off-by: Barret Rhoden <brho@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-05-16blame: add the ability to ignore commits and their changesLibravatar Barret Rhoden1-9/+167
Commits that make formatting changes or function renames are often not interesting when blaming a file. A user may deem such a commit as 'not interesting' and want to ignore and its changes it when assigning blame. For example, say a file has the following git history / rev-list: ---O---A---X---B---C---D---Y---E---F Commits X and Y both touch a particular line, and the other commits do not: X: "Take a third parameter" -MyFunc(1, 2); +MyFunc(1, 2, 3); Y: "Remove camelcase" -MyFunc(1, 2, 3); +my_func(1, 2, 3); git-blame will blame Y for the change. I'd like to be able to ignore Y: both the existence of the commit as well as any changes it made. This differs from -S rev-list, which specifies the list of commits to process for the blame. We would still process Y, but just don't let the blame 'stick.' This patch adds the ability for users to ignore a revision with --ignore-rev=rev, which may be repeated. They can specify a set of files of full object names of revs, e.g. SHA-1 hashes, one per line. A single file may be specified with the blame.ignoreRevFile config option or with --ignore-rev-file=file. Both the config option and the command line option may be repeated multiple times. An empty file name "" will clear the list of revs from previously processed files. Config options are processed before command line options. For a typical use case, projects will maintain the file containing revisions for commits that perform mass reformatting, and their users have the option to ignore all of the commits in that file. Additionally, a user can use the --ignore-rev option for one-off investigation. To go back to the example above, X was a substantive change to the function, but not the change the user is interested in. The user inspected X, but wanted to find the previous change to that line - perhaps a commit that introduced that function call. To make this work, we can't simply remove all ignored commits from the rev-list. We need to diff the changes introduced by Y so that we can ignore them. We let the blames get passed to Y, just like when processing normally. When Y is the target, we make sure that Y does not *keep* any blames. Any changes that Y is responsible for get passed to its parent. Note we make one pass through all of the scapegoats (parents) to attempt to pass blame normally; we don't know if we *need* to ignore the commit until we've checked all of the parents. The blame_entry will get passed up the tree until we find a commit that has a diff chunk that affects those lines. One issue is that the ignored commit *did* make some change, and there is no general solution to finding the line in the parent commit that corresponds to a given line in the ignored commit. That makes it hard to attribute a particular line within an ignored commit's diff correctly. For example, the parent of an ignored commit has this, say at line 11: commit-a 11) #include "a.h" commit-b 12) #include "b.h" Commit X, which we will ignore, swaps these lines: commit-X 11) #include "b.h" commit-X 12) #include "a.h" We can pass that blame entry to the parent, but line 11 will be attributed to commit A, even though "include b.h" came from commit B. The blame mechanism will be looking at the parent's view of the file at line number 11. ignore_blame_entry() is set up to allow alternative algorithms for guessing per-line blames. Any line that is not attributed to the parent will continue to be blamed on the ignored commit as if that commit was not ignored. Upcoming patches have the ability to detect these lines and mark them in the blame output. The existing algorithm is simple: blame each line on the corresponding line in the parent's diff chunk. Any lines beyond that stay with the target. For example, the parent of an ignored commit has this, say at line 11: commit-a 11) void new_func_1(void *x, void *y); commit-b 12) void new_func_2(void *x, void *y); commit-c 13) some_line_c commit-d 14) some_line_d After a commit 'X', we have: commit-X 11) void new_func_1(void *x, commit-X 12) void *y); commit-X 13) void new_func_2(void *x, commit-X 14) void *y); commit-c 15) some_line_c commit-d 16) some_line_d Commit X nets two additionally lines: 13 and 14. The current guess_line_blames() algorithm will not attribute these to the parent, whose diff chunk is only two lines - not four. When we ignore with the current algorithm, we get: commit-a 11) void new_func_1(void *x, commit-b 12) void *y); commit-X 13) void new_func_2(void *x, commit-X 14) void *y); commit-c 15) some_line_c commit-d 16) some_line_d Note that line 12 was blamed on B, though B was the commit for new_func_2(), not new_func_1(). Even when guess_line_blames() finds a line in the parent, it may still be incorrect. Signed-off-by: Barret Rhoden <brho@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-05-16blame: use a helper function in blame_chunk()Libravatar Barret Rhoden1-16/+28
The same code for splitting a blame_entry at a particular line was used twice in blame_chunk(), and I'll use the helper again in an upcoming patch. Signed-off-by: Barret Rhoden <brho@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-05-09Merge branch 'en/merge-directory-renames'Libravatar Junio C Hamano1-1/+1
"git merge-recursive" backend recently learned a new heuristics to infer file movement based on how other files in the same directory moved. As this is inherently less robust heuristics than the one based on the content similarity of the file itself (rather than based on what its neighbours are doing), it sometimes gives an outcome unexpected by the end users. This has been toned down to leave the renamed paths in higher/conflicted stages in the index so that the user can examine and confirm the result. * en/merge-directory-renames: merge-recursive: switch directory rename detection default merge-recursive: give callers of handle_content_merge() access to contents merge-recursive: track information associated with directory renames t6043: fix copied test description to match its purpose merge-recursive: switch from (oid,mode) pairs to a diff_filespec merge-recursive: cleanup handle_rename_* function signatures merge-recursive: track branch where rename occurred in rename struct merge-recursive: remove ren[12]_other fields from rename_conflict_info merge-recursive: shrink rename_conflict_info merge-recursive: move some struct declarations together merge-recursive: use 'ci' for rename_conflict_info variable name merge-recursive: rename locals 'o' and 'a' to 'obuf' and 'abuf' merge-recursive: rename diff_filespec 'one' to 'o' merge-recursive: rename merge_options argument from 'o' to 'opt' Use 'unsigned short' for mode, like diff_filespec does
2019-04-25Merge branch 'dk/blame-keep-origin-blob'Libravatar Junio C Hamano1-1/+2
Performance fix around "git blame", especially in a linear history (which is the norm we should optimize for). * dk/blame-keep-origin-blob: blame.c: don't drop origin blobs as eagerly
2019-04-08Use 'unsigned short' for mode, like diff_filespec doesLibravatar Elijah Newren1-1/+1
struct diff_filespec defines mode to be an 'unsigned short'. Several other places in the API which we'd like to interact with using a diff_filespec used a plain unsigned (or unsigned int). This caused problems when taking addresses, so switch to unsigned short. Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-04-03blame.c: don't drop origin blobs as eagerlyLibravatar David Kastrup1-1/+2
When a parent blob already has chunks queued up for blaming, dropping the blob at the end of one blame step will cause it to get reloaded right away, doubling the amount of I/O and unpacking when processing a linear history. Keeping such parent blobs in memory seems like a reasonable optimization that should incur additional memory pressure mostly when processing the merges from old branches. Signed-off-by: David Kastrup <dak@gnu.org> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-03-07Merge branch 'wh/author-committer-ident-config'Libravatar Junio C Hamano1-1/+2
Four new configuration variables {author,committer}.{name,email} have been introduced to override user.{name,email} in more specific cases. * wh/author-committer-ident-config: config: allow giving separate author and committer idents
2019-02-04config: allow giving separate author and committer identsLibravatar William Hubbs1-1/+2
The author.email, author.name, committer.email and committer.name settings are analogous to the GIT_AUTHOR_* and GIT_COMMITTER_* environment variables, but for the git config system. This allows them to be set separately for each repository. Git supports setting different authorship and committer information with environment variables. However, environment variables are set in the shell, so if different authorship and committer information is needed for different repositories an external tool is required. This adds support to git config for author.email, author.name, committer.email and committer.name settings so this information can be set per repository. Also, it generalizes the fmt_ident function so it can handle author vs committer identification. Signed-off-by: William Hubbs <williamh@gentoo.org> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-01-14read-cache.c: kill read_index()Libravatar Nguyễn Thái Ngọc Duy1-2/+2
read_index() shares the same problem as hold_locked_index(): it assumes $GIT_DIR/index. Move all call sites to repo_read_index() instead. read_index_preload() and read_index_unmerged() are also killed as a consequence. Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-11-12blame.c: remove implicit dependency the_repositoryLibravatar Nguyễn Thái Ngọc Duy1-17/+22
Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-10-19Merge branch 'nd/the-index'Libravatar Junio C Hamano1-30/+33
Various codepaths in the core-ish part learn to work on an arbitrary in-core index structure, not necessarily the default instance "the_index". * nd/the-index: (23 commits) revision.c: reduce implicit dependency the_repository revision.c: remove implicit dependency on the_index ws.c: remove implicit dependency on the_index tree-diff.c: remove implicit dependency on the_index submodule.c: remove implicit dependency on the_index line-range.c: remove implicit dependency on the_index userdiff.c: remove implicit dependency on the_index rerere.c: remove implicit dependency on the_index sha1-file.c: remove implicit dependency on the_index patch-ids.c: remove implicit dependency on the_index merge.c: remove implicit dependency on the_index merge-blobs.c: remove implicit dependency on the_index ll-merge.c: remove implicit dependency on the_index diff-lib.c: remove implicit dependency on the_index read-cache.c: remove implicit dependency on the_index diff.c: remove implicit dependency on the_index grep.c: remove implicit dependency on the_index diff.c: remove the_index dependency in textconv() functions blame.c: rename "repo" argument to "r" combine-diff.c: remove implicit dependency on the_index ...
2018-09-21diff.c: remove implicit dependency on the_indexLibravatar Nguyễn Thái Ngọc Duy1-9/+11
A new variant repo_diff_setup() is added that takes 'struct repository *' and diff_setup() becomes a thin macro around it that is protected by NO_THE_REPOSITORY_COMPATIBILITY_MACROS, similar to NO_THE_INDEX_.... The plan is these macros will always be defined for all library files and the macros are only accessible in builtin/ Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>