summaryrefslogtreecommitdiff
path: root/xdiff
AgeCommit message (Collapse)AuthorFilesLines
2021-07-13Merge branch 'ab/pickaxe-pcre2'Libravatar Junio C Hamano2-1/+3
Rewrite the backend for "diff -G/-S" to use pcre2 engine when available. * ab/pickaxe-pcre2: (22 commits) xdiff-interface: replace discard_hunk_line() with a flag xdiff users: use designated initializers for out_line pickaxe -G: don't special-case create/delete pickaxe -G: terminate early on matching lines xdiff-interface: allow early return from xdiff_emit_line_fn xdiff-interface: prepare for allowing early return pickaxe -S: slightly optimize contains() pickaxe: rename variables in has_changes() for brevity pickaxe -S: support content with NULs under --pickaxe-regex pickaxe: assert that we must have a needle under -G or -S pickaxe: refactor function selection in diffcore-pickaxe() perf: add performance test for pickaxe pickaxe/style: consolidate declarations and assignments diff.h: move pickaxe fields together again pickaxe: die when --find-object and --pickaxe-all are combined pickaxe: die when -G and --pickaxe-regex are combined pickaxe tests: add missing test for --no-pickaxe-regex being an error pickaxe tests: test for -G, -S and --find-object incompatibility pickaxe tests: add test for "log -S" not being a regex pickaxe tests: add test for diffgrep_consume() internals ...
2021-07-08Merge branch 'ab/xdiff-bug-cleanup'Libravatar Junio C Hamano1-14/+8
Code clean-up. * ab/xdiff-bug-cleanup: xdiff: use BUG(...), not xdl_bug(...)
2021-06-08xdiff: use BUG(...), not xdl_bug(...)Libravatar Ævar Arnfjörð Bjarmason1-14/+8
The xdl_bug() function was introduced in e8adf23d1e (xdl_change_compact(): introduce the concept of a change group, 2016-08-22), let's use our usual BUG() function instead. We'll now have meaningful line numbers if we encounter bugs in xdiff, and less code duplication. Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-05-14Merge branch 'pw/patience-diff-clean-up'Libravatar Junio C Hamano1-11/+3
Code clean-up. * pw/patience-diff-clean-up: patience diff: remove unused variable patience diff: remove unnecessary string comparisons
2021-05-11xdiff-interface: replace discard_hunk_line() with a flagLibravatar Ævar Arnfjörð Bjarmason2-1/+3
Remove the dummy discard_hunk_line() function added in 3b40a090fd4 (diff: avoid generating unused hunk header lines, 2018-11-02) in favor of having a new XDL_EMIT_NO_HUNK_HDR flag, for use along with the two existing and similar XDL_EMIT_* flags. Unlike the recently amended xdiff_emit_line_fn interface which'll be called in a loop in xdl_emit_diff(), the hunk header is only emitted once. It makes more sense to pass this as a flag than provide a dummy callback because that function may be able to skip doing certain work if it knows the caller is doing nothing with the hunk header. It would be possible to do so in the case of -U0 now, but the benefit of doing so is so small that I haven't bothered. But this leaves the door open to that, and more importantly makes the API use more intuitive. The reason we're putting a flag in the gap between 1<<0 and 1<<2 is that the old 1<<1 flag was removed in 907681e940d (xdiff: drop XDL_EMIT_COMMON, 2016-02-23) without re-ordering the remaining flags. Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-05-05patience diff: remove unused variableLibravatar Phillip Wood1-3/+0
Signed-off-by: Phillip Wood <phillip.wood@dunelm.org.uk> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-05-05patience diff: remove unnecessary string comparisonsLibravatar Phillip Wood1-8/+3
xdl_prepare_env() calls xdl_classify_record() which arranges for the hashes of non-matching lines to be different so lines can be tested for equality by comparing just their hashes. This reduces the time taken to calculate the diff of v2.28.0 to v2.29.0 by ~3-4%. Signed-off-by: Phillip Wood <phillip.wood@dunelm.org.uk> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-10-20diff: add -I<regex> that ignores matching changesLibravatar Michał Kępień2-2/+49
Add a new diff option that enables ignoring changes whose all lines (changed, removed, and added) match a given regular expression. This is similar to the -I/--ignore-matching-lines option in standalone diff utilities and can be used e.g. to ignore changes which only affect code comments or to look for unrelated changes in commits containing a large number of automatically applied modifications (e.g. a tree-wide string replacement). The difference between -G/-S and the new -I option is that the latter filters output on a per-change basis. Use the 'ignore' field of xdchange_t for marking a change as ignored or not. Since the same field is used by --ignore-blank-lines, identical hunk emitting rules apply for --ignore-blank-lines and -I. These two options can also be used together in the same git invocation (they are complementary to each other). Rename xdl_mark_ignorable() to xdl_mark_ignorable_lines(), to indicate that it is logically a "sibling" of xdl_mark_ignorable_regex() rather than its "parent". Signed-off-by: Michał Kępień <michal@isc.org> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-10-20merge-base, xdiff: zero out xpparam_t structuresLibravatar Michał Kępień2-0/+4
xpparam_t structures are usually zero-initialized before their specific fields are assigned to, but there are three locations in the tree where that does not happen. Add the missing memset() calls in order to make initialization of xpparam_t structures consistent tree-wide and to prevent stack garbage from being used as field values. Signed-off-by: Michał Kępień <michal@isc.org> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-12-16Merge branch 'rs/xdiff-ignore-ws-w-func-context'Libravatar Junio C Hamano1-0/+17
The "diff" machinery learned not to lose added/removed blank lines in the context when --ignore-blank-lines and --function-context are used at the same time. * rs/xdiff-ignore-ws-w-func-context: xdiff: unignore changes in function context
2019-12-05xdiff: unignore changes in function contextLibravatar René Scharfe1-0/+17
Changes involving only blank lines are hidden with --ignore-blank-lines, unless they appear in the context lines of other changes. This is handled by xdl_get_hunk() for context added by --inter-hunk-context, -u and -U. Function context for -W and --function-context added by xdl_emit_diff() doesn't pay attention to such ignored changes; it relies fully on xdl_get_hunk() and shows just the post-image of ignored changes appearing in function context. That's inconsistent and confusing. Improve the result of using --ignore-blank-lines and --function-context together by fully showing ignored changes if they happen to fall within function context. Signed-off-by: René Scharfe <l.s.r@web.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-10-09xdiffi: fix typos and touch up commentsLibravatar Johannes Schindelin1-44/+55
Inspired by the thoroughly stale https://github.com/git/git/pull/159, this patch fixes a couple of typos, rewraps and clarifies some comments. Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-07-31Merge branch 'cb/xdiff-no-system-includes-in-dot-c'Libravatar Junio C Hamano3-8/+0
Compilation fix. * cb/xdiff-no-system-includes-in-dot-c: xdiff: remove duplicate headers from xpatience.c xdiff: remove duplicate headers from xhistogram.c xdiff: drop system includes in xutils.c
2019-07-29Merge branch 'jk/xdiff-clamp-funcname-context-index'Libravatar Junio C Hamano1-2/+2
The internal diff machinery can be made to read out of bounds while looking for --funcion-context line in a corner case, which has been corrected. * jk/xdiff-clamp-funcname-context-index: xdiff: clamp function context indices in post-image
2019-07-28xdiff: remove duplicate headers from xpatience.cLibravatar Carlo Marcelo Arenas Belón1-2/+0
92b7de93fb (Implement the patience diff algorithm, 2009-01-07) added them but were already part of xinclude.h Signed-off-by: Carlo Marcelo Arenas Belón <carenas@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-07-28xdiff: remove duplicate headers from xhistogram.cLibravatar Carlo Marcelo Arenas Belón1-2/+0
8c912eea94 ("teach --histogram to diff", 2011-07-12) included them, but were already part of xinclude.h Signed-off-by: Carlo Marcelo Arenas Belón <carenas@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-07-28xdiff: drop system includes in xutils.cLibravatar Carlo Marcelo Arenas Belón1-4/+0
After b46054b374 ("xdiff: use git-compat-util", 2019-04-11), two system headers added with 6942efcfa9 ("xdiff: load full words in the inner loop of xdl_hash_record", 2012-04-06) to xutils.c are no longer needed and could conflict as shown below from an OpenIndiana build: In file included from xdiff/xinclude.h:26:0, from xdiff/xutils.c:25: ./git-compat-util.h:4:0: warning: "_FILE_OFFSET_BITS" redefined #define _FILE_OFFSET_BITS 64 In file included from /usr/include/limits.h:37:0, from xdiff/xutils.c:23: /usr/include/sys/feature_tests.h:231:0: note: this is the location of the previous definition #define _FILE_OFFSET_BITS 32 Make sure git-compat-util.h is the first header (through xinclude.h) Signed-off-by: Carlo Marcelo Arenas Belón <carenas@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-07-23xdiff: clamp function context indices in post-imageLibravatar Jeff King1-2/+2
After finding a function line for --function-context in the pre-image, xdl_emit_diff() calculates the equivalent line in the post-image. It assumes that the lines between changes are the same on both sides. If the option --ignore-blank-lines was also given then this is not necessarily true. Clamp the calculation results for start and end of the function context to prevent out-of-bounds array accesses. Note that this _just_ fixes the case where our mismatch sends us off the beginning of the file. There are likely other cases where our assumption causes us to go to the wrong line within the file. Nobody has developed a test case yet, and the ultimate fix is likely more complicated than this patch. But this at least prevents a segfault in the meantime. Credit for finding the bug goes to "Liu Wei of Tencent Security Xuanwu Lab". Reported-by: 刘炜 <lw17qhdz@gmail.com> Helped-by: René Scharfe <l.s.r@web.de> Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-04-12xdiff: use xmalloc/xreallocLibravatar Jeff King1-2/+2
Most of xdiff uses a bare malloc() to allocate memory, and returns an error when we get NULL. However, there are a few spots which don't check the return value and may segfault, including at least xdl_merge() and xpatience.c's find_longest_common_sequence(). Let's use xmalloc() everywhere instead, so that we get a graceful die() for these cases, without having to do further auditing. This does mean the existing cases which check errors will now die() instead of returning an error up the stack. But: - that's how the rest of Git behaves already for malloc errors - all of the callers of xdi_diff(), etc, die upon seeing an error So while we might one day want to fully lib-ify the diff code and make it possible to use as part of a long-running process, we're not close to that now. And because we're just tweaking the xdl_malloc() macro here, we're not really moving ourselves any further away from that. We could, for example, simplify some of the functions which handle malloc() errors which can no longer occur. But that would probably be taking us in the wrong direction. This also makes our malloc handling more consistent with the rest of Git, including enforcing GIT_ALLOC_LIMIT and trying to reclaim pack memory when needed. Reported-by: 王健强 <jianqiang.wang@securitygossip.com> Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-04-12xdiff: use git-compat-utilLibravatar Jeff King1-7/+1
Since the xdiff library was not originally part of Git, it does its own system includes. Let's instead use git-compat-util, which has two benefits: 1. It adjusts for any system-specific quirks in how or what we should include (though xdiff's needs are light enough that this hasn't been a problem in the past). 2. It lets us use wrapper functions like xmalloc(). Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-11-02xdiff: provide a separate emit callback for hunksLibravatar Jeff King2-5/+22
The xdiff library always emits hunk header lines to our callbacks as formatted strings like "@@ -a,b +c,d @@\n". This is convenient if we're going to output a diff, but less so if we actually need to compute using those numbers, which requires re-parsing the line. In preparation for moving away from this, let's teach xdiff a new callback function which gets the broken-out hunk information. To help callers that don't want to use this new callback, if it's NULL we'll continue to format the hunk header into a string. Note that this function renames the "outf" callback to "out_line", as well. This isn't strictly necessary, but helps in two ways: 1. Now that there are two callbacks, it's nice to use more descriptive names. 2. Many callers did not zero the emit_callback_data struct, and needed to be modified to set ecb.out_hunk to NULL. By changing the name of the existing struct member, that guarantees that any new callers from in-flight topics will break the build and be examined manually. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-08-17Merge branch 'sb/indent-heuristic-optim'Libravatar Junio C Hamano1-1/+11
"git diff --indent-heuristic" had a bad corner case performance. * sb/indent-heuristic-optim: xdiff: reduce indent heuristic overhead
2018-08-15Merge branch 'sb/histogram-less-memory'Libravatar Junio C Hamano1-55/+78
"git diff --histogram" had a bad memory usage pattern, which has been rearranged to reduce the peak usage. * sb/histogram-less-memory: xdiff/histogram: remove tail recursion xdiff/xhistogram: move index allocation into find_lcs xdiff/xhistogram: factor out memory cleanup into free_index() xdiff/xhistogram: pass arguments directly to fall_back_to_classic_diff
2018-08-01xdiff: reduce indent heuristic overheadLibravatar Stefan Beller1-1/+11
Skip searching for better indentation heuristics if we'd slide a hunk more than its size. This is the easiest fix proposed in the analysis[1] in response to a patch that mercurial took for xdiff to limit searching by a constant. Using a performance test as: #!python open('a', 'w').write(" \n" * 1000000) open('b', 'w').write(" \n" * 1000001) This patch reduces the execution of "git diff --no-index a b" from 0.70s to 0.31s. However limiting the sliding to the size of the diff hunk, which was proposed as a solution (that I found easiest to implement for now) is not optimal for cases like open('a', 'w').write(" \n" * 1000000) open('b', 'w').write(" \n" * 2000000) as then we'd still slide 1000000 times. In addition to limiting the sliding to size of the hunk, also limit by a constant. Choose 100 lines as the constant as that fits more than a screen, which really means that the diff sliding is probably not providing a lot of benefit anyway. [1] https://public-inbox.org/git/72ac1ac2-f567-f241-41d6-d0f83072e0b3@alum.mit.edu/ Reported-by: Jun Wu <quark@fb.com> Analysis-by: Michael Haggerty <mhagger@alum.mit.edu> Signed-off-by: Stefan Beller <sbeller@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-07-23xdiff/histogram: remove tail recursionLibravatar Stefan Beller1-6/+14
When running the same reproduction script as the previous patch, it turns out the stack is too small, which can be easily avoided. Signed-off-by: Stefan Beller <sbeller@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-07-19xdiff/xhistogram: move index allocation into find_lcsLibravatar Stefan Beller1-43/+53
This fixes a memory issue when recursing a lot, which can be reproduced as seq 1 100000 >one seq 1 4 100000 >two git diff --no-index --histogram one two Before this patch, histogram_diff would call itself recursively before calling free_index, which would mean a lot of memory is allocated during the recursion and only freed afterwards. By moving the memory allocation (and its free call) into find_lcs, the memory is free'd before we recurse, such that memory is reused in the next step of the recursion instead of using new memory. This addresses only the memory pressure, not the run time complexity, that is also awful for the corner case outlined above. Helpful in understanding the code (in addition to the sparse history of this file), was https://stackoverflow.com/a/32367597 which reproduces most of the code comments of the JGit implementation. Signed-off-by: Stefan Beller <sbeller@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-07-19xdiff/xhistogram: factor out memory cleanup into free_index()Libravatar Stefan Beller1-4/+9
This will be useful in the next patch as we'll introduce multiple callers. Signed-off-by: Stefan Beller <sbeller@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-07-19xdiff/xhistogram: pass arguments directly to fall_back_to_classic_diffLibravatar Stefan Beller1-11/+11
By passing the 'xpp' and 'env' argument directly to the function 'fall_back_to_classic_diff', we eliminate an occurrence of the 'index' in histogram_diff, which will prove useful in a bit. While at it, move it up in the file. This will make the diff of one of the next patches more legible. Signed-off-by: Stefan Beller <sbeller@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-07-17xdiff/xdiffi.c: remove unneeded function declarationsLibravatar Stefan Beller1-17/+0
There is no need to forward-declare these functions, as they are used after their implementation only. Signed-off-by: Stefan Beller <sbeller@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-07-17xdiff/xdiff.h: remove unused flagsLibravatar Stefan Beller1-8/+0
These flags were there since the beginning (3443546f6e (Use a *real* built-in diff generator, 2006-03-24), but were never used. Remove them. Signed-off-by: Stefan Beller <sbeller@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-12-19Merge branch 'jt/diff-anchored-patience'Libravatar Junio C Hamano2-5/+41
"git diff" learned a variant of the "--patience" algorithm, to which the user can specify which 'unique' line to be used as anchoring points. * jt/diff-anchored-patience: diff: support anchoring line(s)
2017-11-28Merge branch 'rs/include-comments-before-the-function-header'Libravatar Junio C Hamano1-3/+10
"git grep -W", "git diff -W" and their friends learned a heuristic to extend a pre-context beyond the line that matches the "function pattern" (aka "diff.*.xfuncname") to include a comment block, if exists, that immediately precedes it. * rs/include-comments-before-the-function-header: grep: show non-empty lines before functions with -W grep: update boundary variable for pre-context t7810: improve check of -W with user-defined function lines xdiff: show non-empty lines before functions with -W xdiff: factor out is_func_rec() t4051: add test for comments preceding function lines
2017-11-28diff: support anchoring line(s)Libravatar Jonathan Tan2-5/+41
Teach diff a new algorithm, one that attempts to prevent user-specified lines from appearing as a deletion or addition in the end result. The end user can use this by specifying "--anchored=<text>" one or more times when using Git commands like "diff" and "show". Signed-off-by: Jonathan Tan <jonathantanmy@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-11-27Merge branch 'jc/ignore-cr-at-eol'Libravatar Junio C Hamano2-12/+52
The "diff" family of commands learned to ignore differences in carriage return at the end of line. * jc/ignore-cr-at-eol: diff: --ignore-cr-at-eol xdiff: reassign xpparm_t.flags bits
2017-11-21xdiff: show non-empty lines before functions with -WLibravatar René Scharfe1-0/+3
Non-empty lines before a function definition are most likely comments for that function and thus relevant. Include them in function context. Such a non-empty line might also belong to the preceeding function if there is no separating blank line. Stop extending the context upwards also at the next function line to make sure only one extra function body is shown at most. Original-patch-by: Vegard Nossum <vegard.nossum@oracle.com> Signed-off-by: Rene Scharfe <l.s.r@web.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-11-21xdiff: factor out is_func_rec()Libravatar René Scharfe1-3/+7
Add a helper for checking if a given record is a function line. It frees callers from having to deal with the buffer arguments of match_func_rec(). Signed-off-by: Rene Scharfe <l.s.r@web.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-11-09Replace Free Software Foundation address in license noticesLibravatar Todd Zullinger14-28/+28
The mailing address for the FSF has changed over the years. Rather than updating the address across all files, refer readers to gnu.org, as the GNU GPL documentation now suggests for license notices. The mailing address is retained in the full license files (COPYING and LGPL-2.1). The old address is still present in t/diff-lib/COPYING. This is intentional, as the file is used in tests and the contents are not expected to change. Signed-off-by: Todd Zullinger <tmz@pobox.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-11-08diff: --ignore-cr-at-eolLibravatar Junio C Hamano2-3/+39
A new option --ignore-cr-at-eol tells the diff machinery to treat a carriage-return at the end of a (complete) line as if it does not exist. Just like other "--ignore-*" options to ignore various kinds of whitespace differences, this will help reviewing the real changes you made without getting distracted by spurious CRLF<->LF conversion made by your editor program. Helped-by: Johannes Schindelin <Johannes.Schindelin@gmx.de> [jch: squashed in command line completion by Dscho] Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-10-27xdiff: reassign xpparm_t.flags bitsLibravatar Junio C Hamano1-10/+14
We have packed the bits too tightly in such a way that it is not easy to add a new type of whitespace ignoring option, a new type of LCS algorithm, or a new type of post-cleanup heuristics. Reorder bits a bit to give room for these three classes of options to grow. Also make use of XDF_WHITESPACE_FLAGS macro where we check any of these bits are on, instead of using DIFF_XDL_TST() macro on individual possibilities. That way, the "is any of the bits on?" code does not have to change when we add more ways to ignore whitespaces. While at it, add a comment in front of the bit definitions to clarify in which structure these defined bits may appear. Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-10-10cleanup: fix possible overflow errors in binary searchLibravatar Derrick Stolee1-1/+1
A common mistake when writing binary search is to allow possible integer overflow by using the simple average: mid = (min + max) / 2; Instead, use the overflow-safe version: mid = min + (max - min) / 2; This translation is safe since the operation occurs inside a loop conditioned on "min < max". The included changes were found using the following git grep: git grep '/ *2;' '*.c' Making this cleanup will prevent future review friction when a new binary search is contructed based on existing code. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Reviewed-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-01-15xdiff -W: relax end-of-file function detectionLibravatar Vegard Nossum1-8/+6
When adding a new function to the end of a file, it's enough to know that 1) the addition is at the end of the file; and 2) there is a function _somewhere_ in there. If we had simply been changing the end of an existing function, then we would also be deleting something from the old version. This fixes the case where we add e.g. // Begin of dummy static int dummy(void) { } to the end of the file. Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com> Acked-by: René Scharfe <l.s.r@web.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-01-10Merge branch 'jc/retire-compaction-heuristics'Libravatar Junio C Hamano2-35/+1
"git diff" and its family had two experimental heuristics to shift the contents of a hunk to make the patch easier to read. One of them turns out to be better than the other, so leave only the "--indent-heuristic" option and remove the other one. * jc/retire-compaction-heuristics: diff: retire "compaction" heuristics
2016-12-23diff: retire "compaction" heuristicsLibravatar Junio C Hamano2-35/+1
When a patch inserts a block of lines, whose last lines are the same as the existing lines that appear before the inserted block, "git diff" can choose any place between these existing lines as the boundary between the pre-context and the added lines (adjusting the end of the inserted block as appropriate) to come up with variants of the same patch, and some variants are easier to read than others. We have been trying to improve the choice of this boundary, and Git 2.11 shipped with an experimental "compaction-heuristic". Since then another attempt to improve the logic further resulted in a new "indent-heuristic" logic. It is agreed that the latter gives better result overall, and the former outlived its usefulness. Retire "compaction", and keep "indent" as an experimental feature. The latter hopefully will be turned on by default in a future release, but that should be done as a separate step. Suggested-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2016-12-06xdiff: drop XDL_FAST_HASHLibravatar Jeff King1-106/+0
The xdiff code hashes every line of both sides of a diff, and then compares those hashes to find duplicates. The overall performance depends both on how fast we can compute the hashes, but also on how many hash collisions we see. The idea of XDL_FAST_HASH is to speed up the hash computation. But the generated hashes have worse collision behavior. This means that in some cases it speeds diffs up (running "git log -p" on git.git improves by ~8% with it), but in others it can slow things down. One pathological case saw over a 100x slowdown[1]. There may be a better hash function that covers both properties, but in the meantime we are better off with the original hash. It's slightly slower in the common case, but it has fewer surprising pathological cases. [1] http://public-inbox.org/git/20141222041944.GA441@peff.net/ Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2016-10-03Merge branch 'mh/diff-indent-heuristic'Libravatar Junio C Hamano1-7/+7
Clean-up for a recently graduated topic. * mh/diff-indent-heuristic: xdiff: rename "struct group" to "struct xdlgroup"
2016-09-29Merge branch 'rs/xdiff-merge-overlapping-hunks-for-W-context' into maintLibravatar Junio C Hamano1-1/+1
"git diff -W" output needs to extend the context backward to include the header line of the current function and also forward to include the body of the entire current function up to the header line of the next one. This process may have to merge to adjacent hunks, but the code forgot to do so in some cases. * rs/xdiff-merge-overlapping-hunks-for-W-context: xdiff: fix merging of hunks with -W context and -u context
2016-09-27xdiff: rename "struct group" to "struct xdlgroup"Libravatar Jeff King1-7/+7
Commit e8adf23 (xdl_change_compact(): introduce the concept of a change group, 2016-08-22) added a "struct group" type to xdiff/xdiffi.c. But the POSIX system header "grp.h" already defines "struct group" (it is part of the getgrnam interface). This happens to work because the new type is local to xdiffi.c, and the xdiff code includes a relatively small set of system headers. But it will break compilation if xdiff ever switches to using git-compat-util.h. It can also probably cause confusion with tools that look at the whole code base, like coccinelle or ctags. Let's resolve by giving the xdiff variant a scoped name, which is closer to other xdiff types anyway (e.g., xdlfile_t, though note that xdiff is fond if typedefs when Git usually is not). Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2016-09-26Merge branch 'mh/diff-indent-heuristic'Libravatar Junio C Hamano2-98/+538
Output from "git diff" can be made easier to read by selecting which lines are common and which lines are added/deleted intelligently when the lines before and after the changed section are the same. A command line option is added to help with the experiment to find a good heuristics. * mh/diff-indent-heuristic: blame: honor the diff heuristic options and config parse-options: add parse_opt_unknown_cb() diff: improve positioning of add/delete blocks in diffs xdl_change_compact(): introduce the concept of a change group recs_match(): take two xrecord_t pointers as arguments is_blank_line(): take a single xrecord_t as argument xdl_change_compact(): only use heuristic if group can't be matched xdl_change_compact(): fix compaction heuristic to adjust ixo
2016-09-21Merge branch 'rs/xdiff-merge-overlapping-hunks-for-W-context'Libravatar Junio C Hamano1-1/+1
"git diff -W" output needs to extend the context backward to include the header line of the current function and also forward to include the body of the entire current function up to the header line of the next one. This process may have to merge to adjacent hunks, but the code forgot to do so in some cases. * rs/xdiff-merge-overlapping-hunks-for-W-context: xdiff: fix merging of hunks with -W context and -u context
2016-09-19diff: improve positioning of add/delete blocks in diffsLibravatar Michael Haggerty2-0/+326
Some groups of added/deleted lines in diffs can be slid up or down, because lines at the edges of the group are not unique. Picking good shifts for such groups is not a matter of correctness but definitely has a big effect on aesthetics. For example, consider the following two diffs. The first is what standard Git emits: --- a/9c572b21dd090a1e5c5bb397053bf8043ffe7fb4:git-send-email.perl +++ b/6dcfa306f2b67b733a7eb2d7ded1bc9987809edb:git-send-email.perl @@ -231,6 +231,9 @@ if (!defined $initial_reply_to && $prompting) { } if (!$smtp_server) { + $smtp_server = $repo->config('sendemail.smtpserver'); +} +if (!$smtp_server) { foreach (qw( /usr/sbin/sendmail /usr/lib/sendmail )) { if (-x $_) { $smtp_server = $_; The following diff is equivalent, but is obviously preferable from an aesthetic point of view: --- a/9c572b21dd090a1e5c5bb397053bf8043ffe7fb4:git-send-email.perl +++ b/6dcfa306f2b67b733a7eb2d7ded1bc9987809edb:git-send-email.perl @@ -230,6 +230,9 @@ if (!defined $initial_reply_to && $prompting) { $initial_reply_to =~ s/(^\s+|\s+$)//g; } +if (!$smtp_server) { + $smtp_server = $repo->config('sendemail.smtpserver'); +} if (!$smtp_server) { foreach (qw( /usr/sbin/sendmail /usr/lib/sendmail )) { if (-x $_) { This patch teaches Git to pick better positions for such "diff sliders" using heuristics that take the positions of nearby blank lines and the indentation of nearby lines into account. The existing Git code basically always shifts such "sliders" as far down in the file as possible. The only exception is when the slider can be aligned with a group of changed lines in the other file, in which case Git favors depicting the change as one add+delete block rather than one add and a slightly offset delete block. This naive algorithm often yields ugly diffs. Commit d634d61ed6 improved the situation somewhat by preferring to position add/delete groups to make their last line a blank line, when that is possible. This heuristic does more good than harm, but (1) it can only help if there are blank lines in the right places, and (2) always picks the last blank line, even if there are others that might be better. The end result is that it makes perhaps 1/3 as many errors as the default Git algorithm, but that still leaves a lot of ugly diffs. This commit implements a new and much better heuristic for picking optimal "slider" positions using the following approach: First observe that each hypothetical positioning of a diff slider introduces two splits: one between the context lines preceding the group and the first added/deleted line, and the other between the last added/deleted line and the first line of context following it. It tries to find the positioning that creates the least bad splits. Splits are evaluated based only on the presence and locations of nearby blank lines, and the indentation of lines near the split. Basically, it prefers to introduce splits adjacent to blank lines, between lines that are indented less, and between lines with the same level of indentation. In more detail: 1. It measures the following characteristics of a proposed splitting position in a `struct split_measurement`: * the number of blank lines above the proposed split * whether the line directly after the split is blank * the number of blank lines following that line * the indentation of the nearest non-blank line above the split * the indentation of the line directly below the split * the indentation of the nearest non-blank line after that line 2. It combines the measured attributes using a bunch of empirically-optimized weighting factors to derive a `struct split_score` that measures the "badness" of splitting the text at that position. 3. It combines the `split_score` for the top and the bottom of the slider at each of its possible positions, and selects the position that has the best `split_score`. I determined the initial set of weighting factors by collecting a corpus of Git histories from 29 open-source software projects in various programming languages. I generated many diffs from this corpus, and determined the best positioning "by eye" for about 6600 diff sliders. I used about half of the repositories in the corpus (corresponding to about 2/3 of the sliders) as a training set, and optimized the weights against this corpus using a crude automated search of the parameter space to get the best agreement with the manually-determined values. Then I tested the resulting heuristic against the full corpus. The results are summarized in the following table, in column `indent-1`: | repository | count | Git 2.9.0 | compaction | compaction-fixed | indent-1 | indent-2 | | --------------------- | ----- | -------------- | -------------- | ---------------- | -------------- | -------------- | | afnetworking | 109 | 89 (81.7%) | 37 (33.9%) | 37 (33.9%) | 2 (1.8%) | 2 (1.8%) | | alamofire | 30 | 18 (60.0%) | 14 (46.7%) | 15 (50.0%) | 0 (0.0%) | 0 (0.0%) | | angular | 184 | 127 (69.0%) | 39 (21.2%) | 23 (12.5%) | 5 (2.7%) | 5 (2.7%) | | animate | 313 | 2 (0.6%) | 2 (0.6%) | 2 (0.6%) | 2 (0.6%) | 2 (0.6%) | | ant | 380 | 356 (93.7%) | 152 (40.0%) | 148 (38.9%) | 15 (3.9%) | 15 (3.9%) | * | bugzilla | 306 | 263 (85.9%) | 109 (35.6%) | 99 (32.4%) | 14 (4.6%) | 15 (4.9%) | * | corefx | 126 | 91 (72.2%) | 22 (17.5%) | 21 (16.7%) | 6 (4.8%) | 6 (4.8%) | | couchdb | 78 | 44 (56.4%) | 26 (33.3%) | 28 (35.9%) | 6 (7.7%) | 6 (7.7%) | * | cpython | 937 | 158 (16.9%) | 50 (5.3%) | 49 (5.2%) | 5 (0.5%) | 5 (0.5%) | * | discourse | 160 | 95 (59.4%) | 42 (26.2%) | 36 (22.5%) | 18 (11.2%) | 13 (8.1%) | | docker | 307 | 194 (63.2%) | 198 (64.5%) | 253 (82.4%) | 8 (2.6%) | 8 (2.6%) | * | electron | 163 | 132 (81.0%) | 38 (23.3%) | 39 (23.9%) | 6 (3.7%) | 6 (3.7%) | | git | 536 | 470 (87.7%) | 73 (13.6%) | 78 (14.6%) | 16 (3.0%) | 16 (3.0%) | * | gitflow | 127 | 0 (0.0%) | 0 (0.0%) | 0 (0.0%) | 0 (0.0%) | 0 (0.0%) | | ionic | 133 | 89 (66.9%) | 29 (21.8%) | 38 (28.6%) | 1 (0.8%) | 1 (0.8%) | | ipython | 482 | 362 (75.1%) | 167 (34.6%) | 169 (35.1%) | 11 (2.3%) | 11 (2.3%) | * | junit | 161 | 147 (91.3%) | 67 (41.6%) | 66 (41.0%) | 1 (0.6%) | 1 (0.6%) | * | lighttable | 15 | 5 (33.3%) | 0 (0.0%) | 2 (13.3%) | 0 (0.0%) | 0 (0.0%) | | magit | 88 | 75 (85.2%) | 11 (12.5%) | 9 (10.2%) | 1 (1.1%) | 0 (0.0%) | | neural-style | 28 | 0 (0.0%) | 0 (0.0%) | 0 (0.0%) | 0 (0.0%) | 0 (0.0%) | | nodejs | 781 | 649 (83.1%) | 118 (15.1%) | 111 (14.2%) | 4 (0.5%) | 5 (0.6%) | * | phpmyadmin | 491 | 481 (98.0%) | 75 (15.3%) | 48 (9.8%) | 2 (0.4%) | 2 (0.4%) | * | react-native | 168 | 130 (77.4%) | 79 (47.0%) | 81 (48.2%) | 0 (0.0%) | 0 (0.0%) | | rust | 171 | 128 (74.9%) | 30 (17.5%) | 27 (15.8%) | 16 (9.4%) | 14 (8.2%) | | spark | 186 | 149 (80.1%) | 52 (28.0%) | 52 (28.0%) | 2 (1.1%) | 2 (1.1%) | | tensorflow | 115 | 66 (57.4%) | 48 (41.7%) | 48 (41.7%) | 5 (4.3%) | 5 (4.3%) | | test-more | 19 | 15 (78.9%) | 2 (10.5%) | 2 (10.5%) | 1 (5.3%) | 1 (5.3%) | * | test-unit | 51 | 34 (66.7%) | 14 (27.5%) | 8 (15.7%) | 2 (3.9%) | 2 (3.9%) | * | xmonad | 23 | 22 (95.7%) | 2 (8.7%) | 2 (8.7%) | 1 (4.3%) | 1 (4.3%) | * | --------------------- | ----- | -------------- | -------------- | ---------------- | -------------- | -------------- | | totals | 6668 | 4391 (65.9%) | 1496 (22.4%) | 1491 (22.4%) | 150 (2.2%) | 144 (2.2%) | | totals (training set) | 4552 | 3195 (70.2%) | 1053 (23.1%) | 1061 (23.3%) | 86 (1.9%) | 88 (1.9%) | | totals (test set) | 2116 | 1196 (56.5%) | 443 (20.9%) | 430 (20.3%) | 64 (3.0%) | 56 (2.6%) | In this table, the numbers are the count and percentage of human-rated sliders that the corresponding algorithm got *wrong*. The columns are * "repository" - the name of the repository used. I used the diffs between successive non-merge commits on the HEAD branch of the corresponding repository. * "count" - the number of sliders that were human-rated. I chose most, but not all, sliders to rate from those among which the various algorithms gave different answers. * "Git 2.9.0" - the default algorithm used by `git diff` in Git 2.9.0. * "compaction" - the heuristic used by `git diff --compaction-heuristic` in Git 2.9.0. * "compaction-fixed" - the heuristic used by `git diff --compaction-heuristic` after the fixes from earlier in this patch series. Note that the results are not dramatically different than those for "compaction". Both produce non-ideal diffs only about 1/3 as often as the default `git diff`. * "indent-1" - the new `--indent-heuristic` algorithm, using the first set of weighting factors, determined as described above. * "indent-2" - the new `--indent-heuristic` algorithm, using the final set of weighting factors, determined as described below. * `*` - indicates that repo was part of training set used to determine the first set of weighting factors. The fact that the heuristic performed nearly as well on the test set as on the training set in column "indent-1" is a good indication that the heuristic was not over-trained. Given that fact, I ran a second round of optimization, using the entire corpus as the training set. The resulting set of weights gave the results in column "indent-2". These are the weights included in this patch. The final result gives consistently and significantly better results across the whole corpus than either `git diff` or `git diff --compaction-heuristic`. It makes only about 1/30 as many errors as the former and about 1/10 as many errors as the latter. (And a good fraction of the remaining errors are for diffs that involve weirdly-formatted code, sometimes apparently machine-generated.) The tools that were used to do this optimization and analysis, along with the human-generated data values, are recorded in a separate project [1]. This patch adds a new command-line option `--indent-heuristic`, and a new configuration setting `diff.indentHeuristic`, that activate this heuristic. This interface is only meant for testing purposes, and should be finalized before including this change in any release. [1] https://github.com/mhagger/diff-slider-tools Signed-off-by: Michael Haggerty <mhagger@alum.mit.edu> Signed-off-by: Junio C Hamano <gitster@pobox.com>