diff options
author | Elijah Newren <newren@gmail.com> | 2021-02-14 07:51:47 +0000 |
---|---|---|
committer | Junio C Hamano <gitster@pobox.com> | 2021-02-15 18:02:16 -0800 |
commit | a35df3371c2e2e9b407ff8c950169e74f6bf4add (patch) | |
tree | 1fcbc5f408994502d22a84697e22ea44ae0010aa /git-archimport.perl | |
parent | t4001: add a test comparing basename similarity and content similarity (diff) | |
download | tgif-a35df3371c2e2e9b407ff8c950169e74f6bf4add.tar.xz |
diffcore-rename: compute basenames of source and dest candidates
We want to make use of unique basenames among remaining source and
destination files to help inform rename detection, so that more likely
pairings can be checked first. (src/moduleA/foo.txt and
source/module/A/foo.txt are likely related if there are no other
'foo.txt' files among the remaining deleted and added files.) Add a new
function, not yet used, which creates a map of the unique basenames
within rename_src and another within rename_dst, together with the
indices within rename_src/rename_dst where those basenames show up.
Non-unique basenames still show up in the map, but have an invalid index
(-1).
This function was inspired by the fact that in real world repositories,
files are often moved across directories without changing names. Here
are some sample repositories and the percentage of their historical
renames (as of early 2020) that preserved basenames:
* linux: 76%
* gcc: 64%
* gecko: 79%
* webkit: 89%
These statistics alone don't prove that an optimization in this area
will help or how much it will help, since there are also unpaired adds
and deletes, restrictions on which basenames we consider, etc., but it
certainly motivated the idea to try something in this area.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Diffstat (limited to 'git-archimport.perl')
0 files changed, 0 insertions, 0 deletions