From 37a251436447f5190105d01f18c6b398474be3d7 Mon Sep 17 00:00:00 2001 From: Elijah Newren Date: Sat, 27 Feb 2021 00:30:39 +0000 Subject: diffcore-rename: use directory rename guided basename comparisons A previous commit noted that it is very common for people to move files across directories while keeping their filename the same. The last few commits took advantage of this and showed that we can accelerate rename detection significantly using basenames; since files with the same basename serve as likely rename candidates, we can check those first and remove them from the rename candidate pool if they are sufficiently similar. Unfortunately, the previous optimization was limited by the fact that the remaining basenames after exact rename detection are not always unique. Many repositories have hundreds of build files with the same name (e.g. Makefile, .gitignore, build.gradle, etc.), and may even have hundreds of source files with the same name. (For example, the linux kernel has 100 setup.c, 87 irq.c, and 112 core.c files. A repository at $DAYJOB has a lot of ObjectFactory.java and Plugin.java files). For these files with non-unique basenames, we are faced with the task of attempting to determine or guess which directory they may have been relocated to. Such a task is precisely the job of directory rename detection. However, there are two catches: (1) the directory rename detection code has traditionally been part of the merge machinery rather than diffcore-rename.c, and (2) directory rename detection currently runs after regular rename detection is complete. The 1st catch is just an implementation issue that can be overcome by some code shuffling. The 2nd requires us to add a further approximation: we only have access to exact renames at this point, so we need to do directory rename detection based on just exact renames. In some cases we won't have exact renames, in which case this extra optimization won't apply. We also choose to not apply the optimization unless we know that the underlying directory was removed, which will require extra data to be passed in to diffcore_rename_extended(). Also, even if we get a prediction about which directory a file may have relocated to, we will still need to check to see if there is a file in the predicted directory, and then compare the two files to see if they meet the higher min_basename_score threshold required for marking the two files as renames. This commit introduces an idx_possible_rename() function which will do this directory rename detection for us and give us the index within rename_dst of the resulting filename. For now, this function is hardcoded to return -1 (not found) and just hooks up how its results would be used once we have a more complete implementation in place. Reviewed-by: Derrick Stolee Signed-off-by: Elijah Newren Signed-off-by: Junio C Hamano --- Documentation/gitdiffcore.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'Documentation') diff --git a/Documentation/gitdiffcore.txt b/Documentation/gitdiffcore.txt index 80fcf95424..8673a5c5b2 100644 --- a/Documentation/gitdiffcore.txt +++ b/Documentation/gitdiffcore.txt @@ -186,7 +186,7 @@ mark a file pair as a rename and stop considering other candidates for better matches. At most, one comparison is done per file in this preliminary pass; so if there are several remaining ext.txt files throughout the directory hierarchy after exact rename detection, this -preliminary step will be skipped for those files. +preliminary step may be skipped for those files. Note. When the "-C" option is used with `--find-copies-harder` option, 'git diff-{asterisk}' commands feed unmodified filepairs to -- cgit v1.2.3