summaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2021-05-20fast-rebase: change assert() to BUG()Libravatar Elijah Newren1-1/+2
assert() can succinctly document expectations for the code, and do so in a way that may be useful to future folks trying to refactor the code and change basic assumptions; it allows them to more quickly find some places where their violations of previous assumptions trips things up. Unfortunately, assert() can surround a function call with important side-effects, which is a huge mistake since some users will compile with assertions disabled. I've had to debug such mistakes before in other codebases, so I should know better. Luckily, this was only in test code, but it's still very embarrassing. Change an assert() to an if (...) BUG (...). Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-05-20Documentation/technical: describe remembering renames optimizationLibravatar Elijah Newren1-0/+669
Remembering renames on the upstream side of history in an early merge of a rebase or cherry-pick for re-use in a latter merge of the same operation makes pretty good intuitive sense. However, trying to show that it doesn't cause some subtle behavioral difference or some funny edge or corner case is much more involved. And, in fact, it does introduce a subtle behavioral change. Document all the assumptions, special cases, and logic involved in such an optimization, and describe why this optimization is safe under the current optimizations/features/etc. -- even when the subtle behavioral change is triggered. Part of the point of adding this document that goes over the optimization in such laborious detail, is that it is possible that significant future changes (optimizations or feature changes) could interact with this optimization in interesting ways; this document is here to help folks making big changes sanity check that the assumptions and arguments underlying this optimization are still valid. (As a side note, creating this document forced me to review things in sufficient detail that I found I was not properly caching directory-rename-induced renames, resulting in the code not being aware of those renames and causing unnecessary diffcore_rename_extended() calls in subsequent merges.) A subsequent commit will add several testcases based on this document meant to stress-test the implementation and also document the case with the subtle behavioral change. Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-05-04t6423: rename file within directory that other side renamedLibravatar Elijah Newren1-0/+58
Add a new testcase where one side of history renames: olddir/ -> newdir/ and the other side of history renames: olddir/a -> olddir/alpha When using merge.directoryRenames=true, it seems logical to expect the file to end up at newdir/alpha. Unfortunately, both merge-recursive and merge-ort currently see this as a rename/rename conflict: olddir/a -> newdir/a vs. olddir/a -> newdir/alpha Suggesting that there's some extra logic we probably want to add somewhere to allow this case to run without triggering a conflict. For now simply document this known issue. Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-03-20Add testing with merge-ort merge strategyLibravatar Elijah Newren2-0/+3
In preparation for switching from merge-recursive to merge-ort as the default strategy, have the testsuite default to running with merge-ort. Keep coverage of the recursive backend by having the linux-gcc job run with it. Signed-off-by: Elijah Newren <newren@gmail.com> Reviewed-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-03-20t6423: mark remaining expected failure under merge-ort as suchLibravatar Elijah Newren1-1/+1
When we started on merge-ort, thousands of tests failed when run with the GIT_TEST_MERGE_ALGORITHM=ort flag; with so many, it didn't make sense to flip all their test expectations. The ones in t6409, t6418, and the submodule tests are being handled by an independent in-flight topic ("Complete merge-ort implemenation...almost"). The ones in t6423 were left out of the other series because other ongoing series that this commit depends upon were addressing those. Now that we only have one remaining test failure in t6423, let's mark it as such. This remaining test will be fixed by a future optimization series, but since merge-recursive doesn't pass this test either, passing it is not necessary for declaring merge-ort ready for general use. Signed-off-by: Elijah Newren <newren@gmail.com> Reviewed-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-03-20Revert "merge-ort: ignore the directory rename split conflict for now"Libravatar Elijah Newren1-12/+1
This reverts commit 5ced7c3da009090c5a926e3123a71314c7f28d42, which was put in place as a temporary measure to avoid optimizations unstably erroring out on no destination having a majority of the necessary renames for directories that had no new files and thus no need for directory rename detection anyway. Now that optimizations are in place to prevent us from trying to compute directory rename count computations for directories that do not need it, we can undo this temporary measure. Signed-off-by: Elijah Newren <newren@gmail.com> Reviewed-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-03-20merge-recursive: add a bunch of FIXME comments documenting known bugsLibravatar Elijah Newren1-0/+37
The plan is to just delete merge-recursive, but not until everyone is comfortable with merge-ort as a replacement. Given that I haven't switched all callers of merge-recursive over yet (e.g. git-am still uses merge-recursive), maybe there's some value documenting known bugs in the algorithm in case we end up keeping it or someone wants to dig it up in the future. Signed-off-by: Elijah Newren <newren@gmail.com> Reviewed-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-03-20merge-ort: write $GIT_DIR/AUTO_MERGE whenever we hit a conflictLibravatar Elijah Newren6-0/+20
There are a variety of questions users might ask while resolving conflicts: * What changes have been made since the previous (first) parent? * What changes are staged? * What is still unstaged? (or what is still conflicted?) * What changes did I make to resolve conflicts so far? The first three of these have simple answers: * git diff HEAD * git diff --cached * git diff There was no way to answer the final question previously. Adding one is trivial in merge-ort, since it works by creating a tree representing what should be written to the working copy complete with conflict markers. Simply write that tree to .git/AUTO_MERGE, allowing users to answer the fourth question with * git diff AUTO_MERGE I avoided using a name like "MERGE_AUTO", because that would be merge-specific (much like MERGE_HEAD, REBASE_HEAD, REVERT_HEAD, CHERRY_PICK_HEAD) and I wanted a name that didn't change depending on which type of operation the merge was part of. Ensure that paths which clean out other temporary operation-specific files (e.g. CHERRY_PICK_HEAD, MERGE_MSG, rebase-merge/ state directory) also clean out this AUTO_MERGE file. Signed-off-by: Elijah Newren <newren@gmail.com> Reviewed-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-03-20t: mark several submodule merging tests as fixed under merge-ortLibravatar Elijah Newren5-9/+22
merge-ort handles submodules (and directory/file conflicts in general) differently than merge-recursive does; it basically puts all the special handling for different filetypes into one place in the codebase instead of needing special handling for different filetypes in many different code paths. This one code path in merge-ort could perhaps use some work still (there are still test_expect_failure cases in the testsuite), but it passes all the tests that merge-recursive does as well as 12 additional ones that merge-recursive fails. Mark those 12 tests as test_expect_success under merge-ort. Signed-off-by: Elijah Newren <newren@gmail.com> Reviewed-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-03-20merge-ort: implement CE_SKIP_WORKTREE handling with conflicted entriesLibravatar Elijah Newren2-15/+32
When merge conflicts occur in paths removed by a sparse-checkout, we need to unsparsify those paths (clear the SKIP_WORKTREE bit), and write out the conflicted file to the working copy. In the very unlikely case that someone manually put a file into the working copy at the location of the SKIP_WORKTREE file, we need to avoid overwriting whatever edits they have made and move that file to a different location first. Signed-off-by: Elijah Newren <newren@gmail.com> Reviewed-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-03-20t6428: new test for SKIP_WORKTREE handling and conflictsLibravatar Elijah Newren1-0/+158
If there is a conflict during a merge for a SKIP_WORKTREE entry, we expect that file to be written to the working copy and have the SKIP_WORKTREE bit cleared in the index. If the user had manually created a file in the working tree despite SKIP_WORKTREE being set, we do not want to clobber their changes to that file, but want to move it out of the way. Add tests that check for these behaviors. Signed-off-by: Elijah Newren <newren@gmail.com> Reviewed-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-03-20merge-ort: support subtree shiftingLibravatar Elijah Newren1-0/+24
merge-recursive has some simple code to support subtree shifting; copy it over to merge-ort. This fixes t6409.12 under GIT_TEST_MERGE_ALGORITHM=ort. Signed-off-by: Elijah Newren <newren@gmail.com> Reviewed-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-03-20merge-ort: let renormalization change modify/delete into clean deleteLibravatar Elijah Newren1-2/+62
When we have a modify/delete conflict, but the only change to the modification is e.g. change of line endings, then if renormalization is requested then we should be able to recognize such a case as a not-modified/delete and resolve the conflict automatically. This fixes t6418.10 under GIT_TEST_MERGE_ALGORITHM=ort. Signed-off-by: Elijah Newren <newren@gmail.com> Reviewed-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-03-20merge-ort: have ll_merge() use a special attr_index for renormalizationLibravatar Elijah Newren1-2/+62
ll_merge() needs an index when renormalization is requested. Create one specifically for just this purpose with just the one needed entry. This fixes t6418.4 and t6418.5 under GIT_TEST_MERGE_ALGORITHM=ort. NOTE 1: Even if the user has a working copy or a real index (which is not a given as merge-ort can be used in bare repositories), we explicitly ignore any .gitattributes file from either of these locations. merge-ort can be used to merge two branches that are unrelated to HEAD, so .gitattributes from the working copy and current index should not be considered relevant. NOTE 2: Since we are in the middle of merging, there is a risk that .gitattributes itself is conflicted...leaving us with an ill-defined situation about how to perform the rest of the merge. It could be that the .gitattributes file does not even exist on one of the sides of the merge, or that it has been modified on both sides. If it's been modified on both sides, it's possible that it could itself be merged cleanly, though it's also possible that it only merges cleanly if you use the right version of the .gitattributes file to drive the merge. It gets kind of complicated. The only test we ever had that attempted to test behavior in this area was seemingly unaware of the undefined behavior, but knew the test wouldn't work for lack of attribute handling support, marked it as test_expect_failure from the beginning, but managed to fail for several reasons unrelated to attribute handling. See commit 6f6e7cfb52 ("t6038: remove problematic test", 2020-08-03) for details. So there are probably various ways to improve what initialize_attr_index() picks in the case of a conflicted .gitattributes but for now I just implemented something simple -- look for whatever .gitattributes file we can find in any of the higher order stages and use it. Signed-off-by: Elijah Newren <newren@gmail.com> Reviewed-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-03-20merge-ort: add a special minimal index just for renormalizationLibravatar Elijah Newren1-0/+20
renormalize_buffer() requires an index_state, which is something that merge-ort does not operate with. However, all the renormalization code needs is an index with a .gitattributes file...plus a little bit of setup. Create such an index, along with the deallocation and attr_direction handling. A subsequent commit will add a function to finish the initialization of this index. Signed-off-by: Elijah Newren <newren@gmail.com> Reviewed-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-03-20merge-ort: use STABLE_QSORT instead of QSORT where requiredLibravatar Elijah Newren1-1/+7
rename/rename conflict handling depends on the fact that if both sides renamed the same path, that the one on the MERGE_SIDE1 will appear first in the combined diff_queue_struct passed to process_renames(). Since we add all pairs from MERGE_SIDE1 to combined first, and then all pairs from MERGE_SIDE2, and then sort based on filename, this will only be true if the sort used is stable. This was found due to the fact that Mac, unlike Linux, apparently has a system-defined qsort that is not stable. While we are at it, review the other callers of QSORT and add comments about why they can remain as calls to QSORT instead of being modified to call STABLE_QSORT. Signed-off-by: Elijah Newren <newren@gmail.com> Reviewed-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-03-18diffcore-rename: determine which relevant_sources are no longer relevantLibravatar Elijah Newren1-1/+50
As noted a few commits ago ("diffcore-rename: only compute dir_rename_count for relevant directories"), when a source file rename is used as part of directory rename detection, we need to increment counts for each ancestor directory in dirs_removed with value RELEVANT_FOR_SELF. However, a few commits ago ("diffcore-rename: check if we have enough renames for directories early on"), we may have downgraded all relevant ancestor directories from RELEVANT_FOR_SELF to RELEVANT_FOR_ANCESTOR. For a given file, if no ancestor directory is found in dirs_removed with a value of RELEVANT_FOR_SELF, then we can downgrade relevant_source[PATH] from RELEVANT_LOCATION to RELEVANT_NO_MORE. This means we can skip detecting a rename for that particular path (and any other paths in the same directory). For the testcases mentioned in commit 557ac0350d ("merge-ort: begin performance work; instrument with trace2_region_* calls", 2020-10-28), this change improves the performance as follows: Before After no-renames: 5.680 s ± 0.096 s 5.665 s ± 0.129 s mega-renames: 13.812 s ± 0.162 s 11.435 s ± 0.158 s just-one-mega: 506.0 ms ± 3.9 ms 494.2 ms ± 6.1 ms While this improvement looks rather modest for these testcases (because all the previous optimizations were sufficient to nearly remove all time spent in rename detection already), consider this alternative testcase tweaked from the ones in commit 557ac0350d as follows <Same initial setup as commit 557ac0350d, then...> $ git switch -c add-empty-file v5.5 $ >drivers/gpu/drm/i915/new-empty-file $ git add drivers/gpu/drm/i915/new-empty-file $ git commit -m "new file" $ git switch 5.4-rename $ git cherry-pick --strategy=ort add-empty-file For this testcase, we see the following improvement: Before After pick-empty: 1.936 s ± 0.024 s 688.1 ms ± 4.2 ms So roughly a factor of 3 speedup. At $DAYJOB, there was a particular repository and cherry-pick that inspired this optimization; for that case I saw a speedup factor of 7 with this optimization. Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-03-18merge-ort: record the reason that we want a rename for a fileLibravatar Elijah Newren2-5/+16
There are two different reasons we might want a rename for a file -- for three-way content merging or as part of directory rename detection. Record the reason. diffcore-rename will potentially be able to filter some of the ones marked as needed only for directory rename detection, if it can determine those directory renames based solely on renames found via exact rename detection and basename-guided rename detection. Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-03-18diffcore-rename: add computation of number of unknown renamesLibravatar Elijah Newren2-4/+40
The previous commit can only be effective if we have a computation of the number of paths under a given directory which are still have pending renames, and expected this number to be recorded in the dir_rename_count map under the key UNKNOWN_DIR. Add the code necessary to compute these values. Note that this change means dir_rename_count might have a directory whose only entry (for UNKNOWN_DIR) was removed by the time merge-ort goes to check it. To account for this, merge-ort needs to check for the case where the max count is 0. With this change we are now computing the necessary value for each directory in dirs_removed, but are not using that value anywhere. The next two commits will make use of the values stored in dirs_removed in order to compute whether each relevant_source (that is needed only for directory rename detection) has become unnecessary. Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-03-18diffcore-rename: check if we have enough renames for directories early onLibravatar Elijah Newren1-10/+63
As noted in the past few commits, if we can determine that a directory already has enough renames to determine how directory rename detection will be decided for that directory, then we can mark that directory as no longer needing any more renames detected for files underneath it. For such directories, we change the value in the dirs_removed map from RELEVANT_TO_SELF to RELEVANT_FOR_ANCESTOR. A subsequent patch will use this information while iterating over the remaining potential rename sources to mark ones that were only location_relevant as unneeded if no containing directory is still marked as RELEVANT_TO_SELF. Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-03-18diffcore-rename: only compute dir_rename_count for relevant directoriesLibravatar Elijah Newren1-5/+22
When one side adds files to a directory that the other side renamed, directory rename detection is used to either move the new paths to the newer directory or warn the user about the fact that another path location might be better. If a parent of the given directory had new files added to it, any renames in the current directory are also part of determining where the parent directory is renamed to. Thus, naively, we need to record each rename N times for a path at depth N. However, we can use the additional information added to dirs_removed in the last commit to avoid traversing all N parent directories in many cases. Let's use an example to explain how this works. If we have a path named src/old_dir/a/b/file.c and src/old_dir doesn't exist on one side of history, but the other added a file named src/old_dir/newfile.c, then if one side renamed src/old_dir/a/b/file.c => source/new_dir/a/b/file.c then this file would affect potential directory rename detection counts for src/old_dir/a/b => source/new_dir/a/b src/old_dir/a => source/new_dir/a src/old_dir => source/new_dir src => source adding a weight of 1 to each in dir_rename_counts. However, if src/ exists on both sides of history, then we don't need to track any entries for it in dir_rename_counts. That was implemented previously. What we are adding now, is that if no new files were added to src/old_dir/a or src/old_dir/b, then we don't need to have counts in dir_rename_count for those directories either. In short, we only need to track counts in dir_rename_count for directories whose dirs_removed value is RELEVANT_FOR_SELF. And as soon as we reach a directory that isn't in dirs_removed (signalled by returning the default value of NOT_RELEVANT from strintmap_get()), we can stop looking any further up the directory hierarchy. Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-03-18merge-ort: record the reason that we want a rename for a directoryLibravatar Elijah Newren3-4/+46
When one side of history renames a directory, and the other side of history added files to the old directory, directory rename detection is used to warn about the location of the added files so the user can move them to the old directory or keep them with the new one. This sets up three different types of directories: * directories that had new files added to them * directories underneath a directory that had new files added to them * directories where no new files were added to it or any leading path Save this information in dirs_removed; the next several commits will make use of this information. Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-03-18merge-ort, diffcore-rename: tweak dirs_removed and relevant_source typeLibravatar Elijah Newren3-40/+41
As noted in the previous commit, we want to be able to take advantage of the "majority rules" portion of directory rename detection to avoid detecting more renames than necessary. However, for diffcore-rename to take advantage of that, it needs to know whether a rename source file was needed for just directory rename detection reasons, or if it is wanted for potential three-way content merging. Modify relevant_sources from a strset to a strintmap, so we can encode additional information. We also modify dirs_removed from a strset to a strintmap at the same time because trying to determine what files are needed for directory rename detection will require us tracking a bit more information for each directory. This commit only changes the types of the two variables from strset to strintmap; it does not actually store any special values yet and for now only checks for presence of entries in the strintmap. Thus, the code is functionally identical to how it behaved before. Future commits will start associating values with each key for these two maps. Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-03-18diffcore-rename: take advantage of "majority rules" to skip more renamesLibravatar Elijah Newren1-0/+25
In directory rename detection (when a directory is removed on one side of history and the other side adds new files to that directory), we work to find where the greatest number of files within that directory were renamed to so that the new files can be moved with the majority of the files. Naively, we can just do this by detecting renames for *all* files within the removed/renamed directory, looking at all the destination directories where files within that directory were moved, and if there is more than one such directory then taking the one with the greatest number of files as the directory where the old directory was renamed to. However, sometimes there are enough renames from exact rename detection or basename-guided rename detection that we have enough information to determine the majority winner already. Add a function meant to compute whether particular renames are still needed based on this majority rules check. The next several commits will then add the necessary infrastructure to get the information we need to compute which additional rename sources we can skip. An important side note for future further optimization: There is a possible improvement to this optimization that I have not yet attempted and will not be included in this series of patches: we could first check whether exact renames provide enough information for us to determine directory renames, and avoid doing basename-guided rename detection on some or all of the RELEVANT_LOCATION files within those directories. In effect, this variant would mean doing the handle_early_known_dir_renames() both after exact rename detection and again after basename-guided rename detection, though it would also mean decrementing the number of "unknown" renames for each rename we found from basename-guided rename detection. Adding this additional check for skippable renames right after exact rename detection might turn out to be valuable, especially for partial clones where it might allow us to download certain source files entirely. However, this particular optimization was actually the last one I did in original implementation order, and by the time I implemented this idea, every testcase I had was sufficiently fast that further optimization was unwarranted. If future testcases arise that tax rename detection more heavily (or perhaps partial clones can benefit from avoiding loading more objects), it may be worth implementing this more involved variant. Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-03-10diffcore-rename: avoid doing basename comparisons for irrelevant sourcesLibravatar Elijah Newren1-4/+33
The basename comparison optimization implemented in find_basename_matches() is very beneficial since it allows a source to sometimes only be compared with one other file instead of N other files. When a match is found, both a source and destination can be removed from the matrix of inexact rename comparisons. In contrast, the irrelevant source optimization only allows us to remove a source from the matrix of inexact rename comparisons...but it has the advantage of allowing a source file to not even be loaded into memory at all and be compared to 0 other files. Generally, not even comparing is a bigger performance win, so when both optimizations could apply, prefer to use the irrelevant-source optimization. For the testcases mentioned in commit 557ac0350d ("merge-ort: begin performance work; instrument with trace2_region_* calls", 2020-10-28), this change improves the performance as follows: Before After no-renames: 5.708 s ± 0.111 s 5.680 s ± 0.096 s mega-renames: 102.171 s ± 0.440 s 13.812 s ± 0.162 s just-one-mega: 3.471 s ± 0.015 s 506.0 ms ± 3.9 ms Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-03-10merge-ort: skip rename detection entirely if possibleLibravatar Elijah Newren1-0/+44
diffcore_rename_extended() will do a bunch of setup, then check for exact renames, then abort before inexact rename detection if there are no more sources or destinations that need to be matched. It will sometimes be the case, however, that either * we start with neither any sources or destinations * we start with no *relevant* sources In the first of these two cases, the setup and exact rename detection will be very cheap since there are 0 files to operate on. In the second case, it is quite possible to have thousands of files with none of the source ones being relevant. Avoid calling diffcore_rename_extended() or even some of the setup before diffcore_rename_extended() when we can determine that rename detection is unnecessary. For the testcases mentioned in commit 557ac0350d ("merge-ort: begin performance work; instrument with trace2_region_* calls", 2020-10-28), this change improves the performance as follows: Before After no-renames: 6.003 s ± 0.048 s 5.708 s ± 0.111 s mega-renames: 114.009 s ± 0.236 s 102.171 s ± 0.440 s just-one-mega: 3.489 s ± 0.017 s 3.471 s ± 0.015 s Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-03-10merge-ort: use relevant_sources to filter possible rename sourcesLibravatar Elijah Newren1-1/+1
The past several commits determined conditions when rename sources might be needed, and filled a relevant_sources strset with those paths. Pass these along to diffcore_rename_extended() to use to limit the sources that we need to detect renames for. For the testcases mentioned in commit 557ac0350d ("merge-ort: begin performance work; instrument with trace2_region_* calls", 2020-10-28), this change improves the performance as follows: Before After no-renames: 12.596 s ± 0.061 s 6.003 s ± 0.048 s mega-renames: 130.465 s ± 0.259 s 114.009 s ± 0.236 s just-one-mega: 3.958 s ± 0.010 s 3.489 s ± 0.017 s Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-03-10merge-ort: precompute whether directory rename detection is neededLibravatar Elijah Newren1-6/+61
The point of directory rename detection is that if one side of history renames a directory, and the other side adds new files under the old directory, then the merge can move those new files into the new directory. This leads to the following important observation: * If the other side does not add any new files under the old directory, we do not need to detect any renames for that directory. Similarly, directory rename detection had an important requirement: * If a directory still exists on one side of history, it has not been renamed on that side of history. (See section 4 of t6423 or Documentation/technical/directory-rename-detection.txt for more details). Using these two bits of information, we note that directory rename detection is only needed in cases where (1) directories exist in the merge base and on one side of history (i.e. dirmask == 3 or dirmask == 5), and (2) where there is some new file added to that directory on the side where it still exists (thus where the file has filemask == 2 or filemask == 4, respectively). This has to be done in two steps, because we have the dirmask when we are first considering the directory, and won't get the filemasks for the files within it until we recurse into that directory. So, we save dir_rename_mask = dirmask - 1 when we hit a directory that is missing on one side, and then later look for cases of filemask == dir_rename_mask One final note is that as soon as we hit a directory that needs directory rename detection, we will need to detect renames in all subdirectories of that directory as well due to the "majority rules" decision when files are renamed into different directory hierarchies. We arbitrarily use the special value of 0x07 to record when we've hit such a directory. The combination of all the above mean that we introduce a variable named dir_rename_mask (couldn't think of a better name) which has one of the following values as we traverse into a directory: * 0x00: directory rename detection not needed * 0x02 or 0x04: directory rename detection only needed if files added * 0x07: directory rename detection definitely needed We then pass this value through to add_pairs() so that it can mark location_relevant as true only when dir_rename_mask is 0x07. Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-03-10merge-ort: introduce wrappers for alternate tree traversalLibravatar Elijah Newren1-0/+71
Add traverse_trees_wrapper() and traverse_trees_wrapper_callback() functions. The former runs traverse_trees() with info->fn set to traverse_trees_wrapper_callback, in order to simply save all the entries without processing or recursing into any of them. This step allows extra computation to be done (e.g. checking some condition across all files) that can be used later. Then, after that is completed, it iterates over all the saved entries and calls the original info->fn callback with the saved data. Currently, this does nothing more than marginally slowing down the tree traversal since we do not take advantage of the opportunity to compute anything special in traverse_trees_wrapper_callback(), and thus the real callback will be called identically as it would have been without this extra wrapper. However, a subsequent commit will add some special computation of some values that the real callback will be able to use. Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-03-10merge-ort: add data structures for an alternate tree traversalLibravatar Elijah Newren1-0/+26
In order to determine whether directory rename detection is needed, we as a pre-requisite need a way to traverse through all the files in a given tree before visiting any directories within that tree. traverse_trees() only iterates through the entries in the order they appear, so add some data structures that will store all the entries as we iterate through them in traverse_trees(), which will allow us to re-traverse them in our desired order. Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-03-10merge-ort: precompute subset of sources for which we need rename detectionLibravatar Elijah Newren2-3/+103
rename detection works by trying to pair all file deletions (or "sources") with all file additions (or "destinations"), checking similarity, and then marking the sufficiently similar ones as renames. This can be expensive if there are many sources and destinations on a given side of history as it results in an N x M comparison matrix. However, there are many cases where we can compute in advance that detecting renames for some of the sources provides no useful information and thus that we can exclude those sources from the matrix. To see why, first note that the merge machinery uses detected renames in two ways: * directory rename detection: when one side of history renames a directory, and the other side of history adds new files to that directory, we want to be able to warn the user about the need to chose whether those new files stay in the old directory or move to the new one. * three-way content merging: in order to do three-way content merging of files, we need three different file versions. If one side of history renamed a file, then some of the content for the file is found under a different path than in the merge base or on the other side of history. Add a simple testcase showing the two kinds of reasons renames are relevant; it's a testcase that will only pass if we detect both kinds of needed renames. Other than the testcase added above, this commit concentrates just on the three-way content merging; it will punt and mark all sources as needed for directory rename detection, and leave it to future commits to narrow that down more. The point of three-way content merging is to reconcile changes made on *both* sides of history. What if the file wasn't modified on both sides? There are two possibilities: * If it wasn't modified on the renamed side: -> then we get to do exact rename detection, which is cheap. * If it wasn't modified on the unrenamed side: -> then detection of a rename for that source file is irrelevant That latter claim might be surprising at first, so let's walk through a case to show why rename detection for that source file is irrelevant. Let's use two filenames, old.c & new.c, with the following abbreviated object ids (and where the value '000000' is used to denote that the file is missing in that commit): old.c new.c MERGE_BASE: 01d01d 000000 MERGE_SIDE1: 01d01d 000000 MERGE_SIDE2: 000000 5e1ec7 If the rename *isn't* detected: then old.c looks like it was unmodified on one side and deleted on the other and should thus be removed. new.c looks like a new file we should keep as-is. If the rename *is* detected: then a three-way content merge is done. Since the version of the file in MERGE_BASE and MERGE_SIDE1 are identical, the three-way merge will produce exactly the version of the file whose abbreviated object id is 5e1ec7. It will record that file at the path new.c, while removing old.c from the directory. Note that these two results are identical -- a single file named 'new.c' with object id 5e1ec7. In other words, it doesn't matter if the rename is detected in the case where the file is unmodified on the unrenamed side. Use this information to compute whether we need rename detection for each source created in add_pair(). It's probably worth noting that there used to be a few other edge or corner cases besides three-way content merges and directory rename detection where lack of rename detection could have affected the result, but those cases actually highlighted where conflict resolution methods were not consistent with each other. Fixing those inconsistencies were thus critically important to enabling this optimization. That work involved the following: * bringing consistency to add/add, rename/add, and rename/rename conflict types, as done back in the topic merged at commit ac193e0e0a ("Merge branch 'en/merge-path-collision'", 2019-01-04), and further extended in commits 2a7c16c980 ("t6422, t6426: be more flexible for add/add conflicts involving renames", 2020-08-10) and e8eb99d4a6 ("t642[23]: be more flexible for add/add conflicts involving pair renames", 2020-08-10) * making rename/delete more consistent with modify/delete as done in commits 1f3c9ba707 ("t6425: be more flexible with rename/delete conflict messages", 2020-08-10) and 727c75b23f ("t6404, t6423: expect improved rename/delete handling in ort backend", 2020-10-26) Since the set of relevant_sources we compute has not yet been narrowed down for directory rename detection, we do not pass it to diffcore_rename_extended() yet. That will be done after subsequent commits narrow down the list of relevant_sources needed for directory rename detection reasons. Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-03-10diffcore-rename: enable filtering possible rename sourcesLibravatar Elijah Newren3-7/+21
Add the ability to diffcore_rename_extended() to allow external callers to declare that they only need renames detected for a subset of source files, and use that information to skip detecting renames for them. There are two important pieces to this optimization that may not be obvious at first glance: * We do not require callers to just filter the filepairs out to remove the non-relevant sources, because exact rename detection is fast and when it finds a match it can remove both a source and a destination whereas the relevant_sources filter can only remove a source. * We need to filter out the source pairs in a preliminary pass instead of adding a strset_contains(relevant_sources, one->path) check within the nested matrix loop. The reason for that is if we have 30k renames, doing 30k * 30k = 900M strset_contains() calls becomes extraordinarily expensive and defeats the performance gains from this change; we only want to do 30k such calls instead. If callers pass NULL for relevant_sources, that is special cases to treat all sources as relevant. Since all callers currently pass NULL, this optimization does not yet have any effect. Subsequent commits will have merge-ort compute a set of relevant_sources to restrict which sources we detect renames for, and have merge-ort pass that set of relevant_sources to diffcore_rename_extended(). A note about filtering order: Some may be curious why we don't filter out irrelevant sources at the same time we filter out exact renames. While that technically could be done at this point, there are two reasons to defer it: First, was to reinforce a lesson that was too easy to forget. As I mentioned above, in the past I filtered irrelevant sources out before exact rename checking, and then discovered that exact renames' ability to remove both sources and destinations was an important consideration and thus doing the filtering after exact rename checking would speed things up. Then at some point I realized that basename matching could also remove both sources and destinations, and decided to put irrelevant source filtering after basename filtering. That slowed things down a lot. But, despite learning about this important ordering, in later restructuring I forgot and made the same mistake of putting the filtering after basename guided rename detection again. So, I have this series of patches structured to do the irrelevant filtering last to start to show how much extra it costs, and then add relevant filtering in to find_basename_matches() to show how much it speeds things up. Basically, it's a way to reinforce something that apparently was too easy to forget, and make sure the commit messages record this lesson. Second, the items in the "relevant_sources" in this patch series will include all sources that *might be* relevant. It has to be conservative and catch anything that might need a rename, but in the patch series after this one we'll find ways to weed out more of the *might be* relevant ones. Unfortunately, merge-ort does not have sufficient information to weed those ones out, and there isn't enough information at the time of filtering exact renames out to remove the extra ones either. It has to be deferred. So the deferral is in part to simplify some later additions. Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-26diffcore-rename: compute dir_rename_guess from dir_rename_countsLibravatar Elijah Newren1-4/+41
dir_rename_counts has a mapping of a mapping, in particular, it has old_dir => { new_dir => count } We want a simple mapping of old_dir => new_dir based on which new_dir had the highest count for a given old_dir. Compute this and store it in dir_rename_guess. This is the final piece of the puzzle needed to make our guesses at which directory files have been moved to when basenames aren't unique. For the testcases mentioned in commit 557ac0350d ("merge-ort: begin performance work; instrument with trace2_region_* calls", 2020-10-28), this change improves the performance as follows: Before After no-renames: 12.775 s ± 0.062 s 12.596 s ± 0.061 s mega-renames: 188.754 s ± 0.284 s 130.465 s ± 0.259 s just-one-mega: 5.599 s ± 0.019 s 3.958 s ± 0.010 s Reviewed-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-26diffcore-rename: limit dir_rename_counts computation to relevant dirsLibravatar Elijah Newren1-0/+10
We are using dir_rename_counts to count the number of other directories that files within a directory moved to. We only need this information for directories that disappeared, though, so we can return early from update_dir_rename_counts() for other paths. If dirs_removed is passed to diffcore_rename_extended(), then it provides the relevant bits of information for us to limit this counting to relevant dirs. If dirs_removed is not passed, we would need to compute some replacement in order to do this limiting. Introduce a new info->relevant_source_dirs variable for this purpose, even though at this stage we will only set it to dirs_removed for simplicity. Reviewed-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-26diffcore-rename: compute dir_rename_counts in stagesLibravatar Elijah Newren1