summaryrefslogtreecommitdiff
path: root/merge-ort.c
AgeCommit message (Collapse)AuthorFilesLines
2021-07-28Merge branch 'en/rename-limits-doc'Libravatar Junio C Hamano1-1/+1
Documentation on "git diff -l<n>" and diff.renameLimit have been updated, and the defaults for these limits have been raised. * en/rename-limits-doc: rename: bump limit defaults yet again diffcore-rename: treat a rename_limit of 0 as unlimited doc: clarify documentation for rename/copy limits diff: correct warning message when renameLimit exceeded
2021-07-28Merge branch 'ab/attribute-format'Libravatar Junio C Hamano1-0/+1
Many "printf"-like helper functions we have have been annotated with __attribute__() to catch placeholder/parameter mismatches. * ab/attribute-format: advice.h: add missing __attribute__((format)) & fix usage *.h: add a few missing __attribute__((format)) *.c static functions: add missing __attribute__((format)) sequencer.c: move static function to avoid forward decl *.c static functions: don't forward-declare __attribute__
2021-07-16Merge branch 'ab/struct-init'Libravatar Junio C Hamano1-2/+2
Code cleanup around struct_type_init() functions. * ab/struct-init: string-list.h users: change to use *_{nodup,dup}() string-list.[ch]: add a string_list_init_{nodup,dup}() dir.[ch]: replace dir_init() with DIR_INIT *.c *_init(): define in terms of corresponding *_INIT macro *.h: move some *_INIT to designated initializers
2021-07-16Merge branch 'en/merge-dir-rename-corner-case-fix'Libravatar Junio C Hamano1-1/+5
The merge code had funny interactions between content based rename detection and directory rename detection. * en/merge-dir-rename-corner-case-fix: merge-recursive: handle rename-to-self case merge-ort: ensure we consult df_conflict and path_conflicts t6423: test directory renames causing rename-to-self
2021-07-16Merge branch 'en/ort-perf-batch-13'Libravatar Junio C Hamano1-0/+50
Performance tweaks of "git merge -sort" around lazy fetching of objects. * en/ort-perf-batch-13: merge-ort: add prefetching for content merges diffcore-rename: use a different prefetch for basename comparisons diffcore-rename: allow different missing_object_cb functions t6421: add tests checking for excessive object downloads during merge promisor-remote: output trace2 statistics for number of objects fetched
2021-07-16Merge branch 'en/ort-perf-batch-12'Libravatar Junio C Hamano1-23/+57
More fix-ups and optimization to "merge -sort". * en/ort-perf-batch-12: merge-ort: miscellaneous touch-ups Fix various issues found in comments diffcore-rename: avoid unnecessary strdup'ing in break_idx merge-ort: replace string_list_df_name_compare with faster alternative
2021-07-15rename: bump limit defaults yet againLibravatar Elijah Newren1-1/+1
These were last bumped in commit 92c57e5c1d29 (bump rename limit defaults (again), 2011-02-19), and were bumped both because processors had gotten faster, and because people were getting ugly merges that caused problems and reporting it to the mailing list (suggesting that folks were willing to spend more time waiting). Since that time: * Linus has continued recommending kernel folks to set diff.renameLimit=0 (maps to 32767, currently) * Folks with repositories with lots of renames were happy to set merge.renameLimit above 32767, once the code supported that, to get correct cherry-picks * Processors have gotten faster * It has been discovered that the timing methodology used last time probably used too large example files. The last point is probably worth explaining a bit more: * The "average" file size used appears to have been average blob size in the linux kernel history at the time (probably v2.6.25 or something close to it). * Since bigger files are modified more frequently, such a computation weights towards larger files. * Larger files may be more likely to be modified over time, but are not more likely to be renamed -- the mean and median blob size within a tree are a bit higher than the mean and median of blob sizes in the history leading up to that version for the linux kernel. * The mean blob size in v2.6.25 was half the average blob size in history leading to that point * The median blob size in v2.6.25 was about 40% of the mean blob size in v2.6.25. * Since the mean blob size is more than double the median blob size, any file as big as the mean will not be compared to any files of median size or less (because they'd be more than 50% dissimilar). * Since it is the number of files compared that provides the O(n^2) behavior, median-sized files should matter more than mean-sized ones. The combined effect of the above is that the file size used in past calculations was likely about 5x too large. Combine that with a CPU performance improvement of ~30%, and we can increase the limits by a factor of sqrt(5/(1-.3)) = 2.67, while keeping the original stated time limits. Keeping the same approximate time limit probably makes sense for diff.renameLimit (there is no progress feedback in e.g. git log -p), but the experience above suggests merge.renameLimit could be extended significantly. In fact, it probably would make sense to have an unlimited default setting for merge.renameLimit, but that would likely need to be coupled with changes to how progress is displayed. (See https://lore.kernel.org/git/YOx+Ok%2FEYvLqRMzJ@coredump.intra.peff.net/ for details in that area.) For now, let's just bump the approximate time limit from 10s to 1m. (Note: We do not want to use actual time limits, because getting results that depend on how loaded your system is that day feels bad, and because we don't discover that we won't get all the renames until after we've put in a lot of work rather than just upfront telling the user there are too many files involved.) Using the original time limit of 2s for diff.renameLimit, and bumping merge.renameLimit from 10s to 60s, I found the following timings using the simple script at the end of this commit message (on an AWS c5.xlarge which reports as "Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz"): N Timing 1300 1.995s 7100 59.973s So let's round down to nice even numbers and bump the limits from 400->1000, and from 1000->7000. Here is the measure_rename_perf script (adapted from https://lore.kernel.org/git/20080211113516.GB6344@coredump.intra.peff.net/ in particular to avoid triggering the linear handling from basename-guided rename detection): #!/bin/bash n=$1; shift rm -rf repo mkdir repo && cd repo git init -q -b main mkdata() { mkdir $1 for i in `seq 1 $2`; do (sed "s/^/$i /" <../sample echo tag: $1 ) >$1/$i done } mkdata initial $n git add . git commit -q -m initial mkdata new $n git add . cd new for i in *; do git mv $i $i.renamed; done cd .. git rm -q -rf initial git commit -q -m new time git diff-tree -M -l0 --summary HEAD^ HEAD Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-07-13*.c static functions: add missing __attribute__((format))Libravatar Ævar Arnfjörð Bjarmason1-0/+1
Add missing __attribute__((format)) function attributes to various "static" functions that take printf arguments. Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-07-01string-list.h users: change to use *_{nodup,dup}()Libravatar Ævar Arnfjörð Bjarmason1-2/+2
Change all in-tree users of the string_list_init(LIST, BOOL) API to use string_list_init_{nodup,dup}(LIST) instead. As noted in the preceding commit let's leave the now-unused string_list_init() wrapper in-place for any in-flight users, it can be removed at some later date. Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-06-30merge-ort: ensure we consult df_conflict and path_conflictsLibravatar Elijah Newren1-1/+5
Path conflicts (typically rename path conflicts, e.g. rename/rename(1to2) or rename/add/delete), and directory/file conflicts should obviously result in files not being marked as clean in the merge. We had a codepath where we missed consulting the path_conflict and df_conflict flags, based on match_mask. Granted, it requires an unusual setup to trigger this codepath (directory rename causing rename-to-self is the only case I can think of), but we still need to handle it. To make it clear that we have audited the other codepaths that do not explicitly mention these flags, add some assertions that the flags are not set. Reported-by: Anders Kaseorg <andersk@mit.edu> Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-06-28merge-ort: add prefetching for content mergesLibravatar Elijah Newren1-0/+50
Commit 7fbbcb21b1 ("diff: batch fetching of missing blobs", 2019-04-05) introduced batching of fetching missing blobs, so that the diff machinery would have one fetch subprocess grab N blobs instead of N processes each grabbing 1. However, the diff machinery is not the only thing in a merge that needs to work on blobs. The 3-way content merges need them as well. Rather than download all the blobs 1 at a time, prefetch all the blobs needed for regular content merges. This does not cover all possible paths in merge-ort that might need to download blobs. Others include: - The blob_unchanged() calls to avoid modify/delete conflicts (when blob renormalization results in an "unchanged" file) - Preliminary content merges needed for rename/add and rename/rename(2to1) style conflicts. (Both of these types of conflicts can result in nested conflict markers from the need to do two levels of content merging; the first happens before our new prefetch_for_content_merges() function.) The first of these wouldn't be an extreme amount of work to support, and even the second could be theoretically supported in batching, but all of these cases seem unusual to me, and this is a minor performance optimization anyway; in the worst case we only get some of the fetches batched and have a few additional one-off fetches. So for now, just handle the regular 3-way content merges in our prefetching. For the testcase from the previous commit, the number of downloaded objects remains at 63, but this drops the number of fetches needed from 32 down to 20, a sizeable reduction. Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-06-14Merge branch 'en/ort-perf-batch-11'Libravatar Junio C Hamano1-12/+319
Optimize out repeated rename detection in a sequence of mergy operations. * en/ort-perf-batch-11: merge-ort, diffcore-rename: employ cached renames when possible merge-ort: handle interactions of caching and rename/rename(1to1) cases merge-ort: add helper functions for using cached renames merge-ort: preserve cached renames for the appropriate side merge-ort: avoid accidental API mis-use merge-ort: add code to check for whether cached renames can be reused merge-ort: populate caches of rename detection results merge-ort: add data structures for in-memory caching of rename detection t6429: testcases for remembering renames fast-rebase: write conflict state to working tree, index, and HEAD fast-rebase: change assert() to BUG() Documentation/technical: describe remembering renames optimization t6423: rename file within directory that other side renamed
2021-06-09merge-ort: miscellaneous touch-upsLibravatar Elijah Newren1-0/+5
Add some notes in the code about invariants with match_mask when adding pairs. Also add a comment that seems to have been left out in my work of pushing these changes upstream. Signed-off-by: Elijah Newren <newren@gmail.com> Reviewed-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-06-09Fix various issues found in commentsLibravatar Elijah Newren1-3/+5
A random hodge-podge of incorrect or out-of-date comments that I found: * t6423 had a comment that has referred to the wrong test for years; fix it to refer to the right one. * diffcore-rename had a FIXME comment meant to remind myself to investigate if I could make another code change. I later investigated and removed the FIXME, but while cherry-picking the patch to submit upstream I missed the later update. Remove the comment now. * merge-ort had the early part of a comment for a function; I had meant to include the more involved description when I updated the function. Update the comment now. Signed-off-by: Elijah Newren <newren@gmail.com> Reviewed-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-06-09merge-ort: replace string_list_df_name_compare with faster alternativeLibravatar Elijah Newren1-20/+47
Gathering accumulated times from trace2 output on the mega-renames testcase, I saw the following timings (where I'm only showing a few lines to highlight the portions of interest): 10.120 : label:incore_nonrecursive 4.462 : ..label:process_entries 3.143 : ....label:process_entries setup 2.988 : ......label:plist special sort 1.305 : ....label:processing 2.604 : ..label:collect_merge_info 2.018 : ..label:merge_start 1.018 : ..label:renames In the above output, note that the 4.462 seconds for process_entries was split as 3.143 seconds for "process_entries setup" and 1.305 seconds for "processing" (and a little time for other stuff removed from the highlight). Most of the "process_entries setup" time was spent on "plist special sort" which corresponds to the following code: trace2_region_enter("merge", "plist special sort", opt->repo); plist.cmp = string_list_df_name_compare; string_list_sort(&plist); trace2_region_leave("merge", "plist special sort", opt->repo); In other words, in a merge strategy that would be invoked by passing "-sort" to either rebase or merge, sorting an array takes more time than anything else. Serves me right for naming my merge strategy this way. Rewrite the comparison function in a way that does not require finding out the lengths of the strings when comparing them. While at it, tweak the code for our specific case -- no need to handle a variety of modes, for example. The combination of these changes reduced the time spent in "plist special sort" by ~25% in the mega-renames case. For the testcases mentioned in commit 557ac0350d ("merge-ort: begin performance work; instrument with trace2_region_* calls", 2020-10-28), this change improves the performance as follows: Before After no-renames: 5.622 s ± 0.059 s 5.235 s ± 0.042 s mega-renames: 10.127 s ± 0.073 s 9.419 s ± 0.107 s just-one-mega: 500.3 ms ± 3.8 ms 480.1 ms ± 3.9 ms Signed-off-by: Elijah Newren <newren@gmail.com> Reviewed-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-05-20merge-ort, diffcore-rename: employ cached renames when possibleLibravatar Elijah Newren1-5/+42
When there are many renames between the old base of a series of commits and the new base, the way sequencer.c, merge-recursive.c, and diffcore-rename.c have traditionally split the work resulted in redetecting the same renames with each and every commit being transplanted. To address this, the last several commits have been creating a cache of rename detection results, determining when it was safe to use such a cache in subsequent merge operations, adding helper functions, and so on. See the previous half dozen commit messages for additional discussion of this optimization, particularly the message a few commits ago entitled "add code to check for whether cached renames can be reused". This commit finally ties all of that work together, modifying the merge algorithm to make use of these cached renames. For the testcases mentioned in commit 557ac0350d ("merge-ort: begin performance work; instrument with trace2_region_* calls", 2020-10-28), this change improves the performance as follows: Before After no-renames: 5.665 s ± 0.129 s 5.622 s ± 0.059 s mega-renames: 11.435 s ± 0.158 s 10.127 s ± 0.073 s just-one-mega: 494.2 ms ± 6.1 ms 500.3 ms ± 3.8 ms That's a fairly small improvement, but mostly because the previous optimizations were so effective for these particular testcases; this optimization only kicks in when the others don't. If we undid the basename-guided rename detection and skip-irrelevant-renames optimizations, then we'd see that this series by itself improved performance as follows: Before Basename Series After Just This Series no-renames: 13.815 s ± 0.062 s 5.697 s ± 0.080 s mega-renames: 1799.937 s ± 0.493 s 205.709 s ± 0.457 s Since this optimization kicks in to help accelerate cases where the previous optimizations do not apply, this last comparison shows that this cached-renames optimization has the potential to help signficantly in cases that don't meet the requirements for the other optimizations to be effective. The changes made in this optimization also lay some important groundwork for a future optimization around having collect_merge_info() avoid recursing into subtrees in more cases. However, for this optimization to be effective, merge_switch_to_result() should only be called when the rebase or cherry-pick operation has either completed or hit a case where the user needs to resolve a conflict or edit the result. If it is called after every commit, as sequencer.c does, then the working tree and index are needlessly updated with every commit and the cached metadata is tossed, defeating this optimization. Refactoring sequencer.c to only call merge_switch_to_result() at the end of the operation is a bigger undertaking, and the practical benefits of this optimization will not be realized until that work is performed. Since `test-tool fast-rebase` only updates at the end of the operation, it was used to obtain the timings above. Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-05-20merge-ort: handle interactions of caching and rename/rename(1to1) casesLibravatar Elijah Newren1-1/+29
As documented in Documentation/technical/remembering-renames.txt, and as tested for in the two testcases in t6429 with "rename same file identically" in their description, there is one case where we need to have renames in one commit NOT be cached for the next commit in our rebase sequence -- namely, rename/rename(1to1) cases. Rather than specifically trying to uncache those and fix up dir_rename_counts() to match (which would also be valid but more work), we simply disable the optimization when this really rare type of rename occurs. Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-05-20merge-ort: add helper functions for using cached renamesLibravatar Elijah Newren1-0/+47
If we have a usable rename cache, then we can remove from relevant_sources all the paths that were cached; diffcore_rename_extended() can then consider an even smaller set of relevant_sources in its rename detection. However, when diffcore_rename_extended() is done, we will need to take the renames it detected and then add back in all the ones we had cached from before. Add helper functions for doing these two operations; the next commit will make use of them. Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-05-20merge-ort: preserve cached renames for the appropriate sideLibravatar Elijah Newren1-9/+11
Previous commits created an in-memory cache of the results of rename detection, and added logic to detect when that cache could appropriately be used in a subsequent merge operation -- but we were still unconditionally clearing the cache with each new merge operation anyway. If it is valid to reuse the cache from one of the two sides of history, preserve that side. Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-05-20merge-ort: avoid accidental API mis-useLibravatar Elijah Newren1-0/+7
Previously, callers of the merge-ort API could have passed an uninitialized value for struct merge_result *result. However, we want to check result to see if it has cached renames from a previous merge that we can reuse; such values would be found behind result->priv. However, if result->priv is uninitialized, attempting to access behind it will give a segfault. So, we need result->priv to be NULL (which will be the case if the caller does a memset(&result, 0)), or be written by a previous call to the merge-ort machinery. Documenting this requirement may help, but despite being the person who introduced this requirement, I still missed it once and it did not fail in a very clear way and led to a long debugging session. Add a _properly_initialized field to merge_result; that value will be 0 if the caller zero'ed the merge_result, it will be set to a very specific value by a previous run by the merge-ort machinery, and if it's uninitialized it will most likely either be 0 or some value that does not match the specific one we'd expect allowing us to throw a much more meaningful error. Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-05-20merge-ort: add code to check for whether cached renames can be reusedLibravatar Elijah Newren1-2/+64
We need to know when renames detected in a previous merge operation can be reused in a later merge operation. Consider the following setup (from the git-rebase manpage): A---B---C topic / D---E---F---G master After rebasing, this will appear as: A'--B'--C' topic / D---E---F---G master Further, let's say that 'oldfile' was renamed to 'newfile' between E and G. The rebase or cherry-pick of A onto G will involve a three-way merge between E (as the merge base) and G and A. After detecting the rename between E:oldfile and G:newfile, there will be a three-way content merge of the following: E:oldfile G:newfile A:oldfile and produce a new result: A':newfile Now, when we want to pick B onto A', we will need to do a three-way merge between A (as the merge-base) and A' and B. This will involve a three-way content merge of A:oldfile A':newfile B:oldfile but only if we can detect that A:oldfile is similar enough to A':newfile to be used together in a three-way content merge, i.e. only if we can detect that A:oldfile and A':newfile are a rename. But we already know that A:oldfile and A':newfile are similar enough to be used in a three-way content merge, because that is precisely where A':newfile came from in the previous merge. Note that A & A' both appear in both merges. That gives us the condition under which we can reuse renames. There are a couple important points about this optimization: - If the rebase or cherry-pick halts for user conflicts, these caches are NOT saved anywhere. Thus, resuming a halted rebase or cherry-pick will result in no reused renames for the next commit. This is intentional, as user resolution can change files significantly and in ways that violate the similarity assumptions here. - Technically, in a *very* narrow case this might give slightly different results for rename detection. Using the example above, if: * E:oldfile had 20 lines * G:newfile added 10 new lines at the beginning of the file * A:oldfile deleted all but the first three lines of the file then => A':newfile would have 13 lines, 3 of which matches those in A:oldfile. Consider the two cases: * Without this optimization: - the next step of the rebase operation (moving B to B') would not detect the rename betwen A:oldfile and A':newfile - we'd thus get a modify/delete conflict with the rebase operation halting for the user to resolve, and have both A':newfile and B:oldfile sitting in the working tree. * With this optimization: - the rename between A:oldfile and A':newfile would be detected via the cache of renames - a three-way merge between A:oldfile, A':newfile, and B:oldfile would commence and be written to A':newfile Now, is the difference in behavior a bug...or a bugfix? I can't tell. Given that A:oldfile and A':newfile are not very similar, when we three-way merge with B:oldfile it seems likely we'll hit a conflict for the user to resolve. And it shouldn't be too hard for users to see why we did that three-way merge; oldfile and newfile *were* renames somewhere in the sequence. So, most of these corner cases will still behave similarly -- namely, a conflict given to the user to resolve. Also, consider the interesting case when commit B is a clean revert of commit A. Without this optimization, a rebase could not both apply a weird patch like A and then immediately revert it; users would be forced to resolve merge conflicts. With this optimization, it would successfully apply the clean revert. So, there is certainly at least one case that behaves better. Even if it's considered a "difference in behavior", I think both behaviors are reasonable, and the time savings provided by this optimization justify using the slightly altered rename heuristics. Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-05-20merge-ort: populate caches of rename detection resultsLibravatar Elijah Newren1-1/+72
Fill in cache_pairs, cached_target_names, and cached_irrelevant based on rename detection results. Future commits will make use of these values. Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-05-20merge-ort: add data structures for in-memory caching of rename detectionLibravatar Elijah Newren1-0/+53
When there are many renames between the old base of a series of commits and the new base for a series of commits, the sequence of merges employed to transplant those commits (from a cherry-pick or rebase operation) will repeatedly detect the exact same renames. This is wasted effort. Add data structures which will be used to cache rename detection results, along with the initialization and deallocation of these data structures. Future commits will populate these caches, detect the appropriate circumstances when they can be used, and employ them to avoid re-detecting the same renames repeatedly. Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-05-16Merge branch 'ah/merge-ort-i18n'Libravatar Junio C Hamano1-6/+15
An i18n fix. * ah/merge-ort-i18n: merge-ort: split "distinct types" message into two translatable messages
2021-05-11merge-ort: split "distinct types" message into two translatable messagesLibravatar Alex Henrie1-6/+15
The word "renamed" has two possible translations in many European languages depending on whether one thing was renamed or two things were renamed. Give translators freedom to alter any part of the message to make it sound right in their language. Signed-off-by: Alex Henrie <alexhenrie24@gmail.com> Acked-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-05-10Merge branch 'bc/hash-transition-interop-part-1'Libravatar Junio C Hamano1-10/+10
SHA-256 transition. * bc/hash-transition-interop-part-1: hex: print objects using the hash algorithm member hex: default to the_hash_algo on zero algorithm value builtin/pack-objects: avoid using struct object_id for pack hash commit-graph: don't store file hashes as struct object_id builtin/show-index: set the algorithm for object IDs hash: provide per-algorithm null OIDs hash: set, copy, and use algo field in struct object_id builtin/pack-redundant: avoid casting buffers to struct object_id Use the final_oid_fn to finalize hashing of object IDs hash: add a function to finalize object IDs http-push: set algorithm when reading object ID Always use oidread to read into struct object_id hash: add an algo member to struct object_id
2021-04-30Merge branch 'ds/sparse-index-protections'Libravatar Junio C Hamano1-1/+1
Builds on top of the sparse-index infrastructure to mark operations that are not ready to mark with the sparse index, causing them to fall back on fully-populated index that they always have worked with. * ds/sparse-index-protections: (47 commits) name-hash: use expand_to_path() sparse-index: expand_to_path() name-hash: don't add directories to name_hash revision: ensure full index resolve-undo: ensure full index read-cache: ensure full index pathspec: ensure full index merge-recursive: ensure full index entry: ensure full index dir: ensure full index update-index: ensure full index stash: ensure full index rm: ensure full index merge-index: ensure full index ls-files: ensure full index grep: ensure full index fsck: ensure full index difftool: ensure full index commit: ensure full index checkout: ensure full index ...
2021-04-27hash: provide per-algorithm null OIDsLibravatar brian m. carlson1-10/+10
Up until recently, object IDs did not have an algorithm member, only a hash. Consequently, it was possible to share one null (all-zeros) object ID among all hash algorithms. Now that we're going to be handling objects from multiple hash algorithms, it's important to make sure that all object IDs have a correct algorithm field. Introduce a per-algorithm null OID, and add it to struct hash_algo. Introduce a wrapper function as well, and use it everywhere we used to use the null_oid constant. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-04-16Merge branch 'ah/merge-ort-ubsan-fix'Libravatar Junio C Hamano1-14/+6
Code clean-up for merge-ort backend. * ah/merge-ort-ubsan-fix: merge-ort: only do pointer arithmetic for non-empty lists
2021-04-16Merge branch 'en/ort-readiness'Libravatar Junio C Hamano1-28/+215
Plug the ort merge backend throughout the rest of the system, and start testing it as a replacement for the recursive backend. * en/ort-readiness: Add testing with merge-ort merge strategy t6423: mark remaining expected failure under merge-ort as such Revert "merge-ort: ignore the directory rename split conflict for now" merge-recursive: add a bunch of FIXME comments documenting known bugs merge-ort: write $GIT_DIR/AUTO_MERGE whenever we hit a conflict t: mark several submodule merging tests as fixed under merge-ort merge-ort: implement CE_SKIP_WORKTREE handling with conflicted entries t6428: new test for SKIP_WORKTREE handling and conflicts merge-ort: support subtree shifting merge-ort: let renormalization change modify/delete into clean delete merge-ort: have ll_merge() use a special attr_index for renormalization merge-ort: add a special minimal index just for renormalization merge-ort: use STABLE_QSORT instead of QSORT where required
2021-04-16Merge branch 'en/ort-perf-batch-10'Libravatar Junio C Hamano1-18/+61
Various rename detection optimization to help "ort" merge strategy backend. * en/ort-perf-batch-10: diffcore-rename: determine which relevant_sources are no longer relevant merge-ort: record the reason that we want a rename for a file diffcore-rename: add computation of number of unknown renames diffcore-rename: check if we have enough renames for directories early on diffcore-rename: only compute dir_rename_count for relevant directories merge-ort: record the reason that we want a rename for a directory merge-ort, diffcore-rename: tweak dirs_removed and relevant_source type diffcore-rename: take advantage of "majority rules" to skip more renames
2021-04-12merge-ort: only do pointer arithmetic for non-empty listsLibravatar Andrzej Hunt1-13/+5
versions could be an empty string_list. In that case, versions->items is NULL, and we shouldn't be trying to perform pointer arithmetic with it (as that results in undefined behaviour). Moreover we only use the results of this calculation once when calling QSORT. Therefore we choose to skip creating relevant_entries and call QSORT directly with our manipulated pointers (but only if there's data requiring sorting). This lets us avoid abusing the string_list API, and saves us from having to explain why this abuse is OK. Finally, an assertion is added to make sure that write_tree() is called with a valid offset. This issue has probably existed since: ee4012dcf9 (merge-ort: step 2 of tree writing -- function to create tree object, 2020-12-13) But it only started occurring during tests since tests started using merge-ort: f3b964a07e (Add testing with merge-ort merge strategy, 2021-03-20) For reference - here's the original UBSAN commit that implemented this check, it sounds like this behaviour isn't actually likely to cause any issues (but we might as well fix it regardless): https://reviews.llvm.org/D67122 UBSAN output from t3404 or t5601: merge-ort.c:2669:43: runtime error: applying zero offset to null pointer #0 0x78bb53 in write_tree merge-ort.c:2669:43 #1 0x7856c9 in process_entries merge-ort.c:3303:2 #2 0x782317 in merge_ort_nonrecursive_internal merge-ort.c:3744:2 #3 0x77feef in merge_incore_nonrecursive merge-ort.c:3853:2 #4 0x6f6a5c in do_recursive_merge sequencer.c:640:3 #5 0x6f6a5c in do_pick_commit sequencer.c:2221:9 #6 0x6ef055 in single_pick sequencer.c:4814:9 #7 0x6ef055 in sequencer_pick_revisions sequencer.c:4867:10 #8 0x4fb392 in run_sequencer revert.c:225:9 #9 0x4fa5b0 in cmd_revert revert.c:235:8 #10 0x42abd7 in run_builtin git.c:453:11 #11 0x429531 in handle_builtin git.c:704:3 #12 0x4282fb in run_argv git.c:771:4 #13 0x4282fb in cmd_main git.c:902:19 #14 0x524b63 in main common-main.c:52:11 #15 0x7fc2ca340349 in __libc_start_main (/lib64/libc.so.6+0x24349) #16 0x4072b9 in _start start.S:120 SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior merge-ort.c:2669:43 in Signed-off-by: Andrzej Hunt <ajrhunt@google.com> Reviewed-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-04-08Merge branch 'en/ort-perf-batch-9'Libravatar Junio C Hamano1-4/+230
The ort merge backend has been optimized by skipping irrelevant renames. * en/ort-perf-batch-9: diffcore-rename: avoid doing basename comparisons for irrelevant sources merge-ort: skip rename detection entirely if possible merge-ort: use relevant_sources to filter possible rename sources merge-ort: precompute whether directory rename detection is needed merge-ort: introduce wrappers for alternate tree traversal merge-ort: add data structures for an alternate tree traversal merge-ort: precompute subset of sources for which we need rename detection diffcore-rename: enable filtering possible rename sources
2021-03-22Merge branch 'en/ort-perf-batch-8'Libravatar Junio C Hamano1-138/+6
Rename detection rework continues. * en/ort-perf-batch-8: diffcore-rename: compute dir_rename_guess from dir_rename_counts diffcore-rename: limit dir_rename_counts computation to relevant dirs diffcore-rename: compute dir_rename_counts in stages diffcore-rename: extend cleanup_dir_rename_info() diffcore-rename: move dir_rename_counts into dir_rename_info struct diffcore-rename: add function for clearing dir_rename_count Move computation of dir_rename_count from merge-ort to diffcore-rename diffcore-rename: add a mapping of destination names to their indices diffcore-rename: provide basic implementation of idx_possible_rename() diffcore-rename: use directory rename guided basename comparisons
2021-03-20Revert "merge-ort: ignore the directory rename split conflict for now"Libravatar Elijah Newren1-12/+1
This reverts commit 5ced7c3da009090c5a926e3123a71314c7f28d42, which was put in place as a temporary measure to avoid optimizations unstably erroring out on no destination having a majority of the necessary renames for directories that had no new files and thus no need for directory rename detection anyway. Now that optimizations are in place to prevent us from trying to compute directory rename count computations for directories that do not need it, we can undo this temporary measure. Signed-off-by: Elijah Newren <newren@gmail.com> Reviewed-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-03-20merge-ort: write $GIT_DIR/AUTO_MERGE whenever we hit a conflictLibravatar Elijah Newren1-0/+10
There are a variety of questions users might ask while resolving conflicts: * What changes have been made since the previous (first) parent? * What changes are staged? * What is still unstaged? (or what is still conflicted?) * What changes did I make to resolve conflicts so far? The first three of these have simple answers: * git diff HEAD * git diff --cached * git diff There was no way to answer the final question previously. Adding one is trivial in merge-ort, since it works by creating a tree representing what should be written to the working copy complete with conflict markers. Simply write that tree to .git/AUTO_MERGE, allowing users to answer the fourth question with * git diff AUTO_MERGE I avoided using a name like "MERGE_AUTO", because that would be merge-specific (much like MERGE_HEAD, REBASE_HEAD, REVERT_HEAD, CHERRY_PICK_HEAD) and I wanted a name that didn't change depending on which type of operation the merge was part of. Ensure that paths which clean out other temporary operation-specific files (e.g. CHERRY_PICK_HEAD, MERGE_MSG, rebase-merge/ state directory) also clean out this AUTO_MERGE file. Signed-off-by: Elijah Newren <newren@gmail.com> Reviewed-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-03-20merge-ort: implement CE_SKIP_WORKTREE handling with conflicted entriesLibravatar Elijah Newren1-13/+30
When merge conflicts occur in paths removed by a sparse-checkout, we need to unsparsify those paths (clear the SKIP_WORKTREE bit), and write out the conflicted file to the working copy. In the very unlikely case that someone manually put a file into the working copy at the location of the SKIP_WORKTREE file, we need to avoid overwriting whatever edits they have made and move that file to a different location first. Signed-off-by: Elijah Newren <newren@gmail.com> Reviewed-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-03-20merge-ort: support subtree shiftingLibravatar Elijah Newren1-0/+24
merge-recursive has some simple code to support subtree shifting; copy it over to merge-ort. This fixes t6409.12 under GIT_TEST_MERGE_ALGORITHM=ort. Signed-off-by: Elijah Newren <newren@gmail.com> Reviewed-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-03-20merge-ort: let renormalization change modify/delete into clean deleteLibravatar Elijah Newren1-2/+62
When we have a modify/delete conflict, but the only change to the modification is e.g. change of line endings, then if renormalization is requested then we should be able to recognize such a case as a not-modified/delete and resolve the conflict automatically. This fixes t6418.10 under GIT_TEST_MERGE_ALGORITHM=ort. Signed-off-by: Elijah Newren <newren@gmail.com> Reviewed-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-03-20merge-ort: have ll_merge() use a special attr_index for renormalizationLibravatar Elijah Newren1-2/+62
ll_merge() needs an index when renormalization is requested. Create one specifically for just this purpose with just the one needed entry. This fixes t6418.4 and t6418.5 under GIT_TEST_MERGE_ALGORITHM=ort. NOTE 1: Even if the user has a working copy or a real index (which is not a given as merge-ort can be used in bare repositories), we explicitly ignore any .gitattributes file from either of these locations. merge-ort can be used to merge two branches that are unrelated to HEAD, so .gitattributes from the working copy and current index should not be considered relevant. NOTE 2: Since we are in the middle of merging, there is a risk that .gitattributes itself is conflicted...leaving us with an ill-defined situation about how to perform the rest of the merge. It could be that the .gitattributes file does not even exist on one of the sides of the merge, or that it has been modified on both sides. If it's been modified on both sides, it's possible that it could itself be merged cleanly, though it's also possible that it only merges cleanly if you use the right version of the .gitattributes file to drive the merge. It gets kind of complicated. The only test we ever had that attempted to test behavior in this area was seemingly unaware of the undefined behavior, but knew the test wouldn't work for lack of attribute handling support, marked it as test_expect_failure from the beginning, but managed to fail for several reasons unrelated to attribute handling. See commit 6f6e7cfb52 ("t6038: remove problematic test", 2020-08-03) for details. So there are probably various ways to improve what initialize_attr_index() picks in the case of a conflicted .gitattributes but for now I just implemented something simple -- look for whatever .gitattributes file we can find in any of the higher order stages and use it. Signed-off-by: Elijah Newren <newren@gmail.com> Reviewed-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-03-20merge-ort: add a special minimal index just for renormalizationLibravatar Elijah Newren1-0/+20
renormalize_buffer() requires an index_state, which is something that merge-ort does not operate with. However, all the renormalization code needs is an index with a .gitattributes file...plus a little bit of setup. Create such an index, along with the deallocation and attr_direction handling. A subsequent commit will add a function to finish the initialization of this index. Signed-off-by: Elijah Newren <newren@gmail.com> Reviewed-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-03-20merge-ort: use STABLE_QSORT instead of QSORT where requiredLibravatar Elijah Newren1-1/+7
rename/rename conflict handling depends on the fact that if both sides renamed the same path, that the one on the MERGE_SIDE1 will appear first in the combined diff_queue_struct passed to process_renames(). Since we add all pairs from MERGE_SIDE1 to combined first, and then all pairs from MERGE_SIDE2, and then sort based on filename, this will only be true if the sort used is stable. This was found due to the fact that Mac, unlike Linux, apparently has a system-defined qsort that is not stable. While we are at it, review the other callers of QSORT and add comments about why they can remain as calls to QSORT instead of being modified to call STABLE_QSORT. Signed-off-by: Elijah Newren <newren@gmail.com> Reviewed-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-03-18merge-ort: record the reason that we want a rename for a fileLibravatar Elijah Newren1-5/+10
There are two different reasons we might want a rename for a file -- for three-way content merging or as part of directory rename detection. Record the reason. diffcore-rename will potentially be able to filter some of the ones marked as needed only for directory rename detection, if it can determine those directory renames based solely on renames found via exact rename detection and basename-guided rename detection. Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-03-18diffcore-rename: add computation of number of unknown renamesLibravatar Elijah Newren1-0/+3
The previous commit can only be effective if we have a computation of the number of paths under a given directory which are still have pending renames, and expected this number to be recorded in the dir_rename_count map under the key UNKNOWN_DIR. Add the code necessary to compute these values. Note that this change means dir_rename_count might have a directory whose only entry (for UNKNOWN_DIR) was removed by the time merge-ort goes to check it. To account for this, merge-ort needs to check for the case where the max count is 0. With this change we are now computing the necessary value for each directory in dirs_removed, but are not using that value anywhere. The next two commits will make use of the values stored in dirs_removed in order to compute whether each relevant_source (that is needed only for directory rename detection) has become unnecessary. Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-03-18merge-ort: record the reason that we want a rename for a directoryLibravatar Elijah Newren1-3/+38
When one side of history renames a directory, and the other side of history added files to the old directory, directory rename detection is used to warn about the location of the added files so the user can move them to the old directory or keep them with the new one. This sets up three different types of directories: * directories that had new files added to them * directories underneath a directory that had new files added to them * directories where no new files were added to it or any leading path Save this information in dirs_removed; the next several commits will make use of this information. Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-03-18merge-ort, diffcore-rename: tweak dirs_removed and relevant_source typeLibravatar Elijah Newren1-14/+14
As noted in the previous commit, we want to be able to take advantage of the "majority rules" portion of directory rename detection to avoid detecting more renames than necessary. However, for diffcore-rename to take advantage of that, it needs to know whether a rename source file was needed for just directory rename detection reasons, or if it is wanted for potential three-way content merging. Modify relevant_sources from a strset to a strintmap, so we can encode additional information. We also modify dirs_removed from a strset to a strintmap at the same time because trying to determine what files are needed for directory rename detection will require us tracking a bit more information for each directory. This commit only changes the types of the two variables from strset to strintmap; it does not actually store any special values yet and for now only checks for presence of entries in the strintmap. Thus, the code is functionally identical to how it behaved before. Future commits will start associating values with each key for these two maps. Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-03-13use CALLOC_ARRAYLibravatar René Scharfe1-5/+4
Add and apply a semantic patch for converting code that open-codes CALLOC_ARRAY to use it instead. It shortens the code and infers the element size automatically. Signed-off-by: René Scharfe <l.s.r@web.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-03-10merge-ort: skip rename detection entirely if possibleLibravatar Elijah Newren1-0/+44
diffcore_rename_extended() will do a bunch of setup, then check for exact renames, then abort before inexact rename detection if there are no more sources or destinations that need to be matched. It will sometimes be the case, however, that either * we start with neither any sources or destinations * we start with no *relevant* sources In the first of these two cases, the setup and exact rename detection will be very cheap since there are 0 files to operate on. In the second case, it is quite possible to have thousands of files with none of the source ones being relevant. Avoid calling diffcore_rename_extended() or even some of the setup before diffcore_rename_extended() when we can determine that rename detection is unnecessary. For the testcases mentioned in commit 557ac0350d ("merge-ort: begin performance work; instrument with trace2_region_* calls", 2020-10-28), this change improves the performance as follows: Before After no-renames: 6.003 s ± 0.048 s 5.708 s ± 0.111 s mega-renames: 114.009 s ± 0.236 s 102.171 s ± 0.440 s just-one-mega: 3.489 s ± 0.017 s 3.471 s ± 0.015 s Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-03-10merge-ort: use relevant_sources to filter possible rename sourcesLibravatar Elijah Newren1-1/+1
The past several commits determined conditions when rename sources might be needed, and filled a relevant_sources strset with those paths. Pass these along to diffcore_rename_extended() to use to limit the sources that we need to detect renames for. For the testcases mentioned in commit 557ac0350d ("merge-ort: begin performance work; instrument with trace2_region_* calls", 2020-10-28), this change improves the performance as follows: Before After no-renames: 12.596 s ± 0.061 s 6.003 s ± 0.048 s mega-renames: 130.465 s ± 0.259 s 114.009 s ± 0.236 s just-one-mega: 3.958 s ± 0.010 s 3.489 s ± 0.017 s Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-03-10merge-ort: precompute whether directory rename detection is neededLibravatar Elijah Newren1-6/+61
The point of directory rename detection is that if one side of history renames a directory, and the other side adds new files under the old directory, then the merge can move those new files into the new directory. This leads to the following important observation: * If the other side does not add any new files under the old directory, we do not need to detect any renames for that directory. Similarly, directory rename detection had an important requirement: * If a directory still exists on one side of history, it has not been renamed on that side of history. (See section 4 of t6423 or Documentation/technical/directory-rename-detection.txt for more details). Using these two bits of information, we note that directory rename detection is only needed in cases where (1) directories exist in the merge base and on one side of history (i.e. dirmask == 3 or dirmask == 5), and (2) where there is some new file added to that directory on the side where it still exists (thus where the file has filemask == 2 or filemask == 4, respectively). This has to be done in two steps, because we have the dirmask when we are first considering the directory, and won't get the filemasks for the files within it until we recurse into that directory. So, we save dir_rename_mask = dirmask - 1 when we hit a directory that is missing on one side, and then later look for cases of filemask == dir_rename_mask One final note is that as soon as we hit a directory that needs directory rename detection, we will need to detect renames in all subdirectories of that directory as well due to the "majority rules" decision when files are renamed into different directory hierarchies. We arbitrarily use the special value of 0x07 to record when we've hit such a directory. The combination of all the above mean that we introduce a variable named dir_rename_mask (couldn't think of a better name) which has one of the following values as we traverse into a directory: * 0x00: directory rename detection not needed * 0x02 or 0x04: directory rename detection only needed if files added * 0x07: directory rename detection definitely needed We then pass this value through to add_pairs() so that it can mark location_relevant as true only when dir_rename_mask is 0x07. Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>