Age | Commit message (Collapse) | Author | Files | Lines |
|
Use size_t to match n when building the bitmask for checking whether a
rank is occupied, instead of the default signed int.
Signed-off-by: René Scharfe <l.s.r@web.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
Use MINSTD to generate pseudo-random numbers consistently instead of
using rand(3), whose output can vary from system to system, and reset
its seed before filling in the test values. This gives repeatable
results across versions and systems, which simplifies sharing and
comparing of results between developers.
Signed-off-by: René Scharfe <l.s.r@web.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
The bottom-up mergesort implementation needs to skip through sublists a
lot. A recursive version could avoid that, but would require log2(n)
stack frames. Explicitly manage a stack of sorted sublists of various
lengths instead to avoid fast-forwarding while also keeping a lid on
memory usage.
While this patch was developed independently, a ranks stack is also used
in https://github.com/mono/mono/blob/master/mono/eglib/sort.frag.h by
the Mono project.
The idea is to keep slots for log2(n_max) sorted sublists, one for each
power of 2. Such a construct can accommodate lists of any length up to
n_max. Since there is a known maximum number of items (effectively
SIZE_MAX), we can preallocate the whole rank stack.
We add items one by one, which is akin to incrementing a binary number.
Make use of that by keeping track of the number of items and check bits
in it instead of checking for NULL in the rank stack when checking if a
sublist of a certain rank exists, in order to avoid memory accesses.
The first item can go into the empty first slot as a sublist of length
2^0. The second one needs to be merged with the previous sublist and
the result goes into the empty second slot as a sublist of length 2^1.
The third one goes into vacated first slot and so on. At the end we
merge all the sublists to get the result.
The new version still performs a stable sort by making sure to put items
seen earlier first when the compare function indicates equality. That's
done by preferring items from sublists with a higher rank.
The new merge function also tries to minimize the number of operations.
Like blame.c::blame_merge(), the function doesn't set the next pointer
if it already points to the right item, and it exits when it reaches the
end of one of the two sublists that it's given. The old code couldn't
do the latter because it kept all items in a single list.
The number of comparisons stays the same, though. Here's example output
of "test-tool mergesort test" for the rand distributions with the most
number of comparisons with the ranks stack:
$ t/helper/test-tool mergesort test | awk '
NR > 1 && $1 != "rand" {next}
$7 > max[$3] {max[$3] = $7; line[$3] = $0}
END {for (n in line) print line[n]}
'
distribut mode n m get_next set_next compare verdict
rand copy 100 32 669 420 569 OK
rand dither 1023 64 9997 5396 8974 OK
rand dither 1024 512 10007 6159 8983 OK
rand dither 1025 256 10993 5988 9968 OK
Here are the differences to the results without this patch:
distribut mode n m get_next set_next compare
rand copy 100 32 -515 -280 0
rand dither 1023 64 -6376 -4834 0
rand dither 1024 512 -6377 -4081 0
rand dither 1025 256 -7461 -5287 0
The numbers of get_next and set_next calls are reduced significantly.
NB: These winners are different than the ones shown in the patch that
introduced the unriffle mode because the addition of the unriffle_skewed
mode in between changed the consumption of rand() values.
Here are the distributions with the most comparisons overall with the
ranks stack:
$ t/helper/test-tool mergesort test | awk '
$7 > max[$3] {max[$3] = $7; line[$3] = $0}
END {for (n in line) print line[n]}
'
distribut mode n m get_next set_next compare verdict
sawtooth unriffle_skewed 100 128 689 632 589 OK
sawtooth unriffle_skewed 1023 1024 10230 10220 9207 OK
sawtooth unriffle 1024 1024 10241 10240 9217 OK
sawtooth unriffle_skewed 1025 2048 11266 10242 10241 OK
And here the differences to before:
distribut mode n m get_next set_next compare
sawtooth unriffle_skewed 100 128 -495 -68 0
sawtooth unriffle_skewed 1023 1024 -6143 -10 0
sawtooth unriffle 1024 1024 -6143 0 0
sawtooth unriffle_skewed 1025 2048 -7188 -1033 0
We get a similar reduction of get_next calls here, but only a slight
reduction of set_next calls, if at all.
And here are the results of p0071-sort.sh before:
0071.12: llist_mergesort() unsorted 0.36(0.33+0.01)
0071.14: llist_mergesort() sorted 0.15(0.13+0.01)
0071.16: llist_mergesort() reversed 0.16(0.14+0.01)
... and here the ones with this patch:
0071.12: llist_mergesort() unsorted 0.24(0.22+0.01)
0071.14: llist_mergesort() sorted 0.12(0.10+0.01)
0071.16: llist_mergesort() reversed 0.12(0.10+0.01)
NB: We can't use t/perf/run to compare revisions in one run because it
uses the test-tool from the worktree, not from the revisions being
tested.
Signed-off-by: René Scharfe <l.s.r@web.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
Signed-off-by: René Scharfe <l.s.r@web.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
Check if sorting takes advantage of already sorted or reversed content,
or if that corner case actually decreases performance, like it would for
a simplistic quicksort implementation.
Signed-off-by: René Scharfe <l.s.r@web.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
Add a mode that turns a sorted list into adversarial input for a
bottom-up mergesort implementation that doubles the length of sorted
sublists at each level -- like our llist_mergesort().
While unriffle mode splits the list in half at each recursion step,
unriffle_skewed splits it into 2^l items and the rest, with 2^l being
the highest power of two smaller than the number of items and thus
2^l >= rest. The rest is unriffled with the tail of the first half to
require a merge to compare the maximum number of elements.
It complements the unriffle mode, which targets balanced merges. If
the number of elements is a power of two then both actually produce the
same result, as 2^l == rest == n/2 at each recursion step in that case.
Here are the results:
$ t/helper/test-tool mergesort test | awk '
$7 > max[$3] {max[$3] = $7; line[$3] = $0}
END {for (n in line) print line[n]}
'
distribut mode n m get_next set_next compare verdict
sawtooth unriffle_skewed 100 128 1184 700 589 OK
sawtooth unriffle_skewed 1023 1024 16373 10230 9207 OK
sawtooth unriffle 1024 1024 16384 10240 9217 OK
sawtooth unriffle_skewed 1025 2048 18454 11275 10241 OK
The sawtooth distribution with m>=n produces a sorted list and
unriffle_skewed mode turns it into adversarial input for unbalanced
merges, which it wins in all cases except for n=1024 -- the resulting
list is the same, but unriffle is tested before unriffle_skewed, so its
result is selected by the AWK script.
Signed-off-by: René Scharfe <l.s.r@web.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
Add a mode that turns sorted items into adversarial input for mergesort.
Do that by running mergesort in reverse and rearranging the items in
such a way that each merge needs the maximum number of operations to
undo it.
To riffle is a card shuffling technique and involves splitting a deck
into two and then to interleave them. A perfect riffle takes one card
from each half in turn. That's similar to the most expensive merge,
which has to take one item from each sublist in turn, which requires the
maximum number of comparisons (n-1).
So unriffle does that in reverse, i.e. it generates the first sublist
out of the items at even indexes and the second sublist out of the items
at odd indexes, without changing their order in any other way. Done
recursively until we reach the trivial sublist length of one, this
twists the list into an order that requires the maximum effort for
mergesort to untangle.
As a baseline, here are the rand distributions with the highest number
of comparisons from "test-tool mergesort test":
$ t/helper/test-tool mergesort test | awk '
NR > 1 && $1 != "rand" {next}
$7 > max[$3] {max[$3] = $7; line[$3] = $0}
END {for (n in line) print line[n]}
'
distribut mode n m get_next set_next compare verdict
rand copy 100 32 1184 700 569 OK
rand reverse_1st_half 1023 256 16373 10230 8976 OK
rand reverse_1st_half 1024 512 16384 10240 8993 OK
rand dither 1025 64 18454 11275 9970 OK
And here are the most expensive ones overall:
$ t/helper/test-tool mergesort test | awk '
$7 > max[$3] {max[$3] = $7; line[$3] = $0}
END {for (n in line) print line[n]}
'
distribut mode n m get_next set_next compare verdict
stagger reverse 100 64 1184 700 580 OK
sawtooth unriffle 1023 1024 16373 10230 9179 OK
sawtooth unriffle 1024 1024 16384 10240 9217 OK
stagger unriffle 1025 2048 18454 11275 10241 OK
The sawtooth distribution with m>=n generates a sorted list. The
unriffle mode is designed to turn that into adversarial input for
mergesort, and that checks out for n=1023 and n=1024, where it produces
the list that requires the most comparisons.
Item counts that are not powers of two have other winners, and that's
because unriffle recursively splits lists into equal-sized halves, while
llist_mergesort() splits them into the biggest power of two smaller than
n and the rest, e.g. for n=1025 it sorts the first 1024 separately and
finally merges them to the last item.
So unriffle mode works as designed for the intended use case, but to
consistently generate adversarial input for unbalanced merges we need
something else.
Signed-off-by: René Scharfe <l.s.r@web.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
Add a subcommand for printing test data. It can be used to generate
special test cases and feed them into the sort subcommand or sort(1) for
performance measurements. It may also be useful to illustrate the
effect of distributions, modes and their parameters.
It generates n integers with the specified distribution and its
distribution-specific parameter m. E.g. m is the maximum value for
the plateau distribution and the length and height of individual teeth
of the sawtooth distribution.
The generated values are printed as zero-padded eight-digit hexadecimal
numbers to make sure alphabetic and numeric order are the same.
Signed-off-by: René Scharfe <l.s.r@web.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
Adapt the qsort certification program from "Engineering a Sort Function"
by Bentley and McIlroy for testing our linked list sort function. It
generates several lists with various distribution patterns and counts
the number of operations llist_mergesort() needs to order them. It
compares the result to the output of a trusted sort function (qsort(1))
and also checks if the sort is stable.
Also add a test script that makes use of the new subcommand.
Signed-off-by: René Scharfe <l.s.r@web.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
Give the code for sorting a text file its own sub-command. This allows
extending the helper, which we'll do in the following patches.
Signed-off-by: René Scharfe <l.s.r@web.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
Strip line ending characters to make sure empty lines are sorted like
sort(1) does.
Signed-off-by: René Scharfe <l.s.r@web.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
|
|
* maint-2.29:
Git 2.29.3
Git 2.28.1
Git 2.27.1
Git 2.26.3
Git 2.25.5
Git 2.24.4
Git 2.23.4
Git 2.22.5
Git 2.21.4
Git 2.20.5
Git 2.19.6
Git 2.18.5
Git 2.17.6
unpack_trees(): start with a fresh lstat cache
run-command: invalidate lstat cache after a command finished
checkout: fix bug that makes checkout follow symlinks in leading path
|
|
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
|
|
* maint-2.28:
Git 2.28.1
Git 2.27.1
Git 2.26.3
Git 2.25.5
Git 2.24.4
Git 2.23.4
Git 2.22.5
Git 2.21.4
Git 2.20.5
Git 2.19.6
Git 2.18.5
Git 2.17.6
unpack_trees(): start with a fresh lstat cache
run-command: invalidate lstat cache after a command finished
checkout: fix bug that makes checkout follow symlinks in leading path
|
|
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
|
|
* maint-2.27:
Git 2.27.1
Git 2.26.3
Git 2.25.5
Git 2.24.4
Git 2.23.4
Git 2.22.5
Git 2.21.4
Git 2.20.5
Git 2.19.6
Git 2.18.5
Git 2.17.6
unpack_trees(): start with a fresh lstat cache
run-command: invalidate lstat cache after a command finished
checkout: fix bug that makes checkout follow symlinks in leading path
|
|
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
|
|
* maint-2.26:
Git 2.26.3
Git 2.25.5
Git 2.24.4
Git 2.23.4
Git 2.22.5
Git 2.21.4
Git 2.20.5
Git 2.19.6
Git 2.18.5
Git 2.17.6
unpack_trees(): start with a fresh lstat cache
run-command: invalidate lstat cache after a command finished
checkout: fix bug that makes checkout follow symlinks in leading path
|
|
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
|
|
* maint-2.25:
Git 2.25.5
Git 2.24.4
Git 2.23.4
Git 2.22.5
Git 2.21.4
Git 2.20.5
Git 2.19.6
Git 2.18.5
Git 2.17.6
unpack_trees(): start with a fresh lstat cache
run-command: invalidate lstat cache after a command finished
checkout: fix bug that makes checkout follow symlinks in leading path
|
|
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
|
|
* maint-2.24:
Git 2.24.4
Git 2.23.4
Git 2.22.5
Git 2.21.4
Git 2.20.5
Git 2.19.6
Git 2.18.5
Git 2.17.6
unpack_trees(): start with a fresh lstat cache
run-command: invalidate lstat cache after a command finished
checkout: fix bug that makes checkout follow symlinks in leading path
|
|
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
|
|
* maint-2.23:
Git 2.23.4
Git 2.22.5
Git 2.21.4
Git 2.20.5
Git 2.19.6
Git 2.18.5
Git 2.17.6
unpack_trees(): start with a fresh lstat cache
run-command: invalidate lstat cache after a command finished
checkout: fix bug that makes checkout follow symlinks in leading path
|
|
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
|
|
* maint-2.22:
Git 2.22.5
Git 2.21.4
Git 2.20.5
Git 2.19.6
Git 2.18.5
Git 2.17.6
unpack_trees(): start with a fresh lstat cache
run-command: invalidate lstat cache after a command finished
checkout: fix bug that makes checkout follow symlinks in leading path
|
|
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
|
|
* maint-2.21:
Git 2.21.4
Git 2.20.5
Git 2.19.6
Git 2.18.5
Git 2.17.6
unpack_trees(): start with a fresh lstat cache
run-command: invalidate lstat cache after a command finished
checkout: fix bug that makes checkout follow symlinks in leading path
|
|
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
|
|
* maint-2.20:
Git 2.20.5
Git 2.19.6
Git 2.18.5
Git 2.17.6
unpack_trees(): start with a fresh lstat cache
run-command: invalidate lstat cache after a command finished
checkout: fix bug that makes checkout follow symlinks in leading path
|
|
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
|
|
* maint-2.19:
Git 2.19.6
Git 2.18.5
Git 2.17.6
unpack_trees(): start with a fresh lstat cache
run-command: invalidate lstat cache after a command finished
checkout: fix bug that makes checkout follow symlinks in leading path
|
|
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
|
|
* maint-2.18:
Git 2.18.5
Git 2.17.6
unpack_trees(): start with a fresh lstat cache
run-command: invalidate lstat cache after a command finished
checkout: fix bug that makes checkout follow symlinks in leading path
|
|
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
|
|
* maint-2.17:
Git 2.17.6
unpack_trees(): start with a fresh lstat cache
run-command: invalidate lstat cache after a command finished
checkout: fix bug that makes checkout follow symlinks in leading path
|
|
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
|
|
We really want to avoid relying on stale information.
Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br>
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
|
|
In the previous commit, we intercepted calls to `rmdir()` to invalidate
the lstat cache in the successful case, so that the lstat cache could
not have the idea that a directory exists where there is none.
The same situation can arise, of course, when a separate process is
spawned (most notably, this is the case in `submodule_move_head()`).
Obviously, we cannot know whether a directory was removed in that
process, therefore we must invalidate the lstat cache afterwards.
Note: in contrast to `lstat_cache_aware_rmdir()`, we invalidate the
lstat cache even in case of an error: the process might have removed a
directory and still have failed afterwards.
Co-authored-by: Matheus Tavares <matheus.bernardino@usp.br>
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
|
|
Before checking out a file, we have to confirm that all of its leading
components are real existing directories. And to reduce the number of
lstat() calls in this process, we cache the last leading path known to
contain only directories. However, when a path collision occurs (e.g.
when checking out case-sensitive files in case-insensitive file
systems), a cached path might have its file type changed on disk,
leaving the cache on an invalid state. Normally, this doesn't bring
any bad consequences as we usually check out files in index order, and
therefore, by the time the cached path becomes outdated, we no longer
need it anyway (because all files in that directory would have already
been written).
But, there are some users of the checkout machinery that do not always
follow the index order. In particular: checkout-index writes the paths
in the same order that they appear on the CLI (or stdin); and the
delayed checkout feature -- used when a long-running filter process
replies with "status=delayed" -- postpones the checkout of some entries,
thus modifying the checkout order.
When we have to check out an out-of-order entry and the lstat() cache is
invalid (due to a previous path collision), checkout_entry() may end up
using the invalid data and thrusting that the leading components are
real directories when, in reality, they are not. In the best case
scenario, where the directory was replaced by a regular file, the user
will get an error: "fatal: unable to create file 'foo/bar': Not a
directory". But if the directory was replaced by a symlink, checkout
could actually end up following the symlink and writing the file at a
wrong place, even outside the repository. Since delayed checkout is
affected by this bug, it could be used by an attacker to write
arbitrary files during the clone of a maliciously crafted repository.
Some candidate solutions considered were to disable the lstat() cache
during unordered checkouts or sort the entries before passing them to
the checkout machinery. But both ideas include some performance penalty
and they don't future-proof the code against new unordered use cases.
Instead, we now manually reset the lstat cache whenever we successfully
remove a directory. Note: We are not even checking whether the directory
was the same as the lstat cache points to because we might face a
scenario where the paths refer to the same location but differ due to
case folding, precomposed UTF-8 issues, or the presence of `..`
components in the path. Two regression tests, with case-collisions and
utf8-collisions, are also added for both checkout-index and delayed
checkout.
Note: to make the previously mentioned clone attack unfeasible, it would
be sufficient to reset the lstat cache only after the remove_subtree()
call inside checkout_entry(). This is the place where we would remove a
directory whose path collides with the path of another entry that we are
currently trying to check out (possibly a symlink). However, in the
interest of a thorough fix that does not leave Git open to
similar-but-not-identical attack vectors, we decided to intercept
all `rmdir()` calls in one fell swoop.
This addresses CVE-2021-21300.
Co-authored-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br>
|
|
* tb/ci-run-cocci-with-18.04:
.github/workflows/main.yml: run static-analysis on bionic
|
|
GitHub Actions is transitioning workflow steps that run on
'ubuntu-latest' from 18.04 to 20.04 [1].
This works fine in all steps except the static-analysis one, since
Coccinelle isn't available on Ubuntu focal (it is only available in the
universe suite).
Until Coccinelle can be installed from 20.04's main suite, pin the
static-analysis build to run on 18.04, where it can be installed by
default.
[1]: https://github.com/actions/virtual-environments/issues/1816
Reported-by: Junio C Hamano <gitster@pobox.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
Our setting of GitHub CI test jobs were a bit too eager to give up
once there is even one failure found. Tweak the knob to allow
other jobs keep running even when we see a failure, so that we can
find more failures in a single run.
* pb/ci-matrix-wo-shortcut:
ci: do not cancel all jobs of a matrix if one fails
|
|
Test fix.
* pb/blame-funcname-range-userdiff:
annotate-tests: quote variable expansions containing path names
|
|
A perf script was made more portable.
* jk/p5303-sed-portability-fix:
p5303: avoid sed GNU-ism
|
|
The implementation of "git branch --sort" wrt the detached HEAD
display has always been hacky, which has been cleaned up.
* ab/branch-sort:
branch: show "HEAD detached" first under reverse sort
branch: sort detached HEAD based on a flag
ref-filter: move ref_sorting flags to a bitfield
ref-filter: move "cmp_fn" assignment into "else if" arm
ref-filter: add braces to if/else if/else chain
branch tests: add to --sort tests
branch: change "--local" to "--list" in comment
|
|
Code clean-up.
* ma/more-opaque-lock-file:
read-cache: try not to peek into `struct {lock_,temp}file`
refs/files-backend: don't peek into `struct lock_file`
midx: don't peek into `struct lock_file`
commit-graph: don't peek into `struct lock_file`
builtin/gc: don't peek into `struct lock_file`
|
|
Text encoding fix for "git p4".
* dl/p4-encode-after-kw-expansion:
git-p4: fix syncing file types with pattern
|