diff options
author | Ævar Arnfjörð Bjarmason <avarab@gmail.com> | 2021-01-24 18:28:13 +0100 |
---|---|---|
committer | Junio C Hamano <gitster@pobox.com> | 2021-01-24 16:09:17 -0800 |
commit | 95ca1f987edd23389e3079d0a7fe6d0f89927b68 (patch) | |
tree | 848a3df16861c43f43e8161eeae352e597337ce6 /t/t7812-grep-icase-non-ascii.sh | |
parent | grep/pcre2 tests: don't rely on invalid UTF-8 data test (diff) | |
download | tgif-95ca1f987edd23389e3079d0a7fe6d0f89927b68.tar.xz |
grep/pcre2: better support invalid UTF-8 haystacks
Improve the support for invalid UTF-8 haystacks given a non-ASCII
needle when using the PCREv2 backend.
This is a more complete fix for a bug I started to fix in
870eea8166 (grep: do not enter PCRE2_UTF mode on fixed matching,
2019-07-26), now that PCREv2 has the PCRE2_MATCH_INVALID_UTF mode we
can make use of it.
This fixes the sort of case described in 8a5999838e (grep: stess test
PCRE v2 on invalid UTF-8 data, 2019-07-26), i.e.:
- The subject string is non-ASCII (e.g. "ævar")
- We're under a is_utf8_locale(), e.g. "en_US.UTF-8", not "C"
- We are using --ignore-case, or we're a non-fixed pattern
If those conditions were satisfied and we matched found non-valid
UTF-8 data PCREv2 might bark on it, in practice this only happened
under the JIT backend (turned on by default on most platforms).
Ultimately this fixes a "regression" in b65abcafc7 ("grep: use PCRE v2
for optimized fixed-string search", 2019-07-01), I'm putting that in
scare-quotes because before then we wouldn't properly support these
complex case-folding, locale etc. cases either, it just broke in
different ways.
There was a bug related to this the PCRE2_NO_START_OPTIMIZE flag fixed
in PCREv2 10.36. It can be worked around by setting the
PCRE2_NO_START_OPTIMIZE flag. Let's do that in those cases, and add
tests for the bug.
Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Diffstat (limited to 't/t7812-grep-icase-non-ascii.sh')
-rwxr-xr-x | t/t7812-grep-icase-non-ascii.sh | 46 |
1 files changed, 45 insertions, 1 deletions
diff --git a/t/t7812-grep-icase-non-ascii.sh b/t/t7812-grep-icase-non-ascii.sh index 38457c2e4f..e5d1e4ea68 100755 --- a/t/t7812-grep-icase-non-ascii.sh +++ b/t/t7812-grep-icase-non-ascii.sh @@ -57,7 +57,12 @@ test_expect_success GETTEXT_LOCALE,LIBPCRE2 'PCRE v2: setup invalid UTF-8 data' printf "\\200\\n" >invalid-0x80 && echo "ævar" >expected && cat expected >>invalid-0x80 && - git add invalid-0x80 + git add invalid-0x80 && + + # Test for PCRE2_MATCH_INVALID_UTF bug + # https://bugs.exim.org/show_bug.cgi?id=2642 + printf "\\345Aæ\\n" >invalid-0xe5 && + git add invalid-0xe5 ' test_expect_success GETTEXT_LOCALE,LIBPCRE2 'PCRE v2: grep ASCII from invalid UTF-8 data' ' @@ -67,6 +72,13 @@ test_expect_success GETTEXT_LOCALE,LIBPCRE2 'PCRE v2: grep ASCII from invalid UT test_cmp expected actual ' +test_expect_success GETTEXT_LOCALE,LIBPCRE2 'PCRE v2: grep ASCII from invalid UTF-8 data (PCRE2 bug #2642)' ' + git grep -h "Aæ" invalid-0xe5 >actual && + test_cmp invalid-0xe5 actual && + git grep -h "(*NO_JIT)Aæ" invalid-0xe5 >actual && + test_cmp invalid-0xe5 actual +' + test_expect_success GETTEXT_LOCALE,LIBPCRE2 'PCRE v2: grep non-ASCII from invalid UTF-8 data' ' git grep -h "æ" invalid-0x80 >actual && test_cmp expected actual && @@ -74,9 +86,41 @@ test_expect_success GETTEXT_LOCALE,LIBPCRE2 'PCRE v2: grep non-ASCII from invali test_cmp expected actual ' +test_expect_success GETTEXT_LOCALE,LIBPCRE2 'PCRE v2: grep non-ASCII from invalid UTF-8 data (PCRE2 bug #2642)' ' + git grep -h "Aæ" invalid-0xe5 >actual && + test_cmp invalid-0xe5 actual && + git grep -h "(*NO_JIT)Aæ" invalid-0xe5 >actual && + test_cmp invalid-0xe5 actual +' + +test_lazy_prereq PCRE2_MATCH_INVALID_UTF ' + test-tool pcre2-config has-PCRE2_MATCH_INVALID_UTF +' + test_expect_success GETTEXT_LOCALE,LIBPCRE2 'PCRE v2: grep non-ASCII from invalid UTF-8 data with -i' ' test_might_fail git grep -hi "Æ" invalid-0x80 >actual && test_might_fail git grep -hi "(*NO_JIT)Æ" invalid-0x80 >actual ' +test_expect_success GETTEXT_LOCALE,LIBPCRE2,PCRE2_MATCH_INVALID_UTF 'PCRE v2: grep non-ASCII from invalid UTF-8 data with -i' ' + git grep -hi "Æ" invalid-0x80 >actual && + test_cmp expected actual && + git grep -hi "(*NO_JIT)Æ" invalid-0x80 >actual && + test_cmp expected actual +' + +test_expect_success GETTEXT_LOCALE,LIBPCRE2,PCRE2_MATCH_INVALID_UTF 'PCRE v2: grep non-ASCII from invalid UTF-8 data with -i (PCRE2 bug #2642)' ' + git grep -hi "Æ" invalid-0xe5 >actual && + test_cmp invalid-0xe5 actual && + git grep -hi "(*NO_JIT)Æ" invalid-0xe5 >actual && + test_cmp invalid-0xe5 actual && + + # Only the case of grepping the ASCII part in a way that + # relies on -i fails + git grep -hi "aÆ" invalid-0xe5 >actual && + test_cmp invalid-0xe5 actual && + git grep -hi "(*NO_JIT)aÆ" invalid-0xe5 >actual && + test_cmp invalid-0xe5 actual +' + test_done |