summaryrefslogtreecommitdiff
path: root/t/t4034-diff-words.sh
diff options
context:
space:
mode:
authorLibravatar Jonathan Nieder <jrnieder@gmail.com>2011-01-11 15:48:50 -0600
committerLibravatar Junio C Hamano <gitster@pobox.com>2011-01-18 08:59:39 -0800
commit664d44ee7fb18bdfdd66a1be760c7ee1bbe911c6 (patch)
treee7a26f9f61b8f3f05702c491a7bb28325d07d265 /t/t4034-diff-words.sh
parentt4034: bulk verify builtin word regex sanity (diff)
downloadtgif-664d44ee7fb18bdfdd66a1be760c7ee1bbe911c6.tar.xz
userdiff: simplify word-diff safeguard
git's diff-words support has a detail that can be a little dangerous: any text not matched by a given language's tokenization pattern is treated as whitespace and changes in such text would go unnoticed. Therefore each of the built-in regexes allows a special token type consisting of a single non-whitespace character [^[:space:]]. To make sure UTF-8 sequences remain human readable, the builtin regexes also have a special token type for runs of bytes with the high bit set. In English, non-ASCII characters are usually isolated so this is analogous to the [^[:space:]] pattern, except it matches a single _multibyte_ character despite use of the C locale. Unfortunately it is easy to make typos or forget entirely to include these catch-all token types when adding support for new languages (see v1.7.3.5~16, userdiff: fix typo in ruby and python word regexes, 2010-12-18). Avoid this by including them automatically within the PATTERNS and IPATTERN macros. While at it, change the UTF-8 sequence token type to match exactly one non-ASCII multi-byte character, rather than an arbitrary run of them. Suggested-by: Thomas Rast <trast@student.ethz.ch> Signed-off-by: Jonathan Nieder <jrnieder@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Diffstat (limited to 't/t4034-diff-words.sh')
0 files changed, 0 insertions, 0 deletions