diff options
author | Jonathan Nieder <jrnieder@gmail.com> | 2011-01-11 15:48:50 -0600 |
---|---|---|
committer | Junio C Hamano <gitster@pobox.com> | 2011-01-18 08:59:39 -0800 |
commit | 664d44ee7fb18bdfdd66a1be760c7ee1bbe911c6 (patch) | |
tree | e7a26f9f61b8f3f05702c491a7bb28325d07d265 /t/t4034-diff-words.sh | |
parent | t4034: bulk verify builtin word regex sanity (diff) | |
download | tgif-664d44ee7fb18bdfdd66a1be760c7ee1bbe911c6.tar.xz |
userdiff: simplify word-diff safeguard
git's diff-words support has a detail that can be a little dangerous:
any text not matched by a given language's tokenization pattern is
treated as whitespace and changes in such text would go unnoticed.
Therefore each of the built-in regexes allows a special token type
consisting of a single non-whitespace character [^[:space:]].
To make sure UTF-8 sequences remain human readable, the builtin
regexes also have a special token type for runs of bytes with the high
bit set. In English, non-ASCII characters are usually isolated so
this is analogous to the [^[:space:]] pattern, except it matches a
single _multibyte_ character despite use of the C locale.
Unfortunately it is easy to make typos or forget entirely to include
these catch-all token types when adding support for new languages (see
v1.7.3.5~16, userdiff: fix typo in ruby and python word regexes,
2010-12-18). Avoid this by including them automatically within the
PATTERNS and IPATTERN macros.
While at it, change the UTF-8 sequence token type to match exactly one
non-ASCII multi-byte character, rather than an arbitrary run of them.
Suggested-by: Thomas Rast <trast@student.ethz.ch>
Signed-off-by: Jonathan Nieder <jrnieder@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Diffstat (limited to 't/t4034-diff-words.sh')
0 files changed, 0 insertions, 0 deletions