summaryrefslogtreecommitdiff
path: root/t/t0028-working-tree-encoding.sh
AgeCommit message (Collapse)AuthorFilesLines
2019-11-07t0028: eliminate non-standard usage of printfLibravatar Doan Tran Cong Danh1-2/+2
man 1p printf: In addition to the escape sequences shown in the Base Definitions volume of POSIX.1‐2008, Chapter 5, File Format Notation ('\\', '\a', '\b', '\f', '\n', '\r', '\t', '\v'), "\ddd", where ddd is a one, two, or three-digit octal number, shall be written as a byte with the numeric value specified by the octal number. printf '\xfe\xff' is an extension of some shell. Dash, a popular yet simple shell, do not implement this extension. This wasn't caught by most people running the tests, even though common shells like dash don't handle hex escapes, because their systems don't trigger the NO_UTF16_BOM prereq. But systems with musl libc do; when combined with dash, the test fails. Correct it. Signed-off-by: Doan Tran Cong Danh <congdanhqx@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-09-28t0028: add more testsLibravatar Alexandr Miloslavskiy1-0/+39
After I discovered that UTF-16-LE-BOM test was buggy, I decided that better tests are required. Possibly the best option here is to compare git results against hardcoded ground truth. The new tests also cover more interesting chars where (ANSI != UTF-8). Signed-off-by: Alexandr Miloslavskiy <alexandr.miloslavskiy@syntevo.com> Reviewed-by: Torsten Bögershausen <tboegi@web.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-09-28t0028: fix test for UTF-16-LE-BOMLibravatar Alexandr Miloslavskiy1-1/+1
According to its name, the test is designed for UTF-16-LE-BOM. However, possibly due to copy&paste oversight, it was using UTF-32. While the test succeeds (extra \000\000 are interpreted as NUL), I myself had an unrelated problem which caused the test to fail. When analyzing the failure I was quite puzzled by the fact that the test is obviously buggy. And it seems that I'm not alone: https://public-inbox.org/git/CAH8yC8kSakS807d4jc_BtcUJOrcVT4No37AXSz=jePxhw-o9Dg@mail.gmail.com/T/#u Fix the test to follow its original intention. Signed-off-by: Alexandr Miloslavskiy <alexandr.miloslavskiy@syntevo.com> Reviewed-by: Torsten Bögershausen <tboegi@web.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-02-13Merge branch 'kd/t0028-octal-del-is-377-not-777'Libravatar Junio C Hamano1-4/+4
Test fix. * kd/t0028-octal-del-is-377-not-777: t0028: fix wrong octal values for BOM in setup
2019-02-11utf8: handle systems that don't write BOM for UTF-16Libravatar brian m. carlson1-5/+29
When serializing UTF-16 (and UTF-32), there are three possible ways to write the stream. One can write the data with a BOM in either big-endian or little-endian format, or one can write the data without a BOM in big-endian format. Most systems' iconv implementations choose to write it with a BOM in some endianness, since this is the most foolproof, and it is resistant to misinterpretation on Windows, where UTF-16 and the little-endian serialization are very common. For compatibility with Windows and to avoid accidental misuse there, Git always wants to write UTF-16 with a BOM, and will refuse to read UTF-16 without it. However, musl's iconv implementation writes UTF-16 without a BOM, relying on the user to interpret it as big-endian. This causes t0028 and the related functionality to fail, since Git won't read the file without a BOM. Add a Makefile and #define knob, ICONV_OMITS_BOM, that can be set if the iconv implementation has this behavior. When set, Git will write a BOM manually for UTF-16 and UTF-32 and then force the data to be written in UTF-16BE or UTF-32BE. We choose big-endian behavior here because the tests use the raw "UTF-16" encoding, which will be big-endian when the implementation requires this knob to be set. Update the tests to detect this case and write test data with an added BOM if necessary. Always write the BOM in the tests in big-endian format, since all iconv implementations that omit a BOM must use big-endian serialization according to the Unicode standard. Preserve the existing behavior for systems which do not have this knob enabled, since they may use optimized implementations, including defaulting to the native endianness, which may improve performance. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-02-11t0028: fix wrong octal values for BOM in setupLibravatar Kevin Daudt1-4/+4
The setup code uses octal values with printf to generate a BOM for UTF-16/32 BE/LE. It specifically uses '\777' to emit a 0xff byte. This relies on the fact that most shells truncate the value above 0o377. Ash however interprets '\777' as '\77' + a literal '7', resulting in an invalid BOM. Fix this by using the proper value of 0xff: '\377'. Signed-off-by: Kevin Daudt <me@ikke.info> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-01-31Support working-tree-encoding "UTF-16LE-BOM"Libravatar Torsten Bögershausen1-1/+11
Users who want UTF-16 files in the working tree set the .gitattributes like this: test.txt working-tree-encoding=UTF-16 The unicode standard itself defines 3 allowed ways how to encode UTF-16. The following 3 versions convert all back to 'g' 'i' 't' in UTF-8: a) UTF-16, without BOM, big endian: $ printf "\000g\000i\000t" | iconv -f UTF-16 -t UTF-8 | od -c 0000000 g i t b) UTF-16, with BOM, little endian: $ printf "\377\376g\000i\000t\000" | iconv -f UTF-16 -t UTF-8 | od -c 0000000 g i t c) UTF-16, with BOM, big endian: $ printf "\376\377\000g\000i\000t" | iconv -f UTF-16 -t UTF-8 | od -c 0000000 g i t Git uses libiconv to convert from UTF-8 in the index into ITF-16 in the working tree. After a checkout, the resulting file has a BOM and is encoded in "UTF-16", in the version (c) above. This is what iconv generates, more details follow below. iconv (and libiconv) can generate UTF-16, UTF-16LE or UTF-16BE: d) UTF-16 $ printf 'git' | iconv -f UTF-8 -t UTF-16 | od -c 0000000 376 377 \0 g \0 i \0 t e) UTF-16LE $ printf 'git' | iconv -f UTF-8 -t UTF-16LE | od -c 0000000 g \0 i \0 t \0 f) UTF-16BE $ printf 'git' | iconv -f UTF-8 -t UTF-16BE | od -c 0000000 \0 g \0 i \0 t There is no way to generate version (b) from above in a Git working tree, but that is what some applications need. (All fully unicode aware applications should be able to read all 3 variants, but in practise we are not there yet). When producing UTF-16 as an output, iconv generates the big endian version with a BOM. (big endian is probably chosen for historical reasons). iconv can produce UTF-16 files with little endianess by using "UTF-16LE" as encoding, and that file does not have a BOM. Not all users (especially under Windows) are happy with this. Some tools are not fully unicode aware and can only handle version (b). Today there is no way to produce version (b) with iconv (or libiconv). Looking into the history of iconv, it seems as if version (c) will be used in all future iconv versions (for compatibility reasons). Solve this dilemma and introduce a Git-specific "UTF-16LE-BOM". libiconv can not handle the encoding, so Git pick it up, handles the BOM and uses libiconv to convert the rest of the stream. (UTF-16BE-BOM is added for consistency) Rported-by: Adrián Gimeno Balaguer <adrigibal@gmail.com> Signed-off-by: Torsten Bögershausen <tboegi@web.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-08-29tests: fix non-portable iconv invocationLibravatar Ævar Arnfjörð Bjarmason1-1/+5
The iconv that comes with a FreeBSD 11.2-RELEASE-p2 box I have access to doesn't support the SHIFT-JIS encoding. Guard a test added in e92d62253 ("convert: add round trip check based on 'core.checkRoundtripEncoding'", 2018-04-15) first released with Git v2.18.0 with a prerequisite that checks for its availability. The iconv command is in POSIX, and we have numerous tests unconditionally relying on its ability to convert ASCII, UTF-8 and UTF-16, but unconditionally relying on the presence of more obscure encodings isn't portable. Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Reviewed-by: Jonathan Nieder <jrnieder@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-04-16convert: add round trip check based on 'core.checkRoundtripEncoding'Libravatar Lars Schneider1-0/+39
UTF supports lossless conversion round tripping and conversions between UTF and other encodings are mostly round trip safe as Unicode aims to be a superset of all other character encodings. However, certain encodings (e.g. SHIFT-JIS) are known to have round trip issues [1]. Add 'core.checkRoundtripEncoding', which contains a comma separated list of encodings, to define for what encodings Git should check the conversion round trip if they are used in the 'working-tree-encoding' attribute. Set SHIFT-JIS as default value for 'core.checkRoundtripEncoding'. [1] https://support.microsoft.com/en-us/help/170559/prb-conversion-problem-between-shift-jis-and-unicode Signed-off-by: Lars Schneider <larsxschneider@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-04-16convert: add tracing for 'working-tree-encoding' attributeLibravatar Lars Schneider1-0/+2
Add the GIT_TRACE_WORKING_TREE_ENCODING environment variable to enable tracing for content that is reencoded with the 'working-tree-encoding' attribute. This is useful to debug encoding issues. Signed-off-by: Lars Schneider <larsxschneider@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-04-16convert: check for detectable errors in UTF encodingsLibravatar Lars Schneider1-0/+62
Check that new content is valid with respect to the user defined 'working-tree-encoding' attribute. Signed-off-by: Lars Schneider <larsxschneider@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-04-16convert: add 'working-tree-encoding' attributeLibravatar Lars Schneider1-0/+142
Git recognizes files encoded with ASCII or one of its supersets (e.g. UTF-8 or ISO-8859-1) as text files. All other encodings are usually interpreted as binary and consequently built-in Git text processing tools (e.g. 'git diff') as well as most Git web front ends do not visualize the content. Add an attribute to tell Git what encoding the user has defined for a given file. If the content is added to the index, then Git reencodes the content to a canonical UTF-8 representation. On checkout Git will reverse this operation. Signed-off-by: Lars Schneider <larsxschneider@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>