summaryrefslogtreecommitdiff
path: root/utf8.c
AgeCommit message (Collapse)AuthorFilesLines
2018-08-15Merge branch 'jk/size-t'Libravatar Junio C Hamano1-5/+5
Code clean-up to use size_t/ssize_t when they are the right type. * jk/size-t: strbuf_humanise: use unsigned variables pass st.st_size as hint for strbuf_readlink() strbuf_readlink: use ssize_t strbuf: use size_t for length in intermediate variables reencode_string: use size_t for string lengths reencode_string: use st_add/st_mult helpers
2018-07-24reencode_string: use size_t for string lengthsLibravatar Jeff King1-3/+3
The iconv interface takes a size_t, which is the appropriate type for an in-memory buffer. But our reencode_string_* functions use integers, meaning we may get confusing results when the sizes exceed INT_MAX. Let's use size_t consistently. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-07-24reencode_string: use st_add/st_mult helpersLibravatar Jeff King1-2/+2
When converting a string with iconv, if the output buffer isn't big enough, we grow it. But our growth is done without any concern for integer overflow. So when we add: outalloc = sofar + insz * 2 + 32; we may end up wrapping outalloc (which is a size_t), and allocating a too-small buffer. We then manipulate it further: outsz = outalloc - sofar - 1; and feed outsz back to iconv. If outalloc is wrapped and smaller than sofar, we'll end up with a small allocation but feed a very large outsz to iconv, which could result in it overflowing the buffer. Can we use this to construct an attack wherein the victim clones a repository with a very large commit object with an encoding header, and running "git log" reencodes it into utf8, causing an overflow? An attack of this sort is likely impossible in practice. "sofar" is how many output bytes we've written total, and "insz" is the number of input bytes remaining. Imagine our input doubles in size as we output it (which is easy to do by converting latin1 to utf8, for example), and that we start with N input bytes. Our initial output buffer also starts at N bytes, so after the first call we'd have N/2 input bytes remaining (insz), and have written N bytes (sofar). That means our next allocation will be (N + N/2 * 2 + 32) bytes, or (2N + 32). We can therefore overflow a 32-bit size_t with a commit message that's just under 2^31 bytes, assuming it consists mostly of "doubling" sequences (e.g., latin1 0xe1 which becomes utf8 0xc3 0xa1). But we'll never make it that far with such a message. We'll be spending 2^31 bytes on the original string. And our initial output buffer will also be 2^31 bytes. Which is not going to succeed on a system with a 32-bit size_t, since there will be other things using the address space, too. The initial malloc will fail. If we imagine instead that we can triple the size when converting, then our second allocation becomes (N + 2/3N * 2 + 32), or (7/3N + 32). That still requires two allocations of 3/7 of our address space (6/7 of the total) to succeed. If we imagine we can quadruple, it becomes (5/2N + 32); we need to be able to allocate 4/5 of the address space to succeed. This might start to get plausible. But is it possible to get a 4-to-1 increase in size? Probably if you're converting to some obscure encoding. But since git defaults to utf8 for its output, that's the likely destination encoding for an attack. And while there are 4-character utf8 sequences, it's unlikely that you'd be able find a single-byte source sequence in any encoding. So this is certainly buggy code which should be fixed, but it is probably not a useful attack vector. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-07-09utf8.c: avoid char overflowLibravatar Beat Bolli1-4/+4
In ISO C, char constants must be in the range -128..127. Change the BOM constants to char literals to avoid overflow. Signed-off-by: Beat Bolli <dev+git@drbeat.li> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-05-29Sync with Git 2.17.1Libravatar Junio C Hamano1-12/+46
* maint: (25 commits) Git 2.17.1 Git 2.16.4 Git 2.15.2 Git 2.14.4 Git 2.13.7 fsck: complain when .gitmodules is a symlink index-pack: check .gitmodules files with --strict unpack-objects: call fsck_finish() after fscking objects fsck: call fsck_finish() after fscking objects fsck: check .gitmodules content fsck: handle promisor objects in .gitmodules check fsck: detect gitmodules files fsck: actually fsck blob data fsck: simplify ".git" check index-pack: make fsck error message more specific verify_path: disallow symlinks in .gitmodules update-index: stat updated files earlier verify_dotfile: mention case-insensitivity in comment verify_path: drop clever fallthrough skip_prefix: add case-insensitive variant ...
2018-05-22Sync with Git 2.14.4Libravatar Junio C Hamano1-12/+46
* maint-2.14: Git 2.14.4 Git 2.13.7 verify_path: disallow symlinks in .gitmodules update-index: stat updated files earlier verify_dotfile: mention case-insensitivity in comment verify_path: drop clever fallthrough skip_prefix: add case-insensitive variant is_{hfs,ntfs}_dotgitmodules: add tests is_ntfs_dotgit: match other .git files is_hfs_dotgit: match other .git files is_ntfs_dotgit: use a size_t for traversing string submodule-config: verify submodule names as paths
2018-05-21is_hfs_dotgit: match other .git filesLibravatar Jeff King1-12/+46
Both verify_path() and fsck match ".git", ".GIT", and other variants specific to HFS+. Let's allow matching other special files like ".gitmodules", which we'll later use to enforce extra restrictions via verify_path() and fsck. Signed-off-by: Jeff King <peff@peff.net>
2018-05-08Merge branch 'ls/checkout-encoding'Libravatar Junio C Hamano1-2/+63
The new "checkout-encoding" attribute can ask Git to convert the contents to the specified encoding when checking out to the working tree (and the other way around when checking in). * ls/checkout-encoding: convert: add round trip check based on 'core.checkRoundtripEncoding' convert: add tracing for 'working-tree-encoding' attribute convert: check for detectable errors in UTF encodings convert: add 'working-tree-encoding' attribute utf8: add function to detect a missing UTF-16/32 BOM utf8: add function to detect prohibited UTF-16/32 BOM utf8: teach same_encoding() alternative UTF encoding names strbuf: add a case insensitive starts_with() strbuf: add xstrdup_toupper() strbuf: remove unnecessary NUL assignment in xstrdup_tolower()
2018-04-16utf8: add function to detect a missing UTF-16/32 BOMLibravatar Lars Schneider1-0/+13
If the endianness is not defined in the encoding name, then let's be strict and require a BOM to avoid any encoding confusion. The is_missing_required_utf_bom() function returns true if a required BOM is missing. The Unicode standard instructs to assume big-endian if there in no BOM for UTF-16/32 [1][2]. However, the W3C/WHATWG encoding standard used in HTML5 recommends to assume little-endian to "deal with deployed content" [3]. Strictly requiring a BOM seems to be the safest option for content in Git. This function is used in a subsequent commit. [1] http://unicode.org/faq/utf_bom.html#gen6 [2] http://www.unicode.org/versions/Unicode10.0.0/ch03.pdf Section 3.10, D98, page 132 [3] https://encoding.spec.whatwg.org/#utf-16le Signed-off-by: Lars Schneider <larsxschneider@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-04-16utf8: add function to detect prohibited UTF-16/32 BOMLibravatar Lars Schneider1-0/+26
Whenever a data stream is declared to be UTF-16BE, UTF-16LE, UTF-32BE or UTF-32LE a BOM must not be used [1]. The function returns true if this is the case. This function is used in a subsequent commit. [1] http://unicode.org/faq/utf_bom.html#bom10 Signed-off-by: Lars Schneider <larsxschneider@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-04-16utf8: teach same_encoding() alternative UTF encoding namesLibravatar Lars Schneider1-2/+24
The function same_encoding() could only recognize alternative names for UTF-8 encodings. Teach it to recognize all kinds of alternative UTF encoding names (e.g. utf16). While we are at it, fix a crash that would occur if same_encoding() was called with a NULL argument and a non-NULL argument. This function is used in a subsequent commit. Signed-off-by: Lars Schneider <larsxschneider@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-04-11unicode_width.h: rename to use dash in file nameLibravatar Stefan Beller1-1/+1
This is more consistent with the project style. The majority of Git's source files use dashes in preference to underscores in their file names. Also adjust contrib/update-unicode as well. Signed-off-by: Stefan Beller <sbeller@google.com>
2017-10-10cleanup: fix possible overflow errors in binary searchLibravatar Derrick Stolee1-1/+1
A common mistake when writing binary search is to allow possible integer overflow by using the simple average: mid = (min + max) / 2; Instead, use the overflow-safe version: mid = min + (max - min) / 2; This translation is safe since the operation occurs inside a loop conditioned on "min < max". The included changes were found using the following git grep: git grep '/ *2;' '*.c' Making this cleanup will prevent future review friction when a new binary search is contructed based on existing code. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Reviewed-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-09-07utf8: release strbuf on error return in strbuf_utf8_replace()Libravatar Rene Scharfe1-1/+2
Signed-off-by: Rene Scharfe <l.s.r@web.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2016-09-26utf8: accept "latin-1" as ISO-8859-1Libravatar Junio C Hamano1-0/+7
Even though latin-1 is still seen in e-mail headers, some platforms only install ISO-8859-1. "iconv -f ISO-8859-1" succeeds, while "iconv -f latin-1" fails on such a system. Using the same fallback_encoding() mechanism factored out in the previous step, teach ourselves that "ISO-8859-1" has a better chance of being accepted than "latin-1". Signed-off-by: Junio C Hamano <gitster@pobox.com>
2016-09-26utf8: refactor code to decide fallback encodingLibravatar Junio C Hamano1-11/+18
The codepath we use to call iconv_open() has a provision to use a fallback encoding when it fails, hoping that "UTF-8" being spelled differently could be the reason why the library function did not like the encoding names we gave it. Essentially, we turn what we have observed to be used as variants of "UTF-8" (e.g. "utf8") into the most official spelling and use that as a fallback. We do the same thing for input and output encoding. Introduce a helper function to do just one side and call that twice. Signed-off-by: Junio C Hamano <gitster@pobox.com>
2015-09-17utf8: add function to align a string into given strbufLibravatar Karthik Nayak1-0/+21
Add strbuf_utf8_align() which will align a given string into a strbuf as per given align_type and width. If the width is greater than the string length then no alignment is performed. Helped-by: Eric Sunshine <sunshine@sunshineco.com> Mentored-by: Christian Couder <christian.couder@gmail.com> Mentored-by: Matthieu Moy <matthieu.moy@grenoble-inp.fr> Signed-off-by: Karthik Nayak <karthik.188@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2015-04-16utf8-bom: introduce skip_utf8_bom() helperLibravatar Junio C Hamano1-0/+11
With the recent change to ignore the UTF8 BOM at the beginning of .gitignore files, we now have two codepaths that do such a skipping (the other one is for reading the configuration files). Introduce utf8_bom[] constant string and skip_utf8_bom() helper and teach .gitignore code how to use it. Signed-off-by: Junio C Hamano <gitster@pobox.com>
2015-01-07Merge branch 'maint-2.1' into maintLibravatar Junio C Hamano1-12/+20
* maint-2.1: is_hfs_dotgit: loosen over-eager match of \u{..47}
2015-01-07Merge branch 'maint-2.0' into maint-2.1Libravatar Junio C Hamano1-12/+20
* maint-2.0: is_hfs_dotgit: loosen over-eager match of \u{..47}
2015-01-07Merge branch 'maint-1.9' into maint-2.0Libravatar Junio C Hamano1-12/+20
* maint-1.9: is_hfs_dotgit: loosen over-eager match of \u{..47}
2015-01-07Merge branch 'maint-1.8.5' into maint-1.9Libravatar Junio C Hamano1-12/+20
* maint-1.8.5: is_hfs_dotgit: loosen over-eager match of \u{..47}
2014-12-29is_hfs_dotgit: loosen over-eager match of \u{..47}Libravatar Jeff King1-12/+20
Our is_hfs_dotgit function relies on the hackily-implemented next_hfs_char to give us the next character that an HFS+ filename comparison would look at. It's hacky because it doesn't implement the full case-folding table of HFS+; it gives us just enough to see if the path matches ".git". At the end of next_hfs_char, we use tolower() to convert our 32-bit code point to lowercase. Our tolower() implementation only takes an 8-bit char, though; it throws away the upper 24 bits. This means we can't have any false negatives for is_hfs_dotgit. We only care about matching 7-bit ASCII characters in ".git", and we will correctly process 'G' or 'g'. However, we _can_ have false positives. Because we throw away the upper bits, code point \u{0147} (for example) will look like 'G' and get downcased to 'g'. It's not known whether a sequence of code points whose truncation ends up as ".git" is meaningful in any language, but it does not hurt to be more accurate here. We can just pass out the full 32-bit code point, and compare it manually to the upper and lowercase characters we care about. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2014-12-17Sync with v2.1.4Libravatar Junio C Hamano1-0/+64
* maint-2.1: Git 2.1.4 Git 2.0.5 Git 1.9.5 Git 1.8.5.6 fsck: complain about NTFS ".git" aliases in trees read-cache: optionally disallow NTFS .git variants path: add is_ntfs_dotgit() helper fsck: complain about HFS+ ".git" aliases in trees read-cache: optionally disallow HFS+ .git variants utf8: add is_hfs_dotgit() helper fsck: notice .git case-insensitively t1450: refactor ".", "..", and ".git" fsck tests verify_dotfile(): reject .git case-insensitively read-tree: add tests for confusing paths like ".." and ".git" unpack-trees: propagate errors adding entries to the index
2014-12-17Sync with v2.0.5Libravatar Junio C Hamano1-0/+64
* maint-2.0: Git 2.0.5 Git 1.9.5 Git 1.8.5.6 fsck: complain about NTFS ".git" aliases in trees read-cache: optionally disallow NTFS .git variants path: add is_ntfs_dotgit() helper fsck: complain about HFS+ ".git" aliases in trees read-cache: optionally disallow HFS+ .git variants utf8: add is_hfs_dotgit() helper fsck: notice .git case-insensitively t1450: refactor ".", "..", and ".git" fsck tests verify_dotfile(): reject .git case-insensitively read-tree: add tests for confusing paths like ".." and ".git" unpack-trees: propagate errors adding entries to the index
2014-12-17Sync with v1.9.5Libravatar Junio C Hamano1-0/+64
* maint-1.9: Git 1.9.5 Git 1.8.5.6 fsck: complain about NTFS ".git" aliases in trees read-cache: optionally disallow NTFS .git variants path: add is_ntfs_dotgit() helper fsck: complain about HFS+ ".git" aliases in trees read-cache: optionally disallow HFS+ .git variants utf8: add is_hfs_dotgit() helper fsck: notice .git case-insensitively t1450: refactor ".", "..", and ".git" fsck tests verify_dotfile(): reject .git case-insensitively read-tree: add tests for confusing paths like ".." and ".git" unpack-trees: propagate errors adding entries to the index
2014-12-17Sync with v1.8.5.6Libravatar Junio C Hamano1-0/+64
* maint-1.8.5: Git 1.8.5.6 fsck: complain about NTFS ".git" aliases in trees read-cache: optionally disallow NTFS .git variants path: add is_ntfs_dotgit() helper fsck: complain about HFS+ ".git" aliases in trees read-cache: optionally disallow HFS+ .git variants utf8: add is_hfs_dotgit() helper fsck: notice .git case-insensitively t1450: refactor ".", "..", and ".git" fsck tests verify_dotfile(): reject .git case-insensitively read-tree: add tests for confusing paths like ".." and ".git" unpack-trees: propagate errors adding entries to the index
2014-12-17utf8: add is_hfs_dotgit() helperLibravatar Jeff King1-0/+64
We do not allow paths with a ".git" component to be added to the index, as that would mean repository contents could overwrite our repository files. However, asking "is this path the same as .git" is not as simple as strcmp() on some filesystems. HFS+'s case-folding does more than just fold uppercase into lowercase (which we already handle with strcasecmp). It may also skip past certain "ignored" Unicode code points, so that (for example) ".gi\u200ct" is mapped ot ".git". The full list of folds can be found in the tables at: https://www.opensource.apple.com/source/xnu/xnu-1504.15.3/bsd/hfs/hfscommon/Unicode/UCStringCompareData.h Implementing a full "is this path the same as that path" comparison would require us importing the whole set of tables. However, what we want to do is much simpler: we only care about checking ".git". We know that 'G' is the only thing that folds to 'g', and so on, so we really only need to deal with the set of ignored code points, which is much smaller. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2014-09-19Merge branch 'rs/export-strbuf-addchars'Libravatar Junio C Hamano1-7/+0
Code clean-up. * rs/export-strbuf-addchars: strbuf: use strbuf_addchars() for adding a char multiple times strbuf: export strbuf_addchars()
2014-09-09Merge branch 'nd/strbuf-utf8-replace'Libravatar Junio C Hamano1-0/+3
* nd/strbuf-utf8-replace: utf8.c: fix strbuf_utf8_replace() consuming data beyond input string
2014-09-08strbuf: export strbuf_addchars()Libravatar René Scharfe1-7/+0
Move strbuf_addchars() to strbuf.c, where it belongs, and make it available for other callers. Signed-off-by: Rene Scharfe <l.s.r@web.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2014-08-11utf8.c: fix strbuf_utf8_replace() consuming data beyond input stringLibravatar Nguyễn Thái Ngọc Duy1-0/+3
The main loop in strbuf_utf8_replace() could summed up as: while ('src' is still valid) { 1) advance 'src' to copy ANSI escape sequences 2) advance 'src' to copy/replace visible characters } The problem is after #1, 'src' may have reached the end of the string (so 'src' points to NUL) and #2 will continue to copy that NUL as if it's a normal character. Because the output is stored in a strbuf, this NUL accounted in the 'len' field as well. Check after #1 and break the loop if necessary. The test does not look obvious, but the combination of %>>() should make a call trace like this show_log() pretty_print_commit() format_commit_message() strbuf_expand() format_commit_item() format_and_pad_commit() strbuf_utf8_replace() where %C(auto)%d would insert a color reset escape sequence in the end of the string given to strbuf_utf8_replace() and show_log() uses fwrite() to send everything to stdout (including the incorrect NUL inserted by strbuf_utf8_replace) Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2014-06-06Merge branch 'tb/unicode-6.3-zero-width'Libravatar Junio C Hamano1-69/+7
Update the logic to compute the display width needed for utf8 strings and allow us to more easily maintain the tables used in that logic. We may want to let the users choose if codepoints with ambiguous widths are treated as a double or single width in a follow-up patch. * tb/unicode-6.3-zero-width: utf8: make it easier to auto-update git_wcwidth() utf8.c: use a table for double_width
2014-05-12utf8: make it easier to auto-update git_wcwidth()Libravatar Torsten Bögershausen1-59/+2
The function git_wcwidth() returns for a given unicode code point the width on the display: -1 for control characters, 0 for combining or other non-visible code points 1 for e.g. ASCII 2 for double-width code points. This table had been originally been extracted for one Unicode version, probably 3.2. We now use two tables these days, one for zero-width and another for double-width. Make it easier to update these tables to a later version of Unicode by factoring out the table from utf8.c into unicode_width.h and add the script update_unicode.sh to update the table based on the latest Unicode specification files. Thanks to Peter Krefting <peter@softwolves.pp.se> and Kevin Bracey <kevin@bracey.fi> for helping with their Unicode knowledge. Signed-off-by: Torsten Bögershausen <tboegi@web.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2014-05-12utf8.c: use a table for double_widthLibravatar Torsten Bögershausen1-23/+18
Refactor git_wcwidth() and replace the if-else-if chain. Use the table double_width which is scanned by the bisearch() function, which is already used to find combining code points. Signed-off-by: Torsten Bögershausen <tboegi@web.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2014-04-16Merge branch 'tb/unicode-6.3-zero-width'Libravatar Junio C Hamano1-5/+4
Teach our display-column-counting logic about decomposed umlauts and friends. * tb/unicode-6.3-zero-width: utf8.c: partially update to version 6.3
2014-04-09utf8.c: partially update to version 6.3Libravatar Torsten Bögershausen1-5/+4
Unicode 6.3 defines more code points as combining or accents. For example, the character "ö" could be expressed as an "o" followed by U+0308 COMBINING DIARESIS (aka umlaut, double-dot-above). We should consider that such a sequence of two codepoints occupies one display column for the alignment purposes, and for that, git_wcwidth() should return 0 for them. Affected codepoints are: U+0358..U+035C U+0487 U+05A2, U+05BA, U+05C5, U+05C7 U+0604, U+0616..U+061A, U+0659..U+065F Earlier unicode standards had defined these as "reserved". Only the range 0..U+07FF has been checked to see which codepoints need to be marked as 0-width while preparing for this commit; more updates may be needed. Signed-off-by: Torsten Bögershausen <tboegi@web.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2014-02-18utf8: use correct type for values in interval tableLibravatar John Keeping1-2/+2
We treat these as unsigned everywhere and compare against unsigned values, so declare them using the typedef we already have for this. While we're here, fix the indentation as well. Signed-off-by: John Keeping <john@keeping.me.uk> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2014-02-18utf8: fix iconv error detectionLibravatar John Keeping1-1/+1
iconv(3) returns "(size_t) -1" on error. Make sure that we cast the "-1" properly when checking for this. Signed-off-by: John Keeping <john@keeping.me.uk> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2013-04-28pretty: Fix bug in truncation support for %>, %< and %><Libravatar Ramsay Jones1-2/+2
Some systems experience failures in t4205-*.sh (tests 18-20, 27) which all relate to the use of truncation with the %< padding placeholder. This capability was added in the commit a7f01c6b ("pretty: support truncating in %>, %< and %><", 19-04-2013). The truncation support was implemented with the assistance of a new strbuf function (strbuf_utf8_replace). This function contains the following code: strbuf_attach(sb_src, strbuf_detach(&sb_dst, NULL), sb_dst.len, sb_dst.alloc); Unfortunately, this code is subject to unspecified behaviour. In particular, the order of evaluation of the argument expressions (along with the associated side effects) is not specified by the C standard. Note that the second argument expression is a call to strbuf_detach() which, as a side effect, sets the 'len' and 'alloc' fields of the sb_dst argument to zero. Depending on the order of evaluation of the argument expressions to the strbuf_attach call, this can lead to assigning an empty string to 'sb_src'. In order to remove the undesired behaviour, we replace the above line of code with: strbuf_swap(sb_src, &sb_dst); strbuf_release(&sb_dst); which achieves the desired effect without provoking unspecified behaviour. Signed-off-by: Ramsay Jones <ramsay@ramsay1.demon.co.uk> Acked-by: Duy Nguyen <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2013-04-18pretty: support %>> that steal trailing spacesLibravatar Nguyễn Thái Ngọc Duy1-1/+1
This is pretty useful in `%<(100)%s%Cred%>(20)% an' where %s does not use up all 100 columns and %an needs more than 20 columns. By replacing %>(20) with %>>(20), %an can steal spaces from %s. %>> understands escape sequences, so %Cred does not stop it from stealing spaces in %<(100). Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2013-04-18pretty: support truncating in %>, %< and %><Libravatar Nguyễn Thái Ngọc Duy1-0/+46
%>(N,trunc) truncates the right part after N columns and replace the last two letters with "..". ltrunc does the same on the left. mtrunc cuts the middle out. Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2013-04-18utf8.c: add reencode_string_len() that can handle NULs in stringLibravatar Nguyễn Thái Ngọc Duy1-3/+7
Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2013-04-18utf8.c: add utf8_strnwidth() with the ability to skip ansi sequencesLibravatar Nguyễn Thái Ngọc Duy1-6/+14
Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2013-04-18utf8.c: move display_mode_esc_sequence_len() for use by other functionsLibravatar Nguyễn Thái Ngọc Duy1-14/+14
Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2013-03-25Merge branch 'ks/rfc2047-one-char-at-a-time'Libravatar Junio C Hamano1-0/+39
When "format-patch" quoted a non-ascii strings on the header files, it incorrectly applied rfc2047 and chopped a single character in the middle of it. * ks/rfc2047-one-char-at-a-time: format-patch: RFC 2047 says multi-octet character may not be split
2013-03-21Merge branch 'jk/utf-8-can-be-spelled-differently'Libravatar Junio C Hamano1-2/+18
Some platforms and users spell UTF-8 differently; retry with the most official "UTF-8" when the system does not understand the user-supplied encoding name that are the common alternative spellings of UTF-8. * jk/utf-8-can-be-spelled-differently: utf8: accept alternate spellings of UTF-8
2013-03-09format-patch: RFC 2047 says multi-octet character may not be splitLibravatar Kirill Smelkov1-0/+39
Even though an earlier attempt (bafc478..41dd00bad) cleaned up RFC 2047 encoding, pretty.c::add_rfc2047() still decides where to split the output line by going through the input one byte at a time, and potentially splits a character in the middle. A subject line may end up showing like this: ".... fö?? bar". (instead of ".... föö bar".) if split incorrectly. RFC 2047, section 5 (3) explicitly forbids such beaviour Each 'encoded-word' MUST represent an integral number of characters. A multi-octet character may not be split across adjacent 'encoded- word's. that means that e.g. for Subject: .... föö bar encoding Subject: =?UTF-8?q?....=20f=C3=B6=C3=B6?= =?UTF-8?q?=20bar?= is correct, and Subject: =?UTF-8?q?....=20f=C3=B6=C3?= <-- NOTE ö is broken here =?UTF-8?q?=B6=20bar?= is not, because "ö" character UTF-8 encoding C3 B6 is split here across adjacent encoded words. To fix the problem, make the loop grab one _character_ at a time and determine its output length to see where to break the output line. Note that this version only knows about UTF-8, but the logic to grab one character is abstracted out in mbs_chrlen() function to make it possible to extend it to other encodings with the help of iconv in the future. Signed-off-by: Kirill Smelkov <kirr@mns.spb.ru> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2013-02-25utf8: accept alternate spellings of UTF-8Libravatar Jeff King1-2/+18
The iconv implementation on many platforms will accept variants of UTF-8, including "UTF8", "utf-8", and "utf8", but some do not. We make allowances in our code to treat them all identically, but we sometimes hand the string from the user directly to iconv. In this case, the platform iconv may or may not work. There are really four levels of platform iconv support for these synonyms: 1. All synonyms understood (e.g., glibc). 2. Only the official "UTF-8" understood (e.g., Windows). 3. Official "UTF-8" not understood, but some other synonym understood (it's not known whether such a platform exists). 4. Neither "UTF-8" nor any synonym understood (e.g., ancient systems, or ones without utf8 support installed). This patch teaches git to fall back to using the official "UTF-8" spelling when iconv_open fails (and the encoding was one of the synonym spellings). This makes things more convenient to users of type 2 systems, as they can now use any of the synonyms for the log output encoding. Type 1 systems are not affected, as iconv already works on the first try. Type 4 systems are not affected, as both attempts already fail. Type 3 systems will not benefit from the feature, but because we only use "UTF-8" as a fallback, they will not be regressed (i.e., you can continue to use "utf8" if your platform supports it). We could try all the various synonyms, but since such systems are not even known to exist, it's not worth the effort. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2013-02-14Merge branch 'jx/utf8-printf-width'Libravatar Junio C Hamano1-0/+21
Use a new helper that prints a message and counts its display width to align the help messages parse-options produces. * jx/utf8-printf-width: Add utf8_fprintf helper that returns correct number of columns