summaryrefslogtreecommitdiff
path: root/utf8.c
AgeCommit message (Collapse)AuthorFilesLines
2013-04-03Merge branch 'ks/rfc2047-one-char-at-a-time' into maintLibravatar Junio C Hamano1-0/+39
When "format-patch" quoted a non-ascii strings on the header files, it incorrectly applied rfc2047 and chopped a single character in the middle of it. * ks/rfc2047-one-char-at-a-time: format-patch: RFC 2047 says multi-octet character may not be split
2013-03-26Merge branch 'jk/utf-8-can-be-spelled-differently' into maintLibravatar Junio C Hamano1-2/+18
Some platforms and users spell UTF-8 differently; retry with the most official "UTF-8" when the system does not understand the user-supplied encoding name that are the common alternative spellings of UTF-8. * jk/utf-8-can-be-spelled-differently: utf8: accept alternate spellings of UTF-8
2013-03-09format-patch: RFC 2047 says multi-octet character may not be splitLibravatar Kirill Smelkov1-0/+39
Even though an earlier attempt (bafc478..41dd00bad) cleaned up RFC 2047 encoding, pretty.c::add_rfc2047() still decides where to split the output line by going through the input one byte at a time, and potentially splits a character in the middle. A subject line may end up showing like this: ".... fö?? bar". (instead of ".... föö bar".) if split incorrectly. RFC 2047, section 5 (3) explicitly forbids such beaviour Each 'encoded-word' MUST represent an integral number of characters. A multi-octet character may not be split across adjacent 'encoded- word's. that means that e.g. for Subject: .... föö bar encoding Subject: =?UTF-8?q?....=20f=C3=B6=C3=B6?= =?UTF-8?q?=20bar?= is correct, and Subject: =?UTF-8?q?....=20f=C3=B6=C3?= <-- NOTE ö is broken here =?UTF-8?q?=B6=20bar?= is not, because "ö" character UTF-8 encoding C3 B6 is split here across adjacent encoded words. To fix the problem, make the loop grab one _character_ at a time and determine its output length to see where to break the output line. Note that this version only knows about UTF-8, but the logic to grab one character is abstracted out in mbs_chrlen() function to make it possible to extend it to other encodings with the help of iconv in the future. Signed-off-by: Kirill Smelkov <kirr@mns.spb.ru> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2013-02-25utf8: accept alternate spellings of UTF-8Libravatar Jeff King1-2/+18
The iconv implementation on many platforms will accept variants of UTF-8, including "UTF8", "utf-8", and "utf8", but some do not. We make allowances in our code to treat them all identically, but we sometimes hand the string from the user directly to iconv. In this case, the platform iconv may or may not work. There are really four levels of platform iconv support for these synonyms: 1. All synonyms understood (e.g., glibc). 2. Only the official "UTF-8" understood (e.g., Windows). 3. Official "UTF-8" not understood, but some other synonym understood (it's not known whether such a platform exists). 4. Neither "UTF-8" nor any synonym understood (e.g., ancient systems, or ones without utf8 support installed). This patch teaches git to fall back to using the official "UTF-8" spelling when iconv_open fails (and the encoding was one of the synonym spellings). This makes things more convenient to users of type 2 systems, as they can now use any of the synonyms for the log output encoding. Type 1 systems are not affected, as iconv already works on the first try. Type 4 systems are not affected, as both attempts already fail. Type 3 systems will not benefit from the feature, but because we only use "UTF-8" as a fallback, they will not be regressed (i.e., you can continue to use "utf8" if your platform supports it). We could try all the various synonyms, but since such systems are not even known to exist, it's not worth the effort. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2013-02-14Merge branch 'jx/utf8-printf-width'Libravatar Junio C Hamano1-0/+21
Use a new helper that prints a message and counts its display width to align the help messages parse-options produces. * jx/utf8-printf-width: Add utf8_fprintf helper that returns correct number of columns
2013-02-11Add utf8_fprintf helper that returns correct number of columnsLibravatar Jiang Xin1-0/+21
Since command usages can be translated, they may include utf-8 encoded strings, and the output in console may not align well any more. This is because strlen() is different from strwidth() on utf-8 strings. A wrapper utf8_fprintf() can help to return the correct number of columns required. Signed-off-by: Jiang Xin <worldhello.net@gmail.com> Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Reviewed-by: Torsten Bögershausen <tboegi@web.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2013-01-02Merge branch 'sp/shortlog-missing-lf'Libravatar Junio C Hamano1-7/+6
When a line to be wrapped has a solid run of non space characters whose length exactly is the wrap width, "git shortlog -w" failed to add a newline after such a line. * sp/shortlog-missing-lf: strbuf_add_wrapped*(): Remove unused return value shortlog: fix wrapping lines of wraplen
2012-12-11strbuf_add_wrapped*(): Remove unused return valueLibravatar Steffen Prohaska1-7/+6
Since shortlog isn't using the return value anymore (see previous commit), the functions can be changed to void. Signed-off-by: Steffen Prohaska <prohaska@zib.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2012-12-07Merge branch 'jc/same-encoding' into maintLibravatar Junio C Hamano1-0/+7
Various codepaths checked if two encoding names are the same using ad-hoc code and some of them ended up asking iconv() to convert between "utf8" and "UTF-8". The former is not a valid way to spell the encoding name, but often people use it by mistake, and we equated them in some but not all codepaths. Introduce a new helper function to make these codepaths consistent. * jc/same-encoding: reencode_string(): introduce and use same_encoding()
2012-11-20Merge branch 'js/format-2047' into maintLibravatar Junio C Hamano1-1/+1
Various rfc2047 quoting issues around a non-ASCII name on the From: line in the output from format-patch have been corrected. * js/format-2047: format-patch tests: check quoting/encoding in To: and Cc: headers format-patch: fix rfc2047 address encoding with respect to rfc822 specials format-patch: make rfc2047 encoding more strict format-patch: introduce helper function last_line_length() format-patch: do not wrap rfc2047 encoded headers too late format-patch: do not wrap non-rfc2047 headers too early utf8: fix off-by-one wrapping of text
2012-11-15Merge branch 'jc/same-encoding'Libravatar Junio C Hamano1-0/+7
Various codepaths checked if two encoding names are the same using ad-hoc code and some of them ended up asking iconv() to convert between "utf8" and "UTF-8". The former is not a valid way to spell the encoding name, but often people use it by mistake, and we equated them in some but not all codepaths. Introduce a new helper function to make these codepaths consistent. * jc/same-encoding: reencode_string(): introduce and use same_encoding() Conflicts: builtin/mailinfo.c
2012-11-09Merge branch 'js/format-2047'Libravatar Jeff King1-1/+1
Fixes many rfc2047 quoting issues in the output from format-patch. * js/format-2047: format-patch tests: check quoting/encoding in To: and Cc: headers format-patch: fix rfc2047 address encoding with respect to rfc822 specials format-patch: make rfc2047 encoding more strict format-patch: introduce helper function last_line_length() format-patch: do not wrap rfc2047 encoded headers too late format-patch: do not wrap non-rfc2047 headers too early utf8: fix off-by-one wrapping of text
2012-11-04reencode_string(): introduce and use same_encoding()Libravatar Junio C Hamano1-0/+7
Callers of reencode_string() that re-encodes a string from one encoding to another all used ad-hoc way to bypass the case where the input and the output encodings are the same. Some did strcmp(), some did strcasecmp(), yet some others when converting to UTF-8 used is_encoding_utf8(). Introduce same_encoding() helper function to make these callers use the same logic. Notably, is_encoding_utf8() has a work-around for common misconfiguration to use "utf8" to name UTF-8 encoding, which does not match "UTF-8" hence strcasecmp() would not consider the same. Make use of it in this helper function. Signed-off-by: Junio C Hamano <gitster@pobox.com>
2012-10-18utf8: fix off-by-one wrapping of textLibravatar Jan H. Schönherr1-1/+1
The wrapping logic in strbuf_add_wrapped_text() does currently not allow lines that entirely fill the allowed width, instead it wraps the line one character too early. For example, the text "This is the sixth commit." formatted via "%w(11,1,2)" (wrap at 11 characters, 1 char indent of first line, 2 char indent of following lines) results in four lines: " This is", " the", " sixth", " commit." This is wrong, because " the sixth" is exactly 11 characters long, and thus allowed. Fix this by allowing the (width+1) character of a line to be a valid wrapping point if it is a whitespace character. Signed-off-by: Jan H. Schönherr <schnhrr@cs.tu-berlin.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2012-07-08git on Mac OS and precomposed unicodeLibravatar Torsten Bögershausen1-10/+16
Mac OS X mangles file names containing unicode on file systems HFS+, VFAT or SAMBA. When a file using unicode code points outside ASCII is created on a HFS+ drive, the file name is converted into decomposed unicode and written to disk. No conversion is done if the file name is already decomposed unicode. Calling open("\xc3\x84", ...) with a precomposed "Ä" yields the same result as open("\x41\xcc\x88",...) with a decomposed "Ä". As a consequence, readdir() returns the file names in decomposed unicode, even if the user expects precomposed unicode. Unlike on HFS+, Mac OS X stores files on a VFAT drive (e.g. an USB drive) in precomposed unicode, but readdir() still returns file names in decomposed unicode. When a git repository is stored on a network share using SAMBA, file names are send over the wire and written to disk on the remote system in precomposed unicode, but Mac OS X readdir() returns decomposed unicode to be compatible with its behaviour on HFS+ and VFAT. The unicode decomposition causes many problems: - The names "git add" and other commands get from the end user may often be precomposed form (the decomposed form is not easily input from the keyboard), but when the commands read from the filesystem to see what it is going to update the index with already is on the filesystem, readdir() will give decomposed form, which is different. - Similarly "git log", "git mv" and all other commands that need to compare pathnames found on the command line (often but not always precomposed form; a command line input resulting from globbing may be in decomposed) with pathnames found in the tree objects (should be precomposed form to be compatible with other systems and for consistency in general). - The same for names stored in the index, which should be precomposed, that may need to be compared with the names read from readdir(). NFS mounted from Linux is fully transparent and does not suffer from the above. As Mac OS X treats precomposed and decomposed file names as equal, we can - wrap readdir() on Mac OS X to return the precomposed form, and - normalize decomposed form given from the command line also to the precomposed form, to ensure that all pathnames used in Git are always in the precomposed form. This behaviour can be requested by setting "core.precomposedunicode" configuration variable to true. The code in compat/precomposed_utf8.c implements basically 4 new functions: precomposed_utf8_opendir(), precomposed_utf8_readdir(), precomposed_utf8_closedir() and precompose_argv(). The first three are to wrap opendir(3), readdir(3), and closedir(3) functions. The argv[] conversion allows to use the TAB filename completion done by the shell on command line. It tolerates other tools which use readdir() to feed decomposed file names into git. When creating a new git repository with "git init" or "git clone", "core.precomposedunicode" will be set "false". The user needs to activate this feature manually. She typically sets core.precomposedunicode to "true" on HFS and VFAT, or file systems mounted via SAMBA. Helped-by: Junio C Hamano <gitster@pobox.com> Signed-off-by: Torsten Bögershausen <tboegi@web.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2011-02-23strbuf: add fixed-length version of add_wrapped_textLibravatar Jeff King1-0/+9
The function strbuf_add_wrapped_text takes a NUL-terminated string. This makes it annoying to wrap strings we have as a pointer and a length. Refactoring strbuf_add_wrapped_text and all of its sub-functions to handle fixed-length strings turned out to be really ugly. So this implementation is lame; it just strdups the text and operates on the NUL-terminated version. This should be fine as the strings we are wrapping are generally pretty short. If it becomes a problem, we can optimize later. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2010-03-02Merge branch 'rs/optim-text-wrap'Libravatar Junio C Hamano1-33/+28
* rs/optim-text-wrap: utf8.c: speculatively assume utf-8 in strbuf_add_wrapped_text() utf8.c: remove strbuf_write() utf8.c: remove print_spaces() utf8.c: remove print_wrapped_text()
2010-02-20utf8.c: speculatively assume utf-8 in strbuf_add_wrapped_text()Libravatar René Scharfe1-6/+17
is_utf8() works by calling utf8_width() for each character at the supplied location. In strbuf_add_wrapped_text(), we do that anyway while wrapping the lines. So instead of checking the encoding beforehand, optimistically assume that it's utf-8 and wrap along until an invalid character is hit, and when that happens start over. This pays off if the text consists only of valid utf-8 characters. The following command was run against the Linux kernel repo with git 1.7.0: $ time git log --format='%b' v2.6.32 >/dev/null real 0m2.679s user 0m2.580s sys 0m0.100s $ time git log --format='%w(60,4,8)%b' >/dev/null real 0m4.342s user 0m4.230s sys 0m0.110s And with this patch series: $ time git log --format='%w(60,4,8)%b' >/dev/null real 0m3.741s user 0m3.630s sys 0m0.110s So the cost of wrapping is reduced to 70% in this case. Signed-off-by: Rene Scharfe <rene.scharfe@lsrfire.ath.cx> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2010-02-20utf8.c: remove strbuf_write()Libravatar René Scharfe1-13/+5
The patch before the previous one made sure that all callers of strbuf_add_wrapped_text() supply a strbuf. Replace all calls of strbuf_write() with regular strbuf functions and remove it. Signed-off-by: Rene Scharfe <rene.scharfe@lsrfire.ath.cx> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2010-02-20utf8.c: remove print_spaces()Libravatar René Scharfe1-9/+6
The previous patch made sure that strbuf_add_wrapped_text() (and thus strbuf_add_indented_text(), too) always get a strbuf. Make use of this fact by adding strbuf_addchars(), a small helper that adds a char the specified number of times to a strbuf, and use it to replace print_spaces(). Signed-off-by: Rene Scharfe <rene.scharfe@lsrfire.ath.cx> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2010-02-20utf8.c: remove print_wrapped_text()Libravatar René Scharfe1-5/+0
strbuf_add_wrapped_text() is called only from print_wrapped_text() without a strbuf (in which case it writes its results to stdout). At its only callsite, supply a strbuf, call strbuf_add_wrapped_text() directly and remove the wrapper function. Signed-off-by: Rene Scharfe <rene.scharfe@lsrfire.ath.cx> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2010-01-12utf8.c: mark file-local function staticLibravatar Junio C Hamano1-1/+1
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2009-11-23strbuf_add_wrapped_text(): skip over colour codesLibravatar René Scharfe1-1/+21
Ignore display mode escape sequences (colour codes) for the purpose of text wrapping because they don't have a visible width. Signed-off-by: Rene Scharfe <rene.scharfe@lsrfire.ath.cx> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2009-11-22strbuf_add_wrapped_text(): factor out strbuf_add_indented_text()Libravatar René Scharfe1-9/+17
Add a new helper function, strbuf_add_indented_text(), to indent text without a width limit, and call it from strbuf_add_wrapped_text(). It respects both indent (applied to the first line) and indent2 (applied to the rest of the lines); indent2 was ignored by the indent-only path of strbuf_add_wrapped_text() before the patch. Two simple test cases are added, one exercising strbuf_add_wrapped_text() and the other strbuf_add_indented_text(). Signed-off-by: Rene Scharfe <rene.scharfe@lsrfire.ath.cx> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2009-10-22Teach --wrap to only indent without wrappingLibravatar Junio C Hamano1-0/+13
When a zero or negative width is given to "shortlog -w<width>,<in1>,<in2>" and --format=%[wrap(w,in1,in2)...%], just indent the text by in1 without wrapping. Signed-off-by: Junio C Hamano <gitster@pobox.com>
2009-10-19Add strbuf_add_wrapped_text() to utf8.[ch]Libravatar Johannes Schindelin1-9/+24
The newly added function can rewrap text according to a given first-line indent, other-indent and text width. Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
2009-10-19print_wrapped_text(): allow hard newlinesLibravatar Johannes Schindelin1-2/+16
print_wrapped_text() will insert its own newlines. Up until now, if the text passed to it contained newlines, they would not be handled properly (the wrapping got confused after that). The strategy is to replace a single new-line with a space, but keep double new-lines so that already-wrapped text with empty lines between paragraphs will be handled properly. However, single new-line characters are only handled this way if the character after it is an alphanumeric character, as per Linus' suggestion. Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
2009-06-06On Solaris choose the OLD_ICONV iconv() declaration based on the UNIX specLibravatar Brandon Casey1-1/+1
OLD_ICONV is only necessary on Solaris until UNIX03. This is indicated by the private macro _XPG6 which is set in /usr/include/sys/feature_tests.h. Signed-off-by: Brandon Casey <drafnel@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2009-02-04utf8: add utf8_strwidth()Libravatar Geoffrey Thomas1-0/+19
I'm about to use this pattern more than once, so make it a common function. Signed-off-by: Geoffrey Thomas <geofft@mit.edu> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2008-01-06utf8_width(): allow non NUL-terminated inputLibravatar Junio C Hamano1-32/+52
The original interface assumed that the input string is always terminated with a NUL, but that wasn't too useful. Signed-off-by: Junio C Hamano <gitster@pobox.com>
2008-01-06utf8: pick_one_utf8_char()Libravatar Junio C Hamano1-6/+21
utf8_width() function was doing two different things. To pick a valid character from UTF-8 stream, and compute the display width of that character. This splits the former to a separate function pick_one_utf8_char(). Signed-off-by: Junio C Hamano <gitster@pobox.com>
2007-11-15Remove unreachable statementsLibravatar Guido Ostkamp1-1/+0
Solaris Workshop Compiler found a few unreachable statements. Signed-off-by: Guido Ostkamp <git@ostkamp.fastmail.fm> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2007-11-08Style: place opening brace of a function definition at column 1Libravatar Junio C Hamano1-1/+2
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2007-05-07wcwidth redeclarationLibravatar Amos Waterland1-2/+2
Build fails for git 1.5.1.3 on AIX, with the message: utf8.c:66: error: conflicting types for 'wcwidth' /.../lib/gcc/powerpc-ibm-aix5.3.0.0/4.0.3/include/string.h:266: error: previous declaration of 'wcwidth' was here Fix this by renaming our static variant to our own name. Signed-off-by: Amos Waterland <apw@us.ibm.com> Signed-off-by: Junio C Hamano <junkio@cox.net>
2007-03-03Merge branch 'maint'Libravatar Junio C Hamano1-6/+14
* maint: Unset NO_C99_FORMAT on Cygwin. Fix a "pointer type missmatch" warning. Fix some "comparison is always true/false" warnings. Fix an "implicit function definition" warning. Fix a "label defined but unreferenced" warning. Document the config variable format.suffix git-merge: fail correctly when we cannot fast forward. builtin-archive: use RUN_SETUP Fix git-gc usage note
2007-03-03Fix a "pointer type missmatch" warning.Libravatar Ramsay Jones1-2/+8
In particular, the second parameter in the call to iconv() will cause this warning if your library declares iconv() with the second (input buffer pointer) parameter of type const char **. This is the old prototype, which is none-the-less used by the current version of newlib on Cygwin. (It appears in old versions of glibc too). Signed-off-by: Ramsay Jones <ramsay@ramsay1.demon.co.uk> Signed-off-by: Junio C Hamano <junkio@cox.net>
2007-03-03Fix some "comparison is always true/false" warnings.Libravatar Ramsay Jones1-4/+6
On Cygwin the wchar_t type is an unsigned short (16-bit) int. This results in the above warnings from the return statement in the wcwidth() function (in particular, the expressions involving constants with values larger than 0xffff). Simply replace the use of wchar_t with an unsigned int, typedef-ed as ucs_char_t. Signed-off-by: Ramsay Jones <ramsay@ramsay1.demon.co.uk> Signed-off-by: Junio C Hamano <junkio@cox.net>
2007-03-02print_wrapped_text: fix output for negative indentLibravatar Johannes Schindelin1-1/+1
When providing a negative indent, it means that -indent columns were already printed. Fix a bug where the function ate the first character if already the first word did not fit into the first line. Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de> Signed-off-by: Junio C Hamano <junkio@cox.net>
2007-02-27Actually make print_wrapped_text() usefulLibravatar Johannes Schindelin1-5/+12
Now, it returns the current column, does not add a newline, and you can pass a negative indent, to indicate that the indent was already printed. With this, you can actually continue in the middle of a paragraph, not having to print everything into a buffer first. Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de> Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-12-30commit-tree: cope with different ways "utf-8" can be spelled.Libravatar Junio C Hamano1-0/+9
People can spell config.commitencoding differently from what we internally have ("utf-8") to mean UTF-8. Try to accept them and treat them equally. Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-12-26Move encoding conversion routine out of mailinfo to utf8.cLibravatar Junio C Hamano1-0/+54
This moves the body of convert_to_utf8() routine used in mailinfo to the utf8.c i18n library. Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-12-24commit-tree: encourage UTF-8 commit messages.Libravatar Johannes Schindelin1-0/+278
Introduce is_utf() to check if a text looks like it is encoded in UTF-8, utf8_width() to count display width, and implements print_wrapped_text() using them. git-commit-tree warns if the commit message does not minimally conform to the UTF-8 encoding when i18n.commitencoding is either unset, or set to "utf-8". Signed-off-by: Junio C Hamano <junkio@cox.net>