Age | Commit message (Collapse) | Author | Files | Lines |
|
If the endianness is not defined in the encoding name, then let's
be strict and require a BOM to avoid any encoding confusion. The
is_missing_required_utf_bom() function returns true if a required BOM
is missing.
The Unicode standard instructs to assume big-endian if there in no BOM
for UTF-16/32 [1][2]. However, the W3C/WHATWG encoding standard used
in HTML5 recommends to assume little-endian to "deal with deployed
content" [3]. Strictly requiring a BOM seems to be the safest option
for content in Git.
This function is used in a subsequent commit.
[1] http://unicode.org/faq/utf_bom.html#gen6
[2] http://www.unicode.org/versions/Unicode10.0.0/ch03.pdf
Section 3.10, D98, page 132
[3] https://encoding.spec.whatwg.org/#utf-16le
Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
Whenever a data stream is declared to be UTF-16BE, UTF-16LE, UTF-32BE
or UTF-32LE a BOM must not be used [1]. The function returns true if
this is the case.
This function is used in a subsequent commit.
[1] http://unicode.org/faq/utf_bom.html#bom10
Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
Many instances of duplicate words (e.g. "the the path") and
a few typoes are fixed, originally in multiple patches.
wildmatch: fix duplicate words of "the"
t: fix duplicate words of "output"
transport-helper: fix duplicate words of "read"
Git.pm: fix duplicate words of "return"
path: fix duplicate words of "look"
pack-protocol.txt: fix duplicate words of "the"
precompose-utf8: fix typo of "sequences"
split-index: fix typo
worktree.c: fix typo
remote-ext: fix typo
utf8: fix duplicate words of "the"
git-cvsserver: fix duplicate words
Signed-off-by: Li Peng <lip@dtdream.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
Add strbuf_utf8_align() which will align a given string into a strbuf
as per given align_type and width. If the width is greater than the
string length then no alignment is performed.
Helped-by: Eric Sunshine <sunshine@sunshineco.com>
Mentored-by: Christian Couder <christian.couder@gmail.com>
Mentored-by: Matthieu Moy <matthieu.moy@grenoble-inp.fr>
Signed-off-by: Karthik Nayak <karthik.188@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
A compilation workaround.
* es/utf8-stupid-compiler-workaround:
utf8: NO_ICONV: silence uninitialized variable warning
|
|
The last argument of reencode_string_len() is an 'int *' which is
assigned the length of the converted string. When NO_ICONV is defined,
however, reencode_string_len() is stubbed out by the macro:
#define reencode_string_len(a,b,c,d,e) NULL
which never assigns a value to the final argument. When called like
this:
int n;
char *s = reencode_string_len(..., &n);
if (s)
do_something(s, n);
some compilers complain that 'n' is used uninitialized within the
conditional.
Signed-off-by: Eric Sunshine <sunshine@sunshineco.com>
Reviewed-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
With the recent change to ignore the UTF8 BOM at the beginning of
.gitignore files, we now have two codepaths that do such a skipping
(the other one is for reading the configuration files).
Introduce utf8_bom[] constant string and skip_utf8_bom() helper
and teach .gitignore code how to use it.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
We do not allow paths with a ".git" component to be added to
the index, as that would mean repository contents could
overwrite our repository files. However, asking "is this
path the same as .git" is not as simple as strcmp() on some
filesystems.
HFS+'s case-folding does more than just fold uppercase into
lowercase (which we already handle with strcasecmp). It may
also skip past certain "ignored" Unicode code points, so
that (for example) ".gi\u200ct" is mapped ot ".git".
The full list of folds can be found in the tables at:
https://www.opensource.apple.com/source/xnu/xnu-1504.15.3/bsd/hfs/hfscommon/Unicode/UCStringCompareData.h
Implementing a full "is this path the same as that path"
comparison would require us importing the whole set of
tables. However, what we want to do is much simpler: we
only care about checking ".git". We know that 'G' is the
only thing that folds to 'g', and so on, so we really only
need to deal with the set of ignored code points, which is
much smaller.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
For most of our functions that take printf-like formats, we
use gcc's __attribute__((format)) to get compiler warnings
when the functions are misused. Let's give a few more
functions the same protection.
In most cases, the annotations do not uncover any actual
bugs; the only code change needed is that we passed a size_t
to transfer_debug, which expected an int. Since we expect
the passed-in value to be a relatively small buffer size
(and cast a similar value to int directly below), we can
just cast away the problem.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
This is pretty useful in `%<(100)%s%Cred%>(20)% an' where %s does not
use up all 100 columns and %an needs more than 20 columns. By
replacing %>(20) with %>>(20), %an can steal spaces from %s.
%>> understands escape sequences, so %Cred does not stop it from
stealing spaces in %<(100).
Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
%>(N,trunc) truncates the right part after N columns and replace the
last two letters with "..". ltrunc does the same on the left. mtrunc
cuts the middle out.
Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
When "format-patch" quoted a non-ascii strings on the header files,
it incorrectly applied rfc2047 and chopped a single character in
the middle of it.
* ks/rfc2047-one-char-at-a-time:
format-patch: RFC 2047 says multi-octet character may not be split
|
|
Even though an earlier attempt (bafc478..41dd00bad) cleaned
up RFC 2047 encoding, pretty.c::add_rfc2047() still decides
where to split the output line by going through the input
one byte at a time, and potentially splits a character in
the middle. A subject line may end up showing like this:
".... fö?? bar". (instead of ".... föö bar".)
if split incorrectly.
RFC 2047, section 5 (3) explicitly forbids such beaviour
Each 'encoded-word' MUST represent an integral number of
characters. A multi-octet character may not be split across
adjacent 'encoded- word's.
that means that e.g. for
Subject: .... föö bar
encoding
Subject: =?UTF-8?q?....=20f=C3=B6=C3=B6?=
=?UTF-8?q?=20bar?=
is correct, and
Subject: =?UTF-8?q?....=20f=C3=B6=C3?= <-- NOTE ö is broken here
=?UTF-8?q?=B6=20bar?=
is not, because "ö" character UTF-8 encoding C3 B6 is split here across
adjacent encoded words.
To fix the problem, make the loop grab one _character_ at a time and
determine its output length to see where to break the output line. Note
that this version only knows about UTF-8, but the logic to grab one
character is abstracted out in mbs_chrlen() function to make it possible
to extend it to other encodings with the help of iconv in the future.
Signed-off-by: Kirill Smelkov <kirr@mns.spb.ru>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
Use a new helper that prints a message and counts its display width
to align the help messages parse-options produces.
* jx/utf8-printf-width:
Add utf8_fprintf helper that returns correct number of columns
|
|
Since command usages can be translated, they may include utf-8
encoded strings, and the output in console may not align well any
more. This is because strlen() is different from strwidth() on utf-8
strings.
A wrapper utf8_fprintf() can help to return the correct number of
columns required.
Signed-off-by: Jiang Xin <worldhello.net@gmail.com>
Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
Reviewed-by: Torsten Bögershausen <tboegi@web.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
When a line to be wrapped has a solid run of non space characters
whose length exactly is the wrap width, "git shortlog -w" failed to
add a newline after such a line.
* sp/shortlog-missing-lf:
strbuf_add_wrapped*(): Remove unused return value
shortlog: fix wrapping lines of wraplen
|
|
Since shortlog isn't using the return value anymore (see previous
commit), the functions can be changed to void.
Signed-off-by: Steffen Prohaska <prohaska@zib.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
Callers of reencode_string() that re-encodes a string from one
encoding to another all used ad-hoc way to bypass the case where the
input and the output encodings are the same. Some did strcmp(),
some did strcasecmp(), yet some others when converting to UTF-8 used
is_encoding_utf8().
Introduce same_encoding() helper function to make these callers use
the same logic. Notably, is_encoding_utf8() has a work-around for
common misconfiguration to use "utf8" to name UTF-8 encoding, which
does not match "UTF-8" hence strcasecmp() would not consider the
same. Make use of it in this helper function.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
Mac OS X mangles file names containing unicode on file systems HFS+,
VFAT or SAMBA. When a file using unicode code points outside ASCII
is created on a HFS+ drive, the file name is converted into
decomposed unicode and written to disk. No conversion is done if
the file name is already decomposed unicode.
Calling open("\xc3\x84", ...) with a precomposed "Ä" yields the same
result as open("\x41\xcc\x88",...) with a decomposed "Ä".
As a consequence, readdir() returns the file names in decomposed
unicode, even if the user expects precomposed unicode. Unlike on
HFS+, Mac OS X stores files on a VFAT drive (e.g. an USB drive) in
precomposed unicode, but readdir() still returns file names in
decomposed unicode. When a git repository is stored on a network
share using SAMBA, file names are send over the wire and written to
disk on the remote system in precomposed unicode, but Mac OS X
readdir() returns decomposed unicode to be compatible with its
behaviour on HFS+ and VFAT.
The unicode decomposition causes many problems:
- The names "git add" and other commands get from the end user may
often be precomposed form (the decomposed form is not easily input
from the keyboard), but when the commands read from the filesystem
to see what it is going to update the index with already is on the
filesystem, readdir() will give decomposed form, which is different.
- Similarly "git log", "git mv" and all other commands that need to
compare pathnames found on the command line (often but not always
precomposed form; a command line input resulting from globbing may
be in decomposed) with pathnames found in the tree objects (should
be precomposed form to be compatible with other systems and for
consistency in general).
- The same for names stored in the index, which should be
precomposed, that may need to be compared with the names read from
readdir().
NFS mounted from Linux is fully transparent and does not suffer from
the above.
As Mac OS X treats precomposed and decomposed file names as equal,
we can
- wrap readdir() on Mac OS X to return the precomposed form, and
- normalize decomposed form given from the command line also to the
precomposed form,
to ensure that all pathnames used in Git are always in the
precomposed form. This behaviour can be requested by setting
"core.precomposedunicode" configuration variable to true.
The code in compat/precomposed_utf8.c implements basically 4 new
functions: precomposed_utf8_opendir(), precomposed_utf8_readdir(),
precomposed_utf8_closedir() and precompose_argv(). The first three
are to wrap opendir(3), readdir(3), and closedir(3) functions.
The argv[] conversion allows to use the TAB filename completion done
by the shell on command line. It tolerates other tools which use
readdir() to feed decomposed file names into git.
When creating a new git repository with "git init" or "git clone",
"core.precomposedunicode" will be set "false".
The user needs to activate this feature manually. She typically
sets core.precomposedunicode to "true" on HFS and VFAT, or file
systems mounted via SAMBA.
Helped-by: Junio C Hamano <gitster@pobox.com>
Signed-off-by: Torsten Bögershausen <tboegi@web.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
The function strbuf_add_wrapped_text takes a NUL-terminated
string. This makes it annoying to wrap strings we have as a
pointer and a length.
Refactoring strbuf_add_wrapped_text and all of its
sub-functions to handle fixed-length strings turned out to
be really ugly. So this implementation is lame; it just
strdups the text and operates on the NUL-terminated version.
This should be fine as the strings we are wrapping are
generally pretty short. If it becomes a problem, we can
optimize later.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
* rs/optim-text-wrap:
utf8.c: speculatively assume utf-8 in strbuf_add_wrapped_text()
utf8.c: remove strbuf_write()
utf8.c: remove print_spaces()
utf8.c: remove print_wrapped_text()
|
|
strbuf_add_wrapped_text() is called only from print_wrapped_text()
without a strbuf (in which case it writes its results to stdout).
At its only callsite, supply a strbuf, call strbuf_add_wrapped_text()
directly and remove the wrapper function.
Signed-off-by: Rene Scharfe <rene.scharfe@lsrfire.ath.cx>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
The newly added function can rewrap text according to a given first-line
indent, other-indent and text width.
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
|
|
I'm about to use this pattern more than once, so make it a common function.
Signed-off-by: Geoffrey Thomas <geofft@mit.edu>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
The original interface assumed that the input string is
always terminated with a NUL, but that wasn't too useful.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
utf8_width() function was doing two different things. To pick a
valid character from UTF-8 stream, and compute the display width of
that character. This splits the former to a separate function
pick_one_utf8_char().
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
Now, it returns the current column, does not add a newline, and you can
pass a negative indent, to indicate that the indent was already printed.
With this, you can actually continue in the middle of a paragraph, not
having to print everything into a buffer first.
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Signed-off-by: Junio C Hamano <junkio@cox.net>
|
|
People can spell config.commitencoding differently from what we
internally have ("utf-8") to mean UTF-8. Try to accept them and
treat them equally.
Signed-off-by: Junio C Hamano <junkio@cox.net>
|
|
This moves the body of convert_to_utf8() routine used in mailinfo
to the utf8.c i18n library.
Signed-off-by: Junio C Hamano <junkio@cox.net>
|
|
Introduce is_utf() to check if a text looks like it is encoded
in UTF-8, utf8_width() to count display width, and implements
print_wrapped_text() using them.
git-commit-tree warns if the commit message does not minimally
conform to the UTF-8 encoding when i18n.commitencoding is either
unset, or set to "utf-8".
Signed-off-by: Junio C Hamano <junkio@cox.net>
|