Age | Commit message (Collapse) | Author | Files | Lines |
|
Our anonymize_mem() function is careful to take a ptr/len pair to allow
storing binary tokens like object ids, as well as partial strings (e.g.,
just "foo" of "foo/bar"). But it duplicates the hash key using
xstrdup()! That means that:
- for a partial string, we'd store all bytes up to the NUL, even
though we'd never look at anything past "len". This didn't produce
wrong behavior, but was wasteful.
- for a binary oid that doesn't contain a zero byte, we'd copy garbage
bytes off the end of the array (though as long as nothing complained
about reading uninitialized bytes, further reads would be limited by
"len", and we'd produce the correct results)
- for a binary oid that does contain a zero byte, we'd copy _fewer_
bytes than intended into the hashmap struct. When we later try to
look up a value, we'd access uninitialized memory and potentially
falsely claim that a particular oid is not present.
The most common reason to store an oid is an anonymized gitlink, but our
test case doesn't have any gitlinks at all. So let's add one whose oid
contains a NUL and is present at two different paths. ASan catches the
memory error, but even without it we can detect the bug because the oid
is not anonymized the same way for both paths.
And of course the fix is to copy the correct number of bytes. We don't
technically need the appended NUL from xmemdupz(), but it doesn't hurt
as an extra protection against anybody treating it like a string (plus a
future patch will push us more in that direction).
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
Our tests of the anonymized repo just hard-code the expected set of
objects in the root and subdirectory trees. This makes them brittle to
the test setup changing (e.g., adding new paths that need tested).
Let's look at the original repo to compute our expected set of objects.
Note that this isn't completely perfect (e.g., we still rely on there
being only one tree in the root), but it does simplify later patches.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|
|
Sometimes users want to report a bug they experience on
their repository, but they are not at liberty to share the
contents of the repository. It would be useful if they could
produce a repository that has a similar shape to its history
and tree, but without leaking any information. This
"anonymized" repository could then be shared with developers
(assuming it still replicates the original problem).
This patch implements an "--anonymize" option to
fast-export, which generates a stream that can recreate such
a repository. Producing a single stream makes it easy for
the caller to verify that they are not leaking any useful
information. You can get an overview of what will be shared
by running a command like:
git fast-export --anonymize --all |
perl -pe 's/\d+/X/g' |
sort -u |
less
which will show every unique line we generate, modulo any
numbers (each anonymized token is assigned a number, like
"User 0", and we replace it consistently in the output).
In addition to anonymizing, this produces test cases that
are relatively small (compared to the original repository)
and fast to generate (compared to using filter-branch, or
modifying the output of fast-export yourself). Here are
numbers for git.git:
$ time git fast-export --anonymize --all \
--tag-of-filtered-object=drop >output
real 0m2.883s
user 0m2.828s
sys 0m0.052s
$ gzip output
$ ls -lh output.gz | awk '{print $5}'
2.9M
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
|