summaryrefslogtreecommitdiff
path: root/pack-revindex.c
AgeCommit message (Collapse)AuthorFilesLines
2015-12-21pack-revindex: store entries directly in packed_gitLibravatar Jeff King1-25/+22
A pack_revindex struct has two elements: the revindex entries themselves, and a pointer to the packed_git. We need both to do lookups, because only the latter knows things like the number of objects in the pack. Now that packed_git contains the pack_revindex struct it's just as easy to pass around the packed_git itself, and we do not need the extra back-pointer. We can instead just store the entries directly in the pack. All functions which took a pack_revindex now just take a packed_git. We still lazy-load in find_pack_revindex, so most callers are unaffected. The exception is the bitmap code, which computes the revindex and caches the pointer when we load the bitmaps. We can continue to load, drop the extra cache pointer, and just access bitmap_git.pack.revindex directly. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2015-12-21pack-revindex: drop hash tableLibravatar Jeff King1-54/+6
The main entry point to the pack-revindex code is find_pack_revindex(). This calls revindex_for_pack(), which lazily computes and caches the revindex for the pack. We store the cache in a very simple hash table. It's created by init_pack_revindex(), which inserts an entry for every packfile we know about, and we never grow or shrink the hash. If we ever need the revindex for a pack that isn't in the hash, we die() with an internal error. This can lead to a race, because we may load more packs after having called init_pack_revindex(). For example, imagine we have one process which needs to look at the revindex for a variety of objects (e.g., cat-file's "%(objectsize:disk)" format). Simultaneously, git-gc is running, which is doing a `git repack -ad`. We might hit a sequence like: 1. We need the revidx for some packed object. We call find_pack_revindex() and end up in init_pack_revindex() to create the hash table for all packs we know about. 2. We look up another object and can't find it, because the repack has removed the pack it's in. We re-scan the pack directory and find a new pack containing the object. It gets added to our packed_git list. 3. We call find_pack_revindex() for the new object, which hits revindex_for_pack() for our new pack. It can't find the packed_git in the revindex hash, and dies. You could also replace the `repack` above with a push or fetch to create a new pack, though these are less likely (you would have to somehow learn about the new objects to look them up). Prior to 1a6d8b9 (do not discard revindex when re-preparing packfiles, 2014-01-15), this was safe, as we threw away the revindex whenever we re-scanned the pack directory (and thus re-created the revindex hash on the fly). However, we don't want to simply revert that commit, as it was solving a different race. So we have a few options: - We can fix the race in 1a6d8b9 differently, by having the bitmap code look in the revindex hash instead of caching the pointer. But this would introduce a lot of extra hash lookups for common bitmap operations. - We could teach the revindex to dynamically add new packs to the hash table. This would perform the same, but would mean adding extra code to the revindex hash (which currently cannot be resized at all). - We can get rid of the hash table entirely. There is exactly one revindex per pack, so we can just store it in the packed_git struct. Since it's initialized lazily, it does not add to the startup cost. This is the best of both worlds: less code and fewer hash table lookups. The original code likely avoided this in the name of encapsulation. But the packed_git and reverse_index code are fairly intimate already, so it's not much of a loss. This patch implements the final option. It's a minimal conversion that retains the pack_revindex struct. No callers need to change, and we can do further cleanup in a follow-on patch. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2014-01-16do not discard revindex when re-preparing packfilesLibravatar Jeff King1-11/+0
When an object lookup fails, we re-read the objects/pack directory to pick up any new packfiles that may have been created since our last read. We also discard any pack revindex structs we've allocated. The discarding is a problem for the pack-bitmap code, which keeps a pointer to the revindex for the bitmapped pack. After the discard, the pointer is invalid, and we may read free()d memory. Other revindex users do not keep a bare pointer to the revindex; instead, they always access it through revindex_for_pack(), which lazily builds the revindex. So one solution is to teach the pack-bitmap code a similar trick. It would be slightly less efficient, but probably not all that noticeable. However, it turns out this discarding is not actually necessary. When we call reprepare_packed_git, we do not throw away our old pack list. We keep the existing entries, and only add in new ones. So there is no safety problem; we will still have the pack struct that matches each revindex. The packfile itself may go away, of course, but we are already prepared to handle that, and it may happen outside of reprepare_packed_git anyway. Throwing away the revindex may save some RAM if the pack never gets reused (about 12 bytes per object). But it also wastes some CPU time (to regenerate the index) if the pack does get reused. It's hard to say which is more valuable, but in either case, it happens very rarely (only when we race with a simultaneous repack). Just leaving the revindex in place is simple and safe both for current and future code. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2013-10-24revindex: export new APIsLibravatar Vicent Marti1-13/+25
Allow users to efficiently lookup consecutive entries that are expected to be found on the same revindex by exporting `find_revindex_position`: this function takes a pointer to revindex itself, instead of looking up the proper revindex for a given packfile on each call. Signed-off-by: Vicent Marti <tanoku@gmail.com> Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2013-07-12pack-revindex: radix-sort the revindexLibravatar Jeff King1-5/+95
The pack revindex stores the offsets of the objects in the pack in sorted order, allowing us to easily find the on-disk size of each object. To compute it, we populate an array with the offsets from the sha1-sorted idx file, and then use qsort to order it by offsets. That does O(n log n) offset comparisons, and profiling shows that we spend most of our time in cmp_offset. However, since we are sorting on a simple off_t, we can use numeric sorts that perform better. A radix sort can run in O(k*n), where k is the number of "digits" in our number. For a 64-bit off_t, using 16-bit "digits" gives us k=4. On the linux.git repo, with about 3M objects to sort, this yields a 400% speedup. Here are the best-of-five numbers for running echo HEAD | git cat-file --batch-check="%(objectsize:disk) on a fully packed repository, which is dominated by time spent building the pack revindex: before after real 0m0.834s 0m0.204s user 0m0.788s 0m0.164s sys 0m0.040s 0m0.036s This matches our algorithmic expectations. log(3M) is ~21.5, so a traditional sort is ~21.5n. Our radix sort runs in k*n, where k is the number of radix digits. In the worst case, this is k=4 for a 64-bit off_t, but we can quit early when the largest value to be sorted is smaller. For any repository under 4G, k=2. Our algorithm makes two passes over the list per radix digit, so we end up with 4n. That should yield ~5.3x speedup. We see 4x here; the difference is probably due to the extra bucket book-keeping the radix sort has to do. On a smaller repo, the difference is less impressive, as log(n) is smaller. For git.git, with 173K objects (but still k=2), we see a 2.7x improvement: before after real 0m0.046s 0m0.017s user 0m0.036s 0m0.012s sys 0m0.008s 0m0.000s On even tinier repos (e.g., a few hundred objects), the speedup goes away entirely, as the small advantage of the radix sort gets erased by the book-keeping costs (and at those sizes, the cost to generate the the rev-index gets lost in the noise anyway). Signed-off-by: Jeff King <peff@peff.net> Reviewed-by: Brandon Casey <drafnel@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2013-07-12pack-revindex: use unsigned to store number of objectsLibravatar Jeff King1-4/+4
A packfile may have up to 2^32-1 objects in it, so the "right" data type to use is uint32_t. We currently use a signed int, which means that we may behave incorrectly for packfiles with more than 2^31-1 objects on 32-bit systems. Nobody has noticed because having 2^31 objects is pretty insane. The linux.git repo has on the order of 2^22 objects, which is hundreds of times smaller than necessary to trigger the bug. Let's bump this up to an "unsigned". On 32-bit systems, this gives us the correct data-type, and on 64-bit systems, it is probably more efficient to use the native "unsigned" than a true uint32_t. While we're at it, we can fix the binary search not to overflow in such a case if our unsigned is 32 bits. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2009-07-22janitor: useless checks before freeLibravatar Pierre Habouzit1-2/+1
Signed-off-by: Pierre Habouzit <madcoder@debian.org> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2008-11-02make find_pack_revindex() aware of the nasty worldLibravatar Nicolas Pitre1-1/+2
It currently calls die() whenever given offset is not found thinking that such thing should never happen. But this offset may come from a corrupted pack whych _could_ happen and not be found. Callers should deal with this possibility gracefully instead. Signed-off-by: Nicolas Pitre <nico@cam.org> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2008-08-22discard revindex data when pack list changesLibravatar Nicolas Pitre1-0/+12
This is needed to fix verify-pack -v with multiple pack arguments. Also, in theory, revindex data (if any) must be discarded whenever reprepare_packed_git() is called. In practice this is hard to trigger though. Signed-off-by: Nicolas Pitre <nico@cam.org> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2008-06-23call init_pack_revindex() lazilyLibravatar Nicolas Pitre1-2/+4
This makes life much easier for next patch, as well as being more efficient when the revindex is actually not used. Signed-off-by: Nicolas Pitre <nico@cam.org> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2008-03-01factorize revindex code out of builtin-pack-objects.cLibravatar Nicolas Pitre1-0/+142
No functional change. This is needed to fix verify-pack in a later patch. Signed-off-by: Nicolas Pitre <nico@cam.org> Signed-off-by: Junio C Hamano <gitster@pobox.com>