summaryrefslogtreecommitdiff
path: root/builtin/index-pack.c
AgeCommit message (Collapse)AuthorFilesLines
2021-10-31i18n: fix typos found during l10n for git 2.34.0Libravatar Jiang Xin1-1/+1
Emir and Jean-Noël reported typos in some i18n messages when preparing l10n for git 2.34.0. * Fix unstable spelling of config variable "gpg.ssh.defaultKeyCommand" which was introduced in commit fd9e226776 (ssh signing: retrieve a default key from ssh-agent, 2021-09-10). * Add missing space between "with" and "--python" which was introduced in commit bd0708c7eb (ref-filter: add %(raw) atom, 2021-07-26). * Fix unmatched single quote in 'builtin/index-pack.c' which was introduced in commit 8737dab346 (index-pack: refactor renaming in final(), 2021-09-09) [1] https://github.com/git-l10n/git-po/pull/567 Reported-by: Emir Sarı <bitigchi@me.com> Reported-by: Jean-Noël Avila <jn.avila@free.fr> Signed-off-by: Jiang Xin <worldhello.net@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-10-25Merge branch 'ab/fsck-unexpected-type'Libravatar Junio C Hamano1-1/+1
"git fsck" has been taught to report mismatch between expected and actual types of an object better. * ab/fsck-unexpected-type: fsck: report invalid object type-path combinations fsck: don't hard die on invalid object types object-file.c: stop dying in parse_loose_header() object-file.c: return ULHR_TOO_LONG on "header too long" object-file.c: use "enum" return type for unpack_loose_header() object-file.c: simplify unpack_loose_short_header() object-file.c: make parse_loose_header_extended() public object-file.c: return -1, not "status" from unpack_loose_header() object-file.c: don't set "typep" when returning non-zero cat-file tests: test for current --allow-unknown-type behavior cat-file tests: add corrupt loose object test cat-file tests: test for missing/bogus object with -t, -s and -p cat-file tests: move bogus_* variable declarations earlier fsck tests: test for garbage appended to a loose object fsck tests: test current hash/type mismatch behavior fsck tests: refactor one test to use a sub-repo fsck tests: add test for fsck-ing an unknown type
2021-10-01fsck: report invalid object type-path combinationsLibravatar Ævar Arnfjörð Bjarmason1-1/+1
Improve the error that's emitted in cases where we find a loose object we parse, but which isn't at the location we expect it to be. Before this change we'd prefix the error with a not-a-OID derived from the path at which the object was found, due to an emergent behavior in how we'd end up with an "OID" in these codepaths. Now we'll instead say what object we hashed, and what path it was found at. Before this patch series e.g.: $ git hash-object --stdin -w -t blob </dev/null e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 $ mv objects/e6/ objects/e7 Would emit ("[...]" used to abbreviate the OIDs): git fsck error: hash mismatch for ./objects/e7/9d[...] (expected e79d[...]) error: e79d[...]: object corrupt or missing: ./objects/e7/9d[...] Now we'll instead emit: error: e69d[...]: hash-path mismatch, found at: ./objects/e7/9d[...] Furthermore, we'll do the right thing when the object type and its location are bad. I.e. this case: $ git hash-object --stdin -w -t garbage --literally </dev/null 8315a83d2acc4c174aed59430f9a9c4ed926440f $ mv objects/83 objects/84 As noted in an earlier commits we'd simply die early in those cases, until preceding commits fixed the hard die on invalid object type: $ git fsck fatal: invalid object type Now we'll instead emit sensible error messages: $ git fsck error: 8315[...]: hash-path mismatch, found at: ./objects/84/15[...] error: 8315[...]: object is of unknown type 'garbage': ./objects/84/15[...] In both fsck.c and object-file.c we're using null_oid as a sentinel value for checking whether we got far enough to be certain that the issue was indeed this OID mismatch. We need to add the "object corrupt or missing" special-case to deal with cases where read_loose_object() will return an error before completing check_object_signature(), e.g. if we have an error in unpack_loose_rest() because we find garbage after the valid gzip content: $ git hash-object --stdin -w -t blob </dev/null e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 $ chmod 755 objects/e6/9de29bb2d1d6434b8b29ae775ad8c2e48c5391 $ echo garbage >>objects/e6/9de29bb2d1d6434b8b29ae775ad8c2e48c5391 $ git fsck error: garbage at end of loose object 'e69d[...]' error: unable to unpack contents of ./objects/e6/9d[...] error: e69d[...]: object corrupt or missing: ./objects/e6/9d[...] There is currently some weird messaging in the edge case when the two are combined, i.e. because we're not explicitly passing along an error state about this specific scenario from check_stream_oid() via read_loose_object() we'll end up printing the null OID if an object is of an unknown type *and* it can't be unpacked by zlib, e.g.: $ git hash-object --stdin -w -t garbage --literally </dev/null 8315a83d2acc4c174aed59430f9a9c4ed926440f $ chmod 755 objects/83/15a83d2acc4c174aed59430f9a9c4ed926440f $ echo garbage >>objects/83/15a83d2acc4c174aed59430f9a9c4ed926440f $ /usr/bin/git fsck fatal: invalid object type $ ~/g/git/git fsck error: garbage at end of loose object '8315a83d2acc4c174aed59430f9a9c4ed926440f' error: unable to unpack contents of ./objects/83/15a83d2acc4c174aed59430f9a9c4ed926440f error: 8315a83d2acc4c174aed59430f9a9c4ed926440f: object corrupt or missing: ./objects/83/15a83d2acc4c174aed59430f9a9c4ed926440f error: 0000000000000000000000000000000000000000: object is of unknown type 'garbage': ./objects/83/15a83d2acc4c174aed59430f9a9c4ed926440f [...] I think it's OK to leave that for future improvements, which would involve enum-ifying more error state as we've done with "enum unpack_loose_header_result" in preceding commits. In these increasingly more obscure cases the worst that can happen is that we'll get slightly nonsensical or inapplicable error messages. There's other such potential edge cases, all of which might produce some confusing messaging, but still be handled correctly as far as passing along errors goes. E.g. if check_object_signature() returns and oideq(real_oid, null_oid()) is true, which could happen if it returns -1 due to the read_istream() call having failed. Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-23Merge branch 'rs/use-xopen-in-index-pack'Libravatar Junio C Hamano1-3/+1
Code clean-up. * rs/use-xopen-in-index-pack: index-pack: use xopen in init_thread
2021-09-20Merge branch 'ab/unbundle-progress'Libravatar Junio C Hamano1-0/+6
Add progress display to "git bundle unbundle". * ab/unbundle-progress: bundle: show progress on "unbundle" index-pack: add --progress-title option bundle API: change "flags" to be "extra_index_pack_args" bundle API: start writing API documentation
2021-09-20Merge branch 'tb/pack-finalize-ordering'Libravatar Junio C Hamano1-25/+23
The order in which various files that make up a single (conceptual) packfile has been reevaluated and straightened up. This matters in correctness, as an incomplete set of files must not be shown to a running Git. * tb/pack-finalize-ordering: pack-objects: rename .idx files into place after .bitmap files pack-write: split up finish_tmp_packfile() function builtin/index-pack.c: move `.idx` files into place last index-pack: refactor renaming in final() builtin/repack.c: move `.idx` files into place last pack-write.c: rename `.idx` files after `*.rev` pack-write: refactor renaming in finish_tmp_packfile() bulk-checkin.c: store checksum directly pack.h: line-wrap the definition of finish_tmp_packfile()
2021-09-10index-pack: use xopen in init_threadLibravatar René Scharfe1-3/+1
Support an arbitrary file descriptor expression in the semantic patch for replacing open+die_errno with xopen, not just an identifier, and apply it. This makes the error message at the single affected place more consistent and reduces code duplication. Signed-off-by: René Scharfe <l.s.r@web.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-09builtin/index-pack.c: move `.idx` files into place lastLibravatar Taylor Blau1-2/+2
In a similar spirit as preceding patches to `git repack` and `git pack-objects`, fix the identical problem in `git index-pack`. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-09index-pack: refactor renaming in final()Libravatar Ævar Arnfjörð Bjarmason1-25/+23
Refactor the renaming in final() into a helper function, this is similar in spirit to a preceding refactoring of finish_tmp_packfile() in pack-write.c. Before e37d0b8730b (builtin/index-pack.c: write reverse indexes, 2021-01-25) it probably wasn't worth it to have this sort of helper, due to the differing "else if" case for "pack" files v.s. "idx" files. But since we've got "rev" as well now, let's do the renaming via a helper, this is both a net decrease in lines, and improves the readability, since we can easily see at a glance that the logic for writing these three types of files is exactly the same, aside from the obviously differing cases of "*final_name" being NULL, and "make_read_only_if_same" being different. Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-07index-pack: add --progress-title optionLibravatar Ævar Arnfjörð Bjarmason1-0/+6
Add a --progress-title option to index-pack, when data is piped into index-pack its progress is a proxy for whatever's feeding it data. This option will allow us to set a more relevant progress bar title in "git bundle unbundle", and is also used in my "bundle-uri" RFC patches[1] by a new caller in fetch-pack.c. The code change in cmd_index_pack() won't handle "--progress-title=xyz", only "--progress-title xyz", and the "(i+1)" style (as opposed to "i + 1") is a bit odd. Not using the "--long-option=value" style is inconsistent with existing long options handled by cmd_index_pack(), but makes the code that needs to call it better (two strvec_push(), instead of needing a strvec_pushf()). Since the option is internal-only the inconsistency shouldn't matter. I'm copying the pattern to handle it as-is from the handling of the existing "-o" option in the same function, see 9cf6d3357aa (Add git-index-pack utility, 2005-10-12) for its addition. That's a short option, but the code to implement the two is the same in functionality and style. Eventually we'd like to migrate all of this this to parse_options(), which would make these differences in behavior go away. 1. https://lore.kernel.org/git/RFC-cover-00.13-0000000000-20210805T150534Z-avarab@gmail.com/ Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-08-25use xopen() to handle fatal open(2) failuresLibravatar René Scharfe1-6/+2
Add and apply a semantic patch for using xopen() instead of calling open(2) and die() or die_errno() explicitly. This makes the error messages more consistent and shortens the code. Signed-off-by: René Scharfe <l.s.r@web.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-07-12*.c static functions: don't forward-declare __attribute__Libravatar Ævar Arnfjörð Bjarmason1-3/+1
9cf6d3357aa (Add git-index-pack utility, 2005-10-12) and 466dbc42f58 (receive-pack: Send internal errors over side-band #2, 2010-02-10) we added these static functions and forward-declared their __attribute__((printf)). I think this may have been to work around some compiler limitation at the time, but in any case we have a lot of code that uses the briefer way of declaring these that I'm using here, so if we had any such issues with compilers we'd have seen them already. Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-04-27Use the final_oid_fn to finalize hashing of object IDsLibravatar brian m. carlson1-1/+1
When we're hashing a value which is going to be an object ID, we want to zero-pad that value if necessary. To do so, use the final_oid_fn instead of the final_fn anytime we're going to create an object ID to ensure we perform this operation. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-04-27Always use oidread to read into struct object_idLibravatar brian m. carlson1-2/+2
In the future, we'll want oidread to automatically set the hash algorithm member for an object ID we read into it, so ensure we use oidread instead of hashcpy everywhere we're copying a hash value into a struct object_id. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-04-07Merge branch 'ab/fsck-api-cleanup'Libravatar Junio C Hamano1-25/+5
Fsck API clean-up. * ab/fsck-api-cleanup: fetch-pack: use new fsck API to printing dangling submodules fetch-pack: use file-scope static struct for fsck_options fetch-pack: don't needlessly copy fsck_options fsck.c: move gitmodules_{found,done} into fsck_options fsck.c: add an fsck_set_msg_type() API that takes enums fsck.c: pass along the fsck_msg_id in the fsck_error callback fsck.[ch]: move FOREACH_FSCK_MSG_ID & fsck_msg_id from *.c to *.h fsck.c: give "FOREACH_MSG_ID" a more specific name fsck.c: undefine temporary STR macro after use fsck.c: call parse_msg_type() early in fsck_set_msg_type() fsck.h: re-order and re-assign "enum fsck_msg_type" fsck.h: move FSCK_{FATAL,INFO,ERROR,WARN,IGNORE} into an enum fsck.c: refactor fsck_msg_type() to limit scope of "int msg_type" fsck.c: rename remaining fsck_msg_id "id" to "msg_id" fsck.c: remove (mostly) redundant append_msg_id() function fsck.c: rename variables in fsck_set_msg_type() for less confusion fsck.h: use "enum object_type" instead of "int" fsck.h: use designed initializers for FSCK_OPTIONS_{DEFAULT,STRICT} fsck.c: refactor and rename common config callback
2021-03-28fetch-pack: use new fsck API to printing dangling submodulesLibravatar Ævar Arnfjörð Bjarmason1-20/+1
Refactor the check added in 5476e1efde (fetch-pack: print and use dangling .gitmodules, 2021-02-22) to make use of us now passing the "msg_id" to the user defined "error_func". We can now compare against the FSCK_MSG_GITMODULES_MISSING instead of parsing the generated message. Let's also replace register_found_gitmodules() with directly manipulating the "gitmodules_found" member. A recent commit moved it into "fsck_options" so we could do this here. I'm sticking this callback in fsck.c. Perhaps in the future we'd like to accumulate such callbacks into another file (maybe fsck-cb.c, similar to parse-options-cb.c?), but while we've got just the one let's just put it into fsck.c. A better alternative in this case would be some library some more obvious library shared by fetch-pack.c ad builtin/index-pack.c, but there isn't such a thing. Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-03-28fetch-pack: don't needlessly copy fsck_optionsLibravatar Ævar Arnfjörð Bjarmason1-7/+3
Change the behavior of the .gitmodules validation added in 5476e1efde (fetch-pack: print and use dangling .gitmodules, 2021-02-22) so we're using one "fsck_options". I found that code confusing to read. One might think that not setting up the error_func earlier means that we're relying on the "error_func" not being set in some code in between the two hunks being modified here. But we're not, all we're doing in the rest of "cmd_index_pack()" is further setup by calling fsck_set_msg_types(), and assigning to do_fsck_object. So there was no reason in 5476e1efde to make a shallow copy of the fsck_options struct before setting error_func. Let's just do this setup at the top of the function, along with the "walk" assignment. Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-03-28fsck.c: pass along the fsck_msg_id in the fsck_error callbackLibravatar Ævar Arnfjörð Bjarmason1-1/+2
Change the fsck_error callback to also pass along the fsck_msg_id. Before this change the only way to get the message id was to parse it back out of the "message". Let's pass it down explicitly for the benefit of callers that might want to use it, as discussed in [1]. Passing the msg_type is now redundant, as you can always get it back from the msg_id, but I'm not changing that convention. It's really common to need the msg_type, and the report() function itself (which calls "fsck_error") needs to call fsck_msg_type() to discover it. Let's not needlessly re-do that work in the user callback. 1. https://lore.kernel.org/git/87blcja2ha.fsf@evledraar.gmail.com/ Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-03-28fsck.h: move FSCK_{FATAL,INFO,ERROR,WARN,IGNORE} into an enumLibravatar Ævar Arnfjörð Bjarmason1-1/+2
Move the FSCK_{FATAL,INFO,ERROR,WARN,IGNORE} defines into a new fsck_msg_type enum. These defines were originally introduced in: - ba002f3b28a (builtin-fsck: move common object checking code to fsck.c, 2008-02-25) - f50c4407305 (fsck: disallow demoting grave fsck errors to warnings, 2015-06-22) - efaba7cc77f (fsck: optionally ignore specific fsck issues completely, 2015-06-22) - f27d05b1704 (fsck: allow upgrading fsck warnings to errors, 2015-06-22) The reason these were defined in two different places is because we use FSCK_{IGNORE,INFO,FATAL} only in fsck.c, but FSCK_{ERROR,WARN} are used by external callbacks. Untangling that would take some more work, since we expose the new "enum fsck_msg_type" to both. Similar to "enum object_type" it's not worth structuring the API in such a way that only those who need FSCK_{ERROR,WARN} pass around a different type. Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-03-28fsck.h: use "enum object_type" instead of "int"Libravatar Ævar Arnfjörð Bjarmason1-1/+2
Change the fsck_walk_func to use an "enum object_type" instead of an "int" type. The types are compatible, and ever since this was added in 355885d5315 (add generic, type aware object chain walker, 2008-02-25) we've used entries from object_type (OBJ_BLOB etc.). So this doesn't really change anything as far as the generated code is concerned, it just gives the compiler more information and makes this easier to read. Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-03-13use CALLOC_ARRAYLibravatar René Scharfe1-5/+5
Add and apply a semantic patch for converting code that open-codes CALLOC_ARRAY to use it instead. It shortens the code and infers the element size automatically. Signed-off-by: René Scharfe <l.s.r@web.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-03-01Merge branch 'jt/transfer-fsck-across-packs'Libravatar Junio C Hamano1-2/+23
The approach to "fsck" the incoming objects in "index-pack" is attractive for performance reasons (we have them already in core, inflated and ready to be inspected), but fundamentally cannot be applied fully when we receive more than one pack stream, as a tree object in one pack may refer to a blob object in another pack as ".gitmodules", when we want to inspect blobs that are used as ".gitmodules" file, for example. Teach "index-pack" to emit objects that must be inspected later and check them in the calling "fetch-pack" process. * jt/transfer-fsck-across-packs: fetch-pack: print and use dangling .gitmodules fetch-pack: with packfile URIs, use index-pack arg http-fetch: allow custom index-pack args http: allow custom index-pack args
2021-02-22fetch-pack: print and use dangling .gitmodulesLibravatar Jonathan Tan1-2/+23
Teach index-pack to print dangling .gitmodules links after its "keep" or "pack" line instead of declaring an error, and teach fetch-pack to check such lines printed. This allows the tree side of the .gitmodules link to be in one packfile and the blob side to be in another without failing the fsck check, because it is now fetch-pack which checks such objects after all packfiles have been downloaded and indexed (and not index-pack on an individual packfile, as it is before this commit). Signed-off-by: Jonathan Tan <jonathantanmy@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-25t: support GIT_TEST_WRITE_REV_INDEXLibravatar Taylor Blau1-1/+4
Add a new option that unconditionally enables the pack.writeReverseIndex setting in order to run the whole test suite in a mode that generates on-disk reverse indexes. Additionally, enable this mode in the second run of tests under linux-gcc in 'ci/run-build-and-tests.sh'. Once on-disk reverse indexes are proven out over several releases, we can change the default value of that configuration to 'true', and drop this patch. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-25builtin/index-pack.c: write reverse indexesLibravatar Taylor Blau1-3/+47
Teach 'git index-pack' to optionally write and verify reverse index with '--[no-]rev-index', as well as respecting the 'pack.writeReverseIndex' configuration option. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-25builtin/index-pack.c: allow stripping arbitrary extensionsLibravatar Taylor Blau1-8/+8
To derive the filename for a .idx file, 'git index-pack' uses derive_filename() to strip the '.pack' suffix and add the new suffix. Prepare for stripping off suffixes other than '.pack' by making the suffix to strip a parameter of derive_filename(). In order to make this consistent with the "suffix" parameter which does not begin with a ".", an additional check in derive_filename. Suggested-by: Jeff King <peff@peff.net> Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-04object-file.c: rename from sha1-file.cLibravatar Martin Ågren1-1/+1
Drop the last remnant of "sha1" in this file and rename it to reflect that we're not just able to handle SHA-1 these days. Signed-off-by: Martin Ågren <martin.agren@gmail.com> Reviewed-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-11-16compute pack .idx byte offsets using size_tLibravatar Jeff King1-1/+1
A pack and its matching .idx file are limited to 2^32 objects, because the pack format contains a 32-bit field to store the number of objects. Hence we use uint32_t in the code. But the byte count of even a .idx file can be much larger than that, because it stores at least a hash and an offset for each object. So using SHA-1, a v2 .idx file will cross the 4GB boundary at 153,391,650 objects. This confuses load_idx(), which computes the minimum size like this: unsigned long min_size = 8 + 4*256 + nr*(hashsz + 4 + 4) + hashsz + hashsz; Even though min_size will be big enough on most 64-bit platforms, the actual arithmetic is done as a uint32_t, resulting in a truncation. We actually exceed that min_size, but then we do: unsigned long max_size = min_size; if (nr) max_size += (nr - 1)*8; to account for the variable-sized table. That computation doesn't overflow quite so low, but with the truncation for min_size, we end up with a max_size that is much smaller than our actual size. So we complain that the idx is invalid, and can't find any of its objects. We can fix this case by casting "nr" to a size_t, which will do the multiplication in 64-bits (assuming you're on a 64-bit platform; this will never work on a 32-bit system since we couldn't map the whole .idx anyway). Likewise, we don't have to worry about further additions, because adding a smaller number to a size_t will convert the other side to a size_t. A few notes: - obviously we could just declare "nr" as a size_t in the first place (and likewise, packed_git.num_objects). But it's conceptually a uint32_t because of the on-disk format, and we correctly treat it that way in other contexts that don't need to compute byte offsets (e.g., iterating over the set of objects should and generally does use a uint32_t). Switching to size_t would make all of those other cases look wrong. - it could be argued that the proper type is off_t to represent the file offset. But in practice the .idx file must fit within memory, because we mmap the whole thing. And the rest of the code (including the idx_size variable we're comparing against) uses size_t. - we'll add the same cast to the max_size arithmetic line. Even though we're adding to a larger type, which will convert our result, the multiplication is still done as a 32-bit value and can itself overflow. I didn't check this with my test case, since it would need an even larger pack (~530M objects), but looking at compiler output shows that it works this way. The standard should agree, but I couldn't find anything explicit in 6.3.1.8 ("usual arithmetic conversions"). The case in load_idx() was the most immediate one that I was able to trigger. After fixing it, looking up actual objects (including the very last one in sha1 order) works in a test repo with 153,725,110 objects. That's because bsearch_hash() works with uint32_t entry indices, and the actual byte access: int cmp = hashcmp(table + mi * stride, sha1); is done with "stride" as a size_t, causing the uint32_t "mi" to be promoted to a size_t. This is the way most code will access the index data. However, I audited all of the other byte-wise accesses of packed_git.index_data, and many of the others are suspect (they are similar to the max_size one, where we are adding to a properly sized offset or directly to a pointer, but the multiplication in the sub-expression can overflow). I didn't trigger any of these in practice, but I believe they're potential problems, and certainly adding in the cast is not going to hurt anything here. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-10-08Merge branch 'jk/index-pack-hotfixes'Libravatar Junio C Hamano1-17/+12
Hotfix and clean-up for the jt/threaded-index-pack topic that has graduated to v2.29-rc0. * jk/index-pack-hotfixes: index-pack: make get_base_data() comment clearer index-pack: drop type_cas mutex index-pack: restore "resolving deltas" progress meter
2020-10-07index-pack: make get_base_data() comment clearerLibravatar Jonathan Tan1-11/+8
A comment mentions that we may free cached delta bases via find_unresolved_deltas(), but that function went away in f08cbf60fe (index-pack: make quantum of work smaller, 2020-09-08). Since we need to rewrite that comment anyway, make the entire comment clearer. Signed-off-by: Jonathan Tan <jonathantanmy@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-10-07index-pack: drop type_cas mutexLibravatar Jeff King1-6/+0
The type_cas lock lost all of its callers in f08cbf60fe (index-pack: make quantum of work smaller, 2020-09-08), so we can safely delete it. The compiler didn't alert us that the variable became unused, because we still call pthread_mutex_init() and pthread_mutex_destroy() on it. It's worth considering also whether that commit was in error to remove the use of the lock. Why don't we need it now, if we did before, as described in ab791dd138 (index-pack: fix race condition with duplicate bases, 2014-08-29)? I think the answer is that we now look at and assign the child_obj->real_type field in the main thread while holding the work_lock(). So we don't have to worry about racing with the worker threads. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-10-07index-pack: restore "resolving deltas" progress meterLibravatar Jeff King1-0/+4
Commit f08cbf60fe (index-pack: make quantum of work smaller, 2020-09-08) refactored the main loop in threaded_second_pass(), but also deleted the call to display_progress() at the top of the loop. This means that users typically see no progress at all during the delta resolution phase (and for large repositories, Git appears to hang). This looks like an accident that was unrelated to the intended change of that commit, since we continue to update nr_resolved_deltas in resolve_delta(). Let's restore the call to get that progress back. We'll also add a test that confirms we generate the expected progress. This isn't perfect, as it wouldn't catch a bug where progress was delayed to the end. That was probably possible to trigger when receiving a thin pack, because we'd eventually call display_progress() from fix_unresolved_deltas(), but only once after doing all the work. However, since our test case generates a complete pack, it reliably demonstrates this particular bug and its fix. And we can't do better without making the test racy. Signed-off-by: Jeff King <peff@peff.net> Acked-by: Jonathan Tan <jonathantanmy@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-09-22Merge branch 'jt/threaded-index-pack'Libravatar Junio C Hamano1-206/+250
"git index-pack" learned to resolve deltified objects with greater parallelism. * jt/threaded-index-pack: index-pack: make quantum of work smaller index-pack: make resolve_delta() assume base data index-pack: calculate {ref,ofs}_{first,last} early index-pack: remove redundant child field index-pack: unify threaded and unthreaded code index-pack: remove redundant parameter Documentation: deltaBaseCacheLimit is per-thread
2020-09-08index-pack: make quantum of work smallerLibravatar Jonathan Tan1-148/+200
Currently, when index-pack resolves deltas, it does not split up delta trees into threads: each delta base root (an object that is not a REF_DELTA or OFS_DELTA) can go into its own thread, but all deltas on that root (direct or indirect) are processed in the same thread. This is a problem when a repository contains a large text file (thus, delta-able) that is modified many times - delta resolution time during fetching is dominated by processing the deltas corresponding to that text file. This patch contains a solution to that. When cloning using git -c core.deltabasecachelimit=1g clone \ https://fuchsia.googlesource.com/third_party/vulkan-cts on my laptop, clone time improved from 3m2s to 2m5s (using 3 threads, which is the default). The solution is to have a global work stack. This stack contains delta bases (objects, whether appearing directly in the packfile or generated by delta resolution, that themselves have delta children) that need to be processed; whenever a thread needs work, it peeks at the top of the stack and processes its next unprocessed child. If a thread finds the stack empty, it will look for more delta base roots to push on the stack instead. The main weakness of having a global work stack is that more time is spent in the mutex, but profiling has shown that most time is spent in the resolution of the deltas themselves, so this shouldn't be an issue in practice. In any case, experimentation (as described in the clone command above) shows that this patch is a net improvement. Signed-off-by: Jonathan Tan <jonathantanmy@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-08-24index-pack: make resolve_delta() assume base dataLibravatar Jonathan Tan1-3/+5
A subsequent commit will make the quantum of work smaller, necessitating more locking. This commit allows resolve_delta() to be called outside the lock. Signed-off-by: Jonathan Tan <jonathantanmy@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-08-24index-pack: calculate {ref,ofs}_{first,last} earlyLibravatar Jonathan Tan1-63/+60
This is refactoring 2 of 2 to simplify struct base_data. Whenever we make a struct base_data, immediately calculate its delta children. This eliminates confusion as to when the {ref,ofs}_{first,last} fields are initialized. Before this patch, the delta children were calculated at the last possible moment. This allowed the members of struct base_data to be populated in any order, superficially useful when we have the object contents before the struct object_entry. But this makes reasoning about the state of struct base_data more complicated, hence this patch. Signed-off-by: Jonathan Tan <jonathantanmy@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-08-24index-pack: remove redundant child fieldLibravatar Jonathan Tan1-19/+22
This is refactoring 1 of 2 to simplify struct base_data. In index-pack, each thread maintains a doubly-linked list of the delta chain that it is currently processing (the "base" and "child" pointers in struct base_data). When a thread exceeds the delta base cache limit and needs to reclaim memory, it uses the "child" pointers to traverse the lineage, reclaiming the memory of the eldest delta bases first. A subsequent patch will perform memory reclaiming in a different way and will thus no longer need the "child" pointer. Because the "child" pointer is redundant even now, remove it so that the aforementioned subsequent patch will be clearer. In the meantime, reclaim memory in the reverse order of the "base" pointers. Signed-off-by: Jonathan Tan <jonathantanmy@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-08-24index-pack: unify threaded and unthreaded codeLibravatar Jonathan Tan1-9/+1
Signed-off-by: Jonathan Tan <jonathantanmy@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-08-24index-pack: remove redundant parameterLibravatar Jonathan Tan1-14/+12
find_{ref,ofs}_delta_{,children} take an enum object_type parameter, but the object type is already present in the name of the function. Remove that parameter from these functions. Signed-off-by: Jonathan Tan <jonathantanmy@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-08-21index-pack: adjust default threading capLibravatar Jeff King1-3/+16
Commit b8a2486f15 (index-pack: support multithreaded delta resolving, 2012-05-06) describes an experiment that shows that setting the number of threads for index-pack higher than 3 does not help. I repeated that experiment using a more modern version of Git and a more modern CPU and got different results. Here are timings for p5302 against linux.git run on my laptop, a Core i9-9880H with 8 cores plus hyperthreading (so online-cpus returns 16): 5302.3: index-pack 0 threads 256.28(253.41+2.79) 5302.4: index-pack 1 threads 257.03(254.03+2.91) 5302.5: index-pack 2 threads 149.39(268.34+3.06) 5302.6: index-pack 4 threads 94.96(294.10+3.23) 5302.7: index-pack 8 threads 68.12(339.26+3.89) 5302.8: index-pack 16 threads 70.90(655.03+7.21) 5302.9: index-pack default number of threads 116.91(290.05+3.21) You can see that wall-clock times continue to improve dramatically up to the number of cores, but bumping beyond that (into hyperthreading territory) does not help (and in fact hurts a little). Here's the same experiment on a machine with dual Xeon 6230's, totaling 40 cores (80 with hyperthreading): 5302.3: index-pack 0 threads 310.04(302.73+6.90) 5302.4: index-pack 1 threads 310.55(302.68+7.40) 5302.5: index-pack 2 threads 178.17(304.89+8.20) 5302.6: index-pack 5 threads 99.53(315.54+9.56) 5302.7: index-pack 10 threads 72.80(327.37+12.79) 5302.8: index-pack 20 threads 60.68(357.74+21.66) 5302.9: index-pack 40 threads 58.07(454.44+67.96) 5302.10: index-pack 80 threads 59.81(720.45+334.52) 5302.11: index-pack default number of threads 134.18(309.32+7.98) The results are similar; things stop improving at 40 threads. Curiously, going from 20 to 40 really doesn't help much, either (and increases CPU time considerably). So that may represent an actual barrier to parallelism, where we lose out due to context-switching and loss of cache locality, but don't reap the wall-clock benefits due to contention of our coarse-grained locks. So what's a good default value? It's clear that the current cap of 3 is too low; our default values are 42% and 57% slower than the best times on each machine. The results on the 40-core machine imply that 20 threads is an actual barrier regardless of the number of cores, so we'll take that as a maximum. We get the best results on these machines at half of the online-cpus value. That's presumably a result of the hyperthreading. That's common on multi-core Intel processors, but not necessarily elsewhere. But if we take it as an assumption, we can perform optimally on hyperthreaded machines and still do much better than the status quo on other machines, as long as we never half below the current value of 3. So that's what this patch does. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-06-19builtin/index-pack: add option to specify hash algorithmLibravatar brian m. carlson1-0/+8
git index-pack is usually run in a repository, but need not be. Since packs don't contains information on the algorithm in use, instead relying on context, add an option to index-pack to tell it which one we're using in case someone runs it outside of a repository. Since using --stdin necessarily implies a repository, don't allow specifying an object format if it's provided to prevent users from passing an option that won't work. Add documentation for this option. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-05-27packfile: compute and use the index CRC offsetLibravatar brian m. carlson1-5/+1
Both v2 pack index files and the v3 format specified as part of the NewHash work have similar data starting at the CRC table. Much of the existing code wants to read either this table or the offset entries following it, and in doing so computes the offset each time. In order to share as much code between v2 and v3, compute the offset of the CRC table and store it when the pack is opened. Use this value to compute offsets to not only the CRC table, but to the offset entries beyond it. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-04-02promisor-remote: accept 0 as oid_nr in functionLibravatar Jonathan Tan1-3/+2
There are 3 callers to promisor_remote_get_direct() that first check if the number of objects to be fetched is equal to 0. Fold that check into promisor_remote_get_direct(), and in doing so, be explicit as to what promisor_remote_get_direct() does if oid_nr is 0 (it returns 0, success, immediately). Signed-off-by: Jonathan Tan <jonathantanmy@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-02-14Merge branch 'jk/index-pack-dupfix'Libravatar Junio C Hamano1-1/+3
The index-pack code now diagnoses a bad input packstream that records the same object twice when it is used as delta base; the code used to declare a software bug when encountering such an input, but it is an input error. * jk/index-pack-dupfix: index-pack: downgrade twice-resolved REF_DELTA to die()
2020-02-04index-pack: downgrade twice-resolved REF_DELTA to die()Libravatar Jeff King1-1/+3
When we're resolving a REF_DELTA, we compare-and-swap its type from REF_DELTA to whatever real type the base object has, as discussed in ab791dd138 (index-pack: fix race condition with duplicate bases, 2014-08-29). If the old type wasn't a REF_DELTA, we consider that a BUG(). But as discussed in that commit, we might see this case whenever we try to resolve an object twice, which may happen because we have multiple copies of the base object. So this isn't a bug at all, but rather a sign that the input pack is broken. And indeed, this case is triggered already in t5309.5 and t5309.6, which create packs with delta cycles and duplicate bases. But we never noticed because those tests are marked expect_failure. Those tests were added by b2ef3d9ebb (test index-pack on packs with recoverable delta cycles, 2013-08-23), which was leaving the door open for cases that we theoretically _could_ handle. And when we see an already-resolved object like this, in theory we could keep going after confirming that the previously resolved child->real_type matches base->obj->real_type. But: - enforcing the "only resolve once" rule here saves us from an infinite loop in other parts of the code. If we keep going, then the delta cycle in t5309.5 causes us to loop infinitely, as find_ref_delta_children() doesn't realize which objects have already been resolved. So there would be more changes needed to make this case work, and in the meantime we'd be worse off. - any pack that triggers this is broken anyway. It either has a duplicate base object, or it has a cycle which causes us to bring in a duplicate via --fix-thin. In either case, we'd end up rejecting the pack in write_idx_file(), which also detects duplicates. So the tests have little value in documenting what we _could_ be doing (and have been neglected for 6+ years). Let's switch them to confirming that we handle this case cleanly (and switch out the BUG() for a more informative die() so that we do so). Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-01-31sha1-file: allow check_object_signature() to handle any repoLibravatar Matheus Tavares1-2/+3
Some callers of check_object_signature() can work on arbitrary repositories, but the repo does not get passed to this function. Instead, the_repository is always used internally. To fix possible inconsistencies, allow the function to receive a struct repository and make those callers pass on the repo being handled. Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-01-31sha1-file: pass git_hash_algo to hash_object_file()Libravatar Matheus Tavares1-1/+1
Allow hash_object_file() to work on arbitrary repos by introducing a git_hash_algo parameter. Change callers which have a struct repository pointer in their scope to pass on the git_hash_algo from the said repo. For all other callers, pass on the_hash_algo, which was already being used internally at hash_object_file(). This functionality will be used in the following patch to make check_object_signature() be able to work on arbitrary repos (which, in turn, will be used to fix an inconsistency at object.c:parse_object()). Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-01-31streaming: allow open_istream() to handle any repoLibravatar Matheus Tavares1-1/+2
Some callers of open_istream() at archive-tar.c and archive-zip.c are capable of working on arbitrary repositories but the repo struct is not passed down to open_istream(), which uses the_repository internally. For now, that's not a problem since the said callers are only being called with the_repository. But to be consistent and avoid future problems, let's allow open_istream() to receive a struct repository and use that instead of the_repository. This parameter addition will also be used in a future patch to make sha1-file.c:check_object_signature() be able to work on arbitrary repos. Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-10-11Merge branch 'bc/object-id-part17'Libravatar Junio C Hamano1-2/+2
Preparation for SHA-256 upgrade continues. * bc/object-id-part17: (26 commits) midx: switch to using the_hash_algo builtin/show-index: replace sha1_to_hex rerere: replace sha1_to_hex builtin/receive-pack: replace sha1_to_hex builtin/index-pack: replace sha1_to_hex packfile: replace sha1_to_hex wt-status: convert struct wt_status to object_id cache: remove null_sha1 builtin/worktree: switch null_sha1 to null_oid builtin/repack: write object IDs of the proper length pack-write: use hash_to_hex when writing checksums sequencer: convert to use the_hash_algo bisect: switch to using the_hash_algo sha1-lookup: switch hard-coded constants to the_hash_algo config: use the_hash_algo in abbrev comparison combine-diff: replace GIT_SHA1_HEXSZ with the_hash_algo bundle: switch to use the_hash_algo connected: switch GIT_SHA1_HEXSZ to the_hash_algo show-index: switch hard-coded constants to the_hash_algo blame: remove needless comparison with GIT_SHA1_HEXSZ ...
2019-08-19builtin/index-pack: replace sha1_to_hexLibravatar brian m. carlson1-2/+2
Since sha1_to_hex is limited to SHA-1, replace it with hash_to_hex so this code works with other algorithms. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>