summaryrefslogtreecommitdiff
path: root/object-file.c
AgeCommit message (Collapse)AuthorFilesLines
2022-03-16Merge branch 'ab/object-file-api-updates'Libravatar Junio C Hamano1-52/+93
Object-file API shuffling. * ab/object-file-api-updates: object-file API: pass an enum to read_object_with_reference() object-file.c: add a literal version of write_object_file_prepare() object-file API: have hash_object_file() take "enum object_type" object API: rename hash_object_file_literally() to write_*() object-file API: split up and simplify check_object_signature() object API users + docs: check <0, not !0 with check_object_signature() object API docs: move check_object_signature() docs to cache.h object API: correct "buf" v.s. "map" mismatch in *.c and *.h object-file API: have write_object_file() take "enum object_type" object-file API: add a format_object_header() function object-file API: return "void", not "int" from hash_object_file() object-file.c: split up declaration of unrelated variables
2022-02-25object-file API: pass an enum to read_object_with_reference()Libravatar Ævar Arnfjörð Bjarmason1-3/+2
Change the read_object_with_reference() function to take an "enum object_type". It was not prepared to handle an arbitrary "const char *type", as it was itself calling type_from_string(). Let's change the only caller that passes in user data to use type_from_string(), and convert the rest to use e.g. "OBJ_TREE" instead of "tree_type". The "cat-file" caller is not on the codepath that handles"--allow-unknown", so the type_from_string() there is safe. Its use of type_from_string() doesn't functionally differ from that of the pre-image. Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-02-25object-file.c: add a literal version of write_object_file_prepare()Libravatar Ævar Arnfjörð Bjarmason1-10/+29
Split off a *_literally() variant of the write_object_file_prepare() function. To do this create a new "hash_object_body()" static helper. We now defer the type_name() call until the very last moment in format_object_header() for those callers that aren't "hash-object --literally". Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-02-25object-file API: have hash_object_file() take "enum object_type"Libravatar Ævar Arnfjörð Bjarmason1-14/+21
Change the hash_object_file() function to take an "enum object_type". Since a preceding commit all of its callers are passing either "{commit,tree,blob,tag}_type", or the result of a call to type_name(), the parse_object() caller that would pass NULL is now using stream_object_signature(). Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-02-25object API: rename hash_object_file_literally() to write_*()Libravatar Ævar Arnfjörð Bjarmason1-3/+3
Before 0c3db67cc81 (hash-object --literally: fix buffer overrun with extra-long object type, 2015-05-04) the hash-object code being changed here called write_sha1_file() to both hash and write a loose object. Before that we'd use hash_sha1_file() to if "-w" wasn't provided, and otherwise call write_sha1_file(). Now we'll always call the same function for both writing. Let's rename it from hash_*_literally() to write_*_literally(). Even though the write_*() might not actually write if HASH_WRITE_OBJECT isn't in "flags", having it be more similar to write_object_file_flags() than hash_object_file(), but carrying a name that would suggest that it's a variant of the latter is confusing. Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-02-25object-file API: split up and simplify check_object_signature()Libravatar Ævar Arnfjörð Bjarmason1-14/+18
Split up the check_object_signature() function into that non-streaming version (it accepts an already filled "buf"), and a new stream_object_signature() which will retrieve the object from storage, and hash it on-the-fly. All of the callers of check_object_signature() were effectively calling two different functions, if we go by cyclomatic complexity. I.e. they'd either take the early "if (map)" branch and return early, or not. This has been the case since the "if (map)" condition was added in 090ea12671b (parse_object: avoid putting whole blob in core, 2012-03-07). We can then further simplify the resulting check_object_signature() function since only one caller wanted to pass a non-NULL "buf" and a non-NULL "real_oidp". That "read_loose_object()" codepath used by "git fsck" can instead use hash_object_file() followed by oideq(). Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-02-25object API users + docs: check <0, not !0 with check_object_signature()Libravatar Ævar Arnfjörð Bjarmason1-1/+1
Change those users of the object API that misused check_object_signature() by assuming it returned any non-zero when the OID didn't match the expected value to check <0 instead. In practice all of this code worked before, but it wasn't consistent with rest of the users of the API. Let's also clarify what the <0 return value means in API docs. Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-02-25object API docs: move check_object_signature() docs to cache.hLibravatar Ævar Arnfjörð Bjarmason1-6/+0
Move the API documentation for check_object_signature() to cache.h, where its prototype is declared. This is in preparation for adding a companion function. Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-02-25object API: correct "buf" v.s. "map" mismatch in *.c and *.hLibravatar Ævar Arnfjörð Bjarmason1-5/+5
Change the name of the second argument to check_object_signature() to be "buf" in object-file.c, making it consistent with the prototype in cache.h This fixes an inconsistency that's been with us since 2ade9340262 (Add "check_sha1_signature()" helper function, 2005-04-08), and makes a subsequent commit's diff smaller, as we'll move these API docs to cache.h. While we're at it fix a small grammar error in the documentation, dropping an "an" before "in-core object-data". Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-02-25object-file API: have write_object_file() take "enum object_type"Libravatar Ævar Arnfjörð Bjarmason1-5/+5
Change the write_object_file() function to take an "enum object_type" instead of a "const char *type". Its callers either passed {commit,tree,blob,tag}_type and can pass the corresponding OBJ_* type instead, or were hardcoding strings like "blob". This avoids the back & forth fragility where the callers of write_object_file() would have the enum type, and convert it themselves via type_name(). We do have to now do that conversion ourselves before calling write_object_file_prepare(), but those codepaths will be similarly adjusted in subsequent commits. Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-02-25object-file API: add a format_object_header() functionLibravatar Ævar Arnfjörð Bjarmason1-3/+20
Add a convenience function to wrap the xsnprintf() command that generates loose object headers. This code was copy/pasted in various parts of the codebase, let's define it in one place and re-use it from there. All except one caller of it had a valid "enum object_type" for us, it's only write_object_file_prepare() which might need to deal with "git hash-object --literally" and a potential garbage type. Let's have the primary API use an "enum object_type", and define a *_literally() function that can take an arbitrary "const char *" for the type. See [1] for the discussion that prompted this patch, i.e. new code in object-file.c that wanted to copy/paste the xsnprintf() invocation. In the case of fast-import.c the callers unfortunately need to cast back & forth between "unsigned char *" and "char *", since format_object_header() ad encode_in_pack_object_header() take different signedness. 1. https://lore.kernel.org/git/211213.86bl1l9bfz.gmgdl@evledraar.gmail.com/ Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Jiang Xin <zhiyou.jx@alibaba-inc.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-02-25object-file API: return "void", not "int" from hash_object_file()Libravatar Ævar Arnfjörð Bjarmason1-8/+8
The hash_object_file() function added in abdc3fc8421 (Add hash_sha1_file(), 2006-10-14) did not have a meaningful return value, and it never has. One was seemingly added to avoid adding braces to the "ret = " assignments being modified here. Let's instead assign "0" to the "ret" variables at the beginning of the relevant functions, and have them return "void". Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-02-25object-file.c: split up declaration of unrelated variablesLibravatar Ævar Arnfjörð Bjarmason1-1/+2
Split up the declaration of the "ret" and "re_allocated" variables. It's not our usual style to group variable declarations simply because they share a type, we'd only prefer to do so when the two are closely related (e.g. "int i, j"). This change makes a subsequent and meaningful change's diff smaller. Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-02-24object-file: use designated initializers for "struct git_hash_algo"Libravatar Ævar Arnfjörð Bjarmason1-39/+39
As with the preceding commit, change another file-level struct assignment to use designated initializers. Retain the ".name = NULL" etc. in the case of the first element of "unknown hash algorithm", to make it explicit that we're intentionally not setting those, it's not just that we forgot. Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-01-03Merge branch 'ns/tmp-objdir'Libravatar Junio C Hamano1-2/+48
New interface into the tmp-objdir API to help in-core use of the quarantine feature. * ns/tmp-objdir: tmp-objdir: disable ref updates when replacing the primary odb tmp-objdir: new API for creating temporary writable databases
2021-12-15Merge branch 'ab/run-command'Libravatar Junio C Hamano1-1/+1
API clean-up. * ab/run-command: run-command API: remove "env" member, always use "env_array" difftool: use "env_array" to simplify memory management run-command API: remove "argv" member, always use "args" run-command API users: use strvec_push(), not argv construction run-command API users: use strvec_pushl(), not argv construction run-command tests: use strvec_pushv(), not argv assignment run-command API users: use strvec_pushv(), not argv assignment upload-archive: use regular "struct child_process" pattern worktree: stop being overly intimate with run_command() internals
2021-12-15Merge branch 'hn/reftable'Libravatar Junio C Hamano1-5/+2
The "reftable" backend for the refs API, without integrating into the refs subsystem, has been added. * hn/reftable: Add "test-tool dump-reftable" command. reftable: add dump utility reftable: implement stack, a mutable database of reftable files. reftable: implement refname validation reftable: add merged table view reftable: add a heap-based priority queue for reftable records reftable: reftable file level tests reftable: read reftable files reftable: generic interface to tables reftable: write reftable files reftable: a generic binary tree implementation reftable: reading/writing blocks Provide zlib's uncompress2 from compat/zlib-compat.c reftable: (de)serialization for the polymorphic record type. reftable: add blocksource, an abstraction for random access reads reftable: utility functions reftable: add error related functionality reftable: add LICENSE hash.h: provide constants for the hash IDs
2021-12-10Merge branch 'po/size-t-for-vs'Libravatar Junio C Hamano1-1/+1
On platforms where ulong is shorter than size_t, code paths that shifted 1 or 1U to the left lacked the necessary cast to size_t, which have been corrected. * po/size-t-for-vs: object-file.c: LLP64 compatibility, upcast unity for left shift diffcore-delta.c: LLP64 compatibility, upcast unity for left shift repack.c: LLP64 compatibility, upcast unity for left shift
2021-12-08tmp-objdir: disable ref updates when replacing the primary odbLibravatar Neeraj Singh1-0/+6
When creating a subprocess with a temporary ODB, we set the GIT_QUARANTINE_ENVIRONMENT env var to tell child Git processes not to update refs, since the tmp-objdir may go away. Introduce a similar mechanism for in-process temporary ODBs when we call tmp_objdir_replace_primary_odb. Now both mechanisms set the disable_ref_updates flag on the odb, which is queried by the ref_transaction_prepare function. Peff's test case [1] was invoking ref updates via the cachetextconv setting. That particular code silently does nothing when a ref update is forbidden. See the call to notes_cache_put in fill_textconv where errors are ignored. [1] https://lore.kernel.org/git/YVOn3hDsb5pnxR53@coredump.intra.peff.net/ Reported-by: Jeff King <peff@peff.net> Signed-off-by: Neeraj Singh <neerajsi@microsoft.com> Reviewed-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-12-08tmp-objdir: new API for creating temporary writable databasesLibravatar Neeraj Singh1-2/+42
The tmp_objdir API provides the ability to create temporary object directories, but was designed with the goal of having subprocesses access these object stores, followed by the main process migrating objects from it to the main object store or just deleting it. The subprocesses would view it as their primary datastore and write to it. Here we add the tmp_objdir_replace_primary_odb function that replaces the current process's writable "main" object directory with the specified one. The previous main object directory is restored in either tmp_objdir_migrate or tmp_objdir_destroy. For the --remerge-diff usecase, add a new `will_destroy` flag in `struct object_database` to mark ephemeral object databases that do not require fsync durability. Add 'git prune' support for removing temporary object databases, and make sure that they have a name starting with tmp_ and containing an operation-specific name. Based-on-patch-by: Elijah Newren <newren@gmail.com> Signed-off-by: Neeraj Singh <neerajsi@microsoft.com> Reviewed-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-12-01object-file.c: LLP64 compatibility, upcast unity for left shiftLibravatar Philip Oakley1-1/+1
Visual Studio reports C4334 "was 64-bit shift intended" warning because of size miss-match. Promote unity to the matching type to fit with the assignment. Signed-off-by: Philip Oakley <philipoakley@iee.email> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-11-29Merge branch 'mc/clean-smudge-with-llp64'Libravatar Junio C Hamano1-3/+3
The clean/smudge conversion code path has been prepared to better work on platforms where ulong is narrower than size_t. * mc/clean-smudge-with-llp64: clean/smudge: allow clean filters to process extremely large files odb: guard against data loss checking out a huge file git-compat-util: introduce more size_t helpers odb: teach read_blob_entry to use size_t t1051: introduce a smudge filter test for extremely large files test-lib: add prerequisite for 64-bit platforms test-tool genzeros: generate large amounts of data more efficiently test-genzeros: allow more than 2G zeros in Windows
2021-11-25run-command API: remove "env" member, always use "env_array"Libravatar Ævar Arnfjörð Bjarmason1-1/+1
Remove the "env" member from "struct child_process" in favor of always using the "env_array". As with the preceding removal of "argv" in favor of "args" this gets rid of current and future oddities around memory management at the API boundary (see the amended API docs). For some of the conversions we can replace patterns like: child.env = env->v; With: strvec_pushv(&child.env_array, env->v); But for others we need to guard the strvec_pushv() with a NULL check, since we're not passing in the "v" member of a "struct strvec", e.g. in the case of tmp_objdir_env()'s return value. Ideally we'd rename the "env_array" member to simply "env" as a follow-up, since it and "args" are now inconsistent in not having an "_array" suffix, and seemingly without any good reason, unless we look at the history of how they came to be. But as we've currently got 122 in-tree hits for a "git grep env_array" let's leave that for now (and possibly forever). Doing that rename would be too disruptive. Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-11-12Merge branch 'ab/fsck-unexpected-type'Libravatar Junio C Hamano1-3/+2
Regression fix. * ab/fsck-unexpected-type: object-file: free(*contents) only in read_loose_object() caller object-file: fix SEGV on free() regression in v2.34.0-rc2
2021-11-11object-file: free(*contents) only in read_loose_object() callerLibravatar Ævar Arnfjörð Bjarmason1-5/+2
In the preceding commit a free() of uninitialized memory regression in 96e41f58fe1 (fsck: report invalid object type-path combinations, 2021-10-01) was fixed, but we'd still have an issue with leaking memory from fsck_loose(). Let's fix that issue too. That issue was introduced in my 31deb28f5e0 (fsck: don't hard die on invalid object types, 2021-10-01). It can be reproduced under SANITIZE=leak with the test I added in 093fffdfbec (fsck tests: add test for fsck-ing an unknown type, 2021-10-01): ./t1450-fsck.sh --run=84 -vixd In some sense it's not a problem, we lost the same amount of memory in terms of things malloc'd and not free'd. It just moved from the "still reachable" to "definitely lost" column in valgrind(1) nomenclature[1], since we'd have die()'d before. But now that we don't hard die() anymore in the library let's properly free() it. Doing so makes this code much easier to follow, since we'll now have one function owning the freeing of the "contents" variable, not two. For context on that memory management pattern the read_loose_object() function was added in f6371f92104 (sha1_file: add read_loose_object() function, 2017-01-13) and subsequently used in c68b489e564 (fsck: parse loose object paths directly, 2017-01-13). The pattern of it being the task of both sides to free() the memory has been there in this form since its inception. 1. https://valgrind.org/docs/manual/mc-manual.html#mc-manual.leaks Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-11-11object-file: fix SEGV on free() regression in v2.34.0-rc2Libravatar Ævar Arnfjörð Bjarmason1-0/+2
Fix a regression introduced in my 96e41f58fe1 (fsck: report invalid object type-path combinations, 2021-10-01). When fsck-ing blobs larger than core.bigFileThreshold, we'd free() a pointer to uninitialized memory. This issue would have been caught by SANITIZE=address, but since it involves core.bigFileThreshold, none of the existing tests in our test suite covered it. Running them with the "big_file_threshold" in "environment.c" changed to say "6" would have shown this failure, but let's add a dedicated test for this scenario based on Han Xin's report[1]. The bug was introduced between v9 and v10[2] of the fsck series merged in 061a21d36d8 (Merge branch 'ab/fsck-unexpected-type', 2021-10-25). 1. https://lore.kernel.org/git/20211111030302.75694-1-hanxin.hx@alibaba-inc.com/ 2. https://lore.kernel.org/git/cover-v10-00.17-00000000000-20211001T091051Z-avarab@gmail.com/ Reported-by: Han Xin <chiyutianyi@gmail.com> Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-11-03odb: guard against data loss checking out a huge fileLibravatar Matt Cooper1-3/+3
This introduces an additional guard for platforms where `unsigned long` and `size_t` are not of the same size. If the size of an object in the database would overflow `unsigned long`, instead we now exit with an error. A complete fix will have to update _many_ other functions throughout the codebase to use `size_t` instead of `unsigned long`. It will have to be implemented at some stage. This commit puts in a stop-gap for the time being. Helped-by: Johannes Schindelin <johannes.schindelin@gmx.de> Signed-off-by: Matt Cooper <vtbassmatt@gmail.com> Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-10-25Merge branch 'ab/fix-commit-error-message-upon-unwritable-object-store'Libravatar Junio C Hamano1-8/+12
"git commit" gave duplicated error message when the object store was unwritable, which has been corrected. * ab/fix-commit-error-message-upon-unwritable-object-store: commit: fix duplication regression in permission error output unwritable tests: assert exact error output
2021-10-25Merge branch 'jt/no-abuse-alternate-odb-for-submodules'Libravatar Junio C Hamano1-1/+8
Follow through the work to use the repo interface to access submodule objects in-process, instead of abusing the alternate object database interface. * jt/no-abuse-alternate-odb-for-submodules: submodule: trace adding submodule ODB as alternate submodule: pass repo to check_has_commit() object-file: only register submodule ODB if needed merge-{ort,recursive}: remove add_submodule_odb() refs: peeling non-the_repository iterators is BUG refs: teach arbitrary repo support to iterators refs: plumb repo into ref stores
2021-10-25Merge branch 'ab/fsck-unexpected-type'Libravatar Junio C Hamano1-99/+77
"git fsck" has been taught to report mismatch between expected and actual types of an object better. * ab/fsck-unexpected-type: fsck: report invalid object type-path combinations fsck: don't hard die on invalid object types object-file.c: stop dying in parse_loose_header() object-file.c: return ULHR_TOO_LONG on "header too long" object-file.c: use "enum" return type for unpack_loose_header() object-file.c: simplify unpack_loose_short_header() object-file.c: make parse_loose_header_extended() public object-file.c: return -1, not "status" from unpack_loose_header() object-file.c: don't set "typep" when returning non-zero cat-file tests: test for current --allow-unknown-type behavior cat-file tests: add corrupt loose object test cat-file tests: test for missing/bogus object with -t, -s and -p cat-file tests: move bogus_* variable declarations earlier fsck tests: test for garbage appended to a loose object fsck tests: test current hash/type mismatch behavior fsck tests: refactor one test to use a sub-repo fsck tests: add test for fsck-ing an unknown type
2021-10-12commit: fix duplication regression in permission error outputLibravatar Ævar Arnfjörð Bjarmason1-8/+12
Fix a regression in the error output emitted when .git/objects can't be written to. Before 9c4d6c0297b (cache-tree: Write updated cache-tree after commit, 2014-07-13) we'd emit only one "insufficient permission" error, now we'll do so again. The cause is rather straightforward, we've got WRITE_TREE_SILENT for the use-case of wanting to prepare an index silently, quieting any permission etc. error output. Then when we attempt to update to that (possibly broken) index we'll run into the same errors again. But with 9c4d6c0297b the gap between the cache-tree API and the object store wasn't closed in terms of asking write_object_file() to be silent. I.e. post-9c4d6c0297b the first call is to prepare_index(), and after that we'll call prepare_to_commit(). We only want verbose error output from the latter. So let's add and use that facility with a corresponding HASH_SILENT flag, its only user is cache-tree.c's update_one(), which will set it if its "WRITE_TREE_SILENT" flag is set. Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-10-08object-file: only register submodule ODB if neededLibravatar Jonathan Tan1-1/+8
In a35e03dee0 ("submodule: lazily add submodule ODBs as alternates", 2021-09-08), Git was taught to add all known submodule ODBs as alternates when attempting to read an object that doesn't exist, as a fallback for when a submodule object is read as if it were in the_repository. However, this behavior wasn't restricted to happen only when reading from the_repository. Fix this. Signed-off-by: Jonathan Tan <jonathantanmy@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-10-03Merge branch 'hn/refs-errno-cleanup'Libravatar Junio C Hamano1-68/+0
Futz with the way 'errno' is relied on in the refs API to carry the failure modes up the call chain. * hn/refs-errno-cleanup: refs: make errno output explicit for read_raw_ref_fn refs/files-backend: stop setting errno from lock_ref_oid_basic refs: remove EINVAL errno output from specification of read_raw_ref_fn refs file backend: move raceproof_create_file() here
2021-10-01fsck: report invalid object type-path combinationsLibravatar Ævar Arnfjörð Bjarmason1-11/+10
Improve the error that's emitted in cases where we find a loose object we parse, but which isn't at the location we expect it to be. Before this change we'd prefix the error with a not-a-OID derived from the path at which the object was found, due to an emergent behavior in how we'd end up with an "OID" in these codepaths. Now we'll instead say what object we hashed, and what path it was found at. Before this patch series e.g.: $ git hash-object --stdin -w -t blob </dev/null e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 $ mv objects/e6/ objects/e7 Would emit ("[...]" used to abbreviate the OIDs): git fsck error: hash mismatch for ./objects/e7/9d[...] (expected e79d[...]) error: e79d[...]: object corrupt or missing: ./objects/e7/9d[...] Now we'll instead emit: error: e69d[...]: hash-path mismatch, found at: ./objects/e7/9d[...] Furthermore, we'll do the right thing when the object type and its location are bad. I.e. this case: $ git hash-object --stdin -w -t garbage --literally </dev/null 8315a83d2acc4c174aed59430f9a9c4ed926440f $ mv objects/83 objects/84 As noted in an earlier commits we'd simply die early in those cases, until preceding commits fixed the hard die on invalid object type: $ git fsck fatal: invalid object type Now we'll instead emit sensible error messages: $ git fsck error: 8315[...]: hash-path mismatch, found at: ./objects/84/15[...] error: 8315[...]: object is of unknown type 'garbage': ./objects/84/15[...] In both fsck.c and object-file.c we're using null_oid as a sentinel value for checking whether we got far enough to be certain that the issue was indeed this OID mismatch. We need to add the "object corrupt or missing" special-case to deal with cases where read_loose_object() will return an error before completing check_object_signature(), e.g. if we have an error in unpack_loose_rest() because we find garbage after the valid gzip content: $ git hash-object --stdin -w -t blob </dev/null e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 $ chmod 755 objects/e6/9de29bb2d1d6434b8b29ae775ad8c2e48c5391 $ echo garbage >>objects/e6/9de29bb2d1d6434b8b29ae775ad8c2e48c5391 $ git fsck error: garbage at end of loose object 'e69d[...]' error: unable to unpack contents of ./objects/e6/9d[...] error: e69d[...]: object corrupt or missing: ./objects/e6/9d[...] There is currently some weird messaging in the edge case when the two are combined, i.e. because we're not explicitly passing along an error state about this specific scenario from check_stream_oid() via read_loose_object() we'll end up printing the null OID if an object is of an unknown type *and* it can't be unpacked by zlib, e.g.: $ git hash-object --stdin -w -t garbage --literally </dev/null 8315a83d2acc4c174aed59430f9a9c4ed926440f $ chmod 755 objects/83/15a83d2acc4c174aed59430f9a9c4ed926440f $ echo garbage >>objects/83/15a83d2acc4c174aed59430f9a9c4ed926440f $ /usr/bin/git fsck fatal: invalid object type $ ~/g/git/git fsck error: garbage at end of loose object '8315a83d2acc4c174aed59430f9a9c4ed926440f' error: unable to unpack contents of ./objects/83/15a83d2acc4c174aed59430f9a9c4ed926440f error: 8315a83d2acc4c174aed59430f9a9c4ed926440f: object corrupt or missing: ./objects/83/15a83d2acc4c174aed59430f9a9c4ed926440f error: 0000000000000000000000000000000000000000: object is of unknown type 'garbage': ./objects/83/15a83d2acc4c174aed59430f9a9c4ed926440f [...] I think it's OK to leave that for future improvements, which would involve enum-ifying more error state as we've done with "enum unpack_loose_header_result" in preceding commits. In these increasingly more obscure cases the worst that can happen is that we'll get slightly nonsensical or inapplicable error messages. There's other such potential edge cases, all of which might produce some confusing messaging, but still be handled correctly as far as passing along errors goes. E.g. if check_object_signature() returns and oideq(real_oid, null_oid()) is true, which could happen if it returns -1 due to the read_istream() call having failed. Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-10-01fsck: don't hard die on invalid object typesLibravatar Ævar Arnfjörð Bjarmason1-12/+6
Change the error fsck emits on invalid object types, such as: $ git hash-object --stdin -w -t garbage --literally </dev/null <OID> From the very ungraceful error of: $ git fsck fatal: invalid object type $ To: $ git fsck error: <OID>: object is of unknown type 'garbage': <OID_PATH> [ other fsck output ] We'll still exit with non-zero, but now we'll finish the rest of the traversal. The tests that's being added here asserts that we'll still complain about other fsck issues (e.g. an unrelated dangling blob). To do this we need to pass down the "OBJECT_INFO_ALLOW_UNKNOWN_TYPE" flag from read_loose_object() through to parse_loose_header(). Since the read_loose_object() function is only used in builtin/fsck.c we can simply change it to accept a "struct object_info" (which contains the OBJECT_INFO_ALLOW_UNKNOWN_TYPE in its flags). See f6371f92104 (sha1_file: add read_loose_object() function, 2017-01-13) for the introduction of read_loose_object(). Since we'll need a "struct strbuf" to hold the "type_name" let's pass it to the for_each_loose_file_in_objdir() callback to avoid allocating a new one for each loose object in the iteration. It also makes the memory management simpler than sticking it in fsck_loose() itself, as we'll only need to strbuf_reset() it, with no need to do a strbuf_release() before each "return". Before this commit we'd never check the "type" if read_loose_object() failed, but now we do. We therefore need to initialize it to OBJ_NONE to be able to tell the difference between e.g. its unpack_loose_header() having failed, and us getting past that and into parse_loose_header(). Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-10-01object-file.c: stop dying in parse_loose_header()Libravatar Ævar Arnfjörð Bjarmason1-34/+33
Make parse_loose_header() return error codes and data instead of invoking die() by itself. For now we'll move the relevant die() call to loose_object_info() and read_loose_object() to keep this change smaller. In a subsequent commit we'll make read_loose_object() return an error code instead of dying. We should also address the "allow_unknown" case (should be moved to builtin/cat-file.c), but for now I'll be leaving it. For making parse_loose_header() not die() change its prototype to accept a "struct object_info *" instead of the "unsigned long *sizep" it accepted before. Its callers can now check the populated populated "oi->typep". Because of this we don't need to pass in the "unsigned int flags" which we used for OBJECT_INFO_ALLOW_UNKNOWN_TYPE, we can instead do that check in loose_object_info(). This also refactors some confusing control flow around the "status" variable. In some cases we set it to the return value of "error()", i.e. -1, and later checked if "status < 0" was true. Since 93cff9a978e (sha1_loose_object_info: return error for corrupted objects, 2017-04-01) the return value of loose_object_info() (then named sha1_loose_object_info()) had been a "status" variable that be any negative value, as we were expecting to return the "enum object_type". The only negative type happens to be OBJ_BAD, but the code still assumed that more might be added. This was then used later in e.g. c84a1f3ed4d (sha1_file: refactor read_object, 2017-06-21). Now that parse_loose_header() will return 0 on success instead of the type (which it'll stick into the "struct object_info") we don't need to conflate these two cases in its callers. Since parse_loose_header() doesn't need to return an arbitrary "status" we only need to treat its "ret < 0" specially, but can idiomatically overwrite it with our own error() return. This along with having made unpack_loose_header() return an "enum unpack_loose_header_result" in an earlier commit means that we can move the previously nested if/else cases mostly into the "ULHR_OK" branch of the "switch" statement. We should be less silent if we reach that "status = -1" branch, which happens if we've got trailing garbage in loose objects, see f6371f92104 (sha1_file: add read_loose_object() function, 2017-01-13) for a better way to handle it. For now let's punt on it, a subsequent commit will address that edge case. Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-10-01object-file.c: return ULHR_TOO_LONG on "header too long"Libravatar Ævar Arnfjörð Bjarmason1-2/+6
Split up the return code for "header too long" from the generic negative return value unpack_loose_header() returns, and report via error() if we exceed MAX_HEADER_LEN. As a test added earlier in this series in t1006-cat-file.sh shows we'll correctly emit zlib errors from zlib.c already in this case, so we have no need to carry those return codes further down the stack. Let's instead just return ULHR_TOO_LONG saying we ran into the MAX_HEADER_LEN limit, or other negative values for "unable to unpack <OID> header". Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-10-01object-file.c: use "enum" return type for unpack_loose_header()Libravatar Ævar Arnfjörð Bjarmason1-13/+21
In a preceding commit we changed and documented unpack_loose_header() from its previous behavior of returning any negative value or zero, to only -1 or 0. Let's add an "enum unpack_loose_header_result" type and use it for these return values, and have the compiler assert that we're exhaustively covering all of them. Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-10-01object-file.c: simplify unpack_loose_short_header()Libravatar Ævar Arnfjörð Bjarmason1-38/+20
Combine the unpack_loose_short_header(), unpack_loose_header_to_strbuf() and unpack_loose_header() functions into one. The unpack_loose_header_to_strbuf() function was added in 46f034483eb (sha1_file: support reading from a loose object of unknown type, 2015-05-03). Its code was mostly copy/pasted between it and both of unpack_loose_header() and unpack_loose_short_header(). We now have a single unpack_loose_header() function which accepts an optional "struct strbuf *" instead. I think the remaining unpack_loose_header() function could be further simplified, we're carrying some complexity just to be able to emit a garbage type longer than MAX_HEADER_LEN, we could alternatively just say "we found a garbage type <first 32 bytes>..." instead. But let's leave the current behavior in place for now. Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-10-01object-file.c: make parse_loose_header_extended() publicLibravatar Ævar Arnfjörð Bjarmason1-13/+7
Make the parse_loose_header_extended() function public and remove the parse_loose_header() wrapper. The only direct user of it outside of object-file.c itself was in streaming.c, that caller can simply pass the required "struct object-info *" instead. This change is being done in preparation for teaching read_loose_object() to accept a flag to pass to parse_loose_header(). It isn't strictly necessary for that change, we could simply use parse_loose_header_extended() there, but will leave the API in a better end state. It would be a better end-state to have already moved the declaration of these functions to object-store.h to avoid the forward declaration of "struct object_info" in cache.h, but let's leave that cleanup for some other time. 1. https://lore.kernel.org/git/patch-v6-09.22-5b9278e7bb4-20210907T104559Z-avarab@gmail.com/ Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-10-01object-file.c: return -1, not "status" from unpack_loose_header()Libravatar Ævar Arnfjörð Bjarmason1-1/+1
Return a -1 when git_inflate() fails instead of whatever Z_* status we'd get from zlib.c. This makes no difference to any error we report, but makes it more obvious that we don't care about the specific zlib error codes here. See d21f8426907 (unpack_sha1_header(): detect malformed object header, 2016-09-25) for the commit that added the "return status" code. As far as I can tell there was never a real reason (e.g. different reporting) for carrying down the "status" as opposed to "-1". At the time that d21f8426907 was written there was a corresponding "ret < Z_OK" check right after the unpack_sha1_header() call (the "unpack_sha1_header()" function was later rename to our current "unpack_loose_header()"). However, that check was removed in c84a1f3ed4d (sha1_file: refactor read_object, 2017-06-21) without changing the corresponding return code. So let's do the minor cleanup of also changing this function to return a -1. Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-10-01object-file.c: don't set "typep" when returning non-zeroLibravatar Ævar Arnfjörð Bjarmason1-2/+0
When the loose_object_info() function returns an error stop faking up the "oi->typep" to OBJ_BAD. Let the return value of the function itself suffice. This code cleanup simplifies subsequent changes. That we set this at all is a relic from the past. Before 052fe5eaca9 (sha1_loose_object_info: make type lookup optional, 2013-07-12) we would always return the type_from_string(type) via the parse_sha1_header() function, or -1 (i.e. OBJ_BAD) if we couldn't parse it. Then in a combination of 46f034483eb (sha1_file: support reading from a loose object of unknown type, 2015-05-03) and b3ea7dd32d6 (sha1_loose_object_info: handle errors from unpack_sha1_rest, 2017-10-05) our API drifted even further towards conflating the two again. Having read the code paths involved carefully I think this is OK. We are just about to return -1, and we have only one caller: do_oid_object_info_extended(). That function will in turn go on to return -1 when we return -1 here. This might be introducing a subtle bug where a caller of oid_object_info_extended() would inspect its "typep" and expect a meaningful value if the function returned -1. Such a problem would not occur for its simpler oid_object_info() sister function. That one always returns the "enum object_type", which in the case of -1 would be the OBJ_BAD. Having read the code for all the callers of these functions I don't believe any such bug is being introduced here, and in any case we'd likely already have such a bug for the "sizep" member (although blindly checking "typep" first would be a more common case). Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-23Merge branch 'rs/packfile-bad-object-list-in-oidset'Libravatar Junio C Hamano1-2/+2
Replace a handcrafted data structure used to keep track of bad objects in the packfile API by an oidset. * rs/packfile-bad-object-list-in-oidset: packfile: use oidset for bad objects packfile: convert has_packed_and_bad() to object_id packfile: convert mark_bad_packed_object() to object_id midx: inline nth_midxed_pack_entry() oidset: make oidset_size() an inline function
2021-09-20Merge branch 'jt/grep-wo-submodule-odb-as-alternate'Libravatar Junio C Hamano1-0/+5
The code to make "git grep" recurse into submodules has been updated to migrate away from the "add submodule's object store as an alternate object store" mechanism (which is suboptimal). * jt/grep-wo-submodule-odb-as-alternate: t7814: show lack of alternate ODB-adding submodule-config: pass repo upon blob config read grep: add repository to OID grep sources grep: allocate subrepos on heap grep: read submodule entry with explicit repo grep: typesafe versions of grep_source_init grep: use submodule-ODB-as-alternate lazy-addition submodule: lazily add submodule ODBs as alternates
2021-09-12packfile: convert has_packed_and_bad() to object_idLibravatar René Scharfe1-1/+1
The single caller has a full object ID, so pass it on instead of just its hash member. Signed-off-by: René Scharfe <l.s.r@web.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-12packfile: convert mark_bad_packed_object() to object_idLibravatar René Scharfe1-1/+1
All callers have full object IDs, so pass them on instead of just their hash member. Signed-off-by: René Scharfe <l.s.r@web.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-08submodule: lazily add submodule ODBs as alternatesLibravatar Jonathan Tan1-0/+5
Teach Git to add submodule ODBs as alternates to the object store of the_repository only upon the first access of an object not in the_repository, and not when add_submodule_odb() is called. This provides a means of gradually migrating from accessing a submodule's object through alternates to accessing a submodule's object by explicitly passing its repository object. Any Git command can declare that it might access submodule objects by calling add_submodule_odb() (as they do now), but the submodule ODBs themselves will not be added until needed, so individual commands and/or combinations of arguments can be migrated one by one. [The advantage of explicit repository-object passing is code clarity (it is clear which repository an object read is from), performance (there is no need to linearly search through all submodule ODBs whenever an object is accessed from any repository, whether superproject or submodule), and the possibility of future features like partial clone submodules (which right now is not possible because if an object is missing, we do not know which repository to lazy-fetch into).] This commit also introduces an environment variable that a test may set to make the actual registration of alternates fatal, in order to demonstrate that its codepaths do not need this registration. Signed-off-by: Jonathan Tan <jonathantanmy@google.com> Reviewed-by: Emily Shaffer <emilyshaffer@google.com> Reviewed-by: Matheus Tavares <matheus.bernardino@usp.br> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-07hash.h: provide constants for the hash IDsLibravatar Han-Wen Nienhuys1-5/+2
This will simplify referencing them from code that is not deeply integrated with Git, in particular, the reftable library. Signed-off-by: Han-Wen Nienhuys <hanwen@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-01midx: avoid opening multiple MIDXs when writingLibravatar Taylor Blau1-0/+21
Opening multiple instance of the same MIDX can lead to problems like two separate packed_git structures which represent the same pack being added to the repository's object store. The above scenario can happen because prepare_midx_pack() checks if `m->packs[pack_int_id]` is NULL in order to determine if a pack has been opened and installed in the repository before. But a caller can construct two copies of the same MIDX by calling get_multi_pack_index() and load_multi_pack_index() since the former manipulates the object store directly but the latter is a lower-level routine which allocates a new MIDX for each call. So if prepare_midx_pack() is called on multiple MIDXs with the same pack_int_id, then that pack will be installed twice in the object store's packed_git pointer. This can lead to problems in, for e.g., the pack-bitmap code, which does something like the following (in pack-bitmap.c:open_pack_bitmap()): struct bitmap_index *bitmap_git = ...; for (p = get_all_packs(r); p; p = p->next) { if (open_pack_bitmap_1(bitmap_git, p) == 0) ret = 0; } which is a problem if two copies of the same pack exist in the packed_git list because pack-bitmap.c:open_pack_bitmap_1() contains a conditional like the following: if (bitmap_git->pack || bitmap_git->midx) { /* ignore extra bitmap file; we can only handle one */ warning("ignoring extra bitmap file: %s", packfile->pack_name); close(fd); return -1; } Avoid this scenario by not letting write_midx_internal() open a MIDX that isn't also pointed at by the object store. So long as this is the case, other routines should prefer to open MIDXs with get_multi_pack_index() or reprepare_packed_git() instead of creating instances on their own. Because get_multi_pack_index() returns `r->object_store->multi_pack_index` if it is non-NULL, we'll only have one instance of a MIDX open at one time, avoiding these problems. To encourage this, drop the `struct multi_pack_index *` parameter from `write_midx_internal()`, and rely instead on the `object_dir` to find (or initialize) the correct MIDX instance. Likewise, replace the call to `close_midx()` with `close_object_store()`, since we're about to replace the MIDX with a new one and should invalidate the object store's memory of any MIDX that might have existed beforehand. Note that this now forbids passing object directories that don't belong to alternate repositories over `--object-dir`, since before we would have happily opened a MIDX in any directory, but now restrict ourselves to only those reachable by `r->objects->multi_pack_index` (and alternate MIDXs that we can see by walking the `next` pointer). As far as I can tell, supporting arbitrary directories with `--object-dir` was a historical accident, since even the documentation says `<alt>` when referring to the value passed to this option. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-08-25refs file backend: move raceproof_create_file() hereLibravatar Ævar Arnfjörð Bjarmason1-68/+0
Move the raceproof_create_file() API added to cache.h and object-file.c in 177978f56ad (raceproof_create_file(): new function, 2017-01-06) to its only user, refs/files-backend.c. Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>