summaryrefslogtreecommitdiff
path: root/builtin/fsck.c
AgeCommit message (Collapse)AuthorFilesLines
2018-03-22Merge branch 'jt/fsck-code-cleanup' into maintLibravatar Junio C Hamano1-1/+7
Plug recently introduced leaks in fsck. * jt/fsck-code-cleanup: fsck: fix leak when traversing trees
2018-01-23fsck: fix leak when traversing treesLibravatar Eric Wong1-1/+7
While fsck_walk/fsck_walk_tree/parse_tree populates "struct tree" idempotently, it is still up to the fsck_walk caller to call free_tree_buffer. Fixes: ad2db4030e42890e ("fsck: remove redundant parse_tree() invocation") Signed-off-by: Eric Wong <e@80x24.org> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-11-15Merge branch 'bp/read-index-from-skip-verification'Libravatar Junio C Hamano1-0/+1
Drop (perhaps overly cautious) sanity check before using the index read from the filesystem at runtime. * bp/read-index-from-skip-verification: read_index_from(): speed index loading by skipping verification of the entry order
2017-11-08read_index_from(): speed index loading by skipping verification of the entry ↵Libravatar Ben Peart1-0/+1
order There is code in post_read_index_from() to catch out of order entries when reading an index file. This order verification is ~13% of the cost of every call to read_index_from(). Update check_ce_order() so that it skips this verification unless the "verify_ce_order" global variable is set. Teach fsck to force this verification. The effect can be seen using t/perf/p0002-read-cache.sh: Test HEAD HEAD~1 -------------------------------------------------------------------------------------- 0002.1: read_cache/discard_cache 1000 times 0.41(0.04+0.04) 0.50(0.00+0.10) +22.0% Signed-off-by: Ben Peart <benpeart@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-10-16refs: convert resolve_ref_unsafe to struct object_idLibravatar brian m. carlson1-1/+1
Convert resolve_ref_unsafe to take a pointer to struct object_id by converting one remaining caller to use struct object_id, removing the temporary NULL pointer check in expand_ref, converting the declaration and definition, and applying the following semantic patch: @@ expression E1, E2, E3, E4; @@ - resolve_ref_unsafe(E1, E2, E3.hash, E4) + resolve_ref_unsafe(E1, E2, &E3, E4) @@ expression E1, E2, E3, E4; @@ - resolve_ref_unsafe(E1, E2, E3->hash, E4) + resolve_ref_unsafe(E1, E2, E3, E4) Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-09-29Merge branch 'ma/leakplugs'Libravatar Junio C Hamano1-6/+1
Memory leaks in various codepaths have been plugged. * ma/leakplugs: pack-bitmap[-write]: use `object_array_clear()`, don't leak object_array: add and use `object_array_pop()` object_array: use `object_array_clear()`, not `free()` leak_pending: use `object_array_clear()`, not `free()` commit: fix memory leak in `reduce_heads()` builtin/commit: fix memory leak in `prepare_index()`
2017-09-24object_array: add and use `object_array_pop()`Libravatar Martin Ågren1-6/+1
In a couple of places, we pop objects off an object array `foo` by decreasing `foo.nr`. We access `foo.nr` in many places, but most if not all other times we do so read-only, e.g., as we iterate over the array. But when we change `foo.nr` behind the array's back, it feels a bit nasty and looks like it might leak memory. Leaks happen if the popped element has an allocated `name` or `path`. At the moment, that is not the case. Still, 1) the object array might gain more fields that want to be freed, 2) a code path where we pop might start using names or paths, 3) one of these code paths might be copied to somewhere where we do, and 4) using a dedicated function for popping is conceptually cleaner. Introduce and use `object_array_pop()` instead. Release memory in the new function. Document that popping an object leaves the associated elements in limbo. The converted places were identified by grepping for "\.nr\>" and looking for "--". Make the new function return NULL on an empty array. This is consistent with `pop_commit()` and allows the following: while ((o = object_array_pop(&foo)) != NULL) { // do something } But as noted above, we don't need to go out of our way to avoid reading `foo.nr`. This is probably more readable: while (foo.nr) { ... o = object_array_pop(&foo); // do something } The name of `object_array_pop()` does not quite align with `add_object_array()`. That is unfortunate. On the other hand, it matches `object_array_clear()`. Arguably it's `add_...` that is the odd one out, since it reads like it's used to "add" an "object array". For that reason, side with `object_array_clear()`. Signed-off-by: Martin Ågren <martin.agren@gmail.com> Reviewed-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-09-10Merge branch 'rs/fsck-obj-leakfix' into maintLibravatar Junio C Hamano1-11/+11
Memory leak in an error codepath has been plugged. * rs/fsck-obj-leakfix: fsck: free buffers on error in fsck_obj()
2017-08-26Merge branch 'jt/packmigrate'Libravatar Junio C Hamano1-0/+1
Code movement to make it easier to hack later. * jt/packmigrate: (23 commits) pack: move for_each_packed_object() pack: move has_pack_index() pack: move has_sha1_pack() pack: move find_pack_entry() and make it global pack: move find_sha1_pack() pack: move find_pack_entry_one(), is_pack_valid() pack: move check_pack_index_ptr(), nth_packed_object_offset() pack: move nth_packed_object_{sha1,oid} pack: move clear_delta_base_cache(), packed_object_info(), unpack_entry() pack: move unpack_object_header() pack: move get_size_from_delta() pack: move unpack_object_header_buffer() pack: move {,re}prepare_packed_git and approximate_object_count pack: move install_packed_git() pack: move add_packed_git() pack: move unuse_pack() pack: move use_pack() pack: move pack-closing functions pack: move release_pack_memory() pack: move open_pack_index(), parse_pack_index() ...
2017-08-24Merge branch 'jc/simplify-progress'Libravatar Junio C Hamano1-1/+1
The API to start showing progress meter after a short delay has been simplified. * jc/simplify-progress: progress: simplify "delayed" progress API
2017-08-23pack: move open_pack_index(), parse_pack_index()Libravatar Jonathan Tan1-0/+1
alloc_packed_git() in packfile.c is duplicated from sha1_file.c. In a subsequent commit, alloc_packed_git() will be removed from sha1_file.c. Signed-off-by: Jonathan Tan <jonathantanmy@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-08-22Merge branch 'rs/fsck-obj-leakfix'Libravatar Junio C Hamano1-11/+11
Memory leak in an error codepath has been plugged. * rs/fsck-obj-leakfix: fsck: free buffers on error in fsck_obj()
2017-08-19progress: simplify "delayed" progress APILibravatar Junio C Hamano1-1/+1
We used to expose the full power of the delayed progress API to the callers, so that they can specify, not just the message to show and expected total amount of work that is used to compute the percentage of work performed so far, the percent-threshold parameter P and the delay-seconds parameter N. The progress meter starts to show at N seconds into the operation only if we have not yet completed P per-cent of the total work. Most callers used either (0%, 2s) or (50%, 1s) as (P, N), but there are oddballs that chose more random-looking values like 95%. For a smoother workload, (50%, 1s) would allow us to start showing the progress meter earlier than (0%, 2s), while keeping the chance of not showing progress meter for long running operation the same as the latter. For a task that would take 2s or more to complete, it is likely that less than half of it would complete within the first second, if the workload is smooth. But for a spiky workload whose earlier part is easier, such a setting is likely to fail to show the progress meter entirely and (0%, 2s) is more appropriate. But that is merely a theory. Realistically, it is of dubious value to ask each codepath to carefully consider smoothness of their workload and specify their own setting by passing two extra parameters. Let's simplify the API by dropping both parameters and have everybody use (0%, 2s). Oh, by the way, the percent-threshold parameter and the structure member were consistently misspelled, which also is now fixed ;-) Helped-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-08-11Merge branch 'jt/fsck-code-cleanup'Libravatar Junio C Hamano1-25/+16
Code clean-up. * jt/fsck-code-cleanup: fsck: cleanup unused variable object: remove "used" field from struct object fsck: remove redundant parse_tree() invocation
2017-08-10fsck: free buffers on error in fsck_obj()Libravatar René Scharfe1-11/+11
Move the code for releasing tree buffers and commit buffers in fsck_obj() to the end of the function and make sure it's executed no matter of an error is encountered or not. Signed-off-by: Rene Scharfe <l.s.r@web.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-07-26fsck: cleanup unused variableLibravatar Jonathan Tan1-3/+1
Remove the unused variable "heads" from cmd_fsck(). This variable was made unused in commit c3271a0 ("fsck: do not fallback "git fsck <bogus>" to "git fsck"", 2017-01-17). Signed-off-by: Jonathan Tan <jonathantanmy@google.com> Reviewed-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-07-20object: remove "used" field from struct objectLibravatar Jonathan Tan1-10/+14
The "used" field in struct object is only used by builtin/fsck. Remove that field and modify builtin/fsck to use a flag instead. Signed-off-by: Jonathan Tan <jonathantanmy@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-07-20fsck: remove redundant parse_tree() invocationLibravatar Jonathan Tan1-12/+1
If obj->type == OBJ_TREE, an invocation of fsck_walk() will invoke parse_tree() and return quickly if that returns nonzero, so it is of no use for traverse_one_object() to invoke parse_tree() in this situation before invoking fsck_walk(). Remove that code. The behavior of traverse_one_object() is changed slightly in that it now returns -1 instead of 1 in the case that parse_tree() fails, but this is not an issue because its only caller (traverse_reachable) does not care about the value as long as it is nonzero. This code was introduced in commit 271b8d2 ("builtin-fsck: move away from object-refs to fsck_walk", 2008-02-25). The same issue existed in that commit. Signed-off-by: Jonathan Tan <jonathantanmy@google.com> Reviewed-by: Jonathan Nieder <jrnieder@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-07-17builtin/fsck: convert remaining caller of get_sha1 to object_idLibravatar brian m. carlson1-4/+4
Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-07-05Merge branch 'rs/sha1-name-readdir-optim'Libravatar Junio C Hamano1-1/+1
Optimize "what are the object names already taken in an alternate object database?" query that is used to derive the length of prefix an object name is uniquely abbreviated to. * rs/sha1-name-readdir-optim: sha1_file: guard against invalid loose subdirectory numbers sha1_file: let for_each_file_in_obj_subdir() handle subdir names p4205: add perf test script for pretty log formats sha1_name: cache readdir(3) results in find_short_object_filename()
2017-06-24Merge branch 'bw/config-h'Libravatar Junio C Hamano1-0/+1
Fix configuration codepath to pay proper attention to commondir that is used in multi-worktree situation, and isolate config API into its own header file. * bw/config-h: config: don't implicitly use gitdir or commondir config: respect commondir setup: teach discover_git_directory to respect the commondir config: don't include config.h by default config: remove git_config_iter config: create config.h
2017-06-24sha1_file: guard against invalid loose subdirectory numbersLibravatar René Scharfe1-1/+1
Loose object subdirectories have hexadecimal names based on the first byte of the hash of contained objects, thus their numerical representation can range from 0 (0x00) to 255 (0xff). Change the type of the corresponding variable in for_each_file_in_obj_subdir() and associated callback functions to unsigned int and add a range check. Suggested-by: Jeff King <peff@peff.net> Signed-off-by: Rene Scharfe <l.s.r@web.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-06-15config: don't include config.h by defaultLibravatar Brandon Williams1-0/+1
Stop including config.h by default in cache.h. Instead only include config.h in those files which require use of the config system. Signed-off-by: Brandon Williams <bmwill@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-06-13Merge branch 'nd/fopen-errors'Libravatar Junio C Hamano1-2/+1
We often try to open a file for reading whose existence is optional, and silently ignore errors from open/fopen; report such errors if they are not due to missing files. * nd/fopen-errors: mingw_fopen: report ENOENT for invalid file names mingw: verify that paths are not mistaken for remote nicknames log: fix memory leak in open_next_file() rerere.c: move error_errno() closer to the source system call print errno when reporting a system call error wrapper.c: make warn_on_inaccessible() static wrapper.c: add and use fopen_or_warn() wrapper.c: add and use warn_on_fopen_errors() config.mak.uname: set FREAD_READS_DIRECTORIES for Darwin, too config.mak.uname: set FREAD_READS_DIRECTORIES for Linux and FreeBSD clone: use xfopen() instead of fopen() use xfopen() in more places git_fopen: fix a sparse 'not declared' warning
2017-05-29Merge branch 'bc/object-id'Libravatar Junio C Hamano1-8/+8
Conversion from uchar[20] to struct object_id continues. * bc/object-id: (53 commits) object: convert parse_object* to take struct object_id tree: convert parse_tree_indirect to struct object_id sequencer: convert do_recursive_merge to struct object_id diff-lib: convert do_diff_cache to struct object_id builtin/ls-tree: convert to struct object_id merge: convert checkout_fast_forward to struct object_id sequencer: convert fast_forward_to to struct object_id builtin/ls-files: convert overlay_tree_on_cache to object_id builtin/read-tree: convert to struct object_id sha1_name: convert internals of peel_onion to object_id upload-pack: convert remaining parse_object callers to object_id revision: convert remaining parse_object callers to object_id revision: rename add_pending_sha1 to add_pending_oid http-push: convert process_ls_object and descendants to object_id refs/files-backend: convert many internals to struct object_id refs: convert struct ref_update to use struct object_id ref-filter: convert some static functions to struct object_id Convert struct ref_array_item to struct object_id Convert the verify_pack callback to struct object_id Convert lookup_tag to struct object_id ...
2017-05-26use xfopen() in more placesLibravatar Nguyễn Thái Ngọc Duy1-2/+1
xfopen() - provides error details - explains error on reading, or writing, or whatever operation - has l10n support - prints file name in the error Some of these are missing in the places that are replaced with xfopen(), which is a clear win. In some other places, it's just less code (not as clearly a win as the previous case but still is). The only slight regresssion is in remote-testsvn, where we don't report the file class (marks files) in the error messages anymore. But since this is a _test_ svn remote transport, I'm not too concerned. Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-05-16Merge branch 'js/larger-timestamps'Libravatar Junio C Hamano1-3/+3
Some platforms have ulong that is smaller than time_t, and our historical use of ulong for timestamp would mean they cannot represent some timestamp that the platform allows. Invent a separate and dedicated timestamp_t (so that we can distingiuish timestamps and a vanilla ulongs, which along is already a good move), and then declare uintmax_t is the type to be used as the timestamp_t. * js/larger-timestamps: archive-tar: fix a sparse 'constant too large' warning use uintmax_t for timestamps date.c: abort if the system time cannot handle one of our timestamps timestamp_t: a new data type for timestamps PRItime: introduce a new "printf format" for timestamps parse_timestamp(): specify explicitly where we parse timestamps t0006 & t5000: skip "far in the future" test when time_t is too limited t0006 & t5000: prepare for 64-bit timestamps ref-filter: avoid using `unsigned long` for catch-all data type
2017-05-08object: convert parse_object* to take struct object_idLibravatar brian m. carlson1-4/+4
Make parse_object, parse_object_or_die, and parse_object_buffer take a pointer to struct object_id. Remove the temporary variables inserted earlier, since they are no longer necessary. Transform all of the callers using the following semantic patch: @@ expression E1; @@ - parse_object(E1.hash) + parse_object(&E1) @@ expression E1; @@ - parse_object(E1->hash) + parse_object(E1) @@ expression E1, E2; @@ - parse_object_or_die(E1.hash, E2) + parse_object_or_die(&E1, E2) @@ expression E1, E2; @@ - parse_object_or_die(E1->hash, E2) + parse_object_or_die(E1, E2) @@ expression E1, E2, E3, E4, E5; @@ - parse_object_buffer(E1.hash, E2, E3, E4, E5) + parse_object_buffer(&E1, E2, E3, E4, E5) @@ expression E1, E2, E3, E4, E5; @@ - parse_object_buffer(E1->hash, E2, E3, E4, E5) + parse_object_buffer(E1, E2, E3, E4, E5) Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-05-08Convert the verify_pack callback to struct object_idLibravatar brian m. carlson1-3/+3
Make the verify_pack_callback take a pointer to struct object_id. Change the pack checksum to use GIT_MAX_RAWSZ, even though it is not strictly an object ID. Doing so ensures resilience against future hash size changes, and allows us to remove hard-coded assumptions about how big the buffer needs to be. Also, use a union to convert the pointer from nth_packed_object_sha1 to to a pointer to struct object_id. This behavior is compatible with GCC and clang and explicitly sanctioned by C11. The alternatives are to just perform a cast, which would run afoul of strict aliasing rules, but should just work, and changing the pointer into an instance of struct object_id and copying the value. The latter operation could seriously bloat memory usage on fsck, which already uses a lot of memory on some repositories. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-05-08Convert lookup_blob to struct object_idLibravatar brian m. carlson1-1/+1
Convert lookup_blob to take a pointer to struct object_id. The commit was created with manual changes to blob.c and blob.h, plus the following semantic patch: @@ expression E1; @@ - lookup_blob(E1.hash) + lookup_blob(&E1) @@ expression E1; @@ - lookup_blob(E1->hash) + lookup_blob(E1) Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-05-02Convert struct cache_tree to use struct object_idLibravatar brian m. carlson1-2/+2
Convert the sha1 member of struct cache_tree to struct object_id by changing the definition and applying the following semantic patch, plus the standard object_id transforms: @@ struct cache_tree E1; @@ - E1.sha1 + E1.oid.hash @@ struct cache_tree *E1; @@ - E1->sha1 + E1->oid.hash Fix up one reference to active_cache_tree which was not automatically caught by Coccinelle. These changes are prerequisites for converting parse_object. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-04-27timestamp_t: a new data type for timestampsLibravatar Johannes Schindelin1-2/+2
Git's source code assumes that unsigned long is at least as precise as time_t. Which is incorrect, and causes a lot of problems, in particular where unsigned long is only 32-bit (notably on Windows, even in 64-bit versions). So let's just use a more appropriate data type instead. In preparation for this, we introduce the new `timestamp_t` data type. By necessity, this is a very, very large patch, as it has to replace all timestamps' data type in one go. As we will use a data type that is not necessarily identical to `time_t`, we need to be very careful to use `time_t` whenever we interact with the system functions, and `timestamp_t` everywhere else. Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-04-23PRItime: introduce a new "printf format" for timestampsLibravatar Johannes Schindelin1-1/+1
Currently, Git's source code treats all timestamps as if they were unsigned longs. Therefore, it is okay to write "%lu" when printing them. There is a substantial problem with that, though: at least on Windows, time_t is *larger* than unsigned long, and hence we will want to switch away from the ill-specified `unsigned long` data type. So let's introduce the pseudo format "PRItime" (currently simply being defined to "lu") to make it easier to change the data type used for timestamps. Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-04-15read-cache: force_verify_index_checksumLibravatar Jeff Hostetler1-0/+1
Teach git to skip verification of the SHA1-1 checksum at the end of the index file in verify_hdr() which is called from read_index() unless the "force_verify_index_checksum" global variable is set. Teach fsck to force this verification. The checksum verification is for detecting disk corruption, and for small projects, the time it takes to compute SHA-1 is not that significant, but for gigantic repositories this calculation adds significant time to every command. These effect can be seen using t/perf/p0002-read-cache.sh: Test HEAD~1 HEAD -------------------------------------------------------------------------------------- 0002.1: read_cache/discard_cache 1000 times 0.66(0.44+0.20) 0.30(0.27+0.02) -54.5% Signed-off-by: Jeff Hostetler <jeffhost@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-02-22Convert object iteration callbacks to struct object_idLibravatar brian m. carlson1-12/+12
Convert each_loose_object_fn and each_packed_object_fn to take a pointer to struct object_id. Update the various callbacks. Convert several 40-based constants to use GIT_SHA1_HEXSZ. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-02-22refs: convert each_reflog_ent_fn to struct object_idLibravatar brian m. carlson1-8/+8
Make each_reflog_ent_fn take two struct object_id pointers instead of two pointers to unsigned char. Convert the various callbacks to use struct object_id as well. Also, rename fsck_handle_reflog_sha1 to fsck_handle_reflog_oid. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-01-31Merge branch 'jk/fsck-connectivity-check-fix'Libravatar Junio C Hamano1-37/+81
"git fsck --connectivity-check" was not working at all. * jk/fsck-connectivity-check-fix: fsck: lazily load types under --connectivity-only fsck: move typename() printing to its own function t1450: use "mv -f" within loose object directory fsck: check HAS_OBJ more consistently fsck: do not fallback "git fsck <bogus>" to "git fsck" fsck: tighten error-checks of "git fsck <head>" fsck: prepare dummy objects for --connectivity-check fsck: report trees as dangling t1450: clean up sub-objects in duplicate-entry test
2017-01-26fsck: lazily load types under --connectivity-onlyLibravatar Jeff King1-51/+7
The recent fixes to "fsck --connectivity-only" load all of the objects with their correct types. This keeps the connectivity-only code path close to the regular one, but it also introduces some unnecessary inefficiency. While getting the type of an object is cheap compared to actually opening and parsing the object (as the non-connectivity-only case would do), it's still not free. For reachable non-blob objects, we end up having to parse them later anyway (to see what they point to), making our type lookup here redundant. For unreachable objects, we might never hit them at all in the reachability traversal, making the lookup completely wasted. And in some cases, we might have quite a few unreachable objects (e.g., when alternates are used for shared object storage between repositories, it's normal for there to be objects reachable from other repositories but not the one running fsck). The comment in mark_object_for_connectivity() claims two benefits to getting the type up front: 1. We need to know the types during fsck_walk(). (And not explicitly mentioned, but we also need them when printing the types of broken or dangling commits). We can address this by lazy-loading the types as necessary. Most objects never need this lazy-load at all, because they fall into one of these categories: a. Reachable from our tips, and are coerced into the correct type as we traverse (e.g., a parent link will call lookup_commit(), which converts OBJ_NONE to OBJ_COMMIT). b. Unreachable, but not at the tip of a chunk of unreachable history. We only mention the tips as "dangling", so an unreachable commit which links to hundreds of other objects needs only report the type of the tip commit. 2. It serves as a cross-check that the coercion in (1a) is correct (i.e., we'll complain about a parent link that points to a blob). But we get most of this for free already, because right after coercing, we'll parse any non-blob objects. So we'd notice then if we expected a commit and got a blob. The one exception is when we expect a blob, in which case we never actually read the object contents. So this is a slight weakening, but given that the whole point of --connectivity-only is to sacrifice some data integrity checks for speed, this seems like an acceptable tradeoff. Here are before and after timings for an extreme case with ~5M reachable objects and another ~12M unreachable (it's the torvalds/linux repository on GitHub, connected to shared storage for all of the other kernel forks): [before] $ time git fsck --no-dangling --connectivity-only real 3m4.323s user 1m25.121s sys 1m38.710s [after] $ time git fsck --no-dangling --connectivity-only real 0m51.497s user 0m49.575s sys 0m1.776s Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-01-26fsck: move typename() printing to its own functionLibravatar Jeff King1-9/+20
When an object has a problem, we mention its type. But we do so by feeding the result of typename() directly to fprintf(). This is potentially dangerous because typename() can return NULL for some type values (like OBJ_NONE). It's doubtful that this can be triggered in practice with the current code, so this is probably not fixing a bug. But it future-proofs us against modifications that make things like OBJ_NONE more likely (and gives future patches a central point to handle them). Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-01-17fsck: check HAS_OBJ more consistentlyLibravatar Jeff King1-2/+2
There are two spots that call lookup_object() and assume that a non-NULL result means we have the object: 1. When we're checking the objects given to us by the user on the command line. 2. When we're checking if a reflog entry is valid. This generally follows fsck's mental model that we will have looked at and loaded a "struct object" for each object in the repository. But it misses one case: if another object _mentioned_ an object, but we didn't actually parse it or verify that it exists, it will still have a struct. It's not clear if this is a triggerable bug or not. Certainly the later parts of the reachability check need to be careful of this, and do so by checking the HAS_OBJ flag. But both of these steps happen before we start traversing, so probably we won't have followed any links yet. Still, it's easy enough to be defensive here. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-01-17fsck: do not fallback "git fsck <bogus>" to "git fsck"Libravatar Jeff King1-1/+1
Since fsck tries to continue as much as it can after seeing an error, we still do the reachability check even if some heads we were given on the command-line are bogus. But if _none_ of the heads is is valid, we fallback to checking all refs and the index, which is not what the user asked for at all. Instead of checking "heads", the number of successful heads we got, check "argc" (which we know only has non-options in it, because parse_options removed the others). Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-01-17fsck: tighten error-checks of "git fsck <head>"Libravatar Jeff King1-2/+5
Instead of checking reachability from the refs, you can ask fsck to check from a particular set of heads. However, the error checking here is quite lax. In particular: 1. It claims lookup_object() will report an error, which is not true. It only does a hash lookup, and the user has no clue that their argument was skipped. 2. When either the name or sha1 cannot be resolved, we continue to exit with a successful error code, even though we didn't check what the user asked us to. This patch fixes both of these cases. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-01-17fsck: prepare dummy objects for --connectivity-checkLibravatar Jeff King1-23/+97
Normally fsck makes a pass over all objects to check their integrity, and then follows up with a reachability check to make sure we have all of the referenced objects (and to know which ones are dangling). The latter checks for the HAS_OBJ flag in obj->flags to see if we found the object in the first pass. Commit 02976bf85 (fsck: introduce `git fsck --connectivity-only`, 2015-06-22) taught fsck to skip the initial pass, and to fallback to has_sha1_file() instead of the HAS_OBJ check. However, it converted only one HAS_OBJ check to use has_sha1_file(). But there are many other places in builtin/fsck.c that assume that the flag is set (or that lookup_object() will return an object at all). This leads to several bugs with --connectivity-only: 1. mark_object() will not queue objects for examination, so recursively following links from commits to trees, etc, did nothing. I.e., we were checking the reachability of hardly anything at all. 2. When a set of heads is given on the command-line, we use lookup_object() to see if they exist. But without the initial pass, we assume nothing exists. 3. When loading reflog entries, we do a similar lookup_object() check, and complain that the reflog is broken if the object doesn't exist in our hash. So in short, --connectivity-only is broken pretty badly, and will claim that your repository is fine when it's not. Presumably nobody noticed for a few reasons. One is that the embedded test does not actually test the recursive nature of the reachability check. All of the missing objects are still in the index, and we directly check items from the index. This patch modifies the test to delete the index, which shows off breakage (1). Another is that --connectivity-only just skips the initial pass for loose objects. So on a real repository, the packed objects were still checked correctly. But on the flipside, it means that "git fsck --connectivity-only" still checks the sha1 of all of the packed objects, nullifying its original purpose of being a faster git-fsck. And of course the final problem is that the bug only shows up when there _is_ corruption, which is rare. So anybody running "git fsck --connectivity-only" proactively would assume it was being thorough, when it was not. One possibility for fixing this is to find all of the spots that rely on HAS_OBJ and tweak them for the connectivity-only case. But besides the risk that we might miss a spot (and I found three already, corresponding to the three bugs above), there are other parts of fsck that _can't_ work without a full list of objects. E.g., the list of dangling objects. Instead, let's make the connectivity-only case look more like the normal case. Rather than skip the initial pass completely, we'll do an abbreviated one that sets up the HAS_OBJ flag for each object, without actually loading the object data. That's simple and fast, and we don't have to care about the connectivity_only flag in the rest of the code at all. While we're at it, let's make sure we treat loose and packed objects the same (i.e., setting up dummy objects for both and skipping the actual sha1 check). That makes the connectivity-only check actually fast on a real repo (40 seconds versus 180 seconds on my copy of linux.git). Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-01-17fsck: report trees as danglingLibravatar Jeff King1-1/+1
After checking connectivity, fsck looks through the list of any objects we've seen mentioned, and reports unreachable and un-"used" ones as dangling. However, it skips any object which is not marked as "parsed", as that is an object that we _don't_ have (but that somebody mentioned). Since 6e454b9a3 (clear parsed flag when we free tree buffers, 2013-06-05), that flag can't be relied on, and the correct method is to check the HAS_OBJ flag. The cleanup in that commit missed this callsite, though. As a result, we would generally fail to report dangling trees. We never noticed because there were no tests in this area (for trees or otherwise). Let's add some. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-01-15fsck: parse loose object paths directlyLibravatar Jeff King1-13/+33
When we iterate over the list of loose objects to check, we get the actual path of each object. But we then throw it away and pass just the sha1 to fsck_sha1(), which will do a fresh lookup. Usually it would find the same object, but it may not if an object exists both as a loose and a packed object. We may end up checking the packed object twice, and never look at the loose one. In practice this isn't too terrible, because if fsck doesn't complain, it means you have at least one good copy. But since the point of fsck is to look for corruption, we should be thorough. The new read_loose_object() interface can help us get the data from disk, and then we replace parse_object() with parse_object_buffer(). As a bonus, our error messages now mention the path to a corrupted object, which should make it easier to track down errors when they do happen. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2016-10-10alternates: use a separate scratch spaceLibravatar Jeff King1-8/+2
The alternate_object_database struct uses a single buffer both for storing the path to the alternate, and as a scratch buffer for forming object names. This is efficient (since otherwise we'd end up storing the path twice), but it makes life hard for callers who just want to know the path to the alternate. They have to remember to stop reading after "alt->name - alt->base" bytes, and to subtract one for the trailing '/'. It would be much simpler if they could simply access a NUL-terminated path string. We could encapsulate this in a function which puts a NUL in the scratch buffer and returns the string, but that opens up questions about the lifetime of the result. The first time another caller uses the alternate, the scratch buffer may get other data tacked onto it. Let's instead just store the root path separately from the scratch buffer. There aren't enough alternates being stored for the duplicated data to matter for performance, and this keeps things simple and safe for the callers. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2016-09-07streaming: make stream_blob_to_fd take struct object_idLibravatar brian m. carlson1-1/+1
Since all of its callers have been updated, modify stream_blob_to_fd to take a struct object_id. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2016-09-07cache: convert struct cache_entry to use struct object_idLibravatar brian m. carlson1-1/+1
Convert struct cache_entry to use struct object_id by applying the following semantic patch and the object_id transforms from contrib, plus the actual change to the struct: @@ struct cache_entry E1; @@ - E1.sha1 + E1.oid.hash @@ struct cache_entry *E1; @@ - E1->sha1 + E1->oid.hash Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2016-07-28Merge branch 'nd/pack-ofs-4gb-limit'Libravatar Junio C Hamano1-0/+4
"git pack-objects" and "git index-pack" mostly operate with off_t when talking about the offset of objects in a packfile, but there were a handful of places that used "unsigned long" to hold that value, leading to an unintended truncation. * nd/pack-ofs-4gb-limit: fsck: use streaming interface for large blobs in pack pack-objects: do not truncate result in-pack object size on 32-bit systems index-pack: correct "offset" type in unpack_entry_data() index-pack: report correct bad object offsets even if they are large index-pack: correct "len" type in unpack_data() sha1_file.c: use type off_t* for object_info->disk_sizep pack-objects: pass length to check_pack_crc() without truncation
2016-07-18fsck: optionally show more helpful info for broken linksLibravatar Johannes Schindelin1-4/+38
When reporting broken links between commits/trees/blobs, it would be quite helpful at times if the user would be told how the object is supposed to be reachable. With the new --name-objects option, git-fsck will try to do exactly that: name the objects in a way that shows how they are reachable. For example, when some reflog got corrupted and a blob is missing that should not be, the user might want to remove the corresponding reflog entry. This option helps them find that entry: `git fsck` will now report something like this: broken link from tree b5eb6ff... (refs/stash@{<date>}~37:) to blob ec5cf80... Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>