summaryrefslogtreecommitdiff
path: root/fsck.c
diff options
context:
space:
mode:
authorLibravatar Jeff King <peff@peff.net>2017-01-25 23:12:07 -0500
committerLibravatar Junio C Hamano <gitster@pobox.com>2017-01-26 10:51:09 -0800
commita2b22854bd5f252cd036636091a1d30141c35bce (patch)
treed08718eec2801e4cf38050ae03c28486c06c9636 /fsck.c
parentfsck: move typename() printing to its own function (diff)
downloadtgif-a2b22854bd5f252cd036636091a1d30141c35bce.tar.xz
fsck: lazily load types under --connectivity-only
The recent fixes to "fsck --connectivity-only" load all of the objects with their correct types. This keeps the connectivity-only code path close to the regular one, but it also introduces some unnecessary inefficiency. While getting the type of an object is cheap compared to actually opening and parsing the object (as the non-connectivity-only case would do), it's still not free. For reachable non-blob objects, we end up having to parse them later anyway (to see what they point to), making our type lookup here redundant. For unreachable objects, we might never hit them at all in the reachability traversal, making the lookup completely wasted. And in some cases, we might have quite a few unreachable objects (e.g., when alternates are used for shared object storage between repositories, it's normal for there to be objects reachable from other repositories but not the one running fsck). The comment in mark_object_for_connectivity() claims two benefits to getting the type up front: 1. We need to know the types during fsck_walk(). (And not explicitly mentioned, but we also need them when printing the types of broken or dangling commits). We can address this by lazy-loading the types as necessary. Most objects never need this lazy-load at all, because they fall into one of these categories: a. Reachable from our tips, and are coerced into the correct type as we traverse (e.g., a parent link will call lookup_commit(), which converts OBJ_NONE to OBJ_COMMIT). b. Unreachable, but not at the tip of a chunk of unreachable history. We only mention the tips as "dangling", so an unreachable commit which links to hundreds of other objects needs only report the type of the tip commit. 2. It serves as a cross-check that the coercion in (1a) is correct (i.e., we'll complain about a parent link that points to a blob). But we get most of this for free already, because right after coercing, we'll parse any non-blob objects. So we'd notice then if we expected a commit and got a blob. The one exception is when we expect a blob, in which case we never actually read the object contents. So this is a slight weakening, but given that the whole point of --connectivity-only is to sacrifice some data integrity checks for speed, this seems like an acceptable tradeoff. Here are before and after timings for an extreme case with ~5M reachable objects and another ~12M unreachable (it's the torvalds/linux repository on GitHub, connected to shared storage for all of the other kernel forks): [before] $ time git fsck --no-dangling --connectivity-only real 3m4.323s user 1m25.121s sys 1m38.710s [after] $ time git fsck --no-dangling --connectivity-only real 0m51.497s user 0m49.575s sys 0m1.776s Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Diffstat (limited to 'fsck.c')
-rw-r--r--fsck.c4
1 files changed, 4 insertions, 0 deletions
diff --git a/fsck.c b/fsck.c
index 4a3069e204..939792752b 100644
--- a/fsck.c
+++ b/fsck.c
@@ -458,6 +458,10 @@ int fsck_walk(struct object *obj, void *data, struct fsck_options *options)
{
if (!obj)
return -1;
+
+ if (obj->type == OBJ_NONE)
+ parse_object(obj->oid.hash);
+
switch (obj->type) {
case OBJ_BLOB:
return 0;