From ae4f07fbccaab6dc93be52c0f34e137dd9fcbcf4 Mon Sep 17 00:00:00 2001 From: Vicent Marti Date: Sat, 21 Dec 2013 09:00:45 -0500 Subject: pack-bitmap: implement optional name_hash cache When we use pack bitmaps rather than walking the object graph, we end up with the list of objects to include in the packfile, but we do not know the path at which any tree or blob objects would be found. In a recently packed repository, this is fine. A fetch would use the paths only as a heuristic in the delta compression phase, and a fully packed repository should not need to do much delta compression. As time passes, though, we may acquire more objects on top of our large bitmapped pack. If clients fetch frequently, then they never even look at the bitmapped history, and all works as usual. However, a client who has not fetched since the last bitmap repack will have "have" tips in the bitmapped history, but "want" newer objects. The bitmaps themselves degrade gracefully in this circumstance. We manually walk the more recent bits of history, and then use bitmaps when we hit them. But we would also like to perform delta compression between the newer objects and the bitmapped objects (both to delta against what we know the user already has, but also between "new" and "old" objects that the user is fetching). The lack of pathnames makes our delta heuristics much less effective. This patch adds an optional cache of the 32-bit name_hash values to the end of the bitmap file. If present, a reader can use it to match bitmapped and non-bitmapped names during delta compression. Here are perf results for p5310: Test origin/master HEAD^ HEAD ------------------------------------------------------------------------------------------------- 5310.2: repack to disk 36.81(37.82+1.43) 47.70(48.74+1.41) +29.6% 47.75(48.70+1.51) +29.7% 5310.3: simulated clone 30.78(29.70+2.14) 1.08(0.97+0.10) -96.5% 1.07(0.94+0.12) -96.5% 5310.4: simulated fetch 3.16(6.10+0.08) 3.54(10.65+0.06) +12.0% 1.70(3.07+0.06) -46.2% 5310.6: partial bitmap 36.76(43.19+1.81) 6.71(11.25+0.76) -81.7% 4.08(6.26+0.46) -88.9% You can see that the time spent on an incremental fetch goes down, as our delta heuristics are able to do their work. And we save time on the partial bitmap clone for the same reason. Signed-off-by: Vicent Marti Signed-off-by: Jeff King Signed-off-by: Junio C Hamano --- Documentation/technical/bitmap-format.txt | 33 +++++++++++++++++++++++++++++++ 1 file changed, 33 insertions(+) (limited to 'Documentation/technical') diff --git a/Documentation/technical/bitmap-format.txt b/Documentation/technical/bitmap-format.txt index 7a86bd77d5..f8c18a0f7a 100644 --- a/Documentation/technical/bitmap-format.txt +++ b/Documentation/technical/bitmap-format.txt @@ -21,6 +21,12 @@ GIT bitmap v1 format requirement for the bitmap index format, also present in JGit, that greatly reduces the complexity of the implementation. + - BITMAP_OPT_HASH_CACHE (0x4) + If present, the end of the bitmap file contains + `N` 32-bit name-hash values, one per object in the + pack. The format and meaning of the name-hash is + described below. + 4-byte entry count (network byte order) The total count of entries (bitmapped commits) in this bitmap index. @@ -129,3 +135,30 @@ The bitstream represented by the above chunk is then: The next word after `L_M` (if any) must again be a RLW, for the next chunk. For efficient appending to the bitstream, the EWAH stores a pointer to the last RLW in the stream. + + +== Appendix B: Optional Bitmap Sections + +These sections may or may not be present in the `.bitmap` file; their +presence is indicated by the header flags section described above. + +Name-hash cache +--------------- + +If the BITMAP_OPT_HASH_CACHE flag is set, the end of the bitmap contains +a cache of 32-bit values, one per object in the pack. The value at +position `i` is the hash of the pathname at which the `i`th object +(counting in index order) in the pack can be found. This can be fed +into the delta heuristics to compare objects with similar pathnames. + +The hash algorithm used is: + + hash = 0; + while ((c = *name++)) + if (!isspace(c)) + hash = (hash >> 2) + (c << 24); + +Note that this hashing scheme is tied to the BITMAP_OPT_HASH_CACHE flag. +If implementations want to choose a different hashing scheme, they are +free to do so, but MUST allocate a new header flag (because comparing +hashes made under two different schemes would be pointless). -- cgit v1.2.3