diff options
author | Junio C Hamano <gitster@pobox.com> | 2021-04-08 13:23:25 -0700 |
---|---|---|
committer | Junio C Hamano <gitster@pobox.com> | 2021-04-08 13:23:25 -0700 |
commit | e6b971fcf5d85db821636f2d887cfaf204b32bda (patch) | |
tree | 8d94e5501218bd7614729a363637e3746568a814 | |
parent | The seventh batch (diff) | |
parent | midx.c: improve cache locality in midx_pack_order_cmp() (diff) | |
download | tgif-e6b971fcf5d85db821636f2d887cfaf204b32bda.tar.xz |
Merge branch 'tb/reverse-midx'
An on-disk reverse-index to map the in-pack location of an object
back to its object name across multiple packfiles is introduced.
* tb/reverse-midx:
midx.c: improve cache locality in midx_pack_order_cmp()
pack-revindex: write multi-pack reverse indexes
pack-write.c: extract 'write_rev_file_order'
pack-revindex: read multi-pack reverse indexes
Documentation/technical: describe multi-pack reverse indexes
midx: make some functions non-static
midx: keep track of the checksum
midx: don't free midx_name early
midx: allow marking a pack as preferred
t/helper/test-read-midx.c: add '--show-objects'
builtin/multi-pack-index.c: display usage on unrecognized command
builtin/multi-pack-index.c: don't enter bogus cmd_mode
builtin/multi-pack-index.c: split sub-commands
builtin/multi-pack-index.c: define common usage with a macro
builtin/multi-pack-index.c: don't handle 'progress' separately
builtin/multi-pack-index.c: inline 'flags' with options
-rw-r--r-- | Documentation/git-multi-pack-index.txt | 14 | ||||
-rw-r--r-- | Documentation/technical/multi-pack-index.txt | 5 | ||||
-rw-r--r-- | Documentation/technical/pack-format.txt | 83 | ||||
-rw-r--r-- | builtin/multi-pack-index.c | 182 | ||||
-rw-r--r-- | builtin/repack.c | 2 | ||||
-rw-r--r-- | midx.c | 219 | ||||
-rw-r--r-- | midx.h | 11 | ||||
-rw-r--r-- | pack-revindex.c | 126 | ||||
-rw-r--r-- | pack-revindex.h | 53 | ||||
-rw-r--r-- | pack-write.c | 36 | ||||
-rw-r--r-- | pack.h | 1 | ||||
-rw-r--r-- | packfile.c | 3 | ||||
-rw-r--r-- | t/helper/test-read-midx.c | 24 | ||||
-rwxr-xr-x | t/t5319-multi-pack-index.sh | 42 |
14 files changed, 733 insertions, 68 deletions
diff --git a/Documentation/git-multi-pack-index.txt b/Documentation/git-multi-pack-index.txt index eb0caa0439..ffd601bc17 100644 --- a/Documentation/git-multi-pack-index.txt +++ b/Documentation/git-multi-pack-index.txt @@ -9,7 +9,8 @@ git-multi-pack-index - Write and verify multi-pack-indexes SYNOPSIS -------- [verse] -'git multi-pack-index' [--object-dir=<dir>] [--[no-]progress] <subcommand> +'git multi-pack-index' [--object-dir=<dir>] [--[no-]progress] + [--preferred-pack=<pack>] <subcommand> DESCRIPTION ----------- @@ -30,7 +31,16 @@ OPTIONS The following subcommands are available: write:: - Write a new MIDX file. + Write a new MIDX file. The following options are available for + the `write` sub-command: ++ +-- + --preferred-pack=<pack>:: + Optionally specify the tie-breaking pack used when + multiple packs contain the same object. If not given, + ties are broken in favor of the pack with the lowest + mtime. +-- verify:: Verify the contents of the MIDX file. diff --git a/Documentation/technical/multi-pack-index.txt b/Documentation/technical/multi-pack-index.txt index e8e377a59f..fb688976c4 100644 --- a/Documentation/technical/multi-pack-index.txt +++ b/Documentation/technical/multi-pack-index.txt @@ -43,8 +43,9 @@ Design Details a change in format. - The MIDX keeps only one record per object ID. If an object appears - in multiple packfiles, then the MIDX selects the copy in the most- - recently modified packfile. + in multiple packfiles, then the MIDX selects the copy in the + preferred packfile, otherwise selecting from the most-recently + modified packfile. - If there exist packfiles in the pack directory not registered in the MIDX, then those packfiles are loaded into the `packed_git` diff --git a/Documentation/technical/pack-format.txt b/Documentation/technical/pack-format.txt index 1faa949bf6..8d2f42f29e 100644 --- a/Documentation/technical/pack-format.txt +++ b/Documentation/technical/pack-format.txt @@ -379,3 +379,86 @@ CHUNK DATA: TRAILER: Index checksum of the above contents. + +== multi-pack-index reverse indexes + +Similar to the pack-based reverse index, the multi-pack index can also +be used to generate a reverse index. + +Instead of mapping between offset, pack-, and index position, this +reverse index maps between an object's position within the MIDX, and +that object's position within a pseudo-pack that the MIDX describes +(i.e., the ith entry of the multi-pack reverse index holds the MIDX +position of ith object in pseudo-pack order). + +To clarify the difference between these orderings, consider a multi-pack +reachability bitmap (which does not yet exist, but is what we are +building towards here). Each bit needs to correspond to an object in the +MIDX, and so we need an efficient mapping from bit position to MIDX +position. + +One solution is to let bits occupy the same position in the oid-sorted +index stored by the MIDX. But because oids are effectively random, their +resulting reachability bitmaps would have no locality, and thus compress +poorly. (This is the reason that single-pack bitmaps use the pack +ordering, and not the .idx ordering, for the same purpose.) + +So we'd like to define an ordering for the whole MIDX based around +pack ordering, which has far better locality (and thus compresses more +efficiently). We can think of a pseudo-pack created by the concatenation +of all of the packs in the MIDX. E.g., if we had a MIDX with three packs +(a, b, c), with 10, 15, and 20 objects respectively, we can imagine an +ordering of the objects like: + + |a,0|a,1|...|a,9|b,0|b,1|...|b,14|c,0|c,1|...|c,19| + +where the ordering of the packs is defined by the MIDX's pack list, +and then the ordering of objects within each pack is the same as the +order in the actual packfile. + +Given the list of packs and their counts of objects, you can +naïvely reconstruct that pseudo-pack ordering (e.g., the object at +position 27 must be (c,1) because packs "a" and "b" consumed 25 of the +slots). But there's a catch. Objects may be duplicated between packs, in +which case the MIDX only stores one pointer to the object (and thus we'd +want only one slot in the bitmap). + +Callers could handle duplicates themselves by reading objects in order +of their bit-position, but that's linear in the number of objects, and +much too expensive for ordinary bitmap lookups. Building a reverse index +solves this, since it is the logical inverse of the index, and that +index has already removed duplicates. But, building a reverse index on +the fly can be expensive. Since we already have an on-disk format for +pack-based reverse indexes, let's reuse it for the MIDX's pseudo-pack, +too. + +Objects from the MIDX are ordered as follows to string together the +pseudo-pack. Let `pack(o)` return the pack from which `o` was selected +by the MIDX, and define an ordering of packs based on their numeric ID +(as stored by the MIDX). Let `offset(o)` return the object offset of `o` +within `pack(o)`. Then, compare `o1` and `o2` as follows: + + - If one of `pack(o1)` and `pack(o2)` is preferred and the other + is not, then the preferred one sorts first. ++ +(This is a detail that allows the MIDX bitmap to determine which +pack should be used by the pack-reuse mechanism, since it can ask +the MIDX for the pack containing the object at bit position 0). + + - If `pack(o1) ≠ pack(o2)`, then sort the two objects in descending + order based on the pack ID. + + - Otherwise, `pack(o1) = pack(o2)`, and the objects are sorted in + pack-order (i.e., `o1` sorts ahead of `o2` exactly when `offset(o1) + < offset(o2)`). + +In short, a MIDX's pseudo-pack is the de-duplicated concatenation of +objects in packs stored by the MIDX, laid out in pack order, and the +packs arranged in MIDX order (with the preferred pack coming first). + +Finally, note that the MIDX's reverse index is not stored as a chunk in +the multi-pack-index itself. This is done because the reverse index +includes the checksum of the pack or MIDX to which it belongs, which +makes it impossible to write in the MIDX. To avoid races when rewriting +the MIDX, a MIDX reverse index includes the MIDX's checksum in its +filename (e.g., `multi-pack-index-xyz.rev`). diff --git a/builtin/multi-pack-index.c b/builtin/multi-pack-index.c index 5bf88cd2a8..5d3ea445fd 100644 --- a/builtin/multi-pack-index.c +++ b/builtin/multi-pack-index.c @@ -4,67 +4,181 @@ #include "parse-options.h" #include "midx.h" #include "trace2.h" +#include "object-store.h" +#define BUILTIN_MIDX_WRITE_USAGE \ + N_("git multi-pack-index [<options>] write [--preferred-pack=<pack>]") + +#define BUILTIN_MIDX_VERIFY_USAGE \ + N_("git multi-pack-index [<options>] verify") + +#define BUILTIN_MIDX_EXPIRE_USAGE \ + N_("git multi-pack-index [<options>] expire") + +#define BUILTIN_MIDX_REPACK_USAGE \ + N_("git multi-pack-index [<options>] repack [--batch-size=<size>]") + +static char const * const builtin_multi_pack_index_write_usage[] = { + BUILTIN_MIDX_WRITE_USAGE, + NULL +}; +static char const * const builtin_multi_pack_index_verify_usage[] = { + BUILTIN_MIDX_VERIFY_USAGE, + NULL +}; +static char const * const builtin_multi_pack_index_expire_usage[] = { + BUILTIN_MIDX_EXPIRE_USAGE, + NULL +}; +static char const * const builtin_multi_pack_index_repack_usage[] = { + BUILTIN_MIDX_REPACK_USAGE, + NULL +}; static char const * const builtin_multi_pack_index_usage[] = { - N_("git multi-pack-index [<options>] (write|verify|expire|repack --batch-size=<size>)"), + BUILTIN_MIDX_WRITE_USAGE, + BUILTIN_MIDX_VERIFY_USAGE, + BUILTIN_MIDX_EXPIRE_USAGE, + BUILTIN_MIDX_REPACK_USAGE, NULL }; static struct opts_multi_pack_index { const char *object_dir; + const char *preferred_pack; unsigned long batch_size; - int progress; + unsigned flags; } opts; -int cmd_multi_pack_index(int argc, const char **argv, - const char *prefix) +static struct option common_opts[] = { + OPT_FILENAME(0, "object-dir", &opts.object_dir, + N_("object directory containing set of packfile and pack-index pairs")), + OPT_BIT(0, "progress", &opts.flags, N_("force progress reporting"), MIDX_PROGRESS), + OPT_END(), +}; + +static struct option *add_common_options(struct option *prev) { - unsigned flags = 0; + return parse_options_concat(common_opts, prev); +} + +static int cmd_multi_pack_index_write(int argc, const char **argv) +{ + struct option *options; + static struct option builtin_multi_pack_index_write_options[] = { + OPT_STRING(0, "preferred-pack", &opts.preferred_pack, + N_("preferred-pack"), + N_("pack for reuse when computing a multi-pack bitmap")), + OPT_END(), + }; + + options = add_common_options(builtin_multi_pack_index_write_options); + + trace2_cmd_mode(argv[0]); - static struct option builtin_multi_pack_index_options[] = { - OPT_FILENAME(0, "object-dir", &opts.object_dir, - N_("object directory containing set of packfile and pack-index pairs")), - OPT_BOOL(0, "progress", &opts.progress, N_("force progress reporting")), + argc = parse_options(argc, argv, NULL, + options, builtin_multi_pack_index_write_usage, + PARSE_OPT_KEEP_UNKNOWN); + if (argc) + usage_with_options(builtin_multi_pack_index_write_usage, + options); + + FREE_AND_NULL(options); + + return write_midx_file(opts.object_dir, opts.preferred_pack, + opts.flags); +} + +static int cmd_multi_pack_index_verify(int argc, const char **argv) +{ + struct option *options = common_opts; + + trace2_cmd_mode(argv[0]); + + argc = parse_options(argc, argv, NULL, + options, builtin_multi_pack_index_verify_usage, + PARSE_OPT_KEEP_UNKNOWN); + if (argc) + usage_with_options(builtin_multi_pack_index_verify_usage, + options); + + return verify_midx_file(the_repository, opts.object_dir, opts.flags); +} + +static int cmd_multi_pack_index_expire(int argc, const char **argv) +{ + struct option *options = common_opts; + + trace2_cmd_mode(argv[0]); + + argc = parse_options(argc, argv, NULL, + options, builtin_multi_pack_index_expire_usage, + PARSE_OPT_KEEP_UNKNOWN); + if (argc) + usage_with_options(builtin_multi_pack_index_expire_usage, + options); + + return expire_midx_packs(the_repository, opts.object_dir, opts.flags); +} + +static int cmd_multi_pack_index_repack(int argc, const char **argv) +{ + struct option *options; + static struct option builtin_multi_pack_index_repack_options[] = { OPT_MAGNITUDE(0, "batch-size", &opts.batch_size, N_("during repack, collect pack-files of smaller size into a batch that is larger than this size")), OPT_END(), }; + options = add_common_options(builtin_multi_pack_index_repack_options); + + trace2_cmd_mode(argv[0]); + + argc = parse_options(argc, argv, NULL, + options, + builtin_multi_pack_index_repack_usage, + PARSE_OPT_KEEP_UNKNOWN); + if (argc) + usage_with_options(builtin_multi_pack_index_repack_usage, + options); + + FREE_AND_NULL(options); + + return midx_repack(the_repository, opts.object_dir, + (size_t)opts.batch_size, opts.flags); +} + +int cmd_multi_pack_index(int argc, const char **argv, + const char *prefix) +{ + struct option *builtin_multi_pack_index_options = common_opts; + git_config(git_default_config, NULL); - opts.progress = isatty(2); + if (isatty(2)) + opts.flags |= MIDX_PROGRESS; argc = parse_options(argc, argv, prefix, builtin_multi_pack_index_options, - builtin_multi_pack_index_usage, 0); + builtin_multi_pack_index_usage, + PARSE_OPT_STOP_AT_NON_OPTION); if (!opts.object_dir) opts.object_dir = get_object_directory(); - if (opts.progress) - flags |= MIDX_PROGRESS; if (argc == 0) + goto usage; + + if (!strcmp(argv[0], "repack")) + return cmd_multi_pack_index_repack(argc, argv); + else if (!strcmp(argv[0], "write")) + return cmd_multi_pack_index_write(argc, argv); + else if (!strcmp(argv[0], "verify")) + return cmd_multi_pack_index_verify(argc, argv); + else if (!strcmp(argv[0], "expire")) + return cmd_multi_pack_index_expire(argc, argv); + else { +usage: + error(_("unrecognized subcommand: %s"), argv[0]); usage_with_options(builtin_multi_pack_index_usage, builtin_multi_pack_index_options); - - if (argc > 1) { - die(_("too many arguments")); - return 1; } - - trace2_cmd_mode(argv[0]); - - if (!strcmp(argv[0], "repack")) - return midx_repack(the_repository, opts.object_dir, - (size_t)opts.batch_size, flags); - if (opts.batch_size) - die(_("--batch-size option is only for 'repack' subcommand")); - - if (!strcmp(argv[0], "write")) - return write_midx_file(opts.object_dir, flags); - if (!strcmp(argv[0], "verify")) - return verify_midx_file(the_repository, opts.object_dir, flags); - if (!strcmp(argv[0], "expire")) - return expire_midx_packs(the_repository, opts.object_dir, flags); - - die(_("unrecognized subcommand: %s"), argv[0]); } diff --git a/builtin/repack.c b/builtin/repack.c index 6ce2556c9e..2847fdfbab 100644 --- a/builtin/repack.c +++ b/builtin/repack.c @@ -721,7 +721,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix) remove_temporary_files(); if (git_env_bool(GIT_TEST_MULTI_PACK_INDEX, 0)) - write_midx_file(get_object_directory(), 0); + write_midx_file(get_object_directory(), NULL, 0); string_list_clear(&names, 0); string_list_clear(&rollback, 0); @@ -12,6 +12,7 @@ #include "run-command.h" #include "repository.h" #include "chunk-format.h" +#include "pack.h" #define MIDX_SIGNATURE 0x4d494458 /* "MIDX" */ #define MIDX_VERSION 1 @@ -47,11 +48,22 @@ static uint8_t oid_version(void) } } +static const unsigned char *get_midx_checksum(struct multi_pack_index *m) +{ + return m->data + m->data_len - the_hash_algo->rawsz; +} + static char *get_midx_filename(const char *object_dir) { return xstrfmt("%s/pack/multi-pack-index", object_dir); } +char *get_midx_rev_filename(struct multi_pack_index *m) +{ + return xstrfmt("%s/pack/multi-pack-index-%s.rev", + m->object_dir, hash_to_hex(get_midx_checksum(m))); +} + static int midx_read_oid_fanout(const unsigned char *chunk_start, size_t chunk_size, void *data) { @@ -239,7 +251,7 @@ struct object_id *nth_midxed_object_oid(struct object_id *oid, return oid; } -static off_t nth_midxed_offset(struct multi_pack_index *m, uint32_t pos) +off_t nth_midxed_offset(struct multi_pack_index *m, uint32_t pos) { const unsigned char *offset_data; uint32_t offset32; @@ -258,7 +270,7 @@ static off_t nth_midxed_offset(struct multi_pack_index *m, uint32_t pos) return offset32; } -static uint32_t nth_midxed_pack_int_id(struct multi_pack_index *m, uint32_t pos) +uint32_t nth_midxed_pack_int_id(struct multi_pack_index *m, uint32_t pos) { return get_be32(m->chunk_object_offsets + (off_t)pos * MIDX_CHUNK_OFFSET_WIDTH); @@ -431,6 +443,14 @@ static int pack_info_compare(const void *_a, const void *_b) return strcmp(a->pack_name, b->pack_name); } +static int idx_or_pack_name_cmp(const void *_va, const void *_vb) +{ + const char *pack_name = _va; + const struct pack_info *compar = _vb; + + return cmp_idx_or_pack_name(pack_name, compar->pack_name); +} + struct write_midx_context { struct pack_info *info; uint32_t nr; @@ -443,8 +463,11 @@ struct write_midx_context { uint32_t entries_nr; uint32_t *pack_perm; + uint32_t *pack_order; unsigned large_offsets_needed:1; uint32_t num_large_offsets; + + int preferred_pack_idx; }; static void add_pack_to_midx(const char *full_path, size_t full_path_len, @@ -489,6 +512,7 @@ struct pack_midx_entry { uint32_t pack_int_id; time_t pack_mtime; uint64_t offset; + unsigned preferred : 1; }; static int midx_oid_compare(const void *_a, const void *_b) @@ -500,6 +524,12 @@ static int midx_oid_compare(const void *_a, const void *_b) if (cmp) return cmp; + /* Sort objects in a preferred pack first when multiple copies exist. */ + if (a->preferred > b->preferred) + return -1; + if (a->preferred < b->preferred) + return 1; + if (a->pack_mtime > b->pack_mtime) return -1; else if (a->pack_mtime < b->pack_mtime) @@ -527,7 +557,8 @@ static int nth_midxed_pack_midx_entry(struct multi_pack_index *m, static void fill_pack_entry(uint32_t pack_int_id, struct packed_git *p, uint32_t cur_object, - struct pack_midx_entry *entry) + struct pack_midx_entry *entry, + int preferred) { if (nth_packed_object_id(&entry->oid, p, cur_object) < 0) die(_("failed to locate object %d in packfile"), cur_object); @@ -536,6 +567,7 @@ static void fill_pack_entry(uint32_t pack_int_id, entry->pack_mtime = p->mtime; entry->offset = nth_packed_object_offset(p, cur_object); + entry->preferred = !!preferred; } /* @@ -552,7 +584,8 @@ static void fill_pack_entry(uint32_t pack_int_id, static struct pack_midx_entry *get_sorted_entries(struct multi_pack_index *m, struct pack_info *info, uint32_t nr_packs, - uint32_t *nr_objects) + uint32_t *nr_objects, + int preferred_pack) { uint32_t cur_fanout, cur_pack, cur_object; uint32_t alloc_fanout, alloc_objects, total_objects = 0; @@ -589,12 +622,17 @@ static struct pack_midx_entry *get_sorted_entries(struct multi_pack_index *m, nth_midxed_pack_midx_entry(m, &entries_by_fanout[nr_fanout], cur_object); + if (nth_midxed_pack_int_id(m, cur_object) == preferred_pack) + entries_by_fanout[nr_fanout].preferred = 1; + else + entries_by_fanout[nr_fanout].preferred = 0; nr_fanout++; } } for (cur_pack = start_pack; cur_pack < nr_packs; cur_pack++) { uint32_t start = 0, end; + int preferred = cur_pack == preferred_pack; if (cur_fanout) start = get_pack_fanout(info[cur_pack].p, cur_fanout - 1); @@ -602,7 +640,11 @@ static struct pack_midx_entry *get_sorted_entries(struct multi_pack_index *m, for (cur_object = start; cur_object < end; cur_object++) { ALLOC_GROW(entries_by_fanout, nr_fanout + 1, alloc_fanout); - fill_pack_entry(cur_pack, info[cur_pack].p, cur_object, &entries_by_fanout[nr_fanout]); + fill_pack_entry(cur_pack, + info[cur_pack].p, + cur_object, + &entries_by_fanout[nr_fanout], + preferred); nr_fanout++; } } @@ -776,10 +818,80 @@ static int write_midx_large_offsets(struct hashfile *f, return 0; } +struct midx_pack_order_data { + uint32_t nr; + uint32_t pack; + off_t offset; +}; + +static int midx_pack_order_cmp(const void *va, const void *vb) +{ + const struct midx_pack_order_data *a = va, *b = vb; + if (a->pack < b->pack) + return -1; + else if (a->pack > b->pack) + return 1; + else if (a->offset < b->offset) + return -1; + else if (a->offset > b->offset) + return 1; + else + return 0; +} + +static uint32_t *midx_pack_order(struct write_midx_context *ctx) +{ + struct midx_pack_order_data *data; + uint32_t *pack_order; + uint32_t i; + + ALLOC_ARRAY(data, ctx->entries_nr); + for (i = 0; i < ctx->entries_nr; i++) { + struct pack_midx_entry *e = &ctx->entries[i]; + data[i].nr = i; + data[i].pack = ctx->pack_perm[e->pack_int_id]; + if (!e->preferred) + data[i].pack |= (1U << 31); + data[i].offset = e->offset; + } + + QSORT(data, ctx->entries_nr, midx_pack_order_cmp); + + ALLOC_ARRAY(pack_order, ctx->entries_nr); + for (i = 0; i < ctx->entries_nr; i++) + pack_order[i] = data[i].nr; + free(data); + + return pack_order; +} + +static void write_midx_reverse_index(char *midx_name, unsigned char *midx_hash, + struct write_midx_context *ctx) +{ + struct strbuf buf = STRBUF_INIT; + const char *tmp_file; + + strbuf_addf(&buf, "%s-%s.rev", midx_name, hash_to_hex(midx_hash)); + + tmp_file = write_rev_file_order(NULL, ctx->pack_order, ctx->entries_nr, + midx_hash, WRITE_REV); + + if (finalize_object_file(tmp_file, buf.buf)) + die(_("cannot store reverse index file")); + + strbuf_release(&buf); +} + +static void clear_midx_files_ext(struct repository *r, const char *ext, + unsigned char *keep_hash); + static int write_midx_internal(const char *object_dir, struct multi_pack_index *m, - struct string_list *packs_to_drop, unsigned flags) + struct string_list *packs_to_drop, + const char *preferred_pack_name, + unsigned flags) { char *midx_name; + unsigned char midx_hash[GIT_MAX_RAWSZ]; uint32_t i; struct hashfile *f = NULL; struct lock_file lk; @@ -828,7 +940,19 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index * if (ctx.m && ctx.nr == ctx.m->num_packs && !packs_to_drop) goto cleanup; - ctx.entries = get_sorted_entries(ctx.m, ctx.info, ctx.nr, &ctx.entries_nr); + ctx.preferred_pack_idx = -1; + if (preferred_pack_name) { + for (i = 0; i < ctx.nr; i++) { + if (!cmp_idx_or_pack_name(preferred_pack_name, + ctx.info[i].pack_name)) { + ctx.preferred_pack_idx = i; + break; + } + } + } + + ctx.entries = get_sorted_entries(ctx.m, ctx.info, ctx.nr, &ctx.entries_nr, + ctx.preferred_pack_idx); ctx.large_offsets_needed = 0; for (i = 0; i < ctx.entries_nr; i++) { @@ -889,13 +1013,30 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index * pack_name_concat_len += strlen(ctx.info[i].pack_name) + 1; } + /* Check that the preferred pack wasn't expired (if given). */ + if (preferred_pack_name) { + struct pack_info *preferred = bsearch(preferred_pack_name, + ctx.info, ctx.nr, + sizeof(*ctx.info), + idx_or_pack_name_cmp); + + if (!preferred) + warning(_("unknown preferred pack: '%s'"), + preferred_pack_name); + else { + uint32_t perm = ctx.pack_perm[preferred->orig_pack_int_id]; + if (perm == PACK_EXPIRED) + warning(_("preferred pack '%s' is expired"), + preferred_pack_name); + } + } + if (pack_name_concat_len % MIDX_CHUNK_ALIGNMENT) pack_name_concat_len += MIDX_CHUNK_ALIGNMENT - (pack_name_concat_len % MIDX_CHUNK_ALIGNMENT); hold_lock_file_for_update(&lk, midx_name, LOCK_DIE_ON_ERROR); f = hashfd(get_lock_file_fd(&lk), get_lock_file_path(&lk)); - FREE_AND_NULL(midx_name); if (ctx.m) close_midx(ctx.m); @@ -927,8 +1068,16 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index * write_midx_header(f, get_num_chunks(cf), ctx.nr - dropped_packs); write_chunkfile(cf, &ctx); - finalize_hashfile(f, NULL, CSUM_FSYNC | CSUM_HASH_IN_STREAM); + finalize_hashfile(f, midx_hash, CSUM_FSYNC | CSUM_HASH_IN_STREAM); free_chunkfile(cf); + + if (flags & MIDX_WRITE_REV_INDEX) + ctx.pack_order = midx_pack_order(&ctx); + + if (flags & MIDX_WRITE_REV_INDEX) + write_midx_reverse_index(midx_name, midx_hash, &ctx); + clear_midx_files_ext(the_repository, ".rev", midx_hash); + commit_lock_file(&lk); cleanup: @@ -943,13 +1092,55 @@ cleanup: free(ctx.info); free(ctx.entries); free(ctx.pack_perm); + free(ctx.pack_order); free(midx_name); return result; } -int write_midx_file(const char *object_dir, unsigned flags) +int write_midx_file(const char *object_dir, + const char *preferred_pack_name, + unsigned flags) { - return write_midx_internal(object_dir, NULL, NULL, flags); + return write_midx_internal(object_dir, NULL, NULL, preferred_pack_name, + flags); +} + +struct clear_midx_data { + char *keep; + const char *ext; +}; + +static void clear_midx_file_ext(const char *full_path, size_t full_path_len, + const char *file_name, void *_data) +{ + struct clear_midx_data *data = _data; + + if (!(starts_with(file_name, "multi-pack-index-") && + ends_with(file_name, data->ext))) + return; + if (data->keep && !strcmp(data->keep, file_name)) + return; + + if (unlink(full_path)) + die_errno(_("failed to remove %s"), full_path); +} + +static void clear_midx_files_ext(struct repository *r, const char *ext, + unsigned char *keep_hash) +{ + struct clear_midx_data data; + memset(&data, 0, sizeof(struct clear_midx_data)); + + if (keep_hash) + data.keep = xstrfmt("multi-pack-index-%s%s", + hash_to_hex(keep_hash), ext); + data.ext = ext; + + for_each_file_in_pack_dir(r->objects->odb->path, + clear_midx_file_ext, + &data); + + free(data.keep); } void clear_midx_file(struct repository *r) @@ -964,6 +1155,8 @@ void clear_midx_file(struct repository *r) if (remove_path(midx)) die(_("failed to clear multi-pack-index at %s"), midx); + clear_midx_files_ext(r, ".rev", NULL); + free(midx); } @@ -1184,7 +1377,7 @@ int expire_midx_packs(struct repository *r, const char *object_dir, unsigned fla free(count); if (packs_to_drop.nr) - result = write_midx_internal(object_dir, m, &packs_to_drop, flags); + result = write_midx_internal(object_dir, m, &packs_to_drop, NULL, flags); string_list_clear(&packs_to_drop, 0); return result; @@ -1373,7 +1566,7 @@ int midx_repack(struct repository *r, const char *object_dir, size_t batch_size, goto cleanup; } - result = write_midx_internal(object_dir, m, NULL, flags); + result = write_midx_internal(object_dir, m, NULL, NULL, flags); m = NULL; cleanup: @@ -15,6 +15,10 @@ struct multi_pack_index { const unsigned char *data; size_t data_len; + const uint32_t *revindex_data; + const uint32_t *revindex_map; + size_t revindex_len; + uint32_t signature; unsigned char version; unsigned char hash_len; @@ -36,10 +40,15 @@ struct multi_pack_index { }; #define MIDX_PROGRESS (1 << 0) +#define MIDX_WRITE_REV_INDEX (1 << 1) + +char *get_midx_rev_filename(struct multi_pack_index *m); struct multi_pack_index *load_multi_pack_index(const char *object_dir, int local); int prepare_midx_pack(struct repository *r, struct multi_pack_index *m, uint32_t pack_int_id); int bsearch_midx(const struct object_id *oid, struct multi_pack_index *m, uint32_t *result); +off_t nth_midxed_offset(struct multi_pack_index *m, uint32_t pos); +uint32_t nth_midxed_pack_int_id(struct multi_pack_index *m, uint32_t pos); struct object_id *nth_midxed_object_oid(struct object_id *oid, struct multi_pack_index *m, uint32_t n); @@ -47,7 +56,7 @@ int fill_midx_entry(struct repository *r, const struct object_id *oid, struct pa int midx_contains_pack(struct multi_pack_index *m, const char *idx_or_pack_name); int prepare_multi_pack_index_one(struct repository *r, const char *object_dir, int local); -int write_midx_file(const char *object_dir, unsigned flags); +int write_midx_file(const char *object_dir, const char *preferred_pack_name, unsigned flags); void clear_midx_file(struct repository *r); int verify_midx_file(struct repository *r, const char *object_dir, unsigned flags); int expire_midx_packs(struct repository *r, const char *object_dir, unsigned flags); diff --git a/pack-revindex.c b/pack-revindex.c index 4262530449..0e4a31d9db 100644 --- a/pack-revindex.c +++ b/pack-revindex.c @@ -3,6 +3,7 @@ #include "object-store.h" #include "packfile.h" #include "config.h" +#include "midx.h" struct revindex_entry { off_t offset; @@ -293,6 +294,43 @@ int load_pack_revindex(struct packed_git *p) return -1; } +int load_midx_revindex(struct multi_pack_index *m) +{ + char *revindex_name; + int ret; + if (m->revindex_data) + return 0; + + revindex_name = get_midx_rev_filename(m); + + ret = load_revindex_from_disk(revindex_name, + m->num_objects, + &m->revindex_map, + &m->revindex_len); + if (ret) + goto cleanup; + + m->revindex_data = (const uint32_t *)((const char *)m->revindex_map + RIDX_HEADER_SIZE); + +cleanup: + free(revindex_name); + return ret; +} + +int close_midx_revindex(struct multi_pack_index *m) +{ + if (!m || !m->revindex_map) + return 0; + + munmap((void*)m->revindex_map, m->revindex_len); + + m->revindex_map = NULL; + m->revindex_data = NULL; + m->revindex_len = 0; + + return 0; +} + int offset_to_pack_pos(struct packed_git *p, off_t ofs, uint32_t *pos) { unsigned lo, hi; @@ -347,3 +385,91 @@ off_t pack_pos_to_offset(struct packed_git *p, uint32_t pos) else return nth_packed_object_offset(p, pack_pos_to_index(p, pos)); } + +uint32_t pack_pos_to_midx(struct multi_pack_index *m, uint32_t pos) +{ + if (!m->revindex_data) + BUG("pack_pos_to_midx: reverse index not yet loaded"); + if (m->num_objects <= pos) + BUG("pack_pos_to_midx: out-of-bounds object at %"PRIu32, pos); + return get_be32(m->revindex_data + pos); +} + +struct midx_pack_key { + uint32_t pack; + off_t offset; + + uint32_t preferred_pack; + struct multi_pack_index *midx; +}; + +static int midx_pack_order_cmp(const void *va, const void *vb) +{ + const struct midx_pack_key *key = va; + struct multi_pack_index *midx = key->midx; + + uint32_t versus = pack_pos_to_midx(midx, (uint32_t*)vb - (const uint32_t *)midx->revindex_data); + uint32_t versus_pack = nth_midxed_pack_int_id(midx, versus); + off_t versus_offset; + + uint32_t key_preferred = key->pack == key->preferred_pack; + uint32_t versus_preferred = versus_pack == key->preferred_pack; + + /* + * First, compare the preferred-ness, noting that the preferred pack + * comes first. + */ + if (key_preferred && !versus_preferred) + return -1; + else if (!key_preferred && versus_preferred) + return 1; + + /* Then, break ties first by comparing the pack IDs. */ + if (key->pack < versus_pack) + return -1; + else if (key->pack > versus_pack) + return 1; + + /* Finally, break ties by comparing offsets within a pack. */ + versus_offset = nth_midxed_offset(midx, versus); + if (key->offset < versus_offset) + return -1; + else if (key->offset > versus_offset) + return 1; + + return 0; +} + +int midx_to_pack_pos(struct multi_pack_index *m, uint32_t at, uint32_t *pos) +{ + struct midx_pack_key key; + uint32_t *found; + + if (!m->revindex_data) + BUG("midx_to_pack_pos: reverse index not yet loaded"); + if (m->num_objects <= at) + BUG("midx_to_pack_pos: out-of-bounds object at %"PRIu32, at); + + key.pack = nth_midxed_pack_int_id(m, at); + key.offset = nth_midxed_offset(m, at); + key.midx = m; + /* + * The preferred pack sorts first, so determine its identifier by + * looking at the first object in pseudo-pack order. + * + * Note that if no --preferred-pack is explicitly given when writing a + * multi-pack index, then whichever pack has the lowest identifier + * implicitly is preferred (and includes all its objects, since ties are + * broken first by pack identifier). + */ + key.preferred_pack = nth_midxed_pack_int_id(m, pack_pos_to_midx(m, 0)); + + found = bsearch(&key, m->revindex_data, m->num_objects, + sizeof(*m->revindex_data), midx_pack_order_cmp); + + if (!found) + return error("bad offset for revindex"); + + *pos = found - m->revindex_data; + return 0; +} diff --git a/pack-revindex.h b/pack-revindex.h index ba7c82c125..479b8f2f9c 100644 --- a/pack-revindex.h +++ b/pack-revindex.h @@ -14,6 +14,20 @@ * * - offset: the byte offset within the .pack file at which the object contents * can be found + * + * The revindex can also be used with a multi-pack index (MIDX). In this + * setting: + * + * - index position refers to an object's numeric position within the MIDX + * + * - pack position refers to an object's position within a non-existent pack + * described by the MIDX. The pack structure is described in + * Documentation/technical/pack-format.txt. + * + * It is effectively a concatanation of all packs in the MIDX (ordered by + * their numeric ID within the MIDX) in their original order within each + * pack), removing duplicates, and placing the preferred pack (if any) + * first. */ @@ -24,6 +38,7 @@ #define GIT_TEST_REV_INDEX_DIE_IN_MEMORY "GIT_TEST_REV_INDEX_DIE_IN_MEMORY" struct packed_git; +struct multi_pack_index; /* * load_pack_revindex populates the revindex's internal data-structures for the @@ -35,6 +50,22 @@ struct packed_git; int load_pack_revindex(struct packed_git *p); /* + * load_midx_revindex loads the '.rev' file corresponding to the given + * multi-pack index by mmap-ing it and assigning pointers in the + * multi_pack_index to point at it. + * + * A negative number is returned on error. + */ +int load_midx_revindex(struct multi_pack_index *m); + +/* + * Frees resources associated with a multi-pack reverse index. + * + * A negative number is returned on error. + */ +int close_midx_revindex(struct multi_pack_index *m); + +/* * offset_to_pack_pos converts an object offset to a pack position. This * function returns zero on success, and a negative number otherwise. The * parameter 'pos' is usable only on success. @@ -71,4 +102,26 @@ uint32_t pack_pos_to_index(struct packed_git *p, uint32_t pos); */ off_t pack_pos_to_offset(struct packed_git *p, uint32_t pos); +/* + * pack_pos_to_midx converts the object at position "pos" within the MIDX + * pseudo-pack into a MIDX position. + * + * If the reverse index has not yet been loaded, or the position is out of + * bounds, this function aborts. + * + * This function runs in time O(log N) with the number of objects in the MIDX. + */ +uint32_t pack_pos_to_midx(struct multi_pack_index *m, uint32_t pos); + +/* + * midx_to_pack_pos converts from the MIDX-relative position at "at" to the + * corresponding pack position. + * + * If the reverse index has not yet been loaded, or the position is out of + * bounds, this function aborts. + * + * This function runs in constant time. + */ +int midx_to_pack_pos(struct multi_pack_index *midx, uint32_t at, uint32_t *pos); + #endif diff --git a/pack-write.c b/pack-write.c index 2ca85a9d16..f1fc3ecafa 100644 --- a/pack-write.c +++ b/pack-write.c @@ -201,21 +201,12 @@ static void write_rev_header(struct hashfile *f) } static void write_rev_index_positions(struct hashfile *f, - struct pack_idx_entry **objects, + uint32_t *pack_order, uint32_t nr_objects) { - uint32_t *pack_order; uint32_t i; - - ALLOC_ARRAY(pack_order, nr_objects); - for (i = 0; i < nr_objects; i++) - pack_order[i] = i; - QSORT_S(pack_order, nr_objects, pack_order_cmp, objects); - for (i = 0; i < nr_objects; i++) hashwrite_be32(f, pack_order[i]); - - free(pack_order); } static void write_rev_trailer(struct hashfile *f, const unsigned char *hash) @@ -229,6 +220,29 @@ const char *write_rev_file(const char *rev_name, const unsigned char *hash, unsigned flags) { + uint32_t *pack_order; + uint32_t i; + const char *ret; + + ALLOC_ARRAY(pack_order, nr_objects); + for (i = 0; i < nr_objects; i++) + pack_order[i] = i; + QSORT_S(pack_order, nr_objects, pack_order_cmp, objects); + + ret = write_rev_file_order(rev_name, pack_order, nr_objects, hash, + flags); + + free(pack_order); + + return ret; +} + +const char *write_rev_file_order(const char *rev_name, + uint32_t *pack_order, + uint32_t nr_objects, + const unsigned char *hash, + unsigned flags) +{ struct hashfile *f; int fd; @@ -262,7 +276,7 @@ const char *write_rev_file(const char *rev_name, write_rev_header(f); - write_rev_index_positions(f, objects, nr_objects); + write_rev_index_positions(f, pack_order, nr_objects); write_rev_trailer(f, hash); if (rev_name && adjust_shared_perm(rev_name) < 0) @@ -94,6 +94,7 @@ struct ref; void write_promisor_file(const char *promisor_name, struct ref **sought, int nr_sought); const char *write_rev_file(const char *rev_name, struct pack_idx_entry **objects, uint32_t nr_objects, const unsigned char *hash, unsigned flags); +const char *write_rev_file_order(const char *rev_name, uint32_t *pack_order, uint32_t nr_objects, const unsigned char *hash, unsigned flags); /* * The "hdr" output buffer should be at least this big, which will handle sizes diff --git a/packfile.c b/packfile.c index 6661f3325a..8668345d93 100644 --- a/packfile.c +++ b/packfile.c @@ -862,6 +862,9 @@ static void prepare_pack(const char *full_name, size_t full_name_len, if (!strcmp(file_name, "multi-pack-index")) return; + if (starts_with(file_name, "multi-pack-index") && + ends_with(file_name, ".rev")) + return; if (ends_with(file_name, ".idx") || ends_with(file_name, ".rev") || ends_with(file_name, ".pack") || diff --git a/t/helper/test-read-midx.c b/t/helper/test-read-midx.c index 2430880f78..7c2eb11a8e 100644 --- a/t/helper/test-read-midx.c +++ b/t/helper/test-read-midx.c @@ -4,7 +4,7 @@ #include "repository.h" #include "object-store.h" -static int read_midx_file(const char *object_dir) +static int read_midx_file(const char *object_dir, int show_objects) { uint32_t i; struct multi_pack_index *m; @@ -43,13 +43,29 @@ static int read_midx_file(const char *object_dir) printf("object-dir: %s\n", m->object_dir); + if (show_objects) { + struct object_id oid; + struct pack_entry e; + + for (i = 0; i < m->num_objects; i++) { + nth_midxed_object_oid(&oid, m, i); + fill_midx_entry(the_repository, &oid, &e, m); + + printf("%s %"PRIu64"\t%s\n", + oid_to_hex(&oid), e.offset, e.p->pack_name); + } + return 0; + } + return 0; } int cmd__read_midx(int argc, const char **argv) { - if (argc != 2) - usage("read-midx <object-dir>"); + if (!(argc == 2 || argc == 3)) + usage("read-midx [--show-objects] <object-dir>"); - return read_midx_file(argv[1]); + if (!strcmp(argv[1], "--show-objects")) + return read_midx_file(argv[2], 1); + return read_midx_file(argv[1], 0); } diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh index b4afab1dfc..5641d158df 100755 --- a/t/t5319-multi-pack-index.sh +++ b/t/t5319-multi-pack-index.sh @@ -234,6 +234,48 @@ test_expect_success 'warn on improper hash version' ' ) ' +test_expect_success 'midx picks objects from preferred pack' ' + test_when_finished rm -rf preferred.git && + git init --bare preferred.git && + ( + cd preferred.git && + + a=$(echo "a" | git hash-object -w --stdin) && + b=$(echo "b" | git hash-object -w --stdin) && + c=$(echo "c" | git hash-object -w --stdin) && + + # Set up two packs, duplicating the object "B" at different + # offsets. + # + # Note that the "BC" pack (the one we choose as preferred) sorts + # lexically after the "AB" pack, meaning that omitting the + # --preferred-pack argument would cause this test to fail (since + # the MIDX code would select the copy of "b" in the "AB" pack). + git pack-objects objects/pack/test-AB <<-EOF && + $a + $b + EOF + bc=$(git pack-objects objects/pack/test-BC <<-EOF + $b + $c + EOF + ) && + + git multi-pack-index --object-dir=objects \ + write --preferred-pack=test-BC-$bc.idx 2>err && + test_must_be_empty err && + + test-tool read-midx --show-objects objects >out && + + ofs=$(git show-index <objects/pack/test-BC-$bc.idx | grep $b | + cut -d" " -f1) && + printf "%s %s\tobjects/pack/test-BC-%s.pack\n" \ + "$b" "$ofs" "$bc" >expect && + grep ^$b out >actual && + + test_cmp expect actual + ) +' test_expect_success 'verify multi-pack-index success' ' git multi-pack-index verify --object-dir=$objdir |