summaryrefslogtreecommitdiff
path: root/Documentation/technical
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/technical')
-rw-r--r--Documentation/technical/pack-format.txt118
-rw-r--r--Documentation/technical/pack-heuristics.txt466
-rw-r--r--Documentation/technical/pack-protocol.txt41
-rw-r--r--Documentation/technical/racy-git.txt195
-rw-r--r--Documentation/technical/send-pack-pipeline.txt63
-rw-r--r--Documentation/technical/shallow.txt49
-rw-r--r--Documentation/technical/trivial-merge.txt121
7 files changed, 1053 insertions, 0 deletions
diff --git a/Documentation/technical/pack-format.txt b/Documentation/technical/pack-format.txt
new file mode 100644
index 0000000000..e5b31c81fa
--- /dev/null
+++ b/Documentation/technical/pack-format.txt
@@ -0,0 +1,118 @@
+GIT pack format
+===============
+
+= pack-*.pack file has the following format:
+
+ - The header appears at the beginning and consists of the following:
+
+ 4-byte signature:
+ The signature is: {'P', 'A', 'C', 'K'}
+
+ 4-byte version number (network byte order):
+ GIT currently accepts version number 2 or 3 but
+ generates version 2 only.
+
+ 4-byte number of objects contained in the pack (network byte order)
+
+ Observation: we cannot have more than 4G versions ;-) and
+ more than 4G objects in a pack.
+
+ - The header is followed by number of object entries, each of
+ which looks like this:
+
+ (undeltified representation)
+ n-byte type and length (3-bit type, (n-1)*7+4-bit length)
+ compressed data
+
+ (deltified representation)
+ n-byte type and length (3-bit type, (n-1)*7+4-bit length)
+ 20-byte base object name
+ compressed delta data
+
+ Observation: length of each object is encoded in a variable
+ length format and is not constrained to 32-bit or anything.
+
+ - The trailer records 20-byte SHA1 checksum of all of the above.
+
+= pack-*.idx file has the following format:
+
+ - The header consists of 256 4-byte network byte order
+ integers. N-th entry of this table records the number of
+ objects in the corresponding pack, the first byte of whose
+ object name are smaller than N. This is called the
+ 'first-level fan-out' table.
+
+ Observation: we would need to extend this to an array of
+ 8-byte integers to go beyond 4G objects per pack, but it is
+ not strictly necessary.
+
+ - The header is followed by sorted 24-byte entries, one entry
+ per object in the pack. Each entry is:
+
+ 4-byte network byte order integer, recording where the
+ object is stored in the packfile as the offset from the
+ beginning.
+
+ 20-byte object name.
+
+ Observation: we would definitely need to extend this to
+ 8-byte integer plus 20-byte object name to handle a packfile
+ that is larger than 4GB.
+
+ - The file is concluded with a trailer:
+
+ A copy of the 20-byte SHA1 checksum at the end of
+ corresponding packfile.
+
+ 20-byte SHA1-checksum of all of the above.
+
+Pack Idx file:
+
+ idx
+ +--------------------------------+
+ | fanout[0] = 2 |-.
+ +--------------------------------+ |
+ | fanout[1] | |
+ +--------------------------------+ |
+ | fanout[2] | |
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
+ | fanout[255] | |
+ +--------------------------------+ |
+main | offset | |
+index | object name 00XXXXXXXXXXXXXXXX | |
+table +--------------------------------+ |
+ | offset | |
+ | object name 00XXXXXXXXXXXXXXXX | |
+ +--------------------------------+ |
+ .-| offset |<+
+ | | object name 01XXXXXXXXXXXXXXXX |
+ | +--------------------------------+
+ | | offset |
+ | | object name 01XXXXXXXXXXXXXXXX |
+ | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+ | | offset |
+ | | object name FFXXXXXXXXXXXXXXXX |
+ | +--------------------------------+
+trailer | | packfile checksum |
+ | +--------------------------------+
+ | | idxfile checksum |
+ | +--------------------------------+
+ .-------.
+ |
+Pack file entry: <+
+
+ packed object header:
+ 1-byte size extension bit (MSB)
+ type (next 3 bit)
+ size0 (lower 4-bit)
+ n-byte sizeN (as long as MSB is set, each 7-bit)
+ size0..sizeN form 4+7+7+..+7 bit integer, size0
+ is the least significant part, and sizeN is the
+ most significant part.
+ packed object data:
+ If it is not DELTA, then deflated bytes (the size above
+ is the size before compression).
+ If it is DELTA, then
+ 20-byte base object name SHA1 (the size above is the
+ size of the delta data that follows).
+ delta data, deflated.
diff --git a/Documentation/technical/pack-heuristics.txt b/Documentation/technical/pack-heuristics.txt
new file mode 100644
index 0000000000..103eb5d989
--- /dev/null
+++ b/Documentation/technical/pack-heuristics.txt
@@ -0,0 +1,466 @@
+ Concerning Git's Packing Heuristics
+ ===================================
+
+ Oh, here's a really stupid question:
+
+ Where do I go
+ to learn the details
+ of git's packing heuristics?
+
+Be careful what you ask!
+
+Followers of the git, please open the git IRC Log and turn to
+February 10, 2006.
+
+It's a rare occasion, and we are joined by the King Git Himself,
+Linus Torvalds (linus). Nathaniel Smith, (njs`), has the floor
+and seeks enlightenment. Others are present, but silent.
+
+Let's listen in!
+
+ <njs`> Oh, here's a really stupid question -- where do I go to
+ learn the details of git's packing heuristics? google avails
+ me not, reading the source didn't help a lot, and wading
+ through the whole mailing list seems less efficient than any
+ of that.
+
+It is a bold start! A plea for help combined with a simultaneous
+tri-part attack on some of the tried and true mainstays in the quest
+for enlightenment. Brash accusations of google being useless. Hubris!
+Maligning the source. Heresy! Disdain for the mailing list archives.
+Woe.
+
+ <pasky> yes, the packing-related delta stuff is somewhat
+ mysterious even for me ;)
+
+Ah! Modesty after all.
+
+ <linus> njs, I don't think the docs exist. That's something where
+ I don't think anybody else than me even really got involved.
+ Most of the rest of git others have been busy with (especially
+ Junio), but packing nobody touched after I did it.
+
+It's cryptic, yet vague. Linus in style for sure. Wise men
+interpret this as an apology. A few argue it is merely a
+statement of fact.
+
+ <njs`> I guess the next step is "read the source again", but I
+ have to build up a certain level of gumption first :-)
+
+Indeed! On both points.
+
+ <linus> The packing heuristic is actually really really simple.
+
+Bait...
+
+ <linus> But strange.
+
+And switch. That ought to do it!
+
+ <linus> Remember: git really doesn't follow files. So what it does is
+ - generate a list of all objects
+ - sort the list according to magic heuristics
+ - walk the list, using a sliding window, seeing if an object
+ can be diffed against another object in the window
+ - write out the list in recency order
+
+The traditional understatement:
+
+ <njs`> I suspect that what I'm missing is the precise definition of
+ the word "magic"
+
+The traditional insight:
+
+ <pasky> yes
+
+And Babel-like confusion flowed.
+
+ <njs`> oh, hmm, and I'm not sure what this sliding window means either
+
+ <pasky> iirc, it appeared to me to be just the sha1 of the object
+ when reading the code casually ...
+
+ ... which simply doesn't sound as a very good heuristics, though ;)
+
+ <njs`> .....and recency order. okay, I think it's clear I didn't
+ even realize how much I wasn't realizing :-)
+
+Ah, grasshopper! And thus the enlightenment begins anew.
+
+ <linus> The "magic" is actually in theory totally arbitrary.
+ ANY order will give you a working pack, but no, it's not
+ ordered by SHA1.
+
+ Before talking about the ordering for the sliding delta
+ window, let's talk about the recency order. That's more
+ important in one way.
+
+ <njs`> Right, but if all you want is a working way to pack things
+ together, you could just use cat and save yourself some
+ trouble...
+
+Waaait for it....
+
+ <linus> The recency ordering (which is basically: put objects
+ _physically_ into the pack in the order that they are
+ "reachable" from the head) is important.
+
+ <njs`> okay
+
+ <linus> It's important because that's the thing that gives packs
+ good locality. It keeps the objects close to the head (whether
+ they are old or new, but they are _reachable_ from the head)
+ at the head of the pack. So packs actually have absolutely
+ _wonderful_ IO patterns.
+
+Read that again, because it is important.
+
+ <linus> But recency ordering is totally useless for deciding how
+ to actually generate the deltas, so the delta ordering is
+ something else.
+
+ The delta ordering is (wait for it):
+ - first sort by the "basename" of the object, as defined by
+ the name the object was _first_ reached through when
+ generating the object list
+ - within the same basename, sort by size of the object
+ - but always sort different types separately (commits first).
+
+ That's not exactly it, but it's very close.
+
+ <njs`> The "_first_ reached" thing is not too important, just you
+ need some way to break ties since the same objects may be
+ reachable many ways, yes?
+
+And as if to clarify:
+
+ <linus> The point is that it's all really just any random
+ heuristic, and the ordering is totally unimportant for
+ correctness, but it helps a lot if the heuristic gives
+ "clumping" for things that are likely to delta well against
+ each other.
+
+It is an important point, so secretly, I did my own research and have
+included my results below. To be fair, it has changed some over time.
+And through the magic of Revisionistic History, I draw upon this entry
+from The Git IRC Logs on my father's birthday, March 1:
+
+ <gitster> The quote from the above linus should be rewritten a
+ bit (wait for it):
+ - first sort by type. Different objects never delta with
+ each other.
+ - then sort by filename/dirname. hash of the basename
+ occupies the top BITS_PER_INT-DIR_BITS bits, and bottom
+ DIR_BITS are for the hash of leading path elements.
+ - then if we are doing "thin" pack, the objects we are _not_
+ going to pack but we know about are sorted earlier than
+ other objects.
+ - and finally sort by size, larger to smaller.
+
+In one swell-foop, clarification and obscurification! Nonetheless,
+authoritative. Cryptic, yet concise. It even solicits notions of
+quotes from The Source Code. Clearly, more study is needed.
+
+ <gitster> That's the sort order. What this means is:
+ - we do not delta different object types.
+ - we prefer to delta the objects with the same full path, but
+ allow files with the same name from different directories.
+ - we always prefer to delta against objects we are not going
+ to send, if there are some.
+ - we prefer to delta against larger objects, so that we have
+ lots of removals.
+
+ The penultimate rule is for "thin" packs. It is used when
+ the other side is known to have such objects.
+
+There it is again. "Thin" packs. I'm thinking to myself, "What
+is a 'thin' pack?" So I ask:
+
+ <jdl> What is a "thin" pack?
+
+ <gitster> Use of --objects-edge to rev-list as the upstream of
+ pack-objects. The pack transfer protocol negotiates that.
+
+Woo hoo! Cleared that _right_ up!
+
+ <gitster> There are two directions - push and fetch.
+
+There! Did you see it? It is not '"push" and "pull"'! How often the
+confusion has started here. So casually mentioned, too!
+
+ <gitster> For push, git-send-pack invokes git-receive-pack on the
+ other end. The receive-pack says "I have up to these commits".
+ send-pack looks at them, and computes what are missing from
+ the other end. So "thin" could be the default there.
+
+ In the other direction, fetch, git-fetch-pack and
+ git-clone-pack invokes git-upload-pack on the other end
+ (via ssh or by talking to the daemon).
+
+ There are two cases: fetch-pack with -k and clone-pack is one,
+ fetch-pack without -k is the other. clone-pack and fetch-pack
+ with -k will keep the downloaded packfile without expanded, so
+ we do not use thin pack transfer. Otherwise, the generated
+ pack will have delta without base object in the same pack.
+
+ But fetch-pack without -k will explode the received pack into
+ individual objects, so we automatically ask upload-pack to
+ give us a thin pack if upload-pack supports it.
+
+OK then.
+
+Uh.
+
+Let's return to the previous conversation still in progress.
+
+ <njs`> and "basename" means something like "the tail of end of
+ path of file objects and dir objects, as per basename(3), and
+ we just declare all commit and tag objects to have the same
+ basename" or something?
+
+Luckily, that too is a point that gitster clarified for us!
+
+If I might add, the trick is to make files that _might_ be similar be
+located close to each other in the hash buckets based on their file
+names. It used to be that "foo/Makefile", "bar/baz/quux/Makefile" and
+"Makefile" all landed in the same bucket due to their common basename,
+"Makefile". However, now they land in "close" buckets.
+
+The algorithm allows not just for the _same_ bucket, but for _close_
+buckets to be considered delta candidates. The rationale is
+essentially that files, like Makefiles, often have very similar
+content no matter what directory they live in.
+
+ <linus> I played around with different delta algorithms, and with
+ making the "delta window" bigger, but having too big of a
+ sliding window makes it very expensive to generate the pack:
+ you need to compare every object with a _ton_ of other objects.
+
+ There are a number of other trivial heuristics too, which
+ basically boil down to "don't bother even trying to delta this
+ pair" if we can tell before-hand that the delta isn't worth it
+ (due to size differences, where we can take a previous delta
+ result into account to decide that "ok, no point in trying
+ that one, it will be worse").
+
+ End result: packing is actually very size efficient. It's
+ somewhat CPU-wasteful, but on the other hand, since you're
+ really only supposed to do it maybe once a month (and you can
+ do it during the night), nobody really seems to care.
+
+Nice Engineering Touch, there. Find when it doesn't matter, and
+proclaim it a non-issue. Good style too!
+
+ <njs`> So, just to repeat to see if I'm following, we start by
+ getting a list of the objects we want to pack, we sort it by
+ this heuristic (basically lexicographically on the tuple
+ (type, basename, size)).
+
+ Then we walk through this list, and calculate a delta of
+ each object against the last n (tunable parameter) objects,
+ and pick the smallest of these deltas.
+
+Vastly simplified, but the essence is there!
+
+ <linus> Correct.
+
+ <njs`> And then once we have picked a delta or fulltext to
+ represent each object, we re-sort by recency, and write them
+ out in that order.
+
+ <linus> Yup. Some other small details:
+
+And of course there is the "Other Shoe" Factor too.
+
+ <linus> - We limit the delta depth to another magic value (right
+ now both the window and delta depth magic values are just "10")
+
+ <njs`> Hrm, my intuition is that you'd end up with really _bad_ IO
+ patterns, because the things you want are near by, but to
+ actually reconstruct them you may have to jump all over in
+ random ways.
+
+ <linus> - When we write out a delta, and we haven't yet written
+ out the object it is a delta against, we write out the base
+ object first. And no, when we reconstruct them, we actually
+ get nice IO patterns, because:
+ - larger objects tend to be "more recent" (Linus' law: files grow)
+ - we actively try to generate deltas from a larger object to a
+ smaller one
+ - this means that the top-of-tree very seldom has deltas
+ (i.e. deltas in _practice_ are "backwards deltas")
+
+Again, we should reread that whole paragraph. Not just because
+Linus has slipped Linus's Law in there on us, but because it is
+important. Let's make sure we clarify some of the points here:
+
+ <njs`> So the point is just that in practice, delta order and
+ recency order match each other quite well.
+
+ <linus> Yes. There's another nice side to this (and yes, it was
+ designed that way ;):
+ - the reason we generate deltas against the larger object is
+ actually a big space saver too!
+
+ <njs`> Hmm, but your last comment (if "we haven't yet written out
+ the object it is a delta against, we write out the base object
+ first"), seems like it would make these facts mostly
+ irrelevant because even if in practice you would not have to
+ wander around much, in fact you just brute-force say that in
+ the cases where you might have to wander, don't do that :-)
+
+ <linus> Yes and no. Notice the rule: we only write out the base
+ object first if the delta against it was more recent. That
+ means that you can actually have deltas that refer to a base
+ object that is _not_ close to the delta object, but that only
+ happens when the delta is needed to generate an _old_ object.
+
+ <linus> See?
+
+Yeah, no. I missed that on the first two or three readings myself.
+
+ <linus> This keeps the front of the pack dense. The front of the
+ pack never contains data that isn't relevant to a "recent"
+ object. The size optimization comes from our use of xdelta
+ (but is true for many other delta algorithms): removing data
+ is cheaper (in size) than adding data.
+
+ When you remove data, you only need to say "copy bytes n--m".
+ In contrast, in a delta that _adds_ data, you have to say "add
+ these bytes: 'actual data goes here'"
+
+ *** njs` has quit: Read error: 104 (Connection reset by peer)
+
+ <linus> Uhhuh. I hope I didn't blow njs` mind.
+
+ *** njs` has joined channel #git
+
+ <pasky> :)
+
+The silent observers are amused. Of course.
+
+And as if njs` was expected to be omniscient:
+
+ <linus> njs - did you miss anything?
+
+OK, I'll spell it out. That's Geek Humor. If njs` was not actually
+connected for a little bit there, how would he know if missed anything
+while he was disconnected? He's a benevolent dictator with a sense of
+humor! Well noted!
+
+ <njs`> Stupid router. Or gremlins, or whatever.
+
+It's a cheap shot at Cisco. Take 'em when you can.
+
+ <njs`> Yes and no. Notice the rule: we only write out the base
+ object first if the delta against it was more recent.
+
+ I'm getting lost in all these orders, let me re-read :-)
+ So the write-out order is from most recent to least recent?
+ (Conceivably it could be the opposite way too, I'm not sure if
+ we've said) though my connection back at home is logging, so I
+ can just read what you said there :-)
+
+And for those of you paying attention, the Omniscient Trick has just
+been detailed!
+
+ <linus> Yes, we always write out most recent first
+
+For the other record:
+
+ <pasky> njs`: http://pastebin.com/547965
+
+The 'net never forgets, so that should be good until the end of time.
+
+ <njs`> And, yeah, I got the part about deeper-in-history stuff
+ having worse IO characteristics, one sort of doesn't care.
+
+ <linus> With the caveat that if the "most recent" needs an older
+ object to delta against (hey, shrinking sometimes does
+ happen), we write out the old object with the delta.
+
+ <njs`> (if only it happened more...)
+
+ <linus> Anyway, the pack-file could easily be denser still, but
+ because it's used both for streaming (the git protocol) and
+ for on-disk, it has a few pessimizations.
+
+Actually, it is a made-up word. But it is a made-up word being
+used as setup for a later optimization, which is a real word:
+
+ <linus> In particular, while the pack-file is then compressed,
+ it's compressed just one object at a time, so the actual
+ compression factor is less than it could be in theory. But it
+ means that it's all nice random-access with a simple index to
+ do "object name->location in packfile" translation.
+
+ <njs`> I'm assuming the real win for delta-ing large->small is
+ more homogeneous statistics for gzip to run over?
+
+ (You have to put the bytes in one place or another, but
+ putting them in a larger blob wins on compression)
+
+ Actually, what is the compression strategy -- each delta
+ individually gzipped, the whole file gzipped, somewhere in
+ between, no compression at all, ....?
+
+ Right.
+
+Reality IRC sets in. For example:
+
+ <pasky> I'll read the rest in the morning, I really have to go
+ sleep or there's no hope whatsoever for me at the today's
+ exam... g'nite all.
+
+Heh.
+
+ <linus> pasky: g'nite
+
+ <njs`> pasky: 'luck
+
+ <linus> Right: large->small matters exactly because of compression
+ behaviour. If it was non-compressed, it probably wouldn't make
+ any difference.
+
+ <njs`> yeah
+
+ <linus> Anyway: I'm not even trying to claim that the pack-files
+ are perfect, but they do tend to have a nice balance of
+ density vs ease-of use.
+
+Gasp! OK, saved. That's a fair Engineering trade off. Close call!
+In fact, Linus reflects on some Basic Engineering Fundamentals,
+design options, etc.
+
+ <linus> More importantly, they allow git to still _conceptually_
+ never deal with deltas at all, and be a "whole object" store.
+
+ Which has some problems (we discussed bad huge-file
+ behaviour on the git lists the other day), but it does mean
+ that the basic git concepts are really really simple and
+ straightforward.
+
+ It's all been quite stable.
+
+ Which I think is very much a result of having very simple
+ basic ideas, so that there's never any confusion about what's
+ going on.
+
+ Bugs happen, but they are "simple" bugs. And bugs that
+ actually get some object store detail wrong are almost always
+ so obvious that they never go anywhere.
+
+ <njs`> Yeah.
+
+Nuff said.
+
+ <linus> Anyway. I'm off for bed. It's not 6AM here, but I've got
+ three kids, and have to get up early in the morning to send
+ them off. I need my beauty sleep.
+
+ <njs`> :-)
+
+ <njs`> appreciate the infodump, I really was failing to find the
+ details on git packs :-)
+
+And now you know the rest of the story.
diff --git a/Documentation/technical/pack-protocol.txt b/Documentation/technical/pack-protocol.txt
new file mode 100644
index 0000000000..9cd48b4859
--- /dev/null
+++ b/Documentation/technical/pack-protocol.txt
@@ -0,0 +1,41 @@
+Pack transfer protocols
+=======================
+
+There are two Pack push-pull protocols.
+
+upload-pack (S) | fetch/clone-pack (C) protocol:
+
+ # Tell the puller what commits we have and what their names are
+ S: SHA1 name
+ S: ...
+ S: SHA1 name
+ S: # flush -- it's your turn
+ # Tell the pusher what commits we want, and what we have
+ C: want name
+ C: ..
+ C: want name
+ C: have SHA1
+ C: have SHA1
+ C: ...
+ C: # flush -- occasionally ask "had enough?"
+ S: NAK
+ C: have SHA1
+ C: ...
+ C: have SHA1
+ S: ACK
+ C: done
+ S: XXXXXXX -- packfile contents.
+
+send-pack | receive-pack protocol.
+
+ # Tell the pusher what commits we have and what their names are
+ C: SHA1 name
+ C: ...
+ C: SHA1 name
+ C: # flush -- it's your turn
+ # Tell the puller what the pusher has
+ S: old-SHA1 new-SHA1 name
+ S: old-SHA1 new-SHA1 name
+ S: ...
+ S: # flush -- done with the list
+ S: XXXXXXX --- packfile contents.
diff --git a/Documentation/technical/racy-git.txt b/Documentation/technical/racy-git.txt
new file mode 100644
index 0000000000..5030d9f2f8
--- /dev/null
+++ b/Documentation/technical/racy-git.txt
@@ -0,0 +1,195 @@
+Use of index and Racy git problem
+=================================
+
+Background
+----------
+
+The index is one of the most important data structures in git.
+It represents a virtual working tree state by recording list of
+paths and their object names and serves as a staging area to
+write out the next tree object to be committed. The state is
+"virtual" in the sense that it does not necessarily have to, and
+often does not, match the files in the working tree.
+
+There are cases git needs to examine the differences between the
+virtual working tree state in the index and the files in the
+working tree. The most obvious case is when the user asks `git
+diff` (or its low level implementation, `git diff-files`) or
+`git-ls-files --modified`. In addition, git internally checks
+if the files in the working tree are different from what are
+recorded in the index to avoid stomping on local changes in them
+during patch application, switching branches, and merging.
+
+In order to speed up this comparison between the files in the
+working tree and the index entries, the index entries record the
+information obtained from the filesystem via `lstat(2)` system
+call when they were last updated. When checking if they differ,
+git first runs `lstat(2)` on the files and compares the result
+with this information (this is what was originally done by the
+`ce_match_stat()` function, but the current code does it in
+`ce_match_stat_basic()` function). If some of these "cached
+stat information" fields do not match, git can tell that the
+files are modified without even looking at their contents.
+
+Note: not all members in `struct stat` obtained via `lstat(2)`
+are used for this comparison. For example, `st_atime` obviously
+is not useful. Currently, git compares the file type (regular
+files vs symbolic links) and executable bits (only for regular
+files) from `st_mode` member, `st_mtime` and `st_ctime`
+timestamps, `st_uid`, `st_gid`, `st_ino`, and `st_size` members.
+With a `USE_STDEV` compile-time option, `st_dev` is also
+compared, but this is not enabled by default because this member
+is not stable on network filesystems. With `USE_NSEC`
+compile-time option, `st_mtim.tv_nsec` and `st_ctim.tv_nsec`
+members are also compared, but this is not enabled by default
+because the value of this member becomes meaningless once the
+inode is evicted from the inode cache on filesystems that do not
+store it on disk.
+
+
+Racy git
+--------
+
+There is one slight problem with the optimization based on the
+cached stat information. Consider this sequence:
+
+ : modify 'foo'
+ $ git update-index 'foo'
+ : modify 'foo' again, in-place, without changing its size
+
+The first `update-index` computes the object name of the
+contents of file `foo` and updates the index entry for `foo`
+along with the `struct stat` information. If the modification
+that follows it happens very fast so that the file's `st_mtime`
+timestamp does not change, after this sequence, the cached stat
+information the index entry records still exactly match what you
+would see in the filesystem, even though the file `foo` is now
+different.
+This way, git can incorrectly think files in the working tree
+are unmodified even though they actually are. This is called
+the "racy git" problem (discovered by Pasky), and the entries
+that appear clean when they may not be because of this problem
+are called "racily clean".
+
+To avoid this problem, git does two things:
+
+. When the cached stat information says the file has not been
+ modified, and the `st_mtime` is the same as (or newer than)
+ the timestamp of the index file itself (which is the time `git
+ update-index foo` finished running in the above example), it
+ also compares the contents with the object registered in the
+ index entry to make sure they match.
+
+. When the index file is updated that contains racily clean
+ entries, cached `st_size` information is truncated to zero
+ before writing a new version of the index file.
+
+Because the index file itself is written after collecting all
+the stat information from updated paths, `st_mtime` timestamp of
+it is usually the same as or newer than any of the paths the
+index contains. And no matter how quick the modification that
+follows `git update-index foo` finishes, the resulting
+`st_mtime` timestamp on `foo` cannot get a value earlier
+than the index file. Therefore, index entries that can be
+racily clean are limited to the ones that have the same
+timestamp as the index file itself.
+
+The callers that want to check if an index entry matches the
+corresponding file in the working tree continue to call
+`ce_match_stat()`, but with this change, `ce_match_stat()` uses
+`ce_modified_check_fs()` to see if racily clean ones are
+actually clean after comparing the cached stat information using
+`ce_match_stat_basic()`.
+
+The problem the latter solves is this sequence:
+
+ $ git update-index 'foo'
+ : modify 'foo' in-place without changing its size
+ : wait for enough time
+ $ git update-index 'bar'
+
+Without the latter, the timestamp of the index file gets a newer
+value, and falsely clean entry `foo` would not be caught by the
+timestamp comparison check done with the former logic anymore.
+The latter makes sure that the cached stat information for `foo`
+would never match with the file in the working tree, so later
+checks by `ce_match_stat_basic()` would report that the index entry
+does not match the file and git does not have to fall back on more
+expensive `ce_modified_check_fs()`.
+
+
+Runtime penalty
+---------------
+
+The runtime penalty of falling back to `ce_modified_check_fs()`
+from `ce_match_stat()` can be very expensive when there are many
+racily clean entries. An obvious way to artificially create
+this situation is to give the same timestamp to all the files in
+the working tree in a large project, run `git update-index` on
+them, and give the same timestamp to the index file:
+
+ $ date >.datestamp
+ $ git ls-files | xargs touch -r .datestamp
+ $ git ls-files | git update-index --stdin
+ $ touch -r .datestamp .git/index
+
+This will make all index entries racily clean. The linux-2.6
+project, for example, there are over 20,000 files in the working
+tree. On my Athron 64X2 3800+, after the above:
+
+ $ /usr/bin/time git diff-files
+ 1.68user 0.54system 0:02.22elapsed 100%CPU (0avgtext+0avgdata 0maxresident)k
+ 0inputs+0outputs (0major+67111minor)pagefaults 0swaps
+ $ git update-index MAINTAINERS
+ $ /usr/bin/time git diff-files
+ 0.02user 0.12system 0:00.14elapsed 100%CPU (0avgtext+0avgdata 0maxresident)k
+ 0inputs+0outputs (0major+935minor)pagefaults 0swaps
+
+Running `git update-index` in the middle checked the racily
+clean entries, and left the cached `st_mtime` for all the paths
+intact because they were actually clean (so this step took about
+the same amount of time as the first `git diff-files`). After
+that, they are not racily clean anymore but are truly clean, so
+the second invocation of `git diff-files` fully took advantage
+of the cached stat information.
+
+
+Avoiding runtime penalty
+------------------------
+
+In order to avoid the above runtime penalty, post 1.4.2 git used
+to have a code that made sure the index file
+got timestamp newer than the youngest files in the index when
+there are many young files with the same timestamp as the
+resulting index file would otherwise would have by waiting
+before finishing writing the index file out.
+
+I suspected that in practice the situation where many paths in the
+index are all racily clean was quite rare. The only code paths
+that can record recent timestamp for large number of paths are:
+
+. Initial `git add .` of a large project.
+
+. `git checkout` of a large project from an empty index into an
+ unpopulated working tree.
+
+Note: switching branches with `git checkout` keeps the cached
+stat information of existing working tree files that are the
+same between the current branch and the new branch, which are
+all older than the resulting index file, and they will not
+become racily clean. Only the files that are actually checked
+out can become racily clean.
+
+In a large project where raciness avoidance cost really matters,
+however, the initial computation of all object names in the
+index takes more than one second, and the index file is written
+out after all that happens. Therefore the timestamp of the
+index file will be more than one seconds later than the the
+youngest file in the working tree. This means that in these
+cases there actually will not be any racily clean entry in
+the resulting index.
+
+Based on this discussion, the current code does not use the
+"workaround" to avoid the runtime penalty that does not exist in
+practice anymore. This was done with commit 0fc82cff on Aug 15,
+2006.
diff --git a/Documentation/technical/send-pack-pipeline.txt b/Documentation/technical/send-pack-pipeline.txt
new file mode 100644
index 0000000000..681efe4219
--- /dev/null
+++ b/Documentation/technical/send-pack-pipeline.txt
@@ -0,0 +1,63 @@
+git-send-pack
+=============
+
+Overall operation
+-----------------
+
+. Connects to the remote side and invokes git-receive-pack.
+
+. Learns what refs the remote has and what commit they point at.
+ Matches them to the refspecs we are pushing.
+
+. Checks if there are non-fast-forwards. Unlike fetch-pack,
+ the repository send-pack runs in is supposed to be a superset
+ of the recipient in fast-forward cases, so there is no need
+ for want/have exchanges, and fast-forward check can be done
+ locally. Tell the result to the other end.
+
+. Calls pack_objects() which generates a packfile and sends it
+ over to the other end.
+
+. If the remote side is new enough (v1.1.0 or later), wait for
+ the unpack and hook status from the other end.
+
+. Exit with appropriate error codes.
+
+
+Pack_objects pipeline
+---------------------
+
+This function gets one file descriptor (`fd`) which is either a
+socket (over the network) or a pipe (local). What's written to
+this fd goes to git-receive-pack to be unpacked.
+
+ send-pack ---> fd ---> receive-pack
+
+The function pack_objects creates a pipe and then forks. The
+forked child execs pack-objects with --revs to receive revision
+parameters from its standard input. This process will write the
+packfile to the other end.
+
+ send-pack
+ |
+ pack_objects() ---> fd ---> receive-pack
+ | ^ (pipe)
+ v |
+ (child)
+
+The child dup2's to arrange its standard output to go back to
+the other end, and read its standard input to come from the
+pipe. After that it exec's pack-objects. On the other hand,
+the parent process, before starting to feed the child pipeline,
+closes the reading side of the pipe and fd to receive-pack.
+
+ send-pack
+ |
+ pack_objects(parent)
+ |
+ v [0]
+ pack-objects [0] ---> receive-pack
+
+
+[jc: the pipeline was much more complex and needed documentation before
+ I understood an earlier bug, but now it is trivial and straightforward.]
diff --git a/Documentation/technical/shallow.txt b/Documentation/technical/shallow.txt
new file mode 100644
index 0000000000..559263af48
--- /dev/null
+++ b/Documentation/technical/shallow.txt
@@ -0,0 +1,49 @@
+Def.: Shallow commits do have parents, but not in the shallow
+repo, and therefore grafts are introduced pretending that
+these commits have no parents.
+
+The basic idea is to write the SHA1s of shallow commits into
+$GIT_DIR/shallow, and handle its contents like the contents
+of $GIT_DIR/info/grafts (with the difference that shallow
+cannot contain parent information).
+
+This information is stored in a new file instead of grafts, or
+even the config, since the user should not touch that file
+at all (even throughout development of the shallow clone, it
+was never manually edited!).
+
+Each line contains exactly one SHA1. When read, a commit_graft
+will be constructed, which has nr_parent < 0 to make it easier
+to discern from user provided grafts.
+
+Since fsck-objects relies on the library to read the objects,
+it honours shallow commits automatically.
+
+There are some unfinished ends of the whole shallow business:
+
+- maybe we have to force non-thin packs when fetching into a
+ shallow repo (ATM they are forced non-thin).
+
+- A special handling of a shallow upstream is needed. At some
+ stage, upload-pack has to check if it sends a shallow commit,
+ and it should send that information early (or fail, if the
+ client does not support shallow repositories). There is no
+ support at all for this in this patch series.
+
+- Instead of locking $GIT_DIR/shallow at the start, just
+ the timestamp of it is noted, and when it comes to writing it,
+ a check is performed if the mtime is still the same, dying if
+ it is not.
+
+- It is unclear how "push into/from a shallow repo" should behave.
+
+- If you deepen a history, you'd want to get the tags of the
+ newly stored (but older!) commits. This does not work right now.
+
+To make a shallow clone, you can call "git-clone --depth 20 repo".
+The result contains only commit chains with a length of at most 20.
+It also writes an appropriate $GIT_DIR/shallow.
+
+You can deepen a shallow repository with "git-fetch --depth 20
+repo branch", which will fetch branch from repo, but stop at depth
+20, updating $GIT_DIR/shallow.
diff --git a/Documentation/technical/trivial-merge.txt b/Documentation/technical/trivial-merge.txt
new file mode 100644
index 0000000000..24c84100b0
--- /dev/null
+++ b/Documentation/technical/trivial-merge.txt
@@ -0,0 +1,121 @@
+Trivial merge rules
+===================
+
+This document describes the outcomes of the trivial merge logic in read-tree.
+
+One-way merge
+-------------
+
+This replaces the index with a different tree, keeping the stat info
+for entries that don't change, and allowing -u to make the minimum
+required changes to the working tree to have it match.
+
+Entries marked '+' have stat information. Spaces marked '*' don't
+affect the result.
+
+ index tree result
+ -----------------------
+ * (empty) (empty)
+ (empty) tree tree
+ index+ tree tree
+ index+ index index+
+
+Two-way merge
+-------------
+
+It is permitted for the index to lack an entry; this does not prevent
+any case from applying.
+
+If the index exists, it is an error for it not to match either the old
+or the result.
+
+If multiple cases apply, the one used is listed first.
+
+A result which changes the index is an error if the index is not empty
+and not up-to-date.
+
+Entries marked '+' have stat information. Spaces marked '*' don't
+affect the result.
+
+ case index old new result
+ -------------------------------------
+ 0/2 (empty) * (empty) (empty)
+ 1/3 (empty) * new new
+ 4/5 index+ (empty) (empty) index+
+ 6/7 index+ (empty) index index+
+ 10 index+ index (empty) (empty)
+ 14/15 index+ old old index+
+ 18/19 index+ old index index+
+ 20 index+ index new new
+
+Three-way merge
+---------------
+
+It is permitted for the index to lack an entry; this does not prevent
+any case from applying.
+
+If the index exists, it is an error for it not to match either the
+head or (if the merge is trivial) the result.
+
+If multiple cases apply, the one used is listed first.
+
+A result of "no merge" means that index is left in stage 0, ancest in
+stage 1, head in stage 2, and remote in stage 3 (if any of these are
+empty, no entry is left for that stage). Otherwise, the given entry is
+left in stage 0, and there are no other entries.
+
+A result of "no merge" is an error if the index is not empty and not
+up-to-date.
+
+*empty* means that the tree must not have a directory-file conflict
+ with the entry.
+
+For multiple ancestors, a '+' means that this case applies even if
+only one ancestor or remote fits; a '^' means all of the ancestors
+must be the same.
+
+case ancest head remote result
+----------------------------------------
+1 (empty)+ (empty) (empty) (empty)
+2ALT (empty)+ *empty* remote remote
+2 (empty)^ (empty) remote no merge
+3ALT (empty)+ head *empty* head
+3 (empty)^ head (empty) no merge
+4 (empty)^ head remote no merge
+5ALT * head head head
+6 ancest+ (empty) (empty) no merge
+8 ancest^ (empty) ancest no merge
+7 ancest+ (empty) remote no merge
+10 ancest^ ancest (empty) no merge
+9 ancest+ head (empty) no merge
+16 anc1/anc2 anc1 anc2 no merge
+13 ancest+ head ancest head
+14 ancest+ ancest remote remote
+11 ancest+ head remote no merge
+
+Only #2ALT and #3ALT use *empty*, because these are the only cases
+where there can be conflicts that didn't exist before. Note that we
+allow directory-file conflicts between things in different stages
+after the trivial merge.
+
+A possible alternative for #6 is (empty), which would make it like
+#1. This is not used, due to the likelihood that it arises due to
+moving the file to multiple different locations or moving and deleting
+it in different branches.
+
+Case #1 is included for completeness, and also in case we decide to
+put on '+' markings; any path that is never mentioned at all isn't
+handled.
+
+Note that #16 is when both #13 and #14 apply; in this case, we refuse
+the trivial merge, because we can't tell from this data which is
+right. This is a case of a reverted patch (in some direction, maybe
+multiple times), and the right answer depends on looking at crossings
+of history or common ancestors of the ancestors.
+
+Note that, between #6, #7, #9, and #11, all cases not otherwise
+covered are handled in this table.
+
+For #8 and #10, there is alternative behavior, not currently
+implemented, where the result is (empty). As currently implemented,
+the automatic merge will generally give this effect.